A Novel Sequential RDF to Compute Partially Observable Markov Processes with MSE Distortion

We develop a new sequential rate distortion function to compute lower bounds on the average length of all causal prefix free codes for partially observable multivariate Markov processes with mean-squared error distortion constraint. Our information measure is characterized by a variant of causally conditioned directed information and is utilized in various application examples. First, it is used to optimally characterize a finite dimensional optimization problem for jointly Gaussian processes and to obtain the corresponding optimal linear encoding and decoding policies. Under the assumption that all matrices commute by pairs, we show that our problem can be cast as a convex program which achieves its global minimum. We also derive sufficient conditions which ensure that our assumption holds. We then solve the KKT conditions and derive a new reverse-waterfilling algorithm that we implement. If our assumption is violated, one can still use our approach to derive sub-optimal (upper bound) waterfilling solutions. For scalar-valued Gauss-Markov processes with additional observation noise, we derive a new closed form solution and we compare it with known results in the literature. For partially observable time-invariant Markov processes driven by additive i.i.d. system noise only, we recover using an alternative approach and thus strengthening a recent result by Kostina and Hassibi in [1, Theorem 9] whereas for timeinvariant and spatially i.i.d. Markov processes driven by additive noise process we also derive new analytical lower bounds.


I. INTRODUCTION
Nonanticipatory −entropy was introduced in [2], [3] motived by real-time communication with minimal encoding and decoding delays. This entity is shown to be a tight lower bound on causal codes for scalar processes [4] whereas for vector processes it provides a tight lower bound at high rates on causal codes and on the average length of all causal prefix free codes [5] (also termed zero-delay coding).
Inspired by the usefulness of nonanticipatory-entropy in real-time communication, Tatikonda et al. in [6] reinvented the same measure under the name sequential rate distortion function (RDF) 1 to study a linear fully observable Gaussian closed-loop control system over a memoryless communication channel subject to rate constraints. In particular, the authors of [6] used the sequential RDF subject to a pointwise meansquared error (MSE) distortion constraint to describe a lower bound on the minimum cost of control for scalar-valued Gaussian processes and a suboptimal lower bound for the P. A. Stavrou 1 In the literature this information measure can also be found under the name nonanticipative RDF [7]. multivariate case obtained by means of a reverse-waterfilling algorithm [8, 10.3.3]. 2 Tanaka et al. in [10] revisited the estimation/communication part of the problem introduced by Tatikonda et al. and showed that the specific description of the sequential RDF is semidefinite representable. Around the same time, Stavrou et al. in [11] solved the general KKT conditions that correspond to the rate distortion characterization of the optimal estimation problem in [6] and proposed a dynamic reverse-waterfilling characterization (for both pointwise and total MSE distortions) that computes optimally the KKT conditions as long as all dimensions of the multidimensional setup are active, which is the case at high rates regime. In addition, in [11] they found the optimal linear coding policies (by means of a linear forward test-channel realization) that achieve the specific rate distortion characterization thus filling a gap created in [2,Theorem 5]. Recently, the optimal realization therein was used as a benchmark in [12] to derive bounds on a zero delay multiple description source coding problem with feedback.
Kostina and Hassibi in [1] revisited the framework of [6] to derive bounds on rate-cost tradeoffs for time-invariant multivariate fully observable processes. One major result therein is a lower bound in the case where the system is driven by additive i.i.d. zero mean noise. For scalar-valued Markov processes, this bound generalizes naturally known results obtained in closed form, see, e.g., [3,Eq. (1.43)], [4,Theorem 3] beyond additive i.i.d. Gaussian zero mean noise processes. Recently the authors of [13] used a state augmentation technique to extend the characterization of the Gaussian nonanticipatory −entropy derived in [3] to nonstationary multivariate Gaussian autoregressive models of any finite order.
The extension of the framework in [6] to linear partially observable Gaussian control systems under noisy or noiseless communication channels was independently studied in [14] (see also [15,§VII]) and [1,§II]. In [14], [15,§VII], the authors choose to minimize the rate-cost tradeoffs via a multi-letter characterization where the objective is cast by the directed information [16] from the observations process to the output of the decoder/controller when the system's encoder and decoder/controller are allowed to have access to previous decoder/controller signals via noiseless feedback. Then, they showed that the optimal system's coding policies can be realized by a pre-Kalman filter (KF) approach to reduce the partially observable system to a full observable one, a sensor design that uses a SDP representable problem and a post-KF algorithm. Unfortunately, although it is pointed out in [15,Theorem 2] that an optimal policy to solve their problem numerically exists if and only if their SDP characterization is feasible, and in addition, its solution coincides with the ratecost characterization, such conditions are never provided. In [1] the authors choose a rate-cost characterization similar to [15] and derived lower and upper bounds on the optimal solution of their characterization for jointly Gaussian processes. This means that like in [15] an additional pre-KF algorithm is required to transform the partially observable problem to a fully observable one.

A. Contributions
In this paper, we study a new sequential RDF reminiscent of the information measure introduced in [2], [3] to address problems modeled by partially observable Markov processes with MSE distortion. Our new information measure is defined via a variant of causally conditioned directed information [17,Chapter 3] between the observations process and the output of the decoder when both the encoder and the decoder have access to the previous outputs of the decoder via noiseless feedback whereas the decoder is also allowed to have access to previous observations signals (see Theorem 1, Definition 1). Armed with this major result, we obtain the following additional major results. (R1) We show that the description of our bound is, in general, a lower bound on the description of the information measure utilized in [1], [15] to compute performance limitations on the communication/estimation part of a partially observable timeinvariant LQG closed-loop control system (see Proposition 1); (R2) For partially observable Gaussian processes we completely characterize our problem as a finite dimensional optimization (see Theorems 2,3) and show that the optimal solution can be realized by means of only one KF algorithm (see Lemma 2). We also derive the optimal linear policies including the identification of the realization coefficients that surprisingly show that the best policy at the decoder's output follows a first-order Markov process whereas the decoder end up being independent of the previous observations signals (see Theorems 2, 3); (R3) We convexify the characterization obtained in (R2) under the assumption that all matrices in the system commute by pairs (see Theorem 4), that is, they are simultaneously diagonalizable by an orthogonal matrix. 3 Then, we solve the corresponding KKT conditions [18] to obtain a new reversewaterfilling algorithm (see Theorem 5) that we also implement (see Algorithm 1); (R4) We give sufficient conditions that meet our assumption in (R3) (see Proposition 2); (R5) For scalar-valued time-invariant Gauss-Markov processes with additional observation noise we derive a new closed form solution (see Corollary 1); (R6) We extend our results to Markov processes driven by additive i.i.d. zero mean noise (not necessarily Gaussian): first, by recovering for a partially observable system without observations noise, the bound of [1, Theorem 9] using an 3 More details are given in the supplementary material of Appendix A. alternative method that leverages Minkowski's determinant inequality [19] and standard entropy power inequalities (EPI) [20]; second, by deriving two analytical solutions for a timeinvariant multivariate fully observable system with additive i.i.d. observations noise, when each dimension is statistically i.i.d.. The last result is obtained under the assumption that either the system model or the observations model are driven by additive i.i.d. zero mean noise process but not both. Discussion of our results. (R1) suggests that improvements (i.e., further lower bounds) can be obtained when minimizing the estimation/communication cost in the celebrated quantized partially observable LQG closed-loop control problem provided the control-theoretic separation principle holds [21]. Our result in (R2) provides a smaller system realization with significantly reduced complexity compared to [1], [14], [15] where two KF algorithms are required. Moreover, identifying the realization coefficients is important because they can lead to practical achievability schemes to bound the optimal performance theoretically attainable by causal codes [22] or the minimum average length of all causal prefix-free codes, see, e.g., [1], [4], [5], [23]. If the assumption of commuting matrices does not hold in (R3), then, our reverse-waterfilling solution offers an elegant sub-optimal (upper bound) solution. In such case, the tightness of our bound depends on the structure of the given matrices and the dimensionality of the partially observable system. We note that the sub-optimality of our algorithm for arbitrarily chosen matrices comes with the advantage of an algorithmic approach that is extremely fast and easily adaptable to high dimensional systems when implemented (see, e.g., [24] for more details). Our result in (R5) is compared with an existing lower bound obtained in [25, Appendix A] using [1] to show via simulation experiments that the latter is not tight, in general, with respect to its optimal solution. This also means that our lower bound is the first exact lower bound obtained for scalar-valued Gauss-Markov processes with additional observations noise (i.e., it has an actual implementable realization). Our result in (R5) is also used to recover the known result obtained for scalar-valued Gauss-Markov processes derived in numerous papers, see, e.g. [3, Eq. (1.43)], [4,Theorem 3], [6,Equation (14)], and to compute analytically the RL gap due to having additional observations noise (see Proposition 3). To obtain our results in (R6), we use the fact that the optimal linear policies obtained in (R2) are also the best linear policies for additive i.i.d. zero mean noise processes and correspond to a lower bound on the minimum average length of all causal prefix free codes (see Proposition 4).
Notation. We let R = (−∞, ∞), Z={. . . , −1, 0, 1, . . .}, N 0 = {0, 1, . . .}, N n 0 = {0, 1, . . . , n}, n ∈ N 0 . Let X be a finite dimensional Euclidean space and B(X ) the Borel σ-field of X . A random variable (RV) defined on some probability space (Ω, F, P) is a map x : Ω −→ X , where (X , B(X )) is a measurable space. We denote a sequence of RVs by x t r (x r , x r+1 , . . . , x t ), (r, t) ∈ Z × Z, t ≥ r, and their realizations by x t r ∈ X t r × t k=r X k , for simplicity. If r = −∞ and t = −1, we use the notation x −1 −∞ = x −1 , and if r = 0, we use the notation x t 0 = x t . The distribution of the RV x on X is denoted by P(dx). The conditional distribution of a RV y given x = x is denoted by P(dy|x). The transpose and covariance of a random vector x are denoted by x T and Σ x . We denote the determinant, trace, diagonal and diagonal elements of a square matrix S by |S|, trace(S), diag(S) and [·] ii We denote the transpose of a square or rectangular matrix S by S T . We denote the eigenvalues of a square matrix S ∈ R p×p by {µ S,i } p i=1 . The notation Σ S 0 (resp. Σ S 0) denotes a positive definite (resp. positive semi-definite) matrix.
We denote a p × p identity matrix by I p . R G (D) denotes the Gaussian version of a specific RDF and h G (x) (resp. h G (x|y)) denotes the Gaussian differential entropy (resp. conditional Gaussian differential entropy) of a distribution P(dx) (resp. P(dx|y)). N (·) denotes the entropy power (EP) of a RV or a random vector x. The expectation operator is denoted by E{·}; || · || denotes Euclidean norm; [·] + max{0, ·}.

II. PROBLEM STATEMENT
We consider the "online" or zero delay source coding setup of Fig. 1. In this setting, the "hidden" R p -valued Gaussian source is modeled by a discrete-time time-invariant Markov process as follows The observation process is modeled by the discrete-time time-invariant process where C ∈ R m×p is a deterministic full row rank matrix, n t ∈ R m ∼ (0; Σ n ), Σ n 0, is an i.i.d. sequence, independent of ({w t : t ∈ N 0 }, x 0 ). System's operation: At every time step t ∈ N 0 , the hidden vector source x t is conveyed with an additional noise n t at the encoder who observes the impair measurement z t (provided z t−1 are already observed) and produces a binary codeword m t of length l t (in bits) from a predefined set of codewords M t of at most countable number of codewords. The codewords are transmitted across an instantaneous noiseless digital channels to a decoder. Upon receiving m t , the decoder immediately produces an estimate y t of the observations sample z t , under the assumption that y t−1 is already reproduced. The analysis of the noiseless digital channel is restricted to the class of prefix-free binary codes m t . Zero-delay source coding: Formally, the zero-delay source coding problem of Fig. 1 can be explained as follows. Define the input and output alphabet of the noiseless digital channel by B = {1, 2, . . . , B} where B = max t |M t | (that is allowed to be infinite). The elements in B enumerate the codewords of M t . The encoder is specified by the sequence of functions {f t : t ∈ N 0 } with f t : B t−1 × Z t → B t . At time t ∈ N 0 , the output of the encoder is a message m t = f t (m t−1 , z t ) with m 0 = f 0 (z 0 ) which is transmitted through a noiseless channel to the decoder. The decoder is specified by the sequence of measurable functions {g t : t ∈ N 0 } with g t : B t → Y t . For each t ∈ N 0 , the decoder generates y t = g t (m t ) with y 0 = g 0 (m 0 ) assuming y t−1 is already generated. The design in Fig. 1 is required to yield the long-term average distortion lim sup n−→∞ The rate at the encoder is given by the long term average codeword length of all instantaneous codes, denoted by lim sup n−→∞ 1 n+1 n t=0 E{l t }. We denote by L n n t=0 l t the accumulated number of bits received by the decoder at the time it reproduces the estimate y n .
Performance. The performance of the multi-input multioutput (MIMO) system in Fig. 1 can be cast by the following optimization problem:

III. A NEW LOWER BOUND
In this section, we derive a new information measure, that gives a new lower bound on (3). To do it, we first prove a novel data processing theorem that corresponds to the system of Fig. 1.
First, we write the data processing of information for the MIMO system of Fig. 1 in terms of its joint distribution. In particular, the joint distribution induced by the joint process {(z t , m t , y t ) : t ∈ N n 0 } admits the following decomposition: where (a) stems from the fact that we assume in our system the following conditional independence constraints: Remark 1: (Trivial initial information) In (4) we assume that the joint distribution P(dz −1 , dm −1 , dy −1 ) generates trivial information.
The following data processing theorem, is a main result of this paper.
Theorem 1: (Data processing theorem) Provided the decomposition of the joint distribution in (4) holds, the system in Fig. 1 admits the following data processing inequalities: where I(z n → y n ||z n−1 ) = n t=0 I(z t ; y t |z t−1 , y t−1 ), where (a) follows because conditioning reduces entropy [8]; (b) follows because of the non-negativity of the discrete entropy [8]; (c) follows by construction. Next, we prove (ii). Here This is shown as follows: where (d) follows from an adaptation of [26, Lemma 3.3] to processes, i.e., ) and the second term is zero because of the conditional independence constraint (6); (e) follows by the chain rule of conditional mutual information (again an adaptation of [26,Theorem 3.4]) which decomposes the conditional mutual information in two different ways, i.e., where I(z t ; m t−1 |y t−1 , z t−1 ) = 0 due to the conditional independence in (5). From (8) we obtain n t=0 I(z t ; m t |z t−1 , y t ) ≥ 0 (due to non-negativity of conditional mutual information [8]). This completes the derivation. We stress the following technical remark on Theorem 1.
(2) Theorem 1 is a non-trivial generalization of [15,Lemma 1]. Compared to [15,Lemma 1] that assumes noiseless feedback information at both the encoder and the decoder here we additionally assume knowledge of all previous observation signals at the decoder.
The next lemma plays an important role in some of our results in the sequel.
Lemma 1: (Inequality bound) The following bound holds, where Observe that by definition we obtain where (a) follows by definition under the assumption that both information measures are well defined; (b) follows because conditioning reduces (differential) entropy. Next, we show how to construct the new sequential RDF. Hidden Source Distribution. The hidden source distribution x t satisfies conditional independence At t = 0 we have P(dz 0 |x 0 ). Also, by Bayes' rule we obtain − → P (dz n |x n ) ⊗ n t=0 P(dz t |x t ). Clearly, from (11) and (12), we can define the joint distribution In addition, from (13), we can define the Z n −marginal distribution parametrized by Y n−1 as follows where P(dz t |z t−1 , y t−1 ) = Xt P(dz t |x t ) ⊗ P(dx t |z t−1 , y t−1 ). We assume that at t = 0, P(dz 0 |z −1 , y −1 ) = P(dz 0 ). Reproduction or "test-channel". The reproduction conditional distributions, known as test-channels, satisfy conditional independence At t = 0, no initial state information is assumed, hence (15), uniquely define the family of conditional distributions on Y n parametrized by z n ∈ Z n , given by − → Q(dy n |z n ) ⊗ n t=0 P(dy t |y t−1 , z t ), and vice-versa. From (14) and (15), we can uniquely define the joint distribution of {(z t , y t ) : t ∈ N n 0 } by In addition, from (16), we can define the Y n −marginal distribution parametrized by Z n−1 as follows where Given the above construction of distributions we obtain the following variant of causally conditioned directed information [17]: where (a) is due to chain rule of relative entropy using the Radon-Nykodym derivative; (b) follows by definition. Next, we formally define the non-trivial generalization of the sequential RDF defined in [2].
Definition 1: (Sequential RDF for partially observable Markov systems) For a given hidden and observation processes that induce (11) and (12), the following variant of sequential RDF subject to an average total MSE distortion constraint both in finite time and in the asymptotic limit, can be defined as follows: provided the limit takes a finite value, whereas d(x n , y n ) n t=0 ||x t − y t || 2 . Next, we state some useful properties of the Definition 1 that can be extracted from known results. (20)) (1) Similar to what we have already discussed in Remark 2, (1), it can be shown that the objective function in (20) can be replaced by causally conditioned mutual information.

Remark 3: (Comments on
(2) It is easy to show that (20) is convex with respect to the test channel following for instance [27].
(3) If the joint process {(x t , z t ) : t ∈ N 0 } is jointly Gaussian and, {(x t , z t , y t ) : t ∈ N 0 } is also jointly Gaussian, then, (20) achieves a smaller value (see, e.g., [28,Theorem 1]). (4) The implicit expressions of the optimal minimizing distribution P * (dy t |y t−1 , z t ) obtained backward in time, can be found using dynamic programming [29]. The solution will extend the result of fully observable processes derived in [11,Theorem 4.1] to partially observable processes. Although the information measure introduced in Definition 1 can be quit handy in addressing problems related to partially observable Markov decision processes [30], we will not pursue that goal in this paper.

A. Comparison of Definition 1 to other information measures
Recall that the problem statement of §II was also studied in [14] (see also [15]), [1, §II.B] for jointly Gaussian processes. In both works, the authors considered the following information measure to compute a lower bound on (3): Unfortunately, the utility of (22), (23) increases the difficulty for a characterization that yields analytical tractability because the pay-off in (22) admits a multi-letter expression with respect to the observation process {z t : t ∈ N n 0 } and one needs to transform this problem to a similar one that admits a single-letter characterization of the observed signals in order to compute it. For Gaussian processes this can be done optimally (with an increased computational complexity) if we transform the partially observable Gauss-Markov process to a fully observable one using a pre-KF algorithm from the system's process to the observations process that estimates the hidden state of the Gauss-Markov process. The estimate of such KF serves as a sufficient statistic of the partially observable process (see, e.g., [14]).
The next result is immediate from Lemma 1. Proposition 1: (Comparison to similar bounds) For the system model in (1), (2) the following bounds hold: Proposition 1 is fundamental as it shows that our sequential RDF defined in Definition 1 is a lower bound on the description utilized in both [1], [14], [15] to compute the communication cost for partially observable controlled Markov system driven by additive Gaussian noise process.

IV. NEW RESULTS FOR JOINTLY GAUSSIAN PROCESSES
In this section, we use Definition 1 to derive a general finite-dimensional characterization for multivariate partially observable Gaussian processes with MSE distortion constraint. To solve the problem in the asymptotic limit, we assume that all involved matrices commute by pairs (sufficient conditions for this assumption to hold are also provided). Then, the corresponding optimization problem becomes convex and can be solved optimally using KKT conditions [18]. The solution of KKT conditions reveals a new non-trivial reverse-waterfilling algorithm.
First, we need the following helpful lemma which is a nontrivial generalization of classical KF algorithm [31], [32] and a generalization of a recent result in [11].
Lemma 2: (Realization of {P * (dy t |y t−1 , z t ) : t ∈ N n 0 }) Suppose that the joint process {(x t , y t , z t ) : t ∈ N n 0 } is jointly Gaussian. Then, the following statements hold. where 0 } is an independent Gaussian process independent of {(w t , n t ) : t ∈ N n 0 } and x 0 , and {H t ∈ R p×m : t ∈ N n 0 } are time-varying deterministic matrices (to be designed). Moreover, the innovations process {I t ∈ R p : t ∈ N n 0 } of (25) is the orthogonal process given by t ∈ N n 0 } satisfy the following generalized discrete-time KF recursions: x t|t = x t|t−1 + k t I t , where Σ t|t = Σ T where Proof: (1) Since the joint process is assumed to be jointly Gaussian, then, {P * (dy t |y t−1 , z t ) : t ∈ N n 0 } is conditionally Gaussian, and we can obtain the orthogonal realization where t ∈ N n 0 } being deterministic matrices and (G t−1 , Γ t−1 ) deterministic matrices of appropriate dimensions. For such realization, I(z t ; y t |y t−1 , z t−1 ) does not depend on R t (·, ·), ∀t ∈ N n 0 . Moreover, (2) This follows from the discrete-time KF equations. (3) The characterization that achieves (25) is obtained from (1), (2) as follows. First note that by definition, we have The first term in (30) is computed as follows: where (a) follows from the fact that P(dy The second term in (30) is computed as follows: where (b) stems from the fact that P(dy t |y t−1 , z t ) ∼ N (H t (z t − z t|t−1 ) + x t|t−1 ; Σ vt ). Incorporating both (31), (32) in (30), we obtain the objective in (28). Finally, the MSE distortion constraint follows from (1). This completes the proof. The next theorem gives the general finite dimensional characterization of (20) including the feasible set of solutions for time-varying partially observable Gaussian processes with average total MSE distortion constraint. It also reveals the optimal linear Gaussian test-channel distribution (forward testchannel realization) that corresponds to this problem. The derivation relies on the identification of the optimization variables (H t , Σ vt ) of (25).
Theorem 2: (Characterization of (20) for jointly Gaussian processes) Let Λ t = Σ t|t−1 and ∆ t = Σ t|t . Then, the characterization of (20) for the system (1), (2) with total MSE distortion constraint is the following for some D ∈ [D min , D max ] ⊂ (0, D max ], where Q C † Σ n C † T 0 such that rank(Q) = rank (Σ n ). Moreover, the above characterization, is achieved by a linear Gaussian "test channel" P * (dy t |y t−1 , z t ) of the form with y −1 = 0, and where C † is the generalized inverse matrix of C. 4 Proof: From MSE estimation theory we know that the MSE inequality n t=0 E ||x t − y t || 2 ≥ n t=0 E ||x t − x t|t || 2 holds for all (H t , Σ vt ), t ∈ N n 0 , and it is achieved if and only if x t|t = y t . Sufficient conditions for the latter to hold are (i) x t|t−1 ≡ E{y t |y t−1 , z t−1 } and (ii) k t = I p . Note that (i) holds by the general KF algorithm in Lemma 2. The choice of (35) satisfies (ii) hence a smaller distortion for a given rate is achieved and also the Markov realization in (34) holds. Moreover, by substituting in the pay-off of (28) the scalings of (35) (without 1 2 n t=0 ) and after some matrix algebra, we obtain where we have set Q C † Σ n C † T which is a positive semidefinite matrix of rank dictated by the rank of Σ n . Observe that (36) can be reformulated as follows where (a) follows by dividing both the numerator and denominator of the second term in (37) by I p − Λ −1 t ∆ t ; (b) follows by multiplying both the numerator and denominator of the second term in (38) by Λ t .
We ensure that both terms in (38) are well defined if the linear matrix inequality ∆ t Λ t , ∀t holds and additionally for the second term only if the non-linear matrix inequality 0 ≺ Λ t (Λ t +Q) −1 Q ≺ ∆ t , ∀t is imposed (otherwise that term may give a matrix with non-positive real eigenvalues, hence the solution will not be well-defined). Additionally, a sufficient condition for existence of a finite solution to both terms in (38) is to restrict ∆ t 0. In fact that condition can be made more rigorous upon observing that Λ t (Λ t + Q) −1 Q ≺ ∆ t , ∀t. The previous analysis is described by the pair of linear and nonlinear matrix inequalities in (33).
In what follows we point out some interesting technical observations that stem from the result of Theorem 2.
(2) Theorem 2 shows that the optimal minimizer in (20) for jointly Gaussian processes with MSE distortion admits a Markov realization, i.e., P * (dy t |y t−1 , z t ) = P * (dy t |y t−1 , z t ). This implies that the output of the decoder {y t : t ∈ N 0 } is modeled as a first-order Markov process and the corresponding rate-distortion characterization (20) is simplified to the following finite state rate distortion expression: The description in (39) demonstrates that the additional information via the previous observation signals z t−1 at the decoder is surprisingly redundant. Moreover, (39) corresponds to the best linear coding policies as long as the additive noise processes are zero-mean, uncorrelated and white (this conditions are satisfied in our setup), because then, the KF algorithm becomes the best linear MMSE estimator, see, e.g, [31, §3.2] or [32, p. 130].
(3) Clearly, the result of Theorem 2 can be trivially reformulated to the case of pointwise MSE distortion, i.e., trace(∆ t ) ≤ D t , ∀t. (4) The first term in the objective function of (33) is precisely the objective of the "fully observable" Gauss-Markov process with MSE distortion constraint, see, e.g., [10], [11], [5]. Interestingly, if we assume in the second term of the objective function that Q = 0 (null matrix) which means that Σ n = 0, then, the problem recovers the known expression for fully observable Gauss-Markov processes.
Next, we restrict Theorem 2 to its time-invariant characterization.
Theorem 3: (Time-invariant characterization of (33)) Suppose that D ∈ [D min , D max ] ⊂ (0, D max ]. Moreover, let the conditionally Gaussian distribution P * (dy t |y t−1 , z t ) to be time-invariant and the corresponding distribution P * (dy t |y t−1 ) to have a unique invariant distribution. Then, if R G in (D) < ∞, its characterization is given by where (Λ, ∆) are the time-invariant values of (Λ t , ∆ t ), respectively. Moreover, (40) is achieved by a time-invariant realization of the form and Proof: From the sub-additivity property of R G [0,n],in (D) in Definition 1, the limit R G in (D) always exists although it may be infinite. If R G in (D) = +∞ there is nothing to prove. However, by restricting the distributions P * (dy t |y t−1 , z t ) and P * (dy t |y t−1 ) to be time-invariant, and if R G in (D) < ∞, then, its solution is given by (40). (41) follows because we have assumed time-invariant P * (dy t |y t−1 , z t ).
Clearly, Theorems 2, 3 provide not only the characterization of the optimization problem that corresponds to the system (1), (2) driven by additive i.i.d. Gaussian noise process with MSE distortion, but also the feasible set of solutions that ensure a finite value for the corresponding optimization problem. Unfortunately, in general, the characterization in Theorem 3 (see also Theorem 2) forms a non-convex optimization problem (because we cannot ensure that the product of the specific positive semidefinite matrices will preserve a symmetric (positive-semidefinite) structure (i.e., the set of involved matrices does not belong in the positive semidefinite cone [18]). This means that without certain sufficient conditions which will guarantee that the problem is convex, the solution will not likely achieve its global minimum value [18]. In the next theorem, we give sufficient conditions to ensure that the optimization problem of Theorem 3 when solved achieves its minimum value (i.e., it is convex).
Theorem 4: (A convex characterization of (20) for timeinvariant partially observable Gaussian processes with MSE distortion) Suppose that the square matrices (Λ, ∆, Q) commute by pairs. 5 Then, the characterization of (33) is simplified to the following convex optimization problem: Proof: Note that by assumption of the theorem, i.e., (Λ, ∆, Q) commute by pairs, then, they are also simultaneously diagonalizable by an orthogonal matrix U ∈ R p×p . 6 It is simple to show that by writing the spectral representation (eigenvalue decomposition) of each of the above matrices and performing simple matrix algebra, we can simplify the complex structure of (40) to the one in (43). Clearly, the objective as a function of the variable µ ∆,i > µΛ,iµ Q,i µΛ,i+µ Q,i , ∀i is differentiable and continuous in its domain. Moreover, it can be shown that its second derivative (with respect to µ ∆,i ) is non-negative hence the objective is convex with respect to µ ∆,i 's. 5 For details on this concept, see the supplementary material in Appendix A, Definitions 2, 3. 6 For details see the supplementary material in Appendix A, Theorem 8.
Next, we derive sufficient conditions that meet the assumption of Theorem 4. This requires identification of the structure of the square matrices (A, Q, Σ w ).
Proof: We only prove (1) because the other cases follow similarly. Under the matrix structure of (1), (A, Σ w , Q) commute by pairs, because the last two matrices are scalar matrices that commute with every other matrix of the same dimensions. This means that all (A, Σ w , Q) are simultaneously diagonalizable by an orthogonal matrix U . Clearly, (Σ w , Q) commute with ∆ because they are scalar matrices. It remains to show that A commutes with ∆. We write the spectral representation of But, ∆ is a design matrix (variable) therefore we can always choose the design parameter V = U (i.e, eigenvectors alignment) ensuring that (A, ∆) commute. Finally since (A, Σ w , Q, ∆) commute by pairs then Λ and ∆ commute as well. Cases (2), (3), (4), (5) follow similarly.
Remark 5: (Technical comments) Although the sufficient conditions derived in Proposition 2 are restrictive compared to the general available structure of matrices (A, Q, Σ w ), they maintain optimality of the convex optimization problem in Theorem 4 and achieve the minimum possible rates. Unfortunately, if one deviates from such conditions and provided that the chosen structure of matrices (A, Q, Σ w ) will not yield commuting by pairs and simultaneous diagonalization, then, Theorem 4 will be, in general, a sub-optimal solution to the general characterization of Theorem 3. The tightness of the sub-optimal solution depends on the dimensionality of the problem and the corresponding matrices structure.
The next result is a main result of this paper. Theorem 5: (Reverse-waterfilling solution of (43)) The parametric solution of (43) is such that µ Λ,i = µ A 2 ,i µ ∆,i + µ Σw,i , ∀i, and µ ∆,i is computed based on the following reverse-waterfilling algorithm with υ µ Σw,i + (1 − µ A 2 ,i )µ Q,i , µ Q,i = ∞, and µ ∆ * ,i > µ ∆ * min ,i is the solution that achieves the highest rates obtained from the third degree polynomial equation where Proof: The solution is obtained using KKT conditions [18,Chapter 5.5.3]. First, we introduce the augmented (unconstrained) Lagrange functional of (43) as follows: where µ ∆,i ≥ 0 is the primal variable, θ ≥ 0, {(f 1 i ≥ 0, f 2 i ≥ 0)}, ∀i, are the dual variables (Lagrange multipliers) responsible for the distortion constraint, the quadratic inequality constraint µΛ,iµ Q,i µΛ,i+µ Q,i < µ ∆,i , ∀i, and the linear inequality constraint µ ∆,i ≤ µ Λ,i , ∀i, respectively. Note that the optimization problem in (43) under the specific constraints is convex (the objective is convex w.r.t. to µ ∆,i , the quadratic inequality is continuously differentiable convex function under the assumption that µ ∆,i ≥ 0, and the other two inequality constraints are affine). Hence Slater's condition are satisfied and the KKT conditions are also sufficient for global optimal-ity. The KKT conditions for this problem are as follows: where (50) (52), the solution is zero. Hence without loss of optimality we can also take f 2 i = f 2, * i = 0, ∀i. Next, we proceed to solve (50) (without where (55) follows after some algebra and by setting υ µ Σw,i + (1 − µ A 2 ,i )µ Q,i . Hence we obtain (47), (48). From the constraints of the optimization problem, the solution of (55) must satisfy which corresponds to a quadratic inequality with two boundary points of which one is negative hence rejected because µ ∆ * ,i > 0 (by definition). The latter also implies that the quadratic inequality in (56) is active only when (46) holds ∀i.
Hence, the third-degree polynomial equation in (55) should satisfy the minimum distortion criterion of (46) at each i. If this criterion is satisfied by all three solutions, then, we pick the one that achieves higher rates due to Lagrange duality theorem [33]. The problem is solved once we check whether (44) is non-negative via the criterion in (45). This completes the derivation.
Next, we point out a technical comment and certain special cases for Theorem 5.
Remark 6: (Comments on Theorem 5) (1) The reversewaterfilling solution of Theorem 5 relies on the third degree polynomial equation (47). That polynomial equation is known to have one or three real-valued solutions and it can be solved numerically using simple commands in Matlab or in Python.
(2) If in (44), we take µ Q,i = 0, for some i, then for that i, it can be easily shown that our solution will recover as a special case, the reverse-waterfilling algorithm obtained for the "fully observable" time-invariant multidimensional Gauss-Markov processes subject to a MSE distortion constraint derived in [24,Proposition 1].
(3) Due to the pay-off in (44) that is greater than the corresponding one obtained for fully observable processes (only the first term appears in that case), the reverse-waterfilling solution of Theorem 5 will always be an upper bound on the reverse-waterfilling obtained for fully observable multivariate processes in [24,Proposition 1] for the rate-distortion region that these are both defined.

Algorithm 1 Implementation of (45)
Initialize: number of spatial components p; error tolerance 1 , 2 where 1 ≥ 2 ; nominal minimum and maximum value θ min = 0 and θ max (sufficiently large); pick an initial variance for µ Λ,1 ; pick the matrix structure of (A, Σ w , Q) based on (1), (2), and their corresponding eigenvalues Comments on Algorithm 1. Algorithm 1 utilizes the bisection method to converge to a certain value for a given error tolerance. This is done by picking nominal values for θ (i.e., θ min , θ max ). Finally, the convergence of θ affects the convergence of p t=1 µ ∆,i → D. Our iterative scheme does not guaranteed that both (θ, p t=1 µ ∆,i → D) will converge concurrently.
Proof: The solution follows from Theorem 5 when p = 1. In particular, complementary slackness conditions in (51), (52) ensure that δ = D (uniform distortion allocation) and specify the value of D min and D max . Next, we remark how one can recover the well-known result of sequential or nonanticipative RDF obtained for scalar-valued time-invariant Gauss-Markov processes (without additional observation noise process).
Remark 7: (Sequential RDF) If in (57) we assume σ 2 n = 0, with D min > 0 and D max = λ. This special case corresponds precisely to the sequential or nonanticipative RDF derived for scalar-valued time-invariant Gauss-Markov sources with MSE distortion (without observation noise process) derived in [3, Eq. (1.43)], [6,Eq. (14)], [4,Thm 3]. Using Remark 7 and the optimal closed form expression of (57), we can compute the RL gap due to having the additional observation noise. This result is stated in the following proposition.
Proposition 3: (RL Gap due to the additional observation noise) The RL gap between (57) and (58) is given by where (D min , D max ) are obtained in Corollary 1.
In the next remark, we state the steady-state solution of a lower bound on the optimalR G in (D) obtained via [1] (see [25, Appendix A, Eq. (103)]) for scalar-valued Gauss-Markov processes with additional observations noise.
Remark 8: (A lower bound onR G in (D)) A lower bound onR G in (D) was recently derived in [25, Appendix A, Eq. (103)] via [1], for time-invariant scalar-valued Gauss-Markov processes with additional observations noise. This bound is described as follows where D max = D max = λ, and D min can be computed by finding the steady-stateá posteriori error variance solution of a pre-KF algorithm imposed between the system process {x t : t ∈ N 0 } and the output of the observations process {z t : t ∈ N 0 }. One can easily obtain after some calculations that D min = D min . Simulation study. In Fig. 3, we provide an illustrative example where we compare (57), (58) and (60). In addition, we illustrate the RL gap of Proposition 3. Interestingly, our simulations show that the lower bound that can be obtained via [1] achieves lower rates compared to the exact solution of our bound. Hence from Proposition 1, this means that the bound in (60) is not tight, in general, with respect to its optimal solution R G in (D) because if it was it would have been an upper bound on our solution. Clearly, (57), (60) coincide if σ 2 n = 0.

V. ANALYTICAL BOUNDS BEYOND GAUSSIAN PROCESSES
In this section, we derive analytical lower bounds for additive i.i.d. zero mean noise processes. First, note that a subclass of R in (D) in (21) that only considers the best linear coding policies (among all linear policies for causal MMSE decoding) is already obtained from the analysis of jointly Gaussian processes in §IV (see Remark 4). Recall that this description is defined in (39). In what follows we denote the specific class of linear coding policies by R linear in (D). Next, we state the following proposition. Proposition 4: (Data-Rate Bounds) For the system model described by (1), (2) with MSE distortion, the following inequalities hold

VI. CONCLUSIONS AND ONGOING RESEARCH
In this paper, we introduced a novel sequential RDF to compute lower bounds on the average length of all causal prefix free codes (zero-delay codes) for a partially observable multivariate Markov system driven by additive i.i.d. zero mean noise processes with a MSE distortion constraint. When our setup is characterized by jointly Gaussian processes, we provided a complete characterization in finite time and in the asymptotic limit, and a non-trivial computational approach by means of a generalized reverse-waterfilling algorithm that solves the problem optimally in the asymptotic limit, under the assumption that all matrices of the problem commute by pairs. For scalar-valued Gaussian processes with additional observations noise, we found an exact explicit solution. Finally, we derived analytical results that consider setups that go beyond additive zero mean i.i.d. Gaussian processes.
As ongoing research activity we will extend our framework to non-stationary partially observable systems with various distortion constraints, revisit the celebrated LQG closed-loop control problem for linear time-invariant partially observable Gaussian systems (see e.g., [21]) and re-derive the separation principle for noiseless channels based on our new information measure. We expect that in comparison to [21, Fig. 3], the additional pre-KF algorithm in the realization of the predictive coding scheme will not be necessary. This will significantly reduce the system's complexity.