Multi Stage Kalman Filter (MSKF) Based Time-Varying Sparse Channel Estimation with Fast Convergence

—The paper develops novel algorithms for time-varying (TV) sparse channel estimation in Massive multiple-input, multiple-output (MMIMO) systems. This is achieved by employing a novel reduced (non-uniformly spaced tap) delay-line equalizer, which can be related to low/reduced rank ﬁlters. This low rank ﬁlter is implemented by deriving an innovative TV (Krylov-space based) Multi-Stage Kalman Filter (MSKF), employing appropriate state estimation techniques. MSKF converges very quickly, within few stages/iterations (at each symbol). This is possible because MSKF uses those signal spaces, maximally correlated with the desired signal, rather than the standard principal component (PCA) signal spaces. MSKF is also able to reduce channel tracking errors, encountered by a standard Kalman ﬁlter in a high-mobility channel. In addition, MSKF is well suited for large-scale MMIMO systems. This is unlike most existing methods, including recent Bayesian-Belief Propagation, Krylov, fast iterative re-weighted compressed sensing (RCS) and minimum rank minimization methods, which requires more and more iterations to converge, as the scale of MMIMO system increases.


I. INTRODUCTION
MMIMO systems are considered for high data rate communications in sparse channels, e. g. digital television (DTV) [1]- [2], echo cancellation, underwater [3], millimeter-wave (mmwave) 5G communications [4]. For example, in terrestrial DTV transmission [2], [1], a typical receiver is expected to handle multipath with delays as long as 18 microseconds, which at high symbol rates, requires adaptive finite impulse response (FIR) linear equalizers with several hundred symbol-spaced taps [5]. In order to alleviate dynamic multipaths, due to propagation effects, flutter from moving objects, e.g., airplanes and changing atmospheric conditions, the equalizer must update its coefficients at high speed. This situation is also witnessed in a high data rate wireless channel, where only the main signal and a few multipath reflected signals are significant, among (maybe) hundreds of channel taps, (in a tapped-delay line model). Advanced sparse channel estimation methods, requiring estimation of only few significant channel tap weights, have been developed for orthogonal frequency division multiplexing (OFDM) [6] and code division multiple access (CDMA) systems, and provide superior performance.
Existing works estimate the significant channel tap locations first, (by methods [7] et. al.), after which leastsquares (LS) methods (employing training subcarriers only) are used to estimate the significant channel tap weights, but may require a large amount of training data, making them unsuitable for MMIMO. The large number of training symbols/subcarriers required or the resultant pilot contamination/re-use problem in MMIMO [8] necessitated the development of blind/semi-blind sparse time-invariant (ITV) channel/data estimation methods, [9] - [12]. Though [11], [12] perform much better than [9], [10], all of these blind algorithms are data block-based methods (i. e., require a block of received data symbols to be collected, before filtering), and is not adapted for symbol-by-symbol update, for rapidly time-varying (TV) channels. This motivates the development of time (symbol) iterative/update Kalman-like filters. The novel methods here utilize the ideas of reducedrank, sparse, and multistage Kalman filters (MSKF) jointly to exploit the sparsity in different dimensions (time, space etc).
Recent methods, like the popular sparsity based compressed sensing (CS) and Bayesian methods [13] - [20], yield superior performance in sparse, including MMIMO mmwave spatially sparse channels, by exploiting the low rank angular structure induced by the multi-ray channel model with narrow angular spread (AS). But they hold for much simpler single path (not multipath) channels, and thus do not utilize the temporal sparsity (in multipath lag) domain. Moreover, CS and Bayesian methods are computationally very demanding, and their few iterative versions, like proximal re-weighted CS (RCS) [16], [18] or [19], converge after many symbols, making them inadequate for high mobility TV channels. Also, one of the few existing Krylov space spaced TV channel estimators [21] models the time variation by a basis exponential method (BEM), which inevitably introduces approximation error to channel estimates, due to the imperfect model assumed. Here, on the other hand, the novel equalizer is updated by reduced-rank, TV novel MSKF and Krylov-Kalman filters, with reduced computational complexities. In particular, the novel multi-stage MSKF performs very well, employs some data censoring and converges quickly, within a few stages/iterations. Moreover, the novel MSKF is able to reduce channel tracking errors of a standard Kalman filter, which occurs in a high-mobility, TV channel (as seen in Fig. 8 and text below it, [22]).

A. Contributions
The main contributions of this paper are • A novel symbol-iterative MSKF, having superior performance and rapid stage-wise convergence (for each symbol), is developed for high mobility sparse TV channels. This is achieved by • 1. having a novel TV reduced (non-uniformly spaced) equalizer structure, reminiscent of some time-invariant (ITV) reduced equalizers for DTV and echo cancellation. ITV Reduced equalizers have been seen to outperform uniformly symbol-spaced estimators, as evidenced in [1]) (Fig. 3), and in learning curves ( Figure 4, [1] and Figure 3, [2]). Using auto-regressive (AR) TV channel models, the reduced equalizer is generalized to exploit any available sparsity in cluster-sparse channels, even with continuous ISI, and with time-varying significant channel tap locations, to cater to real-world channels, encountered in in 4G/5G transmission (Sections II and III). • 2. Substantial synthesis and analysis of the reduced equalizer then provides us with reduced channel vectors and matrices for data estimation in high mobility sparse TV channels. Then utilizing the fact that MMIMO systems have very large J (number of receive antennas), a low rank algebraic reduced equalizer structure is deduced (Section III). • 3. Next, motivated by some ITV Multi-stage Wiener Filter (MSWF), exhibiting fast stage-wise convergence for some DOA applications [23], a novel TV Multistage Kalman Filter (MSKF) is integrated into the reduced rank equalizer structure above (Sections III and IV). Significant derivation of this novel dynamic, sparse MSKF equalizer, with data censoring (as in sensor networks [24]) is developed in Sections III and IV. • 4. Performance analysis of TV MSKF's much improved (order/iteration)-wise convergence of mean squared error (MSE) (at each symbol) and analytical comparison and connection with Bayesian, CS and other existing sparse methods, along with its computational efficiency (Section VI).
Other contributions include Close connection between the ideas of compressed sensing(CS) and reduced rank filters (matrix rank minimization) in sparse estimation (Sec III. B); development of a second novel TV Krylov filter (Sec V). • 2. Extensive comparative simulations of MSKF with recent Bayesian, CS and Krylov space based sparse estimators, in high mobility MMIMO systems (Section VII). Conclusions are provided in Section VIII.
Notations: Bold upper-case symbols A denote matrices. Bold lower-case symbols b denote vectors. I i is an identity matrix of size i × i, 0 j,k is a j × k-sized zero matrix. Also, A(i : j, k : l) denotes the ith to jth rows and kth to lth columns of the matrix A.

II. SYSTEM MODEL
A single-carrier sparse channel transmission system, with maximum multipath delay spread of up to L symbols, is considered. A novel algorithm is designed to determine the finite impulse response (FIR) equalizer, required to invert this channel in the minimum mean squared error (MMSE) sense [5]. Consider first a SIMO system, with a single transmit antenna and J receive antennas. The TV lth tap channel weight, at symbol n, is h(n, l) = [h 1 (n, l), h 2 (n, l), . . . , h J (n, l)] T , l = 0, 1, · · · , L − 1, (h j (n, l) is the lth lag channel weight from transmitter to the jth receive antenna). However in a sparse channel, only D, (out of a total of L), channel tap weights, have non-zero values. In many cases, D << L. In next section, we consider more general cluster-sparse channels with continuous ISI, (e. g. 3G LTE channel). The J × 1 received signal (on J received antennas) is where l k 's denotes the kth non-zero weighted channel tap locations. Generally, l 0 = 0, [9], [10]. Also, assume that 0 = l 0 < l 1 < l 2 < · · · < l D−1 ≤ L − 1; w(n) is the J × 1 additive white gaussian noise (AWGN). For MIMO systems withS transmit antennas and J receive antennas, the channel matrixH(n, l) is a J ×S matrix, given bỹ Three general assumptions are made as follows: (A1) The symbol sequence of each user s(n) is temporally white with zero mean and unit variance, and is statistically uncorrelated with s(n − m) for m = 0. (A2) The noise sequences w j (n) are stationary, and temporally and spatially white with zero mean and variance σ 2 w . (A3) The symbol sequences s(n) are statistically uncorrelated with the noise sequences w j (n).

III. ALGORITHM DEVELOPMENT
A. Extended Channel Model 1) Group or Clustered Sparsity: Next, group sparsity [25] is considered, where the few non-zero (i.e., significant) channel taps occur in clusters/blocks in a structured manner, see Fig. 1 a). Suppose the multipath channel consists of D clusters (instead of D single taps). The total support of the channel S is given by S = D−1 k=0 S(k), with S(k) being the support of the kth cluster. The kth cluster consists of |S k | consecutive multipaths at lags of l k , (l k + 1), · · · , (l k + |S k | − 1); letS = max k=0,1,··· ,D−1 |S k |. Then (1) can be rewritten as For a fixed i, the inner summation in (2) is again a D-sparse channel, with sparse multipaths, still separated by {(l k+1 + i) − (l k + i) = (l k+1 − l k )} taps, just as in (1). Defining H(n, l k ) ∆ − [h(n, l k ), h(n, l k + 1), · · · , h(n, l k +S − 1)], and S(n−l k ) ∆ = [s(n−l k ), s(n−l k −1), · · · , s(n−l k −S +1)] T , (1) is equivalent to Since the different components of H(n, l k ) are uncorrelated with each other (and so also for S(n)), equation (3) can be regarded as aS-user sparse channel model. Then there is the issue of continuous ISI, as in a 3G LTE Pedestrian B channel. There may be some multipaths, in between the significant multipath clusters, shown in Figure  3, [14]. Their powers may be approximately 50, 70, 170 dB below that of the main path. Simulation results, for TV 3G LTE channel, in Sec. VII show that the effect of these paths is not much, see (last right-most paragraph, pp. 1428) [11] and Table II, [9].
2) Time-Variation of Significant Tap locations: Assume that the time-evolution of random, TV channel is modeled by a first order auto-regressive (AR(1)) model [22], [5], (with significant channel tap locations l k (n) as a function of the nth symbol), where V(n) is the process noise with zero mean and variance σ 2 v I. λ represents how fast and how much the time-varying part of channel taps H(n, l k (n)) varies with respect to the mean of H(n, l k (n)). Substituting (4) into (3), as noises E{V(n)} = 0, E{w(n)} = 0, and both are also uncorrelated with signal S(m). Then l k (n − 1) = l k (n) − 1, i. e., significant channel tap locations are shifted by 1, which is already accommodated in AR(1) model (4) above. This novel channel model is illustrated in Fig. 1 a) and b).

Motivation
There is a close connection between the ideas of compressed sensing (CS) and matrix rank minimization (similar to reduced rank filters), for sparse beamformed (channel) estimation in angular and temporal domains, [20], [19], [18].  [20] investigates (single tap, non-multipath) mmwave channel sparsity in angular/space (Direction of Arrival (DOA)) domain, where the low-rank algebraic structure of the channel matrix is exploited by employing a reduced rank method, followed by a CS sparse method. [19] solves the same problem (in MMIMO) using CS methods only. On the other hand, [18] compares the performance of individual CS and reduced rank methods in MMIMO mmwave multipath channels. In this paper, a novel reduced rank filtering method is used for sparse multipath (non-beamformed) channel estimation. This is facilitated by having very large J in MMIMO, which makes Assumption (A4) (below) more likely to be satisfied. A novel model of a sparse (reduced) equalizer, adequate for data estimation in a sparse channel, is introduced in this section. A sparse equalizer means that only few of its taps (in a tapped-delay line model of linear FIR equalizer) have significant weights. These significant, non-uniformly spaced, FIR equalizer tap locations are seen to be related to the auto-correlation matrix of y(n) [11]. Define the ith lag autocorrelation matrix, at symbol n, After evaluating R(n, i) for the sparse channel in (1), it can be seen that the TV R(n, i) = 0, only for the lags i's, [11], Then, an algorithm for selecting non-uniformly spaced equalizer tap delays {m p } N p=1 's, is enumerated in Table I, by extending [11]. Now the noiseless reduced data vector y red (n) = Table I) can be written as y red (n) = H red (n)s red (n) (H red (n) and s red (n) are the corresponding reduced channel matrix and transmitted data vectors respectively). Obtaining general expressions of the novel reduced channel matrix, (for generic sparse channels, which have many different combinations of {l j } D−1 j=0 's, even for the same sparsity level D), may not always be possible [11], [12]. The reduced channel matrix and transmitted data vector can only be illustrated fully for specific channels, e. g., Example I Channel (equations (14), (47), (48) and (49) in [11]). Extending it to the reduced, TV Example I channel matrix H red (n) here is given by (for M = 23) (equation (7), on top of next page) of size 12J × 21S. If we had taken equalizer taps at all lags, the "full" channel matrix H f ull (n) will be of size M J × (M + L) = 23J × (23 + 12)S = 23J × 35S, Example I Channel, [11]. An assumption made is In MMIMO, with very large number of receive antennas J, the channel matrix is a very "tall" matrix, which makes Assumption (A4) more likely to be valid. Then rank(H f ull (n)) = 35S always, irrespective of the sparsity structure in channel. But rank(H red (n)) = 21S for Example I Channel and has reduced rank, which depends on each specific sparse channel. This is unlike the full channel matrix used traditionally in data estimation; implying that the reduced equalizer methodology has transformed sparse channel estimation problem to that of TV reduced-rank filtering. The equivalence between the reduced equalizer and sparsity promoting Bayesian estimator [26], [27] is shown in [12] (Section V), by considering (sparse) channel's prior probability density function (pdf) as f ((H(n, l k ) (i,j) ) = [(H(n, l k )) (i,j) ] −1/2 , (i. e., magnitude of a channel tap will have a low value with high probability, and a large value with low probability [26]). Then equations (26)- (34) in [12] show that this (prior) pdf leads us to a reduced rank filter.

C. State Space Representation and Innovations in Reduced Filter
Here, the state is the channel H(n, l k (n)), for which the measurement equation is (using (3)), However, since we don't know the significant channel {l k }'s a priori, one starts with assuming that all channel taps are present, in novel MSKF algorithm and its simulations. The only thing we know is, (using Algorithm I), the (non-uniformly)-spaced reduced equalizer lags , i. e. y red (n) vector. The TV channel's dynamic state equatioñ follows from (4). Since the reduced equalizer is y red (n)=[y T (n − m 1 ) y T (n − m 2 ) · · · y T (n − m N )] T , H(n) needs to be updated at times (symbols) {(n − m N ), · · · , (n − m 2 ), (n − m 1 ), · · · , n} only . This may be viewed as some form of data censoring (i. e., using selected received signal, at only certain symbols), as in sensor networks [24]. Now, y(n), in (8) is used in block-based data channel estimation [11], i. e. a block of received data symbols is collected beforeH(n) is estimated, i. e., the estimate is not iteratively updated from one symbol to the next, which is the objective of this paper. For our novel, reduced Kalman filter, the innovations (used to derive an time-iterative algorithm) Similarly, channel estimateĤ(n|n) uses the current dataỹ(n); the a priori channel estimateĤ(n|n − m 1 ) uses past data {y(n − m j )} N j=1 's; a priori estimate error is denoted by H e (n|n − m 1 ). Next, the innovations has to be expressed in terms of Kalman state matrices (given in equations (8) and (9)). [unlike timeinvariant Wiener (MSWF [23], [11])]. From (8),ỹ(n) and its auto-correlation matrix arê where R H e (n|n − m 1 ) is the a priori channel error correlation matrix. Now, a time-update of the state,H(n), is obtained, (similar to the classical "full" Kalman filter, which uses all channel taps, [28]).

IV. MULTI STAGE KALMAN FILTER (MSKF)
A computationally efficient, reduced rank Multistage Wiener Filter (MSWF) (for time-invariant (ITV) systems) has been developed in [23], which converges to some Krylov based methods. It involves a reduction in the dimensionality of the observed data to obtain a MMSE filter, which is as close as possible to what can be attained if all the observed data were used in the estimation process. [23], and its variants [31], have been successfully used in CDMA data estimation, and recently in semiblind estimation of time-invariant (ITV) sparse channels [11]. The novel Multistage Kalman Filter (MSKF) here, is inspired from such considerations, and involves substantial extension to TV state estimation. In this paper, innovations dataỹ(n) is used to estimate the desired signalH(n|ỹ(n)) (i. e., second term in RHS of (12)), by a novel, fast-converging, stage-by-stage filter structure, referred to as MSKF.

B. MSKF Derivation
The top-level (main) Kalman filter weight w z0 (n), in (16), is now implemented in a multi-stage fashion (MSWF), leading to faster (order)-wise convergence at reduced complexity. The scalar MSWF [23] is extended to vector MSWF (V-MSWF); for ease of presentation, the derivation of V-MSWF is provided in Technical Report [32], and the main steps of V-MSWF algorithm are shown in Table II. Table II's equations, (47) -(51), will then be directly applied here to the TV state space model, (9), (10), to derive the MSKF's novel state estimation, in terms of state space matrices. The block diagram of novel MSKF, with N = 3 stages, is shown in Fig 2. First, the J × J(M + 1)S-sized (0th order) cross-correlation, Rỹ (n),H e (n|n−m1) , and its normalized version, C 1 , are defined as, In the time-invariant (ITV) case [23], [11], the expectation operator in (17) is implemented by time averaging over a number of received symbols. But here, the different received symbols are generated by different channelsH(n), which vary from symbol to symbol, making time averaging unsuitable for this situation. To circumvent this problem, Rỹ (n),H e (n|n−m1) has to be computed in terms of matrices available at time n (as is done in a standard Kalman filter). Substituting equation (10) in (17), Now, the 2nd term on RHS of (19), (noise term), is The first term in (20), is uncorrelated with process noise V(n); w(n) is also uncorrelated withH(n − m 1 ), (which depends on V(n−m 1 ), · · · , V(n−m 1 −k)'s etc). Similarly, the second term in (20), E{w(n)Ĥ H (n|n − m 1 )} = 0, sinceĤ(n|n − m 1 ) is estimated by {ỹ(n − j)} j=m1 j=m N 's, which contain measurement noises {w(n − j)} j=m1 j=m N 's, all of which are uncorrelated with white noise w(n) at symbol n. Then (18) can be computed by Equation (21) is a key equation. From (8), C(n) is known at time n. Moreover, R H e (n|n − m 1 ) is iteratively updated from its past value at (n − m 1 − m 2 )th symbol, by (15) and (40) below, and is thus available (at present time n) for computing C 1 by (21) as the product of C(n) (measurement matrix in state-space representation) and R H e (n|n − m 1 ). This avoids explicit time averaging in (17), which is done in time-invariant (ITV) case in [11]. Equation (21) also gives the 0th order Wiener filter weights, in (16), as Again (22) avoids explicit time-averaging and is computed from state matrices, available at the present nth symbol. In the next stage, the (1st order) JS(M + 1) × 1 desired signal vector d 1 (n) and blocking matrix B 1 are formed by It can easily be shown that when B 1 operates on any signal, it removes the component of C 1 present in that signal, i. e., Defining the 1st order signal as, Next, the 1st order normalized cross-correlation is defined as by using (23) and (24). From (21), Again, explicit time averaging (over a number of received data symbols) is avoided in (26).
To obtain the equations for any generic order (stage), as the order is changed from ith to (i + 1)th, and using (24), , the 2nd order cross-correlation is obtained as Then for the generic ith order, it can be shown that Also, Rỹ i(n) = ( j=i j=1 B j )Rỹ (n) ( j=i j=1 B j ). The normalized cross-correlations C i 's, blocking matrices B i 's, and ith order desired signal d i (n) and dataỹ i (n)'s have been generated as the order i is increased from 1, 2, · · · , N (uprecursions). Now that the different order signals have been generated, a reduced-rank multistage estimation algorithm is derived. This requires 0th order desired signal D 0 (n) =H(n|ỹ(n)), (in the outer loop in block-diagram of Fig. 2), to be estimated from the 1st order data Table II), which is in the 1st inner loop of Fig. 2. The MMSE filter w z1 , in (48) (Table II), is employed for this purpose. This process is continued in a nested fashion, to generate the (i + 1)th (order) inner loop from the ith loop in Fig. 2. Thus at the (i + 1)th stage, d i (n) has to be estimated by t where w i+2 = R −1 yi+1 Rỹ i+1,di+1 are the Wiener tap weights for estimating d i+1 (n) fromỹ i+1 (n). Extending (51) to the ith order (and after some algebra), In (30), estimation error i+1 (n) = [d i+1 (n) − w H i+2ỹ i+1 (n)] is error between (i + 1)th order desired signal d i+1 (n) and its estimated i+1 (n) = w H i+2ỹ i+1 (n) (using (i + 1)th stage dataỹ i+1 (n)). The application of weight w i+1 to this i+1 (n)), in (30), provides the estimate of the lower (ith) order desired signal, i. e.d i (n), leading to the down-recursion in (34). Physical Interpretation: This nested (order)-wise filter structure is possible because 1. By the orthogonality principle of MMSE estimation, the estimation error i+1 (n) is orthogonal to the data used in estimation, i. e.ỹ i+1 (n).

Again by construction
3. Since d i (n) and i+1 (n) are both uncorrelated with y i+1 (n), there may be some correlation between d i (n) and i+1 (n), which incidentally is ∆ i+1 Hence, i+1 (n) can be used for estimating the desired signal d i (n), [i. e., d i (n) =w H i+1 i+1 (n) in equation (30) above], leading to the novel nested MSKF filter. Initializing N (n) by N (n) = y N −1 (n) = d N (n), the error energy is Again, no explicit time averaging (over a number of data symbols) is involved. Also, Using (29)-(33), we have order down-recursions for j = N, N − 1, · · · , 1, Now, in order to avoid using the expectation operator in (36), we have from (28), Again, (37) is implemented from precomputed quantities at previous time, employing only matrix multiplications, without any explicit time averaging (over a number of received data symbols). Time-Updates: Using (34)-(37) and (14), desired signal D 0 (n) = H(n|ỹ(n)) channel estimate and the channel estimate error H e (n|n) are given by, First, the apriori channel error correlation matrix is iteratively predicted by Then, using (15), aposteriori R H e (n|n) is updated by The novel MSKF algorithm is then fully tabulated in Table  III. C. Some Issues 1) MMIMO case: Next, re-visit the discussion in "Note" (Section II) about the common sparsity support (over all receive antennas) assumption being violated in MMIMO, since the received signal is delayed at the J different receive antennas, with overall distance between them increasing for large J. Following ( [15], pp. 106, and Table I)  BW , significant channel tap locations l k vary spatially or, are different, across the farthest antennas, as proved in [15]. As an illustrative example, in Fig. 1 c), J = 3J, and say over the bottomJ receive antennas, the significant channel taps locations are l 0 , l 1 , l 2 , and over the next (upper)J antennas, the locations are l 0 +1, l 1 +1, l 2 +1, while over the top-mostJ antennas, they are at l 0 + 2, l 1 + 2, l 2 + 2. Thus, for the mid antenna group, channel location vector h mid = [0, 0, l 0 + 1, 0, · · · , 0, l 1 + 1, 0, · · · , 0, l 2 + 1, 0, · · · , 0] T . Then y(n), in (3), can be considered as ā S = 3-user system, with channels h bottom h mid h top , as in Fig. 1 c) (Simulations in Sec. VII).

D. Kalman Krylov Filter (KKL)
In KKL, the Wiener filter (16) is implemented using a Arnoldi-Krylov-Householder method [30], [29], expected to have superior numerical properties than [28]. For ease of presentation, the KKL agorithm is shown in Table IV. V

. PERFORMANCE ANALYSIS: COMPARISON OF ORDER-WISE CONVERGENCE SPEEDS
In [23], it is shown that time-invariant (ITV) MSWF filter converges to a N dimensional subspace, that has the largest correlations between the eigenvectors of Rỹ (n) and the desired signal D 0 (n) =H(n|ỹ(n)), (equation (76), [23]). Suppose at the 1st stage, the Kalman filter weight (withỹ(n) = z 0 (n) in (16)) is collinear with the crosscorrelation vector C 1 , i. e. let, wỹ (n) = kC 1 , where k is a scalar constant. Then after just 1 stage/iteration, (by (23)), i. e., the final channel estimate. Thus, just after 1 stage, d 1 (n) (in MSKF) gives the optimal estimate of desired signal H(n|ỹ(n)), for each symbol n. Then using (41), 2nd stage C 2 is (31) (for i = 0). Thus, there is no further need to compute succeeding stages C j 's for j ≥ 2, similar to (pp. 2953, [23]). Again, (15) in MSKF, gives For a given R H e (n|n − m 1 )), (43) is minimized when the 2nd term, w H y(n) Rỹ (n),H e (n|n−m1) (= w H y(n) C 1 ∆ 1 ) is maximized [23]. This happens only when wỹ (n) and C 1 (n) are in phase with each other, i. e., Kalman filter weight wỹ (n) is collinear with the cross-correlation vector C 1 (n). Then the maximum value (of 2nd term of (43)) is C H 1 (n)C 1 (n)∆ 1 , and this minimizes (43) . Thereby, by focusing on making normalized cross-correlation collinear with the Kalman filter weight (at each stage/iteration, for symbol n), the MSKF has a fast stage/iteration number-wise convergence. This also leads to novel MSKF exhibiting approximately same convergence speed (number of iterations), even for large MIMO systems, i. e., larger loading ratio R =S J . This has also been seen in ITV MSWF [33], (where equation (46) and Fig. 4) show that for any R < 1, the output SINR increases rapidly (to optimal value) with increasing stage number i. This unique property of novel MSKF, i. e., rapid convergence (even for large-scale systems), is not exhibited in Bayesian MSBL [16], [21] and RCS methods (Simulations Sec VII). On the other hand, Kalman-Krylov filter (KKL) converges to the dominant signal/eigen subspace of Rỹ (n) , corresponding to its N largest eigenvalues. Theorem 3.5.1 (pp. 48, [28]) shows the angle (between the KKL and the true eigen subspace) decreases at every stage/iteration. Also, the KKL algorithm steps show that it determines a N dimensional subspace for Rỹ (n) (principal component analysis PCA), rather than converging to one, which has the largest correlations between of Rỹ (n) 's eigenvectors and desired signal D 0 (n) (as done by novel MSKF). Simulation results (Sec. VII) show the MSKF converge quickly, within an order of 14; (i. e., channel NRMSE remains almost same, even with number of iterations increasing from 14 to 40). But KKL converges slowly, with increasing iteration number. For large systems (J = 28,S = 14), MSKF performs well, while MSBL, CS (RCS), [21] and PCA methods perform inadequately, i. e., they do not not scale up well.

A. Comparison with Existing Methods
Not much work exists on TV MMIMO channel estimation [4], (pp. 1926). The novelty of MSKF vis-a-vis existing algorithms is 1. The MSKF is also compared with re-weighted compressed sensing (RCS) [19], [18] (which is closer to l 0 norm criterion than the l 1 norm). Starting with the minimization of RCS (in proximal form), it develops an iterative CS algorithm using soft thresholding. This is then applied to sparse (in angular domain) beamforming channel estimation; however, [19] is only for single path (not multipath) channel. However, RCS converges very slowly, after many symbols in ( [19], Fig.  3) and also in simulations for our signal model (Section VII, Fig, 1), where it takes as many 200 − 600 symbols to converge. This requires the channel to be static over that time period, and is thus unsuitable for high Doppler channels; while MSKF works, with the channel changing every symbol. This is because, the number of stages (in OMP) is sparsity level d = DKJ, which increases rapidly with increasing J, K in Large-Scale MMIMO systems. Thus, its convergence speed is slow. 2. By using additional beamforming hardware, which slows down MMIMO channel time-variation ( [17], pp. 2, and its Ref [12]), [4] and [17] estimate high Doppler channels in Mmwave communications, but both are developed only for single-tap channels, and do not exploit sparsity in temporal (lag) domain. 3. The uniformly-spaced "Full' equalizer perform worse than a reduced or non-uniformly spaced taps equalizer (see Figure  3 in [1]) and also in learning curves (Figure 4, [1] and Figure  3, [2]), even for time-invariant (ITV) sparse DTV channels, and in Simulations (Section VII) here. [Also, similar results in [11] are due to the non-required taps in "Full" equalizer just adding noise to the estimation process]. Also, the Bayesian filter (used in Expectation step in [4] employs a ("Full" -uniformly spaced all equalizer taps) Kalman filter (see 3. below), unlike the novel reduced (non-uniformly tap spaced) reduced re-configurable equalizer here. 4. In addition, KKL and Bayesian MSBL [16], will be shown to converge much slowly than novel MSKF (for each symbol), especially for large-scale MMIMO systems (Sec. VII). 5. Bayesian methods [14], (simulated only for slow-varying channels, an AR(1) model, λ = 0.9999), and [4] can be computationally very demanding for MMIMO systems, since they are not equipped with suitable model order reduction, to reduce complexity. 6. Also, channel magnitude, corresponding to smaller λ (high-mobility channels), decreases, as n increases, as distance between mobile and base-station increases. Thus received signal (for large n) will be more noisy, and leads to channel tracking errors, encountered by a standard Kalman filter (see Fig. 8 and text below it, [22]). This is alleviated by data censoring and reduced equalizer in MSKF. 7. MSKF has also been compared to one of few existing Krylov based TV channel estimator [21]. [21] models the channel time-variation by a basis exponential method (BEM), which inevitably introduces approximation error to channel estimates, due to the imperfect model assumed [17]. Though [21], [34] are developed for high mobility channels, they do not exploit its sparsity. [21] performs worse than novel MSKF and KKL methods (Sec. VII). 8. Moreover, our novel reduced-rank filters are shown to be equivalent to some Bayesian estimator [12]; (by considering the channel sparse (prior) pdf as f ((H(n, l k ) (i,j) ) = [(H(n, l k )) (i,j) ] −1/2 , i. e., magnitude of a channel tap will have a low value with high probability). Then equations (26)- (34) in [12] show that this prior pdf leads one to a reduced rank filter -see end of Section III. B. above). 9. Unlike existing methods, MSKF combines both sparsity (in multipath lag domain) and Kalman filter, and also incorporates (multistage) model-order reduction; leading to fast convergence speed.

VI. SIMULATION RESULTS
In MMIMO (J = 50 receive antennas) systems, the data signals {s The following algorithms are simulated: 1. "Full" Kalman filter, also used in recent [4], [14], TV EM method [36], 2. Reduced-rank KKL filter over varying orders, denoted by "Krylov",3. Novel Multistage Kalman Filter (MSKF) filter for varying number of iterations, vs symbol number, at different SNRs, and varying λ's, 4. Multi-user Sparse channels, 5. Cluster-Sparse channels, 6. Recent dynamic, Bayesian-Belief Propagation method [16], denoted by "SBL", 7. Large scale MMIMO (large loading ratio R =S J ) systems, 8. Iterative re-weighted compressed sensing (RCS) [19], 9. Existing BEM based Krylov TV channel estimation [21], denoted by "Klov-BEM". The receiver signal-to-noise ratio (SNR) is defined as SNR = E(||y(n)−w(n)|| 2 ) E(||w(n)|| 2 ) , (w(n) : AWGN noise); performance of different estimators measured by normalized MSE (NRMSE) First, following Sec VI. B., we consider a stationary channel, as in [19], to investigate the convergence speed (in symbols) of OMP based RCS, for different values of loading factor R = K J and different SNRs of 5, 20, 30 dBs in Fig 3 a), while Fig 3 b) is the corresponding plot for the novel MSKF. Fig. 3 a) shows RCS to converge very slowly, after as many as 400 − 500 symbols, making it unsuitable in high Doppler channels. (This has also been witnessed in [19]'s Fig. 3). Also, the performance of RCS degrades substantially, at lower SNRs (it is to be noted that non-stationarity factor λ is not incorporated into RCS [22], as is done in MSKF) .  Fig 3 b) shows the novel MSKF to perform very well with low NRMSE, starting from n = 12 on wards; also there is no channel tracking errors, as this is a stationary channel. Also, MSKF's performance degradation (at low SNR) is much less than that of RCS . Next, Fig. 4 simulates the channel NRMSE of "Full", (which also limits the performance of Bayesian [4], [14], [36], Sec VI.B.); along with comparative simulations of novel MSKF, KKL, over varying number of iterations, at 30 and 5 dB SNRs, and λ = 0.988. The plot shows the KKL to converge slowly, with its NRMSE decreasing as number of iterations increases from 14 to 22, · · · , 50. MSKF sparse channel estimator converges very quickly, within an order of 14; as difference in channel NRMSE (at order of 14 to that at 40) is not significant. The uniformly-spaced "Full" equalizer also performs inadequately. The NRMSE of MSKF (with 14 iterations) is also less than that of "Full" and KKL (50 iterations). Fig. 4 b) shows results for λ = 0.995, resulting in lower NRMSEs. Also, the novel MSKF is able to handle the non-stationarity of the channel better than the "Full" filter. For λ = 0.995, the ratio (of NRMSE at symbol no 26 to that at symbol no 12) is 4.67 for MSKF and 21.33 for "Full"; while it is 10.43 in MSKF and increases rapidly to about 100 in "Full", for a more TV channel (λ = 0.988). Updating only at required symbols {(n − m l )}'s, akin to data censoring (in sparse channel) and updating in a reduced subspace, prevents the novel MSKF from exhibiting larger NRMSE, with increasing n, i. e. channel tracking issues (see Fig. 8 and text below it, [22]) occurring in a TV channel (see Sec VI. A.6). Fig. 5 provides results for multi-user (S = 2) Example I Channel, and cluster-sparse (S = 3) channels. Fig. 5 b) also includes the case for space-variant sparse channels, where large antenna arrays make l j 's change by a few lags, over two ends of antennas in MMIMO, (see Section IV. C). As expected in Sec. VI. A. 7,"Klov-BEM" (Krylov-space based method, using BEM, instead of more generic Kalman filter), performs worse than our Kalman-Krylov KKL method (denoted by "Krylov") in Fig. 5 a), b). Fig. 6 a) shows simulation results for 3G LTE Pedestrian B channel (having continuous ISI and some significant multipath clusters, with TV component λ = 0.988/0.995 incorporated here), illustrating the novel MSKF performs well for such practical channels as well; see Sec. III. A. 1) for how MSKF adapts to such general cases. Fig. 6 b) simulates a J = 50,S = 1, SNR = 10 dB system, where MSKF performs well, similar to the very recent SBL [16]. Fig. 7 shows results for large scale (large R), a) J = 20, K = 11 and b) J = 28, K = 14, MMIMO systems. Fig 7 a) shows that MSKF still converges within 14 iterations even for this large scale MMIMO systems, i. e., MSKF exhibits almost same convergence speed, (though with a larger NRMSE for a more loaded system), as that for (J = 50, K = 1 system in Fig 3). Though it uses a symboliterative Kalman filter, recent (sparse PCA based) SBL [16] requires many more iterations to converge (at each symbol n, and even then performs poorly for large-scale MMIMO. For e. g., J = 28, K = 14, at 10 dB SNR, MSKF still converges fast in < 14 iterations; KKL's performance is inferior, (with > 40 iterations). But SBL [16] performs very poorly and does not converge, even with number of iterations increased to more than 100 (for each symbol n), at both SNR of 10 dB, and higher 35 dB SNR.

VII. CONCLUSIONS
The paper develops TV, sparse data/channel estimation algorithms in MMIMO, using a novel non-uniformly spaced TV equalizer, which transforms channel/data estimation problem into one of reduced-rank filtering. This is enabled by a novel reduced-rank  of the omnipresent Kalman filter, which can be extended to TV 5G mmwave communications, by appropriate inclusion of angular estimation [17], [4], and beamforming vectors. Step 1. From computed noisy R(n, k) , find lags k, at which the Frobenius norm R(n, k) F is above threshold γ(n). γ(n) calculated as γ(n) = 1 N +Z M −1 j=0 R(n, j) F , [9], with . F being the Frobenius norm. Simulations show this threshold to work very well, even in noisy situations, (see discussion in [11]).
Step 3. Further values of tp, for p > D * , are obtained from all possible integral combinations of already obtained t p s, p = 0, 1, 2, · · · , D * , (obtained in Step 2), under the constraint that tp ≤ M , (M : "full " equalizer length), tp = D * l=0 c p,l t l , p > D * , where c p,l are integer constants.