Streaming Algorithms for Subspace Analysis: an Edge- and Hardware-oriented review

—Subspace analysis is a basic tool for coping with high-dimensional data and is becoming a fundamental step in early processing of many signals elaboration tasks. Nowadays trend of collecting huge quantities of usually very redundant data by means of decentralized systems suggests these techniques be deployed as close as possible to the data sources. Regrettably, despite its conceptual simplicity, subspace analysis is ultimately equivalent to eigenspace computation and brings along non-negligible computational and memory requirements. To make this ﬁt into typical systems operating at the edge , specialized streaming algorithms have been recently devised that we here classify and review giving them a coherent description, highlighting features and analogies, and easing comparisons. Implementation of these methods is also tested on a common edge digital hardware platform to estimate not only abstract functional and complexity features, but also more practical running times and memory footprints on which compliance with real-world applications hinges.


I. INTRODUCTION
T HE IDEA behind Subspace Analysis (SA) in its different declinations (principal or minor subspace identification or tracking) is to find a subspace of the original signal space such that the corresponding data sets projection has some special properties.
In principal component/subspace analysis (PCA/PSA) projection is the embedding of the data set into a lower dimensional subspace that maintains most of the original energy, while in minor component/subspace analysis (MCA/MSA) projection should produce always vanishing vectors.
PCA/PSA has hystorically a huge number of applications, among which, for example, pattern identification in computer network traffic analysis [1], as well as anomaly detection [2], biomedical application [3], [4], blind source separation [5], surveillance [6], [7], though the list is by far not exhaustive. The most straightforward solution to subspace finding are batch methods, such as the eigenvalue decomposition (EVD) or the singular-value decomposition (SVD) techniques [8]. They require to acquire and store large chunks of available data, and process them with algorithms that make use of standard linear algebra computations. For this reason, batch methods imply a large memory usage and have a complexity in the order of at least O(n 2 k), where n is the original dimension of the data and a subspace of dimension k is sought.
Yet, nowadays, identifying and tracking principal and minor subspaces is becoming particularly important due to the increasing availability of massive, but often highly redundant, collections of data. It may become an engineering nonsense to transport and store all this data, when a large part is eventually discarded as a result of information extraction procedures.
For this reason, subspace analysis is one of the best candidates for implementation at the edge of the cloud, i.e., those devices that appear along the data path from sensors to central processing facilities. Edge devices typically have hub functionalities as they are in charge of concentrating data coming from multiple physically close sources and send their aggregation throught the Internet.
Hence, they are the first stage of the processing chain in which data may become highly-dimensional while exhibiting a noteworthy level of redundancy and ideal candidates for early processing and dimensionality reduction.
Yet, hubs do not support the ideally unlimited and easily scalable computational and storage resources that characterize the cloud model, and this implies the need for tailored approaches.
In this light, a number of streaming algorithms (see, e.g. [9]- [11]) have been recently presented. Here, data windows are not stored but processed as they are available, resulting in a complexity in the order of O(nk 2 ) or, at the price of some approximation, even O(nk). Figure 1 shows the general scheme of streaming algorithms for subspace analysis. Among the many proposals, often differing only for subtle details, our review selects those that incarnate main principles and approaches, are sufficiently lightweight to fit in a low-resources device, and have performances that are representative of what can be obtained even by more specialized implementations. For example, we only consider purely streaming approaches as block-wise methods are usually a straightforward extension of sample-wise methods but require a larger use of memory and computation resources without yielding a significant improvement in subspace identification.
All methods are expressed within the same coherent framework that facilitates comparison in terms of both functionality (estimation accuracy) and resource needs. We also investigate and discuss the deployment of these methods on edge devices distinguishing high-end from low-end devices. The former class of devices usually features an operative system that supports high level programming languages, while the latter class requires direct programming.
The paper is organized as follows. Sec. II proposes methods classification while Sec. III presents a list of streaming algorithms along with their relationship. Methods performance are in Sec. IV and, before Conclusion, Sec. V compares the computational budget and memory footprint on reference edge devices.

II. METHODS CLASSIFICATION
We model the data stream as a discrete-time stochastic process that generates occurrences x t ∈ R n with t = 0, 1, . . . . The algorithms we describe process vectors x t sequentially to extract a characterization of the whole data-set. We assume that E [x t ] = 0 n for every t, where 0 n is the n-dimensional null vector and E [·] indicates the expectation operator. We also assume the covariance/correlation matrix Σ t = E x t x t , where · indicates transposition, is either constant or slowly varying for a long span of subsequent time instants, so that we may safely drop the time indication, i.e., Σ t ≈ Σ. If this constraint holds we can also use x instead of x t to represent a generic signal occurrence.
We aim at identifying the so-called principal subspace, i.e., the one spanned by the m < n eigenvectors associated with the largest eigenvalues, or the so-called minor subspace or noise subspace, i.e., the one spanned by the remaining n − m eigenvectors associated with the smallest eigenvalues.
In any case, the target subspace is expressed as the span of the columns of a matrix U ∈ R n×k that is columnorthonormal, i.e., such that U U = I k , where I k stands for the k × k identity matrix.
For the principal subspace estimation k = m while k = n − m is in case of minor subspace. In several applications m n [12] such that the size of U drastically changes in passing to the task identifying the minor subspace from the one that focuses on the principal subspace.
Streaming algorithms are iterative. They maintain an estimation U t that is updated every time a new vector x t is acquired in such a way that U t → U when t grows. The streaming algorithms we review reach this goals in different ways. Here we list some key feature that can be used to make distinctions or draw connections.
1) principal and minor subspaces: most algorithms are able to target principal subspaces. Some of them are also able to target minor subspaces and few methods are designed for the minor subspace only.
3) objective function: identifying the k-dimensional principal (minor) subspace is equivalent to finding the columnorthonormal matrix U ∈ R n×k that maximizes (minimizes) the variance of the projection y = U x [13], [14]. Hence, some approaches consider the objective function where · denotes the l 2 norm, tr (·) computes the trace of its argument, and where U is constrained to be columnorthonormal. Maximizing (minimizing) projection variance is equivalent to minimizing (maximizing) the average norm of residual vector r = x − U U x. Hence, some methods [15], [16] consider the objective function As a further alternative, it is possible to define an objective function based on the so-called spiked model [17] whereby the observable x is assumed as an expansion of an m-dimensional signal s such that s = Ψ |m x [18], [19]. Despite their obvious link, some methods consider x and s separately and minimize (3) to enforce their relationship.
As a remark, the signal model behind (3) make this objective function effective only in case of principal subspace identification, i.e., J SM (U ) = 0 implies span(U ) = span(Ψ |m ).
Only one of the algorithms we review does not use optimization to identify the subspace. Its description is in Section III.
4) column-orthonormality: Though the final U must be orthonormal, the orthonormality of each U t depends on the algorithm. Two main classes exist. The one containing procedures that ensure every U t to be orthonormal, and the one containing procedures that produce U t that is only approximately orthonormal. - [24] present methods derived from (8) that has been further extended in [25] to HFRANS; d - [26] shows that the application of the ISVD to the missing data case is equivalent to a suitable configuration of GROUSE.
In general, algorithms ensuring orthonormality have complexity O(nk 2 ), while algorithms limiting to approximate orthonormality are able to achieve O(nk).
Those ensuring orthonormality can be further distinguished depending on the technique used. Some do so by applying a specific orthonormalization procedure (typically a QRdecomposition) at each step. In this case, if the update produces a matrix U t that is not column-orthonormal, then the columns of the final U t span the same subspace as the columns of U t , while being orthonormal. Some others do so by projecting onto the Grassmannian manifold [14], [20] that contains all possible n×k orthonormal matrices. Using projection implies that the span of the result of the non-orhonormal update U t is not necessarily the same as that of the final orthonormal U t , and some of the improvements in selecting U t might be lost.
As a third option, orthonormality can be guaranteed by constraining the updates of U t to be along Grassmannian geodesics. In this case, the updated estimation improves over the previous one without leaving the acceptability region.
Tab. I classifies the methods, whose detailed description is given in the next Section, along directions from 1) to 4) above. In the same Table, we also report the computational complexity and the field of interest in which the methods were originally conceived.

III. METHODS DESCRIPTION
This section describes the methods we then test and implement. For each method we give the update step, i.e., the sequence of operations leading from the matrix U t−1 that is the estimate of the subspace after t − 1 input observations, to the matrix U t that is the current estimation and takes into account also the occurrence associated to x t .
Most update steps compute intermediate quantities that have a proper role, namely y t = U t−1 x t that is the vector of coefficients expressing the projection of x t onto the subspaces as it is estimated at time t−1, and r t = x t −U t−1 y t that is the residual of such a projection. Note that if U t−1 is orthonormal, the two vectors y t and r t are orthogonal.
For the sake of brevity, the computation of y t and r t are not explicitly mentioned in the descriptions of the methods.

A. Oja
Originally proposed in [27] for the principal subspace estimation with k = m = 1, and then extended to the rankk cases in [28] it starts from a random column-orthonormal U 0 ∈ R n×k . At the t-th sample it updates the estimation of U = Ψ |k according to the input data x t as where Ω(·) is an operator that orthonormalizes the columns of its argument, e.g., gives the the Q matrix in the QR decomposition of its argument [29,Chapter 2]. The parameter γ t is the step size or learning rate that may change with time.
Notably, authors of [15], [30], [31] show that Oja is an extension of the well-know power method [32] that, in turn, is equivalent to solving a maximization problem where the objective function is (1) and U t is constrained to be columnorthonormal. In particular, the gradient of (1) with respect to U is ΣU such that (4) is equivalent to the update of a stochastic gradient descent algorithm where Σ is approximated by x t x t , γ t is the learning rate, and Ω(·) forces the update to yield a column-orthonormal matrix.
Actually, to save some computation, one may think of applying Ω only after a certain number of updates [15]. Yet, proper sizing of the number of steps without orthonormalization depends on the application. Furthermore, when minimization is considered instead of maximization, with same objective function and constraint, one may aim at identifying the minor subspace U = Ψ k| with k = n−m. With respect to (4), here stochastic gradient descent algorithm performing minimization follows the opposite of the gradient of (1), i.e., Lastly, since the convergence of the method depends on the choice of the initial matrix U 0 , [33] proposes a procedure of warm start that avoids the random initialization.

B. Krasulina
Originally proposed in [34], [35] and recently revised in [13] to include the k > 1 case, it also starts from a random column-orthonormal U 0 ∈ R n×k . The update step is According to [13], [16], the update in (6) converges to a matrix that approximates the minimizer of (2).
As for Oja, a warm start approach has been proposed in [13] and the minor subspace can be targeted by simply change sign of the last equation in (6) to yield In this case the method converge to U = Ψ k| with k = n−m.
Oja and Krasulina are deeply linked. In fact, [21] proves that the t ) (8) thus ultimately establishing equivalent convergence properties for (4) and (6) as γ t → 0 for growing t.

C. HFRANS
The analogy of the two previous methods highilighted by (8) has led to class of methods that try to provide columnorthonormality avoiding the Ω operator.
Since U t = U t−1 + γ t r t y t tends to be columnorthonormal for t → ∞ an extreme option is to consider o(γ 2 t ) negligible and simply avoid orhonormalization.
Such a choice can be acceptable when targeting U = Ψ |k . Yet, when aiming at identifying U = Ψ k| , the very small amplitude of the projection of the signal on the minor subspace makes the whole procedure extremely error-prone and may spoil convergence.
To overcome this impasse, [22] first proposed a term to approximate the o(γ 2 t ) residue in (8) giving raise to the socalled OOja method that is then extended [23] with a policy that adapts γ t to both U t and x t , leading to the so-called NOOja method.
OOja and NOOja further evolved [25] into HFRANS. This method is also an adjustment of FRANS, a previous Rayleigh quotient-based adaptive noise subspace method [24]. In HFRANS, Househölder transformations are introduced to grant the numerical stability needed to the cope with the minor subspace case.
The method starts from a random U 0 ∈ R n×k matrix and uses the following update rule that depends on a given 0 < γ < 2, The Project Approximation Subspace Tracking (PAST) [16], [36] is an algorithm obtained by minimizing (3) in which the expectation is unrolled in time as an exponentially weighted sum, i.e., by setting without the constraint of U t being column-orthonormal and where β ∈ [0, 1] is the forgetting factor that weights the prior samples. Equation (10) is based on the fact that signal observances x l are generated accordingly to a spiked model, i.e., s l = Ψ |k x.
Since, at each signal occurrence, only x l is known, s l is approximated by the projection vector y l = U l−1 x l , i.e., by adopting the last estimated U . Thanks to this approximation, the problem has a closed solution and U t can be retrieved by mean of recursive least squares (RLS) methods, which allow for a computational cost as low as O(nk).
Iterations start from a random U 0 ∈ R n×k and P 0 = δI k for some δ > 0 and the update step is Though U t is not guaranteed to be column-orthonormal for any finite t, [16], [37] show that column-orthonormality is achieved asymptotically as t increases. In applications where columns-orthnormality is important at each step, variants of PAST can be adopted. For instance, PASTd in [16] is a version based on the deflation technique that comes at the cost of an increase in complexity. Otherwise in [38] nearlyorthonormality is provided by correction terms applied to each update of U that keeps complexity at O(nk).
PAST is the base for other approaches such as [39] that deal with the case in which data is perturbed by colored noise, [19] that is designed to cope with the missing components in the vectors x t , and [40], [41] where the Approximated Power Iteration (API) extends the standard power method by exploiting the same approximation used in PAST.

E. ISVD
Given any n × t matrix A, Singular Value Decomposition (SVD) [32, Chapter 2] finds three factors P , D and Q such that A = P DQ with P and Q being square orthonormal matrices of size n×n and t×t respectively, and D is a diagonal n×t matrix whose diagonal entries are called singular values.

SVD is symbolized as
Let also X t = x 1 . . . x t be the matrix containing samples up to the t-th. Subspace analysis has to do with the SVD of X t with a very large t.
In [46] it is showed that the SVD of X t can be effectively computed from the SVD of X t−1 . Even more, [42] shows that this is possible even if we focus on the so-called thin SVD (tSVD), i.e., on a decomposition that computes only the first k columns ofŪ t andV t , the former set of columns being exactly the matrix U we look for when targeting the principal subspace.
The method relies on the fact that from the decomposition one may express the data matrix X t as The equality in (12) holds exactly only if the rank of X t−1 is k and is otherwise an approximation. Moreover, (12) is the foundation of ISVD method which compute the SVD of the inner matrix at each update. With more details, the method computes the SVD of which represents an adjusted version of the inner matrix in (12) where the parameter 0 < β < 1 is added to control the memory of the algorithm. The computed SVD yields two orthonormal factors P and Q and a diagonal factor D whose product can be plugged into (12) to obtain an inner diagonal factor and thus yield the updated tSVD of X t . Such a representation is then shrunk to the minimum by keeping only the first k columns of the left and right factor, and only the first k columns and rows of the central factor. Overall, starting from a random orthonormal U 0 ∈ R n×k , the update step computes where (·) |k is the same operator used before which selects the first k columns of its argument, while (·) k selects the first k columns and the first k rows of its argument. Besides, the speed of convergence is highly affected by the condition number of X t . To partially overcome this problem authors in [43] propose the Polar Incremental Matrix Completion (PIMC) algorithm which adapts the memory factor to the norm of the observed samples β = 2 , a 0 = 1 and · F denotes the Frobenius norm of a matrix.

F. GROUSE
Grassmaniann Rank-One Update Subspace Estimation (GROUSE) is a streaming algorithm for subspace tracking proposed in [18]. Although it is designed to deal with the case in which some components of x t are unknown, we here consider the version for complete data. The idea consists in applying the stochastic gradient descent to minimize (3) while making moves that do not exit the set of all possible columnorthonormal matrices, i.e., the the Grassmaniann manifold of the k-dimensional subspaces of R n .
Starting from one of such matrices, the update rule for the case in which all the components of x t are known, is where the expression for θ t is derived from eq. (3)-(4) in [44] and where α t is meant to mitigate the effect of noise. Convergence of GROUSE is analyzed in [44], [45] and [26] shows that GROUSE and ISVD are stricly linked. I particular, the application of the ISVD to the missing data case is equivalent to GROUSE for a certain choice of its parameters.

IV. SAMPLE SUBSPACE ANALYSIS PERFORMANCE
In this section we test the functional performance of the methods described in Section III by asking to each method to correctly detect a subspace characterizing a set of signal observations.
Data are generated according to the spiked model [17] with the subspace structure of the signal that does not change in time, meaning that where Φ ∈ R n×m , with m < n, is a column-orthonormal matrix that expands instances s t ∈ R m of a zero-mean random Gaussian source with zero mean and covariance I m , while ν t ∈ R n represents realizations of a zero-mean Gaussian noise term with covariance ρI n such that 0 < ρ < 1 controls the noise level.
With this model we have Σ = E x t x t = ΦΦ + ρI n , and since Φ is column-orthonormal and ρ < 1, then Φ itself spans the m-dimensional principal subspace, while its orthogonal complement Φ ⊥ spans the (n − m)-dimensional minor subspace. As a result, the target n × k matrix U is Φ in case of principal subspace estimation and it is Φ ⊥ for the minor subspace.
To quantify the effectiveness in subspace analysis, for each method we monitor the sequence of reconstruction errors where · F indicates the Frobenius norm of its argument. In case of correct estimation we have U t = U and thus e t = 0 while the error is maximum when U t is orthogonal to U , yielding e t = k.
Simulations consider n = 100, m = 10 and noise amplitude ρ = 10 −3 . For each task, 100 Montecarlo trials are performed where random column-orthonormal Φ and random column-orthonormal initialization matrices U 0 are drawn independently. Each single Montecarlo trial is composed by a stretch of 1000 sample windows x t generated independently following (15).
The parameters controlling each method are set as reported in Table II: Oja and Krasulina need a learning rate γ t , PAST and ISVD need to set a forgetting factor β, and finally GROUSE and HFRANS depend on α (controlling the effect of noise) and γ (that can be seen as a re-scaled learning rate). Parameters are tuned to optimize the capability to identifies the target subspace.
For the methods adopting a learning rate we tested the classical trend γ t = c/t d with d ∈ {0, 1/2, 1} and c selected to increase convergence speed. Since, in the fixed subspace case, performances are approximately the same for the three values of d, we selected d = 0 and report in Table II only the value of γ t = γ = c. Such a constant learning rate allows for substantial updates even when t grows. For HFRANS, the learning rate tuning does not depend on t while the value of α in GROUSE is set to cope with the noise level.
Finally, the forgetting factors of ISVD and PAST are selected as the largest possible values that make the fixed subspace case converge.
Results in terms of e t are shown in Figure 3. Figure 3a is for the principal subspace while Figure 3b is for the minor. Solid lines represent median values of the Montecarlo trials while shaded areas indicate the spread containing 50% of the values.
As expected, all methods are able to deal with the subspace identification task. As expected, ISVD and PAST are the fastest to converge as they have been designed with the spiked model in mind, GROUSE is in the order of the two historical approaches Oja and Krasulina. This behavior reflects the fact that the main GROUSE novelty is in its ability to manage the missing data case.

V. IMPLEMENTATION ON EDGE DEVICES
Since edge devices are employed on a very large variety of applications and for many different tasks, several kinds of device have been proposed ranging from single-board system equipped with a single microcontroller to more complex system that can be composed by different modules. Examples in the latter class are: SONY Spresence dev board 1 with a 6 core cortex ARM M4 microcontroller, Raspberry pi4 model B 2 with a quad-core ARM A72 microcontroller and STM STM32MP157A dev board 3 equipped with both an AMR M4F micorcontroller and a dual ARM A7 microcontroller. All these system are able to support an operative system such that high level programming languages could be considered in the developing of the reviewed methods, e.g., Python with dedicated libraries for digital signal processing and numerical computing. On the contrary, single board system does not support high level programming languages with the need of device direct programming. To cope with both scenarios two subsection report implementation details.

A. High-end edge devices
We here focus on the Raspberry pi 4 model B which could be considered a reference for the high-end edge devices family. With more details, we refer to the board equipped with a 1GB of RAM and having as operative system the Raspberry Pi OS (32-bit) Lite 5.4.51-v7l+. Clock frequency ranges from 600 MHz to 800 MHz 4 . Methods performance has been computed by adopting a Python (version 3.7.3) implementation with the packages Scipy (version 1.1.0) and Numpy (version 1.16.2) for the required algebraic manipulations.
To avoid misleading comparison between methods performing the identification of the principal subspace with respect to the ones performing identification of the minor subspace we remark that the n × k matrix to be estimated is an n × m in case of principal subspace while minor subspace identification looks at an n × (n − m) matrix. It case of m n, the computational budget for the minor subspace case amply exceeds what is expected for the principal subspace. We consider here the setting n = 100 and m = 10 (the one adopted in Sec. IV), averages and standard deviations of the update times are in Tab. II where 1000 different updates have been performed for each method and for each task. As expected, computational time in case of minor subspace identification is higher compared to the the methods identifying the principal subspace.
For principal subspace identification, the most time consuming method is ISVD which computes a singular value decomposition of an (k + 1) × (k + 1) matrix at each update. On the contrary, PAST is the method with the lowest computational time. This is mainly due to the fact the PAST does not require any complex operation as either orthonoralization or SVD. As a further remark, note that, as expected, standard deviations are negligible since difference between single time updates are only due to the operative system activity. In case of minor subspace identification the methods to be preffered in terms of computational times is HFRANS. As for PAST this is a method that does not require any complex operation at each update.
To enlarge this analysis several n, k couples of values were tested in order better compare the reviewed methods in terms of required computational time. Here we keep constant the ratio between n and k and Fig. 4a shows how the average computational times increases with n in case of n/k = 100/10 for the principal subspace and n/k = 100/90 for the minor subspace.
Regardless of the task to be performed, we can identify two different classes of method. Oja, Krasulina and ISVD that require at each step either an orthonormalization of the columns or an SVD and HFRANS, GROUSE and PAST that do not require complex operations thus reducing the computational time with respect to the first group of methods.

B. Low-end edge devices
The description of how we propose to direct implement discussed methods on low resources devices is here presented. In details, we developed a C library 5 targeting the ARM Cortex microcontroller (MCU) family of devices. The library backend is mainly based on the well know ARM CMSIS-DSP library 6 .
The implementation of the methods is not straight forward as we want to reduce to the minimum the memory requirements and the processing time of the methods. In Appendix A it is possible to find the tricks that have been used for the implementation of fast and low-memory matrix operations.
Both GROUSE and PAST implementations adopts all these gimmicks to reduce as possible memory footprint and computational complexity. Other than this, Oja, Krasulina and ISVD require an extra care due two main main bottlenecks, namely QR decomposition and SVD operation. The overcoming of these strong limitations is described hereafter.
After that, speed performance, energy consumption and memory requirements on the MCU are analyzed, with a final assessment on the performance for both subspace identification tasks.
1) QR decomposition: The operator Ω(·) that makes column-orthonormal its argument is in (4) and (6), the two update rules for Oja and Krasulina methods. It can be obtained through the QR decomposition operation where several implementation have been proposed.
Householder QR decomposition (Hous-QR) [47] is one of the most used algorithms as it guarantees a great numerical stability, even in case of column-orthonormalization of illconditioned matrices. This means that the output matrix is always as nearest as possible to a column-orthonormal matrix, i.e., the difference between I k and U t U t is in the order of the machine epsilon. Nonetheless, Hous-QR is an iterative algorithm that requires multiple matrix multiplications for each iteration, thus making it computationally very expensive.
A less expensive algorithm is the Modified Gram-Schmidt QR decomposition (MGS-QR) [48]. However, the price for a reduced complexity comes with the possibility of numerical instability: U t is only guaranteed to be approximately columnorthonormal, especially if the argument of the QR operator is an ill-conditioned matrix.
Finally, it is also possible to employ Cholesky decomposition (Chol-QR) [49]. This algorithm is even faster if compared to MGS-QR [50], with the price of an even greater numerical instability. This translates in an output matrix U t that is always only approximately column-orthonormal. While this is a critical problem for many applications, in the case of Oja and Krasulina methods we noticed that they are quite robust to this limiation, i.e, both methods well tolerate mtrices U t that are approximately columns-orthonormal. Because of this, Chol-QR does not affect the performance of the two algorithms and thus has been selected for the final implementation.
Chol-QR algorithm is quite simple and its description can be found in Algorithm 1. The implementation of Cholesky decomposition is based on ARM CMSIS-DSP library. Also, the inversion of R can be easily performed with backward substitution technique [12] as it is an upper triangular matrix. For the same reason, R and R −1 can be stored as the two triangular parts of a single square matrix, because R −1 is instead lower triangular.
2) SVD operation: The SVD operation A SVD → P , D, Q is required in the ISVD method (Equation 13). The algorithm used on MCU is Golub-Reinsch SVD (GR-SVD) [51], whose computational time is comparable to Chol-QR. The implementation of GR-SVD is based on Daniel Matterson's CControl library 7 .
In order to accelerate the final implementation, the evaluation of Q has been elided as it is not needed for ISVD. Also, GR-SVD can be performed directly overwriting matrix A with matrix P during computation, thus keeping the memory requirements to the minimum.
The whole methods implementations have been tested on STM32H743ZIT (rev. V), MCU based on ARM Cortex M7 family, with a 32-bit floating point unit, f CLK = 480 MHz and both instruction cache and data cache enabled 8 .
With this setup, the energy consumption of a single update has been estimated as E update = V DD × I DD × t update , where V DD is the supply voltage, I DD is the absorbed current and t update is number of clock cycles necessary to apply a single update rule of a method divided by the clock frequency. Values are obtained from datasheet. In particular, we refer to current values corresponding to V DD = 1.8 V and with both no peripherals or all the peripherals enabled.
The results obtained for time per update and energy consumption per update are reported in Table III. Figure 5 shows instead how time per update scales with n with a fixed n/k ratio, which is proportional to energy consumption.
For each method, memory footprint can be split in three parts. The first part consists in the stack memory, this is a fixed cost that we remove from our analysis since it is negligible, in general, in the computation of the whole memory footprint and since it does not depend on either the adopted method or the considered setting (n and m values). The second part is equal for any method and consists in the input vector x and matrix U t , respectively of size n and n × k. Finally the third part consists of other extra memory buffers of various size required b each method.
The size of the extra buffers, along with the overall memory rule for each method and the memory required in case n = 100 and m = 10, can be found in Table IV 9 . Finally, the subspace identification performance of the algorithms on MCU has been assessed with the same setup reported in Section IV and can be seen in Figure 6.

VI. CONCLUSION
We review six major approaches to subspace identification by streaming computation that tackle both principal and minor analysis.
The aim is to give a coherent presentation of the methods that gave rise to most of the subtle variants currently available in the Literature, and perform some elementary comparison with an eye to their deployment at the edge of a system that concentrates data from a host of distributed nodes.
Beyond functional testing, we implement the methods in both an high-end edge device and a low-end edge devices to assess how much their computational requirements are compatible with the need of nowadays data collection systems that must reduce redundancy as early as possible along the signal chain, to maintain efficiency in presence of massive data flows. 9 without buffered multiplication technique described in Appendix A, memory requirements for Oja and Krasulina are about 45% higher while computation is about 19% faster compared to the results shown in this section.
At least two of the methods reviewed (PAST and HFRANS) excel in this as they are able to deliver substantial accuracy in principal subspace identification (PAST) and minor subspace identification (HFRANS) with very limited computational requirements.
If heavier computations can be tolerated, maximum speed convergence in the identification and tracking of principal subspaces can be obtained by ISVD.

APPENDIX A FAST AND MEMORY EFFICIENT ARITHMETIC ON MCU
Being the arithmetic based on ARM CMSIS-DSP library, on MCU a single n × k matrix is stored as a linear vector of size nk. In this way memory contiguity is maximized and the number of memory accesses is reduced to the minimum.
A brief overview of the techniques used for vector and matrix multiplications follows.
1) Loop unroll: as loops are massively used, unrolling highly increases the performance. In fact, it allows to reduce the amount of data to be transferred from memory and reduces the number of updates of the loop index. This is performed automatically at compile time by using -Ofast gcc option.
2) Register blocking: this technique is illustrated in Figure 7(a). The idea behind matrix-matrix multiplication can be simply reduced to multiple subsequent vector-vector dot products. Each dot product is performed through multiply and accumulate operations and generates one of the scalars in the output matrix. By using multiple accumulators at the same time, i.e, by interleaving two or more dot products, local registers are better employed and performance greatly increases.
3) Buffered multiplication: matrix-matrix multiplications where the first matrix is n × k and the second one is k × k are common (Oja, Krasulina, ISVD). The result is a n × k matrix, and the total memory required for this operation would be (2n + k)k. If n k, memory requirements can be almost halved by storing the output matrix in the same memory location of the first input matrix, thus overwriting it. This is possible if, for each output row, the input row is temporarily copied in a k-sized buffer. On the other hand, this slightly increases the computation time. This method can be applied also in the transposed case (k × k times k × n matrix-matrix multiplication). This technique is illustrated in Figure 7(b). 4) Vector outer product and matrix addition merging: some methods (Krasulina, PAST, GROUSE, HFRANS) requires update operations of type A = A + ab T , where A ∈ R n×k , a ∈ R n and b ∈ R k . Instead of evaluating first B = ab T and then performing A = A+B, it is possible to directly sum each entry of B to each entry of A the moment it is evaluated, thus reducing the memory requirements and the computation time of the overall operation. This is shown in Figure 7(c).

5)
Transposition of square matrices: in general, the transposition of a rectangular matrix requires the copy of the entire matrix in another memory space. In the case of a square matrix it is instead possible to directly transpose the matrix overwriting the original, by swapping each of the values in the two triangular parts of the matrix.  6) Transposed multiplications: when performing multiplications such as AB T or A T B, it is possible to avoid the transposition operations entirely by modifying multiplication operations by scanning the transposed matrix row-first instead of column-first (or vice-versa). 7) Column concatenation: ISVD method requires the concatenation of a column vector. Columns are not memorycontiguous while rows are. In order to greatly simplify the operation and reduce the computation time of the operation, it is possible to transpose the whole method so that the column concatenation operation becomes a row concatenation.