MAXVAR-Based Distributed Correlation Estimation in a Wireless Sensor Network

The performance of most array signal processing tasks relies on the presence of correlation between sensor signals. In a wireless sensor network, where sensor nodes are spread out over a relatively large area, it is useful to identify nodes observing similar sensor signals and hence common phenomenons, for example to partition the network according to the observed latent signals and corresponding correlation structure. This can be achieved via the so-called MAXVAR formulation of generalized canonical correlation analysis, which finds a low-dimensional subspace that highlights correlated signal components between multiple nodes' observed signal subspaces. The classical procedure for computing the solutions of MAXVAR consists in performing a generalized eigenvalue decomposition after collecting all the sensors' signals at a fusion center. However, this typically incurs high communication and computational costs. In this paper, we describe a low communication and computational cost distributed algorithm that computes the solutions of MAXVAR without aggregating the nodes' observations at a central location. We show how a subset of those solutions can be used locally by each node to estimate the global correlation structure across all nodes in the network, thereby allowing any node to evaluate the presence of correlated signals at any other node, even if no direct link is shared. We prove the convergence of the algorithm and validate our method for estimating the correlation structure via simulations.


I. INTRODUCTION
A WIRELESS Sensor Network (WSN) consists of a collection of nodes that are equipped with one or more sensors, wireless communication capabilities, and a processing unit. The sensor observations collected in a WSN can either be forwarded to a "fusion center" (FC) where all the data is collected and centrally processed, or the data processing task can be collaboratively performed in a decentralized fashion by the sensing nodes Manuscript  themselves [2]. Centralized computation has the advantage of allowing the use of off-the-shelf algorithms and techniques, but comes at the cost of greater bandwidth requirements, since all nodes have to send their raw data to the FC, as well as a large computational cost at the FC, which scales poorly with respect to the number of sensor nodes [3]. The aforementioned increase in bandwidth is even more severe in the case of networks relying on multi-hop routing towards the FC, where some nodes act as data relays for other nodes [4]. Furthermore, the FC constitutes a single point of failure which compromises the robustness of the network [4], [5]. In comparison, distributed processing only requires each node to solve a local low-complexity task to collaboratively complete a global task of higher complexity. Array signal processing tasks such as signal estimation, filtering, or subspace estimation, generally boil down to the extraction of specific signal components, often split across many sensor signals. Due to the presence of sensing noise and the spatial distribution of the underlying signal sources, different nodes typically observe different but correlated signals, some sharing common components and others not. It is therefore of great importance to identify which node pairs share a common latent signal subspace and which do not, as this knowledge can be used to prune the network or cluster the nodes according to the similarity between their signal subspaces, resulting in further bandwidth and computational complexity reduction at each node [6], [7].
In this paper, we leverage the fact that the so-called principal angles derived from the solutions of the canonical correlation analysis (CCA) problem can be used to quantify the similarity between two nodes' sensor signal subspaces [8], [9]. Indeed, CCA can be used to estimate the signal components that are maximally correlated between two different nodes. It is closely related to principal component analysis (PCA) [10] and the Karhunen-Loève transform (KLT) [11], which extract the highest power components, yet not necessarily observed by both nodes. As CCA is only applicable to identify the most correlated signal components within the signal subspaces of two nodes, applying it to a network of more than two nodes would result in a combinatorial complexity scaling. In this paper, we show that one of its multi-set generalizations, the so-called "Maximum Variance" (MAXVAR) generalization [12], [13], [14], can be used to approximate the solutions of the pairwise CCA problems and therefore the complete correlation structure of the network with a complexity that scales linearly with the network's size. MAXVAR has indeed been shown to provide a description of the intersection between multiple subspaces [15], which fits with 1053-587X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
the general purpose of finding "shared" signals subspaces across multiple nodes of a WSN. There exist multiple multi-set generalizations of CCA, each characterized by a specific objective function and set of constraints [14]. The MAXVAR formulation historically refers to the objective function introduced by Horst [12]. Carroll [13] later introduced a new set of constraints for this objective which turns the formulation into an easily interpretable subspace decomposition and on which we focus in this paper.
As using MAXVAR in a centralized fashion would hinder our progress towards bandwidth and complexity reduction, we present a distributed and adaptive algorithm for tracking the solutions of MAXVAR, relying on the exchange of compressed signals between the nodes. We describe variants of this distributed MAXVAR (D-MAXVAR) algorithm for fully connected, startopology and tree-topology networks, and use the later as a basis for extending the algorithm to arbitrary topologies. We then explain how its solution can be used to evaluate the pair-wise correlation between any pair of nodes in the network, even if they are not connected via a direct link.
Previous works have investigated similar distributed subspace decompositions, targeting the union of the nodes' subspaces by extracting the components of greatest power [16], [17], maximal SNR [18], [19] or maximizing correlation between two sets of signals [20], [21]. In this work, we focus on finding the components that most adequately describe the inter-node correlation coefficient, also known as Pearson's correlation coefficient. Distributed algorithms already exist for the so-called SUMCORR generalization of CCA [22] but, to our knowledge, no distributed algorithm has been developed for MAXVAR, which, contrarily to SUMCORR, admits an analytical solution. In addition, the focus of [22] was on computational efficiency, rather than on bandwidth scaling, as is the case of the present work. Finally, as an alternative to distributed methods, several works investigate the applicability of CCA-like techniques to scenarios involving two sensor arrays and where only a limited number of samples is available at each node [23], [24], [25].
In our previous work [1], we presented a condensed description of the algorithm for star-and tree-topology networks. In this paper, we provide a proof of convergence and explain how the algorithm's outputs can be used to estimate the network's correlation structure. We also provide numerical simulations to demonstrate the algorithm's effectiveness and its convergence properties.
The paper is organized as follows. In Section II, we cover the preliminaries and algebraic concepts related to MAXVAR and explain how it can be used in the context of WSNs. In Section III, we describe a distributed MAXVAR algorithm for fully connected networks and discuss its convergence properties. The algorithm is extended to general topologies in Section IV. Section V discusses how a specific subset of the solutions of MAXVAR can be used by each node to estimate the inter-node correlation structure, and in particular evaluate the degree to which each node's signals correlate with any other node, even several hops away. In Section VI, we assess the algorithm's performance through various simulations. We conclude by a brief discussion in Section VII.

A. WSN Setting and Notation
We consider a WSN consisting of K nodes in which each node k ∈ K = {1, . . . , K} collects discrete observations of a realvalued M k -channel sensor signal x k = [x k,1 , . . . , x k,M k ] T . We model x k ∈ R M k as a stochastic process and denote x k [t] its value at time t. We assume that x k is zero-mean, ergodic and short-time stationary, allowing us to estimate the slowly varying covariance matrices from sample averages over finite segments of data: where E{·} denotes the expectation operator and X k [t] denotes the M k × T observation matrix containing T consecutive observations of x k centered around t in its columns. Finally, we define the network-wide observation vector as the M -channel vector x obtained by stacking the x k 's and where M = k M k .

B. Canonical Correlation Analysis (CCA)
We wish to characterize the intensity of correlations between the signals of any two nodes. The so-called canonical correlation coefficients provide such a characterization, which can be found by means of canonical correlation analysis [26]. Considering two multi-channel signals x k and x l associated with two nodes k, l ∈ K, CCA computes spatial filters w k and w l that maximize the correlation coefficient ρ kl between their output signals z k = w T k x k and z l = w T l x l . The output signals z l and z k are referred to as the first canonical directions, and ρ kl is referred to as the first canonical correlation coefficient. Additional canonical directions and coefficients can be found by computing additional pairs of spatial filters that maximize the correlation between their outputs while having their outputs remain uncorrelated (orthogonal) to the previous canonical directions. Formally, the i-th canonical correlation coefficient ρ kl i and canonical directions z k,i , z l,i are defined as The three last constraints require the canonical directions to have unit-variance and be orthogonal with respect to each other within each node. Note that ρ kl i = 1 implies that the pairs (z k,i , z l,i ) span the exact intersection between the sensor signal subspaces of nodes k and l.
If we now express the set of canonical directions of node k with respect to node l as with M kl = min(M k , M l ) and W kl ∈ R M k ×M kl , it can be shown [27] that the canonical directions and coefficients between two nodes k and l are the solutions of the following generalized eigenvalue problem: where R x k is a shorthand notation for R x k x k and Λ kl is a diagonal matrix whose diagonal entries are non-negative and correspond to the canonical correlation coefficients. As a result, the optimal solution of (2) is given by the generalized eigenvectors corresponding to the largest generalized eigenvalues. By expressing the canonical directions via the parametrization introduced by (3), CCA can indeed be seen as the problem of finding two sets of spatial filters whose corresponding outputs have maximal total correlation.

C. Total Squared Correlation
From the definition of canonical correlation coefficients, we can define the total squared correlation (TSC) where Tr(·) denotes the trace operator. (5) can be interpreted as a generalization of correlation between random variables to correlation between sets of random variables. Indeed, if we denote X k = Spanx k , i.e. the signal subspace spanned by the channels of x k , we have: assuming dim(X k ) ≤ dim(X l ). Note that in the onedimensional case (M k = M l = 1), the TSC reduces to the usual squared correlation coefficient. Contrarily to correlation, the TSC is not concerned with the direction of the relationship (i.e. positively or negatively correlated), only the absolute magnitude of the relationship is captured. The TSC is positive and symmetric but does not satisfy the triangle inequality and is therefore not a distance [28]. Still, in the case where the channels of x k and x l are linearly independent (which is in practice always true in noisy settings), it can easily be turned into one by defining which satisfies the triangle equality [29] and corresponds to the so-called Chordal distance in the particular case where M k = M l [30]. Considering these above properties, the TSC is an adequate metric for characterizing the degree to which the signals of two nodes correlate, which is one of the goals of this paper. In a WSN context, the TSC could for example be used as the key ingredient to an adaptive network where links between low-TSC node pairs are pruned, while more bandwidth is allocated to links between high-TSC node pairs. Similarly, the TSC could be used to define a weighted graph describing the correlation structure of the network, and on which spectral clustering techniques could be applied [31].

D. MAXVAR: Extension to More Than two Subspaces
Problem (2) can be generalized to more than two nodes or subspaces in various fashions. As it admits a closed form solution, 1 we consider here the so-called Maximum-Variance (MAXVAR) generalization of the CCA problem, as described by Carroll [13] and defined as follows for K signal subspaces: where Q is the number of desired components. From the parametrization introduced by constraint (9b), the problem becomes equivalent to finding a Q-outputs filter W k ∈ C M k ×Q per node such that the filtered observations z k , which we refer to as the per-node MAXVAR directions, are as close as possible to some common network-wide Q-dimensional signal s, and hence as close as possible to each other. From (9), it is clear that the per-node MAXVAR directions are the orthogonal projections of s onto the nodes' signal subspaces Span(x k ). Hence, Span(s) is the Q-dimensional subspace whose signals have minimal average projection error onto the nodes' individual signal subspaces.
In [32], it is shown that the solution of (9) satisfies where W [W T 1 · · · W T K ] T is the matrix obtained by stacking the W k 's. Substituting (10) in (9a), the objective can be reformulated as The problem therefore consists in finding the set of node-specific filters whose outputs are as close to each other as possible in a minimum squared error sense. Furthermore, it is also shown in [32] that the W k 's are solutions to the following eigenvalue problem: where R D Blkdiag(R x 1 , . . . , R x K ) is the block diagonal matrix containing the node-specific covariance matrices and Λ is a diagonal matrix. Solving problem (9) is therefore equivalent to computing the generalized eigenvalue decomposition (GEVD) of the matrix pencil (R D , R xx ), keeping only the generalized eigenvectors (GEVC) corresponding to the Q smallest generalized eigenvalues (GEVL).

E. Problem Statement
For each pair of nodes, we wish to assess how "close" their observed signal subspaces are to each other. This can be achieved by computing the TSC as described by (5). However, this approach has two major drawbacks. Firstly, it requires the computation of R x k x l which is only possible if the full observations x k and x l are co-located at a single node (or an FC), therefore incurring high communication cost. Secondly, a CCA solution needs to be computed for each of the K(K − 1)/2 pairs of nodes, further increasing the communication and computational cost of the procedure. As an efficient alternative, we propose to jointly approximate the TSCs between all node pairs using a distributed procedure based on MAXVAR. Indeed, we will show in Section V-A that solving a slightly modified version of the MAXVAR problem produces a joint approximation of the TSC of each node pair. As it is useful in its own right and of more general interest, we will first describe a distributed procedure for computing the solutions of the classical MAXVAR problem (12). In order to obtain such a solution, all nodes would typically need to share their observations to an FC where the full covariance matrix could be estimated. This requires a large communication bandwidth between the nodes and the FC, in particular if all nodes are not directly connected to the FC, which increases the stress on the communication links of the nodes that are close to the fusion center. In addition, such a centralized processing does not leverage the fact that the signal subspace in which inter-node correlation is present typically manifests a low dimension, and that its signals can therefore be efficiently described by only a few components.
In Section III, we present a distributed algorithm for solving (12) in a distributed fashion and relying on the transmission of low-dimensional compressed views of the data between neighboring nodes, thus lifting the need to transfer the raw M k -channel sensor observations of all nodes to an FC through a possibly multi-hop network. Instead, the signal observations are directly fused with other observations within the network, and such that each node eventually have access to an estimate of the solutions, thereby avoiding the need for a fusion center altogether. In Section V, we describe how the distributed MAX-VAR procedure can be modified to produce an approximation of (5).

III. DISTRIBUTED MAXVAR IN FULLY-CONNECTED NETWORKS
In this section, we derive a distributed iterative algorithm for computing the Q first per-node MAXVAR directions associated with each node, and which is such that each node k eventually has access to the network-wide estimate of s and its own per-node MAXVAR direction z k as defined in (9b). Relying on our interest in a subset of the components only (which is motivated in Section V) and the particular problem structure, we show that neighboring nodes only need to share Q-dimensional compressed views of their observed signal subspaces at each iteration, which eventually converge to their first Q per-node MAXVAR directions z k . In order to facilitate the reader's understanding and intuition in the algorithm development, we first derive the algorithm for the simpler case of fully-connected networks and later extend it to more general topologies.

A. Algorithm Derivation
In a fully-connected network, any node can communicate with any other node via a single hop. By denoting the set of neighbors of node k as N k , we have for such networks N k = K \ {k}.
Considering the GEVD formulation (12), the problem of finding the Q first per-node MAXVAR directions is equivalent to finding the Q GEVCs of the matrix pencil (R D , R xx ) associated with its smallest GEVLs and corresponding to the columns of W . The algorithm iteratively updates the M k × Q matrices W i k (where the superscript i denotes the iteration index), which act as the local estimates of W k and, as will be shown, as a node-specific compressor of the nodes' signals. It can be shown [33] that the GEVCs (corresponding to columns of W ) associated with the Q smallest GEVLs from the GEVD in (12) and hence the MAXVAR solutions of (9) coincide with the solution of the following trace minimization problem 2 where Tr(·) denotes the trace operator. The core idea of the algorithm is to have the nodes solve a local version of (13) in turns, and expressed in terms of the node's own raw observations and the other nodes' compressed observations z i k = W iH k x k (W iH k denotes the compression matrix associated with the signals of node k at iteration i of the algorithm). The notation z i k , W iH k for the compressed observations and compression matrices is chosen deliberately, as they also correspond to node k's local estimate of its part of the solution of (9). We indeed expect that eventually The algorithm in fully connected networks is as follows. At the beginning of each iteration, an updating node q is selected. Every other node transmits a batch of its compressed observations z i k ∀k ∈ N q to the updating node, such that it can locally solve the following problem: and Note that the solution of (15) can again be found from a GEVD, this time applied to the pencil . An update rule for the network-wide W i+1 emerges naturally by noticing that solving the local problem (15) is equivalent to solving the original centralized problem (13) with additional constraints: where the operator C(·) denotes the column space of its argument. Indeed, notice that the range constraints in (18c) can be equivalently formulated by introducing the following parametrization of W in (13): where the multiplication with G k ∈ R Q×Q allows to select a new W k within the column space of W i k . With this parametrization, (18a) becomes Finally introducing W as (21) and substituting in (20), we indeed obtain the objective of the local problem (15a). A similar derivation can be done for the constraints (18c). The parametrization therefore indeed defines the update rule for the local estimate of W k at each node: where W q and G k are extracted from the solution of (15) based on the partitioning (21). Note that the range constraints make (18) different from a traditional nonlinear Gauss-Seidel approach where all W k would be fixed to W i k for k = q, in which case (18) could not be solved as a GEVD anymore.
Concretely, the procedure at each iteration can be divided into three steps (the detailed procedure is formally described in Algorithm 1): 1) Aggregation: Each node k ∈ N q sends a new batch of T observations of the locally compressed signals z i k to the updating node q.
2) Local solution: The updating node q solves problem (15), expressed exclusively in terms of locally available data. From the solution, it extracts the update of its own local estimate of the solution W q as well as the update matrices G k corresponding to each other node, as defined by the partitioning (21). Due to the sign ambiguities of the GEVCs 3 , the solution of (15) is not unique and independent from the sign changes of its columns. Therefore, in order to ensure convergence, we select the solution resulting in the smallest difference W i+1 − W i F . In practice, this amounts to checking whether for each of the columns w i+1 and w i of W and [W iT q I Q · · · I Q ] T , respectively, and choosing the signs of the columns accordingly. 3) Update: Node q sends the update matrices {G k } k =q to each node, which update their local estimates W i k according to (22). Note that the transmission cost of these Q × Q update matrices is negligible compared to the transmission costs in the aggregation step (assuming T Q 2 ). The role of the updating node is finally passed on to node (q mod K) + 1. If required, the common components s can be estimated at any iteration by performing an in-network summation: It is noted that the updating node q has access to all the data required to compute (23), such that no additional bandwidth is required to compute s at node q. Remark III.1: The batch of T (compressed) samples that is transmitted by each node during the aggregation step would typically consist of different samples than the ones used in the previous iteration, in order to avoid retransmitting the same data multiple times which would substantially increase the bandwidth. Therefore, the time index t corresponding to the first sample of a batch is typically updated as t + iT at each iteration (with T possibly smaller than T ). This implies that the iterations are spread out over time, and that the algorithm behaves as an adaptive filter tracking the signal statistics over time. In order for the algorithm to converge, we therefore have to assume that the signal statistics change sufficiently slowly compared to the convergence dynamics of the algorithm. To make the convergence analysis mathematically tractable, all convergence proofs in the remaining of this paper implicitly assume that the signal statistics remain stationary during convergence of the algorithm.

B. Convergence and Optimality
In this subsection, we provide some insight in the convergence and optimality properties of the D-MAXVAR algorithm, and establish formal convergence proofs.
A first important observation is that the solution of (18) at iteration i is by definition in the constraint set of the problem at iteration i + 1 (corresponding to selecting G k = I Q ). As a result, the objective function (18a) decreases monotonically and must therefore converge, as the objective function is bounded in the constraint set. However, the convergence of the sequence of optimization variables (W i ) i∈N is less straightforward. Indeed, monotonic convergence of the objective function does not imply convergence of its arguments, nor that the global minimum is eventually attained. Nonetheless, by making an assumption which is always satisfied in practice, we can show several interesting properties about the convergence behavior and limit points of the algorithm.
Assumption III.1: The accumulation points of the sequence of local pencils ((R i D q , R i x q )) i∈N are non singular and have distinct Q-th and (Q + 1)-th smallest GEVLs. Due to the presence of uncorrelated sensor noise, the nonsingularity assumption is always verified. The assumption about distinct eigenvalues is merely technical, as we can show that such points corresponding to degenerate local problems would be unstable in practice, unless they correspond to the global solution of (13) (see Theorem III.3 below), and the algorithm would therefore eventually diverge from such points.
In what follows, a fixed point W * is defined as a point which is invariant under the updates of Algorithm 1, i.e., (W i ) i∈N = (W * ) i∈N if W 0 = W * . The following theorem gives an important characterization of the algorithm's fixed points: Theorem III.1: The columns of matrices which are fixed points of Algorithm 1 are stationary points of problem (13) and therefore GEVCs of the pencil (R D , R xx ).
Proof: See Appendix A. Assumption III.1 guarantees that the local problems have well-defined solutions at accumulation points of the algorithm, allowing us to state our main convergence result: Theorem III.2 (Convergence): If Assumption III.1 holds, (W i ) i∈N converges to a fixed point, and hence stationary point, of problem (13).
Proof: See Appendix B.
Finally, we show that the global minimizers of (13) are the only stable fixed points. In other words, all convergence trajectories to limit cycles or stationary points where (13) is not minimized are unstable in the sense that the algorithm can be kicked out of such trajectories by infinitesimally small perturbations. This is formalized in the following theorem: Theorem III.3 (Unstable Accumulation Points): Let W * be an accumulation point of Algorithm 1. Then W * is an unstable accumulation point if and only if it is not a global minimizer of problem (13).
Proof: See Appendix C. Therefore, in the presence of numerical noise, we expect the algorithm to converge to a global minimizer of problem (13), as demonstrated by the simulations performed in Section VI.

IV. DISTRIBUTED MAXVAR IN GENERAL NETWORK TOPOLOGIES
Before describing the D-MAXVAR algorithm for more general topologies, we first explain how it can be established in a star topology, thereby introducing some important insights towards further extensions.

A. Star-Topology Networks
In a star-topology network, we can distinguish two kinds of nodes: the central node k c , which shares a link with all other nodes in the network, and the leaf nodes k ∈ L = K\{k c }, which are exclusively connected to the central node. Therefore, A naive strategy to apply the algorithm presented above for fully connected networks to a star-topology network would be to let the central node act as a relay between the leaf nodes. We discard this solution for two reasons: Firstly, the bandwidth required at the central node would grow linearly with the number of nodes in the case of a broadcast communication protocol or quadratically in the case of one-to-one communication. The maximum network size would therefore largely depend on the bandwidth available at the central node. Secondly, further bandwidth savings can be achieved by allowing the central node to compress and fuse the signals it receives from the leaf nodes.
We will apply a separate treatment to the iterations where the updating node is the central node and those where it is a leaf node, of which we give a brief overview hereafter: a) The updating node is the central node (q = k c ): As all nodes share a link with the updating node, the network proceeds as in the fully-connected case. All nodes send their compressed observations z i k to the central node which then solves (15). The leaf nodes update their local estimates of the solution as in the fully-connected case.
b) The updating node is a leaf node (q = k c ): As only the central node shares a link with the updating node, it collects the compressed observations of the other leaf nodes. It fuses (i.e. adds) them together with its own observations and sends them to the updating node. The updating node now acts as if the data received from the central node were the compressed observations of a single node and proceeds to compute its local solution estimate and the single update matrix G k c for the central node accordingly. The central node then relays the update matrix G k c to the other leaf nodes, which then all update their local estimate with this single update matrix. The data flow for star-topology networks is illustrated in Fig. 1. Note that the computation of the local solution at node q also requires aggregating and fusing the second order signal statistics through the network similarly to the signal observations, as will be explained below.
Formally, for a star-topology network, when the updating node is a leaf node q ∈ L, node q receives the following from the central node: It then constructs as it did in the fully-connected case according to (16)- (17), and solves problem (15). With those new definitions, we can show that solving the local problem (15) is equivalent to solving where W −q is the matrix obtained by removing the rows of W corresponding to W q . Note that when the updating node is the central node (i.e. q / ∈ L), problem (30) reduces to the fully connected case problem (18). When q ∈ L, constraint (30c) results in the equivalent parametrization where G k c ∈ R Q×Q . Remark IV.1: Note that the number of degrees of freedom is lower for iterations where the updating node q is a leaf node (constraint (30c) is active) than for iterations where the updating node q is the center node (constraint (30d) is active). We therefore generally expect a lower decrease of the objective function (on average) for iterations in which a leaf node is the updating node.

B. Tree-Topology Networks
A tree-topology network has an acyclic graph, which implies that there is a unique path between any two nodes. The nodes with a single neighbor are also referred to as leaf nodes and constitute the end points of the branches in the tree. As we did for star-topology networks, we denote the set of leaf nodes L.
Note that the tree concept should not be viewed as an actual topology constraint, but instead as a framework to organize the data streams within the network, i.e., to define in which order nodes have to transmit their data through the network, which will be essential when extending our algorithm to general topologies.
The procedure in tree-topology networks is conceptually similar to the star-topology case, where the updating node behaves as if it were the center node of a star-topology network, as described in Section IV-A. Consider the two isolated subtrees obtained by disconnecting the updating node q from one of its neighbors k ∈ N q . We denote B kq the set of nodes in the subtree containing k (see Fig. 2).
At each iteration, each of the neighboring nodes k ∈ N q of node q recursively collects and sums the compressed observations and related covariance matrices of its respective subtree  B kq and sends them to node q. The updating node q therefore receives Note that these definitions are recursive and that those values can be efficiently computed by performing an in-network summation in a recursive fashion. This recursive aggregation procedure is described by Algorithm 2 and illustrated in Fig. 3. Note that even in nodes where Q > M k (e.g. in case of single-channel sensor nodes) this aggregation process realizes an overall bandwidth reduction due to the in-network fusion of data (as opposed to straightforwardly relaying the raw data). Similarly to the other topologies, we define with n q = |N q | and {k 1 , . . . , k n q } = N q . This allows us to solve the local problem (15) which is again equivalent to the global problem (13) equipped with additional range constraints, this time for each subtree B (·) . A complete description of the procedure is given by Algorithm 3. Finally, note that similarly to (23), the common components in s can be estimated as

C. Arbitrary Networks
It is in tree-topology networks that our algorithm offers the greatest benefits in terms of both complexity and communication costs. Even though tree-topology networks are quite specific, the D-MAXVAR algorithm can be generalized to networks with arbitrary topologies by overlaying it with a topology management layer responsible for designing and maintaining a tree-topology at each iteration (assuming the initial graph is connected). Ideally, such a tree should preserve all the neighbors of the updating node q. Indeed, cutting off neighbors would reduce the number of update matrices G in (34), which would reduce the degrees of freedom in the minimization of (15), using

Algorithm 4: Distributed Algorithm for Computing a Spanning Tree Preserving the Neighbors of Node q.
the definitions in (34)- (36). Algorithm 4 describes a simple distributed procedure for constructing a tree rooted at node q and preserving the links to its neighbors. It has a per-node message complexity at node k of O(N k ) and an overall time complexity of O(D) where D is the diameter of the network (i.e. longest shortest-path between two-nodes). The procedure is as follows. Each neighbor of node q sends a message to each of its neighbors offering to become its parent in the tree. A node accepts to become a child if it does not yet have a parent. Once it has a parent, it does the same with its own neighbors (its parent excluded). One can see that at each step of the algorithm, the nodes which have a parent are part of a tree. The algorithm stops when each node has a parent, producing a tree spanning the entire network graph.
In practice, the data exchange required to construct such a tree is expected to be negligible compared to the exchange of raw sensor data in Algorithm 3, in which a whole batch of T multi-channel signal samples is transmitted by each node. Furthermore, we expect the network topology to be relatively static and change only after many iterations of the algorithm, as the objective function of the algorithm would otherwise not be relevant anymore. In that scenario, it is reasonable to only periodically compute a new spanning tree using either Algorithm 4 or a more elaborate distributed spanning tree algorithm such as those described in [34].
A similar convergence statement as for fully-connected networks can be made for tree networks and their extension to arbitrary networks, and the associated proof can be relatively straightforwardly adapted to these cases, although the notation and definitions become substantially more convoluted.  variant is that it scales well, i.e. the per-node communication cost and transmission cost is independent of the network size (as opposed to a naive multi-hop relay procedure, which would grow with the depth of the tree). This is because the sensor observations of the different nodes are fused along the way when the data travels through the network, as described in the aggregation step of Algorithm 2.

D. Complexity and Communication Cost
The above figures assume the transmission cost of the update matrices G and of the messages involved in the distributed spanning-tree procedure to be negligible compared to a batch of T samples.

V. CORRELATION STRUCTURE ESTIMATION
In this section, we describe how the D-MAXVAR algorithm can be modified to efficiently approximate the TSC of all node pairs, even for a pair of nodes that do not directly exchange signals with each other. We first describe how the TSC relates to a low-rank approximation of a particular block-whitened correlation matrix (defined below). We then show how this lowrank approximation can be computed by computing a different subset of the per-node MAXVAR directions, instead of the ones corresponding to the Q smallest GEVLs. Finally, we describe how to select an appropriate value for Q.

A. MAXVAR as a Low-Rank Approximation
The covariance matrix between node k and l of the (per-node) whitened signals is (38) and the TSC between nodes k and l can be shown to be equal to where · F denotes the Frobenius norm (See Appendix D for a proof). We first show how W , i.e. the MAXVAR solution of (9), relates to the eigenvalue decomposition of D , the network-wide covariance matrix of the (pernode) whitened signals. Note that P xx is a block matrix such that where [·] kl denotes the M k × M l block corresponding to node k and l, and where in particular the diagonal blocks [P xx ] kk = I M k . We stated earlier that the spatial filters W were solutions of the GEVD (12). If we assume that Λ = W T R D W (see (12)) only contains non-zero GEVLs (which is always the case in the presence of noise), we can define allowing us to write (12) as which can be reorganized as The columns of U are therefore orthogonal eigenvectors of P xx and Λ −1 contains the corresponding eigenvalues. As a result, the Q-dimensional filters computed by MAXVAR are isomorphic to the eigenvectors corresponding to the Q largest eigenvalues of P xx (i.e. the smallest diagonal elements of Λ) via (41).

B. Naive TSC Approximation Using D-MAXVAR
Considering the definition (39) of the TSC, it seems reasonable to approximate the TSC by replacing P x k x l by [P xx ] kl in (39). Indeed, if we denote the Q-rank approximation of P xx aŝ P xx , we havê P xx can therefore be interpreted as a joint low-rank approximation of the matrices P x k x l . This approach, however, has two issues. Consider the split of (45) between diagonal and off-diagonal blocks: First, if the off-diagonal blocks of P xx have a very small norm, the joint low-rank approximation will be focused on reconstructing the diagonal blocks, which do not contain any information about the inter-node correlation structure. Second, P xx might have full rank, even though the inter-node correlation structure can be described by a small number of components (e.g. in the case of a completely uncorrelated network, P xx = I). This makes the choice of an appropriate value for Q quite challenging as it is seemingly unrelated to the rank of P xx . Ideally, our approximation method should be such that increasing the value of Q, and therefore allocating more bandwidth, always results in a better approximation of the TSC. In addition, when the inter-node correlation structure can be described by only a few components, the TSC should be perfectly recovered by selecting Q M .

C. TSC Approximation Using a Modified D-MAXVAR
In subsection V-A, we have shown that MAXVAR was equivalent to computing the low-rank approximation of P xx and in the previous subsection, we have described how using this low-rank approximation to approximate the TSC would result in an approximation lacking multiple desirable properties. Via a slight modification of the original problem, we can obtain a low-rank approximation of P xx − I, from which we can derive a TSC approximation which satisfies the desired properties for Q. For this purpose, we propose to approximate the TSC by usinĝ P x k x l [P D xx ] kl in (39), whereP D xx denotes the low-rank approximation of P xx − I instead of P xx . This would resolve the problems mentioned in the previous subsection, as it removes the influence of the (irrelevant) diagonal entries in (47). In the following, we describe how the D-MAXVAR algorithm can be modified to efficiently compute this new TSC approximation.
We first note that the eigenvectors of P xx − I and P xx are the same, and that the corresponding eigenvalues can be obtained by substracting 1 from the eigenvalues of P xx . Indeed, the full eigenvalue decomposition of P xx can be expressed as where U M is an orthogonal matrix that contains the full set of M eigenvectors in its columns, and Λ −1 M is the diagonal matrix of corresponding eigenvalues. The previously defined matrices U and Λ −1 are therefore Q-columns submatrices of U M amd Λ −1 M , respectively. Substracting I from both sides yields One can therefore obtain a Q-rank approximation of P xx − I by computing the Q eigenvectors of P xx corresponding to the Q largest diagonal elements of Λ −1 M − I in absolute magnitude. 4 We can link U M to the full set of GEVCs W M of (R D , R xx ) by noting that (41) is valid for any subset of eigenvectors. Therefore and (49) in combination with the orthogonality of U M yields with Γ M Λ −1 M − I. W M thus contains GEVCs of the pencil (R xx − R D , R D ) in its columns, and Γ M contains the corresponding GEVLs. The problem of finding a Q-rank approximation of P xx − I is therefore equivalent to finding the Q GEVCs of (R xx − R D , R D ) associated with its Q-largest (in absolute magnitude) GEVLs. These can be found based on a slightly modified version of the D-MAXVAR algorithm, described hereafter.

Modified D-MAXVAR Algorithm:
Let W Q denote the Qcolumns submatrix of W M associated with the Q-rank approximation of P xx − I. In order to compute W Q in a distributed fashion, Algorithm 3 must be modified such that, at each iteration i, W i contains the Q GEVCs of (R i x q − R i D q , R i D q ) associated with the largest GEVLs (in absolute magnitude) instead of the Q smallest GEVLs of (R i D q , R i x q ). This change has no impact on Theorems III.1 and III.2, which remain valid as W Q is just another stationary point of the original problem (13). A similar result to Theorem III.3 can be obtained by showing that selecting the components as described above is equivalent to solving a related optimization problem for which Theorem III.3 applies, that is find the Q largest GEVCs of the pencil (R xx − R D , R D ).
Remark V.1: We note that the original MAXVAR components in W generally differ from the components in W Q . Additional bandwidth would therefore need to be allocated if both the TSC approximation and the solutions to the classical MAXVAR problem would be required. If the goal is only to find an approximate TSC matrix (as in the example of Section VI), D-MAXVAR can be replaced with its modified version, in which case no additional bandwidth is required.
An expression for the approximation of the TSC usinĝ P x k x l [P D xx ] kl in (39) can be obtained as follows. Let Γ Q be the diagonal submatrix of Γ M corresponding to the GEVCs in W Q obtained by the modified D-MAXVAR algorithm. We define the approximate TSC aŝ (50)). From the definition of the Frobenius norm, we have and from the cyclic property of the trace From (50) and the fact that Λ is the covariance matrix of the compressed signal z k associated with W Q,k , such that we finally obtain Note that R Q z k can be locally computed at node k and shared with negligible communication cost compared to the compressed signal observations themselves. Indeed, instead of sharing M k T samples (T being the window length considered) samples, the nodes only need to share Q × Q matrices, which is cheap even for nodes several hops away. Finally, we have from (51) that Γ Q can be computed as which can be efficiently estimated at each node using (33) without any additional communication.

D. Selecting Q
The following theorem provides a useful bound on the TSC total approximation error: Theorem V.1: Let γ denote the Q-largest element of Γ in absolute magnitude. Then the total TSC approximation error Proof: See Appendix E. Any node can compute the above bound to assess the quality of the TSC approximation, as γ also corresponds to the smallest diagonal element of Γ Q , locally computable via (58).
Note that if Rank(P xx − I) is known a priori, selecting Q = Rank(P xx − I) would result in a perfect recovery of P xx − I, and hence of the TSC (by definition (39)). However, the rank is in practice not known in advance, and the node must therefore estimate it by increasing Q until γ = 0, as then Rank(P xx − I) = Q − 1.
Remark V.2: One can also take a statistical approach and test whether the absolute magnitudes of the eigenvalues are significantly larger than 0, i.e., whether they truly capture correlated components or whether their energy is explained by the estimation noise. The latter can be tested, e.g., via a maximum likelihood ratio test as in [23], [24], [25], [35], [36].

VI. SIMULATION RESULTS
In this section, we validate the TSC approximation described in Section V and demonstrate the convergence properties of the D-MAXVAR algorithm in tree-topology networks.

A. Simulation Settings
In the context of WSNs, it is appropriate to model the observed signals as noisy observations of a mixture of uniformly spatially distributed latent sources, i.e.
where n k is M k -dimensional spatially white sensing noise at node k and uncorrelated with the noise at other nodes, s is a d-dimensional spatially white latent signal and A k is the mixing matrix associated with node k. Finally, α is a network-wide parameter allowing us to modulate the signal-to-noise ratio (SNR). Let A k = [a 1 k , . . . , a d k ], where a j k denote the steering vector at node k associated with source s j . We model it as where g j k is an M k -dimensional vector of random variables drawn uniformly from [0.95, 1.05], modeling the slight discrepancies in the channel gains, and m k and l j are the random coordinates of node k and source j, uniformly drawn from a 10 by 10 square. With this model the sources can be seen as point sources radiating energy uniformly in all directions. The associated covariance matrix can be computed as where A is the matrix obtained by stacking the A k 's. The total power of the latent sources picked up by the nodes is P s = A 2 F and the total noise power is α 2 M . We set α 2 = P s (M SNR) −1 to obtain the desired SNR. For each simulated scenario, we performed 1000 Monte-Carlo runs where the covariance matrices were directly computed via (62). Additionally, we set M k = 8 and d = 3.

B. TSC Approximation
Using the above model and settings with K = 10, we obtained the results depicted by Fig. 4 with the average absolute error E defined as As expected, the quality of the approximation increases with increasing Q and always eventually reaches 0 for a sufficiently high value of Q. As the SNR grows, so does the value of Q required to obtain a perfect approximation of the TSC. Indeed, a larger SNR causes the eigenvalues of R xx , and hence P xx , to grow closer to 0. As a consequence, the corresponding eigenvalues of P xx − I become closer to −1, resulting in more power in the off-diagonal blocks which needs to be accounted for. In practice, using our method in such a setting where the signal subspace is perfectly observed by every node does not make much sense as a simple distributed PCA (based for example  on [17]) would yield an excellent approximation of R xx with only Q = d components. On the other hand, such a PCA-based method would perform very poorly in low-SNR regimes, as it would first compute the components of highest power, which would in that case correspond to noise and would therefore not yield any information about the correlation structure. Our method is therefore best suited for scenarios where the nodespecific subspaces and common subspaces have similar power levels and are hard to separate. Fig. 5 showcases such a scenario. We see that the correlation structure is well captured by the approximate TSC using only 3 components when node-specific noise and latent signals have similar power levels.

C. Convergence
where f (W ) = W H R D W is the objective function of (13) and f * is its value at a global minimizer. In the top plot, we see that convergence is indeed faster with a larger number of components Q. In the bottom plot, we see the impact of the tree depth: the deeper the tree, the higher the compression (due to the recursive summing resulting from Algorithm 2). We expect convergence speed to increase in more densely connected networks, that is networks with more links, as this increases the number of neighbors of each node and hence the number of G matrices possibly used at each iteration (that is assuming that an instance of Algorithm 4 is run at each iteration). In tree-topology networks, the total number of links is K − 1, while it is K 2 in fully connected networks, the convergence speed is therefore expected to drop much more rapidly with increasing network size K in the case of tree-topology networks as the compensation in terms of added neighbors per node is much less significant than for fully connected networks. Finally, we see that, in practice, convergence to a global minima is always achieved due to the instability of other fixed points (see Theorem III.3).

VII. CONCLUSION AND DISCUSSION
In this paper, we have presented an algorithm which computes the solution of the so-called MAXVAR problem in a distributed setting. Our algorithm displays significant savings in computational and communication requirements compared to a centralized procedure where signal observations are collected at a single location. In particular, in arbitrary-topology networks, the communication cost is independent of the network size and only depends on the degree of each node and the chosen compression factor Q. We have proven the convergence properties of the algorithm, and shown via simulations that the condition for global convergence holds in practice.
We have also shown how the networks' correlation structure can be efficiently estimated from the algorithm's outputs and for negligible additional cost, via an approximation of the total squared cosine. Even though the proposed approximation can be arbitrarily precise, it is not adapted to very high-SNR regimes, as the number of required components can quickly grow to become larger than the number of channels at a given node. Nevertheless, we have demonstrated that the approximation is quite accurate for lower SNR regimes, even with low Q, and allows significant computational savings compared to the exact computation of all pair-wise TSCs. Finally, we note that, in most cases, there will be no interest in perfectly estimating the TSC. Indeed, for most applications, the TSC will be thresholded in order to determine whether the link between two nodes should be kept alive, or simply used as a distance input to some clustering algorithm.

A. Proof of Theorem III.1
Proof: Let us assume that W * is a fixed point of Algorithm 1. Then W * is a solution of problem (18) when W i = W * for any q ∈ K. The Lagrangian of this problem can be expressed as where Λ q and Γ k ∀k = q are matrices of proper dimensions containing the Lagrange multipliers and N k is an M k × (M k − Q) whose columns span the left null space of W * k . As W * is a solution of (18), it must satisfy where Γ q is the matrix obtained by vertically stacking all the rows of Γ k and N q is the block diagonal matrix whose blocks are N k , and where the entries of the blocks corresponding to q are set to zero for both matrices. Left-multiplying by W * T and using constraint (18b), we obtain Since the left-hand side is independent of q, we can conclude that Λ q = Λ is the same for every choice of q. From this and (66), we have Combining those equations for q ∈ K yields R D W * = R xx W * Λ. As W * is computed via the GEVD of the local pencil (R i D q , R i x q ) corresponding to W i = W * , it must be such that the local W in (15) diagonalizes R i D q at every updating node q. As a consequence, and considering that the update matrices G k are identity matrices due to the fact that W * is a fixed point, is a diagonal matrix and the columns of W * are therefore GEVCs of the pencil (R D , R xx ).

B. Proof of Theorem III.2
In order to prove Theorem III.2, we first prove two intermediate results. In the following, let f : R M ×Q → R denote the objective function (13a) and let the constraint set (13b) be Furthermore, let the constraint set of the local problems (18) be where V = W i in (18). For convenience, we define the following equivalent procedure to an update of Algorithm 1: where q i = i mod K. We denote F q the mapping producing a new iterate W i+1 from W i at the updating node q at iteration i.
Then m q is a continuous function for any q at points where the blocks W k =q have full column rank.
Proof: The constraint set of the local problem (15) can be expressed as where P q W denotes the orthogonal projection matrix on the linear subspace (71). m q (W ) therefore corresponds to the sum of the Q smallest GEVLs of (P q W R D P qT W , P q W R xx P qT W ) (which is found by substituting the optimization variable in (15) by P q W V ). As the GEVLs of a pencil vary continuously with the entries of its constituent matrices [37], and as P q W varies continuously with W at points where its blocks W k =q have linearly independent columns [38] (which is true under the assumption that the local problems are non-singular), m q is a continuous function at points where the blocks W k =q have full column rank.
Let (W i ) i∈N be any sequence of iterates satisfying the mapping defined by Algorithm 1 and therefore procedure (72). We now show that if (W i ) i∈N has an accumulation point, then it is a fixed point of Algorithm 1 and therefore a stationary point of problem (13).
Lemma B.2: The accumulation points of (W i ) i∈N are fixed points of Algorithm 1.
Proof: Let W * be an accumulation point of (W i ) i∈N . From the continuity of m q (see Lemma B.1), and under the assumption of non-singular local pencils, we have: As (f (W i )) i∈N is bounded and decreases monotonically, it converges to some f * and therefore, from the continuity of f , f (W * ) = f * . By definition of m q we have with q i = i mod K. Therefore, As W * is an accumulation point of (W i ) i∈N , there is some index set N k ⊆ N such that (W i ) i∈N k converges to W * and (q i ) i∈N k = (k) i for some node k. From (76) and the continuity of m q we have As, by definition (71), W * is in D k (W * ), and by the definition of m q , the minimum value of f in D k (W * ) is m k (W * ) = f * , W * must be in M (as defined in (72b)). In virtue of (72a), it must be that F k (W * ) = W * and W * is therefore a fixed point of F k .
We will now establish that W * is also a fixed point of node k + 1. As, by hypothesis, the Q-th and (Q + 1)-th smallest GEVLs of the local pencils (R i D k , R i x k ) are distinct at the accumulations points of (W i ) i∈N , a small perturbation of the pencil around an accumulation point results in a small perturbation of the generalized eigenspace [9], [37] 5 . As a consequence, the convergence of the subsequence (W i ) i∈N k to W * implies the convergence of (M i ) i∈N k to some M * , which from (72) corresponds to the sequence of sets of generalized eigenvectors of the local pencils (R i x q , R i D q ), where the convergence of (M i ) i∈N k must be understood in terms of the Haussdorf distance between sets. 6 As m k (W * ) = f * and f (W * ) = f * , it must be that W * ∈ M * . Therefore, there exists some convergent sequence {V i+1 } i∈N k with V i+1 ∈ M i converging to W * (As the convergence of (M i ) i∈N k to M * implies that for any point W in M * we can find a set M i ∈ (M i ) i∈N k containing a point arbitrarily close to W ). As both sequences converge to the same point, and as V i+1 ∈ M i , we have from (72a) that (78) in combination with the squeeze theorem therefore implies that The convergence of (W i ) i∈N k to W * therefore implies the convergence of (W i+1 ) i∈N k to the same point. As (W i+1 ) i∈N k =(W i ) i∈N k +1 with (q i ) i∈N k +1 = (k + 1) i , the argument showing that W * is a fixed point of node k can be applied to node k + 1, and inductively to node k + l for any l. W * is therefore a fixed point of Algorithm 1.
We can now finally prove Theorem III.2. As D is compact, (W i ) i∈N must, by definition of an accumulation point, converge to the (possibly infinite) set of its accumulation points. As a consequence, if (W i ) i∈N is not a convergent sequence, it will eventually oscillate between points arbitrarily close to accumulation points. As from Lemma B.2, the accumulation points of (W i ) i∈N are fixed points of the algorithm, and hence from Theorem III.1, stationary points of the problem, the sequence has a finite number of accumulation points, separated by a finite and fixed distance. As a consequence, it cannot be that W i+1 − W i F converges to 0 and there must be some index set N such that lim i→∞,i∈N Therefore there must also be some other index set N k ⊆ N such that (q i ) i∈N k = (k) i . (81) thus contradicts (80) and (W i ) i∈N must therefore be a convergent sequence.

C. Proof of Theorem III.3
Since W * is not a global minimizer and since (13) has no local minima 7 , every neighborhood V ⊆ D around W * contains a continuum of points U ∈ V for which f (U ) < f(W * ). Now take any point U in U . Since (f (W i )) i∈N is monotonically decreasing, setting W 0 = U will result in a sequence (W i ) i∈N that remains at a finite distance from W * . Therefore, W * cannot be a stable accumulation point.

D. Proof of (39)
Let U kl R 1 2 x k W kl . (4) becomes As described in [27], the full eigenvalue decomposition of this matrix consists in the eigenvectors and diagonal matrix As the eigenvectors matrix is unitary, we have which in conjunction with (5) gives P x k x l 2 F = Θ kl .

E. Proof of Theorem V.1
From the Cauchy-Schwartz inequality, we have As γ 2 corresponds to the Q-largest squared eigenvalue of P xx − I, the low-rank approximation error P xx − P xx 7 Matrices spanning the generalized eigenspaces of (R D , R xx ) corresponding to the Q largest or smallest GEVLs, are global maximizers and minimizers of (13). Matrices spanning the generalized eigengspaces corresponding to any other combination of GEVLs are saddle points.