Deep Graph Unfolding for Beamforming in MU-MIMO Interference Networks

We develop an efficient and near-optimal solution for beamforming in multi-user multiple-input-multiple-output single-hop wireless ad-hoc interference networks. Inspired by the weighted minimum mean squared error (WMMSE) method, a classical approach to solving this problem, and the principle of algorithm unfolding, we present unfolded WMMSE (UWMMSE) for MU-MIMO. This method learns a parameterized functional transformation of key WMMSE parameters using graph neural networks (GNNs), where the channel and interference components of a wireless network constitute the underlying graph. These GNNs are trained through gradient descent on a network utility metric using multiple instances of the beamforming problem. Comprehensive experimental analyses illustrate the superiority of UWMMSE over the classical WMMSE and state-of-the-art learning-based methods in terms of performance, generalizability, and robustness.


I. INTRODUCTION
Multi-user multi-input-multi-output (MU-MIMO) [2], [3] systems have become increasingly useful in the context of multi-antenna beamforming [4], [5] in both multi-cell [6] and ad-hoc [7] wireless network scenarios.They are especially beneficial for increasing spectral efficiency and improving effective network capacity to meet the high quality-of-service (QoS) requirements of modern wireless systems [8].The task of multi-antenna beamforming is particularly challenging for wireless ad-hoc networks (WANETs) wherein the transceivers may operate under strict power constraints.For example, a mission-specific military deployment might use multiple handheld devices with limited battery life to constitute a tactical WANET.Moreover, these deployments can be in various topological, environmental, and weather conditions, giving rise to varying fading effects and path loss.Also, the devices in the network may suffer from interference with each other, posing further challenges towards maintaining the required QoS.Broadly then, the key task of beamforming in a MU-MIMO WANET involves managing the channel and interfer-Research was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-19-2-0269.The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government.The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.Preliminary results were presented in [1]. A. Chowdhury and S. Segarra are with the Dept. of ECE, Rice University.G. Verma, and A. Swami are with the US Army's DEVCOM Army Research Laboratory.Emails: {ac131, segarra}@rice.edu,{gunjan.verma,ananthram.swami}.civ@army.mil.ence conditions in the wireless network, to obtain beams that achieve a reasonably high value of a given QoS metric without violating the overall power constraint.
The beamforming task can be formalized as an optimization problem involving a network utility function -e.g., sumrate, mean-rate, or harmonic-rate -as the objective with resource constraints, such as maximum available power, at each transceiver.Optimization problems of this form have been shown to be non-convex and NP-hard, [6], [7], [9] and therefore lack closed-form solutions.In the absence of methods to generate exact solutions to the beamforming problem, multiple classical approaches have been proposed that try to obtain approximate solutions.For the most common case of sumrate maximization, the following broad categories of methods have been commonly employed in the last few decades: Lagrangian dual decomposition [10], [11], successive convex approximation [12], interference pricing [13] and weighted minimum mean squared error (WMMSE) minimization [14].
WMMSE is a popular algorithm for beamforming in both multi-cell and ad-hoc wireless networks.It offers closed-form, iterative update rules to solve a surrogate optimization problem that has been shown to have identical local optima as the sumrate optimization problem [14].In spite of being an effective classical solution, it has several drawbacks.First, it is computationally complex on account of several matrix inversion and eigendecomposition operations in each iteration.Additionally, WMMSE has to be applied from scratch each time a new channel condition is presented, creating a significant lag in obtaining successive beamformers.Finally, the solution offered by WMMSE is near-optimal at best, as it can only generate a local optimum of the sum-rate optimization problem.
Deep learning methods have also been proposed to solve the challenging beamforming problem.More specifically, these neural models take a channel state information (CSI) tensor -a consolidated structure containing channel and interference measures between pairs of indexed nodes -as input and generate the corresponding beamformer for all the transmitters in the network.Several neural architectures of varying complexitymulti-layer perceptrons (MLP) [15], [16], convolutional neural networks (CNN) [17], recurrent neural networks (RNN) [18], and even graph neural networks (GNN) [19], [20] -have been applied to this end.A major advantage of these methods lies in their low feedforward computational complexity.Moreover, these methods generate the beamformers in parallel for all nodes in the network, making them applicable to large-scale networks.Nevertheless, these methods fail to generalize to channel conditions that were unseen at training.Further, they often fail to be competitive with WMMSE since they lack arXiv:2304.00446v1[eess.SP] 2 Apr 2023 the domain-specific information encoded in the structured WMMSE updates.
To address these drawbacks, a class of hybrid algorithms [1], [21]- [24] that combine the classical update structure of WMMSE with the fast inference capabilities of neural models have been proposed.This is achieved through the learning paradigm of algorithm unfolding [25]- [28].Algorithm unfolding decouples the update steps of an iterative algorithm to create a cascade of hybrid layers that preserve the original update structure but introduces one or more learnable parameters from data.This form of domain-inspired learning has been extremely popular and effective in several application areas, including but not limited to non-negative matrix factorization [29], iterative soft thresholding [30], semantic image segmentation [31], blind deblurring [32], clutter suppression [33], particle filtering [34], symbol detection [35], link scheduling [36], energy-aware power allocation [37], and beamforming in wireless networks [21]- [23].These algorithms use various neural layers to learn one or more parameters of the iterative algorithm being unfolded or to approximate certain computational steps in the algorithm to reduce complexity and speed-up processing.
In this work, we propose an unfolding solution of WMMSE wherein we combine two separate classes of neural architectures with very specific advantages -a ReLU MLP that enforces a parametric functional transformation on a specific WMMSE variable -in a manner elaborated further in Section III-A, specifically in (8), -and a GNN which learns the parameters of the MLP by leveraging the underlying graph structure of the wireless network.While both MLP [15] and GNN [23] are common generic architectures, the specific form of unfolding proposed in this paper is entirely novel.The proposed scheme relies on the universal approximation property of MLPs to allow the unfolded variable to learn its own functional transformation -necessary for better convergence -without enforcing any prior structure to it.At the same time, the use of GNNs to learn the transformation parameters allows the model to incorporate the connectivity information embedded in the wireless network that cannot be otherwise extracted for arbitrary network topologies by a general-purpose MLP.Contribution.Following are the three main contributions of this work: i) We propose a hybrid algorithm, namely UWMMSE -by unfolding WMMSE -for beamforming in multi-user multi-inputmulti-output wireless ad-hoc networks -where the parameters of a functional transformation are learned via a GNN -along with provisions for its distributed implementation.Further, we emphasize on obtaining a fast and lightweight model by extensive parameter-sharing to reduce model complexity.ii) We present a theoretical analysis of the proposed model in terms of a necessary condition that the learnable transformation must satisfy to enable effective learning; [ref.Theorem 1].Additionally, we establish permutation equivariance of the proposed model; [ref.Proposition 1].iii) We provide comprehensive experimental analyses to illustrate the empirical superiority of the proposed method.Firstly, we present a comparison with state-of-the-art connectionist methods.We then demonstrate the generalizability of the proposed model to unseen network sizes and its robustness to out-of-distribution inputs.Notation.[X ] ij... , [X] ij , and [x] i denote the entries of a multi-dimensional tensor X , a matrix X, and a vector x.The generic subindex : denotes a whole dimension, e.g., row i of matrix X is denoted as [X] i : .E(•) is the expectation operator while (•) H represents conjugate transpose.The all-zeros and all-ones tensors are denoted by 0 and 1, respectively, where the dimensions are clear from context.The z × z identity matrix is represented by I z .The diagonal matrix diag(X) stores the diagonal elements of X.The zero-mean complexnormal distribution is denoted by CN (0, σ 2 I).

II. SYSTEM MODEL AND PROBLEM FORMULATION
Our system is a single-hop MU-MIMO WANET with M distinct transmitter-receiver pairs.Each transmitter, having T antennas, transmits independent data signals to a corresponding receiver equipped with R antennas.Let V i ∈ C T ×d denote the beamformer that transmitter i uses to transmit a signal x i ∈ C d , where E[x i x H i ] = I d , to its assigned receiver r(i).Assuming a linear channel model, the signal y i ∈ C R , received at r(i) is of the form , for all i, (1) where n i ∈ C R denotes independent additive white gaussian noise sampled from CN (0, σ 2 I R ).Here, H ii ∈ C R×T represents the communication channel between transmitter i and its assigned receiver r(i) while H ij ∈ C R×T for all j = i represents the interference between r(i) and all other transmitters j.Finally, the transmitted signal is estimated at r(i) using a receiver-beamformer U i ∈ C R×d , to obtain xi = U H i y i for all i ∈ {1, . . ., M }.If we define the channel state information (CSI) tensor H ∈ C M ×M ×R×T such that [H] ij:: = H ij and the transmitterbeamformer tensor V ∈ C M ×T ×d such that [V] i:: = V i , then for every user i, assuming perfect knowledge of the CSI matrices, its achievable rate [14] is given by Our objective is to determine the V that maximizes the sum-rate of the whole network, where α i ∈ R represents the priority of the transceiver pair i and the maximum power available uniformly to each transmitter is denoted by P max ∈ R. Henceforth, for simplicity, we focus on the case where every user is given the same α i = 1 in the objective.
The optimization problem in (3) is non-convex and NPhard [7], [38].A standard approach to solving this problem is to reformulate it as a constrained weighted-minimummean-square-error (WMMSE) optimization [14].Specifically, introducing the receiver-weight tensor Ŵ ∈ C M ×d×d and receiver-beamformer tensor U ∈ C M ×R×d the problem can be defined as where Ŵi = [ Ŵ] i:: 0 is a weight matrix at receiver r(i) while E i ∈ C d×d is the mean squared error between transmitted and received signals [14].
The optimization problem ( 4) is equivalent to (3) as shown in [14,Thm. 3] -the variable V * in the global optimal solution { Ŵ * , U * , V * } of the former is the same as the optimal transmitter-beamformer V * in the latter.Moreover, the problem in ( 4) is tri-convex, i.e., fixing any two variables renders the objective function convex in the third variables.This makes (4) amenable to a block-coordinate-descent (BCD) based solution.
In spite of the tri-convexity property that results in a tractable closed-form solution, WMMSE performance is limited by its cumbersome iterations that are composed of expensive computational steps like matrix inversion, eigendecomposition, and bisection search (per-iteration complexity of WMMSE scales as O(M 2 ), where M is the number of users) [24].Naturally, WMMSE tends to be time-consuming depending on the size and complexity of the wireless network, making it particularly ineffective for fast-changing channels.Moreover, while WMMSE can achieve a near-optimal solution, it can only solve a single instance of (3) for a given H.In case there are multiple CSI tensors, say {H i } n i=1 , to be processed -e.g., in a scenario wherein multiple wireless sub-networks are being optimized over by a centralized optimizing agent -, WMMSE has to be repeated from scratch independently for each of the n instances.
From a practical standpoint, it is desirable to have a mechanism for fast, efficient and interpretable processing of a set of independent CSI tensors.We propose to achieve this through a GNN-based unfolded algorithm.More specifically, we leverage the near-optimal solution provided by the iterations of the classical WMMSE method and enhance it with the expressivity and computational efficiency of trained graph neural models.

III. UNFOLDED-WMMSE FOR MU-MIMO
Since WMMSE is composed of computationally expensive iterations, we reduce the computations while preserving the update structure by truncating the number of iterations and then compensating for the reduced iterations using data-driven neural modules.

A. Designing the unfolded architecture
We define a K-layered parametric function Λ(•; Θ) : where Θ is a set of trainable parameters and V (K) = Λ(H; Θ) approximates the solution to (3) for a given CSI tensor H.The layers in Λ are hybrid structures designed using the WMMSE updates [14], augmented by a learnable transformation Φ to accelerate convergence.More specifically, by setting the initial beamformers , for all i such that Trace(V where updates ( 5)-( 9) are computed in parallel for every user i.
Here, Ψ in ( 7) is a complex-valued GNN (CV-GNN) architecture with a set of trainable parameters θ.The matrix S ∈ C M ×M is obtained through a learnable transformation applied to the CSI tensor, enabling its use within the CV-GNN Ψ.This tensor transformation is to be described in more detail in (11).The trainable parameter µ ∈ C resembles the Lagrange multiplier for the power constraint in the original WMMSE formulation.The output of each hybrid layer of our architecture is then given by V(k) , such that [ V(k) ] i:: = V(k) i for all i.However, V(k) is the raw transmitter-beamformer that does not necessarily obey the power constraint.
To enforce the power constraint on the output of the feedforward architecture, we introduce a non-linear activation function β(•) in the hybrid layers of Λ(•; Θ) which saturates the model output beyond the permissible values.For the multiantenna setup that we consider here, this involves constraining V (k) identically in each layer k such that all its elements To attain this, the activation β in each layer k for all i is defined as where || • || F denotes the Frobenius norm.We note that a non-linear mapping of this form was used in the PGD based beamforming strategy of [22] as the projection step.
A schematic view of the variable dependence of the proposed UWMMSE is given in Fig. 1.
It is essential to note here that if we ignore (7) and set Φ ξ (k) (•) = 0 d×d for every layer k, then ( 5)-( 10) boil down to the classical BCD closed-form updates of WMMSE.Updates are shown for an arbitrary node i and are computed in parallel for all i.The yellow blocks represent the five equations in ( 5)-( 9) plus (10) and (11).The input to the layered structure is and the output transmitter-beamformer is given by V (K) i for all i.Dependence on σ and {Uj, Wj, Vj} for all j = i is implicit.
However, by providing the additional flexibility to UWMMSE of learning a set of representations ξ (k) in ( 7) for each node -which are implemented as parameters of a CV-MLP in ( 8) corresponding to the node -we enable faster convergence and better performance compared with the classical WMMSE, as illustrated in Section IV.
In building ( 5)-( 9), one of the primary design considerations is the choice of WMMSE variables that are to be learned.We choose to preserve the update structures of U and V since these are tightly related to the underlying communication dynamics of the wireless network.Indeed, these two update equations explicitly quantify the effects of interference on transmitters and receivers.On the other hand, Ŵ is representative of the quality of the channel connecting a transmitter and its corresponding receiver and plays a key role in driving V and U to their near-optimal values.Our hypothesis is that if Ŵ can be accelerated towards convergence through data-driven optimization, then it will lead U and V to faster convergence without affecting the dynamics of the wireless network.Having thus finalized the variable to be augmented by learning, the next design consideration is the structure of the learnable transformation Φ.To that end, we propose the use of a complex-valued multi-layer perceptron (CV-MLP) with a single hidden layer as Φ : . On account of its universal approximation property (UAP), an MLP is capable of modelling any continuous and bounded function of arbitrary complexity [39].Therefore, without imposing any additional inductive bias on the structure of the transformation, the proposed method provides the necessary capabilities for W to follow an improved update trajectory compared to that taken by Ŵ alone.
In any given layer k, parameters ξ , defined on node i, are learned using a CV-GNN Ψ.In addition to the complex-valued parameters ξ, Φ also uses Cartesian non-linearites [40] on both hidden and output layers, which are capable of handling complexvalued outputs.More specifically, it uses the ReLU family of activations [41] that are applied independently on the real and imaginary components of the layer-wise outputs of Φ, thereby transforming both magnitude and phase.In essence, we propose a learnable transformation in each unfolded layer k at two levels.Firstly, we leverage the UAP of the CV-MLP Φ to frame a general functional transformation for the receiver weights, and secondly, the node-specific parameters ξ (k) i of the transformation Φ ξ (k) i are learned using a CV-GNN Ψ as representations for all nodes i.While the standard practice is to learn the parameters of a CV-MLP directly, we take the aforementioned route as a standalone CV-MLP cannot generalize to arbitrary connectivity patterns in a wireless network graph.For instance, the learnable model must be capable of generating node representations by leveraging the local connectivity structure of the wireless network embedded in S. A generic GNN architecture, through a sequence of aggregation and transformation operations [42], [43], is able to sufficiently capture this structure.Moreover, GNNs are typically permutation equivariant [42], thus offering better generalization performance against variations in node ordering.The choice of the specific GNN architecture and the size of the trainable parameters are, however, arbitrary and can be made depending on the nature of the problem.In this case, to ensure computational simplicity [44], we choose a complex-valued architecture inspired by a graph convolutional network [42] (CV-GCN) along with Cartesian non-linearites [40] on Ψ(•; θ), as described in ( 12)- (13).
Independent of the choice of architecture Ψ, we treat the CSI tensor as a weighted adjacency structure of a directed graph which is used to aggregate information from the neighboring nodes [45].In the multi-antenna setting that we consider here, the link -either the channel or interferencebetween any transmitter i and receiver r(j) is described by an R × T matrix that depends on the number of transmitter and receiver antennas.However, since CV-GCN Ψ requires the channel between i and r(j) to be represented by a scalar coefficient [42], we propose the use of a single-layered 1 × 1 depth-wise convolution [46] operation with shared filter parameters to transform H to an amenable structure.Essentially, we define an additional fully connected neural layer Γ(H; ω) : -which forms the input 1 to Ψ(•; θ) in (7).Indeed, this operation can be interpreted as a learnable weighted-combination of the RT antenna coefficients for each channel matrix In addition to capturing the local connectivity structures in graphs, GNNs are also well suited to handle features or signals supported on the nodes of a graph [42], [47].More precisely, setting the aggregation matrix S ∈ C M ×M and defining a features matrix Q ∈ C M ×F , the GNN in (7) ] is a concatenation of the current iterates for U and V. Specifically, we consider the case where d = 1 in the rest of the paper, resulting in F = R + T .However, note that the model can be extended to the case where d > 1 by a simple pooling transformation on the last dimensions of U and V.Such a formulation explicitly couples the CV-GNN with the current state of the WMMSE variables.This is essential to ensure that the CV-GNN output has a functional dependence on the optimization trajectory across layers.Additionally, certain QoS metrics like traffic rates, user priority, queue lengths etc. can also constitute relevant node features in this case.While the incorporation of the aforementioned or more node features in our model is a straightforward task, the detailed analyses -theoretical and experimental -of their effects on model performance is beyond the scope of this work.
Thus, having described the aggregator matrix S and the feature matrix Q, we now present the exact architecture of Ψ(•; θ) in the proposed model where θ = {θ 11 , θ 12 , θ 21 , θ 22 }, and both α 1 and α 2 are Cartesian RELU activation functions.Note here that we have an additional set of weights for the diagonal elements in the formulation of Ψ(•; θ).This is essentially to emphasize the importance of the transmission channel elements with respect to (w.r.t) the off-diagonal interference elements in the learnable model.Finally, trainable parameter µ -shared by all nodesoccupies the place of the Lagrange multiplier in the original 1 In practice, we observed that a row-normalization operation on S prior to inputting it in (7) had the effect of stabilizing training and thereby improving the overall performance.See footnote 2 for access to implementation code.WMMSE formulation.Its primary purpose is to incorporate the node-wise power constraint.Intuitively, a larger µ in (9) would result in a V such that Trace( V VH ) is small even when the interference component is small.This ensures that V is less likely to deviate far from the power constraint, even before it is enforced explicitly in (10).This allows the model to operate more frequently in the linear region of the activation β, leading to better numerically conditioned gradient propagation across layers.The parameter µ is trained directly using the gradient feedback from (14).

B. Permutation Equivariance of UWMMSE
Permutation equivariance is a property by virtue of which the performance of a GNN model remains consistent w.r.t variations in node identities.It is of particular importance for dynamic WANETs wherein nodes may move in and out of the network or even operate with varying topologies.Moreover, the channel sensors may convey their estimates in different orders leading to a rearrangement of the rows and columns of S. In all these cases, the GNN must be equipped to consistently maintain the quality of its predictions across the variations.Since the fundamental learnable module Ψ(S, Q; θ) in our proposed model Λ(H; Θ) -where Θ = {θ, ω, µ}, such that ω and µ are independent of node ordering -is permutation equivariant [42], it is essential to establish that this key property is inherited by the overall UWMMSE architecture.
Similar to [24], [48] we now formalize the definition of permutation equivariance.Let us consider a generic permutation matrix Π ∈ {0, 1} M ×M .Further, let F denote the set of all functions f : for all permutations Π and all matrices S.
Essentially, a certain permutation of the node indices of a graph, input to function f which is permutation equivariant, applies the same permutation to the indices of the corresponding node outputs without altering their values.Next, we formally qualify the permutation equivariance of the proposed UWMMSE model conditioned on the specific Ψ in (7).
Proof: The proof is relegated to Appendix A.

C. Training Process
Having explained the inner workings of the proposed UWMMSE for a given set of learnable parameters Θ = {θ, ω, µ}, we now shift focus to the training of the architecture.Given a fixed Θ, the model Λ(H; Θ) generates the transmitterbeamformer corresponding to CSI H, which is used to obtain a network sum-rate utility given as . Thus, the loss can be defined as where D is the channel state distribution of interest.Even if D is known, minimizing (Θ) w.r.t Θ = {θ, ω, µ} is a nonconvex problem.However, notice that Ψ(•; θ) in ( 7) and Γ(•; ω) in ( 11) are differentiable w.r.t θ and ω, respectively.Thus, given a set of H drawn from D, we employ stochastic gradient descent to minimize (14).Therefore, UWMMSE is essentially an unsupervised learning algorithm which only needs samples of the CSI H but does not need the true-optimal transmitterbeamformers (labels) associated with those channels, which can be tremendously expensive to generate.While another possibility is to use WMMSE power allocation as the training labels, this effectively limits the learning capacity of the proposed UWMMSE by the near-optimality of the WMMSE output.
Remark 1 (Application to SISO systems) While our method is designed for beamforming in the more general MIMO setting, it can be seamlessly employed for power allocation in SISO wireless networks.In the SISO setting, however, there would be no need for the neural network in (11) since when R = T = 1 the CSI matrix can directly be used as the generalized adjacency matrix in the GNN Ψ, thus reducing the trainable parameters only to Θ = {θ, µ}.Similarly, intermediate variables are reduced to scalars, but their update equations in ( 5)-( 9) remain valid and naturally present a lower computational load.In this sense, for the SISO case, our proposed UWMMSE resembles the unfolding scheme presented in [24].However, in [24], the embedded transformation Φ on W in each hybrid layer has a fixed affine structure that lacks the UAP of the CV-MLP.Further, the authors in [24] consider only real-valued channel realizations which are constructed using the magnitude of the complex-valued channels.Thus, their method completely ignores the phase information embedded in the channel coefficients thereby oversimplifying the problem.Also, each layer k of the model presented in [24] has its own independent set of GNNs Ψ(•; θ (k) ) resulting in growing complexity of their proposed solution with increasing number of layers.More importantly, a model of this form requires the number of layers to be fixed at both training and inference making it inflexible at deployment.In this respect, our more general UWMMSE framework for complex-valued MIMO still presents advantages w.r.t existing works even if we focus on the SISO setting.

D. UWMMSE convergence: A necessary condition
We theoretically establish the behavior of the UWMMSE architecture with an arbitrarily large number of layers.
Theorem 1 Consider a UWMMSE architecture (5)-( 11) of infinite depth -where Φ ξ (k) i (•) are continuous functions for all i, k -being used for beamforming to transmit a signal x i ∈ C 1 i.e., d = 1.Consider the extreme low-noise regime so that σ = 0 and set µ = 0. Denote using * the optimal solutions to problem (4).
where W ∈ C 1×1 is represented as the scalar w ∈ C, Proof: The proof is relegated to Appendix B.
Theorem 1 states that if UWMMSE learns the optimal transmitter beamformer V * i -where, V * H i V * i = P * i -uniformly at deeper layers, then the learned transformation Φ ξi (•) must satisfy (15) asymptotically for all i.Notice that Φ ξ (k) i (w for some constant δ satisfy the requirement in (15).This is intuitively pleasing, since this limiting behavior of Φ ξ (k) i (w (k) i ) implies that deeper layers of UWMMSE would resemble the classical iteration of WMMSE, for which we know that the optimal beamformer is a fixed point [14].In other words, the proposed learnable module is sufficiently expressive to modify the updates for the first few layers to accelerate convergence while recovering the optimal asymptotic guarantees of the classical WMMSE algorithm at deeper unfolded layers.Finally, note that Theorem 1 is independent of the choice of the specific CV-GNN Ψ(•; θ) in (7).Therefore, we can safely claim that the aforementioned result is an attribute of the hybrid model in ( 5)- (10), and is independent of specific the CV-GNN architecture chosen to learn the parameters ξ.

E. Parameter sharing for computational efficiency
The CV-GNN Ψ(•; θ) is shared by all unfolded layers, i.e., θ does not depend on the layer index k in (7).Such a scheme ensures that θ is trained using gradient feedback that accumulates across layers and depends on the overall optimization trajectory.Moreover, a formulation of this form allows for flexibility in adding or removing unfolded layers at deployment (after training has been completed).Another immediate advantage is an O(K) reduction in the number of trainable parameters with respect to the layer-dependent alternative, making the training process less computationally expensive and time-consuming.Further, the trainable parameters ω of the tensor-transformation Γ(•; ω) in (11) are identical for all channel elements.This is appropriate as all channel representations must have identical functional mapping from their respective antenna coefficients in a way that is analogous to shared 1 × 1 convolutions of image pixels.Additionally, having a shared filter kernel allows for an O(M 2 ) reduction in the number of trainable parameters.Finally, note that µ is also tied across layers, further reducing the number of trainable parameters.

F. Complexity analysis, scalability and distributed implementation
Per iteration computational complexity of WMMSE [14] is ) for an M -user in-terference network with R receiver antennas and T transmitter antennas.It can be re-written as O(M 2 [max{R, T }] 3 ).Each unfolded layer in UWMMSE inherits this complexity directly as they perform the same update as WMMSE.Moreover, the trained CV-GCN in each unfolded layer incurs a feedforward complexity of O(M 2 F ) where the hidden layer size is F [42].Finally, each trained single-hidden-layered CV-MLP has a feedforward complexity of O(M G), where the hidden layer size is G. Hence, the total complexity of the trained UWMMSE is given as Clearly, for a fixed set of antenna sizes {R, T } and predesigned hidden dimensions F and G, per layer complexity of UWMMSE varies as O(M 2 ) w.r.t network size M .This is same as the per-iteration complexity of WMMSE with fixed antenna size.Consequently, by truncating the number of unfolded layers K in UWMMSE as compared to the number of iterations in WMMSE, the inference time is significantly reduced.This will be empirically validated in Section IV-C.
In addition to the computational complexity, it is also important to note the size of the model given by the number of its trainable parameters.The proposed architecture has very few trainable parameters making it easy to train, and likely to generalize as illustrated in Section IV.The number of parameters θ in the where F is the input size.Further, the linear layer Γ has O(RT ) trainable parameters and µ has a size of just 1.Thus, the total number of trainable parameters of UWMMSE is O(RT + F [F + G]), and is independent of the number of users M .As a result, the same model can be employed to process wireless networks of varying size with the assumption that the underlying channel model is identical.
While it is necessary to train the proposed UWMMSE in a centralized manner under the assumption that the centralized trainer has access to the full CSI tensor, the trained UWMMSE can support a distributed deployment with only local information available at each node.This is mainly possible given the fact that power allocation at a given node i depends only on the row slice H r(i)::: and the column slice H :i:: of the CSI tensor.Nevertheless, to achieve a fully distributed deployment, three vital assumptions are necessary, two of which are inherited directly from the distributed version of WMMSE [14].Firstly, for all receivers r(j) where j = 1 . . .M , the local channel state estimates H ji should be available to each transmitter i.Secondly, a mechanism is required to facilitate information feedback from receivers to all transmitters.These assumptions essentially enable the transmitters to compute V (k) i after receiving the corresponding U (k) i and W (k) i from each receiver r(i) in all unfolded layers k.Finally, a copy of the full set of trained parameters Θ = {θ, ω, µ} must be available to each node.Note that, while we can achieve distributed deployment in this manner, the feedback links will add to the communication overhead and therefore the inference time would be higher than that of the centralized version.

IV. NUMERICAL EXPERIMENTS
In this section, we present comprehensive numerical experiments to demonstrate the performance of the proposed UWMMSE model in allocating power to complex-valued MU-MIMO WANETs operating under various fading conditions and topologies. 2 A detailed description of the datasets is provided in Section IV-A while the model architecture, hyperparameters and system setup are presented in Section IV-B.In Section IV-C, we compare our model performance with WMMSE and its truncated version, in terms of achieved sumrate and allocation time.In Section IV-D and Section IV-E, we evaluate the generalization performance of our model across different operating conditions in training and inference.Further, in Section IV-F, we present an illustration of the convergence behavior of our model.Finally, in Section IV-G we investigate the robustness of our proposed model against norm-bounded distortions in the input CSI tensor.

A. Datasets
We use randomly generated geometric channel realizations to evaluate the model performance.A geometric channel model has a composite structure with path loss and fading components.To simulate that, we construct a 2-D geometric graph with M randomly sampled transceiver pairs.All transmitters and receivers are dropped uniformly at random at location Path gain between transmitter i and receiver r(j) is then computed as an inverse function of their corresponding physical distance l ij .We set the number of antennas as R = 3 and T = 5 for all the experiments.For simplicity, we assume that a scalar complexvalued signal is being transmitted, i.e., d = 1.
Similar to [1], [15], [16], [49], we choose the following fading channel models: Rayleigh: For each channel matrix H ij corresponding to the transceiver pair ij, we generate Rayleigh channel coefficients [H ij ] rt independently for all antenna pairs (r, t) as the real and imaginary components sampled independently from a standard normal distribution.Incorporating the path loss component, elements of the channel matrix [H ij ] rt are given by where, a ∼ N (0, 1), b ∼ N (0, 1) Rician: For each channel matrix H ij corresponding to the transceiver pair ij, we generate Rician channel coefficients [H ij ] rt with 20 dB K-factor [50] independently for all antenna pairs (r, t) as the real and imaginary components sampled independently from a normal distribution.Incorporating the path loss component, elements of the channel matrix [H ij ] rt are given by where, a ∼ N (µ ric , σ ric ), b ∼ N (µ ric , σ ric ) where µ ric = k 2(k+1) and σ ric = 1 2(k+1) with k = 100. 2We have released our code for this work at https://github.com/ArCho48/Unrolled-WMMSE-for-MU-MIMO.

B. Model architecture
Our proposed feedforward UWMMSE architecture is composed of 3 unfolded-WMMSE layers with a 2-layered CV-GCN -shared by the unfolded layers -modeling the function Ψ in (7).We set the hidden layer of CV-GNN as F = 32 and that of CV-MLP as G = 16.The model consists of 3302 trainable parameters.NovoGrad [51] optimizer is employed for training across a maximum of 15000 iterations on a batch of 64 randomly sampled channel realizations with early stopping.The initial learning rate is set to 1 × 10 −2 .An interesting observation was that while 3 unfolded-layers offered the best performance trade-off in terms of sum-rate and time at inference, the best training performance was achieved with just 1 unfolded layer.Recalling that our model is flexible insofar as to have different numbers of unfolded-layers at training and inference on account of parameter-sharing, we established a model setup wherein UWMMSE is trained with a single layer and then 2 more layers are appended at inference, which share the learned components.Experimental results presented in this section demonstrate the effectiveness of this setup.At inference, we average the achieved sum-rate over 10000 channel realizations for all experiments.Our model, on account of being lightweight, is perfectly suited for both CPU and GPU operating environments.Nevertheless, for uniformity and reproducibility, all experimental results for this paper are generated on an Nvidia GeForce RTX 2080 GPU.

C. Performance Comparison
The sum-rate performance of the proposed UWMMSE is compared with that of existing baselines including classical WMMSE and state-of-the-art connectionist methods.For these comparisons, we had to choose an operating point between two regimes based on additive channel noise.Firstly, in the low-noise regime, the channel noise power is set at −90 dB or less.This is a more challenging scenario since the effects of interference tend to dominate that of channel noise and therefore the sum-capacity achieved depends largely on the precise beamforming at each transmitter.On the other hand, the high-noise regime (noise power is set at 0 dB), offers a simplified setting wherein the achievable sum-capacity is innately low on account of high noise and therefore the exact beamforming at the transmitters is not critically important.In fact, our observation is that in this regime, the transmitters often exhibit binary characteristics, either transmitting at full power or not transmitting at all.For our experiments, we choose to operate in the low-noise regime, specifically at σ = 2.6 × 10 −5 (−114 dB) since it allows us to evaluate the full potential of the proposed model.
We now present a list of methods that we choose for performance comparison.All these methods address the common problem of beamforming in CV MU-MIMO WANETs.
1) WMMSE [14] is the classical baseline for our experiments as our method is an unfolded extension of it and is potentially an improvement over it.The maximum iterations per sample for WMMSE is set to 100.It is important to note here that both IAIDNN [21] and GCN-WMMSE [23] are unfolded extensions 3 of WMMSE and therefore belong to a very specific class of hybrid algorithms that the proposed UWMMSE is also a part of.The main difference of UWMMSE w.r.t these methods lies in the structure of the respective unfolded components and the choice of the corresponding learnable modules (see Section III).
The comparisons are shown in Fig 2 .At any given instant, the channel conditions are randomly sampled from a fading distribution.As a result, the sum-rate utility, which is conditioned on CSI, can vary significantly based on the exact samples used for evaluating different beamforming algorithms.A particular instance of CSI can be easy or hard to solve depending on the interference conditions and how these conditions contrast with the channel intensities.However, the general performance over a large set of test samples should reveal the superiority of a particular beamforming algorithm over others.This is illustrated in form of a histogram of achieved sum-rate by the full test set in a Rayleigh channel setting; see Fig 2(a).The observed empirical distribution of sum-rate values over multiple channel realizations clearly reveals that UWMMSE significantly outperforms WMMSE for most realizations, with only a small overlap.Gain in full WMMSE performance compared to Tr-WMMSE, while expected due to difference in number of iterations, is not significant.Intuitively, this is perhaps an indication that the WMMSE iterates approach a local optimum of the sum-rate objective fairly quickly and then simply converge to it over the rest of the gradient steps.On the other hand, the UWMMSE iterates, supported by the embedded learnable components, take data-driven modified gradient steps to converge to a better local optimum within fewer steps.
Having established the superiority of UWMMSE w.r.t WMMSE, we now extend our investigation to the class of hybrid algorithms specifically under a Rayleigh channel setting; see Fig 2(b).IAIDNN [21], which uses learnable parameters to approximate matrix multiplication and inversion steps in WMMSE update rules, falls short of WMMSE by a significant margin with an average sum-rate of 32.17.We suspect that the connectionist components of IAIDNN fail to leverage the inherent graph structure of the wireless networks and while its current formulation works well for fading channels without the path loss component (essentially lacking the geometric structure) and under a simplified high-noise setting, it is not equipped to deal with the full geometric channel model in a challenging low-noise scenario.This shortcoming is addressed by the more recent GCN-WMMSE [23], that uses a combination of graph filters and GCN [42] based graph learners to approximate certain WMMSE variables.Clearly, the use of graph information provides a big boost in its performance (average sum-rate of 55. 19) w.r.t IAIDNN.Nevertheless, it also falls short of WMMSE, albeit marginally, in this particular setting.Clearly, the proposed UWMMSE is the superior algorithm among all the compared approaches.This superiority can be largely attributed to the two-step learning scheme wherein the first step leverages the underlying graph structure in the geometric wireless network to learn a set of parameters and the second step enforces a general functional transformation on a key WMMSE variable based on these learnt parameters (ref. Section III-A), thus facilitating a better convergence than the other approaches.We now shift focus to the Rician channel setting.Our objective in choosing the two different channel models for comparison is to demonstrate that the superiority of our method and the general performance trend for the various beamforming algorithms are not dependent on a particular channel type; see Fig 2(c).While we observe a similar performance trend wherein the proposed UWMMSE beats all other methods comfortably, GCN-WMMSE [23] surpasses WMMSE marginally in this case and also closes the gap slightly with UWMMSE.IAIDNN [21] still lags behind all the methods but does marginally better than its Rayleigh counterpart.While achieving a high sum-rate is the primary objective of the hybrid beamforming algorithms, it is also important for these methods to offer rapid inference as the beamforming process has to be capable of operating on the same time scale as potentially quickly varying channels.We therefore consider it essential to compare the time complexity of the aforementioned algorithms to generate the beamforming output for any given CSI input.A comparison of computation time is provided in Table I.Per sample inference time of UWMMSE is 54 ms, which is 24X lower than that of WMMSE which clocks around 1.30 sec.Inference time of UWMMSE is predictably similar to that of Tr-WMMSE (47 ms per sample) since they have the same number of iterations, however, the learnable components of UWMMSE add slightly to its time complexity.IAIDNN [21] has an inference time that is an order-of-magnitude higher than UWMMSE whereas GCN-WMMSE [23] takes the longest (1.36 sec per sample) inference time among all the methods.Moreover, neither IAIDNN nor GCN-WMMSE achieve the same average performance as the proposed UWMMSE in their respective time duration.Further, all the hybrid methods have a training component -unlike the classical approach -which tends to be timeconsuming [ref.Table I].Clearly, the training for IAIDNN is the fastest while the proposed UWMMSE takes the longest to train among all the hybrid methods.However, since training is one-time and is typically done prior to deployment, this is generally not a major concern for most applications.We observe similar trends in both Rayleigh and Rician fading cases but we only present the Rayleigh case in Table I, for the sake of brevity.

D. Generalization across network sizes and fading types
Wireless networks are typically dynamic in terms of size, fading conditions, and channel noise power, among multiple other aspects that evolve through time.Thus, we are interested in models that work for multiple network conditions.To quantify this aspect, we evaluate the model on its generalization performance under a set of operating conditions at inference that are different from those at training.Specifically, we perform this evaluation w.r.t network size and fading channel type as shown in Fig 3 .We choose network size as one of the dynamic quantities since it is very common for a wireless

E. Generalization across spatial distributions
So far, we have considered a WANET setting, wherein the transceivers are dropped uniformly at random in a square region of area M as discussed in Section IV-A.This choice of distribution is arbitrary and was made for the sake of simplicity.Different real-world situations may yield other spatial distributions depending on topological conditions or missionspecific requirements.For example, an Army unit might be deployed in a manner such that there are more soldiers stationed at a particular point-of-interest and their concentration reduces with distance from that point.Such a deployment can be modelled more appropriately as a Gaussian distribution.It is important that the proposed UWMMSE is able to handle such CSI tensors at inference, even when trained under a different distribution.To that end, we take a UWMMSE model trained on uniformly distributed transceivers and evaluate its generalization performance on Gaussian distributed transceivers with a controlled standard deviation parameter.As shown in Figure 4(a), the performance of UWMMSE on the Gaussian test set is equivalent to the uniform test set only when the standard deviation is large enough at which point the Gaussian distribution is wide enough over the square region to essentially emulate a uniform distribution.There is a clear degradation in performance with decreasing standard deviations as the nodes come closer together generating stronger interference.Nevertheless, it is important to note that the mean sum-rates, normalized w.r.t the corresponding WMMSE sum-rates, are always above 1.0.Hence, although UWMMSE achieves best possible performance when it sees identical distributions in training and inference, it generalizes reasonably well as compared to WMMSE even for unseen distributions at inference.

F. Convergence
In this section, we present an empirical convergence analysis of the proposed UWMMSE model and also provide a comparison with the WMMSE algorithm under the same framework.Our analysis focuses on the variation of the transformed receiver-weight W and its similarity with the final transmitter beamformer output V. Essentially, it is W that represents the channel conditions between a given transceiver pair.Intuitively, transceivers with better channel conditions should transmit with high power as opposed to transceivers with poor channel conditions, which must hold transmission to conserve power.Therefore, W plays a key role in driving V to its nearoptimal value.Also, W is a function of V and depends on it to measure the quality of the channel.To emphasize this interdependence which leads to convergence of the UWMMSE model, we extract W (k) i and V (k) i for all nodes i ∈ 1, . . ., M in all layers k ∈ 1, 2, 3 for all 10000 test samples.Typically, a large value of W i F represents a strong channel suitable for transmission and must be allocated greater power.On the contrary, a smaller value signifies a poor channel that should not transmit.To validate this hypothesis, we threshold W i for all i such that all transmitters with W i F > 1.0 are assigned a scalar 1.0 and all transmitters with W i F ≤ 1.0 are assigned 0.0.Further, we compute the real-valued final power allocation vector p for the entire network as V i F for all i and threshold it to yield a binary power allocation vector.Analogous to binary classification, we treat p as ground-truth and (the thresholded) w as the prediction and then compute the F1-score between the two vectors to match them.The choice of the metric is to ensure that both false-positives and false-negatives have an impact on the score.These scores are then averaged across all test samples to find a global trend in matching the two variables.Similarly, a matching score between Ŵ, V is computed for the case of the WMMSE algorithm.Since WMMSE takes 100 iterations to offer best performance, we compute the scores for iterations 1,2, and 3 for a direct comparison with UWMMSE and also iterations 50 and 100 for a more complete comparison.It is important to note here that we use the respective final power allocations for UWMMSE and WMMSE for this experiment.This is because we have already established in Section IV-C that UWMMSE reaches a better convergence than WMMSE in terms of sum-rate and therefore the two algorithms are unlikely to reach an identical final p.The main objective of this experiment is, however, to evaluate how fast the proposed UWMMSE achieves its respective convergence as compared to the classical WMMSE.The comparison is shown in Fig 4(b).Clearly, UWMMSE offers a better match between W and V in all 3 layers as compared to the first 3 iterations of WMMSE.Notwithstanding the error bars, we observe that the proposed UWMMSE learns the near-optimal W much earlier as compared to WMMSE.

G. Robustness
We analyze the robustness of the proposed UWMMSE model to distortion in channel state information.An analysis of this form is essential since, under real-world scenarios, the channel estimation process is imperfect [52].Yet, it is imperative that the beamforming algorithms are robust to the extent that they are able to maintain a steady performance in spite of these variations.To that end, we add random Gaussian noise of bounded variance to the channel coefficients in the CSI tensor.The sum-rate, however, is computed for the undistorted CSI tensor H.For example, in the case of the Rayleigh channel model, the distorted elements [ Hij ] rt of the input tensor H are given by where, c ∼ N (0, σ r ), d ∼ N (0, σ r ) where, σ r = 0.001.The elements which are distorted are chosen uniformly at random from all the entries of the tensor.Fig 4(c) shows that under a controlled rate of distortion that varies between 0.0 (no element of H is distorted) to 1.0 (all elements of H are distorted), the proposed UWMMSE maintains a sum-rate that is better than WMMSE-with-undistortedinput until about 20% of the channel coefficients are distorted.
Although the performance dips beyond the 20% mark, it is still better than WMMSE-with-distorted-input until about the 60% mark, beyond which the performance of both algorithms applied to the distorted CSI are similar.Clearly, UWMMSE is reasonably robust insofar as to achieve a superior sum-rate than the classical method in an event of distorted channel estimation until about 40% of the coefficients are estimated correctly.This degree of robustness, albeit empirical, is inherent to the model since no robustness criterion is enforced at training.We strongly expect that methods like adversarial training and noise-regularized training will further enhance the robustness of the proposed model.That analysis, however, is beyond the scope of this work.

V. CONCLUSION
We presented UWMMSE, a hybrid algorithm for fast, efficient, and near-optimal beamforming in complex-valued MU-MIMO WANETs by unfolding the iterations of the classical WMMSE algorithm using complex-valued neural models.The main contribution of this work lies in forming a synergistic combination of an MLP-based parametric functional transformation with a GNN-based learner appropriate for tackling wireless network graphs.Superiority of this method is established through extensive experiments.Further, the proposed model is lightweight on account of extensive parameter sharing and also easy to implement in a distributed fashion.The perlayer computational complexity of UWMMSE matches that of the per-iteration complexity of WMMSE.However, the main gain lies in the reduction of number of layers in UWMMSE as compared to the number of iterations necessary for WMMSE to converge.Future work will involve considering time-varying channels with long-term constraints on the beamforming selection such as fairness or battery constraints.Evaluating the model performance on real-world datasets is also an important next step.
Leveraging these identities, we now want to show that Λ(• ; Θ) is equivariant.Specifically, for the special case of K = 1, it can be obtained from the definition of Λ(•; Θ), that Λ( H; Θ) = Λ 1 ( H, V (0) ; Θ) = Λ 1 ( H, ΠV (0) ; Θ) = ΠΛ 1 (H, V (0) ; Θ) = ΠΛ(H; Θ), the fact that V (0) is a constant tensor gives rise to the second equality while the third equality is obtained as a special case of the previous identity for k = 1.This completes the proof for permutation equivariance of a single-layered UWMMSE.For UWMMSE with K > 1, permutation equivariance can be established via a simple induction argument omitted here.

APPENDIX B PROOF OF THEOREM 1
Proof : This proof is inspired by that of the convergence result presented in [24], and the theoretical linear convergence results of unfolded ISTA [30], as presented in [53].We assume that Trace V for all i uniformly.This means that, for all η > 0, there exists a layer index K 1 such that for all k > K 1 , V (k−1) = V * + E V and V (k) = V * + E V where [E V ] i F < η and [E V ] i F < η for all i.Following notations identical to that of the proof of Proposition 1 in Appendix A, we have that V (k) = Λ k (H, V (k−1) ; Θ).It follows from uniform convergence that, Therefore, we need to obtain an expression g k (H, V * +E V ; Θ) such that Λ k (H, V * + E V ; Θ) = V * + g k (H, V * + E V ; Θ), that can be replaced in (16) to obtain The fact that (17) holds for a positive η → 0, implies that as k → ∞, g k (H, V * + E V ; Θ) → 0. In what follows, we endeavour to determine the explicit form of g k (H, V * + E V ; Θ) and thereby demonstrate (15).First, focusing on an arbitrary i, we replace V (9), where H Since we operate in the low-noise regime [24], the noise term in the inverse can be neglected in comparison to the interference term to yield where, we define two dummy variables Clearly, the first term represents U * i .All remaining terms that depend on E V , constitute E U such that ||E Ui || F → 0 as η → 0 for all i.Thus, U (k) i takes the form, We then consider (6), where we replace U (k) and V (k−1) , by U * + E U and V * + E V , respectively.Similar to the procedure followed in (18), and leveraging the continuity of Φ ξ , we have where | wi | → 0 as η → 0 for all i.Next, we repeat the procedure for (9).To that end, we redefine the dummy variables A i , B i and C i as, i , and i , by their respective forms obtained in (18) and (19).Clearly, for k → ∞, the non-linearity β(•) is no longer significant since Trace V(k) i V(k) H i → P * i and 0 < P * i < P max for all i.Therefore, from ( 9)- (10) we get that, where ||E Vi || F → 0 as η → 0 for all i.Since, (U * , W * , V * ) form a fixed point of the WMMSE updates [14], V * i is given by the first term in the r.h.s of (20) for all i.Therefore, the last two terms in the r.h.s of (20) constitute, element-wise, the function g k (H, V * + E; Θ).Further, since ||E Vi || F → 0 as η → 0 (equivalently, as k → ∞), as k goes to infinity, the last term in (20) must go to 0, thus completing the proof.

Fig. 1 .
Fig. 1.Flow diagram depicting the variable dependencies in any given intermediate layer k of the proposed K-layered UWMMSE algorithm.Updates are shown for an arbitrary node i and are computed in parallel for all i.The yellow blocks represent the five equations in (5)-(9) plus(10) and(11).The input to the layered structure is V

Fig. 2 .
Fig. 2. Comparison of the achieved sum-rate by the proposed UWMMSE with full WMMSE [14], a truncated version of it, IAIDNN [21], and GCN-WMMSE [23].(a) Sum-rate histogram for ∼ 10000 Rayleigh CSI samples in the low-noise regime (σ = 2.6 × 10 −5 ).UWMMSE achieves significantly better performance than WMMSE while requiring equal number of iterations as the truncated version of it.(b) Box plots corresponding to histograms in (a), with additional comparisons with IAIDNN and GCN-WMMSE on identical CSIs.(c) Counterpart of (b) for Rician channels.

Fig. 3 .
Fig. 3. Generalization performance of UWMMSE -normalized w.r.t the corresponding WMMSE performance -across multiple network sizes and fading models.(a) Normalized mean sum-rate achieved by UWMMSE on Rayleigh channel realizations with test sizes varying in the range {10, 11, . . ., 50}, while training sizes were in the range {10, 12, 14, . . ., 50}.The normalized mean sum-rate achieved by Tr-WMMSE for the same range of network sizes is shown for comparison.(b) Counterpart of (a) but for test sizes varying in the range {55, . . ., 100}, with training sizes still in the range {10, 12, 14, . . ., 50}.(c) Counterpart of (a) but for Rician channel realizations with an additional plot of normalized sum-rate achieved by a UWMMSE model trained on Rayleigh channel realizations and tested on Rician channel realizations.ad-hocnetwork to have new nodes added to it or existing nodes removed from it.The choice of fading channel type as the second dynamic setting is motivated by the fact that an ad-hoc network can group and re-group under various geographical (rural, urban, suburban) and climatic conditions such that there may or may not be dominant line-of-sight paths among transmitters and corresponding receivers.Firstly, as a means of evaluating the interpolation behavior of UWMMSE across network sizes, we train it on evenvalued sizes between {10, . . ., 50} and then test it on all sizes between 10 and 50 for a Rayleigh channel setting.As shown in Fig3(a), the model performs significantly better -more than 1.2-times the corresponding WMMSE sum-rate -for all sizes including the odd-valued sizes that were not available during training.The observed drop in the model performance with increase in network size is expected since a larger network offers more interference at each transceiver and essentially poses a more challenging problem for the beamforming algorithm.Next, to evaluate the extrapolation behavior of UWMMSE, we test an identically trained model from the previous experiment on all sizes between 55 and 100.As shown in Fig3(b), the model performs reasonably wellmore than 1.05-times the corresponding WMMSE sum-rate -for all unseen sizes.Similar to the interpolation setting, a steady decrease in mean performance is observed in the extrapolation setting mainly due to increase in network size and the departure from the training setting.However, the fact that the UWMMSE performance for all sizes between 10 and 100 is strictly above WMMSE irrespective of training conditions demonstrates the generalizability of our model to variations in network size.Fig 3(c) illustrates the generalization behavior of UWMMSE to variation in fading conditions.A UWWMSE model trained on Rayleigh setting (UWMMSE ray)identical to the interpolation experiment -is employed for inference on a Rician setting.While UWMMSE ray fails to beat UWMMSE ric (which is specifically trained on a Rician channel setting), it still manages to comfortably beat the corresponding WMMSE performance for all network sizes.Clearly, the proposed UWMMSE can reasonably generalize to different fading channel settings -when trained under a fixed fading condition -without any re-training on other fading conditions.

Fig. 4 .
Fig. 4. Performance generalization across spatial distributions, convergence, and robustness results.(a) Variation in normalized sum-rate utility achieved by the UWMMSE model trained on uniformly sampled transceiver locations and tested on transceiver locations sampled from Gaussian distributions with varying standard deviations.(b) Matching scores between the variables W and V for the proposed UWMMSE in layers 1,2,and 3 and the classical WMMSE in iterations 1,2,3,50,and 100.(c) Variation in normalized mean sum-rate achieved by UWMMSE and WMMSE under controlled distortion in the input CSI tensor H as compared to the ideal performance achieved on the same CSI tensors without distortion.APPENDIX A PROOF OF PROPOSITION I Proof : Let V (k) = Λ k (H, V (k−1) ; Θ) denote the output of the UWMMSE architecture in (5)-(10) at layer k.Also, H = ΠHΠ denotes an arbitrary permuted version of the channel tensor with the permutations being enforced only on the first two dimensions of the tensor.and Ṽ(k−1) = ΠV (k−1)represents input V (k−1) to the kth layer, with the permutations being enforced on the first dimension of the tensor only.Further, S = ΠSΠ is a permuted version of the transformed CSI matrix, and Q = ΠQ is the equivalent permutation of the node feature vectors.Firstly, we want to prove that Λ k ( H, Ṽ(k−1) ; Θ) = ΠV(k) .We know that ξ(k) = Ψ( S, Q; θ) = Πξ (k) since Ψ is permutation equivariant.Let node i be assigned a new index π(i) after permutation Π.Then by setting Hij = [ H] ij:: , it follows from (5) thatŨ(k) i = [23]eved by equal number of WMMSE iterations as the number of unfolded layers without any learning.Tr-WMMSE is allowed 3 iterations to match the number of UWMMSE layers.3)IAIDNN[21]is a deep-unfolding framework to solve the sum-rate maximization problem for precoding design in MU-MIMO systems.4)GCN-WMMSE[23]is a graph based unfolding framework for transceiver design in multicell MU-MIMO interference channels with local channel state information.
2) Truncated WMMSE (Tr-WMMSE) offers an empirical lower bound for UWMMSE in terms of the performance that can be