Adaptive Regularized Zero-Forcing Beamforming in Massive MIMO with Multi-Antenna Users

Modern wireless cellular networks use massive multiple-input multiple-output (MIMO) technology. This technology involves operations with an antenna array at a base station that simultaneously serves multiple mobile devices which also use multiple antennas on their side. For this, various precoding and detection techniques are used, allowing each user to receive the signal intended for him from the base station. There is an important class of linear precoding called Regularized Zero-Forcing (RZF). In this work, we propose Adaptive RZF (ARZF) with a special kind of regularization matrix with different coefficients for each layer of multi-antenna users. These regularization coefficients are defined by explicit formulas based on Singular Value Decomposition (SVD) of user channel matrices. We study the optimization problem, which is solved by the proposed algorithm, with the connection to other possible problem statements. We prove theoretical estimates of the number of conditionality of the inverse covariance matrix of the ARZF method and the standard RZF method, which is important for systems with fixed computational accuracy. Finally, We compare the proposed algorithm with state-of-the-art linear precoding algorithms on simulations with the Quadriga channel model. The proposed approach provides a significant increase in quality with the same computation time as in the reference methods.


Introduction
In multiple-input multiple-output (MIMO) systems with numerous antennas, precoding is an important part of downlink signal processing, since this procedure can focus the transmission signal energy on smaller areas and allows for greater spectral efficiency with less transmitted power [1,2]. Various linear precodings allow directing the maximum amount of energy to the user as Maximum Ratio Transmission (MRT) or completely get rid of inter-user interference as Zero-Forcing (ZF) [3]. In the case of Reg-Emails: eugenbobrov@ya.ru, roborisor@gmail.com, kuznetsov.victor@huawei.com, luhao12@huawei.com, minenkov.ds@gmail.com, stroshin@hse.ru, d.yudakov43@gmail.com, zaev.da@gmail.com.
ularized Zero-Forcing (RZF) algorithms (aka the Wiener filter), we balance between maximizing the signal power and minimizing the interference leakage [4,5,6,7,8] and still have low complexity compared to the non-linear precoding. Note that usually a "scalar" regularization is considered, i.e. regularization with a scalar multiplied by a unitary matrix of the corresponding size. E. Björnson in [5] uses the primal-dual approach and proves that the optimal regularization has the form of a diagonal matrix (generally speaking, with different elements). This proof is not constructive and does not provide any particular formula or algorithm for this optimal regularization. In the current paper, we propose an explicit heuristic formula for a diagonal regularization that provides better results compared with scalar RZF. We support the proposal with theoretical justification and tests using Quadriga [9].
There are also approaches for linear precoding, like Transceivers (detection-aware precoding, see e.g. [10]), Block-diagonal precoding (see e.g. [11]), and Power Allocation problem (see the involved study in [12]). There are also different non-linear precoding techniques such as Dirty Paper Coding (DPC) and Vector Perturbation (VP) but they have much higher implementation complexity [13] and in the case of usage of numerous antennas in massive MIMO, linear precoding techniques are more preferable. There also are many good surveys of precoding techniques [14,15,16] and papers that consider different aspects and variations of these algorithms (see e.g. [17,18,19]). In particular, the work [19] presents various variants of RZF in the case of multiple base stations, which help to reduce inter-cell interference.
Most works do not pay much attention to multi-antenna user equipment (UE) for simplicity. In our work, we consider a single base station serving multi-antenna users who simultaneously receive fewer data channels than their number of antennas. This approach is necessary because, in practice, the channels between different antennas of one UE are often spatial correlated [20]. Therefore, the matrix of the user channel is ill-conditioned (or even has incomplete rank), thus one can not efficiently transmit data using the maximum number of streams. To solve this problem, instead of the full matrix of the user channel, vectors from its singular value decomposition (SVD) with the largest singular values are used for precoding [21]. In the case of UE with one antenna, the channel matrix can be normalized [6] and normalization coefficients are the path losses of each UE that can differ by several orders, common Reference Signal Received Power (RSRP) values vary from (−130) dBm to (−70) dBm. When we use SVD, the singular values of different UEs have the order of the corresponding path-losses and therefore vary greatly.
We managed to find a simple heuristic formula that provides gain over known RZF algorithms. Motivated by the above issue, we propose Adaptive RZF (ARZF) precoding with diagonal regularization. Algorithms of the Wiener filter type [4] have a scalar regularization of the form λI, and ARZF belongs to the narrow class of precoding which uses a diagonal regularization matrix. Such algorithms effectively use different regularization parameters for different UEs by taking into account the singular values of transmitted layers (streams).
The idea of ARZF is well-known, for example, in [5] it is shown that maximization of the SINR function (including the considered sum SE (19)) is achieved by an algorithm with appropriate diagonal regularization, and in [22] the authors derive a similar formula for the single-antenna MU-SIMO user system.
The results of this paper are the adaptation of the formula W ARZF (V ) together with Theorem 2.8 for Multi-User (MU)-MIMO systems.
The rest of this paper is organized as follows. In Section 2.1 we formulate the problem, which consists of the Channel and System Model, quality measures, and 1 2 … Figure 1. An example of the MIMO precoding usage. The problem is to find an optimal precoding matrix W of the system given the target Spectral Efficiency (SE) function with the given constraints (24). power constraints. The downlink MIMO channel model is simplified using SVDdecomposition of the channel (sec. 2.1.1) and useful idealistic detection (sec. 2.1.3); quality measures and power constraints are discussed in sec. 2.1.5, 2.1.6. In Section 2.2 we propose an adaptive precoding algorithm that utilizes UE singular values. Then, we study its relation with known precoding, including MRT, ZF, and RZF. Comparison of these algorithms on numerical experiments with Quadriga is provided in Section 3.2. Conclusions are drawn in Section 4. Symbols and notations are shown in Tab. 1.
Throughout the paper, we assume that the transmitter has perfect channel state information (CSI) of all downlink channels, this assumption is reasonable in time division duplex (TDD) systems, which allows the transmitter to employ reciprocity to estimate the downlink channels, and that each user only has access to their own CSI, but not the CSI of the downlink channels of the other users.
Throughout the paper we use the following notations. We consider one cell with K UE, the number of transmit antennas is T and the number of receive antennas and transmit symbols of UE k are R k = 1, 2, 4 and L k R k with total R = K k=1 R k and L = K k=1 L k : L k R k T . By bold lower case letters we denote vectors: either columns or rows, which will be clear from the context. We denote matrices by bold upper case letters, considering them as sets of vectors, e.g. channel matrix is a set of vector-rows H = [h 1 ; ...; h R ] ∈ C R×T and precoding matrix is a set of vector-columns W = (w 1 , ..., w L ) ∈ C T ×L . Matrix elements are denoted by ordinary lower case letters with the first index standing for rows and the second one -for columns: H = {h rt }, W = {w tl }, r = 1, ..., R, t = 1, ..., T, l = 1, ..., L. Hermitian conjugate is denoted by H H := H T . Diagonal and block-diagonal matrices are written as S k = diag{s k,1 , . . . , s k,Rk } and S = bdiag{S 1 , . . . , S K } correspondingly, the identity matrix of size T is I T = diag{1, . . . , 1} ∈ C T ×T . Trace of a square matrix A is denoted by trA = K k=1 a kk , and Frobenius norm is H =

Channel and System Model
According to [2,23,12] we consider a MIMO broadcast channel. The Multi-User MIMO model is described using the following linear system: where x, r ∈ C L are correspondingly transmitted and received vectors, H ∈ C R×T is a downlink channel matrix, W ∈ C T ×L is a precoding matrix, and G ∈ C L×R is a blockdiagonal detection matrix ; noise-vector n ∼ CN (0, σ 2 I R ) is assumed to be independent Gaussian with noise level σ 2 (Fig. 1). Note that the linear precoding and detection are implemented by simple matrix multiplications. The constant T is the number of transmit antennas, R is the total number of receive antennas, and L is the total number of transmitted symbols in the system. They are usually related as L R T . Each of the matrices G, H, W decomposes by K users:  Figure 3. The main decomposition of the channel matrix.

Singular Value Decomposition of the Channel
It is convenient [21] to represent the channel matrix of UE k via its reduced Singular Value Decomposition (SVD): (2) Where the channel matrix for user k, H k ∈ C Rk×T contains channel vectors h i ∈ C T by rows, the singular values S k ∈ C Rk×Rk are sorted by descending, U k ∈ C Rk×Rk is a unitary matrix of left singular vectors, and matrix V k ∈ C Rk×T consists of right singular vectors vector-rows Collecting all users together, we may write the following channel matrix decomposition: H = U H SV (Lemma 2.1 and Fig. 3), where each of the decomposition matrices Note that this representation is not actually SVD-decomposition of the matrix H: vectors v k,j , v l,i that correspond to different UE k = l are not generally orthogonal. Nonetheless, this representation has important properties that make it useful: the matrix S = diag(S k ) ∈ C R×R is a diagonal matrix and U = bdiag(U k ) ∈ C R×R is a block-diagonal unitary matrix. This allows to compensate factor U H S by detection on UE side (each UE deals with its own U H k S k ). Thus, on the transmitter side, it is sufficient to invert only the matrix V , which in itself is much simpler than the channel H: first, its rows have a unit norm and, second, it is a natural object for Rank adaptation problem [24]. We heavily use the Lemma 2.1 in the following.

Idea of Proposed Precoding Algorithm
Let us briefly formulate the resulting algorithm. The intuitive idea of the proposed precoding method is that it should provide maximum signal power and eliminate the so-called inter-user interference. It is known that maximal signal power is provided by Maximum Ratio Transmission (MRT) precoding, W M RT (V ) = V H , and the interference is vanished by the Zero-Forcing (ZF) precoding, W ZF (V ) = V H (V V H ) −1 [3]. Both of these algorithms have disadvantages and can be improved by using Regularized Zero-Forcing (RZF) precoding W RZF (V ) = V H (V V H + λI) −1 , where the regularization parameter λ > 0 depends on noise level and average path-losses [4].
When users have significantly different path-losses, it is better to use regularization with a diagonal matrix. Non-scalar (matrix) regularization is introduced by E. Björnson in [5], where it is constructed using the primal-dual technique and demonstrated that the optimal regularization takes the shape of a diagonal matrix (generally speaking, with different elements). This proof is not constructive since it does not propose a specific formula or procedure for optimum regularization. In our work, we propose an explicit heuristic formula for diagonal regularization, Adaptive Regularized Zero-Forcing (ARZF) for the MU-MIMO system, as follows: Theoretical justification is given below (Sec. 2.2). Experiments with Quadriga [9] show that proposed ARZF outperforms known scalar RZF algorithms (see Sec. 3.2).

Number of Transmitted Streams and Conjugate Detection
In the previous section, we have introduced the notation of SVD (eq. 2). Usually, the transmitter sends to UE several layers and the number of layers (rank) is less than the number of UE antennas (L k R k ). In this case, it is natural to choose for transmission the first L k vectors from V k that correspond to the L k largest singular values from S k . Denote by S k ∈ C Lk×Lk the first L k largest singular values from S k , and by U H k ∈ C Rk×Lk , V k ∈ C Lk×T the first L k left and right singular vectors that correspond to S k : where the wave sign · denotes the contraction of singular vectors, i.e. rank V k = L k R k = rankV k . Numbers L k (and particular selection of V k ) are defined during the Rank Adaptation problem that, along with Scheduler, is solved before precoding. For the Rank adaptation problem, we refer for example to [24] and in what follows we consider L k , V k already chosen. After precoding and transmission, on the UE k side, we have to choose the detection matrix G k ∈ C Lk×Rk that takes into account UE Rank L k . The way, how UE performs detection, heavily affects the total performance and different detection algorithms require different optimal precodings (see [4] where precoding is a function of the detection matrix). The best way would be to choose precoding and detection consistently but it is hardly possible due to the distributed nature of wireless communication. Nevertheless, there are ideas on how to adjust the precoding matrix assuming a particular way of detection on the UE side at the transmitter [10]. We do not consider such an approach in this paper, though it can be used to further improve our main proposal.
To conduct analytical calculations, we assume the Conjugate Detection (CD) [25] in the following form: where the unitary matrix U k ∈ C Lk×Rk contains the first L k singular vectors. The diagonal matrix S k ∈ C Lk×Lk contains the first L k of the largest singular values.
A block-diagonal unitary matrix U consists of U k blocks. The diagonal matrix S −1 consists of blocks S −1 k . Finally, the block-diagonal detection matrix G C consists of the blocks G C k on the main diagonal or, which is the same, the product of the matrices Proof. Using Lemma (2.1) we can write which immediately leads to with combination of users k = 1 . . . K (5).
Remark 1. The Corollary 2.3 is essential for our study, since it gives the idea to use S −2 ∈ C L×L in regularization part of precoding to take into account the correct effective noise S −1 U n.
Remark 2. The formulated Theorem sufficiently simplifies the initial problem, decreases its dimensions, and allows notation to be uniform. Namely, we can work with user layers of shapes L k and L instead of considering user antennas space. Note also that it is sufficient to only perform Partial SVD of the channel H k ∈ C Rk×T , keeping just the first L k singular values and vectors for each user k such as Based on this, in what follows we omit the tilde and write Remark 3. The introduced Conjugate Detection is "ideal" and can not be implemented in practice. However, one can show that realistic detection policies as MMSE or IRC detection [26] often behave similarly to Conjugate Detection. In simulations, we use MMSE detection to compare various precoding methods.

Known MMSE Detection
Signal detection aims at pseudo-reversing the product of the channel and precoding matrices. The most common form of the G detection matrix is studied, which are called minimum MSE (MMSE) [27]. In statistics and signal processing, MMSE -is an estimation method that minimizes the MSE function, which in turn is a general measure of the quality of the estimation of fitted values of the dependent variable. The definition of MMSE detection implements the following rule: The parameter P -is the base station power, and σ 2 -the system noise. The MMSE detection method seeks to eliminate noise by assuming it is the same for all symbols: n ∼ CN (0, σ 2 I R ), which may be violated in practice.
Proof. Let's substitute the system expression The second summand will go to zero from introducing the expectation of noise with zero mean, and the third summand will go to zero Let's open the brackets using the sum norm squared formula: Next, applying expectation over the symbols x with the condition E x∼CN (0,IL) xx * = I: If the conditions are met: Let's calculate the gradient of the function (12) and equate it to zero: We get the desired solution (9).

Quality Measures
The following functions are used to measure the quality of the precoding methods. These functions are based not on the actual sending symbols x ∈ C L×1 , but some distribution of them [5]. Thus, we get the common function for all assumed symbols, which can be sent using the specified precoding matrix. Let us consider the end-to-end numbering of symbols l = 1, ..., L for all UE and index function k = k(l) that returns the index of UE that receives the symbol l.
The Signal-to-Interference-and-Noise functional of the l-th symbol of user k = k(l) is defined as: The formula (14) shows the ratio between the useful and harmful parts of the signal. It depends on the whole precoding matrix W ∈ C T ×L , where the complex vector w l ∈ C T ×1 denotes the precoding for the l-th symbol, on the channel matrix H k ∈ C Rk×T of the k-th user, the detection vector g l ∈ C 1×Rk of the l-th symbol; after detection noise n k ∈ C Rk×1 level of UE k becomes E[g l n] = σ 2 g l 2 . The formula (14) can be efficiently computed for all L layers using several matrix multiplications and summations. It can be further simplified, using Theorem 2.2.
Corollary 2.5. For given SVD (2) and assuming Conjugate Detection (4) the following formula for SINR holds: The important criterion of network performance is the spectral efficiency SE k of UE k, which refers to the information rate that can be transmitted over a given bandwidth for a certain UE. In the case of one symbol L k = 1 it is bounded by Shannon's entropy that depends on SINR in the following way: This is theoretically supreme for the possible transmission Rate, and recent modulation and coding schemes along with Hybrid Automatic Repeat Request (HARQ) retransmission and Block Error Rate (BLER) management allow us to achieve a Rate very close to Shannon's entropy. Note that SINR here is taken in linear values, not in dB.
In the case of several transmitted symbols, generally speaking, SE k of UE k is not a sum over its layers because there is one common transport block to be transmitted and thus the coding and modulation algorithms are also common. Usually, one would introduce an effective SINR as a function of per-symbol SINR: SINR ef f k = f (SINR 1 , . . . , SINR Lk ), and so using (16) we obtain: There are different approaches to estimate this effective SINR for QAM64 and QAM256 (see [28]) and in simulations, we use the QAM256 model. Also, we con-sider approximate formula using the geometric mean of per-symbol SINR. That is the same as the usual average of per-symbols in dB. This heuristic approximation will be used later in numerical optimization by gradient search: The most general problem statement is the multi-criteria optimization of the whole vector (SE 1 , ..., SE K ). For such a problem, the Pareto optimality can be studied (see [29,30]), which is hard and does not provide a unique solution, that is why usually a suitable decomposition into one-criterial optimization is considered: [5]. Such decomposition can be done in different ways, we consider the sum of Spectral Efficiencies (17) over all UEs: Finally, we consider Single-User SINR for the k-th user and the average of SU SINR is an important parameter in simulations: The formula (22) reflects the quality of the channel (geometrical average of per-symbol SINR) for the specified UE without taking into account other users. It depends on the greatest L k singular values S k ∈ R Lk×Lk of the k-th user channel matrix H k ∈ C Rk×T and may be derived from the (14) and (18) formulas assuming Single-User case, MRT or ZF precoding matrix and Conjugate Detection (4). We will use this function in our experiments as a universal channel characteristic, including system noise σ 2 and station power P .

Problem Statement and Power Constraints
First of all, we assume that the total channel H, the number K of UE, and their ranks L k are known given values. This means that the Scheduler problem (which UE are to be served from the set of active UE) and Rank adaptation problem (which rank is provided to each UE) are already solved. This is usually the case in real networks. Scheduler and Rank adaptation problems are complicated and important radio resource management problems themselves but are out of the scope of this study (for example of Scheduler problem we refer to [31] and bibliography within, for Rank adaptation, [24] and [10]. These algorithms affect the properties of matrix H, e.g. Scheduler can choose only UE with small enough correlations C = V V H − I R ε. We take this into account and consider scenarios with small correlations of UE channels. Next, we consider the channel model in the form (1) that particularly means exact measurements of the channel. To further simplify the problem we suppose detection policy G = G(H, W ) to be a known function, moreover we assume Conjugate Detection (4) that simplifies channel model to (5). Based on this channel model we calculate SINR of transmitted symbols by (15) and effective SINR of UE, which can be approximately calculated by (18).
Finally, denote the total power of the system as P and that the sent vector has unit norm E[xx H ] = I L . The total power constraints and the more realistic per-antenna power constraints (see [12]) impose the following conditions on the precoding matrix: The ultimate goal is to find a precoding matrix that maximizes sum Spectral Efficiency (17) subject to power constraints (23), e.g.: Even after all the above simplifications, the formulated problem is too complicated to solve analytically. Moreover, it is not convex or concave, hence it could have a lot of (essentially) different local maximums. That's why our strategy in this paper is to propose some heuristic formula that will prove to be better than known algorithms on some reliable simulations. After defining the particular form of precoding (anzats) W 0 = W 0 (V , S, n) we can always satisfy power constraints by normalizing constant, e.g. for (23): In simulations, we use more realistic per-antenna power constraints. A heuristic algorithm is formulated below and its idea is hinted at by formulated simplifications in Corollary 2.3 and Remark 1. Theoretically, it is supported by the model problem of MSE minimization (see e.g. [4]): where r(W ) is given by channel model (1) or (5).

Reference Precoding Methods
We repeat known reference precoding algorithms and present our solution.

Maximum Ratio Transmission (MRT)
A Maximum Ratio Transmission precoding algorithm takes single-user weights V H from the SVD-decomposition. This approach leads to the maximization of single-user power, ignoring the interference. The MRT approach is preferred in noisy systems when the noise power is higher than inter-user interference [3]: where normalizing constant µ is defined from power constraint (25). This algorithm outputs a set of interfering channels assuming the model (5):

Zero-Forcing (ZF)
The next modification of the precoding algorithm performs decorrelation of the symbols using the inverse correlation matrix of the channel vectors. Such precoding construction sends the signal beams to the users without creating any interference between them. Different from the MRT method, the Zero-Forcing approach is preferred when the potential inter-user interference is higher than the noise power, and the Spectral Efficiency quality improves by eliminating this interference [3].
The following receiver model is: Theorem 2.6. Assume the channel matrix has the form of H = U H SV (Lemma 2.1). Then, the following relation for Zero-Forcing holds: Proof.

Regularized Zero-Forcing (RZF)
In the geometrical sense, in the ZF method (28), beams are sent not directly to the users but with some deviation, which reduces the useful signal. The following modification corrects the beams, which allows some inter-user interference and significantly increases the payload.
In the practical sense, in the ZF method (28), the channel right inversion may not exist or matrix V V H may be ill-conditioned, making ZF poorly perform. There are many practical solutions to this problem based on regularization.
Regularized Zero-Forcing is the most common method in real practice, and therefore we use it as the main reference method. As the baseline, we use the analytical form of the regularization matrix using λ = Lσ 2 P [8]. This method cannot cancel all multi-user and multi-layer interference. It admits some interference to maximize single-user power. It is used as a trade-off between using MRT and ZF precoding [5] balancing between maximizing the signal power and minimizing the interference leakage, and thus we need to appropriately manage them by optimizing the regularization parameter depending on the noise level.
The RZF method has the following asymptotic properties [3]: if σ 2 → ∞, it becomes equivalent to W M RT = µV H , which is optimal in low SINR cases. And if we set σ 2 = 0, the formula becomes equal to ZF precoding: The precoding matrix based on the un-normalized channel [6] in the case when the number of sending symbols L is less than the number of receiver antennas may be written in the following form of RZF: Here, F = SV parameter and λ = Lσ 2 P [8] are chosen taking into account noise level E[nn H ] = σ 2 I R . Actually, this method is equivalent to an un-normalized channel matrix H, which in our case is the matrix F , obtained after splitting user channel into multiple streams.
Let us formulate the following well-known fact about RZF method (31).
Theorem 2.7. Consider the channel decomposition H = U H SV from the Lemma 2.1. The precoding W RZF (V ) with any parameter λ > 0 is the solution of the following optimization problem: Proof. Calculating the gradient and equating it to zero, we get: The last identity may be proved using multiplication by the (V V H + λI) matrix from the right side of the identity and by the (V H V + λI) from the left side of it.
Remark 4. The algorithm W RZF (V ) is also the solution to the constrained optimization problem (26) in assumption GH = V (see [4]): Constraint optimization problem (35) is decomposed to (33)

Wiener Filter Zero-Forcing (WRZF)
In [4] authors study the Wiener filter that provides optimum in the problem (35) for arbitrary symbol and noise covariance matrices R x and R n in the case of known detection matrix G. We take Conjugate Detection (4) G = G C and apply the Wiener filter for normalized channel V and symbol R x = I L and corresponding noise covariance R n = S −2 . The algorithm is the solution to the constrained optimization problem (26) in assumption of CD (G = G C so that G C H = V according to Theorem 2.2): where

Proposed Adaptive Regularized Zero-Forcing (ARZF)
Algorithms W RZF and W W RZF (and even the Wiener filter in more general cases) are precodings with scalar Regularization. Taking into account effective noise from Corollary 2.3, we propose Adaptive RZF (ARZF) algorithm with diagonal matrix regularization: ARZF allows the application of different regularization for different users and layers corresponding to their singular values that also include UE path loss.
Using r = V W x + S −1 U n, let's write out the quadratic error of the received and sent symbols r and x: Let's introduce the inverse noise covariance matrix S into the definition of norm, and we get the following weighted least squares function (40): So, ARZF provides optimum to the problem 40 (Theorem 2.8).
Proof. Calculating the gradient and equating it to zero, we get: The last identity may be proved using multiplication by the (V V H + λS −2 ) matrix from the right side of the identity and by the (V H S 2 V + λI) from the left side. (41) is not common, and it can not be formulated in the form similar to (35). Nevertheless, such a problem statement is a more adequate approximation to sum SE maximization problem (24): because ARZF provides a larger sum SE than RZF of WRZF. Detailed simulation results are given below.

Remark 5. Optimization problem
Remark 6. The proposed algorithm W ARZF (V ) along with Theorem 2.8 is the main result of the paper. Transmit Wiener Filter [4] algorithms have scalar regularization of the form λI, so ARZF is not from this class. As it is shown in [5], the maximum of the function of UE SINR (incl. considered the sum of SE (17)) is achieved by an algorithm with a proper diagonal regularization, and ARZF is a suboptimal heuristic of such form.
The possible interpretation of the function J S (W ) (33) is as follows. The second term λ W 2 is the standard noise regularization part and the first term, the norm S(V W − I) 2 , weighted by the matrix S, weights more for the layers with higher singular values. And, therefore, the function is optimized more precisely for the layers with a higher signal quality compared to the layers with lower signal quality. In other words, precoding vectors for layers with higher singular values become similar to Zero-Forcing precoding, and for layers with lower singular values become similar to Maximum-Ratio Transmission, i.e. ARZF provides adaptive regularization. In the next section, we will see that this approach leads to a uniform increase in spectral efficiency compared to the basic method with unit weights.
Let us study the relation of ARZF with other algorithms. Firstly, we see that regularization parameter in WRZF is the (arithmetical) average of ARZF regularization: In the case when path-losses of all UE are similar s l ≈ s, l = 1, ..., L ARZF and WRZF provide a similar result. Secondly, the relation between W RZF and W ARZF precoding is formulated as follows: Theorem 2.9. Assume the channel matrix has the form of H = U H SV (Lemma 2.1) and denote F = U H = SV . Then, the following relation for ARZF holds: Proof.
Right factor S in the r.h.s. of (43) can be interpreted as a special type of Power Allocation algorithm (see an interesting study in [12, sec. 7]) that distributes the total transmission power between layers. In practice, it is better to use W RZF (V ) rather than W RZF (F ), because the norms of the rows of the matrix F F H + λI can differ sufficiently (up to several orders!), which leads to unbalanced power distribution between layers (as state Theorem 2.8 another way is to apply a proper Power allocation for W RZF (F ). On the other hand, the regularization parameter of W RZF (F ) is more natural and correct. The proposed W ARZF (V ) combines the benefits from these two approaches and provides an envelope of them.
Remark 7. ARZF formula (38) can be found using the PCA-decomposition [32], which is stated in the Theorem 2.8.
In this work we prove theoretical estimates of the number of conditionality of the inverse covariance matrix of the ARZF method and the standard RZF method, which is important for systems with fixed computational accuracy and are also related by the ratio: The assumption that the matrix of singular user vectors is close to unitary is valid under low correlation user selection: V H V (ε) = I + O(ε), with ε → 0. Then the matrices under study are reduced to diagonal: Their conditionality functions are equal (45). By comparing these functions, we get transition points where it is clearly better to use the first formula λ < s 2 min < s 2 max : χ(A) < χ(B), it is better to use the second s 2 min < s 2 max < λ: χ(A) > χ(B).

Remark 8.
In the situation s 2 min < λ < s 2 max nothing can be said and further investigation is required. Note that only case 1) λ < s 2 min < s 2 max is used in real networks. Singular vectors whose singular values are smaller than the noise power are not used, so the condition χ(A) < χ(B) is always satisfied.
The mathematical notation W ARZF = V H A −1 improves the conditionality of the system under low to medium noise λ conditions, which makes the algorithm computationally more accurate under limited discharge grid conditions compared to another mathematical notation of the method W RZF = SV H B −1 . An experimental comparison of the conditionality value is shown in Fig. 4.

Asymptotic properties of ARZF
Using the Neumann series [33], we formulate the following Lemma 2.11. Consider square Hermit matrices A and B of the same size and full rank. Introduce a small parameter 0 < ε 1. The following and same size square matrices the following asymptotic holds: where Proof. We are looking for the inverse in the form of formal series (A + εB) −1 = A 0 + εA 1 + . . . . Calculating product of A + εB and its inverse, we get the chain of equations for each power of ε: The following theorem 2.12 gives the asymptotic properties of ARZF.
Theorem 2.12. Assume that matrix V V H is of full rank equals to L. It holds: Proof. To get asymptotics as λ → 0+ one should apply Lemma 2.11 with A = V V H and B = S −2 , asymptotics as λ → +∞ are given by Lemma 2.11 with A = S −2 , B = V V H , ε = λ −1 . Also note that normalizing coefficient µ (25) is different for different precoding algorithms.
Property at low noise λ → 0+ is the same as for W RZF . Another asymptotic from Theorem 2.12 means that, when noise is greater than signal even for UE with the best channel, ARZF only serves UE with the best channel.

Gradient-Based Optimal Regularization (OPT)
The proposed algorithm provides minimum to quadratic optimization problem (41), but still, there is a question: how good is it w.r.t. to sum SE function (19)? In [5] it is proved that optimal solution to the function f (SINR 1 , ..., SINR K ) with total power constraint is given by algorithm of the form W OPT (V ) = µV H (V V H + R) −1 P , R = diag(r 1 , . . . , r L ), P = diag(p 1 , . . . , p L ).
(51) It is hardly possible to explicitly express the particular values of R, P for optimal precoding but the structure itself is useful. According to this result, we are going to study the effectiveness of the proposed ARZF algorithm comparing it with gradient search over the sum of SE (17).
To this end we formulate an iterative solution with differentiable embedded parts. The proposed parametric solution preserves the structure of the basic RZF algorithm and optimizes the target functional of SE, which leads to a significant improvement in quality. We set the constrained smooth optimization problem. The problem is to find the local maximum of the Spectral Efficiency Algorithm 1: On the Optimal Precoding Regularization R Input: Channel H and its decomposition H = U H SV by Lemma 2.1, station power P , noise σ 2 , # of iterations N ; Define precoding function W (R) using (53); Define detection function G(R) = G MMSE (W (R)) using (9); Define target function J SE (R) = SE(W (R), H, G(R), σ 2 ) of (14), (16), (19); Set tolerance grad ε g , termination tolerance on first order optimality (1e-5); Set tolerance change ε c , termination tolerance on function and parameters (1e-9); Initialize regularization by R 0 = Lσ 2 P S −2 ; if Convergence conditions hold on ε g or ε c then return Calculate L-BFGS direction [34] using the gradient: Optimize scalar step length α n = arg max α J SE (R n + αD n ); Make the optimization step: R n+1 = R n + α n D n ; end end return W OPT = W (R N ) A parametric solution uses the RZF formula as follows: We restrict the maximum power of antennas by multiplying the precoding matrix by the scalar, which allows us to satisfy the power constraints and save the geometry and desired properties of the constructed precoding.
In further experiments, we will make the real diagonal matrix R ∈ R L×L differentiable and optimize it for the target Spectral Efficiency functional, which is one of our contributions. The optimization of precoding matrix W is given in the corresponding article [35].
Detection is involved in the calculation of the gradient, it can be considered as an integral part of the Spectral Efficiency. That is, the differentiable variables here are the diagonal of the regularization matrix, then the precoding matrix is calculated using the regularization, then this precoding is substituted into the detection. Finally, the calculated precoding and detection are both substituted into the gradient. The regularization diagonal is involved in all these operations as an internal variable of a composite function, and its gradient can be calculated using a chain rule, backpropagation algorithm, as it happens using automatic differentiation of the PyTorch library. Remark 9. In considered gradient optimization precoding is constructed, assuming some particular detection that could be performed on the UE side (namely, MMSE detection). This idea is widely discussed in state-of-the-art (see e.g. [10]) and can be used to improve merely any precoding policy with a corresponding iterative procedure.

Setup in Quadriga
In this section, we describe how we obtain data from Quadriga, open-source software for generating a realistic radio channel. The considered scenario is Urban Non-Lineof-Sight 3GPP 38.901 RMa NLOS [23,36]. Fig. 5 shows an example of user placement, and users are assigned either in a cluster of one of the buildings or on the ground next to a building. The users are highlighted with blue circles, and the base station -one in red. The distances between the individual antennas of the station and the users are negligible compared to the distances between the users and the station, so the station and users are depicted as separate circles, each containing multiple antennas within it.
The overall procedure is as follows, for each random seed: (1) We generate a random environment around the base station; (2) We select random user positions near the base station; (3) We select relevant users based on correlations.
Next, we describe the procedure in detail. First, we fix the position of the base station at the coordinates [0, 0, 25] and set the random seed in Quadriga [9] to generate a random environment. Second, we select the random user positions around the base station. To do this, we locate users in the urban landscape (see example at Fig 5): (1) We sample up to 8 cluster centers x c , y c in the 120 • sector from the base station within 2000m from the base station. Each cluster represents a part of a city building; (2) We assign a random cluster height z c = 1.5m + (3 · U ({1, . . . , 10}) − 1) , selecting the cluster floor in a building from the uniform distribution U ; (3) For each user, we assign a cluster id c(u) and sample x u , y u position randomly over the 60m circle around the cluster centre; (4) We sample the height of each user in addition to the height of the cluster, at the same time, 80% of users are placed near the cluster floor z u = z c(u) + 3 · U ({−1, 0, 1}) m. and 20% of users are placed outdoors z u = 1.5m.
Third, after generating channel matrices for a fairly large number of users K max = 64, we perform the user selection to choose subsets of users that will be served together (this procedure simulates Scheduler). We select one subset of K < K max users, such that there is no pair of too high correlated users (in real life users with high correlations are served at a different time or on different frequency intervals). The correlation between users i, j is measured as the squared cosine between the main singular vectors: The number of transmitted symbols for each UE is L k = 2 (it is the simplest Rank Selection policy).
We also consider two different situations (scenarios): (1) When UE have different path-losses (PL), when we add random factor PL ∈ [−10dB, 10dB], which is the realistic channel variance for close UE (in real networks path-loss variance for UE within one cell can be up to 60dB). (2) When UE have almost equal path-losses, i.e. first singular values of each two UE are the same s i,1 ∼ s j,1 (this is the default option for Quadriga channel generation); Presented results are the average over 40 considered random realizations of channel matrix and UE subsets. Namely, for each scenario we generate 40 different channels: H ∈ C K×Rk×T : • T = 64 base station antennas; • K = 4 users; • R = 16 total user antennas; • L = 8 total user layers.
The carrier frequency for each channel matrix is selected randomly over the bandwidth. The base station antenna array forms a grid with 8 placeholders along the y axis and 4 placeholders along the y axis, the receiver antenna array consists of two placeholders along the x. Each placeholder contains two cross-polarized antennas. Our user generation algorithm produces realistic setups for the Urban case and can be convenient for other studies. An interested reader can find detailed hyperparameters for antenna models and generation processes in Table 6 in Appendix.

Results and Discussion
We compare the proposed algorithm (ARZF) with reference algorithms (MR, ZF, RZF, WRZF) and the estimation of the precoding with optimal diagonal regularization (OPT). In Fig. 6 (Tab. 2) and Fig. 7 (Tab. 3) the Average Spectral Efficiency (19) and the Minimal SE (20) of the described algorithms are presented for the different path-loss scenarios. In Fig. 8 (Tab. 2) and Fig. 9 (Tab. 5) the same quality functions are compared for the equal path-loss scenario.
ARZF (the red line, 38) provides the best Average Spectral Efficiency (19) compared to all other analytical methods up to the highest SUSINR region (see Fig. 6, 8). The advantages of the ARZF algorithm are revealed due to adaptive regularization for a specific path-loss (and thus singular values order) for each user. In detail, the situation is as follows. ZF (W ZF (F ), F = SV and W ZF (V )) are better than MRT (W M RT (V )) on high SUSINR region, and are worse for low SUSINR (W ZF (V ) is always better than W ZF (F )). RZF is better than both MRT and corresponding ZF, e.g. W RZF (V ) is better than W M RT (V ) and W ZF (V ). One can say that RZF provides an envelope of MRT and ZF. It is worth saying that although W RZF (V ) outperforms W RZF (F ) on high SUSINR region W RZF (F ) is better for low SUSINR. Due to the adaptive approach, ARZF makes an envelope of both RZF algorithms. It is quite natural that on equal path-loss scenario, WRZF and ARZF work similarly: it happens because all the singular values are almost the same. Well, on the different path-loss scenarios, WRZF degrades significantly and ARZF is better.
One may notice, that the gradient-based iterative method OPT (the black line, (1)) provides superior results. On the other hand, OPT is computationally expensive and cannot be used in practice. However, OPT is very good for upper bound estimation: the results for OPT show that the ARZF method can be improved in further research. We also measure the quality of Minimal Spectral Efficiency (20) (see Fig. 7,9) to investigate the performance of the weakest user.
These results show that the ARZF method, configuring the regularization in a special adaptive way, outperforms the Average SE at the expense of Minimal SE, i.e. of the weaker users. This follows from the fact that in terms of the Minimal SE function, the ARZF method can lose compared to other algorithms, especially in the low SUSINR region. The Average SE quality of the system, however, increases significantly. The contradiction between Average and Minimal SE functions is wellknown (see. e.g. [12]). Fig. 8 (Tab. 4) and Fig. 9 (Tab. 5) represent Average and Minimal scenarios and show the experimental results for users with almost equal path-losses. The general tendency is the same as for different path-loss scenarios: The proposed ARZF method outperforms all other analytical references in the Average SE, but it is not the best from the Minimal SE point of view. In this scenario the gain in Average SE is not so big, particularly, ARZF behaves almost the same way as WRZF. The last observation in Fig (8 (4)) is very important. The red line of W ARZF (V ) coincides with the blue line of W W RZF (V ). This result shows that both algorithms work the same when users have equal path-losses, which is also confirmed by theoretical calculations

Conclusions
Multi-user precoding optimization is a key problem in modern cellular wireless systems, which are based on massive MIMO technology. In this paper, we analyze the performance of different transmission precoding techniques in a downlink multi-user scenario. Linear techniques are computationally less expensive. On the other hand, non-linear techniques can provide better performance. The first technique that we propose in this paper is a low-complexity heuristic formula of the Adaptive Regularized Zero-Forcing (ARZF) algorithm. This technique is especially attractive in cases when the users and the base station are equipped with multiple antennas and have various path losses. We study analytically the relation of ARZF with known RZF-like algorithms and its asymptotic properties. Finally, we study the properties of the proposed ARZF method on simulations, using a realistic Quadriga channel. Simulations show uniform improvement over the reference methods to Average Spectral Efficiency for all considered scenarios. This particularly means that weighted MSE problem statement (41) is a more adequate approximation of (24) than the standard MMSE statement. Minimal Spectral Efficiency function is not the best, but is still acceptable over ARZF. We also introduce a non-linear technique Gradient-Based Optimal Regularization (OPT) that performs gradient optimization on the target Spectral Efficiency function and finds the optimal diagonal regularization matrix this way. This algorithm can be used for the upper bound study.

Availability of data and materials
The Quadriga-generated channel matrices, as well as the full Python implementation of the proposed approach and baselines, are publicly available in the Github repository at https://github.com/eugenbobrov/ Adaptive-Regularized-Zero-Forcing-Beamforming-in-Massive-MIMO-with-Multi-Antenna-Users

Authors Contribution
EB and BC formulated the proposed formula of ARZF; EB, BC, DM and DY formulated, proved, and wrote the theoretical results section; EB did the main work of conducting the experiments and drafting the manuscript; BC also did the main work of the literature review and the introduction; EB and DM conceived the study; ST helped with dataset preparation; VK and DM were involved in methodology; HL and DZ contributed to project administration and supervision. All authors read and approved the final manuscript.

Acknowledges
Authors are grateful to E. Barinova and E. Levitskaya for support.

Funding
Authors have received research support from Huawei Technologies.

Competing Interest
The authors declare that they have no competing interests.      Fig. 6 and shows the Average Spectral Efficiency of the different precodings in the Urban NLOS scenario using different path-losses. Proposal algorithm is W ARZF (V ). Optimal regularization W OPT (V ) was configured using the L-BFGS optimization algorithm.