A Memory-Efficient Learning Framework for SymbolLevel Precoding with Quantized NN Weights

This paper proposes a memory-efficient deep neural network (DNN) framework-based symbol level precoding (SLP). We focus on a DNN with realistic finite precision weights and adopt an unsupervised deep learning (DL) based SLP model (SLP-DNet). We apply a stochastic quantization (SQ) technique to obtain its corresponding quantized version called SLP-SQDNet. The proposed scheme offers a scalable performance vs memory tradeoff, by quantizing a scale-able percentage of the DNN weights, and we explore binary and ternary quantizations. Our results show that while SLP-DNet provides near-optimal performance, its quantized versions through SQ yield 3.46x and 2.64x model compression for binary-based and ternary-based SLP-SQDNets, respectively. We also find that our proposals offer 20x and 10x computational complexity reductions compared to SLP optimization-based and SLP-DNet, respectively.


I. Introduction
P RECODING using the known channel state information (CSI) at the transmitter has been proven to be an efficient interference management technique in a downlink multiuser multiple-input-single-output (MU-MISO) communication system [1] [2]. The precoding also enables many complex signal processing at the base station (BS), which simplifies users' terminals. Classical blocklevel precoding (BLP) schemes, where the precoding coefficients are applied across a block of symbols (codewords), have proven to be less computationally expensive than the optimal dirty paper coding (DPC) but suffer performance degradation [3]. Masouros and Alsusa [4] first proposed a method for classifying instantaneous interference into constructive and destructive. The suboptimal precoding strategies that exploit constructive interference (CI) were first introduced [5]. Precoding methods based on optimization are appealing because of their amenability to achieve various performance targets. An optimization-based CI precoding was first introduced using a quadratic optimization strategy in light of vector perturbation precoding [6].
To further improve the performance, a precoding design termed symbol-level-precoding (SLP) that exploits the multiuser interference via CI with the known CSI and transforms it into useful power at the mobile user end has received a lot of attention [7]- [11]. The CI-based solution is suitable for practical implementation and has proven that massive multiple-input-multiple-output (m-MIMO) systems can take advantage of the CI with SLP [12]- [14]. The idea of CI combined with optimization has been applied in many wireless physical layer designs due to its performance gains over BLP schemes to achieve different objectives, such as transmit power minimization and SINR balancing problems [15]- [19]. A closed-form precoding design with optimal performance for a CI exploitation in the MISO downlink for optimization with both strict and relaxed phase rotations was proposed by Li and Masouros [19]. While CI-based precoding methods offer superior performance, computing them online on a symbol-bysymbol basis can be computationally demanding.
As a result of the proliferation of machine learning algorithms, the model-driven deep learning (DL) technique that exploits the expert's knowledge has been applied in many wireless communication problems due to its explicability, reliability, and low computational complexity [20]- [22]. Therefore, DL-based precoding designs that use domain knowledge have been recently proposed for MU-MISO downlink transmission [23]- [25]. However, the drawback of such methods is that the optimization constraints are not directly integrated with the loss function. Furthermore, their performance is bounded by the assumptions and accuracy of the optimal solutions obtained from the optimization algorithm. An unsupervised deep unfolding precoding design termed "SLP-DNet" [26] that utilizes the specifics of the optimization objectives of the precoding problem has been proposed to address the issues mentioned above, and will be used as our benchmark in this work.
Typically, a DL model contains thousands or even millions of learnable parameters, usually stored in a 32-bit floating-point (FP32) numerical presentation, making the model computationally and memory demanding during inference and deployment. To facilitate the online training and deployment of a trained DL model at the device edge, light-weight deep neural network (DNN) designs with lower-precision numerical formats have gained significant attention within the deep learning community, typically applied to image processing applications [27]- [30]. However, this concept has not been fully explored in wireless communications. In this work, we propose a DL model's structural simplification method through weights quantization for SLP design. We adopt the DL-based SLP model (SLP-DNet) introduced by Mohammad et al. [26]. Our contributions are summarized below: arXiv:2110.06542v2 [eess.SP] 28 Nov 2021 • We propose a memory and complexity efficient DNN approach, applied to the learning-based precoding framework (SLP-DNet) [26]. Specifically, we propose an efficient model simplification via weights compression to accelerate both training and inference and facilitate deployment on the device edge. • We devise a scalable tradeoff between performance and inference complexity, by allowing a percentage of the DNN weights to be quantized, while retaining important weights in full-precision. By tuning the percentage of quantised weights, a scalable tradeoff between performance and complexity / memory efficiency is achieved. • We further introduce a stochastic quantization (SQ) technique that uses the quantization error to alleviate the loss in performance caused by the nonhomogenous quantization errors of the conventional extreme quantization (binary and ternary). In the SQ technique, a fraction of the neural network (NN) weight matrix is quantized to lower resolution while the remaining is retained in its full-precision, resulting in a hybrid quantized weight matrix. The technique yields a memory-efficient DL-based SLP model with a good balance between the performance and the computational complexity.
The remainder of the paper is structured as follows: System model and the review of the relevant precoding techniques are presented in Section II. We introduce a technique of designing compressed unsupervised learningbased SLP schemes in Section IV. Simulations and results are presented in Section V. Finally, Section VI concludes the paper. Notations: We use bold uppercase symbols for matrices, bold lowercase symbols for vectors and lowercase symbols for scalars. Operators · 2 , · 1 and | · | denote l 2 -norm, l 1 -norm and absolute values, respectively. The symbol Ω i represents the i-th trainable parameter associated with DNN layers. Re{·} and Im{·} represent real and imaginary parts of complex vector/matrix, respectively. Finally, notations L(·) and D(·) are used for the loss and parameter update functions, respectively.

A. System Model
Consider an MU-MISO downlink transmission in a single cell scenario where an M -antenna base BS serves K single-antenna users. The data is transmitted to the users over flat-fading Rayleigh channel denoted by h i ∈ C M ×1 . The received signal at the i-th user is expressed as where h i , u i , n i represent the channel vector, precoding vector and additive white Gaussian noise for the i-th user.

B. Symbol Level Precoding Power Minimization
The CI precoding scheme enhances the symbol detection by pushing the received signals away from the constellation detection boundaries without consuming extra transmission power [8]. As an illustration, Fig. 1 shows a symbolic example representing the constellation point 1 + j in the QPSK. The green shaded area depicts the constructive region of the constellation based on the least distance (χ) from the decision boundaries, whose value is determined by the SNR constraints. This allows the interfering signals to align with the symbol of interest constructively through precoding vectors. We can observe that if the maximum angle shift in the CI region is zero, the interfering signals overlap entirely on the signal of interest (θ = 0), then the problem reduces to a strict phase angle optimization. It is important to note that the strict phase formulation is not appealing because it yields an increase in the transmission power compared to the corresponding relaxed version [31]. For simplicity, the following are defined according to [8]; For the details and rational of the above definitions the reader is referred to [4]. We define the precoding and the channel matrices respectively asH = [h 1 , · · · ,h K ] and U = [u 1 , · · · , u K ]. Therefore, for an M-phase shift keying (M-PSK) modulation scheme, where M is the modulation index, the optimization-based SLP for a nonrobust multicast power minimization is given by [26] is the maximum phase shift in the CI region.
To avoid repetition, we refer the reader to [8] [26] for details and the description of equivalent robust formulations under channel uncertainty.

C. Learning-Based SLP for Power Minimization (SLP-DNet)
This work is based on the unsupervised deep unfolding framework that unfolds the interior point method (IPM) 'log' barrier function based on the problem (2) by reformulating it as unconstrained subproblems per user expressed as min where B(u i ) t i=1 ln (g(u 1i )) is the logarithmic barrier function and υ is the Lagrangian multiplier related to the inequality constraints. Here, the function, g(u i ) is defined as g( 1 and t is the number of the optimization variables. To derive the learning architecture based on an IPM, we define a proximity barrier of (3) as u 0 is the initial precoding vector and γ ∈ {0, +∞} is the training step size. The precoding vector for every l-th iteration is obtained from the following learning update rule where H(u The parameter, λ is introduced as an additional constraint to provide more stability to the learning architecture. Intuitively, NN cascade layers can be formed from (5) as follows where W l and b l are described as weight and bias parameters respectively. The nonlinear activation functions are defined by Π l . The SLP-DNet structure as shown in Fig. 2 is built based on (6) and the Algorithms 1 and 2 of [26]. As shown in Fig. 2, SLP-DNet has two main units; the parameter update module (PUM) and the auxiliary processing module (APM). The PUM has three core components associated with Lagrangian multiplier (υ), the auxiliary parameter (λ), and the training stepsize (γ), which are updated based on the following The structure that is related to the inequality constraint in (2) is the proximity barrier term. It is constructed with one convolutional layer, an average pooling layer, a fully connected layer, and a softPlus layer to constrain the output to a positive real value to satisfy the inequality constraint.
The loss function over N batch training samples (batch size or the number of channel realization) is Lagrangian function expressed as where Ω i are the trainable parameters of the l-th layers associated with the weights and biases, and µ > 0 is the penalty parameter that controls the bias and variance of the trainable coefficients. The optimal precoder is obtained from the Lagrangian function (9) as

D. Robust SLP-DNet
In a similar fashion to the above, we can derive a CSIrobust SLP-DNet from the robust SLP formulation under worst-case CSI-error. The robust SLP is given by [26] For convenience, we introduce new notations as follows: Q 1 = (Υ − I 2M tanθ) and Q 2 = (Υ + I 2M tanθ) and ς 2 is the CSI error bound. (11) is a second order cone programming (SOCP) and can be solved using convex optimization software package.
It is important to note that the structure of the robust SLP-DNet is obtained by following similar steps from (3)-(8) of Subsection II-C by transforming (11) to its equivalent unfolded IPM 'log' barrier form. The loss function is obtained from the Lagrangian of (11) as where Q 1 The optimal precoder can be easily obtained from (12) where Note that the Lagrange multipliers υ 1 and υ 2 are associated with the barrier term and are randomly initialized from a uniform distribution.

III. Preliminaries of NN Weight Quantization
Traditionally, DNN is designed with full-precision weights and activations. This can result in significant memory consumption and computational complexity. For this reason, there has been a recent drive to reduce the DNN model size, driven from the image processing research [27]. DNN acceleration techniques can be broadly classified into three categories: i. Structured simplification: This involves a systematic approach of network factorization (factorizes a convolutional layer into many efficient ones), channel pruning, sparse connections to reduce the size of the DNN model [32]. ii. Optimized Implementation: This approach uses Fast Fourier Transform (FFT) based on NVIDIA's cuFFT library to provide significant speedups [33]. iii. Quantization: In this technique, the computations involving weights, activations, and sometimes input tensors are performed at lower bit-widths than floatingpoint precision [27]. Among the above three model simplification techniques, quantization is most appealing because, in addition to model reduction, most MACs operations required to compute the neurons' weighted sums are replaced by simple binary operations (bit-wise or XNOR operations). Quantization improves both training and inference efficiencies; and reduces hardware requirements during model deployment on the edged-devices.
Typically, the weights of l-th layer DNN architecture are represented by [28] where m is the number of kernels/filters (output channels). The n-dimensional weight tensor W i ∈ R n , n = c × w × h in l-th convolutional layer, where c × w × h represents the input channels, filter width and filter height respectively, and for a fully connected layer, n = m × c (number of the output and input neurons, respectively). For convenience, in what follows, we drop the kernel subscript.
1) Binary Weights: The real-valued weights are converted to (B w ∈ {+1, −1} n ). A full-precision 32-bit weight matrix is binarized as follows [34] A more robust binarized weight "BWN" is proposed as an extension of a straightforward binary network (Binary Connect) by introducing a real scaling factor β ∈ R + such that W ≈ βB w by solving an optimization problem [28] and this yields 2) Ternary Weights: A ternary weighted network (TWN) is the one in which an extra 0 state is introduced into BWN to solve the following optimization problem [35] and solving (17) gives where δ = 0.7 n n i=1 |W| and β * = 1

IV. Proposed Low-bit SLP-DNet Design A. Low-Bit Weights and Stochastic Division
The existing works on low-bit DNNs design focus only on reducing the bit-widths of the weights and activations to speed up the training and inference times and also improve memory efficiency. However, in low-bit DNNs designs, the impact of quantization on the performance of the learning algorithm has not been fully explored and understood. In this work, we adopt a quantization technique proposed in [30] and propose a simple linear probability function of selecting the filter weights to be quantized for designing a low-bit scalable learning-based precoder.
The weight matrix of each layer of the DNN can be expressed as: W = {W 1 , · · · , W n }. Here, the rows of the weight matrix are partitioned into two parts according to the following where W q = {W q1 , · · · , W qM } and W f = {W f 1 , · · · , W f N } represent the quantized and fullprecision parts of the weight respectively, and should satisfy the condition below As seen from (19), one subset of the weight W q is quantized to a low bit-width while the remaining W f is kept in its full-precision form, so that the entire weights matrix is composed of both binary and floating-point values. Note that a fully quantized DNN can be obtained by setting W f to a null set. Suppose r sq is the quantization ratio (QR) (i.e., the percentage of weights quantized as a fraction of the total weights in the DNN), and n is the length of the weight matrix (number of elements), the number of elements in the quantization group is M q = r sq n while that of a fullprecision parts is M f = (1 − r sq )n. The QR can be gradually increased to 100% until the entire network is finally quantized. To select the channel to be quantized, we adopt a lottery disc algorithm as in [30]. It can be observed in Fig. 3 that each sector of the disc represents a probability of selecting a channel (row of weight matrix). The disc is rotated by choosing a value from the uniform distribution whose magnitude is slightly above the probability value. After every selection, the probability is reset (i.e., p j = 0) to ensure that a channel is selected without replacement as summarized in Algorithm 1.

Algorithm 1 Circular Lottery Algorithm for Weight matrix Division
Input: r sq Stochastic Quantization ratio and Weight matrix (W) Output: W q and W f 1: Initialization: Select a random value ϑ j ∈ {0, 1} from a random uniform distribution Compute: W q = W q ∪ W 13: Reset pr i = 0 {This is to avoid i-th channel weight from being selected again} 14:

B. Quantization Error and Quantization Probability
Recall that classical binarized DNNs suffer a significant performance loss due heterogeneous nature of the quantization error (QE) over the entire network. The performance can, however, be improved by stochastically selecting the filter or channel weight matrix to be quantized using a random probability distribution based on the QE between the real-valued and quantized weights as follows where Q * j could be binary or ternary based on (16) or (18). We define the vector of the n-th row weight matrix of a given layer as e = [e 1 , · · · , e n ]. The quantization probability is formulated such that a higher probability is assigned to filter/weights if the quantization error is small because quantizing these weights does not yield a significant loss of accuracy or performance. For a given weight matrix, QR, and quantization probability (QP), a channel is randomly sampled without replacement using a circular lottery Algorithm 1. From this, we can observe that the QP function is inversely proportional to QE and is defined as f p = 1 e+δ , where δ = 10 −6 to avoid possible numerical overflow. The QP function is monotonically non-decreasing to prioritize the selection of the channels/weights to be quantized. Different monotonically non-decreasing functions are: The simplest of these QP functions is uniform or constant function but is not appealing because it is independent of the QE and therefore ignores the random quantization proposition. The most intriguing of all is the half-Gaussian function because of the extra parameter (σ), which can be learned but is more complicated. The linear and softmax functions have been found to yield nearly the same performance, but the former is simpler to implement. Accordingly, in this work, we use the linear function because it balances between performance and simplicity.

C. Low-bit Activation Function
The inputs to convolutional and fully connected layers are often the outputs of the previous layers' activations. In many low-bit DNNs designs, the activation layer is often left in its full-precision. However, quantizing the activation layer is crucial in replacing the floating-point operations with more efficient binarization. The conventional activation functions such as "Relu" may not be suitable for lowbit DNNs [36]. Therefore, the activations are quantized from 32-bit(u 32 ) to k − bit according to the function where W 32 is the floating-point activation bounded by the input dimension (x, y) and k = 2. The activations are not stochastically quantized because, unlike in weights, the activations do not have learning parameters.

1) SLP-DNet and Classically Quantized SLP-DNet:
The SLP-DNet is trained the same way as its corresponding classically quantized versions based on binary and ternary bits (SLP-DBNet and SLP-DTNet). Each PUM block contains three main components and is trained block-wise for k-th number of iterations. Similarly, APM is trained for r-th iterations, and the number of training iterations of the PUM and APM may not necessarily be equal. The PUM is trained for 20 iterations and the APM for 10 iterations. We modify the learning rate by a factor α ∈ R + for every training step to improve the training efficiency using a stochastic gradient descent algorithm with Adam optimizer [37].

2) Stochastic Quantized SLP-DNet (SLP-DSQNet):
The SLP-DNet training is slightly different from that of SLP-DNet. The training is summarized in four stages: stochastic weight matrix division, forward propagation, backward propagation, and parameter update. Given QR, the weight matrix is partitioned into a quantization group and a full-precision group using Algorithm 1. A hybrid weight is then formed containing the quantized and the real-valued weights, and it provides a better gradient direction than pure quantized weights. IfW qf is the composite weight matrix, the weight update with respect to the composite gradients is given by W r+1 = W r − η ∂L ∂W r qf . We train the network with different QRs, which are fixed for all the training iterations and inference.
The learning is performed in an unsupervised fashion in which the loss function is the Lagrangian function's statistical mean over the training batch. During the inference, a feed-forward pass is performed over the whole layers using the learned Lagrangian multipliers to compute the precoding vector using (10) and (13) for nonrobust and robust SLP formulations. Note that except where necessary stated, the training SINR is drawn from a random uniform distribution to enable learning across a wide range of SINR values.

E. Computational Complexity Analysis
This subsection presents the analytical evaluations of the computational costs of the proposed SLP-DSQNet precoding schemes and compares them with SLP-DNet, the conventional BLP, and the SLP optimization-based methods. The complexities are computed in terms of the number of real arithmetic operations involved. To derive the analytical complexity of the optimization-based SLP, we first convert the second-order cone programming (SOCP) ( where and U = [u 11 , · · · , u 1K ]; ∀i = 1, · · · , K. Given the optimal target accuracy, > 0, the complexity of solving convex optimization via IPM is characterized by the formation (C form ) and factorization (C fact ) of the matrix coefficients withn linear equations havingn unknowns and is given by [38] (25) where Q represents the constraint's dimension, M lc and M sc denote the numbers of linear inequality matrix and second order cone (SOC) constraints, respectively. Therefore, the overall complexity is It can be observed that (24) has K constraints with dimension 2M +1. Therefore, using (26), the total computational cost is obtained as C total = √ 2M + 1 n(2M + 1) +n(2M + 1) 2 +n 3 ln 1 .
By following similar principles and steps above, we can obtain the complexities of the robust SLP and the conventional BLP schemes.
On the other hand, to determine the complexities of our proposed precoders, we first evaluate the complexities of the learning modules (PUM and APM) in terms of arithmetic operations involved. For PUM, there are three convolution blocks. The feature map determines the arithmetic operations for a convolution layer and is given by the number of multiplications and additions involved in the convolution operation. The number of operations in a given convolutional layer is where N h , N w , k f , C in and C out denote the height, width of the input layer tensor, filter size, number of input and output channels, respectively. It is important to note that only the first and second convolutions are quantized, while the last convolution is not to avoid losing essential features of the output precoder. Since in our proposed approach, the layer weight matrix contains both floating points and quantized entries, then the quantization approximation of convolution has 1 32 c in k 2 f N w N h c out × QR binary operations and c in k 2 f N w N h c out × (1 − QR) non binary operations based on (28). Using these expressions, we obtain the generic complexity of the PUM as Similarly, the APM's complexity is determined by the cost of the feed-forward pass of the shallow CNN, as shown in Table III and the 'log' barrier that form the barrier term.

A. Simulation Set-up
We consider a downlink situation in which the BS is equipped with four antennas (M = 4) that serve K single users; and assume a single cell. We obtain the dataset from the channel realizations randomly generated from a normal distribution with zero mean and unit variance. The dataset is reshaped and converted to real number domain using the following expression Φ = h Ri ;h Ii as summarized in Fig. 4. The input dataset is normalized by the transmit data symbol so that data entries are within the nominal range, potentially aiding the training. We generate 50,000 training samples and 2000 test samples, respectively. The transmit data symbols are modulated using a QPSK modulation scheme. The training SINR is obtained random from uniform distribution Γ train ∼ U(Γ low , Γ high ). Stochastic gradient descent is used with the Lagrangian function as a loss metric. A parametric rectified linear unit (PReLu) activation function is used for both convolutional and fully connected layers in a fullprecision SLP-DNet and the low-bit activation function (22) for SLP-SQDNet. After every iteration, the learning rate is reduced by a factor α = 0.65 to help the learning algorithm converge faster. The models are implemented in Pytorch 1.7.1 and Python 3.7.8 on a computer with the following specifications: Intel(R) Core (TM) i7-6700 CPU Core, 32.0GB of RAM. Tables 1 summarizes the  simulation parameters, while Tables 2 and 3 depict the NN component settings of the SLP-DNet [26].

B. Performance Evaluation of QSLP-DNet and SLP-DNet
In the following set of results we compare our proposed quantized DL-based SLP scheme's performance against its corresponding full-precision (SLP-DNet) counterpart's [26] and other benchmark schemes, such as conventional BLP [39] [40] and the optimization-based SLP [8]. Primarily, we design full low-bit binary and ternary SLP-DNet models (SLP-DBNet and SLP-DTNet), where the real-valued weights and activation are constrained to 1-bit. Similarly, the expressive learning abilities of SLP-DBNet and SLP-DTNet are further enhanced by designing their corresponding low-bit hybrid stochastically quantized versions (SLP-DSQBNet and SLP-DSQTNet), where part of the weight matrix is quantized to a lower bit, while the remaining is left in its 32-bit floating-point precision. The resulting weight matrix is a hybrid containing both binary and real-valued entries with the activations all reduced to 2-bit according to (22).
The performances of SLP-DBNet, SLP-DTNet, SLP-DSQBNet, SLP-DSQTNet for QR = 0.5 against SLP-DNet and other benchmark precoding schemes (conventional BLP, SLP optimization-based) are shown in Fig. 5. It can be observed that both SLP-DBNet and SLP-DTNet have higher transmit power than the SLP optimization-based and SLP-DNet schemes. Therefore, SLP optimization-based and SLP-DNet solutions require less power to transmit the same amount of data symbols than SLP-DBNet and SLP-DTNet. The loss in performance is expected because some information is lost during feed-forward weight/input convolutions due to quantization and the inhomogeneous nature of the quantization errors.
Furthermore, a closer examination of Fig. 5 reveals that the SLP-DSQBNet and SLP-DSQTNet offer less transmit power than their corresponding full binary and ternary versions. Our simulation also shows that learning by stochastic quantization results in the performance close to the full-precision learning model (SLP-DNet) with a significant model size reduction (memory savings at the inference), as we shall see later. We argue that the decrease in the available transmit power at the BS in this scenario is because not all the weights matrix rows are quantized at once. The quantization error is used to direct the gradient descent towards the best local minima during training. Accordingly, we find that at 30dB, the performance of SLP-DBNet and SLP-DTNet falls by 58% and 35% of the SLP optimization-based solution, respectively. On the other hand, the performance gaps of SLP-DSQBNet, SLP-DSQTNet, and SLP-DNet are 22.2%, 9.62%, and 5% of the SLP optimization-based solution, respectively. Therefore, while the fully quantized model's accuracy is significantly low, the stochastically hybrid quantized counterparts and full-precision models' accuracy is within 88% − 96% of the optimal solution.

C. Performance Evaluation of Robust SLP-SQDNet
Figs. 6(a) and 6(b) compare the performances of SLP-SQDNet and the traditional CSI-robust precoder for the 4 × 4 MISO system evaluated at ς 2 = 10 −4 . Fig. 6(a) depicts how the average transmit power increases with the SN R thresholds, for CSI error bounds ς 2 = 10 −4 and QR = 50%. The robust SLP optimization-based is observed to show a significant power savings of more than 60% compared to the robust conventional BLP. Similarly, the proposed unsupervised learning-based precoders portray similar transmit power reduction trend. They show considerable power savings of 40% − 58% against the conventional optimization result. While the fully quantized models have demonstrated substantial performance loss compared to SLP-based optimal precoder, SLP-DSQBNet and SLP-DSQTNet offer 90% − 98% striking performance correlation with the SLP optimization-based optimal solutions, respectively. Furthermore, we investigate the effect of the CSI error bounds on the transmit power at 30dB. Fig. 6(b) depicts the variation of the transmit power with increasing CSI error bounds. Moreover, a significant increase in transmit power can be observed where the channel uncertainty lies within the region of CSI error bounds of ς 2 = 10 −3 . Interestingly, like the SLP optimization-based algorithm, by exploiting the CI, the proposed unsupervised learning methods also show a descent or moderate increase in transmit power. To further understand the impact of the QR on the transmit power, Fig. 7 compares the performance of the proposed stochastic quantization learning-based CSI-robust precoders evaluated at 30dB. Like the results obtained for the nonrobust scenario, we also observe a similar trend, where the average transmit power available at the BS required to transmit data symbols increases as more weights and activations are quantized.

D. Complexity and Memory Evaluation
The proposed learning schemes' complexities are examined in two folds: firstly, we compare the number of FLOPs operations involved in our proposed learning methods and those of the benchmark precoding schemes'. Secondly, we evaluate and assess the inference memory requirements of our proposed learning-based precoding techniques.

1) Number of FLOPs Operations:
The computational costs of the SLP-DNet are obtained from the PUM and the feed-forward convolutions of the CNN that makes up an APM. For the PUM, the dominant computational cost comes from computing the proximal barrier term [26]. It can be seen that both SLP optimization-based algorithm and the proposed learning schemes are feasible for all sets of M BS antennas and K mobile users. However, for conventional BLP, the solution is only feasible for M ≥ K. or vector-matrix convolution. The gap in the computational cost between SLP-DNet and SLP optimizationbased methods increases with the growing number of mobile users. For example, we find that the complexity of SLP-DNet is ∼ 10× lower than SLP optimization-based at K = 10, while that of SLP-DSQBNet and SLP-DSQTNet are ∼ 20× much lower due to the presence of binary operations. Furthermore, SLP-DBNet and SLP-DTNet offer an additional computational complexity reduction than SLP-DSQBNet and SLP-DSQTNet because binary bit-wise operations replace the entire MACs calculations in the for-ward pass. It is important to recall that SLP-DTNet outperforms SLP-DBNet in all scenarios. However, we observe that SLP-DTNet is slightly slower than SLP-DBNet, and this is due to the additional '0' binary state introduced in the former. We also note that the advantages of the SLP-DBNet and SLP-DTNet are further enhanced via stochastic quantization but at the expense of small additional complexity overhead. The same trend is also observed in the case of a robust channel scenario, as shown in Fig. 8(b). Accordingly, we can deduce that while fully binarized DNN could offer significant training and inference accelerations, it could otherwise lead to significant performance degradation. However, quantizing the weight matrix via a stochastic channel selection based on the quantization error leads to the improved received power. Therefore, we can conclude that the results in Figs. 8(a) and 8(b) demonstrate that the proposed quantized DL-based SLP solutions offer a good trade-off between the performance and computational complexity.
2) Model Size and Memory Utilization: Generally, GPU can speedup the offline training of DNNs. However, most modern GPUs are memory-constrained (e.g.GTX 980: 4GB, Tesla K40: 12GB, Tesla K20: 5GB and GTX Titan X: 12GB) [41]. Practically, the size of the DNN is often bounded by the available memory. Therefore, it is beneficial to estimate the memory requirements of the DNN at the inference. Likewise, the actual memory utilization also depends on the implementation. Here, we examine and analyze the memory utilization of full-precision SLP-DNet and its corresponding quantized versions at inference. By memory utilization, we refer to the model size at the testing phase. For this analysis, we adopt the approach presented in [42] to calculate the inference memory utilization as the summation of 32-bit times the number of floating-point parameters and 1-bit times the number of binary parameters. Mathematically, this can be expressed as 1 32 W b + W f , where W b and W f are the binary and floating-point weights, respectively. Fig. 9(a) shows the average transmit power vs quantization ratio (i.e. the proportion of weights that are quantized) at 30dB SINR. The average power at QR = 0 corresponds to SLP-DNet while QR = 1 represents the corresponding fully quantized counterparts (SLP-DBNet and SLP-DTNet). Moreover, the transmit power gradually increases as more weights are quantized. It is important to note that for a unit quantization ratio (QR = 1.0), all the weights are 100% quantization, where the model could be either a typical binary or ternary. On this note, it is clear that the SLP-DSQTNet offers less transmit power than SLP-SQDBNet. We find that quantizing half of the weights (QR = 50%) could guarantee a good performance within 80%−98% of the full-precision model for both SLP-SQDBNet and SLP-DSQTNet, respectively. To investigate the amount of the memory required at inference with the increase in the quantization ratio, we plot the model size vs QR as depicted in Fig. 9(b). We find that less memory is required as the quantization moves towards extreme binarization to the right of the QR-axis. It can be seen that the continuous line represents a full-precision SLP-DNet (i.e., QR = 0), while QR = 1 represents a fully quantized model. Furthermore, Fig. 10 shows that SLP-DBNet and SLP-DBNet provide considerable memory savings up to ∼ 21× and ∼ 13× compared to the full-precision SLP-DNet because the extreme quantization reduces the available learning parameters significantly. This brings about a trade-off between performance and model size, which is compensated by hybrid quantization as in SLP-DSQBNet and SLP-DSQTNet. Table V presents the summary of the inference memory requirements, MACs, and binary operations of different proposed learning implementations. For SLP-DSQBNet and SLP-DSQTNet, the weights are constrained to the following quantization {−β qf , β qf } and {−β qf , 0, β qf } while the activations are clipped to {−β 2−bit , β 2−bit } 2 − bit quantized values, respectively. This shows that the hybrid quantization enhances the representational capabilities of the convolutional block.

VI. Conclusion
This paper proposed a hybrid quantization DNN-based SLP scheme termed (SLP-QSDNet) based on binary and ternary operations for power minimization for a multiuser downlink MISO system. We proposed various weight quantization techniques to obtain its corresponding full and partially quantized counterparts. We showed that the proposed approach resulted in fast online learning and a significant model size reduction, which could help render the trained model memory-efficient during deployment on the device's edge. Overall, our proposed approaches provide a scalable tradeoff between performance and complexity in learning-based SLP transmission.