Learning-Based Precoding-Aware Radio Resource Scheduling for Cell-Free mMIMO Networks

Communication by jointly precoded transmission from many distributed access points (APs), called cell-free massive multiple-input multiple-output (CF mMIMO), is a promising concept for beyond 5G systems. One of the challenging aspects of CF mMIMO is the efficient management of the radio resources. We propose both reinforcement learning (RL)-based and heuristic precoding aware radio resource scheduling (RRS) algorithms aiming at maximizing sum spectral efficiency (SE). The proposed algorithms allocate resources for Maximum Ratio Transmission (MRT), Zero-Forcing (ZF), Regularised Zero-Forcing (RZF), and Optimized Zero Forcing (OZF) precoders. For the resource allocation, both the set of serving APs and the physical resource blocks are considered. In high noise scenarios, the proposed RL-based RRS algorithm combined with the MRT precoder shows 2.4 times higher sum SE than the standard Round Robin scheduler. Moreover, we demonstrate that the proposed heuristic algorithms offer similar sum SE while significantly reducing the complexity compared to the RL-based solution. We also show that the RZF and OZF precodings, which are superior to the ZF precoding in noisy environments, result overall in more transmitted power. Therefore, assuming the same radio resource schedule and precoding strategy in the neighboring cells, it will result in more inter-cell interference and an overall reduced performance.

gains by employing multiple antennas at the transmitter and receiver which results in significantly increased spectral efficiency (SE).The multiple base station antennas can transmit multiple streams to multiple user equipments (UEs) simultaneously, hence the concept of multi-user MIMO (MU-MIMO) was proposed [2].The MU-MIMO has a huge potential to combine the high capacity achievable with MIMO processing with benefits of space-division multiple access [3].
To further improve the SE in wireless communication, the concept of massive MIMO (mMIMO) has been proposed [4], where a large number of antennas simultaneously serve several UEs.The mMIMO system reduces both the transmission energy and the inter-cell interference [5].A large number of antennas could either be co-located in a single antenna array or geographically distributed over a cell [6].From the performance point of view, the distributed mMIMO may outperform co-located (also referred to as the cellular) mMIMO, but it might be costly in terms of practical deployments.Recently, the network paradigm has been shifted from co-located mMIMO to Cell-Free (CF) mMIMO, which is based on the distributed mMIMO concept.In the CF MIMO network, a number of UEs in a geographical area is simultaneously served by a large number of distributed Access Points (APs) coordinated by the Central Processor Unit (CPU) [7], [8].
In CF mMIMO, the concept of a cell disappears.Instead, a UE is served by a set of APs which is referred to as the AP cluster [9].In highly dynamic network architectures, the general problem of clustering the APs and designing the beamforming vectors to maximize the sum-rate is studied in [10].One of the CF mMIMO challenges is the problem of joint AP selection and precoding optimization formulated in [11] where the weighted sum-rate is maximized by a novel hybrid Deep Reinforcement Learning (DRL) method.A deep learning enabled AP clustering scheme is also proposed in [12] to mitigate the preamble collision problem in grant-free random access (RA) of distributed massive MIMO networks.Moreover, a K-means AP clustering algorithm is also developed to cluster the neighboring APs of collided RA UEs and organize each AP cluster to decode the received data individually.A TDD-reciprocity calibration and CSI interpolation of a CF mMIMO system using cascaded deep learning is proposed in [13] to enable a single-shot solution to estimate the downlink (DL) channel at all subcarriers from the uplink channel at a selected pilot sub-carrier.From the above mentioned discussion, it appears that the CF mMIMO architecture is showing huge potential in terms of improving 1536-1276 © 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the SE [9] and energy efficiency.Therefore, we consider the CF mMIMO network concept and address a precodingaware radio resource scheduling (RRS) optimization problem to improve the network SE performance.In the paper, the precoding with maximum ratio transmission (MRT), zero forcing (ZF), regularized zero forcing (RZF), and a proposed optimized zero forcing (OZF) are investigated combined with RL and heuristic RRS algorithms.The OZF precoding is essentially a RZF precoding with a per user regularisation factor that is also optimized by the algorithm.

A. Motivation and Related Works
The CF mMIMO system architecture has demonstrated better performance than the small-cell and cellular mMIMO architecture in several practical deployment scenarios [14].In recent years, several research works have been conducted by utilizing conventional optimization techniques to optimize RRS of CF mMIMO networks.For example, a user scheduling algorithm is proposed in [15] to maximize the DL sum-rate of CF mMIMO networks.The proposed semi-definite relaxation method for resource allocation and the power control algorithms based on sequential convex approximation show better performance over the traditional canonical CF solution.Similarly, distributed resource allocation for user-centric CF mMIMO is proposed by Ammar et al. in [16] to solve user scheduling, beamforming, and power control in order to maximize network sum SE.Similarly, a novel resource allocation scheme for joint user clustering and power allocation (PA) is proposed in [17] for CF mMIMO to maximize the access rate of the networks.The work [17], however, does not include precoder awareness.As is known, precoding eliminates the effects of interference and fading and increases the SE of the system.This effect is not studied in the resource allocation works proposed in [9], [15], [16], and [17].Therefore, we consider a precoding-aware radio resource scheduling algorithm for Cell-free mMIMO to maximize the SE of the networks.
For the CF mMIMO system, Palhares et al. propose access point selection, the linear minimum mean-square error (LMMSE) precoding and PA for DL transmission with single antenna UEs and multi-antenna APs to maximize the network sum-rate [18].The authors mainly focus on developing three algorithms.However, by using the LMMSE AP precoding method, users are only multiplexed in space and physical resource block (PRB) selection is not included.Precoding and PA for CF mMIMO are studied by Nayebi et al. in [19] where conjugate beamforming and ZF precoding are combined with a low complexity PA scheme used to showcase the SE performance of the network.In [19], the authors mainly focus on comparing several PA schemes to enhance the SE performance of CF mMIMO networks and the performance comparison with the ZF precoder.To maximize DL sum-rate performance of CF mMIMO networks, a novel iterative robust MMSE (RMMSE) precoding technique with several PA strategies is studied in [20] to show the network sum-rate in comparison with ZF and conjugate beamforming techniques.There are a lot of relevant studies that focus on joint precoding and PA scheme [18], [19], [20], but they do not address the joint precoding, PA, and RRS problem which is the focus of this paper.
In CF mMIMO, various precoding techniques including an important class of linear precoding called RZF are used.Several research works have already been proposed to modify the traditional RZF precoder to improve the SE of CF mMIMO networks.An adaptive RZF precoder is proposed by Babrov et al. in [21] that allows the users to use different regularization parameters and layers corresponding to their singular values and the path loss between the UEs and APs.A local RZF (LRZF) is proposed in [22] that provides weighting between interference suppression and maximizing the intended signal at the receiver by exploiting channel statics.The authors investigated the impact of pilot sequences along with the number of antennas per AP and showed that LRZF provides higher SE than the other precoding techniques.Similarly, modified regularized ZF (mRZF) at the APs for downlink in CF mMIMO with non-orthogonal multiple access (NOMA) is proposed by Rezaei et al. in [23] and it shows that the proposed mRZF significantly outperforms the OMA with MRT to maximize the achievable sum-rate.There are a lot of studies that focus on modifying the traditional RZF precoding techniques for CF mMIMO [21], [22], [23] but they fail to address RZF precoding and RRS problem jointly by the use of precoding-aware RRS or RRS-aware optimization of the RZF regularisation terms for CF mMIMO networks.Therefore, we provide an optimized regularised zero forcing (OZF) precoding solution in the paper by investigating an impact of using a diagonal matrix instead of the identity matrix of RZF which can significantly improve the SE performance of CF mMIMO networks.
With the advances in machine learning algorithms, more specifically, reinforcement learning algorithms, deep RL (DRL) approaches are adopted to deal with wireless (mobile) network RRS problems.A deep learning architecture for uplink (UL) transmission in CF mMIMO is proposed in [24] where an artificial neural network (ANN) handles the joint AP selection and power control mechanisms for various fronthaul bandwidths and system loads.Similarly, power allocation strategies for CF mMIMO based on deep Q-learning are proposed in [25] where the proposed deep Q-network method allocates DL transmission power to maximize the spectral efficiency performance of the networks.A deep reinforcement learning based beamforming design is proposed by Weilai et al. in [26] to maximize the energy efficiency (EE) of CF mMIMO networks.Moreover, they further defined the closed form expression of the EE in beamforming design and maximized the long-term EE of CF mMIMO networks.Similarly, a DRL based PA scheme for CF mMIMO networks is proposed by Rajapaksha et al. in [27] to maximize the minimum user rate by allocating optimal power to the users.
In summary, existing works on ML algorithms for CF mMIMO networks have generally studied AP selection, PA, and beamforming design problems either separately or jointly, to maximize the SE, sum spectral efficiency, energy efficiency, and the minimum user rate of the networks [24], [25], [26], [27].As is known, precoding plays an important role in CF mMIMO systems to reduce the effects of interference, path-loss, and significantly improves the spectral efficiency of the network.To the best of our knowledge, a learning-based precoding-aware RRS scheme to maximize the sum SE is still an open problem for cell-free mMIMO networks.In the next sub-section, we will summarize the key contributions of the paper and the methods used in the paper to maximize the sum-rate of CF mMIMO networks.

B. Contributions and Organization of the Paper
In the paper, we address sum SE maximization through precoding and radio resource allocation.The main contributions of the paper include: • Proposing a solution for precoding and radio resource allocation problem by using precoding-aware RRS.
In the precoding-aware RRS the allocation of the radio resources (i.e., PRBs of APs) to users is optimized for a specific precoder, where MRT, ZF, RZF and OZF precoders are considered.• Proposing a reinforcement learning (RL) algorithm with state compression for RRS with MRT, ZF, RZF and OZF precoders.
• Proposing heuristic algorithms for RSS with the MRT, ZF, RZF and OZF precoders providing similar results as the RL-based algorithm with a computational complexity that is reduced by three orders of magnitude.• Presenting sum SE performance comparison of the proposed RL-based and heuristic algorithms for MRT, ZF, RZF and OZF precoder by considering two different scenarios of noise and interference: fixed noise and adaptive adjacent cell interference and noise.The rest of the paper is organized as follows.In Section II, we discuss the system model and formulate the problem of sum spectral efficiency maximization for cell-free mMIMO networks.In Section III, the underlying concepts of reinforcement learning and precoding are described.In Section IV, we describe the proposed precoding-aware RRS techniques.In Section V, we present and discuss the simulation results of the proposed precoding-aware RRS techniques.In Section VI, we conclude our paper and indicate future research directions.
Notations: X * , X T , and X H denote the conjugate, transpose, and conjugate transpose (Hermitian operator) of the matrix X, respectively; [X] u,p stands for the (u, p)th entry of the matrix X; x denotes a complex value by default and x denotes a vector.

II. SYSTEM MODEL AND PROBLEM FORMULATION
The CF mMIMO network model is illustrated in Fig. 1, where U users equipped with a single antenna are simultaneously served in the CF mMIMO manner by N APs equipped with a single antenna.In the following subsections of this section, we show the considered channel model and formulate the optimization problem.

A. Channel Model Description
Let us consider a joint transmission by N APs to U UEs (Fig. 1) on a PRB b, which consists of 12 consecutive subcarriers with the same channel characteristics.The index b may look superfluous in (1) to (5), but it will play its role in (6).Denote x nb the complex-valued signal transmitted from a single antenna AP n on a subcarrier of the PRB b.The transmission results in the DL received complex-valued signal y ub at the user u: where n ub is the noise at UE u in a PRB b.The noise is the sum of the (thermal) noise and the ambient external (i.e., inter-system) interference.For this reason the noise power |n ub | 2 is modeled to follow a log-normal distribution.The h unb is the complex-valued channel (propagation) coefficient between UE u and transmit AP n in PRB b modeled as: where β un and f unb indicate the large-scale and small-scale fading coefficients, respectively.In our work, the channel h unb is modeled based on the ITU recommendations.The large-scale fading coefficient β un depends upon the path loss and shadowing between the corresponding UE and AP.The β un coefficient is modeled for different environments (indoor hotspot, urban macro, urban micro, etc.) according to [28].The small-scale fading coefficient f unb is modeled for different power delay profiles according to [29].
In the Time Division Duplex (TDD), thanks to the DL and UL channel reciprocity, the DL channels can be obtained by UL measurements at the APs.By joint transmission from N APs the system plans to deliver a complex-valued modulation symbol s ub to UE u in PRB b.The average modulation symbol amplitude is normalized to satisfy a power constraint as E{|s ub | 2 } = 1.Prior to the transmission, the AP n linearly precodes the modulation symbols s ub : where w nub denotes the complex precoding value that AP n applies to transmit the symbol s u on PRB b and |w nub | 2 is the allocated downlink transmit power at AP n for UE u on PRB b.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The received signal y ub at UE u on PRB b from (1) can now be written as: The inter and intra-system inference signal i ub at the UE u on PRB b can be written as: So far the PRB index b has been redundant.The received signal y ub in (4) and the interference i ub in ( 5) are calculated independently for each PRB because no inter-frequency interference is assumed.However, the transport block is coded and transmitted to the UE through all PRBs allocated to the UE.Therefore, in the UE received signal-to-interference-plus-noise estimation, the total received power and the total interference on all allocated PRBs is considered: where B u is the number of PRBs allocated to the UE u (not necessarily all PRBs of the system).
The downlink achievable spectral efficiency of the UE u, denoted by SE u , can be expressed as: The DL sum SE is the sum of the spectral efficiency of all UEs and represents the system SE:

B. Problem Formulation
The target of the paper is to maximize the system SE, therefore, the overall optimization problem is formulated as: max SE DL (9) subjected to: where P max is the maximum transmit power per PRB derived from the AP power divided by the number of RBs available in the carrier bandwidth.It can be seen from ( 6) that the radio resource scheduling, which aims at determining the precoder w nub , is a 3-dimensional AP-UE-PRB problem.
An exact solution for this optimization problem in ( 9) is not feasible in practice, since the complexity increases exponentially as N and U increase linearly, i.e., it is non-deterministic polynomial (NP)-hard and heuristic methods are necessary.

C. Two-Step RRS Approach
To reduce the complexity of the RRS problem in ( 9), we propose the RRS to be carried out in two steps: • Step 1: Resource allocation.At first, for each PRB, APs serving users are selected.Selected (allocated) resources have w nub ̸ = 0 while not allocated resource have w nub = 0, • Step 2: Precoder determination.Secondly, the precoder w nub is determined for all non-zero precoding values.For the resource allocation (i.e., for step 1), we propose a reinforcement learning (RL)-based algorithm and a heuristic algorithm.For precoding determination (i.e., for step 2), we use MRT, ZF, and RZF precoders, but we also propose an algorithm for the Optimized Zero Forcing (OZF) precoder, which is the extended version of the RZF precoder taking into account per user regularisation.The resource allocation algorithm of step 1 is optimized for the precoder type that will be used in step 2; therefore, the proposed technique is referred to as the precoding-aware radio resource scheduling (RRS).

III. UNDERLYING CONCEPTS
This section presents two underlying concepts, namely, the reinforcement learning used in step 1 and multi-AP multi-user precoding used in step 2 of the RRS.

A. Reinforcement Learning
In this paper, RL is used to select serving APs for each UE on each PRB.In the O-RAN ALLIANCE system, we locate the algorithm centrally in the Near Real-Time RAN Intelligent Controller (Near-RT RIC).The RL algorithm first learns and then exploits its knowledge.
• In the learning phase, the algorithm, also referred to as the agent (see Fig 2), learns by performing actions and receiving observations.An action causes a transfer to another state and receiving a reward.The knowledge obtained from the rewards is stored by the agent in the Q-function.• In the exploitation phase, the agent uses its knowledge gathered in the Q-function and performs actions that transfer from one state to another state in a way that ensures the maximum cumulative future reward, e.g., maximum sum SE improvement.When being in state s, there exists a set of possible actions A ∈ {a 1 , . . ., a k }.For each state and action pair, the cumulative future reward is obtained from the Q-function.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The action, which provides the maximum cumulative future reward, is the optimum action: The Q-function learns the cumulative future rewards in the iterative learning process performing actions on the environment and receiving the action reward.After iteration t, where the action a t has been performed and the action reward r t+1 has been received, the Q-function is updated as follows: where α RL and γ RL are the learning rate and discount factor, respectively.The Q-function update allows to transfer the reward of being in state s t+1 to the state s t and thus next time make a proper action when being in state s t .The following common realizations of the Q-function exist: • Q-function as a matrix.When the number of states is small, the Q-function can be a simple table with states in rows and actions in columns.• Q-function as a neural network.In the case of a large number of states, when the size of the table is too big to efficiently manage and train it, the Q-function can be a neural network.We propose another realization of the Q-function, where information about states is stored in a compressed way.The proposed approach is presented in the Section IV-C.1.There exist two popular action selection policies i.e., the softmax approach that selects an action based on the Boltzmann distribution function, and the ϵ-greedy approach which is used by us, where the RL algorithm selects a random action with probability ϵ and chooses the optimum action derived from (11) with probability 1 − ϵ, where ϵ is the exploration factor.

B. Multi-AP Multi-User Precoding
In this paper, the Maximum Ratio Transmission (MRT), Zero Forcing (ZF), Regularized Zero Forcing (RZF) and Optimized Zero Forcing (OZF) precoder types are considered.The same precoder type is applied to all UEs in the system.The AP-UE-PRB resources for which the precoder w nub is applied are determined in step 1 of the RRS algorithm.
1) MRT Precoding: The MRT precoding aims at maximizing the received signal at each UE by shifting the phase of the transmitted symbol s u to compensate for the propagation time difference between the AP and the UE.Thus, the MRT precoder wnub ( w denotes a precoder before power normalization) has the phase opposite to the phase of the channel: In order to meet the AP power constraint in (10), the MRT precoder is power normalized per PRB of an AP.
The normalization ensures that the AP transmits each PRB with its maximum power and the PRB power is distributed evenly between UEs, if signal to more than one UE is transmitted by an AP on the PRB: It is important to note that the MRT precoder is simple in implementation because the phase of the MRT precoder applied for one UE does not depend on the phase of the precoder applied for another UE.In other words, no inter-UE precoder coordination is necessary in MRT.
2) ZF Precoding: Unlike MRT, ZF precoder aims at interference cancellation at other UEs in the system and, therefore, the ZF precoder must be calculated globally for the system.The precoding can be done independently for each PRB.For clarity and to simplify the notation, consider a single PRB b, denote H the U × N channel matrix with elements h unb and W the N × U precoding matrix with elements w nub for the PRB b.
The ZF precoder without power normalisation WZF is given by: and it is only applicable to the system with U ≤ N .The WZF precoder is next power normalized to W ZF , as shown in Section III-B.5.
3) RZF Precoding: The RZF precoder before normalisation WRZF is given by [30]: where α > 0 is the regularisation factor.By maximizing the SINR at the UEs, the optimal regularization can be derived as α = SNR [30].Note that this optimal regularization factor was only obtained for the case of homogeneous SNR conditions.With the use of a local search algorithm we find an optimum value of α, which maximizes the sum SE for a given channel matrix H.The WRZF precoder is next power normalized to W RZF , as shown in Section III-B.5.4) OZF Precoding: In the RZF precoder, the regularization factor α is a system-specific value, which is common for all UEs.Note that in [31], it is already mentioned that RZF is a heuristic approximation for the optimal multi-user beamforming that in general requires a generalization that is not the same for each user.Therefore, we have investigated the impact of using a diagonal matrix instead of the identity matrix in (16): where diag(α) is the diagonal matrix from vector α and the vector α = [α 1 , .., α U ] is the regularization vector.The fields α u of the diagonal matrix correspond to different users, but modification of one field may also impact other users.Therefore, the fields α u are sequentially modified in small steps.In one sequence, all diagonal fields are changed once.Typically, two or three sequences are necessary to reach the local sum SE maximum.This optimization process is performed separately for each channel matrix H.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

5) Normalisation:
The WZF , WRZF and WOZF precoders need to be power normalized to meet the constraint (10) and derive the normalized W ZF , W RZF and W OZF precoders, respectively.The normalization process is described by (18): In contrast to the MRT precoder, where the power normalization is done separately for each PRB of each AP, the normalization in (18) scales the power of all APs of specific PRB in the network by the same factor in a way that the strongest AP transmits on the maximum power P max .As a result of this normalization, the system is not using the maximum power P max for each AP.The consequence of power usage by a precoder will be analyzed in Section V-F.

IV. PRECODER-AWARE RRS TECHNIQUES
This section describes step 1 of the RRS technique, i.e., resource allocation, where APs serving each user on each PRB are determined.In the following sections, we present the argumentation why we have considered to limit the number of serving APs and assumptions made to limit the complexity of the radio resource allocation algorithms.We finally discuss three proposed radio resource allocation algorithms: (1) the RL-based, which was trained for MRT and ZF precoders only because of too long computation time for RZF and OZF precoders (2) heuristic approach for the MRT precoder and (3) heuristic approach for ZF, RZF, and OZF precoders.

A. Motivation to Limit the Number of Serving APs
The following observations motivated us to look for an optimum number of serving APs for each UE: • Interference.In MRT precoding, the power transmitted to one UE causes interference to other UEs, therefore, limiting the number of serving APs can reduce interference and improve sum SE.The motivation also applies to RZF and OZF precoders because they do not cancel interference fully.• Power normalization.In ZF precoding, interference is fully canceled, therefore, ZF with more APs may seem reasonable, because additional APs provide additional power increasing the signal power at the UEs.However, the ZF, RZF, and OZF precoders are power normalized on the network level to the AP with the highest transmit power.This normalization demotivates the addition of distant APs to serve a user, as such distant AP may result in a significant reduction of the transmit power for nearby APs.

B. Radio Resource Allocation Assumptions
The optimization problem could be solved using a neural network with its ability to find important relations between actions and states.But due to the long neural network learning time and the requirement of real time RRS, we propose an approach where reasonable assumptions are applied to reduce the RRS algorithm complexity with a small performance gap.
Assumption 1: Serving APs are determined separately for each PRB.In our model in (7), to consider perfect coding and interleaving, UE SINR u is calculated dividing the total received signal power across all PRBs by the total received noise and interference power, next user SE u is calculated in (7) and finally system SE DL in (8), while we optimize SE DL independently for each PRB.Therefore, our optimization approach does not explore UEs' signal power variation across PRBs which still exists even though the macro diversity provided by the mMIMO diminishes it.However, with this assumption, the 3-dimensional UE-AP-PRB optimization problem is solved separately for each PRB, and the computational complexity of the problem, Ω, is reduced from Ω = ω B to Ω = Bω, where ω is the complexity per PRB, and B is the number of PRBs.
For the ZF, RZF, and OZF, we select APs to serve all UEs and the power transmitted by APs per PRB is determined by the precoding and power normalization in (18).For the MRT, the power normalization in ( 14) determines the sum of UEs' power on a PRB and the distribution of the maximum PRB power between UEs needs to be determined by the resource allocation algorithm.Thus, for MRT, we make further assumptions to limit the RRS algorithm complexity.
Assumption 2: For the MRT precoder, an AP serves maximum one UE per PRB using its maximum PRB power.An AP may serve multiple users but on different PRBs.If we did not limit the number of served UEs per PRB, but allow for, e.g., equal power sharing between UEs on PRB, the complexity per PRB would be ω = 2 U N , which is ∼ 10 192 for 10 UEs and 64 APs.If we assume maximum two UEs per PRB with equal power sharing, referred to as the "MRT exh.search 2 UEs per AP/PRB", then complexity per PRB is ω = (1 + U + U (U −1)

2
) N , i.e., ∼ 10 111 for 10 UEs and 64 APs.With this assumption, referred to as the "MRT exh.search 1 UE per AP/PRB", the complexity per PRB is limited to ω = (1 + U ) N , which is ∼ 10 66 for 10 UEs and 64 APs.In Section V-E, the performance gap of this assumption will be estimated to 3.7%.
Assumption 3: For the MRT precoder, an AP can only serve the UE with the strongest received signal power for this PRB.This assumption is a result of our search for an optimum algorithm.It results in further complexity reduction to ω = 2 N , i.e., ∼ 2 19 for 64 APs and 10 UEs.In Section V-E, the performance gap of this assumption will be estimated to 4.4%.

C. Proposed Precoder-Aware RRS Algorithms
In the following sections, we propose three resource allocation algorithms that address the complex problem.
1) The RL-Based Resource Allocation Algorithm for MRT, ZF, RZF and OZF Precoders: The proposed RL-based algorithm selects the set of APs that should serve the UEs.The algorithm first undergoes the learning process and next it is used for resource allocation.The learning must be carried out for the given precoder, UEs and channels.If the channels change, e.g., as a result of UE mobility or radio environment change, or an UE finishes its transmission, then the learning must be done again.The RL algorithm has been trained for MRT or ZF precoders only because of too long computation time for RZF and OZF precoders.The algorithm trained for ZF was also used for RZF and OZF precoders.
The states, actions, and rewards definitions have a fundamental impact on the success of the RL algorithm.The following approach is proposed and carried out separately for each PRB: • resulting from the action.The huge number of states does not allow us to use the Q-function as a matrix.Such a matrix would have a size of 2 N × N , which is the total number of state-action combinations.We propose a solution where information about states is stored in a compressed way in the C-matrix of a size N × N .The way how the Q-value is derived from the C-matrix is explained in Fig. 3.
With this design, a C-matrix row is used for Q-value calculation each time when the AP corresponding to the row is serving.Also, an update of the row takes place during an action in any state where the AP is serving.Thus each field of the C-matrix stores information about average cumulative future reward when the corresponding AP is serving.The advantage of this approach is that we converge faster and update the reduced state space faster, as even multiple rows are updated in one iteration.The disadvantage is that we do not keep track very well of which sets of APs work well together.
2) The Heuristic Resource Allocation Algorithm for MRT Precoder: We have been looking for an algorithm that yields a similar sum SE to the MRT RL-based algorithm yet performs faster using a predefined scheme instead of the computationally costly RL training.Thus, we propose heuristic MRT Algorithm 1, which selects APs sequentially in two loops taking into account the received power at the UE from the APs.In the first loop, it selects one AP for each UE to ensure that each UE is served.In the second loop over all remaining APs, an AP is selected if (1) the received power from the AP is not more than P offset times weaker than the power received from the AP, which provides the strongest power for the UE and (2) the AP selection results in the sum SE improvement.In our simulations the parameter P offset = 200 (i.e., 23 dB) was providing good sum SE performance.
The key aspect of the algorithm is the order in which the APs are selected.For MRT, the sequence order is optimized for interference avoidance.The APs, which create less interference, are allocated first.The interference is estimated by the ratio between the strongest and the second strongest signal power received from this AP by UEs.The APs having a higher Fig. 3. Q-value derivation from the C-matrix for an example of a network with 5 APs.Assume the current state st = [00101], where n-th bit of the sequence indicates if the n-th AP is serving or not.Five actions are possible and they correspond to the C-matrix columns.For each addition action, the Q-value is equal to the mean value of the fields in the column corresponding to the action and rows corresponding to the serving APs in the state.The fields are indicated by blue dots.If, in the state st, an action a 2 is taken by adding the serving AP2 then the system transitions to the state s t+1 = [01101].Next, the Q-values for all actions in the state s t+1 are calculated and the maximum Q-value is fed back updating the C-matrix according to ( 12) by updating C-matrix fields that were used in the Q-value calculation for state st and action a 2 , i.e. blue dot fields.In the state [01101], where AP2 is already serving, the action a 2 leads to a removal of the AP2 and it is an opposite action to the action a 2 in the state [00101].The Q-value of the removal action is equal to the Q-value of the opposite addition action multiplied by −1, e.g., Q([01101], a 2 ) = −Q([00101], a 2 ) i.e., if the addition action resulted in the sum SE improvement, then the opposite removal action will result in the sum SE degradation.ratio tend to create less interference and are considered for allocation first.
3) The Heuristic Resource Allocation Algorithm for ZF, RZF and OZF Precoders: For the ZF, RZF, and OZF precoders, we propose Algorithm 2 determining the sequence of APs selection without a need for computationally expensive SE calculation.For each UE, the algorithm first selects an AP offering the strongest signal power.Next, the second strongest power AP is selected for each UE etc., until the required number of APs is selected.The selected APs serve all UEs.The R AP can be used to reduce the fronthaul load maintaining low sum SE loss.As shown in Fig. 4, the selection of all APs for ZF, RZF, and OZF precoders usually results in the highest sum SE, but the selection of more APs in the 7-2 O-RAN split also means more fronthaul load (network cost), as the same data is sent to more APs.For a network with 64 APs and a noise level of −80 dBm per PRB, the selection of e.g., 35 APs to serve 10 UEs with RZF or OZF precoder results in 2.5% sum SE loss compared to the solution with selecting all 64 APs, as shown in Fig. 4, but requires only 35/64 = 55% of the fronthaul resources.Fig. 4 also shows that the required number of APs increases with the number of UEs, i.e., 2.5% sum SE loss with RZF or OZF precoder requires 22 APs for 5 UEs and 39 APs for 20 UEs.For the same sum SE loss, ZF precoder requires more APs than RZF and OZF precoders.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 1 The Heuristic Resource Allocation Algorithm for MRT Precoder 1: Start 2: UE mapping: For each AP find the UE with the strongest received power from the AP.The AP can serve this UE only and the AP is referred to as the strongest AP for the UE.3: AP sorting: For each AP calculate the ratio between the strongest and the second strongest power received by UEs from the AP, next sort the APs in descending order of the ratio.4: Loop 1: Loop over all APs in the AP sorting order.If the UE mapped to this AP does not have a selected AP yet then select the AP.5: Loop 2: Loop over unselected APs in the AP sorting order.If the power received from the AP by its UE mapped is not weaker than P offset times compared to the power received by the UE from its strongest AP and the selection of the AP improves the sum SE then select the AP.In the simulations the parameter P offset = 200.6: Return the selected APs.

V. SIMULATION RESULTS AND DISCUSSIONS
In this section, we present the simulation environment, the convergence of the RL algorithm, the computational complexity analysis, the performance penalty analysis and the sum SE performance of the proposed precoding-aware RRS algorithms.The performance is first shown in the presence of fixed noise and interference.Next, we focus on the transmit power aspects of the precoders and show the performance of the RRS in an environment where the level of noise and interference is dependent on the precoder transmit power, which is the case where neighboring systems use the same RRS strategies.

A. Simulation Environment
The simulated system details are presented in Fig. 5.It is a dense low-power network consisting of 64 APs spaced by 40 m in the serving area of 320 m×277 m serving 10 randomly distributed UEs in 1000 simulation iterations.

B. DQN Vs. C-Matrix-Based RL
In this research, we aim at comparing our best heuristic-based approach with our best ML-based approach for a set of precoders at different noise levels.Besides the RL Fig. 6.The sum SE comparison for "MRT with DQN", "MRT with RL" and "MRT with heuristic" for a reference scenario with 64 AP and 10 UE with −60 dBm noise.
with the C-matrix we have also investigated a Deep Q-network (DQN) for resource allocation with the MRT precoder.For the DQN, we used a fully connected 3-layer neural network with an input layer coding state, an output layer providing Q-values for actions, one hidden layer with 20 nodes, and a sigmoid activation function.For the noise level of −60 dBm, the "MRT with DQN" yields almost identical sum SE performance as the "MRT with RL" however the computation time of the MRT with DQN is 180 s compared to 2.4 s of the RL algorithm, as shown Fig. 8. Thus, we have selected the RL with C-matrix as our main ML algorithm.

C. Convergence of the RL RRS Algorithm
The RL algorithm was trained with the learning rate α = 0.3, discount factor γ = 0.1 and exploration factor ϵ = 0.5 in 40 lessons each consisting of N steps, because the time after which the C-matrix converged depended on the number of APs.
A crucial aspect of the RL algorithm is its exploitation mode convergence, i.e., the capability to reach the result regardless of the initial state.Fig. 7 shows three exploitation mode searches for five different combinations of the number of APs and UEs, when ϵ = 0.It can be seen in the figure that (i) the RL algorithm converges very well for all combinations of APs and UEs regardless of the initial state, (ii) the number of iterations required to converge depends on the number of APs that will be finally serving as well as how far was the initial state from the target state, (iii) when no AP is initially serving (solid line in Fig. 7), the convergence takes as many iterations as will be the number of serving APs.

D. Computational Time Analysis
To estimate the complexity of the proposed RRS algorithms, the computation time of the algorithms for a network of size from 16 to 196 APs serving 10 UEs is shown in Fig. 8.  Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. Performance Penalty of the Proposed RRS Algorithms
This section shows in Fig. 9 the sum SE performance of the RRS algorithms compared to the exhaustive search for a small network similar to the one in Fig. 5, but with 8 APs, 3 UEs and 1 PRB.The "MRT with RL" performance gap compared to the "MRT exh.search 1 UE per AP/PRB" and "MRT exh.search 2 UEs per AP/PRB", which were defined in Section IV-B, increases with decrease of the noise and for the lowest noise of −90 dBm the "MRT exh.search 2 UEs per AP/PRB" yields sum SE of 24.4 bit/s/Hz, "MRT exh.search 1 UE per AP/PRB" 23.5 bit/s/Hz and "MRT with RL" 22.5 kbit/s/Hz.
• Performance gap of the Assumption 2 defined in Section IV-B, i.e, the performance difference between the "MRT exh.search 1 UE per AP/PRB" and the "MRT exh.search 2 UEs per AP/PRB" is 3.7%.Thus, the constraint that an AP may serve at most one UE per PRB, does not lead to a significant performance penalty.It is seldom, that serving two UEs per PRB improves the sum SE. • Performance gap of the Assumption 3 defined in Section IV-B, i.e., the difference between the "MRT with RL" and "MRT exh.search 1 UE per AP/PRB" is 4.4%.Thus, the assumption that an AP serves an UE with the strongest signal power from this AP, does not lead to a significant performance gap for MRT precoding, but significantly reduces the complexity of the problem, by searching for a good resource allocation, where it is likely to find one.Fig. 9 also shows that, for higher noise levels the performance, gap of the "MRT with RL" is less.The ZF with RL does not have any performance gap for the analyzed network.The solution with all APs serving provides the maximum performance with ZF and this is also the RL choice.

F. Performance Analysis of the RRS Algorithms With Fixed Noise
We have analyzed the sum SE performance of our RRS algorithms at different noise levels.The noise is calculated as the signal received from external network interfering APs shown in Fig. 5.The noise distribution is close to the log-normal distribution with a standard deviation of 4.8 dB.To show the RRS algorithm performance at different noise levels we scale the noise distribution linearly in the power domain to the desired mean value.Fig. 10 presents the DL sum SE, as defined in (8), in the presence of relatively high noise with the mean value of −60 dBm.It can be seen in the figure, that the standard Round Robin1 scheduler yields the median sum SE of 9.3 bit/s/Hz.The "MRT with RL" algorithm with the median sum SE of 21.9 bit/s/Hz performs 2.4 times better than the Round Robin scheduler.The "MRT with heuristic", with the median sum SE of 21.5 bit/s/Hz, yields similar, but slightly worse, spectral efficiency than the "MRT with RL".
Both MRT algorithms converge to a different number of APs to serve 10 UEs.The RL algorithm learns the environment and does not require any configuration parameters to determine the number of serving APs.The heuristic MRT algorithm sequentially adds APs only if they meet the minimum signal power threshold and improve the sum SE.As a result, the "MRT with RL" chooses in average 62.3 APs while the "MRT with heuristic" chooses in average 51.3 APs.
The usage of 62.3 APs out of 64 available results in the system power utilization factor κ = 62.3/64 = 0.973, as shown in Table I. Results of similar calculations performed for the heuristic MRT and the Round Robin schedulers are also shown in the table.The "MRT with heuristic" uses less power than the "MRT with RL", thus creates less interference.Fig. 10 also shows that, in the presence of noise on the level of −60 dBm, the performance of the RZF precoder is significantly better than the performance of the ZF precoder while the OZF precoder is superior to the RZF.In the figure, each precoder performance is shown with three radio resource allocation algorithms.The curve "with all APs" denotes the case where all 64 APs in the network are serving, which means that no radio resource allocation algorithm is needed.The "ZF with RL", which optimizes the selection of serving APs, yields only a slightly better sum SE than the "ZF with all APs".It means, that the power normalization, which motivated us to consider limiting the number of serving APs for ZF, RZF, and OZF precoders (see Section IV-A), does not harm the sum SE performance very much, therefore, usage of all APs often gives the best sum SE performance.
The RL trained for the ZF precoder, but used with the RZF or OZF precoders, is not superior to the solution with all APs.The heuristic algorithm for ZF, RZF and OZF precoders, which optimizes the selection of 35 serving APs (see Algorithm 2), yields sum SE which is not much poorer than a solution "with all APs".The heuristic algorithm can be used to optimize the serving APs selection if it is required due to limited fronthaul capacity.
The resulting transmit power of the RRS technique can be expressed by the power utilization factor κ. In the case of ZF, RZF, and OZF precoders, the APs do not necessarily transmit on their maximum output power, therefore, the power utilization is defined as an average AP PRB transmission power divided by the maximum AP PRB transmission power: where B is the number of PRBs in the carrier bandwidth, P max is the maximum transmit power per PRB and P nb is the actual power transmitted by AP n on PRB b, derived as: The resulting power utilization for all the precoders is shown in Table I, where it can be seen that the RZF precoder uses more power than the ZF precoder and the OZF precoder uses more power than the RZF precoder.Out of the three radio resource allocation algorithms (i.e., "RL", "heuristic" and "all APs") the heuristic algorithm transmits the least power due to the lowest number of serving APs.Interestingly, due to power normalization, the "ZF with RL" uses more power than the "ZF with all APs".Fig. 11 shows the mean sum SE performance for a wide range of noise power.The performance for noise on level −60 dBm, which was analyzed so far, is marked by a vertical line.In real systems, the noise is mostly caused by power transmitted from other macro or micro APs that transmit on the same frequency causing interference.In the following subsection, we evaluate RRS taking into account the power transmitted by different precoders.

G. Performance Analysis of the RRS Algorithms With Adaptive Noise
In this subsection, we simulate a scenario where the serving micro CF mMIMO network is deployed next to other micro CF mMIMO networks which are scheduled independently, for example, by different Distributed Units.Fig. 5 shows 68 neighboring APs, which are located around the serving network.We assume that the neighbouring APs perform RRS independently and interfere UEs in the serving network.The interference power at each UE is estimated as the sum of the received power from all 68 interfering APs.We further assume that the interfering APs follow the same RRS strategy as the serving network, therefore, we assume the power transmitted by each interfering AP is equal to the average power transmitted by the APs of the serving network.In other words, the power utilization factor κ of the interfering network is equal to the power utilization factor of the serving networks.To fulfill the equal power utilization requirement, we have been adjusting the interfering APs power according to the κ of the serving network and repeating the simulations until the resulting κ of the serving network was equal to the κ of the interfering APs.The resulting κ at convergence is denoted in Table I for the adaptive noise scenario.The interference level at convergence for the given RRS technique is marked in Fig. 11 and the CDF distribution for all the investigated RRS techniques is shown in Fig. 12.
It can be seen in Fig. 11 that in adaptive noise scenario, the ZF, RZF, and OZF precodings are superior to the MRT precoding not only because they avoid interfering serving Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.network UEs, but also because they transmits less power resulting in less noise.
Interestingly, the RZF and OZF complex precoding yield a similar mean sum SE to the simple ZF precoding.Within the MRT resource allocations algorithms, for any fixed noise level the RL performs better than the heuristic algorithm, but in the adaptive noise scenario, the MRT heuristic performs slightly better due to the usage of fewer PRBs and creating less noise.
Figure 12 shows that the superiority of the RZF and OZF precoding over the ZF lies in a more uniform sum SE distribution.The figure also shows, that for all the precoders the RL and heuristic algorithms provide similar sum SE performance in the adaptive noise scenario.

VI. CONCLUSION
This paper presents a reinforcement learning (RL) algorithm that determines the serving APs in the CF mMIMO system for the sum spectral efficiency (SE) maximization.The algorithm is applicable for MRT, ZF, RZF and the proposed Optimized Zero Forcing (OZF) precoder.To manage a huge number of possible network states a compression technique has been proposed, which results in fast learning and low performance penalty.The paper also proposes two heuristic algorithms for serving APs selection, one for MRT precoder and another for ZF, RZF and OZF precoders.The heuristic algorithms provide a similar sum SE to the RL algorithm with over 1000-fold computation time reduction.The paper presents the adaptive noise approach to radio resource scheduling in CF mMIMO architecture, where the noise is proportional to the transmitted power for a given precoder.The adaptive noise represents a scenario where a CF mMIMO network is deployed next to another CF mMIMO network scheduled independently, for example, by a different Distributed Unit.The proposed RL and heuristic algorithms show that in a dense low-power network the advantage of the RZF and OZF precoding over ZF precoding is clearly visible in the high fixed noise and interference scenario, e.g., when the considered micro CF mMIMO network is interfered by a strong macro cell.But in this scenario, the MRT precoding is anyway superior to the ZF, RZF and OZF.If the CF mMIMO network is adjacent to other independently scheduled CF mMIMO networks, referred to as the adaptive noise scenario, the ZF precoder is superior to the MRT precoder while both the RL and heuristic algorithms for ZF reduce the number of serving APs yielding similar sum SE to the ZF with all APs.In the adaptive noise scenario, the RZF and OZF do not show significant mean sum SE improvement compared to the ZF, however, they provide more uniform sum SE at the cost of over 10-fold higher computation time.In future work, we will use the deep reinforcement learning model to develop an algorithm, that can learn the network topology off-line and apply its knowledge for any UE locations.We will also investigate radio resource scheduling considering UE mobility and imperfect channel state information.

Fig. 1 .
Fig. 1.The CF mMIMO network model.The figure shows the channels in one PRB, therefore, the frequency dependent index b has been omitted.

Fig. 2 .
Fig. 2. Interaction of the agent with the environment in RL algorithm.

7 : End Algorithm 2 Start 2 : 4 :
The Heuristic Resource Allocation Algorithm for ZF, RZF, and OZF Precoders 1: Outer loop: Loop over i from 1 to N , where N is the number of APs.3: i-th list: For each UE, find the AP which provides the i-th strongest signal power for the UE, next sort the found APs in the descending order of the received signal power.Inner loop: Loop over the i-th list from 1 to U , where U is the number of UEs, and select APs from the list until the required number of APs, R AP (e.g., R AP = 35), is selected.5: Return: Return selected APs.6: End

Fig. 4 .Fig. 5 .
Fig. 4. The relative sum SE for the mean noise per PBR of −80 dBm for a network with 64 APs.

Fig. 7 .
Fig. 7.The convergence of MRT with RL for noise and interference level −70 dBm.The impact of the network size (36, 64 and 196 APa) and the number of UEs (5, 10 and 20) is presented.For each combination of number of APs and number of UEs three exploitation mode search attempts are shown with different numbers of initially serving APs: solid line -no AP initially serving, dashed line -random number of APs initially serving and dotted line -all APs initially serving.

Fig. 8 .
Fig. 8. Computation time of the RRS algorithms vs. number of APs in the system.

Fig. 9 .
Fig. 9. Performance gap of the RRS algorithms for a network with 8 AP and 3 UEs indicated by red arrows.

Fig. 11 .
Fig. 11.Mean sum SE performance of the RRS algorithms at different noise levels.
State is the current selection of serving APs on the PRB.Based on the assumptions in Section IV-B, in the case of the MRT precoder, the serving AP serves one UE per PRB and, for ZF, RZF, and OZF, serves all users on the PRB.The state is identified by a binary sequence of length N , where an n-th bit of the sequence indicates selection or not of the n-th AP.The number of possible states, i.e., variations of AP selections for N APs, is 2 N .•Action is the change of the state done by toggling the selection of an AP, which is equivalent to selecting one more AP, if it was not serving, or removing it, if it was servings.There are N actions possible in each state.
• Reward is the sum SE change (positive or negative)

TABLE I POWER
UTILIZATION FACTOR