Deep Reinforcement Learning for NextG Radio Access Network Slicing With Spectrum Coexistence

Reinforcement learning (RL) is applied for dynamic admission control and resource allocation in NextG radio access network slicing. When sharing the spectrum with an incumbent user (that dynamically occupies frequency-time blocks), communication and computational resources are allocated to slicing requests, each with priority (weight), throughput, latency, and computational requirements. RL maximizes the total weight of granted requests over time beyond myopic, greedy, random, and first come, first served solutions. As the state-action space grows, Deep Q-network effectively admits requests and allocates resources as a low-complexity solution that is robust to sensing errors in detecting the incumbent user activity.


I. INTRODUCTION
N EXTG communications has evolved to meet the increas- ing user demand with its quality of experience (QoE) promises including high throughput and low end-to-end delay.Starting with 5G, the Radio Access Network (RAN) slicing capability is introduced, where the physical network infrastructure is shared among mobile virtual network operators.The static allocation of resources (such as frequency, power, and computational resources) is replaced by reserving them on the fly with network slicing based on dynamic user demands.Therefore, novel algorithms are needed for admission control and resource allocation for network slicing requests to serve different applications such as enhanced Mobile Broadband (eMBB), massive machine-type communications (mMTC), and ultra-reliable low-latency communications (urLLC), based on the throughput and latency requirements [2].The control of RAN slices can be then implemented in near-real time RAN Intelligent Controller (RIC) of Open-RAN (O-RAN) to support micro-service-based applications (xApps).
Since the spectrum is a scarce resource, NextG communications is envisioned to share the spectrum with legacy systems.For example, the Federal Communications Commission (FCC) Yi Shi is with the Commonwealth Cyber Initiative, Virginia Tech, Arlington, VA 22203 USA (e-mail: yshi@vt.edu).
Digital Object Identifier 10.1109/LNET.2023.3284665adopted rules for shared commercial use of the 3.5-GHz Citizens Broadband Radio Service (CBRS) band [3], where NextG communications and radar systems (incumbent user) need to share the spectrum.To prevent interference with the incumbent user, Environmental Sensing Capability (ESC) monitors the spectrum and informs the Spectrum Access System (SAS) to reconfigure NextG communications.In this letter, we consider the dynamic allocation of resources to NextG network slicing requests for spectrum co-existence scenarios.
[In RAN slicing, the complex network dynamics makes the underlying network optimization problem challenging.Reference [4] considered RAN slicing at the resource block (RB) level, to minimize interference and improve base station (BS) coordination.Reference [5] modeled the resource assignment to multiple requests with a geometric knapsack problem.Reference [6] proposed resource allocation algorithms (based on convex optimization and auction game) for different slices in a cloud RAN.
Machine learning can optimize the RAN as an alternative to the model-based approaches that become easily intractable due to the complexity of dynamics involving resources and requests.As training data may not be readily available, reinforcement learning (RL) can be used to learn from the NextG network environment and update resource allocation decisions for network slicing.To that end, RL was compared to the static and round-robin scheduling in [7].Bandwidth and computational resources were considered in [8] for network slicing subject to service-upon-arrival and batch service modes.In [9], RL-based resource allocation was compared to heuristic, best-effort, and random methods.Reference [10] presented a network slicing prototype in an end-to-end mobile network system.Reference [11] used RL for power-efficient resource allocation in cloud RANs with multiple transmitters and receivers at the same frequency.In [12], RL-based resource allocation was considered by predicting communication requests.Security aspects of RAN slicing with RL were considered in [13], [14].
Previous work on 5G network slicing did not consider the constraints imposed on resources due to potential spectrum utilization of incumbent signals in the same band.Overall, there is a lack of full characterization of states, actions, rewards of RL and communication and computation resources, in addition to the objectives of a network slicing request with respect to latency, throughput, priorities, and deadlines.
In this letter, we consider dynamic allocation of RBs, transmit power and computational resources to support downlink communications from a NextG BS, gNodeB, to user equipments (UEs), in a spectrum sharing scenario (shown in Fig. 1(a)), where network resources occupied by incumbent users change over time and are known under spectrum sensing errors.Each request arrives with priority, throughput, CPU 2576-3156 c 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.usage and latency (deadline) requirements, and needs to be served for a specific duration (without interruption from any other user).Admission control is formulated as a Markov decision process and solved with RL, accounting for the coupling of current and future allocations, improving efficiency in comparison to a myopic solution.In RL, the state is the set of available resources and requests.The state transitions over time depending on how resources are occupied (by granted network slicing requests or the incumbent user) or released (by completed requests).The actions are the admission control decisions for requests.The reward is measured as the sum of satisfied requests weighted by their priorities.We use Q-learning and deep Q-learning (as illustrated in Fig. 1(b)) to admit network slicing requests over a time horizon.Our RL solution provides major gains in network reward compared to myopic, greedy, random, and first come first served (FCFS) algorithms.As the number of UEs increases or the priorities of network slicing requests change over time, we show that RL successfully adapts to dynamics of user demands and spectrum utilization of incumbent users.Since there are potential errors in detecting incumbent signals, resources are only known imperfectly subject to misdetection and false alarm errors.We show the robustness of our RL-based solution in the presence of spectrum sensing errors.
The rest of this letter is organized as follows.Section II describes resources, requests and optimization for network slicing.Section III presents the RL solutions.Section IV evaluates the performance.Section V concludes this letter.

II. NETWORK RESOURCES AND SLICING REQUESTS FOR ADMISSION CONTROL AND RESOURCE ALLOCATION A. Network Resources and Network Slicing Requests
The system model is shown in Fig. 1(a).There is one gNodeB serving N UEs, each requesting downlink communication service with dynamic QoE levels regarding different throughput, CPU usage and latency (deadline) requirements, and priorities (relative importance) for different network services (eMBB, URLLC, and mMTC).We consider dynamic spectrum sharing of NextG with an incumbent user such as radar in the CBRS band [3].When the radar signal is detected by ESC, the SAS reconfigures the RAN to assign network slices to frequency bands that are not occupied by the radar.
Requests are handled by the gNodeB and then appropriate network slices with corresponding RBs are assigned to requests.If a request is not answered yet, it stays in a waiting list until its deadline (time limit from the request arrival until the service starts) expires.The objective is to maximize the weighted sum of supported requests, where weights represent priorities of these requests.Network resources are bandwidth, communication power, and CPU usage.At time t, there are a set of active requests A(t) that includes requests that have just arrived or requests in the waiting list.The CPU usage requirement of UE i for its network slicing request j is where P C ij is the assigned computational resource (measured by CPU usage) and p C ij is the minimum required resource.The throughput requirement of UE i for its request j is where D ij is the achieved rate and d ij is the minimum required rate.D ij is determined by the bandwidth F ij , transmit power P T ij (for downlink traffic) at the gNodeB, modulation coding scheme for communications between the gNodeB and UE i, and channel effects.For 5G NR, the approximate rate (in bps) for a given number of aggregate carriers in a band or band combination is computed as [15]: where K is the number of aggregate component carriers (CCs) in a band or band combination.For the kth CC, v m is the maximum supported modulation order, f (k ) is the scaling factor in {1, 0.8, 0.75, 0.4}, R max = 948/1024, μ is defined in [16], N B (k ),µ PRB is the maximum RB allocation in the UE-supported maximum bandwidth B (k ) in the given band (or combination), T µ s is the average OFDM symbol duration in a subframe, where T µ s = 10 −3 14•2 µ for normal cyclic prefix, and O (k ) is the overhead (0.08 for the uplink in frequency range 1).Assuming a single antenna UE with QPSK modulation, 60 kHz subcarrier spacing and 10 MHz bandwidth, (3) becomes r = c × K , where c ≈ 12.59 × 10 6 .The achieved data rate r is reduced by the bit error rate (BER).For the additive white Gaussian noise channel and low-density parity-check coding, the BER is computed as 0.1, 0.1, 0.09, 0.08 and 10 −3 at −5, −4, −3, −2, and −1 dB SNR values, respectively.Thus, (2) becomes

B. Optimization Problem for Network Slicing
Denote x ij (t) as the binary indicator on whether UE i's request j is satisfied at time t.The constraints of resource assignments to network slices are given by where k ∈ {C , T } and F(t), P C (t) and P T (t) are the available communication (frequency-time blocks), computational and transmit power resources, respectively, at the gNodeB at time t.Some of the gNodeB's resources may have been already assigned to some requests or occupied by incumbent users, and thus not available.We assume that the ESC fails in detecting an incumbent-occupied RB with probability of misdetection, p M , Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and flags an available RB as an occupied one with probability of false alarm, p F .The myopic objective that optimizes the resource allocation at time t only is to select F ij (t), P C ij (t) and P T ij (t) for max (i,j )∈A(t) subject to (1), ( 4)- (5), where w ij is the weight for UE i's request j to reflect its priority.In (6), the myopic objective function is the reward of network slicing allocation at time t.
Next, we consider optimization over a time horizon.The resources are updated from time t−1 to time t as where k ∈ {C , T } and F r (t − 1), P C r (t − 1) and P T r (t − 1) are released resources at t − 1 on frequency, CPU usage, and transmit power, respectively, and F a (t), P C a (t) and P T a (t) are allocated resources at t on frequency, CPU usage and transmit power, respectively.Each request has a lifetime l ij and if it is satisfied at t (the service starts at t), this request will end at time t + l ij .R(t) is the set of requests ending (completed or expired) at t.The released and allocated (communications and computational) resources at t are where k ∈ {C , T }.The optimization problem changes to max t (i,j )∈A(t) subject to (1), ( 4)-( 5), ( 7)- (10).If we unrealistically assume that the gNodeB knows all future requests in advance, this problem can be solved offline.Instead, we solve this problem by RL [12] without any knowledge on future requests.This formulation can be extended to multiple gNodeBs, each with its own transmit power and processing power, as follows.Pre-processing: An RB is unavailable if (i) it was already allocated to earlier requests and (ii) it was used by neighboring gNodeBs.Each gNodeB can sense the spectrum and try to identify the RBs used by neighboring gNodeBs.Or, a gNodeB can obtain such information in post-processing.Post-processing: Neighboring gNodeBs may plan to assign the same RBs in their RL algorithms.A central controller can be used to resolve such conflicts (by removing some assignments) and broadcast the remaining assignments.A distributed scheme is also possible by letting each gNodeB locally broadcast its planned assignments and compare them with planned assignments from neighboring gNodeBs to decide whether a planned assignment should be kept or not.

III. REINFORCEMENT LEARNING FOR RAN SLICING
At the beginning of each time slot, we consider all requests in the waiting list, including requests arrived in the previous time slot and older requests with remaining deadline.RBs previously occupied by an incumbent user are avoided subject to misdetection errors (with probability p M ) and false alarm errors (with probability p F ) in detecting the incumbent user.
State: The state consists of (i) network status part, representing occupied resources, and (ii) request-specific part, with required amount of resources, weight, and duration of the considered request.
Actions: The set of actions are the admission control decisions of accepting or rejecting the considered request.If a request is accepted, resources are assigned and thus the network status part of the state is updated.
Reward: The immediate reward of accepting a request in a time slot depends on the weight of that request.If an incumbent user is not detected correctly, the request, using at least one overlapping RB, is dropped.The loss of this request in service is reflected in the reward of that time slot.
Reinforcement learning: The goal is to find the optimal policy that maximizes the average cumulative discounted reward.RL learns the policy that yields which action to take for the gNodeB under a given state (available resources and request).The gNodeB computes the function Q : S × A → R to evaluate the quality of action a ∈ A producing reward R at state s ∈ S .The optimal Q-function satisfies the Bellman equation where the expectation is over both the distribution of immediate rewards r and possible next states s , and γ is the discount factor (0 ≤ γ ≤ 1) for rewards over time.Q-learning solves ( 12) by iteratively updating the Q-function, while deep Q-learning uses a DNN to approximate the Q-function.

A. Q-Learning Algorithm
The gNodeB maintains Q(•) as a Q-table.At each time t, the gNodeB selects an action a t , observes a reward r t , transitions from the current state s t to a new state s t+1 based on action a t , and updates Q(•).Starting Q(•) as a random matrix and using the weighted average of the old value and the new information, Q-learning performs the value iteration update: where α is the learning rate (0 < α ≤ 1).In (13), max a Q(s t+1 , a) refers to the estimate of the optimal future value.The reward at time t is r t = w ij if UE i's request j is satisfied at time t, and r t = 0 otherwise.Multiple actions can be taken at the same time.The state transition at time t is driven by blocking resources for requests granted at time t and releasing resources after the lifetimes of active services expire at time t.The state transitions are given by ( 7)- (10).The optimal admission control policy is learned by interacting with the environment based on the -greedy method, following the greedy policy a = argmax a Q(s, a) with probability 1 − and selecting a random action with probability .

B. Deep Q-learning Algorithm
The complexity for value update in ( 13) is very low.However, as the state-action space increases, Q-learning becomes inefficient, as it needs to keep a large Q-table for state-action pairs.Deep Q-learning approximates the optimal action-value function with a Deep-Q network (DQN) [17].The DQN maps state-action pairs to a parameterized approximate function Q(s, a; θ i ), where the parameter θ i sets the neural network weights at the i-th iteration.We use the state representation as an input to the DQN, and a separate output for each possible action, the estimated Q-values associated with the input state.
We adopt experience replay [17] to learn off-policy while reducing oscillations or divergence in the parameters.In experience replay, past experiences e t = (s t , a t , s t+1 , r t ) are stored in the replay memory.The training is performed using samples drawn uniformly at random from the replay memory, reducing the correlation found in sequential observations and increasing efficiency when training the network, as one step of experience may be reused in many weight updates.
The DQN is trained by adjusting the parameters θ i at each iteration to minimize the mean-squared error in (12), while replacing the optimal target values with approximated target values i from a previous iteration.The Q-learning update at each iteration aims to minimize the loss given by L = (y i − Q(s, a; θ i )) 2 , where the parameters used to compute the target value at the i-th iteration are periodically updated with the DQN parameters and held constant in-between updates.We use a separate network to determine the target values y i in each iteration and replace the target network with the DQN every C iterations, improving stability in comparison to online Q-learning [17].The current state s used as the input includes the current availability of network resources and the features of requests, while the two outputs are Q(s, 'accept'; θ i ) and Q(s, 'reject'; θ i ).We normalize the input to expedite the DQN training.The optimal policy is learned again by interacting with the environment based on the -greedy method.

IV. PERFORMANCE EVALUATION
We use the following setting for performance evaluation.For each UE, requests arrive with rate of 0.5 per slot.Each slot is 0.23 ms long with 60 kHz subcarrier spacing.The CPU usage is measured from 0% to 100% by 2% increments.For each request, weight is an integer in [1,5], lifetime is in [1,10] slots, and deadline is in [1,20] slots.The number of transmit power levels is 5 and the maximum received SNR is in [1.5,3].The total bandwidth is 10 MHz and is split into 11 bands.For Q-learning, discount factor is γ = 0.95 and learning rate is α = 0.1.Simulations are performed in Python with TensorFlow and Keras.OpenAI's Gym framework is used for DRL.NVIDIA GeForce RTX 2080 Ti is used as the computing platform.
Three algorithms are used as baselines.Random algorithm allocates available resources to uniformly randomly selected network slice requests.FCFS algorithm allocates available resources to network slice requests based on the arrival times of requests, i.e., at any given time, the oldest network slice request is considered first.Myopic algorithm allocates available resources to maximize the current reward only by solving (6).The same scenario is run over 1000 time slots to compare different algorithms.When there is no incumbent user activity, the cumulative network reward (summed over time) to serve three UEs is 1807, 1456, 1416 and 1334 for Q-learning, Myopic, FCFS and Random algorithms, respectively.These results are the average reward for multiple runs.The range of achieved reward is measured as [1731,1831] for Q-learning  and [1359,1466] for myopic algorithm.Fig. 2(a) shows that with more UEs and larger arrival rate, it is more likely to find and accept requests with less resource requirements and higher weights, increasing the network reward.
Next, we consider the NextG-radar spectrum coexistence scenario, where the radar signal (incumbent user) possibly occupies multiple frequency blocks over time.We consider two types of arrival patterns for the incumbent: (i) independent identically distributed (i.i.d.) arrivals when the incumbent signal appears at any time (slot) with probability p I and (ii) session (bursty) arrivals, when the lifetime of sessions is in [10,50] slots (the arrival of sessions is adjusted to obtain p I as the probability of incumbent occupancy).First, we assume that the incumbent signal is reliably detected.Then, we will introduce misdetection and false alarm errors.
Fig. 2(b) shows the network reward achieved by RL when we vary p I for i.i.d. and session arrival types of incumbent user.RL adapts to the spectrum occupancy pattern of the incumbent user successfully and is effective in utilizing network resources that are left from the incumbent user's spectrum occupancy.The network reward of RL drops as p I increases but this drop is sub-linear with p I .Fig. 3(b) shows that the average reward decreases with p F , since admission control avoids the resources falsely thought to be occupied by an incumbent user.Fig. 3(c) shows that the average reward decreases with p M .This is due to the failure in detecting the incumbent signal, since at the end of each time slot the requests which had occupied overlapping RBs are considered failed and are sent back to the waiting list.While spectrum sensing errors considered in Fig. 3 have an impact on efficient resource allocation, RL can effectively improve resource allocation with respect to the baseline algorithms, RL is robust to sensing errors.Fig. 4(a) shows the average reward of an episode (averaged over time) vs. the number of total available RBs.While Q-learning and deep Q-learning have very low time (computation) complexity, their space (memory) complexity are significantly different.The occupied memory of the DQN remains almost the same (around 1MB) and thus the DQN scales up well with the number of RBs.Thus, DQN can find a solution for up to 100 RBs in Fig. 4(a).Q-learning has a large Q-table of size N S ×N A , where N S is the number of states and N A is the number of actions.The action is either accepting a request or not, so N A = 2.The state includes the information on the number of occupied RBs (from the set {0, 1, . . ., R}) by an incumbent user or by requests, the amount of computational re-sources used (from the set {0, 0.02, . . ., 1}), the amount of communication power used (from the set {0, 1, . . ., 5}), the number of RBs required (from the set {1, 2, . . ., R}), the communication power determined by throughput requirement (from the set {1, 2, . . ., 5}), the amount of processor usage required (from the set {0.02, 0.04, . . ., 1}), the weight (from the set {1, 2, . . ., 5}), and the duration of the considered request (from the set {1, 2, . . ., 10}).If R = 11, N S = 12 × 11 × 5 × 50 × 5 × 10 = 5.049 × 10 8 .The amount of occupied memory increases with the state-action space (and the number of RBs) in Q-learning, namely from 24GB for 11 RBs to 66GB for 20 RBs.Adding 40 or more RBs leads to a memory error (due to large Q-table size not fitting in the memory) when Q-learning is used and thus there is no result in Fig. 4(a).Fig. 4(b) shows the results when both DQN and Q-learning can find a solution, where the number of RBs is 11 and the number of users is up to 100.The performance of the Q-learning and deep Q-learning algorithms is close (less than 7% difference).

V. CONCLUSION
We addressed admission control and resource allocation for NextG RAN slicing, when NextG shares the spectrum with an incumbent user subject to spectrum sensing errors.We followed an RL approach to assign communication and computational resources to network slicing requests (each with latency, rate, and CPU usage requirements and a priority weight) for downlink communications from the gNodeB to UEs.In RL, the state represents the available network resources and request features, the actions are admitting/rejecting requests, and the reward is the weighted sum of requests.RL outperforms the baselines significantly for the network reward, whereas DQN has low memory usage and scales up better than Q-learning.Overall, RL can effectively allocate resources to network slices when resources may become unavailable due to sharing the spectrum with incumbent users, even when spectrum sensing involves misdetection and false alarm errors.

Manuscript received 16
February 2023; revised 2 May 2023; accepted 29 May 2023.Date of publication 9 June 2023; date of current version 25 September 2023.This work was supported by the U.S. Army Research Office under Contract W911NF-21-C0015.A preliminary version of the material in this paper was partially presented at IEEE International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), 2020 [1] [DOI: 10.1109/CAMAD50429.2020.9209299].The associate editor coordinating the review of this article and approving it for publication was P. Monti.(Corresponding author: Yi Shi.)

Fig. 2 .
Fig. 2. (a) Network reward vs. number of UEs.(b) Network reward when NextG shares the spectrum with an incumbent user.

Fig. 4 .
Fig. 4. Average reward vs. (a) the of UEs (b) the number of RBs.
Fig. 4(b) shows the average reward of an episode vs. the number of users.Note that Figs.2(a) and 2(b) show the cumulative network reward and Figs.4(a)-4(b) show the time-averaged reward.