Deep Reinforcement Learning-Based Online Resource Management for UAV-Assisted Edge Computing With Dual Connectivity

Mobile Edge Computing (MEC) is a key technology towards delay-sensitive and computation-intensive applications in future cellular networks. In this paper, we consider a multi-user, multi-server system where the cellular base station is assisted by a UAV, both of which provide additional MEC services to the terrestrial users. Via dual connectivity (DC), each user can simultaneously offload tasks to the macro base station and the UAV-mounted MEC server for parallel computing, while also processing some tasks locally. We aim to propose an online resource management framework that minimizes the average power consumption of the whole system, considering long-term constraints on queue stability and computational delay of the queueing system. Due to the coexistence of two servers, the problem is highly complex and formulated as a multi-stage mixed integer non-linear programming (MINLP) problem. To solve the MINLP with reduced computational complexity, we first adopt Lyapunov optimization to transform the original multi-stage problem into deterministic problems that are manageable in each time slot. Afterward, the transformed problem is solved using an integrated learning-optimization approach, where model-free Deep Reinforcement Learning (DRL) is combined with model-based optimization. Via extensive simulation and theoretical analyses, we show that the proposed framework is guaranteed to converge and can produce nearly the same performance as the optimal solution obtained via an exhaustive search.


I. INTRODUCTION
M OBILE Edge Computing (MEC) refers to an emerg- ing distributed computing paradigm that brings cloud processing and storage capabilities closer to the end-user (i.e., to the network's edges) [1], [2], [3].The technology has been widely recognized as a promising solution to solve the challenges aligned with the rapid growth of mobile applications and the Internet of Things (IoT).By offloading part of computational tasks to the MEC server, the endusers, especially ones that are battery-based and with limited hardware capabilities, can experience a better quality Linh T. Hoang and Anh T. Pham are with the Computer Communications Laboratory, The University of Aizu, Aizuwakamatsu 965-8580, Japan (e-mail: d8232104@u-aizu.ac.jp; pham@u-aizu.ac.jp).
In support of the dynamic and rapid deployment of MEC networks, mounting the MEC server on Unmanned Aerial Vehicles (UAVs) has recently attracted attention in both industry and academia [2], [3], [8], [9], [10].In such an approach, the UAV can act as a flying base station that can effectively complement existing cellular networks by providing additional computational services to the ground user.Due to the inherent mobility and flexibility of UAVs, this approach is an on-demand solution well-suited in hotspot areas or rural areas where network capacity is insufficient [3], [8], [9], [11].In addition, the Dual Connectivity (DC) technology [12] can also be integrated into the network to further enhance the computational efficiency of the edge device.Using DC, edge devices are enabled to communicate simultaneously with several eNodeBs, which might significantly improve the network's throughput and mobility support [12], [13], [14], [15], [16].Indeed, the concept of DC was introduced in the Third-Generation Partnership Project (3GPP) Release 12 and has recently been widely recognized as a promising approach for the deployment of ultra-dense 5G heterogeneous networks [12], [13], [14], [15], [16].Associating users requesting computationally intensive services to a single MEC server might result in an overload and possible service denials of the server to other users.DC allows the users to offload tasks to two servers simultaneously for parallel computing, thus being a promising solution to balance the workload and avoid the service denial issue.Following these trends, MEC networks can be configured where the mBS and the UAV act as the Master eNodeB (MeNB) and the Second eNodeB (SeNB), respectively, both equipped with a MEC server.The edge user can then offload tasks to both the MEC servers simultaneously for parallel computing with abundant computational resources.
Besides mentioned advantages, integrating UAVs and DC into the MEC networks poses various challenges; one is the time complexity in resource management of the system.In general, the optimization in a multi-user, multi-server MEC network involves solving a mixed integer non-linear programming (MINLP) problem that jointly determines the channel assignment (i.e., the user association) for MEC servers and resource management for communication (e.g., bandwidth allocation) and computation (e.g., CPU frequency selection for local and remote computing) at the server and the user.Solving such a problem is computationally expensive, 1558-2566 © 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
especially given the existence of a large number of users.
Various solutions have been proposed to tackle the issue, such as metaheuristic methods [4], [17], [18], convex relaxation of binary variables [19], local search-based approaches [20], and decomposition-based methods [21].Still, these conventional methods share a common of requiring a large number of iterations to bring out a good performance and are thus not suitable for real-time control of dynamic MEC systems.
To tackle the issue of time complexity, data-driven solutions such as deep reinforcement learning (DRL) are promising candidates that can perform very well while satisfying real-time control requirements [22], [23], [24], [25], [26], [27].The DRL framework harnesses deep neural networks (DNNs) to learn the optimal policy that directly maps the system state (e.g., the channel condition and the number of backlog tasks) to a proper action (i.e., resource management decision) in each time slot.Training is performed via continuous interaction between the optimization solver and the environment (i.e., the MEC network) to maximize the reward following the decision in each step (e.g., the system's energy efficiency and throughput).Indeed, using DNNs in optimization is a model-free approach in which the solver learns from experience (i.e., driven by the training data) to construct an optimal mapping policy, rather than relying on complex mathematical models that might not always be accurate and readily available.However, purely relying on the model-free solution has been reported to lead to unstable performance and suffer from slow convergence or even divergence [23], [25], [26].A proper approach could be letting the DNN take part of the optimization (e.g., for optimizing binary variables) while still using conventional model-based methods for the rest.Indeed, the integration of data-driven and conventional model-based methods has improved the robustness and convergence of the DRL framework via online training [23], [24], [25], [26].
The other challenge in optimizing MEC networks is the preference for long-term key performance indicators (KPIs) of dynamic queueing systems, which refer to time-evolving queueing networks where decisions made in a time slot affect the optimization in subsequent slots [26], [28].A typical example could be minimizing the long-term average power consumption, subject to queue stability constraints and given the randomness of the environment (e.g., channel gains and task arrival).Despite its importance, most existing DRL-based solutions [23], [24], [29], [30], [31] do not focus on the long-term performance when solving resource management problems in MEC networks.A well-known approach to cope with the long-term KPIs of a dynamic system is Lyapunov optimization [28].The framework can be used to transform a multi-state problem into deterministic per-time slot subproblems while providing a theoretical guarantee to long-term system stability.The combination of the two robust tools, DRL and Lyapunov framework, thus is a promising approach to solving the MINLP problem of network resource management while monitoring the long-term KPIs, especially in large-scale multi-user, multi-server MEC systems [26].
In this paper, we consider a UAV-assisted MEC network with DC, where the UAV and the mBS act as the MeNB and the SeNB, respectively, to provide edge computing services to a set of ground users.Via DC, one mobile user can simultaneously obtain communication resources from both MEC servers in support of parallel computing.Under randomness of channel condition and task arrival, an MINLP is formulated to minimize the average power consumption of the whole system (including the user and the UAV, which are all battery-operated), given constraints on long-term queue stability and average task execution delay threshold.We aim to develop an online resource management algorithm with reduced computational complexity that produces system-wide energy efficiency while satisfying all QoS requirements for the user.We jointly optimize various system variables to achieve the goal, including channel assignment, local and remote computational resource scheduling and bandwidth allocation in each time slot.The Lyapunov framework is adopted to transform the original multi-stage problem into deterministic per-time slot problems.A hybrid scheme of combining the model-free DRL and model-based optimization is then proposed to solve resource management optimization in each time slot.To the authors' best knowledge, this is the first work considering a dynamic multi-user, multi-server MEC system with assistance from UAVs via DC and developing a DRL framework for resource management in such a system.The main contributions can be summarized as follows: 1) Power Minimization for a multi-user, multi-server MEC System with dual connectivity: In support of parallel computing, we propose to utilize a UAV to assist edge computing in a cellular network via DC.The problem of resource management is formulated as a multi-stage MINLP to minimize the long-term average of the weighted-sum power consumption, constrained on long-term queue stability and task execution delay.2) Lyapunov-guided DRL Approach: we develop a Lyapunov-guided DRL framework that can efficiently produce sub-optimal solutions.DRL is to cope with the complexity of the problem with the coexistence of two servers.Meanwhile, Lyapunov optimization is to deal with the long-term constraints on queue stability and average task execution delay of the queueing system.3) Hybrid Approach for Actor-Critic Structure: The proposed framework integrates conventional model-based optimization and a data-driven approach via model-free DRL.The actor module utilizes a DNN and an efficient action quantizer to balance exploration and exploitation in producing channel assignment decisions.To accurately evaluate decisions made by the actor, the critic module utilizes model-based optimization rather than using another DNN conventionally.4) We provide theoretical analyses and numerical results via extensive simulation to demonstrate the efficiency of the proposed method.The remainder is organized as follows.Related works are provided in Section II.Sections III and IV detail the system model and problem formulation, respectively.Sections V and VI provide the description and theoretical analyses of the proposed framework.The numerical results are provided in Section VII.Finally, Section VIII concludes the paper.

II. RELATED WORKS
User Association.In recent years, several works have studied the user association problem (i.e., channel assignment or server selection) in resource management for multi-user, multiserver MEC networks [5], [20], [21], [30], [31].Dai et al. [5] formulated a problem of joint computation offloading and user association to minimize overall power consumption of a MEC system where each user has multiple mutually dependent tasks.Tran and Pompili [20] studied the problem of joint sub-channel assignment and resource allocation of a multi-cell network to maximize a weighted sum of reduction in task completion time and energy consumption.Liu and Cao [30] considered the switching cost when a mobile user migrates its service from one server to another and modeled the problem of continuous server selection as a Markov Decision Process (MDP).Guo et al. [31] proposed an online learning-based MEC server selection mechanism under incomplete network information to minimize the time average task execution delay.Hu et al. [21] proposed a submodular optimization-based server selection method for mobile users to optimize the long-term energy-delay tradeoff.However, the works mentioned above have yet to investigate the user association problem in a parallel-computing scenario with dual connectivity.Instead, users have been grouped into separate clusters and allowed to connect to only one server at a time.More investigation is thus needed to fill the gap.
DRL for Resource Management.The utilization of DRL-based methods in optimization for MEC networks has recently attracted lots of attention from research community [23], [24], [25], [29], [30], [31].Min et al. [29] proposed a reinforcement learning-based offloading scheme for IoT devices with energy harvesting (EH) to select the MEC server according to the current battery level and the predicted amount of the harvested energy.Huang et al. [23] proposed a DRL-based offloading framework that utilizes a deep neural network to produce potential binary offloading solutions, while a model-based optimization module is responsible for evaluating candidate decisions and labeling training data samples.Wu et al. [24] developed a hybrid framework that combines a deep Q network and convex optimization for determining offloading strategies at the user side and allocating resources at the computational access point.However, the works [23], [24], [29], [30], [31] only considered quasi-static scenarios and failed to adequately address long-term performance requirements (such as queue stability and average energy consumption) of a time-evolving queueing system.Following the direction toward long-term KPIs, Bi et al. [25] recently proposed a hybrid optimization-learning framework called LyDROO.The framework combines Lyapunov optimization and DRL to optimize task offloading, where the user either computes tasks locally or offloads the tasks (i.e., binary offloading).
In this paper, we also integrate Lyapunov optimization and DRL into a resource management framework to cope with long-term KPIs of a queueing network.Compared to [25], our innovations are three-fold.First, we propose a new actor module that utilizes a DNN to optimize parallel computation between users and servers.The method in [25] allows the user to either process local computation or offload tasks to an edge serve; thus it dose not apply to the parallel paradigm.Second, our proposed framework jointly optimizes not only the user side but also the server side.Since the problem involves many interacting network entities, we propose a new model-based critic module, which is entirely different from [25].Third, [25] considered a single powerful MEC server with no limit on the number of users served at a time.Thus, their algorithms cannot be directly applied in our study, where we consider a multi-server system with dual connectivity and specify constraints on the server's serving capability.A comprehensive comparison between our proposed scheme and other existing DRL-based methods is summarized in Table I.It is noteworthy that the critic module's algorithm in this paper is adopted in part from our previous work [32] to optimize local computation on the user side.

III. SYSTEM MODEL
As illustrated in Fig. 1, we consider a multi-server MEC system with Dual Connectivity (DC).There are two MEC servers, one located at a macro base station (i.e., the Master eNB, MeNB) and the other mounted on a UAV (i.e., the Secondary eNB, SeNB), provides additional edge computing services to a set of ground users.Each mobile user can simultaneously connect to the two MEC servers simultaneously.From now on, the UAV-mounted MEC sever and the SeNB, similarly to the macro base station and the MeNB, will be used interchangeably.
For convenience, we denote the index sets of the mobile devices, the MEC servers, and the time slots respectively as N ≜ {1, 2, . . .N }, S ≜ {UAV, mBS}, and T ≜ {1, 2, . ..}.It is noted that in the following, we index each user by the letter i and the MEC server by letter j, i.e., i ∈ N , j ∈ S.
The modeling of task computation, queueing system, task offloading, and power consumption are presented below.For ease of reference, key notations used in the article are summarized in Table II.

A. Task Queuing Model
We assume that the mobile devices are processing independent and fine-grained tasks [1].Each task is represented by a  volume of bits, which can be decomposed into several packets transmitted to nearby MEC servers and processed in parallel.At the beginning of each slot, a volume of A t i bits arrives at the user i and can be processed starting from the next slot.Without loss of generality, we assume that A t i is independent and identically distributed (i.i.d) over time slots with Poisson distribution and an average rate E[A t i ] = λ i (bits), i ∈ N .In the tth time slot, the user processes l t i bits locally and on opportunity can upload r t i,UAV and r t i,mBS bits to the UAV-mounted MEC and the macro base station, respectively.The newly arrived task volumes at the beginning of one slot will be buffered in the user queue before they can be processed in subsequent slots.Let Q l i (t) denote the local queue length of user i at the beginning of time slot t; the queue update process can be expressed as where D t i ≜ l t i + r t i,UAV + r t i,mBS denotes the amount of tasks departing from the user i's local queue in time slot t.
At the UAV side (i.e., the SeNB), we assume that the UAV maintains N dedicated queues, one for each user, to buffer tasks offloaded by users.Let c t i denote the amount of tasks from user i executed by the UAV in time slot t; the UAV's task queue dedicated for user i, denoted by Q s i (t), can be derived similarly as In this paper, all user and UAV task queues are assumed with sufficiently large capacity.In addition, without loss of generalization, all task queues are empty initially, i.e., Regarding the macro BS (i.e., the MeNB), we assume that the server has redundant computational resources and is powered by an electrical grid; therefore, we do not consider the macro BS's queues and power consumption in optimization.In other words, tasks offloaded to the macro BS will not be buffered in queues but executed promptly, and the energy consumed is less important than other network entities.
According to Little's Law [33], the average delay experienced by one user is proportional to the long-term average number of tasks awaiting in the system.Thus, we exploit the average queue length at the user and the UAV, denoted by Q l i and Q s i , as a measure of the task completion delay for local and remote task processing.Furthermore, the two thresholds Q th l,i and Q th s,i are defined as a Quality of Service (QoS) constraint for the ith user as where the expected values of the queue length (i.e., E ) are taken over the randomness of the channel gain and task arrival in a time slot.It is worth noting that from (1) we have

B. Task Execution Model
To process tasks locally, the mobile user needs to assign a specific number of CPU frequencies for each task.Let f t l,i denote the local CPU frequency of user i in time slot t; the amount of locally-computed tasks in time slot t can then be expressed as Here, τ denotes the time slot length and L i denotes the processing density, defined as the number of CPU cycles required for user i to process one bit.According to circuit theory, the power consumption for local execution at the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
ith user is given by [34] and [35] where the parameter κ i is the effective switched capacitance of the CPU at the ith device, and dependent on the hardware architecture.
Similarly, at the UAV, the amount of tasks computed by the MEC server for the ith user in time slot t can be expressed by where f t c,i denotes the CPU frequency resources that the UAV allocates for computing the ith user's tasks; L s denotes the processing density for the UAV's CPU to process one bit.The power consumption of the UAV for computation can be defined similarly as where κ s denotes effective switched capacitance of the UAV's CPU.We assume that the computational capability of the UAV is stronger than that of the mobile user, but limited by the maximum CPU frequency, i.e.,

C. Task Offloading Model
Channel power gain from user i to the MEC server j: where d i,j denotes the distance from user i to the MEC server j, γ j denotes the path loss exponent (γ j ≥ 2), g j denotes the reference channel gain, and ht i,j denotes the small-scale fading channel power gain, which is assumed to have a finite mean value, E ht i,j < ∞ [34], for the server j's link.To allocate radio resources to the mobile device, each MEC server will first select a subset of users (assuming that one server cannot serve all the users at the same time), then allocate each user in that subset an appropriate bandwidth for communication offloading.Let x t i,j denotes the link association for the user i on the server j's communication channel in time slot t: x t i,j = 1 indicates that the user i could utilize bandwidth allocated on the server j's channel; otherwise, no bandwidth is allocated and offloading is prohibited.N t j ≜ i ∈ N x t i,j = 1 ⊂ N then can be defined as the set of mobile devices associated with the MEC server j in time slot t.Similarly, the set of the MEC servers that associate with the ith user in time slot t can be defined as S t i ≜ j ∈ S x t i,j = 1 ⊂ S. Regarding the communication energy, according to the Shannon-Hartley formula, the transmit power for user i to offload r t i bits can be obtained as [36] p Tx,i (t) = where W j denotes the total bandwidth of the server j, α t i,j denotes the bandwidth ratio allocated to user i on the server j's channel, and N 0 denotes the background noise density.
In (10), r t i,j denotes the offloading volume of the user i on the server j's communication channel in time slot t, thus r t i = j∈S t i r t i,j .It is worth noting that since the mobile user is supported by dual connectivity, one user can connect to both the two MEC servers at the same time, i.e., x t i,UAV and x t i,mBS can be both equal to one in a time slot, thus where |A| denotes the number of elements in set A. Due to signaling overhead for resource management, we assume that server j is able to serve at most χ max j users in a time slot, i.e.,

A. Problem Formulation
We focus on the weighted-sum system power consumption, which consists of power consumed for task execution at the user device and the UAV-mounted MEC server, as well as the user's transmit power for task offloading.Energy consumed for other purposes, such as for maintaining the basic operations of the MEC system and for propulsion of the UAV, are omitted for simplicity.Accordingly, the system's power consumption at time slot t, denoted by P sys (t), can be calculated as a weighted sum as where ψ i and ψ c are positive numbers denoting the weight factors for the power consumption of user i and the UAV, respectively.ψ i and ψ c can be adjusted to reflect the system's preference in optimizing the power consumption of different nodes, as well as to balance the impact of the UAV's and the mobile device's energy [34].
The ultimate goal of the optimization is to minimize the long-term average of the system power consumption, given constraints on the stability of task queues and the limit on the radio and computational resources.The optimization variables include the user's local computation, the volume of offloaded tasks, the UAV's remote processing scheduling, and the radio resource allocation.
Let X = {X t } t∈T denote the combination of all optimization variables over time.Additionally, let X t = x t j , α t j , r t j , f t l , f t c j∈S denote the combined vector of optimization variables at time slot t for all server j in S: Then, the problem can be formulated as a multi-stage MINLP problem as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
(3) and ( 4) In P1, (14b) denotes that the server j can server at most χ max j users at a time.(14c) ensures that the total bandwidth used for the server j's' uplink communication is bounded by W j .(14d) and (14e) indicate the maximum CPU frequency of the user and the UAV, denoted by f max i and f max c , respectively.(14f) denotes the maximum transmit power of the mobile device on each communication link.(14g) and (14h) guarantee that the amount of tasks processed in a time slot (i.e., tasks offloaded and computed by the user and tasks computed by the UAV) does not surpass the backlog of task queues in each time slot.Finally, (14i) and (14j) indicate the mean rate stability [28] for the local and remote queues.Note that (14i) and (14j) do not provide any guarantee of the time-average expected backlogs in queues (and thus the average computation delay).The QoS constraints (3) and ( 4), which is a stronger form of stability [28], are thus useful.
We observe that P1 is a stochastic optimization problem of a time-evolving system.Indeed, radio and computation resource management decisions need to be made in each time slot under the randomness of the arrival task and the fading channel.Furthermore, optimal decisions are temporally correlated and should be adaptive to the time-varying system states, such as the current queue size at the mobile device and the UAV.Solving P1 is challenging also because the optimization evolves a large number of interdependent optimization variables.Specifically, the radio resource management variables among different end users (i.e., x t i,j and α t i,j , i ∈ N ) are coupled and interdependent with the computational resource scheduling at both sides (i.e., f t l,i and f t c,i , i ∈ N ), which suggests that a joint optimization approach is indeed needed.Later in the numerical results, we show that the over-aggressive approach (e.g., a greedy policy based on channel gain or queue length) could not solve the formulated problem effectively.
In the following, instead of solving P1 directly, we consider its modified version, denoted by P2, to obtain an efficient asymptotically optimal online solution as follows.First, we can rewrite (14c) as To obtain P2, we replace (14c) by the following constraint, where ϵ A ∈ (0, 1/N ) is a constant.Constraint (14k) causes the transmit power function in (10) continuous and differentiable with respect to α t i,j , j ∈ S t i .It is worth noting that although the optimal solution to P2 is only a approximation of the optimal solution to P1, we can make them arbitrarily close by setting ϵ A to be sufficiently small.

B. Lyapunov-Guided Problem Transformation
We adopt Lyapunov optimization framework [28] to decouple the multi-stage problem P2 into deterministic problems that can be solved in each time slot.
First, to cope with the QoS constraints ( 3) and ( 4) on the long-term average of the queue length, we introduce two virtual queues for each mobile, for i ∈ N , t ∈ T , where Z l i (0) = Z s i (0) = 0.By the definition of the two virtual queues, it is proved in [28] that constraints (3) and ( 4) are satisfied if the two virtual queues are mean-rate stable, i.e., lim T →∞ E Z l i (t) /T = 0 and lim T →∞ E [Z s i (t)] /T = 0.In support of the problem transformation, we define the system state at time slot t as The Lyapunov function is then defined as a measure of the total queue backlog at time slot t as To keep all the queues stable, the Lyapunov drift function is introduced as To minimize the long-term average power consumption while ensuring the queue stability constraint, we define the Lyapunov-drift-plus-penalty as where V is a positive number denoting a control parameter for the trade-off between the system's power consumption and the average queueing delay.The following theorem provides an upper bound of ∆ V (Θ(t)), which is crucial to the transformation of the multi-stage problem P2 into per-time slot deterministic problems.Theorem 1: The drift-plus-penalty ∆ V (Θ(t)) is bounded as where D t i = l t i + r t i,UAV + r t i,mBS ; B consists of constant terms from the observation at the beginning of time slot t, thus can be put aside from the optimization of the target variables X t .
Proof: Please refer to Appendix A. □ Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.With support from Theorem 1, we can transform the multi-stage problem P2 into a deterministic problem that can be solved in each time slot using the opportunistic expectation minimization technique [28].Specifically, the long-term goal of P2 can be archived by minimizing the upper bound of ∆ V (Θ(t)) in each time slot, taking the system state Θ(t) observed at the beginning of a time slot as the input.By removing constant terms in (20), the deterministic per-time slot problem is formulated as with the objective function as follows It is worth mentioning that solving P3 does not require any future information about incoming tasks and wireless channel state other than the current state of the task queues, making the approach an online optimization design.In the next section, we will introduce a DRL-based online optimization algorithm to solve P3 efficiently.

V. DEEP REINFORCEMENT LEARNING-BASED ONLINE RESOURCE MANAGEMENT
We propose a DRL-based resource management (DRLRM) scheme for solving the per-time slot problem P3.The proposed framework, as depicted in Fig. 2, consists of three main modules, which are (1) the actor module, (2) the critic module, and (3) the policy update module.The actor module obtains necessary information for optimization from the system observer and adopts a DNN to output several potential decisions for channel assignment, xt = x t i,j i∈N ,j∈S .The critic module evaluates each decision made by the actor module by solving remaining variables y t = α t i,j , r t i,j , f t l,i , f t c,i i∈N ,j∈S using model-based optimization.The policy update module logs a history of the system state-optimal decision pairs on the fly and re-trains the actor module's DNN in a periodic manner so that the mapping policy will be updated and adaptive to the time-varying channel condition.In the following, we describe in detail the learning-based framework depicted in Fig. 2, followed by the model-based optimization of the critic module.
For ease of convenience, the input and output of the actor module are denoted as follows.The input is denoted by , which includes the channel gain and the queue length of all physical and virtual queues for each user.The module's output is presented by where k denotes the number of potential channel assignment decisions made.
Remark 1: The above approach is backed by the Tammer decomposition technique [37], where combinatorial variables in xt are decoupled from continuous variables in y t .Indeed, by temporarily fixing xt , we can further decompose P3 into several sub-problems with separate objectives and constraints.
Remark 2: To obtain the optimal decision for xt , an exhaustive search requires evaluating possible channel assignment decisions, where indicates the binomial coefficient of selecting an (unordered) subset of k elements from a fixed set of n elements.Since n k is in O(n k ), an exhaustive search over all possible channel assignments is with complexity O N χ max UAV +χ max mBS .In the following sub-section, we propose to use a DNN to find (x t ) ⋆ with much less computational complexity.The DNN will be trained periodically to dynamically adapt to the time-varying channel condition and approximate an optimal policy Π ⋆ t that maps the current system state to an optimal channel assignment decision in time slot t, Π ⋆ t : Ξ t → (x t ) ⋆ .
A. Outline of the Proposed DRLRM Framework 1) Model-Free Actor Module: The actor module consists of a DNN and an action quantizer.The DNN obtains Ξ t and processes the forward propagation to output a relaxed channel assignment xt = xt i,j ∈ [0, 1] i∈N ,j∈S , which will be later quantized into a number of potential channel assignment decisions.We adopt a Deep Neural Network (DNN) to represent the channel assignment policy (an approximation to the optimal policy Π ⋆ t ) as Π Φt : Ξ t → xt , where Φ t denotes the DNN's parameters at time slot t.To ensure that xt i,j ∈ [0; 1] for all i ∈ N , j ∈ S, we use the sigmoid activation function at the output layer of the DNN.
The action quantizer of the actor module will then obtain the relaxed channel assignment xt to generate a batch of k potential decisions.Let Γ denote the quantizer policy, we have Γ : xt → xt , where xt = xt i,j ∈ {0, 1} i∈N ,j∈S denotes the output channel assignment decision.Let ς t j denote the χ max j th largest element of xt j ≜ xt i,j ∈ {0, 1} i∈N (the set of relaxed channel assignments corresponding to the server j).Then, for each xt i,j in xt , the quantization policy Γ generates a corresponding value of xt i,j in xt as To generate the first decision (x t 1 ), we apply the policy Γ directly to the output of the DNN, i.e., xt .The remaining (k − 1) decisions are generated by applying the policy Γ to noise-added versions of xt , denoted as Sigmoid (x t + n), where Sigmoid(•) denotes the element-wise sigmoid function to ensure each element of the output vector falls within the range (0, 1).Here, n ∼ N (0, σ 2 n I) is a 2N -dimensional zero-mean random vector following the normal (Gaussian) distribution with a diagonal covariance matrix σ 2 n I; I denotes the identity matrix.The variable σ n is a hyper-parameter responsible for the balance between exploration and exploitation of the action quantizer.Too relaxed values of σ n might not take advantage of the DNN output to predict the optimal channel assignment.In contrast, too strict values of σ n might hamper the critic module from extracting good approximations of the per-time slot problem's global minimum in each time slot.In the long term, this phenomenon causes the DNN to experience difficulties in learning the optimal channel assignment policy due to the noisy labels extracted by the critic module.
2) Model-Based Critic Module: The model-based critic module obtains the set of potential channel assignment decisions from the actor module to select the best decision among them and solve the optimization problem for the remaining variables.Unlike the conventional approach that adopts another DNN for the critic module, our approach leverages the model information on the user-server communication and power consumption to evaluate each channel assignment decision analytically.Indeed, by fixing the setting for x t j , j ∈ S, it is feasible to find optimal settings for the remaining variables in y t , which are all continuous.Specifically, let (y t ) ⋆ denote the optimal decision for y t and J ⋆ xt , Ξ t denote the optimal value of the objective function (21a) given xt and Ξ t , P3 is equivalent to the problem, denoted as P4, of finding where xt is one among k channel assignment decisions given by the actor module.We will introduce in detail the algorithm to obtain J ⋆ xt , Ξ t in Section V-B.Using a model-based critic module brings the advantage of having an accurate evaluation for each decision on the channel assignment, thus improving the convergence of training.Besides, it is worth noting that to obtain (x t ) ⋆ and (y t ) ⋆ , we need to evaluate k times the function J ⋆ xt , Ξ t .Thus, k is another hyper-parameter of the system that will affect the trade-off between performance and computational complexity.In general, larger values of k result in a better performance in terms of convergence time but require more computational 3) Policy Update Module: The policy update module exploits training samples labeled by the critic module (i.e., the pair Ξ t , (x t ) ⋆ ) to update the parameters of the actor module's DNN.A replay memory of size q is adopted to record the training samples.Only the most recent data samples are kept, i.e., new data will continuously replace the old ones to avoid memory bloat.Beginning with an empty memory, we start training the DNN only when at least q/2 data samples are available.Afterward, the DNN is trained periodically once every δ T time slots.Such a training scheme helps prevent the DNN from overfitting with noise in the input and enables the neural network to adapt dynamically to the time-varying channel condition.
Specifically, a batch of training samples is randomly selected from the replay memory when mod (t, δ T ) = 0 (mod indicates the modulo operation).We then use these samples to train the DNN by using the Adam algorithm [38] to minimize the cross-entropy cost function L(Φ t ), given as In (26)

B. Model-Based Optimization of the Critic Module
In this section, we present in detail the optimization algorithm used by the critic module to obtain J ⋆ xt , Ξ t in (25).For a given the channel assignment decision xt = x t i,j i∈N ,j∈S , we can obtain the optimal solution for y t = α t i,j , r t i,j , f t l,i , f t c,i i∈N ,j∈S by decomposing the per-time slot problem P3 into four sub-problems, including optimization for offloading volume and bandwidth allocation for the UAV and the mBS links, optimization for local computation, and optimization for UAV's computational resource scheduling.
1) Optimization on Offloading Volume and Bandwidth Allocation for the UAV Link: Given a feasible channel assignment decision, the optimization variables related to the UAV includes the bandwidth allocation, α t UAV ≜ α t i,j |i ∈ N t UAV , j = UAV , and the offloading volume on the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
UAV link, r t UAV ≜ r t i,j |i ∈ N t UAV , j = UAV .Let x t UAV ≜ {α t UAV , r t UAV } denote the combination of these variables.The optimal decision for x t UAV can be obtained by solving the following problem, hereinafter referred to as P3.1: where r t,max i,UAV = W UAV α t i,UAV τ log 2 1 + N0WUAV denotes the upper bound of r t i,UAV using the maximum transmit power.Note that for all users that are not associated with the UAV in time slot t, the bandwidth and offloading volume assigned to them equal zero, i.e., r t,max i,UAV = 0, α t i,UAV = 0, ∀i / ∈ N t UAV .To solve P3.1, we adopt the Gauss-Seidel approach [39] to optimize the offloading volume and the bandwidth allocation in an alternating manner.Specifically, in each iteration, the offloading volume decision is obtained in closed forms, and the bandwidth allocation is determined by the Lagrangian method.The alternating approach is guaranteed to converge to the optimal solution since P3.1 is convex, and the feasible region is a Cartesian product [39] of r t UAV and α t UAV .a) Optimal offloading volume: For a feasible bandwidth allocation α t UAV , the optimal offloading volume for mobile devices in N t UAV can be obtained by solving The optimal solution to the above problem is either the stationary point of (27a) or one of the boundary points.Specifically, for i b) Optimal bandwidth allocation: For a feasible offloading volume decision r t i , the optimal bandwidth allocation can be obtained by solving where p UAV Tx,i (t) = 2 denotes the transmit power of user i on the UAV link.First, we observe that (28c) is equivalent to α t i,UAV ≥ α t,min i,UAV , where Then, the partial Lagrangian function associated with the above problem can be written as where λ t UAV ≥ 0 is the Lagrangian multiplier associated with the constraint Based on the Karush-Kuhn-Tucker (KKT) condition, the optimal bandwidth allocation (α t UAV ) ⋆ and the optimal Lagrangian multiplier (λ t UAV ) ⋆ should satisfy the following equation set where εA = max ϵ A , α t,min i,UAV and R i,UAV (λ t UAV ) denotes the root of δ δα t i,UAV L(α t i,UAV , λ t UAV ) = 0.The following proposition provides a close-form expression for R i,UAV (λ t UAV ).Proposition 1: Given λ t UAV > 0, the root of L(α t i,UAV , λ t UAV ) = 0 is positive and unique as where W(•) denotes the Lambert-W function.
Proof: Please refer to Appendix B □ It is observed that ) is a monotonic function with response to α t i,UAV .Thus, the optimal Lagrangian multiplier (λ t UAV ) ⋆ can be found using a bisection search Specifically, the two boundaries can be selected as follows The searching process for (λ t UAV ) ⋆ can terminate when where ξ is the accuracy of the algorithm.Details of the Lagrangian method for solving P3.1.2are summarized in Algorithm 1.
Algorithm 1 Lagrangian Method for P3.1.2end if 10: end while 2) Optimization on Offloading Volume and Bandwidth Allocation for the mBS Link: Similar to the UAV link, given a feasible channel assignment decision, the optimization variables related to the mBS include the bandwidth allocation, α t mBS ≜ α t i,j |i ∈ N t mBS , j = mBS , and the offloading volume on the mBS link, r t mBS ≜ r t i,j |i ∈ N t mBS , j = mBS .Denoted by x t mBS ≜ {α t mBS , r t mBS } the combination of these variables, the optimal decisions for the mBS link can be obtained by solving P3.2 :

N0WmBS
and the right-hand side (RHS) of (34c) denotes the upper bound of r t i,mBS at time slot t.Similar to P3.1, we adopt the Gauss-Seidel method to solve α t mBS and r t mBS of P3.2 in an alternating manner, in which the optimal offloading volume is given in a close-form expression and the bandwidth allocation is determined by the Lagrangian method.
a) Optimal offloading volume: For a feasible bandwidth allocation α t mBS , the optimal offloading volume for mobile devices in N t mBS can be obtained by solving P3.2.1: minimize subject to (34c) The optimal solution to the above problem is either the stationary point of (34a) or one of the boundary points.Specifically, for i ∈ N t mBS , (r t i,mBS ) ⋆ = max min rt i,mBS , RHS of (34c) , 0 , where rt .
b) Optimal bandwidth allocation: For a feasible offloading volume decision, the optimal bandwidth allocation on the mBS link can be obtained by solving the problem We observe that P3.2.2 is similar to P3.1.2, in which one is for the SeNB, the other is for the MeNB.Thus, we can adopt Algorithm 1 to solve P3.2.2; the procedure is exactly the same by replacing j = UAV with j = mBS.
3) Optimization on User's Local Computation: Given the optimal decision on offloading volume of each user, the problem of resource allocation for local computation can be decomposed for each individual f t l,i : in which constraint (36c) is added to ensure (14g) of the original problem P1.First, we observe that P3.3 is a convex problem since its objective function (36a) is convex and all constraints are linear.Furthermore, since both the objective function and constraints of P3.3 can be decomposed for each f t l,i , the optimization for f t l can be done by solving each f t l,i separately.Specifically, the optimal solution to P3.3 is either at the stationary point of the objective function (36a) or one of the boundary points as

4) Optimization on UAV's Computational Resource Scheduling:
The optimal f c ⋆ (t) can be obtained by solving We observe that similar to P3.3, P3.4 is convex and the optimization can be decomposed into sub-problems that involve f t c,i separately.Specifically, the optimal solution for each f t c,i is either at the stationary point of the objective function or one of the boundary points as VI. ALGORITHM ANALYSIS In this section, we provide the computational complexity of the proposed scheme in solving the per-time slot problem, followed by an analysis of the scheme's optimality.

A. Computational Complexity
The calculation mainly comes from the model-based critic module because, in each time slot, this module must examine k potential channel assignment decisions made by the actor module to select the best one.Importantly, examining one decision involves solving the optimization of computation and communication of all users and servers.The time complexity in each problem is as follows.
1) Complexity of Joint Optimization on Offloading Volume and Bandwidth Allocation: .Given feasible offloading volumes, Algorithm 1 requires O (log 2 (N/ξ)) iterations via a bi-section search to find the optimal bandwidth allocation, where N denotes the number of users and ξ specifies the algorithm accuracy (e.g., ξ = 10 −4 ).Given a feasible bandwidth allocation, the optimization on the offloading volume for each user can be obtained in closed forms; thus, the complexity is in O(N ) considering N users.Consequently, the Gauss-Seidel joint optimization with I max iterations maximum is with complexity O (I max (N + log 2 (N/ξ))).
Complexity of optimization on the computation of the user and the UAV.Since solutions to P3.3 and P3.4 can both be obtained in closed forms, the optimization is with complexity O(1) for each user.
In summary, the computational complexity for evaluating one channel assignment decision is O I max N + log 2 N/ξ .Since the critic module examines k potential decisions in each time slot, the overall computational complexity of the proposed scheme is in O kI max N + log 2 N/ξ with fast execution. 1

B. Optimality Analysis
We assume that the traffic arrival and channel gain fluctuation for each user is an independent and identically distributed (i.i.d) random process, denoted as ω(t) = h t i,j , A t i i∈N ,j∈S .A policy that observes ω(t) in each time slot and makes control decisions independent of the queue backlog is referred to as an ω-only policy.
To ensure the strong stability of queues, we assume that the following assumption holds, which is the Slater condition for Lyapunov optimization [28].The asymptotic optimality of the proposed DRLRM scheme is then provided in Theorem 2.
Assumption 1: (Slater Condition) There are values ϵ > 0 and Φ(ϵ) (where 0 ≤ Φ(ϵ) ≤ P max sys ) and an ω-only policy Π making control decision α Π,t in time slot t that satisfies 1 Given settings in Section VII, it is demonstrated that the running time of the proposed DRLRM scheme for one-shot decision is milliseconds, using Python and Tensorflow on a PC with Intel Core i7 2.9 GHz CPU and Nvidia GeForce RTX 3070 GPU.In this regard, the MEC server can output the one-slot decision within the time frame duration (e.g., 10 ms in LTE).
Theorem 2: Suppose that the ω(t) process is i.i.d over time slots, P1 is feasible, and the Slater condition holds for some ϵ and Φ(ϵ).Suppose that the proposed DRLRM algorithm produces a C-additive approximation (C ≥ 0) of the minimum of ( 22) every time slot, then the following statements hold.(a) The average system power consumption satisfies where P * sys is the minimum average power cost achievable by any policy that meets the required constraints.
are mean rate stable and QoE constraints (3) and ( 4) are satisfied.Proof: Please refer to Appendix C □ VII.NUMERICAL RESULTS AND DISCUSSION In this section, simulation results are provided to evaluate the proposed scheme's performance.In simulation, we consider N = 8 mobile devices placed randomly around a hot spot within a radius of 100 meters.The MeNB (i.e., the mBS) is 500 meters away from the users while the SeNB (i.e., the UAV) flies above them at a fixed altitude of 50 meters.Each MEC server serves at most two users at a time.The small-scale fading channel power gains in dB, ht For the proposed method, we set σ n = 0.25 and k = 16 to balance the exploration and exploitation for the actor module.The DNN consists of three 1-D convolutional layers, followed by a Flatten layer and three Dense layers; all are implemented using Tensorflow.The ReLU (rectified linear unit) activation is used for all layers except the last one with the sigmoid function.
Benchmarking schemes: To evaluate the performance, we use the following four benchmark schemes: -   These benchmark schemes differentiate in the approach to solving x t for channel assignment.However, all apply the critic module's optimization procedure specified in section V-B to optimize the remaining variables in y t .It is noteworthy that the exhaustive method is considered the optimal solution for benchmarking.However, finding the optimal solution via exhaustive search is not feasible if we consider a system with many users.
Performance evaluation scenarios: To evaluate the degree of suboptimality and convergence, we compare the proposed DRLRM method with the benchmark schemes in the two scenarios described in Table III.
-Scenario I: Power consumption is considered the main KPI for performance evaluation since all methods can satisfy the queue stability constraints given a reasonable computational load.The arrival rate is set at a reasonable level, λ = 10 kb.Accordingly, the queue length thresholds are Q th l = 15 kb and Q th s = 3 kb.For the proposed DRLRM method, the number of generated actions is set at k = 16 (i.e., 2% of the search space).
-Scenario II: A very high computational load with λ = 30 kb is set to demonstrate high-traffic periods.All methods are expected to run at the highest energy level virtually all the time to stabilize queues.Thus, power consumption is not our focus.The queue length threshold is relaxed (i.e., Q th l = Q th s = ∞) to facilitate the optimization.For the proposed method, k = 80 is set (approximately 10% of the search space).Performance in Scenario I: In Fig. 3, we compare all methods in Scenario I, where we focus on the power consumption given predefined queue threshold constraints.Each point in the figure is a moving average of 1500 time slots.From Fig. 3a and Fig. 3b, we observe that thanks to the critic module's optimization, all methods can keep the user's and the UAV's queues stable at levels lower than or equal to the predefined thresholds (i.e., queue threshold constraints are satisfied).From Fig. 3c, we observe a power consumption reduction for the proposed scheme over time.In the early stage, the proposed method consumes much more power than the optimal channel assignment (as much as the QL-JRA method).In the later phase, the scheme's power consumption gently decreases over time and eventually converges to the same level as if the optimal decision was selected.This result proves the effectiveness of the training when the DNN gradually learns from experience to mimic the optimal policy.With the considered settings, our proposed method provides a reduction of 13.4%, 26.6%, and 30.5% compared to the QL-JRA, CH-JRA, and RA-JRA schemes, respectively.
Performance in Scenario II: Figure 4 illustrates the performance of all schemes in Scenario II, where we focus on the queue stability KPI under very high traffic.Each point in the figure is a moving average of 1500 time slots.We observe that while the UAV's queue is kept stable by all methods, only the optimal channel assignment and the proposed scheme can cause local queues of users to be stable.Specifically, the user's backlog queue of the QL-JRA, CH-JRA, and RA-JRA schemes increases almost linearly over time, indicating that the queue is not stable.This is because, without a proper policy for channel assignment, the achievable task processing capability is very limited and cannot afford the given task arrival rate.In contrast, the proposed method's user queue length also increases rapidly in the early stage but decreases gradually in the later phase.In the end, the scheme's user queue length is almost the same as that of the optimal channel assignment.This result once again demonstrates the effectiveness and convergence performance of the proposed DRLRM framework, even in unfavorable circumstances with a very high computational workload.
Effect of the parameter k: In  of k help the DNN learn the optimal channel assignment policy faster.They speed up the convergence at the cost of higher computational resources required for the critic module to investigate the generated potential decisions.Specifically, the time duration until convergence of the proposed method (within a 1% gap compared to the optimal decision) for k = 120 and k = 80 are 6000 and 8000 time slots, respectively.This is because, with more decisions generated, the critic module has more chance to extract good approximations of the global minimum of the per-time slot problem.In other words, by investigating more potential decisions, the critic module helps improve the quality of the training data.In the long term, this advantage speeds up the learning process.
Effect of the queue length threshold: Fig. 6a illustrates the impact of the queue threshold on the average system queue length when changing V .It is noteworthy that the queue length threshold can be referred to as a means of controlling the service delay since the average task execution delay is proportional to the average queue length.We observe that if the queue stability level is not constrained, the average queue length increases almost linearly with the Lyapunov parameter.In addition, a higher task arrival rate leads to an increase in the average queue length.The interesting point is when we add additional constraints (3) and (4) to enforce the stability level for the queue length.We notice that with small values of V , the queue length threshold constraint does not affect the optimization.When V increases, the average queue length is effectively controlled so that the constraint is satisfied for all considered task arrival rates.Note that in case λ = 10 kb, the queue stability level with and without constraints are almost the same since the setting of V is not large enough to make the average queue length surpass the threshold.Fig. 6b investigates the impact of the queue length threshold on the system power consumption when changing V .We observe that the power consumption of all settings decreases, corresponding to the increase of the parameter V , as expected.This is because increasing V puts more emphasis on the system power than stabilizing the queue length in the per-time slot problem.In addition, we notice that as the arrival rate increases, the power consumption increases accordingly.The queue threshold constraint also significantly impacts power consumption.The gap between the two scenarios (with and without the constraint) tends to enlarge with increasing the arrival rate and control parameter V .This is because the network has no choice but to consume more energy to maintain queues stable at a satisfactory level (as depicted in Fig. 6a), especially in cases with high arrival rates.
Power-delay trade-off: In Fig. 7, we investigate the trade-off between the average weighted sum system power consumption and the average task computation delay by varying the Lyapunov control parameter V with unconstrained queue thresholds.As can be observed, the average task computation delay increases as the power consumption decreases, indicating that a proper setting of V is critical to balance the two objectives.Besides, we observe that given a specific execution delay level, the average weighted sum power consumption increases with the task arrival rate.The result is logical since more power consumption is required to keep the queue stable when the workload grows.

VIII. CONCLUSION
This paper proposes a hybrid method that combines conventional model-based optimization and model-free DRL to minimize the power consumption of a multi-user, multiserver MEC network.Via DC, one user can offload tasks to the macro base station and the UAV-mounted MEC server simultaneously.The power minimization is formulated as a multi-stage MINLP problem with long-term constraints of queue stability and average task computation delay.Lyapunov optimization is exploited to transform the original multi-stage problem into a per-time slot problem, which is then solved using a DRL framework.Theoretical analyses are provided to demonstrate the proposed method's optimality and computational complexity.Extensive simulations show that the proposed framework can produce nearly the same performance as the optimal solution obtained via an exhaustive search.In future research, it would be interesting to investigate the impact on system performance of the UAV-ground base station collaboration and the adaptive deployment of multiple UAVs.Other research directions, such as multi-layer edge computing, vertical networks of edge servers, and quality of experienceaware deployment, should also be investigated.

APPENDIX A PROOF OF THEOREM 1 To begin with, let
i∈N , and Z s (t) ≜ {Z s i (t)} i∈N denote the system state variables.We then define the Lyapunov function L(•) and the drift function ∆(•) for Q l (t), Q s (t), Z l (t), and Z s (t) in a similar way as for Θ(t) in ( 17) and (18), respectively, where the conditional expectation is taken given Θ(t).The following lemmas give the upper bound of the drift function for each of the above system state.Note that in what follows, [x] + denotes the function max {x, 0}.Additionally, since the communication and computation resources at the user and the UAV are limited, we denote the upper bound of D t i , r t i and c t i as D i,max , r i,max and c i,max , respectively.Similarly, A i,max denotes the user i's maximal task arrival in a time slot.
Lemma 1: The drift function for Q l (t) is bounded as where , where ( †) is obtained since D t i = l t i + r t i ≤ Q l i (t) as in (14g).By some algebraic transformations, we have By taking the conditional expectation on both sides of the inequation and summing up over all users i ∈ N , we obtain (40).
□ Lemma 2: The drift function ∆ Z l (t) is upper bounded by where A t i and r t i = r t i,UAV + r t i,mBS .Proof: From (15), we have where ( †) is due to the fact that (max {a − b, 0}) 2 ≤ (a − b) 2 ; and ( ‡) is with condition (14g) of P1 that D l i = l t i + r t i ≤ Q l i (t).Thus, we have It is straightforward that ∆ (Θ(t)) = ∆ Q l (t) + ∆ (Q s (t)) + ∆ Z l (t) + ∆ (Z s (t)).Thus, by summing up the left hand sides of (40), (41), ( 43), (44) we obtain the upper bound of the drift-plus-penalty as in (20), where . We observe that B consists of constant terms from the observation at the beginning of time slot t, thus can be put aside from the optimization of the target control variables X t .APPENDIX B PROOF OF PROPOSITION  .Substituting a and c, we obtain (32).

APPENDIX C PROOF OF THEOREM 2
To begin with, we introduce the following lemma for feasibility of problem P1.
□ Proof of statement (a).Considering an ω-only policy Π with a corresponding value δ > 0, applying Lemma 5 into the right-hand side (RHS) of (20) where ( †) is because the ω-only policy Π is independent of the queue backlog Θ(t) and ω(t) is i.i.d over time; ( ‡) is obtained by plugging in (45).By letting δ → 0, we obtain ∆ (Θ(t)) + V E [P sys (t)] ≤ B + C + V P * sys .By summing up both sides from t = 0 to T −1, taking iterated expectation and telescoping sums, then dividing both sides by T V , we obtain t=0 E [P sys (t)] ≤ ( B + C)/V + P * sys .Taking the limit on both sides of the inequation as T → ∞, we obtain (39).This concludes the proof of statement (a).
Proof of statement (b).We consider an ω-only policy Π that satisfies the Slater condition in Assumption 1. Plugging (38) into ( †) of (46), we obtain ∆ (Θ(t)) + V E [P sys (t)] ≤ B + C − ϵ i∈N Q l i (t) + Z l i (t) + Q s i (t) + Z s i (t) + V Φ(ϵ).By taking integrated expectations, summing the telescoping series, and rearranging terms, we obtain The inequation indicates that all queues Q l i (t), Z l i (t), Q s i (t), Z s i (t) are strongly stable, which also implies mean rate stability (Theorem 2.8 of [28]).In addition, since the two virtual queues Z l i (t), Z s i (t) are mean-rate stable, the QoE constraints (3) and ( 4) are satisfied.This conludes the proof of statement (b).

Manuscript received 24
January 2023; accepted 24 March 2023; approved by IEEE/ACM TRANSACTIONS ON NETWORKING Editor T. Lan.Date of publication 13 April 2023; date of current version 19 December 2023.This work was supported by the University of Aizu's Competitive Fund 2022 under Grant P8. (Corresponding author: Anh T. Pham.)

Fig. 5 ,
we investigate the convergence behavior of the proposed DRLRM method under different settings of the hyper-parameter k in Scenario II.The moving average rolling over 2000 time slots of the user queue length is plotted.The number of potential actions of the actor module (k) is set at 40, 80, and 120 for evaluation.The figure clearly shows that the hyperparameter plays a critical role in the convergence speed of the proposed method.Higher values

TABLE I A
COMPARISON OF OUR WORK WITH EXISTING DRL-BASED RESOURCE MANAGEMENT SCHEMES

TABLE II SUMMARY
OF KEY NOTATIONS The network randomly assigns users to the mBS and the UAV.

TABLE III DETAILS
OF TWO SCENARIOS FOR PERFORMANCE EVALUATION ) 2 +D i,max Q th l,i +Q l i (t)A t i where ( †) is obtained since (a − b) 2 ≤ a 2 + b 2 for ab ≥ 0 and 0 ≤ D t i ≤ D i,max .By replacing D ti by l t i + r t i , taking the conditional expectation on both sides of the inequation and summing up over all users i ∈ N , we obtain (41).□Lemma3: The drift function for Q s (t) is bounded as∆ (Q s (t)) ≤ B 3 − The proofis similar to that of Lemma 1. □ Lemma 4: The drift function ∆ (Z s (t)) is upper bounded by i∈N E Z s i (t) c t i − Q s i (t) − r t i,UAV + Q th s,i Θ(t) ,(44)whereB 4 = 1 2 i∈N c 2 i,max + Q s i (t) 2 + r 2 i,max + (Q th s,i ) 2 + c i,max Q th s,i + Q s i (t)r i,max .Proof:The proof is similar to that of Lemma 2.□Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.