Traffic Learning and Proactive UAV Trajectory Planning for Data Uplink in Markovian IoT Models

The Age of Information (AoI) is used to measure the freshness of the data. In IoT networks, the traditional resource management schemes rely on a message exchange between the devices and the base station (BS) before communication which causes high AoI, high energy consumption, and low reliability. Unmanned aerial vehicles (UAVs) as flying BSs have many advantages in minimizing the AoI, energy saving, and throughput improvement. In this article, we present a novel learning-based framework that estimates the traffic arrival of IoT devices based on Markovian events. The learning proceeds to optimize the trajectory of multiple UAVs and their scheduling policy. First, the BS predicts the future traffic of the devices. We compare two traffic predictors: 1) the forward algorithm (FA) and 2) the long short-term memory (LSTM). Afterward, we propose a deep reinforcement learning (DRL) approach to optimize the optimal policy of each UAV. Finally, we manipulate the optimum reward function for the proposed DRL approach. Simulation results show that the proposed algorithm outperforms the random-walk (RW) baseline model regarding the AoI, scheduling accuracy, and transmission power.


I. INTRODUCTION
The recent progress in machine-type communication (MTC) based IoT has witnessed new service modes, such as massive MTC (mMTC) and ultra-reliable low latency communication (URLLC) [1].These new service modes introduce critical applications, e.g., remote surgery and vehicle-to-vehicle (V2V) communications, and have raised the need for new technologies to meet the quality-of-service (QoS) demands of such applications [2].One of the recent critical QoS requirements is the end-to-end latency and real-time data collection [3].
Meanwhile, MTC devices (MTDs) have different traffic characteristics compared to traditional human-type communication (HTC) devices [4].For instance, the MTC packets are usually shorter in length than the HTC packets.Moreover, MTC traffic is highly correlated and more homogeneous, i.e., nearby devices tend to have similar traffic [5].For example, assume sensor a monitors the package count in an industrial factory and sensor b detects package defects.Let event 1 correspond to no missing package, whereas event 2 correspond to a ruined package.In this example, sensor a only is active along with 1, whereas both sensors will be activated with 2.
It is important to forecast the activation pattern of each sensor to allocate the available resources efficiently.
In addition, an IoT network can have a massive number of devices within a small area, i.e., mMTC, and it also may include devices with strict reliability and latency demands, i.e., URLLC [6].The traditional access protocols, such as random access (RA) in LTE and grant-free (GF) non-orthogonal multiple access (NOMA) are efficient in serving HTC; however, they have many limitations operating in IoT networks [7].For instance, the RA relies on a four-handshake procedure, where the device requests a transmission and waits for the response and identification, which introduces high signaling overhead.transmission, many messages are exchanged between the devices and the base station (BS), which introduces high signaling overhead.In addition, this high signaling overhead, which is necessary for device scheduling, reduces the spectral efficiency of the resources.Moreover, operating on massive deployment, the traditional resource allocation techniques can suffer from a high collision.The high signaling overhead and the collision are considered sources of latency and higher energy consumption, which fails to meet the QoS requirements of the MTC use cases.Other conventional schemes, such as time-division multiple-access (TDMA), also fail to meet the QoS demands of IoT devices [8].Despite the fairness of the TDMA in scheduling the resources among the devices, it still disrupts the distinction between the active and silent devices, which reduces the spectral efficiency due to the waste of resources.
Machine learning schemes have recently been considered as potential solutions to the aforementioned problems.The learning-based resource allocation schemes aim to reduce the signaling overhead and the collision, consequently reducing the resultant latency.In [9], the 3GPP defines the fast uplink as the potential replacement for the traditional RA.It requires a traffic estimation to predict the active and silent devices at a time.This learning-based access scheme has shown promising results regarding access delay, reliability, collision, and signaling overhead compared to the RA schemes [10], [11].
According to [4], the initial step in the fast uplink solution is traffic prediction to identify active and silent devices at a time, i.e., a time-series binary classification problem.
Machine learning tools are efficient in solving time-series problems.Classical machine learning has been widely used in forecasting traffic, such as auto-regressive integrated moving average (ARIMA) [12], the forward algorithm (FA) in hidden Markov models (HMMs) [13], and Gaussian mixture models (GMMs) [14].Classical methods rely on statistical equations and probabilistic models to estimate the probability of a device being active or silent at each time instant.They are usually fast and simple with low dimensional problems but suffer from poor accuracy with complex-high dimensional problems.On the other hand, modern machine learning techniques have evolved recently in the time-series problems with the exclusive usage of deep neural networks, such as recurrent neural networks (RNNs), long short-term memory (LSTM), and attention mechanisms [15].Modern deep learning methods rely on feature extraction automatically through deep hidden connected neural network layers.They are normally complex and their training is time-consuming compared to classical techniques.However, they prove to perform better in highcomplexity setups.
An important metric that measures the degree of freshness of information received from devices is the age of information (AoI).AoI is defined in [16] as the time difference between the current and generation times of each device's last received packet.It was developed to evaluate the freshness of the data collected from each device [17], where lower AoI means fresher information.There is a direct relation between the traffic prediction and the AoI, where active devices that successfully receive resources and transmit their packets have lower AoI.In contrast, wrong allocation increases the AoI of non-served active devices.In addition, granting the resources to the active devices without receiving transmission requests reduces the signaling overhead, leading to lower latency and lower average transmission AoI of the information.One of the most emerging technologies considered as potential solutions to minimize the AoI future wireless communications is the Unmanned aerial vehicles (UAVs) [18].Relying on their flexibility, accessibility, and ability to be fully controllable [19], deploying UAVs enables dynamic and real-time data collection, allowing critical applications to operate safely [20].Introducing the UAV as a flying BS has many advantages, such as enhancing the line-of-sight (LoS) communication between the device and the UAV as the UAV flies near the served device, improving the throughput, decreasing the transmission energy, enabling the deployment of a massive number of devices in a network, and minimizing the AoI in wireless networks [21].Despite having many advantages, using the UAV as a flying BS has also raised many challenges recently, such as trajectory optimization, flight energy minimization, the freshness of the collected data, and scheduling the resources efficiently among the served devices [22].
To this end, minimizing the AoI in UAV-based networks is necessary to guarantee that fresh information is received from each device and boost fairness among the devices.Moreover, most of those devices are considered limited-power devices [23], where the transmission power of the devices needs to be minimized.Therefore, three crucial aspects should be monitored in the UAV-based networks, namely, the regret that describes the accuracy of scheduling the resources, the AoI that exploits the freshness of the information, and the transmission power of the limited-power devices [24].

A. Related Literature
Herein, we summarize the existing literature covering the solutions to the traditional RA protocols and the work done on UAV-based networks.To begin with, the authors in [25] proposed a reinforcement learning model to schedule the MTDs to the RA slots, whereas Ali et al. [26] exploited the sleeping multi-armed bandit to formulate a fast uplink grant algorithm that prioritizes the device according to their activation and their importance.Their proposed model enhances fairness in the system and decreases the average access delay.In addition, several solutions were mentioned in [27], such as access class barring schemes, dynamic resource allocation, slotted access, and pull-based schemes.However, they still suffer from undesired latency or collision [11].In [28], Zhiyi et al. overcame the high signaling overhead by introducing a hybrid resource allocation scheduler.The authors in [29] presented a federated-learning solution to estimate the future traffic of the MTDs.Their work does not consider the latency and complexity analysis that is crucial in designing resource allocation schemes.
In [30]- [32], the optimum number of devices and their positions were optimized to ensure a UAV-based network's capacity or throughput constraints.For instance, the authors in [30] proposed a transport theory-based solution to determine cell boundaries and maximize transmission rate.Meanwhile, [31] used a heuristic algorithm to determine the optimum number of the UAVs and their positions, where their simulation results showed that all the users were meeting their QoS constraints.In addition, [32] optimizes the UAV's trajectory to maximize the throughput from the users' perspective.
The works in [33]- [36] discussed the AoI minimization via deep reinforcement learning (DRL) solutions.Tong et al. presented a trajectory optimization to minimize the AoI while ensuring that the packet drop rate is as low as possible [33].In [34], a Markov decision process (MDP) was proposed to formulate the trajectory optimization problem.The authors in [35], [36] formulated a DRL solution to minimize the weighted-sum AoI, where their solutions outperform the baseline schemes such as random walk and distance-based approach.The optimal position of the users was optimized in [37] using a weighted expectation maximization approach.
Due to the complexity of the UAV-based optimization problems, most of the literature neglected the traffic arrival of the MTDs and assumed them to be active all the time.This assumption is not realistic and completely avoids the resource management aspect of the UAV acting as a flying BS.However, a few recent works have addressed the resource management problem in UAV optimization.For instance, [38] utilized a block successive upper-bound minimization algorithm to jointly minimize the energy consumption and the resource management of the UAV.Moreover, Peng and Shen [39] presented a multi-agent deep deterministic policy gradient solution to the aforementioned problem.They claim that using multiple agents outperforms the single-agent scenario regarding delay and QoS satisfaction ratios.

B. Contribution
This work addresses the problem of how to exploit a predictive dynamic traffic pattern in order to proactively and efficiently design a UAV trajectory to "navigate and collect data" from IoT devices.Our scheme aims at the jointly minimization of the average AoI of the IoT devices as well as their average transmit power.The proposed algorithm comprises two stages: traffic estimation and UAV learning.Our main contributions are • We design the system model with the aid of the hidden Markov models (HMMs), where Markovian events govern the activation of devices.We assume that multiple UAVs are serving the devices.• We present the FA as the classical traffic estimation approach that estimates device activation probabilities.
In addition, we propose an LSTM architecture as the modern deep-learning traffic estimation approach.Both traffic estimators are evaluated in terms of accuracy and complexity.
• We propose a DRL solution that optimizes the trajectory path of each UAV and the scheduling policy that jointly minimizes the average AoI, scheduling regret, and average IoT transmission power.Moreover, we acquire the optimum reward function for various devices that yields the best joint performance regarding AoI, scheduling regret, and transmission power.• Exploiting this, the DRL-based UAV trajectory and scheduling optimization outperforms the baseline random-walk (RW) scheme.Our simulation results show that the performance of the proposed algorithms approaches the genie-aided case (the one that uses true activation probabilities instead of the predicted ones).The LSTM traffic predictor shows better AoI and transmission power results than the FA traffic predictor.

C. Outline
The rest of the paper is organized as follows: Section II illustrates the system model and the problem formulation.Section III discusses the traffic estimation stage, whereas Section IV presents the UAV learning stage and the proposed DQN solution.The key performance indicators are described in Section V. Section VI depicts the numerical results, and Section VII concludes the paper.

II. SYSTEM MODEL
Consider an uplink model of D static IoT devices, whereD = {1, 2, • • • , D}.The devices are randomly distributed in a grid world and served by a set U = {1, 2, • • • , U } of U rotary-wing UAVs that fly with fixed velocity v u at height h u and transmit the collected information from the IoT devices to a static BS of height h BS .The location of a device d is given by l d = (x d , y d ), while the location of the UAV is projected on the 2D plane as l u = (x u , y u ), and the BS is located at the center of the grid world, where l BS = (0, 0).
The distance between an IoT device d and a serving UAV u is denoted by L du , while the distance between a serving UAV u and the BS is denoted by L uBS , and the distance between two horizontal or vertical points on the grid world is given by L g .Four charging depots are located at the corners of the grid world to enable the UAVs to recharge.The time axis is discretized into [τ, 2 τ, ...], where τ is the time needed for the UAV to navigate from one grid point to another, i.e., τ = Lg vu .During the time τ , the UAV can allocate a resource to one IoT device d (α d (t) = 1) and the scheduling vector is onehot vector given by α The system model is illustrated in Fig. 1.

A. System Analysis
Assuming a LoS1 communication between the UAV and the BS, and between the IoT devices and the UAV, the channel gain between the UAV and the BS can be calculated as g 0 is the channel gain at the reference distance (1 m).Each UAV has a battery of capacity E, which is discretized into e max energy quanta.Each energy quanta has energy of E emax .The battery evolution e u of each UAV can be formulated as [40], [41] where ⌈ ⌉ is the ceiling approximation, e t u is the energy consumed due to the UAV and BS communication, and e f u is the energy consumed due to movement.Here, e t u can be calculated at time instant t as where σ 2 is the noise power, M is the packet size of the sensor updates and B is the signal bandwidth.In addition, according to [42], e f u is formulated as where P 0 and P 1 represent the blade profile power and derived power when the UAVs are hovering, respectively, v t describes the velocity of the UAVs, and v tip depicts the tip speed of the blade.Meanwhile, v 0 is the mean rotor-induced velocity when hovering, d 0 represents the fuselage drag radio, ρ is the air density, µ 0 represents the rotor solidity and Z the area of the rotor disk.

B. Traffic Arrival
We denote the activation of a device d at a time instant t as w d (t), where w d (t) = 1 means that device k is active at time instant t and w d (t) = 0 means it is silent.Hence, the activation vector of the IoT devices in the network at time instant t can be written as control the activation of the IoT devices.Each event is considered an event-driven background Markovian On-off process that influences the IoT devices 2 .
We denote the activation of an event k at time instant t as S k (t), where S k (t) = 1 means that the event k is in the ON state at time instant t and S k (t) = 0 means its existence in the OFF state.Hence, the activation vector of the binary events in the network at time instant t can be described as S(t) = {S 1 (t), S 2 (t), ..., S K (t)}.As shown in Fig. 2, the activation model is described as a set of binary Markov chains with transition probabilities ϵ k , which is the transition probability of an event k from on state (S k (t) = 0) to off state (S k (t+1) = 1) and ϵ (0) k , which is the transition probability of an event k from on state (S k (t) = 1) to off state (S k (t + 1) = 0).The transition between states for each Markov chain can be summarized as follows with the transition matrix 2 We note that many outdoor or indoor events may follow Markovian behaviour of occurrence.The reader can refer to examples related to indoor fire events and autonomous vehicles in [8], which utilizes the same model for the sake of illustration in predictive traffic scenarios.Moreover, each active event S(t) = 1 has a probability p dk to activate a device d at time instant t.In contrast, a silent event S(t) = 0 has a zero probability of activating a device d at time instant t.Therefore, the probability of a device d to be active or silent affected by an event k at time instant t can be calculated as follows with the activation matrix In addition, the activation probability of a device d affected by all the events in the network can be formulated as follows

C. Problem formulation
We aim to jointly minimize the average AoI, the accumulative regret, and the device's average transmission power.However, before we cast the optimization problem, we introduce the metrics addressed in the problem formulation.
1) Age of Information: The AoI is used to measure the freshness of the transmitted packets and the network fairness among the devices [43].The AoI for device d can be formulated as the difference between the current time instant t and the last time slot t d such that u d (t d ) = 1.If a device transmits an update packet at instant t, i.e., α d (t) = 1, its AoI is reset to one.To reduce the AoI, the UAVs need to forecast the active devices and serve those with longer AoI.We formulate the discrete AoI of device d as follows otherwise, (17) where A max is the maximum AoI threshold in the model.The average age of the network at time instant t is calculated as 2) Accumulative regret: Regret occurs when allocating a resource to an inactive device while an active device is left unserved [8].For example, consider a network of two devices d 1 and d 2 , which are active and silent, respectively, i.e., w d1 (t) = 1 and w d2 (t) = 0. Suppose the scheduling policy is α = [0, 1], i.e., α d1 (t) = 0 and α d2 (t) = 1.The network has scheduled a resource to an inactive device while an unserved active device exists.Therefore, the regret in this scenario is 1.The regret at time instant t can be computed as the minimum value among wrongfully scheduled resources ω t and the missed scheduled resources η t and the accumulative regret at time instant T is formulated as 3) Transmission power: The transmission power P d of an IoT device d at time instant t is calculated as The average transmission power of the network is 4) Joint optimization problem: We are now ready to cast the joint optimization of the average age of information, the accumulative regret, and the average transmission power as: where l c,u are the coordinates of the charging depot where UAV u is going to take off.Therefore, each UAV is forced to start its trajectory from one of the corners of the grid world.The constraint in (23b) is to ensure that the UAV still has enough energy before moving back to the nearest corner to recharge whenever a low battery is monitored.During recharging, the IoT devices transmit their information directly to the BS to overcome long waiting times.In addition, ζ 1

III. THE TRAFFIC PREDICTION STAGE
As shown in Fig. 3, the proposed algorithm has two stages: • the traffic prediction stage, and • the UAV learning stage.
In the UAV learning stage, the agents optimize the optimal policy to follow and optimize the values of ζ 1 and ζ 2 in the reward function for different device deployments.Fig. 4 shows the flowchart of the proposed algorithm and the relationship among the stages.This section introduces the traffic estimation stage, while the UAV learning stage is discussed in the next section.
Herein, the objective is to minimize the AoI, regret, and power jointly.However, to minimize regret, the UAVs need to know which devices are active, thus granting them a resource and those which are silent, avoiding wastage of available resources at each time instant.In addition, there is a correlation between the AoI and regret.For instance, suppose having an active device d 1 at time instant t, i.e., w d1 (t) = 1 and a scheduling policy α d1 (t) = 1.In this case, the regret is zero, and the AoI of that device is reset to 1. On the other hand, if the scheduling policy of the active device d 1 is α d1 (t) = 0, the regret is one, and the AoI would also be incremented according to (17).A straight positive correlation between both metrics is not always the case.Therefore, a good traffic predictor is needed for joint optimization.
In this section, we present an HMM architecture as the proposed classical FA traffic predictor and an LSTM architecture as the potential modern traffic predictor.We compare both architectures from different point-of-views, such as the inputs, the outputs that will be used by the UAV to perform the scheduling, the space complexity, the time complexity, and the accuracy.

A. The Forward Algorithm
As described in the previous section, the activation of IoT devices is completely affected by the state of the background events.In addition, those states are unknown to the BS (hidden), i.e., the BS does not have information about the active and inactive events.Therefore, we can model the relation between the events and the devices as HMMs [13].The HMMs consist of a set of unknown events and observations (device activations) affected by the states of those events.The activation probability of the devices Y d = Pr (w d (t) = 1|S(t)) is calculated from (15).The major concern is that the latter equation relies on the knowledge of the unknown states.The FA can estimate the hidden states by computing the joint distribution between the states and the observations recursively as Then the state-activation joint probability is maximized over all possible events using [44] S * (t) = arg max Pr (S(t), W (1 : t)) . ( The BS utilizes the estimation of the hidden states of the events to predict the activation of the IoT devices.The predicted activation probability of device d at time instant t + 1 given the predicted hidden states S * (t) can be formulated as The output of the FA is an estimated probability of each device being active.

B. Long Short-Term Memory
The LSTM was introduced in [45] to solve the problem of vanishing gradient in RNNs resulting from long sequences [46].It proposes a short memory h(t) for short series in the past and a long memory C(t) to store the relevant information from the long sequences.As shown in Fig. 3, the LSTM consists of 4 gates: 1) The forget gate f (t): it is used to extract the relevant information from the input to be stored in the long memory and forget irrelevant information.Its updated equation is where σ is a sigmoid activation function, w f are the weights to be updated, h(t − 1) is the previous hidden layer, and x(t) is the input features.2) The learn gate i(t): it works similarly to RNN as its main purpose is to learn new patterns from the short sequences.Its updated equations are where C(t)is a vector of the potential features to be used to update the long memory.
3) The remember gate: it uses the result of the forget gate and the learn gate to update the long memory, where 4) The use gate: it updates the short memory as One of the strongest aspects of modern deep learning techniques in time-series prediction is that they rely on observations only.Therefore, the LSTM only needs to collect a sufficient amount of data generated by the described model that can efficiently describe the model hyperparameters and the hidden states.It uses the collected observations from history and captures their pattern to estimate possible future observations.The output of the LSTM is binary as it returns which devices are expected to be active or silent at each time instant.
The output of this stage is a vector that describes the activation of the devices.Its size is the number of devices in the network D. This vector is either an activation probability vector ∈ [0, 1], in case of using the FA as the traffic predictor, or a binary vector, in case of using the LSTM as the traffic predictor.

IV. THE UAV LEARNING STAGE
In this section, we formulate the addressed problem as an MDP.A DRL-based solution is presented, where we cast the reward function that will be used in the UAV learning stage to jointly minimize the average AoI, transmission power, and accumulative regret.

A. Markov Decision Process
An MDP is usually described in terms of the tuple ⟨s, a, r, p⟩, which consists of the state s, the action a, the reward r, and the state transition probability p.In addition, the environment is the IoT network modeled in Section II and the agents are the UAVs that serve the devices.At time instant t, the agents are found at state s(t) and select an action a(t).Each agent moves to a new state s(t + 1) following the state transition probability p a(t) (s(t), s(t + 1)) of the environment.In addition, the agents gain an immediate reward of r(t) based on the selected action that transits the agent from one state to another.The agents aim to maximize the received reward, which is usually formulated in terms of the desired functions to be minimized or maximized.Here, the reward function will be formulated in terms of the average age, transmission power, and accumulative regret as in 23.A policy π is the strategy that the agents would follow to select a particular action at each state.Whenever the agent selects an action that results in a state that has low AoI, transmission power, and accumulative regret, that agent will receive a higher reward.Therefore, the agent's task is to discover the best possible action at each state that results in the best possible reward.This process is been referred to as the optimal policy π * .
1) State space: In the described problem, the state space at time instant t consists of four elements: i) the AoI vector A(t) = [A 1 (t), ..., A D (t)], ii) a position vector of each UAV [l 1 , ..., l U ], iii) the parameter ∆ u that describes the difference between the available energy at each UAV and the required energy to reach the nearest charging depot, and iv) the predicted activity vector of each device W (t) in the case of using the LSTM architecture for the traffic prediction or an activation probability vector of each device [Pr (w 1 (t) = 1|S(t)) , ..., Pr (w D (t) = 1|S(t))] in the case of using the forward algorithm for the traffic prediction.
2) Action space: The action space at time instant t consists of two elements: i) the device to be served by each UAV α(t) = [α 1 , ..., α U ] and ii) the movement of the UAV β u (t) = [north, south, east, west, hovering].
3) State transition probability: We assume a deterministic state-space transition probability, thus each component of the state vector is affected by deterministic transition equations.For instance, the AoI is updated according to (17), and the position of each UAV is updated according to and the needed energy before recharge ∆ u is updated by subtracting the difference between the available energy calculated in (2) and the needed energy to move towards the nearest charging depot.Finally, the traffic predictor outputs the activation pattern or probability.4) Reward function: Based on the optimization problem in P1, the immediate reward is described as and the accumulative reward is

B. Solving the MDP Problem
The action-value function q π (s, a) is the expected reward starting from a state s, taking an action a and then following the policy π.The optimal action-value function can be described as where at the optimal policy π * , the optimal action-value function is satisfied q * (s, a) = q π * (s, a).In addition, the optimal policy is simply maximizing, at each state, the action-value function over all the possible actions.The Bellman equation describes the optimal action-value function recursively as where γ ∈ [0, 1] is the discount factor that controls how much the agent cares about the future rewards relative to the immediate rewards, i.e, γ = 0 means that the model cares only about the immediate reward, whereas γ = 1 means that the model prioritizes the future reward up to infinity.The Bellman equation is non-linear and has no closed-form solution.Therefore, iterative solutions are used to solve it.Qlearning is a model-free iterative algorithm that is used to learn how good an action is in a particular state.It is formulated as follows where α is the learning rate and Q(s, a) → q * (s, a).The ϵgreedy policy is used in the Q-learning algorithm such that the model chooses a random action with a probability ϵ and the greedy action.Thus, such action maximizes the actionvalue function with probability 1 − ϵ.Usually, ϵ is set to be a very large value (close to 1) at the beginning of the learning process and decays with time.This procedure is called the exploration-exploitation trade-off.The larger the value of ϵ, the more the exploration, whereas small ϵ means that the model is exploiting what it has learned to maximize its action-value function.The Q-learning algorithm is suitable for simple problems, where the state space and the action space are relatively small.However, in high-dimension state and action spaces, such as the described UAV model, Qlearning fails to converge.Therefore, action-value function estimation algorithms are used to solve such problems with high dimensions.

C. The DQN solution
To overcome the curse of dimensionality of the state and action spaces, DQN was proposed in [47].The DQN utilizes an artificial neural network (ANN) to estimate the actionvalue function Q(s, a|θ 1 ), where θ 1 is a vector containing the weights of the trained ANN to estimate the action-value function.This ANN is called the estimate network.Therefore, the action-value function is estimated by optimizing the weights that minimize the loss function The DQN introduces the experience replay and the fixed Q-targets techniques.The experience replay proposes to save the tuple ⟨s, a, r, p⟩ in a memory called the replay memory.Then, a mini-batch is sampled from this memory to be used in the training of the estimate network.The fixed Q-targets technique utilizes a new ANN called the target network, where its weights θ 2 are updated every O time instants and are used as the targets for the estimate network.Hence, the loss function is now formulated as (41) and the weights are optimized using stochastic gradient descent methods.Thus, the weights are updated as follows where ∇ θ1 is the gradient with respect to θ ( The weights for the new estimate and target networks are θ 3 and θ 4 , respectively.The reward function of the new network depends on the reward function of the initially trained network and the reward function of the initial network depends on optimizing the values of ζ 1 and ζ 2 using the second DQN.Therefore, the problem is solved iteratively between both networks.Note that this approach finds the optimized priority factors that jointly minimum AoI, power consumption, and regret; however, these factors could change depending on the application's priority.Algorithm 1 summarizes the proposed initial DQN, where the forward algorithm is used as the traffic predictor, whereas Algorithm 2 describes the proposed initial DQN, where an LSTM 3 architecture is used as the traffic predictor.Finally, Algorithm 3 presents the reward function optimization4 .

V. KEY PERFORMANCE INDICATORS
In this section, we introduce the key performance indicators (KPIs).First, we present the KPIs related to the traffic estimation stage.Then, we discuss the KPIs of the DQNs in the UAV learning stage.
A. Traffic Prediction KPIs 1) Mean square error: The forward algorithm estimates recurrently the probability of a device to be active Ỹd .One way to evaluate the estimation of the forward algorithm is to compare the resulting probability with the true activation probabilities Y d .The mean square error (MSE) can be formulated as 2) Training and validation losses of LSTM: The loss quantifies the error in the prediction of machine learning models.A high loss indicates that the model generates an erroneous result, whereas a low loss indicates that the model is working well with few errors.The MSE loss function is the most well-known loss function in time-series regressiontype problems [49].The LSTM uses a sequence of data from the past (training data) to fit the weights of the gates.
Algorithm 1: The proposed DRL algorithm with the FA as the traffic predictor.
1 Define the number of devices D and their coordinates l d . 2 Define ϵ, γ, α and O.
3 Estimate the hidden states using (25).4 Calculate the device's activation probabilities using (26).5 Utilize the estimated probabilities from (26) in the state space.6 Define the reward function in (36).Save ⟨s(t), a(t), r(t), p(t)⟩ in the replay buffer.
14 Sample a mini-batch from the buffer.

15
Update θ 1 and θ 2 every O instants using (42).3 Generate an activation sequence for each device W (t) at each instant t. 4 Use an LSTM to predict the future activation using the past sequence as illustrated in Section III-B.5 Utilize the predicted activity for each device w D (t) in the state space as illustrated in Section IV-A1.6 Define the reward function in (36).Save ⟨s(t), a(t), r(t), p(t)⟩ in the replay buffer.
14 Sample a mini-batch from the buffer.

15
Update θ 1 and θ 2 every O instants using (42).with probability ϵ or select the greedy action a = max a Q(s(t), a) with probability 1 − ϵ. 7 Train the DQN in algorithm 1 or 2 to calculate the reward function. 8 Save ⟨s(t), a(t), r(t), p(t)⟩ in the replay buffer. 9 Sample a mini-batch from the buffer.
10 Update θ 3 and θ 4 every O instants using (42).11 end used with new known data (validation data) to test how the optimized weights fit with new data in the future.Therefore, the validation loss measures how good the model is with future test data.
3) LSTM classification metrics: • The confusion matrix presents the correct and wrong classification of each class in a matrix form.• The precision (P)is the ratio between the true predicted samples of a class and the total predicted samples of that class, whereas the recall (R) is the ratio between the true predicted samples of a class and the total actual samples of that class.In addition, the overall accuracy (acc) is the ratio between the correct samples of both classes and the total samples.• The f1-score (f 1s) is calculated as follow B. UAV Learning KPIs 1) Immediate and accumulative reward: In DRL models, an increasing immediate reward over the episodes is an important indication that the model learns.If the model has a decreasing immediate reward, this indicates that the learning scheme of the model is poor, whereas a fixed immediate reward over the episodes indicates the convergence of the DRL model and the possibility to terminate the training.The accumulative reward is an important evaluation metric to compare the DRL model with a baseline model such as the RW.In addition, it is an indicative KPI to compare multiple DRL with different hyperparameters, such as the learning rate, replay buffer size, and exploration rate, among others.
2) Ergodic age: The average age Ā(t) at time instant t is averaging the individual ages of each device.The ergodic age, is the mean of the average age over time.3) Ergodic transmission power: The average power P (t) at time instant t is the average of the individual powers of each device.The ergodic power is the time average of the accumulative power given as

VI. SIMULATION RESULTS AND DISCUSSION
In this section, we present the numerical results of the proposed DRL algorithm.First, we discuss the results of the traffic estimation via both the FA and the LSTM.Afterward, we exploit the proposed DQN to jointly optimize the AoI, regret, and transmission power.Finally, we present the results of optimizing the reward function for different network setups.We consider a grid world of 11 × 11 cells, where each cell is a square with side length 100 m.The simulation parameters are defined in Table I.We train the proposed algorithm using the Pytorch framework on a single NVIDIA Tesla V100 GPU and 20 GB of RAM.The RW scheme stands for random movement of the UAVs and random scheduling policy.The genie-aided scheme refers to the proposed DRL assuming perfect knowledge of the active and silent devices in the network.The term FA-DRL is used to describe the proposed DRL scheme with FA as the traffic predictor, whereas the term LSTM-DRL is used to describe the proposed DRL with LSTM as the traffic predictor.
Fig. 5a depicts the MSE between the estimated activation probability using the FA and the true activation probability in two different network setups, namely, D = 7 and D = 10.We can notice that in the beginning, the error is high as the observations are not enough for the FA to estimate the hidden states and the future activation as discussed in (26).Then, after 12-time slots, the MSE decreases to less than 0.5% and starts to converge.Moreover, the average loss function is plotted versus epochs when using the LSTM to forecast the activation of the devices.In the beginning, the loss is higher as the weights of the LSTM architecture are randomly chosen.Afterward, the loss decreases as the weights are optimized using the accumulated device activation patterns.Convergence occurs after about 20 epochs where the training could be stopped.Fig. 5b exploits the trajectory optimization result from the proposed DRL algorithm.Fig. 5b exploits the trajectory optimization result from the proposed DRL algorithm.For illustration, we present the trajectory optimization using LSTM as the traffic predictor.The devices at the bottom of the grid world are set to have higher activation probability by increasing the values ϵ 1 and decreasing the values of ϵ 0 .This ensures a higher activation probability of the events that affect these devices.In addition, the values of p dk for the devices at the bottom of the network are higher than the other devices, forcing those devices to be active as long as possible.As shown in Fig. 5b, both UAVs tend to spend more time navigating near the bottom of the map.This indicates the effectiveness of the proposed learning scheme capturing that those devices are active most of the time.This trajectory is the optimized path that jointly minimizes age, regret, and Table II highlights comparing the FA and the LSTM as traffic predictors.The FA relies on the model parameters and prior observations to predict the probability of a device being active in future instants.On the other hand, the LSTM is model-free; it relies only on the previous observations to solve the activation of the devices.Therefore, both predictors require a long sequence of the previous activations to work efficiently.As the FA works recursively over all the previous time instants, including very long sequences to estimate the probability becomes cumbersome.This is not the case with the LSTM, which uses very long sequences efficiently to produce the actual activation pattern, thanks to the forget gate.We can notice in Table II shows that the LSTM is more complex than the FA regarding time and memory consumption.However, if the BS uses the same long sequence for the FA as the LSTM, the FA becomes more complex than the LSTM regarding training time and memory consumption.We evaluate the FA performance using the MSE of activation probabilities.The FA has an average MSE of 0.0016 for all devices.The LSTM returns the actual activation pattern (binary).Therefore, we evaluate the LSTM performance using the confusion matrix, where it has an average performance of correctly predicting 48 active instants and 46 silent instants from a total of 100 time instants.This means that the LSTM predicts a device to be silent while it is active two times and it predicts a device to be active while it is silent four times.For the active instants, it has a precision of 92%, recall of = 96%, and f1 − score = 94%.Meanwhile, the silent instants have a precision of 96%, recall of 92%, and f1 − score = 94%.The LSTM has an overall accuracy of 94%, which is quite high.
Table III demonstrates the immediate reward over episodes for the proposed DRL approach.It is noticeable that the reward enhances as more episodes are trained.This confirms that the DQN is learning over time.Fig. 7 exploits the performance of the proposed algorithm using different reward functions, namely, different values for ζ 1 and ζ 2 .It is noticeable that using ζ 1 = 0 and ζ 2 = 0 has the best performance concerning AoI.On the other hand, the accumulative regret and the accumulative power increase as they do not weigh the reward function.In addition, utilizing LSTM as the traffic predictor leads to lower AoI when compared to the FA.Using ζ 1 = 100 and ζ 2 = 0 increases the weight of the regret in the reward function, which results in the best accumulative regret using both proposed traffic predictors, whereas the average age and the accumulative power increase.Setting ζ 1 = 0 and ζ 2 = 1000 reduces the power consumption at the cost of worse AoI and regret, where both LSTM and the FA traffic predictors almost give the same accumulative power results.In Fig. 6, the average AoI, the accumulative regret, and the accumulative transmission power are plotted over time for a network of D = 10 devices served by two UAVs.The optimized values of ζ 1 and ζ 2 are 25 and 500, respectively.We can notice in Fig. 6a that the LSTM-DRL performs better concerning AoI than the FA-DRL, where both outperform the RW baseline scheme.In Fig. 6b, the performance of the FA-DRL is worse than the genie-aided and the LSTM-DRL due to the uncertainty of the traffic prediction that relies on the estimated probabilities.The accumulative power of the proposed FA-DRL and LSTM-DRL is lower than the RW scheme and almost catches the power of the genie-aided, as shown in Fig. 6c.Despite being more complex than the FA, the LSTM achieves better performance in terms of AoI, regret, and power consumption.Herein, we use the FA as a baseline model to be compared to the LSTM and show the trade-off between complexity and performance efficiency.Fig. 7 depicts the algorithm's ergodic age and ergodic power while sweeping the values of the power factor ζ 1 on the xaxis.From Fig. 7a, we can notice that assigning (0, 0) to (ζ 1 , ζ 2 ) minimizes the ergodic AoI for all the DRL schemes.Assigning large values such as (100, 1000) to (ζ 1 , ζ 2 ) and asymptotically up to (∞, ∞) render high AoI.On the other hand, Fig. 7b shows the effect of different values of (ζ 1 , ζ 2 ) on the ergodic power.We observe that high values of ζ 1 minimize the transmission power and vice versa.Overall, The LSTM traffic predictor outperforms the FA traffic predictor regarding ergodic age and ergodic power consumption over time.
Note that, both Fig. 7a and 7b elucidate that the average AoI is directly proportional with the transmit power.This is because if the devices are able to transmit the signal with with hight power, this signal can travel to higher distances.Therefore, UAVs would receive the data without the need to move closer to the devices, which and updates arrive more frequently, which improves the AoI.
It is worth mentioning that, during training, the UAVs share their states with the central unit (BS), which sends back the actions and the rewards for each UAV.On the other hand, during testing, the UAVs save the look-up table of the converged state-action pairs by just exchanging the current states between the UAVs to avoid high signaling overhead.

VII. CONCLUSION
In this paper, we proposed a novel framework to jointly minimize the average AoI, regret, and transmission power of the IoT devices by optimizing the trajectory path of multiple UAVs and their scheduling policy.First, in the traffic estimation stage, the BS predicts the traffic of the IoT devices using a classical approach (FA) and a deep learning approach (LSTM).Then, we propose a DQN solution and select the optimum reward function in the UAV learning stage by optimizing the importance weights of the regret and the transmission power, i.e., ζ 1 and ζ 2 , respectively.Finally, the optimal policy regarding the trajectory of the UAVs and their scheduling is optimized.The simulation results elucidate that the LSTM outperforms the FA in predicting the traffic of the devices to be used in the UAV learning stage.The LSTM has higher time and space complexity demands than the FA.Furthermore, the BS stage chooses the best reward function for the UAV learning stage.In the UAV learning stage, the proposed DQN approach shows better results regarding the AoI, regret, and transmission power than the baseline RW scheme.
Note that, in our endeavour, we considered only a static deployment of devices to illustrate the idea.In order to solve the same problem in a dynamic environment, meta-learning could be applied, where the experience of solving a specific setup could be invested in order to facilitate the learning process when devices move around.Finally, we would like to note that increasing the number of UAVs and devices are open research questions for future investigation using distributed learning approaches.

Fig. 1 :
Fig. 1: The system model: IoT devices are served by multiple UAVs that relay the information to the BS located at the center of the grid world.

Fig. 2 :
Fig. 2: The activation of D devices is modeled as a Markovian arrival of K binary events.If an event k is active, it influences a device d with an activation probability of p dk .

Fig. 3 :
Fig. 3: The stages of the proposed algorithm.

Fig. 4 :
Fig.4: Flow chart of the proposed algorithm.First, the BS estimates the traffic using the LSTM or the FA in the traffic estimation stage.Then, the optimal policy of the UAVs is optimized in the UAV learning stage.

1 .
Choosing the values for ζ 1 and ζ 2 in the reward function controls the resulting AoI, regret, and transmission power.Therefore, the BS needs to optimize the best values for ζ 1 and ζ 2 that jointly minimize the AoI, regret, and transmission power to the global minimum given a certain setup.We propose a DQN architecture to estimate the optimal reward function based on the best values for ζ 1 and ζ 2 for a given setup.For the new DQN, the states are the number of devices in the network and the action space consists of the chosen values for ζ 1 and ζ 2 .The reward function for the new DQN is simply negative the multiplication of the average AoI, accumulative regret, and the average transmission power calculated from the first trained DQN within an episode of time T r DQN = − Ā(T ) R(T ) P (T ).

7
Train the DQN in algorithm 3 to optimize ζ 1 and ζ 2 .8 Define the number of episodes E. 9 Initialize t = 1. 10 for e = 1,...,E do 11 while ∆ u (t) > 0 do 12 Choose a random action a with probability ϵ or select the greedy action a = max a Q(s(t), a) with probability 1 − ϵ.

2 :1
The proposed DRL algorithm with LSTM as the traffic predictor.Define the number of devices D and their coordinates l d . 2 Define ϵ, γ, α and O.

7
Train the DQN in Algorithm 3 to optimize ζ 1 and ζ 2 .8 Define the number of episodes E. 9 Initialize t = 1. 10 for e = 1,...,E do 11 while ∆ u (t) > 0 do 12 Choose a random action a with probability ϵ or select the greedy action a = max a Q(s(t), a) with probability 1 − ϵ.

end 18 end 3 :1
Herein, the training loss measures how well the model fits the training data.On the other hand, the estimated weights are Algorithm The reward function optimization.Define ϵ, γ, α and O. 2 Define the reward function in (43) 3 Initialize the replay buffer.4 Define the number of episodes E. 5 for e = 1,...,E do 6 Choose a random value for ζ 1 and ζ 2 (action a)

Fig. 5 :
Fig. 5: (a) MSE of the FA activation probability prediction and training and validation losses of LSTM.(b) Trajectory path of 2-UAVs serving a network of D = 10 devices using the LSTM as the traffic predictor.The values for ζ 1 and ζ 2 are 25 and 500, respectively.The lower devices have a higher activation probability than the rest of the devices.

Fig. 6 :Fig. 7 :
Fig. 6: The average age, accumulative regret, and accumulative power of a network of D = 10 devices served by two UAVs.The values for ζ 1 and ζ 2 are 25 and 500, respectively.

TABLE I :
The UAV and DQN model parameters.

TABLE II :
A comparison of the performance of the FA and the LSTM as traffic predictors.

TABLE III :
The reward function of training 2-UAVs serving a network of D = 10 devices using the LSTM as the traffic predictor.The values for ζ 1 and ζ 2 are 25 and 500, respectively.