Multi-UAV IRS-Assisted Communications: Multinode Channel Modeling and Fair Sum-Rate Optimization via Deep Reinforcement Learning

Unmanned aerial vehicles (UAVs) combined with intelligent reflective surfaces (IRSs) represent a cutting-edge technology for improving the channel capacity of wireless communications, by capitalizing on UAVs’ 3-D mobility coupled with the IRSs’ smart radio capabilities. This work envisions a scenario in which a swarm of UAVs equipped with IRSs serves multiple Internet of Things (IoT) ground nodes (GNs) concurrently transmitting to a single base station (BS) via OFDMA. The huge number of passive elements composing the IRSs introduces a significant complexity in the mission design. Therefore, each IRS is divided into patches that can be simultaneously used to serve different nodes. Considering general Rician fading, a comprehensive channel model for IRS-assisted UAV-aided networks is derived. Then, a multiobjective mixed-integer nonlinear programming problem is conceived to maximize the sum-rate of the GNs and, at the same time, minimize the difference among the users’ data rates, by jointly optimizing the trajectories and the phase shift matrices. This nonconvex problem, reformulated in terms of scheduling (i.e., patch-GN assignment), is challenging to solve. Hence, it is rearranged as a Markov Decision Process and a quasi-optimal solution is obtained via Deep Reinforcement Learning. Extensive simulation analysis is performed to validate the results and the accuracy of the proposed model.


I. INTRODUCTION
I NTERNET of Drones [1] is a disruptive paradigm which is expected to play a key role in upcoming 6G communications.The inherent versatility of unmanned aerial vehicles (UAVs) enables a vast plethora of applications [2], such as sensing, delivery, surveying, and patrolling.Information exchange with ground infrastructures often plays a central role in these use cases.Moreover, UAVs can act as flying base station (FBS) or relays, thus enhancing the communication capabilities of Internet of Things (IoT) networks [3], [4].Indeed, IoT nodes are usually capability-constrained devices which can greatly benefit from the aid provided by drones' versatility [5], [6], [7].Another cornerstone of future cellular communications are intelligent reflective surfaces (IRSs), also named reconfigurable intelligent surfaces (RISs).These metasurfaces are composed by a matrix of elements, namely, passive reflective units (PRUs), and have demonstrated capabilities of controlling the radio environment.IRS-assisted wireless systems grant significant performance improvement, enhancing the channel quality perceived by communicating nodes.In particular, IRSs are able to shift incident electromagnetic waves by a programmable phase, thus yielding beamforming.This unique property is at the basis of the emerging concept of Smart Radio Environments [8].
Early literature has mainly considered to deploy IRSs on building facades, so as to reflect incoming signals toward multiple users.Recently, the possibility to equip UAVs with IRSs is attracting interest.Drones mobility adds more degrees of freedom that can be exploited to further improve the channel quality, with great advantage for IoT applications [9], [10].Indeed, the high mobility of drones yields a better Line of Sight (LoS) link and a lower pathloss, due to the possibility to adjust the IRS location.Moreover, IRSs are characterized by a limited Size, Weight, and Power consumption (SWaP) with respect to common phased-array antennas [11], which allows to prolong the mission duration, while still providing broad coverage.However, this comes with new challenges [12], [13] and in particular with the necessity of an accurate, general-purpose, and flexible channel model.Indeed, the presence of a base station (BS) with multiple IRSs and ground nodes (GNs), leads to interference patterns that need to be taken into account in case of system design and assessment.
This motivated recent studies that started to investigate the achievable performance of IRS-assisted UAV-aided systems.In [14] a nonconvex optimization problem is formulated for the joint design of UAV trajectory, IRS's phase shifts, scheduling, and resource allocation.The system employs orthogonal frequency multiple access (OFDMA), which introduces frequency and spatial selectivity in the fading of the resulting channel.However, as the majority of the present literature, it considers only the LoS component and hence a deterministic channel which indeed is not able to capture the full characteristics of an actual wireless channel.In this regard, the literature on IRS has investigated different directions to obtain more realistic channel models.In [15] a formulation is proposed for IRS-aided wireless networks over Rician fading channels, which enables closed-form approximations of outage probability, average symbol error probability, and channel capacity.However, in this contribution the presence of a direct, yet weak, transmitter-receiver link is not considered.To fill this gap, [16] introduces a Rician model to account for the LoS, which yields a closed-form upper bound for the ergodic capacity, and a tight approximation for the outage probability.However, the resulting distribution cannot be recast as a known one, which limits the practical tractability.To partially circumvent this issue, simplified expressions are provided in [15] and [16], but only for the asymptotic regime.
More tractable closed-form expressions considering Rician channels have been recently derived in [17], and evaluated in terms of outage and symbol error probability.However, the signal components are assumed to add coherently, which is not always strictly verified.In fact, there could be scenarios in which the same IRS has to serve multiple GN, thus introducing interference among surface elements.Besides, the presence of a direct, though weak, link is not considered, which would instead require to consider the additional interference between PRUs and GN.In order to account for the channel stochasticity, where interference between PRUs, BS, and GNs is present, while considering the presence of a (possibly weak) GN-BS link, [18] proposes a channel gain approximation for IRSassisted UAV-aided OFDMA communications.Although this contribution acknowledges several aspects of the communications, it does not explicitly model a multiuser multidrone scenario.Moreover, the model introduces a huge number of degrees of freedom, one for each element of the IRS, which guarantees a great customization but at the same brings a tremendous complexity in terms of beam design.On the other hand, in [19] the number of degrees of freedom of the surface is drastically reduced by imposing a constraint on the phase shifts, which results to be extremely advantageous in terms of complexity but limits the surface to reflect the signal toward a single direction per time.Besides, [19] treats only a piece of the channel model, in a deterministic way, considering the sole path loss and not the stochastic term due to fading.
In light of the above, the major contributions given by this work are listed below.To achieve this goal a suitable objective function is defined.It includes a fairness factor that determines the importance of a uniform distribution of the resources.The problem is then reformulated in terms of patch scheduling, i.e., patch-GN assignment.To overcome the intractability of the nonconvex cosine patterns related to wave interference, the problem is finally rearranged as a Markov Decision Process (MDP) and solved via a deep reinforcement learning (DRL) method i.e., a proximal policy optimization (PPO)-based approach.3) An extensive simulation campaign is carried out to assess the validity of this work.First, the theoretical findings are corroborated by a thorough analysis which demonstrates, employing Monte Carlo (MC) simulations, the accuracy of the proposed channel model.Then, the solutions obtained by the DRL algorithm are analyzed for different numbers of patches and fairness levels.To further prove the effectiveness of the proposal, the obtained solution is compared with the special case of absence of the swarm and with a baseline approach.The results of this work clearly indicate that: 1) the presence of IRS-equipped drones strongly improves the channel capacity of the GNs and 2) the proposed solution outperforms the baseline in terms of total and average transmitted data.
The remainder of the present contribution is as follows.Section II describes the adopted system model.Section III presents the proposed channel model.Section IV describes the problem formulation and the proposed solution.Section V analyzes accuracy of the model and investigates the obtained numerical results.Finally, Section VI concludes the work and draws future research perspectives.Notations adopted in this work are hereby described.Boldface lower and capital case letters refer to vectors and matrices, respectively; j = √ −1 is the imaginary unit; atan2(x) denotes the four-quadrant arctangent of a real number x; x T is the transpose of a generic vector x; x ⊗ y denotes  the Kronecker product between two generic vectors; diag(x) represents a diagonal matrix whose diagonal is given by a vector x; arg (x) returns the phase of a complex number x; and x ∼ CN (μ, σ 2 ) define a circularly symmetric complex Gaussian distribution x with mean μ and variance σ 2 .For clarity, the adopted notations of this article are summarized in Table I.

II. SYSTEM MODEL
The envisioned scenario, depicted in Fig. 1, considers a swarm of U IRS-equipped UAVs is in charge of optimally reflecting the incident signal coming from a set of G GNs to enhance the channel quality at a single BS, located at a known position q BS = [x B y B z B ] T ∈ R 3 .Both GNs and the BS are equipped with a single-antenna communication apparatus.The entire mission lasts T seconds, split into K timeslots of δ t seconds each.Therefore, the UAVs fly following trajectories discretized into and 2) reflects the incident signal through a complex factor k,u,n,m = a k,u,n,m e jφ k,u,n,m , where a k,u,n,m is the amplitude and φ k,u,n,m ∈ [−π, π) is the phase shift.The IRSs are assumed to be controlled by the BS through a dedicated control channel.PRUs of the uth IRS are grouped into P u patches of . ., G, leading to the definition of the distance d BG g = q BS − q G g from the BS, which does not change over time. 1 On the opposite, UAV-GN and BS-UAV distances depend on kth timeslot since they are a function of the time-varying UAV positions.In particular, in far-field, the distance between each surface element of the uth drone and the gth GN can be approximated as follows [19]: where d RG k,g,u is the distance from the center of the IRS, while θ RG k,g,u and ϕ RG k,g,u denote the vertical and horizontal Angles of Arrival (AoA) of the signal.In particular, Similarly, the distance between the BS and each PRU of the uth drone is where d BR k,u is defined as the distance from the center of the IRS, while θ BR k and ϕ BR k are the vertical and horizontal Angles of Departure (AoD), so that

III. PROPOSED CHANNEL MODELING
The whole communication system employs OFDMA which allows to mitigate the interference among the network entities.As a consequence, the whole available bandwidth B is split into I subcarriers of δ F Hz each that can be assigned to GNs to transmit.Moreover, the power radiation pattern functions, including antenna gains, which characterize the antennas of GNs, BS, and IRSs are denoted by F GN (θ, ϕ), ∀g, F BS (θ, ϕ), and F IRS (θ, ϕ), respectively.As known, these functions define how the transmitted/received power at each antenna varies along a certain direction in inclination θ and azimuth ϕ angles.
The channel gain between the gth GN and the BS, in kth timeslot and subcarrier i = 1, . . ., I is where β BG denotes the channel power gain at the reference distance of 1 m, α BG is the pathloss exponent, 2 ) denotes the channel coefficient representing the stochastic fluctuation due to multipath propagation and fading.It is worth noting that the channel envelope 2 ) and average power , since the presence of a (possibly weak) LoS path is taken into account-according to the propagation characteristics of the actual environment and the presence of obstacles, as depicted in Fig. 1.Therefore, the channel coefficient can be modeled as follows: In particular, h is the LoS component, characterized by a phase shift depending on the user's subcarrier index, f i = f C + iδ F , f c is the carrier frequency, and h BG k,i,g ∼ CN (0, 1) describes the small-scale fading resulting from Non LoS (NLoS) propagation.
The channel gain between the p u th patch of the uth UAV and the gth GN in timeslot k, subcarrier i, is obtained by collecting in a vector the corresponding E u,p u channel gains of the PRUs, i.e., 7) denotes the far-field array response defined in ( 8), shown at the bottom of the page.
Similarly, the channel gain between a certain patch p u of a UAV u and the BS is where [21]: with κ MIN,BR k,g,u and κ MAX,BR k,g,u the minimum and maximum possible K-factors, respectively.The same holds true for κ RG k,g,u and κ BG k,g,u .The IRS phase shift matrix k,u,p u ∈ C E u,pu ×E u,pu , for each timeslot k, UAV u, patch p u is defined as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
recalling that in general k,u,n,m = a k,u,n,m e jφ k,u,n,m .Since each patch p u is meant to coherently reflect the incident signal toward a certain location (to serve one of the G GNs), all its elements can share the same amplitude a k,u,n,m = a k,u,p u for all (n, m) belonging to patch p u , and their phases can be described in terms of two parameters only, denoted by φ X k,u,p u and φ Y k,u,p u .In general, it is possible to reduce the amount of degrees of freedom introduced by PRUs by imposing that By considering ( 5), (9), and ( 14), the composite channel gain G k,i,g,u,p u is given by ( 15), shown at the bottom of the page, where h BR k,i,g,u,p u ,n,m and h RG k,i,g,u,p u ,n,m are the components of the channel vectors h BR k,i,g,u,p u and h RG k,i,g,u,p u , respectively.Moreover, η k,g,u = and h BRG k,i,g,u,p u ,n,m describes the composite BS-PRU-GN channel coefficient.
The composite BS-PRU-GN channel coefficient involves the pairwise product of two complex Gaussian random variables (RVs), whose distribution, named complex double Gaussian, is given in terms of an infinite sum of modified Bessel functions [22].For better tractability, it is however possible to approximate such a product through a complex Gaussian [18], thus implying that the envelope |h BRG k,i,g,u,p u ,n,m | is a Rician RV.The following expression for the channel coefficient, for the generic (n, m)th element of the p u th patch, is obtained: where and h BRG ∼ CN (0, 1) since phase terms are irrelevant to the scatter component [14].
The expression derived in (15), which includes ( 16) and the subsequent definitions, requires further elaboration in order to obtain a more compact form, with a lower computational complexity.To this aim, the following result will be useful to compute the sum of the PRUs' channel coefficients belonging to the same patch p u .
The following theorem is one of the main contributions of this work.Indeed, it characterizes the overall gain G k,i,g , given by the sum of all signals reflected toward user g (by any patch from any UAV) plus the direct link.
Theorem 1: Let G k,i,g be the channel gain perceived by a GN g, in timeslot k and subcarrier i.The channel envelope 2 defined in ( 23) and ( 24), respectively.Proof: Leveraging the result of Lemma 1, since ( 15) is a sum of complex Gaussian distributions, and assuming that all the elements of the same patch p u have the same amplitude a k,u,p u , the channel coefficients sum of all the PRUs G k,i,g,u,p u ∼ CN (μ G k,i,g,u,pu , 2σ 2 G k,i,g,u,pu ) is described by the mean in (19), shown at the bottom of the previous page, and by the second moment Finally, the expression of the channel gain perceived by a certain GN g is Following the same rationale adopted in [18] it is possible to derive the K-factor κ k,i,g and average power k,i,g of the envelope of G k,i,g as follows: |μ G k,i,g | 2 the squared LoS component, defined in (23) at the bottom of the page, with ϕ k,i,g,u = arg (ω k,i,g,u ), ∀k, i, g, u and κ BG g = (κ BG g /κ BG g + 1).Finally, the NLoS component reads where κ BG g = (κ BG g + 1) −1 .Corollary 1: The proposed general channel model can be specialized in case of scattered, i.e., Rayleigh fading (κ BG g → 0), or even totally absent direct BS-GN link (λ g → 0).In both cases, the squared LoS component Based on the above results, it is possible to obtain expressions for the maximum achievable data rates, outage probability, and other inherent performance metrics.In particular, recalling Shannon's capacity formula, the channel capacity in timeslot k and subcarrier i is given by where ρ k,i,g is the transmit power of GNs in ith subcarrier and N 0 is the spectral noise power.Given a maximum achievable data rate R k,i,g , to guarantee a reliable communication, it is required that the outage probability remains below a threshold ε, such that Therefore, considering the maximum tolerable outage, i.e., ε, the previous equation can be rewritten as follows: It follows that the inverse expression, with respect to the second argument is: Since the inverse Marcum Q-function has no closed-form expression, to avoid numerical computation an approximated formula can be used as in [24, eq. (17)] Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
with K 0 the intersection of the subfunctions at 2κ k,i,g > Q the expression of the achievable data rate can be derived as follows: with SNR k,i,g denoting the SNR.
In the following, the scaling behavior of the SNR and other properties are analyzed.
Theorem 2: The SNR of the gth GN scales by a factor 2 .Proof: The maximum squared LoS component is obtained when the swarm is perfectly aligned with BS and GNs, which ∀k, i, g and ∀u ≥ u corresponds to thus allowing to recast it in the more compact notation Therefore, assuming that all the patches of the swarm perfectly reflect the incident signal toward the desired GN yields for any b, c ∈ Z, which is verified when Furthermore, recalling the definition of the average power in (22) and the NLoS component in (24), as {E u,p u } → +∞ the following inequalities hold true: the SNR derived in (26) scales as Corollary 2: Following the same rationale of Theorem 2, it is not difficult to show that in case of a single drone (U = 1), since the alignment condition ( 27) is inherently satisfied, the SNR measured at a specific GN illuminated by P ≤ P u scales as ζ ≥ ( P p u =1 E p ) 2 .Corollary 3: In the special case of no direct BS-GN link (λ g → 0), a single drone (U = 1) illuminating one GN (G = 1) with the entire IRS (P = 1), the channel follows a Rician distribution characterized by , where g, u, and p are omitted for brevity.Further, the corresponding SNR scales as ζ = E 2 .

IV. TRAJECTORY AND PHASE SHIFT MATRIX OPTIMIZATION
Leveraging the model derived in the previous section, an optimization problem aiming at maximizing the sum-rate of nodes and, at the same time, at fairly distributing the resources to the GNs is formulated and a resolution approach provided.

A. Problem Formulation
Let's assume that: 1) the UAVs fly at a constant altitude z U k,u = H ∀k, u; 2) the GNs have a dedicated subcarrier; and 3) the drones have the same number of patches, i.e, P u = P ∀u.Moreover, it is assumed that the UAVs has enough energy to fulfill the mission, since its duration is limited.The objective is to maximize the sum-rate k,i,g R k,i,g , with R k,i,g given in ( 26) and, at the same time, to minimize the sum of the absolute differences g>g | K k=1 I i=1 R k,i,g − R k,i,g | among these amounts, so as to provide an adequate level of fairness among users.This multiobjective problem requires the optimization of the trajectories Q = {q k,u , ∀k, u}, the speed profiles V = {v k,u , ∀k, u}, the acceleration profiles A = {a k,u , ∀k, u}, and the phase shift matrices X k = {φ X k,u,p , ∀u, p} and Y k = {φ Y k,u,p , ∀u, p}.Therefore, it can be formulated as follows: where is a penalty factor that determines the fairness of the solution.Constraints (32b) and (32c) encode the possible 2-D movement of the drones and (32a) describes the start point of the trajectory.Constraints (32d) and (32e) state Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
that the speed and acceleration norm is upper-bounded by v MAX and a MAX , respectively.Equation (32f) guarantees that, throughout the mission, a minimum security distance d MIN is maintained among UAVs.Since the GNs positions are known, it is possible to derive the optimal patch phase shift through (30).Therefore, the problem can be rearranged in terms of scheduling, i.e, patch-GN assignment, by means of a matrix S k = {S k,u,p ∀k, u} ∈ {1, . . ., G} U×P .Let's define an auxiliary function which takes as input the scheduling matrix and returns the optimal phases vectors.Therefore, the expression of the data rate can be rearranged as R k,i,g R k,i,g (•, ).The optimization problem is cast as follows: Problem ( 34) is an MINLP problem and hence very challenging to solve.Common techniques as block coordinate descendent (BCD) and successive convex approximation (SCA) cannot be applied in this case, since the channel model involves cosine interference patterns, which are not convexificable.Therefore, in this work a DRL approach is employed to achieve a quasi-optimal solution.

B. Background on Proximal Policy Optimization
DRL is a branch of machine learning which involves an agent interacting via actions with an environment.Among the vast plethora of available DRL algorithms, PPO [25] represents the most adopted solution due to the noticeable performance.The benefits brought by PPO approach are numerous.The most important ones are: 1) ease of implementation; 2) low complexity; 3) sample efficiency; and 4) few hyper-parameters are needed in the learning process.It ensures a smoother training, compared with other approaches, by constraining the new policy to not excessively differ from the previous one.This yields lower variance in the training process and prevents the agent from taking unrecoverable paths.In this work, it is adopted to solve problem (34) and hence achieve a quasi-optimal solution.PPO uses an actor-critic (AC) approach, which employs two Deep Neural Networks.The Actor, denoted as Q ϑ (s k , a k ), at each discrete time instant k, observes the current state s k .Based on that, it takes a certain action a k , according to the current policy π ϑ (s k |a k ), being ϑ the corresponding parameter set.Consequently, it receives a reward r k and moves to the next state s k+1 .The Critic, denoted as V τ (s k ), evaluates the action taken by the Actor and provides a rating.Based on this value, the Actor improves the current policy π ϑ , adopting a gradient descent algorithm, to take better, or to avoid worst, actions in the future.Concretely, the aim of this framework is to derive the optimal policy π * ϑ from the transition tuples s k , a k , r k , s k+1 that maximize the discounted cumulative sum of all future rewards.Hence, π ϑ (s k |a k ) maps each state to the probability p(s k+1 |s k , a k ) of taking action a k .
PPO offers several advantages [25] over the classical AC algorithm.Specifically, it incorporates a trust region optimization approach, ensuring stable policy updates and preventing catastrophic policy changes.The clipped surrogate objective further constrains the policy update range, enhancing robustness.Moreover, the multiple epochs of minibatch updates allow efficient utilization of collected data, thus improving learning effectiveness.Differently from the AC algorithm, PPO demonstrates scalability, sample efficiency, and compatibility with both continuous and discrete action spaces.It strikes a balance between exploration and exploitation, facilitating optimal policy learning.In contrast, classical AC algorithms often lack such robustness, require careful hyperparameter tuning, exhibit higher variance in updates, and may have limited exploration capabilities.
To provide more technical background, the rationale behind the Policy Gradient Methods, and hence of the PPO algorithm, are hereby discussed.
The algorithm compares the current and new policies to maximize the following objective function: being the probability ratio, ϑ the old policy parameters, and a small constant.Moreover, A π k (•, •) is the advantage function defined as follows: where m ∈ [0, 1] is a smoothing factor and A π k,l (s k , a k ) is the l -step look-ahead advantage function, with l = k, . . ., K. clip(•, •, •) denotes the clipping function [25] which bounds the first argument in range of the last two, such that the new policy is not allowed to go too far from the old.While, in case of positive advantage the objective reduces to where the first term limits to how much the objective can increase.Similarly, for negative advantage (38) Therefore, the clipping function provides a regularization of the objective function by preventing extreme changes, and the hyperparameter represents how far the new policy π ϑ (s k |a k ) can go from the old one π ϑ (s k |a k ).Finally, the two networks are trained using a stochastic gradient descent method, namely, Adam [26], over a mini-batch of D samples.For each one, the parameters of the policy π ϑ (s k |a k ) and the Actor network are updated as follows: while the Critic parameters as follows: For further details, the reader is referred to [25] and [27].

C. Proposed Solution
Problem (34) is reformulated as an MDP, which is a mathematical framework at the foundation of DRL.In this way, it is possible to embed the optimization problem into a PPO-based algorithm.
First, to increase training speed and prevent divergence, the scheduling matrix is normalized such that −1 ≤ S k,u,p ≤ 1, as follows: Similarly, also the acceleration profile vectors a k,u are normalized as a k,u = (a a,u /a MAX ).The reference MDP S, A, P, R, ς , is described as follows.1) State Space S: The set of all the possible states s k ∈ S that can be observed by the agent while interacting with the environment with 2) Action Space A: The set of all possible actions a k ∈ A that the agent can perform during a timestep where 3) Transition Probability P: The set of probabilities p k ∈ P which denote the transition from a state s k to s k+1 .4) Reward function R: The rewards r k ∈ R obtained by maximizing the objective function.In this case, it has been already defined in problem (34).5) Discount Factor ς : The discount factor that determines the importance of future rewards, where 0 ≤ ς ≤ 1.At the beginning, it is necessary to initialize the structures needed for the computation.In particular, the initial positions of the drones are set according to (32a).After the initialization phase, at each timeslot k, the agent observes from the environment the current positions of the drones Q k and the assignment of the patches S k .Then, the agent provides the acceleration profile A k and the new scheduling matrix S k in input to the algorithm that computes the new positions and speed profiles of the drones, according to (32b) and (32c), and the data rate of each GNs.The final output of the computation is the new state (observation), the generated reward, and a flag that notifies weather the mission has terminated or not.Clearly, whereas the mission ends before K, the cumulative reward would be lower; this mechanism is employed to satisfy the remaining constraints (32d)-(32f).

V. NUMERICAL RESULTS AND DISCUSSION
In this section, a simulation campaign is carried out to investigate different aspects of the proposed model.
In particular, the analysis first focuses on the validation of the channel model from a probabilistic standpoint by comparing the derived approximation with MC simulations.Then, the radiation pattern corresponding to different patch configurations is studied.Moreover, results are shown for the optimization problem defined in Section IV, to demonstrate the potential of the diversity introduced by the patches.Finally, the solution is compared with the case of absence of IRS-equipped drones and with a baseline approach implementing a random patch scheduling.
In all the simulations, all the involved antennas are isotropic, thus with unitary gains.All the considered IRS are composed by N × M = 48 × 48 elements.Moreover, according to [19], the cascaded channel BS-UAV-GN is characterized by β BRG = c 2 w/(64π 3 f 2 i ) and α BR = α RG = 2, while the direct weak BS-GN link by β BG = c 2 /(16π 2 f 2 i ) and α BG = 4. Besides, it is assumed , for all k, g, u.
The main considered parameters, unless otherwise specified, are summarized in Table II.

A. Channel Model Analysis
Without loss of generality, the first scenario considers G = 1 GN, located at q G 1 = [50 0 0] T , served by a BS at q BS = [ − 50 0 0] T .The communication is assisted by a drone, positioned at q 1,1 = [0 0 50] T , equipped with just P = 1 patch, i.e., the whole IRS.To provide a comprehensive analysis, four cases are considered: 1) absence of the direct BS-GN link (λ g = 0); 2) scattered (Rayleigh) BS-GN link (κ BG g = 0); 3) worst possible UAV-BS alignment (Worst case); and 4) best possible UAV-BS alignment (Best case).Moreover, to assess the performance of the investigated scenarios, define the normalized SNR ∀k, i, g, as follows: It is worth noting that the data rate derived in (26) is directly proportional to the normalized SNR when the latter is expressed in decibels (due to the presence of the logarithm), as in the graphs shown in this section.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
(a) (b) Fig. 2 represents the comparison between the proposed model and an MC simulation for 5 • 10 6 realizations.As a matter of fact, the proposed model provides an accurate approximation in terms of probability density function (PDF) and SNR curves.Moreover, it can be observed that, even if the BS-GN link exists, the UAV-BS alignment is crucial: in the worst-case scenario the channel quality results to be deteriorated with respect to the pure reflection case (λ g = 0).Fig. 3 corroborates the theoretical results obtained in Corollary 2. Indeed, a large number of PRUs implies a quadratic scale of the SNR.Clearly, in the four considered cases, the convergence depends on the involved system setup and environmental surrounding conditions.It is worth noting that a reduced path loss exponent α BG = 3 produces, in the Best case and for a low number of elements, a higher gain with respect to the pure reflection case (λ g = 0).When a higher path loss coefficient is considered, i.e, α BG = 4, this gap significantly reduces since the direct link brings a minor contribution to the SNR.
The second scenario involves the same UAV and IRS, but considers G = 2 GNs located at q G 1 = [ − 120 0 0] T and q G 2 = [70 0 0] T , connected to a BS at q BS = [ − 50 0 0] T with λ g = 0. Two cases are investigated.In the first one, the IRS surface is divided horizontally, while in the second one, vertically.Then, the phase shift matrices are set so that each user has its own patch assigned.From Fig. 4 it clearly emerges that the produced radiation patterns yield a different SNR perceived in the environment in the two cases.Indeed, the side of the patch characterized by more elements leads to a narrower beam in the orthogonal direction and viceversa.This phenomenon, which can be observed especially in horizontally divided case, is also influenced by the UAV-GN distance.Therefore, the shape design is particularly important since remarkably modifies the irradiated area, and has different implications based on the use case considered.For instance, if the objective is to maximize channel quality perceived by multiple GNs, the neighborhood of an illuminated user benefits from configurations that produces a larger fingerprint.At the same time, if the aim is to establish secure communications, patches must be designed on the opposite  criteria, thus hinder eavesdropping.An agnostic option is to split the IRS into square-shaped patches that can be directed toward specific GNs, thus implying a customizable radiation pattern which guarantees, however, a computation complexity much lower than the solution proposed in [18].

B. Results on Trajectory and Phase Shift Optimization
It is considered a scenario in which U = 3 drones, starting from q 1,u = [0 −150 50] T , ∀u, are in charge of fairly serving G = 20 nodes randomly deployed on the ground.The BS is located at q BS = [0 0 0] T .The Actor and Critic neural networks of the PPO algorithm are characterized by four hidden layers of 1024 neurons each, trained for 3 • 10 5 epochs.
First, it is considered the case in which each drone is equipped with an IRS split into P u = 9 ∀u patches, by adopting a low fairness factor = 1.25 • 10 −4 .Fig. 5 shows the trajectories of the drones during the mission, with heatmap representing the values of the normalized SNR considering only the BS-UAV-GN channel.It is visible that, during the mission, the drones serve different regions at each instant, with the aim of maximizing the sum-rate while providing adependent allocation to the users.The swarm flies, as result of the optimization, from the starting location toward the center of the area, where the BS is located.Consequently, the patches of the different UAVs are allocated to provide an optimal coverage to all users, thus improving their SNRs.Moreover, in Fig. 6, the speed and acceleration profiles, i.e., v k,u and a k,u , are shown for each drone over time.According to (32b) and (32c), the drones accelerate yielding an increase in the speed, without violating the bound constraints (32d) and (32e).This proves that the strategy adopted and discussed in Section IV-C allows the PPO-based algorithm to learn from tuples that satisfy the imposed limits.To give more insights about the exchanged data volume and to validate the effectiveness of the proposed algorithm in other configurations, the IRS are split into P u = 4 and P u = 9 patches.Moreover, two values for the fairness parameter are considered, i.e., 1.25 • 10 −4 and 5 • 10 −4 .The first value penalizes the objective function more, thus obtaining a lower fairness with respect to the second one, which implies a higher balance in the resource distribution.
In Fig. 7, it is depicted the cumulative amount of data transmitted from each GNs to the BS through the direct link plus the reflected one, for P u = 4.In case of a higher fairness factor, i.e., = 1.25 • 10 −4 , the total amount of received data is 5.37 Mbits with a coefficient of variation (ratio between standard deviation and mean) of 0.41.On the contrary, in the second case, i.e, = 5 • 10 −4 , a lower amount of data is received by the BS, i.e, 4.21 Mbits, but with a coefficient of variation equal to 0.2.This implies that a tradeoff between fairness and data transmission exists: it is due to the fact that in case of lower fairness the drones can focus on reflecting the signals from the GN closer to the current position, which in turns implies that farther nodes are less served.
For the same scenario parameters, a simulation with P u = 9 patches (for each drone) is carried out.As expected, Fig. 8 confirms the same tradeoff behavior highlighted in the previous case.As a matter of fact, with P u = 9 patches, in the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.first case a total of 5.12 Mbits have been transmitted with a coefficient of variation equal to 0.42, while in the second case the values are 4.58 and 0.27 Mbits, respectively.Moreover, in Fig. 9 it can be seen the convergence of the neural network training process in all the configurations discussed above.Finally, Fig. 10 summarizes the results and the discussion above by comparing the obtained values with the case in which the swarm is absent and only the BS serves the GNs, i.e, {η k,g,u } = 0 ∀k, g, u.Moreover, to provide a benchmark, the proposed solution is also compared with a baseline approach, in which the optimized trajectories are maintained but a random patch scheduling is implemented (5 • 10 5 trials have been performed).As expected, in all cases, the drones provide a substantial enhancement of the channel capacity, especially when an optimized solution is employed.Indeed, the scheduling plan obtained from the PPO performs better than that related to the baseline approach.In particular, this is always true in terms of average data rate.Nonetheless, when a higher number of patches is employed a higher fairness factor is required to guarantee a uniform resource distribution.

VI. CONCLUSION
In this work, a comprehensive channel model for UAV-aided IRS-assisted OFDMA communications has been derived.Differently from other contributions, it considers the presence of a swarm of drones equipped with IRSs, which are split into an arbitrary number of patches to simultaneously serve multiple GNs.The proposed model inherently captures the constructive/destructive interference among users which can also experience different propagation conditions, due to the adoption of Rice fading.Relying on this channel model, a realistic communication scenario has been investigated, which led to the formulation of a Multiobjective MINLP problem, aiming to fairly distribute the resources among nodes, i.e, to maximize the sum-rate of GNs and, at the same time, to minimize the differences among rates.This required the joint optimization of the phase shift matrices of the IRSs and the trajectories of the drones.To overcome the intractability of the nonconvex cosine patterns related to wave interference, the problem has been rearranged as an MDP and solved via a DRL method, namely, PPO.The validity of this work has been corroborated by extensive simulations, for different parameter settings, including the comparison with a baseline approach.
Results demonstrated that: 1) channel capacity benefits from IRS employment and that 2) the proposed solution outperforms the baseline in terms of total and average transmitted data.Future work includes adding subcarrier scheduling in the joint optimization and considering the presence of multiple BSs.Moreover, models that capture the dependency of amplitude and phase of the IRS elements will be considered.

Fig. 10 .
Fig. 10.Comparison among total data transmitted amounts with optimized trajectories for random (Rnd) and optimized (Opt) scheduling.The black squares represent the mean value.
The aim is to maximize the sum-rate of GNs and, at the same time, minimize the differences among rates.Each node is served by a single BS (direct link) and by a swarm of drones (reflected links), thus requiring the jointly optimization of the phase shift matrices of the IRSs and the trajectories of the drones.
1) A comprehensive channel model for UAV-aided IRSassisted OFDMA systems is derived.Specifically, a swarm of drones equipped with IRSs is considered.The latter is split into an arbitrary number of patches