Deep Q-Learning-Based Resource Allocation in NOMA Visible Light Communications

Visible light communication (VLC) has been introduced as a key enabler for high-data rate wireless services in future wireless communication networks. In addition to this, it was also demonstrated recently that non-orthogonal multiple access (NOMA) can further improve the spectral efficiency of multi-user VLC systems. In this context and owing to the significantly promising potential of artificial intelligence in wireless communications, the present contribution proposes a deep Q-learning (DQL) framework that aims to optimize the performance of an indoor NOMA-VLC downlink network. In particular, we formulate a joint power allocation and LED transmission angle tuning optimization problem, in order to maximize the average sum rate and the average energy efficiency. The obtained results demonstrate that our algorithm offers a noticeable performance enhancement into the NOMA-VLC systems in terms of average sum rate and average energy efficiency, while maintaining the minimum convergence time, particularly for higher number of users. Furthermore, considering a realistic downlink VLC network setup, the simulation results have shown that our algorithm outperforms the genetic algorithm (GA) and the differential evolution (DE) algorithm in terms of average sum rate, and offers considerably less run-time complexity.

The short-range communication with LEDs can be accomplished by modulating the intensity of the LED, a process known as intensity modulation (IM). At the receiver side, a photodetector (PD) is used to perform direct detection (DD) by converting the received light intensity fluctuations into an electrical current for data demodulation [2]. Yet, although VLC systems offer a large amount of available bandwidth, existing off-the-shelf LEDs have a restricted bandwidth, limiting the number of served users by an LED. Therefore, to reap the full potential of VLC networks, spectrally-efficient multiple access schemes are required for improved connectivity and the overall quality of service. To that end, one of the emerging multiple access techniques that is capable of improving the spectral efficiency is non-orthogonal multiple access (NOMA) [4]. Through power domain superposition coding at the transmitter and successive interference cancellation (SIC) at the receiver, all users in NOMA systems can utilize the entire modulation bandwidth of the system simultaneously. Thus, NOMA offers higher connectivity and spectrum efficiency (SE) in IoT networks, as compared to the orthogonal frequency division multiple access (OFDMA) scheme. Moreover, it has been shown that NOMA perform considerably better in high signal-to-noise ratio (SNR) scenarios [5], rendering it a prominent candidate for VLC systems that enjoy high SNRs, which are attributed to the relatively short distances between the transmitter and the receiver. The performance of NOMA-enabled VLC systems has been extensively studied [6], [7], [8], [9]. The main challenge of applying NOMA in VLC systems is the nonnegative real-valued requirement imposed on VLC signals, rendering current power allocation schemes in RF-NOMA systems inapplicable to VLC scenarios. Accordingly, in our paper, we revisit these schemes and develop a deep Qlearning (DQL) approach in order to obtain optimal resource allocation while considering the VLC channel characteristics. The authors in [6] proposed a gain ratio power allocation (GRPA) method and suggested NOMA as a possible candidate for high-speed VLC systems. The studies in [7] and [8] reported more advanced power allocation methods for NOMA-VLC, at the expense of increased computational complexity. Likewise, to improve the error rate performance of uplink NOMA-VLC systems, the authors in [9] proposed a phase pre-distortion approach.
It is recalled that the typically high energy consumption of connected devices in wireless networks constitutes a fundamental challenge in designing future 6G wireless networks, which are envisioned to enable a wide range of essential but energy-consuming applications [10], [11]. Therefore, it is critical to improve the energy efficiency of future wireless communication systems while maintaining or increasing the desired quality of service (QoS). In this context, it is worthy to mention that the exploitation of superposition modulation in NOMA enables energy-efficient wireless transmission [12], [13]. Thus, in order to ensure a desired quality-of-service (QoS) levels for all superimposed users, several research efforts have been devoted in the efficient design of power allocation mechanisms. To that end, power allocation problems were studied in [14], [15], whereas the joint power allocation and sub-channel assignment problems were investigated in [16], [17], [18], [19].
Joint optimization problems in NOMA have received a considerable attention from the research community. For example, Zhao et al. [20] proposed a joint UAV trajectory and NOMA precoding optimization framework, with the aim to improve the system throughput. In another work, Peng et al. [21] considered a hybrid precoding and power allocation scheme in order to maximize the energy efficiency of mmWave-enabled NOMA UAV networks. Nevertheless, most of the reported contributions in joint resource allocation problems for NOMA-enabled networks are non-deterministic polynomial-time hard (NP-hard) [22], especially when users are mobile. Therefore, it is challenging to obtain an optimal solution due to the high amount of uncertainty and the high computational complexity. As a result, sub-optimal solutions were subsequently proposed in [23], [24], [25]. Heuristic optimization techniques like the genetic algorithm (GA) [26] and the differential evolution (DE) algorithm [27] can solve these NP-hard problems. However, these techniques often fall into a local optimum solution. Hence, using heuristic techniques may limit the performance of NOMA in different scenarios for future wireless networks. Therefore, it is of paramount importance to employ an efficient method for obtaining an optimal power allocation mechanism for VLC networks with uniformly distributed users. With this motivation, in the present contribution we utilize an algorithm based on deep reinforcement learning (DRL), in which an agent in the network continuously learns from the environment and adapts the network parameters accordingly. The proposed algorithm aims to improve the average sum rate of a VLC network, with uniformly distributed users. This also provides an answer to the following question: is it practically feasible to jointly optimize the power allocation and the LED transmission angle of an indoor VLC-NOMA network?

A. RELATED WORKS
Recently, Q-learning sparked an unprecedented interest by researchers and engineers in various fields. Q-learning is a subset of reinforcement learning that relies on Q tables to store the optimal sequence of actions, which maximizes the future reward. In the context of optimizing communication networks, several studies have adopted Q-learning to enhance the performance of wireless networks from different perspectives [25], [28], [29], [30], [31], [32], [33], [34], [35]. In [28], the authors proposed a fast RL-based power allocation scheme to improve the spectral efficiency of a multiple-input multiple-output (MIMO) NOMA system in the presence of a smart jammer interference. The study in [30] used Q-learning to develop a framework for enabling mobile edge computing with NOMA. By incorporating deep learning into RL, DRL addresses a challenge associated with Q-learning in terms of Q table storage and look-up. Based on this, Yang et al. [31] used a deep Q-network (DQN) to model a multi-user NOMA offloading problem, whereas the authors in [32] designed a power allocation in cache-assisted NOMA systems using DRL. Likewise, Zhang et al. [33] proposed a dynamic power allocation mechanism based on the actor-critic RL, whilst DRL was used in [35] to arrive at sub-optimal power allocation solutions for an uplink multicarrier NOMA system. Finally, He et al. [25] solved a joint power allocation and channel assignment problem in a two-user NOMA system using a DRL framework.

B. MOTIVATION
The aim of this work is to jointly optimize power allocation and LED transmission angle tuning, with uniformly distributed users in an indoor VLC network. In such a setup, the problem is NP-hard and cannot be tackled using conventional optimization methods. The most significant advantage offered by DQL is its ability to solve complex joint optimization problems in wireless communication, which cannot be solved by conventional mathematical tools [36]. The effectiveness of DQL was demonstrated in several works in the literature. For instance, the authors in [37] optimized an IRS-NOMA system by using DQL to predict and optimally tune the IRS phase shift matrices. In [38], DQL is used to allocate optimal channels to a cluster of users in order to maximize energy efficiency. The DQL algorithm allows the agent to learn about the communication environment and develop new knowledge that can lead to an optimal solution, with imperfect channel state information acquisition. Therefore, in the current contribution, we leverage the DQL algorithm in order to solve this complex problem. The motivation underlying the utilization of DQL in our optimization framework is two-fold: • The proposed solution can cope with the channel uncertainty and does not require perfect knowledge of channel state information to maximize the average sum rate. • The solution avoids an exhaustive search method to reach the optimal solution, which searches for all the power allocation coefficients, with all possible LED transmission angles, thus rendering it to an impractical solution.

C. CONTRIBUTIONS
To the best of the authors' knowledge, none of the previous studies proposed a DQL algorithm to maximize the average sum rate of uniformly distributed users in a NOMA VLC indoor network. In this work, we propose an efficient DQLbased algorithm that maximizes the average sum rate by jointly optimizing the power allocation and the transmission angle of the LEDs. The main contributions of this paper can be summarized as follows: • We formulate a joint optimal power allocation and LED transmission angle tuning problem in the downlink of the considered NOMA-VLC network. The optimization problem aims to maximize the average sum rate of uniformly distributed users, under total power and the individual LED transmission angle constraints. • We propose a joint power allocation and LED transmission angle tuning algorithm to solve the aforementioned non-convex optimization problem by introducing the DQL concept. In particular, we define a reward function that maximizes the sum rate while adhering to the constraints of power and LED transmission angles. • We conduct a theoretical complexity analysis of the proposed deep Q-learning framework and draw valuable insights on the efficiency of the proposed scheme. • We validate the superiority of the proposed algorithm over the fixed power allocation policy and the exhaustive search method. The simulation results indicate that after a few iterations, the proposed scheme converges and performs better under varying transmit SNR, cell radius, and VLC Access Point (AP) height. • The offered results provide useful insights on the achievable performance of the proposed technique, which has a particularly practical importance.

II. INTRODUCTION TO DEEP REINFORCEMENT LEARNING
In this section, we introduce the concept of DRL, which is a special case of reinforcement learning. First, it is recalled that reinforcement learning is a sub-field of machine learning, where an agent interacts with the environment to perform the best series of actions that will maximize the expected future reward in an interactive environment. This interplay between the agent and the environment is depicted in Fig. 1.
In general, RL can be classified as single-agent or multiagent based on the number of agents. In the case of single agent RL. If the agent can observe the environment's full state information, the sequential decision-making problem can be modeled using the Markov decision process (MDP) framework. On the other hand, multi-agent reinforcement learning is typically modeled as a Markov or random game (a generalized method of traditional parroted game) when two or more agents have complete environment observation, and make decisions accordingly. Without loss of generality, our underlying framework assumes a single agent for a single VLC access point. In DRL, the best sequence of actions for an agent will be predicted based on a deep neural network. Therefore, the deep neural network in DRL acts as a universal function approximator.
The fundamental elements for RL are: • Observations: These are the continuous measurements of the environment's properties. They are represented in vector p with O ∈ R p , where p denotes the number of the observed properties. • States: The state s t ∈ S denotes the discretized observation at time step t. • Actions: An action a t ∈ A is one of the valid decisions that the agent can take at time step t. • Policy: A policy denoted by π(.) is the mapping between the actions to be taken by the agent at any given state of the environment. • Rewards: The value u s,s ,a t is the reward obtained after an agents takes a particular action a t in a given state s at time t, which leads to state s . • State-action value: Denoted by Q π (s, a), and defined as the expected discounted reward when the agent starts at state s and selects action a according to policy π . At a given time step t, when an agent performs an action a t , the agent's environment changes from the current state s t to the following state s . As a result of this transition, the agent receives an immediate reward u that represents the outcome of performing action a t while in state s t . At time t , this system generates an experience tuple e = (s t , a t , u , s ), which is stored in buffer D. Based on this, the main goal of the agent is to maximize the long-term cumulative discounted reward, which is defined as with discount factor γ ∈ [0, 1]. To accomplish this, an optimal policy π * that maps the best actions to states is required. In other words, the optimal policy will act as a guide, informing the agent which actions should be taken at a any given state, in order to maximize the long-term cumulative reward. It is noted that the Q-value function [39] is a function that represents the expected cumulative reward U t of starting at state s t , performing action a t , and following a certain policy π . This function is critical in solving RL problems, and is given by where E[ · ] denotes statistical expectation. The optimal π * that maximizes (1) for all states and actions, also maximizes (2). Consequentially, the optimal Q-value function that follows π * is obtained using The definition in (3) is known as the Bellman equation [40]. The purpose of the equation is to divide the value function into two components: the immediate reward u t and the long-term cumulative discounted reward U t . Rather than summing up over multiple time steps, the definition (3) simplifies the computation of the Q-value function by decomposing it into simpler, recursive sub-problems and determining their optimal solutions. Nevertheless, the Bellman equation in (3) is nonlinear, and hence, there are no closed-form solutions to it. As a result, numerous iterative methods have been proposed (e.g., Q-learning), each of which has been shown to converge to the optimal Q value function [39]. However, these methods become impractical in multi-user systems with a large state or action space, as the size of the Q-value table (e.g., all possible values of (2) for all possible states and actions) is extremely large. The solution to this problem is to estimate the Q value using function approximations, e.g., deep neural networks, which is the core idea of the underlying deep Q-network.
The DQN design, shown in Fig. 2, consists of the following three main components: • The input layer represents the states of the environment. • The hidden layer acts as a function approximator. In this component, the Rectified Linear Unit (ReLU) activation function is used to compute the hidden layer values. The ReLU function is defined as where y o is the output from the activation function while y i is its input. The main advantage of employing ReLU as an activation function is its computational efficiency, since it does not compute exponentials and divisions [41]. Additionally, ReLU introduces more sparsity in the hidden units, as when y i < 0, the output values become zero [42]. Therefore, the computational efficiency and the increased sparsity can lead to faster convergence. • The output layers represent the predicted state-action value function, Q * π (s, a; W t ). Fig. 3 shows the underlying system model of the considered study. We consider a NOMA-VLC indoor network, which consists of a VLC AP installed on a ceiling at height L. The VLC AP serves K users, uniformly distributed over a polar coordinate plane of r radius. Without loss of generality, in this work, we will focus on the downlink communication.

III. SYSTEM AND CHANNEL MODELS
Although VLC channels consist of a line-of-sight (LOS) and non-LOS (NLOS) components, this study considers the direct LOS component, due to the fact that the NLOS component has much less energy.

A. VLC CHANNEL MODEL
The signal transmitted by the VLC AP can be expressed as where P e denotes the total electrical transmit power, I DC represents the LED DC bias, which is essential for intensity modulation-based optical baseband transmission, s i represents the modulated symbol of the i th out of K links, and α i is the power allocation coefficient for the corresponding link. It is assumed that the transmitted signal for each user follows a uniform distribution with zero mean and unit variance. Based on this and a given total power constraint, the following constraint should hold, Furthermore, the optical transmit power of the LED can be expressed as where η denote the LED efficiency, which, without loss of generality, is assumed to be normalized to unity. Based on this, the received signal at the k th user can be expressed as where the channel gain h k is given by [43] with z k is the additive white Gaussian noise with zero mean, and variance σ 2 k , A denoting the area of the PD, R p representing the responsivity of the PD, and d k is the Euclidean distance between the VLC AP and the k th user. Also, T(ψ k ) and g(ψ k ) denote the optical filter gain and the optical concentrator, respectively. It is also noted that (9) indicates that the channel gain h k is inversely proportional to the distance of the k th user. As shown in Fig. 3, the light emitted from the LED follows a Lambertian radiation pattern with an order.
where φ 1/2 is the transmission angle of the VLC AP, ψ c denotes the receiver's field of view (FOV), whereas ψ k and φ k denote the angle of incidence and the angle of irradiance, respectively. It is recalled that in power-domain NOMA systems, users with stronger channel conditions are allocated lower signal power, whereas users with severe channel conditions are allocated more power, which implies that α 1 ≥ · · · ≥ α k ≥ · · · ≥ α K−1 ≥ α K . Without loss of generality, we assume that the users in the considered setup are sorted in ascending order according to their channels, namely, In order to perform reliable signal detection, the k th user performs SIC in order to cancel the incurred interference experienced from signals with higher power levels. Also, the signals of the users that are allocated with lower power coefficients are treated as noise.

B. IMPERFECT CSI MODEL
Unlike the majority of previous related contributions, which assumed perfect CSI knowledge for the underlying VLC system model, this work makes the practical assumption of imperfect CSI knowledge. CSI is typically obtained at receivers via pilot symbols. The channel coefficients are transmitted to the transmitter via an RF or infrared (IR) uplink, where channel uncertainty increases as uplink and downlink channel noise increases. Additionally, channel uncertainty is increased due to quantization errors introduced by the imperfect digital-to-analog, analog-to-digital conversion processes, which ultimately degrades system performance. It is worth noting that the current analysis uses the same noisy CSI model as [44], which takes into account the resultant CSI error regardless of the source of the error, i.e., location uncertainty, orientation uncertainty, and LED half-angle uncertainty.
The channel coefficient for the VLC link can be modeled by using the minimum mean squared error (MMSE) estimation method, yielding [44] h k =ĥ k + e k , is the estimated channel gain and e k denotes the estimated error in the channel which follows a Gaussian distribution with mean = h k and variance = σ 2 e . It is worth noting that the random variablesĥ k and e k are uncorrelated.

C. VLC CHANNEL MODEL OF UNIFORMLY DISTRIBUTED USERS
Without loss of generality, we assume that the users are uniformly distributed within the attocell. This assumption is widely considered as a baseline in several contributions in the literature, e.g., [45], [46], [47], and it can be readily generalized into Poisson or normal distributions. Following this assumption, the relationship between the angle of incidence, the angle of irradiance, the Euclidian distance of the k th user, the height L, and the radical distance r k is given by where Substituting (13) and (14) in (9), the DC gain of the LOS component can be expressed as where is a constant. Furthermore, given that users are uniformly distributed, the following probability density function (PDF) is used f r k (r) = 2r/r e . Therefore, the PDF of the corresponding channel gain is given by and Based on this and in order to obtain the corresponding cumulative distribution function (CDF), we integrate (17) With the aid of order statistics [48], the PDF of the ordered channel gain of the k th user denoted by f h k (t), can be obtained as which after some algebraic manipulations can be equivalently expressed as follows: where the constant = 1 . Note that the PDF in (22) has a convenient mathematical form as it consists of only elementary functions, which renders it tractable both analytically and computationally.

D. AVERAGE SUM RATE OF NOMA-VLC WITH UNIFORMLY DISTRIBUTED USERS
Following [43], the average sum-rate of NOMA VLC under imperfect CSI can be expressed as 1 1. It should be noted that Shannon's capacity equations is valid for VLC systems, if the transmitted signal is frequency-upshifted. Wherein, the realvalued baseband transmission signal model for VLC can be converted into a complex-valued baseband channel by applying a frequency-upshift to an intermediate frequency (IF), which has a slightly higher center frequency than half the bandwidth of the transmitted signal before applying the bias current.
where ρ = P e /σ 2 k denotes average transmit SNR. It is worth noting that (23) is derived under the assumption of perfect SIC process. In addition, the decoding order is assumed to be fixed and known to the receivers.
For a K number of uniformly distributed users and an arbitrary power allocation strategy, the average sum rate of NOMA-VLC is given at the bottom of the page [49]. The average sum rate in (24) is expressed in bits per second (bits/s), whereas the average energy efficiency metric is expressed in terms of bit per joule, and can be calculated using where Q VLC represents fixed power consumption of the VLC AP, expressed in watts.

IV. PROBLEM FORMULATION
The main objective of this work is to perform a joint power allocation and LED transmission angle φ 1/2 tuning optimization, with the aim to maximize the average sum rate of uniformly distributed users. Accordingly, the joint optimization problem is formulated as where the first constraint (P1.a) refers to the maximum allowed transmission power, the second constraint (P1.b) is set to ensure that the total transmit power of the superimposed signal equals to P e . The final constraint (P1.c) aims to ensure that the selected LED transmission angles fall within a practical range. Due to the high computational complexity and the varying nature of the channels, it is challenging to obtain a global optimum solution to (P1). To solve the above optimization problem, two approaches can be considered. The first approach is the simplest in terms of implementation, in which the use of a fixed power allocation policy, and a fixed LED transmission angle is considered. However, such an approach results in a sub-optimal solution. The other approach is the exhaustive search, which can lead to an optimal solution; however, this comes at the expense of increased complexity. In the following sections, we introduce DRL as an alternative approach to solve the underlying optimization problem. In the next section, we will demonstrate the proposed DRL-based optimization framework for joint power allocation and LED transmission tuning.

V. JOINT POWER ALLOCATION AND LED TRANSMISSION ANGLE TUNING (JPA-LTAT): DRL-BASED FRAMEWORK
In what follows, we propose a DRL-based framework to solve the optimization problem (P1). First, we will present how the DQN is trained with an appropriate policy selection criterion. Then we introduce an algorithm that relies on the DRL framework to achieve optimal performance.

A. TRAINING PHASE
The DQN is trained and updated to approximate the actionvalue function of Q π * (s, a). It is recalled that the experience tuple is defined as e t = (s t , a t , u t , s ). The agent saves its experiences in a buffer D = e 1 e 2 . . . e t that is used to train the DQN using the gradient descent algorithm [7].
While it is ideal for DQN training to use all data in each iteration, this is prohibitively expensive when the training set is large. A more efficient method is to evaluate the gradients in each iteration using a random subset of the replay buffer D, referred to as mini-batch. Accordingly, the loss function is defined as follows where (26) denotes the DQN's loss function for a random mini-batch D at time slot t andŴ denotes the quasi-static target parameters that are updated every t time slots. Finally, the optimal weights are obtained using In order to minimize the loss function defined in (26), the weights of the DQN are updated at every time step t using a stochastic gradient descent (SGD) algorithm on a mini-batch sampled from the replay buffer D. To this effect, the SGD algorithm will update the weights W in an iterative process with a learning rate of μ > 0 as follows [50]

B. POLICY SELECTION
Generally speaking, Q-Learning is considered as an offpolicy algorithm, which means without actually following any greedy policy, it estimates the reward for future actions and adds a value to the new state [51]. Based on this, we consider a near-greedy action selection policy. The near-greedy policy has two modes: 1) Exploration: The agent tries different actions at every time step t to discover an effective action a t . 2) Exploitation: The agent chooses an action at time step t that maximizes the state-action value function Q π (s, a; W t ) based on the previous experience. In the near-greedy policy, the agent has an exploration rate of and an exploitation rate of 1 -, where 0 < < 1, and is a hyper-parameter that controls the trade-off between exploitation rate and exploration rate of the agent. For every time step t, the agent performs a specific action a t at a given current state s t . Accordingly, the agent receives a positive or negative reward u s,s ,a [t] and moves into a target state s := s t+1 .
The period of time in which the agent interacts with the environment is called an episode, where each episode has a total duration time of T time steps. The convergence of an episode is governed by the target objective being fulfilled. Also, the dimension of the input layer is set equal to the number of the states in S, the dimension of the output layer is equal to the number of possible actions A. For the hidden layer, we choose a smaller depth, as it has a considerable impact on the computational complexity. Therefore, we opted for a depth that offers a reasonable balance between performance and computational complexity.

C. PROPOSED ALGORITHM
In this subsection, we propose the joint power allocation and LED transmission angle tuning (JPA-LTAT) algorithm; an optimization framework based on DRL. The JPA-LTAT algorithm optimizes the average sum rate of the VLC system, assuming that the CSI of each user is unknown. At each time step t, the algorithm calculates the average sum rate of NOMA users in the considered VLC network, which is given in (24). In what follows, we provide some details on the action space, state space, and the reward function.

1) STATE SPACE
All possible states form the state space, denoted as S, which are characterized by power allocation coefficients of each user in the VLC network.
In this paper, the state space S contains the power allocation coefficients of each user in the VLC network and the LED transmission angle of the VLC AP. Accordingly, the resultant state space S = α 1 α 2 . . . α K φ 1/2 . For instance, assuming an initial equal power allocation for 4 users, the initial state space for K = 4 users and M = 1 VLC access point of 45 • LED transmission angle is

2) ACTION SPACE
All the actions can be taken by the agent from the action space, denoted as A. The possible actions in the action space A are: • Increase / Decrease power allocation factor of user k by a step size of k , where k is a fixed value to be added to (or subtracted from) each α k where k ∈ K, while maintaining a unity sum such that K k=1 a k = 1, ∀k ∈ K. • Increase / Decrease the LED transmission angle of the VLC AP by step size ι m , where ι m is a fixed value to be added to (or subtracted from) the value of the LED transmission angle of the m th VLC AP, such that the LED transmission angle is 30 • ≤ φ 1/2 ≥ 70 • . The total number of actions in the action space A are calculated using |A| = 2M + 2K.

3) REWARD FUNCTION
The reward function plays an essential role in the RL algorithm. We use the average sum rate of the VLC-NOMA system, which is calculated using (24), to represent the immediate reward u t returned after choosing action a t in state s t .
Having described the State Space, Action Space, and the Reward Function. In the following, we describe in detail the operational steps of the JPA-LTAT algorithm. Algorithm 1 further summarizes the JPA-LTAT algorithm.
1) The VLC network environment is initialized according to Table 1. The DRL hyper-parameters are initialized as in Table 2. The policy network weights W t are randomly initialized. 2) The power allocation coefficients are reset to their initial values at the start of each episode to improve the learning experience. Similarly, the LED transmission angle is also reset to the initial value of 45 • .

Algorithm 1: JPA-LTAT Algorithm
Input: The average sum rate of the VLC network. Output: The optimal power allocation coefficients of each user, and the optimal LED transmission angle. 1 Initialize time, actions, states, and replay buffer D Select an action based on a t =arg max Q π (s, a; W t ) 12 if 30 • ≤ φ 1/2 ≥ 70 • then 13 Abort episode. 14 Compute the average sum rate based on (24). 15 Store experience e t = (s t ,a t ,u s,s ,a t , s ) in D. 16 Minibatch sample from D, e j = (s j ,a j ,u j ,s j+1 ). 17 Set y j := u j + γ max a Q π * (s j+1 , a ; W t ). 18 Obtain the optimal weights W by performing SGD on ((y j -Q π * (s j ,a j ,W t )) 2 19 Update W t := W in the DQN. 20 Record the Loss L t . 21 Update s t :=s . 3) JPA-LTAT uses the -greedy algorithm to select an action from the action space for a given state in our time-sequential decision process. 4) To allow the exploration of the action space, τ is randomly sampled from a uniform distribution.
a) If the sampled value is less than or equal to the value of , then the agent takes a random action. b) Otherwise, the agent will select an action based on the learned policy a t =arg max Q π (s, a; W t ), which aims to maximize the cumulative future reward.

5)
In order to maximize the Q-value, which is constructed from the policy network outputs, the agent observes the next state and performs the following set of possible actions: a) Increase or decrease the power allocation factor α k by step size k , ∀k ∈ K for each user in the VLC network. b) Increase or decrease the LED transmission angle of the VLC AP φ 1/2 , by step size ι. 6) Following (24), compute the average sum rate for the new set of power allocation factors and the newly modified LED transmission angle and store it as a reward u t . 7) If the agent tries to exceed the constraint of the LED transmission angle, outside the specified range 30 • ≤ φ 1/2 ≥ 70 • , abort the episode. 8) Following that, s t , s , a t , and u t are stored in the replay memory buffer D, which has a capacity of M. 9) Using the gradient descent algorithm with a learning rate μ, a mini-batch is sampled from the buffer and is used to train the policy network to minimize the loss function, which is given by (26). 10) The resulted loss L(W) at time step t is recorded and the next state s is updated as current state s t .

D. COMPLEXITY ANALYSIS OF THE PROPOSED ALGORITHM
It is crucial to quantify computational complexity of the proposed algorithm. However, since deep learning algorithms are dependent on hyperparameters, applying analytical methodologies to guarantee the convergence of the proposed DQL-based method is difficult. This is a common challenge in the literature for analytically proving optimality and convergence [52], [53], [54], [55]. Therefore, instead of convergence, we are presenting the following theorem that shows the amount of work per iteration in Algorithm 1. Theorem 1: For an indoor NOMA-VLC system with K users and M access points, the computational complexity of the proposed Algorithm 1 is given by: Proof: First, the DQL agent observes the state of the system, executes the most valuable action, and calculates the reward based on (24). Assuming that the computational complexity of calculating the reward is where C 1 is directly proportional with K. Second, it is known that the size of the state space and the size of the action space have a significant role in the complexity of the deep Q-learning algorithm. Following [56], the computational complexity of the Q-learning algorithm with the greedy policy is estimated to be O(S ×A×H) each iteration, where S is the number of states, A is the number of actions, and H is the number of steps per episode. It is recalled that the size of the state space is K + M, and the size of the action space is 2K + 2M. Therefore, the amount of work per iteration is Based on this and by incorporating (30) into (31), equation (29) is deduced, which completes the proof.

E. FIXED POWER ALLOCATION
FPA is considered as one of the simplest power allocation schemes. In this scheme, the allocated power among the users is predefined and fixed according to the following, where α is the fixed power allocation factor. It is worth noting that FPA yields a complexity of O(1); however, it does not yield optimal or near-optimal performance.

VI. ACHIEVED RESULTS AND DISCUSSION
This section discusses and analyzes the performance of the proposed DQL-based algorithm, which maximizes the average sum rate of the NOMA-VLC indoor network. Without loss of generality, we assume K users, uniformly distributed in an indoor environment, with a room size of 4×4 meters and a height of 3 meters. The room has a single VLC AP in the ceiling, with a fixed power consumption of 4 Watt and 1 Watt/Amps conversion efficiency. The rest of the simulation parameters are summarized in Table 1. The DQL Algorithm was realized and trained on a PC equipped with Nvidia GPU 2080Ti and an 18-core 2.6GHz processor. Note that we have developed our framework using Python and TensorFlow library [57]. The Deep Q-Learning hyper-parameters are shown in Table 2. Fig. 4 shows the convergence performance comparison between the proposed DQL-based algorithm, the GA, and the DE algorithm. The settings for the GA is as follows: the number of bits per variable is 8, the population size is 20, crossover rate is 0.9, and we chose two typical mutation  rates of 0.1 and 0.2. It worth mentioning that each algorithm has a different execution time per iteration, which is shown in Table 3. To begin with, the proposed algorithm converges after 48 iterations with a maximum average sum rate of 35.9 bpcu. Notably, the convergence rate is faster than the two baseline schemes. For example, the GA with mutation rate = 0.1, takes approximately 478 iterations for convergence. On the other hand, the GA with mutation rate = 0.2 converges after 481 iterations, which is similar with the case of mutation rate = 0.1. The DE algorithm converges after 1787 iterations, which is the highest amongst all the techniques. The rapid convergence of the proposed algorithm is partly attributed to the fact that the DQL algorithm can leverage the GPU cores in order to parallelize the operations. s for the average sum rate performance, the proposed algorithm achieves a maximum average sum rate of 35.9 bpcu, which outperforms both baselines. The DE algorithm achieves a maximum average sum rate of 35.5 bpcu, which outperforms the GA with both mutation rates. The GA with the lower mutation rate achieves 32.5 bpcu, which is slightly better than GA with higher mutation rate that achieves an average sum rate of 32.1 bpcu. Fig. 5 shows the average sum rate vs the transmit SNR, where we compare the proposed DQL-based algorithm, the GA with mutation rate of 0.1, and the DE algorithm for K = 4 users. It can be shown that the proposed algorithm outperforms both baselines (GA and DE) in the medium to high SNR values. When the SNR ≤ 130 dB, all algorithms achieve nearly the same average sum rate. The divergence between the curves begin when the SNR = 140 dB, where the proposed algorithm outperforms both the GA and DE baselines. As the SNR approaches 150 dB, the proposed algorithm yields an average sum rate of 17.8 bpcu, which is around 33% more than the average sum rate achieved by DE, and 47% more than average sum rate of the GA. Finally, it can be deduced that the DE outperforms the GA in the medium to high SNR range. However, the difference between the DE and GA fluctuates as the SNR increases. Fig. 6 depicts the average sum rate as a function of SNR for equal power allocation (EPA), FPA, and DQL-based power allocation for both NOMA and OFDMA as a benchmark solution, with K = 4 users. In the case of NOMA, it can be shown that our algorithm outperforms both techniques in the entire SNR range. Moreover, it can be further observed that as the SNR increases, the performance gap between our algorithm and the other two methods, FPA and EPA, becomes more substantial. For instance, at SNR = 150 dB, NOMA-DQL-PA yields an average sum rate of 17.1 bpcu, compared to 10.2 bpcu and 11 bpcu achieved by NOMA-EPA and NOMA-FPA, respectively. For the case of SNR = 180 dB, NOMA-FPA and NOMA-EPA techniques yield an average sum rate of 21 bpcu and 22 bpcu, respectively, whereas DQL-PA achieves an average sum rate of 36 bpcu, which is approximately 71% higher than the NOMA-EPA and NOMA-FPA techniques. For the OFDMA counterpart, it can be seen that even with OFDMA technique, the proposed algorithm offers a noticeable enhancement over FPA and EPA.For instance, when the SNR = 180 dB, the proposed algorithm achieves 12 bpcu, compared to 8 bpcu in FPA and EPA. Finally, it can be seen that NOMA-based techniques outperform OFDMA-based techniques in terms of the average sum rate. This is expected since in NOMA, each user utilizes the entire bandwidth, whereas OFDMA divides the bandwidth between the 4 users.
In Fig. 7, we compare FPA and DQL-PA algorithms in terms of average sum rate as a function of LED transmission  angle φ 1/2 , with K = 4. It can be shown that our algorithm outperforms FPA over the entire LED transmission angle range. More specifically, the performance gap between the two techniques increases as the transmit SNR increases. Furthermore, the LED transmission angle's impact on the performance follows a similar pattern in both techniques. Therefore, it becomes evident by Fig. 8 that there is an optimal LED transmission angle, which is both unique and significant.
In Fig. 9, the average sum rate is shown versus the LED transmission angle φ 1/2 , for a different number of users K, using the DQL-PA algorithm. Similar to Fig. 6, we observe that the number of users K plays a vital role in defining the optimal LED transmission angle. More specifically, for the case of SNR = 170 dB, the optimal transmission angle for K = 3 is 35 • , whereas, for K = 6, the optimal transmission angle is around 30 • . Interestingly, the optimal angle tends to decrease as the number of users gets higher. This phenomenon is analogous to water-filling power allocation techniques [58], in which the strong users are allocated more power, and conversely, weak users are allocated less power. Also, the fact that there is a unique optimal LED transmission angle for each K necessitates the need for jointly optimizing the power allocation and LED transmission angle using the DQL technique. Fig. 9 shows the average sum rate as a function of the VLC AP vertical length L, using DQL-PA with tuning, and FPA with fixed LED transmission angle, with five users. In this scenario, the impact of the channel symmetry dilemma in VLC is investigated. As the vertical distance becomes large, the channel symmetry becomes worse. At SNR = 180 dB, our DQL-PA with tuning outperforms the FPA approach with no tuning by 65% to 70%. Even at the worst channel symmetry conditions for DQL-PA with tuning, the average sum rate is 29.5 bpcu, which is still higher than the best-case scenario for the FPA with no tuning, which is 19.2 bpcu. This shows that our proposed framework outperforms the other benchmark method of FPA with no tuning, even with varying channel symmetry.
Finally, Fig. 10 demonstrates the average energy efficiency as a function of the cell radius r, using DQL-PA with tuning and FPA with a fixed LED transmission angle. This is an important metric since it can quantify how much energy we expect to save from the use of our approach compared to the conventional scheme. It is shown that DQL-PA with tuning outperforms FPA with no tuning, even after varying the distances between the users from 3 to 7 meters. For instance, the average energy efficiency of DQL-PA with tuning at r = 7 and SNR = 180 dB is 7.28 b/J, compared to 5.24 b/J in the case of FPA with no tuning. Moreover, DQL-PA with tuning in the case of r = 7 meters outperforms the FPA with no tuning in the case of r = 3 meters by 21%.

VII. CONCLUSION
In this work, we proposed an algorithm to maximize the average sum rate and average energy efficiency in an indoor NOMA-VLC network. We leveraged the DRL algorithm to train an agent, in order to obtain an optimal power allocation policy for the users. Jointly with the power allocation, the agent can select the optimal LED transmission angle at the VLC AP. To this effect, the obtained results demonstrated that our algorithm outperforms the GA and the DE in terms of average sum rate, and offers considerably less run-time complexity. It was also shown that the joint optimization of the power allocation and the LED transmission angle is more effective as the number of users increases compared to the sole optimal power allocation approach.