Security-Aware Data Ofﬂoading and Resource Allocation For MEC Systems: A Deep Reinforcement Learning

—The Internet of Things (IoT) is permeating our daily lives where it can provide data collection tools and important measurement to inform our decisions. In addition, they are continually generating massive amounts of data and exchang-ing essential messages over networks for further analysis. The promise of low communication latency, security enhancement and the efﬁcient utilization of bandwidth leads to the new shift change from Mobile Cloud Computing (MCC) towards Mobile Edge Computing (MEC). In this study, we propose an advanced deep reinforcement resource allocation and security-aware data ofﬂoading model that considers the computation and radio resources of industrial IoT devices to guarantee that shared resources between multiple users are utilized in an efﬁcient way. This model is formulated as an optimization problem with the goal of decreasing the consumption of energy and computation delay. This type of problem is NP-hard, due to the curse-of-dimensionality challenge, thus, a deep learning optimization approach is presented to ﬁnd an optimal solution. Additionally, an AES-based cryptographic approach is implemented as a security layer to satisfy data security requirements. Experimental evaluation results show that the proposed model can reduce ofﬂoading overhead by up to 13 . 2% and 64 . 7% in comparison with full ofﬂoading and local execution while scaling well for large-scale devices.

large volumes of data [1]. Such applications include, efficient manufacture inspection, virtual/augmented reality, image recognition, Internet of Vehicles (IoV) and e-Health [2]- [4]. To alleviate the resource constrains of mobile IoT devices and meet the communication/processing delay requirement complex computations can be offloaded to more resourceful devices [5].
Cloud computing was firstly exploited as a resource-rich service for mobile devices via the Mobile Cloud Computing (MCC) paradigm. MCC provides flexible processing, storage and services capabilities while reducing battery consumption. High latency is considered one of the key challenges facing MCC, especially in real-time and delay-sensitive applications. Additionally, security poses a critical challenge that faces MCC, where applications data and services may be vulnerable to many types of attacks during various stages of data transmission and processing [6].
Mobile Edge Computing (MEC) was recently introduced as a viable and promising solution to address MCC's challenges. In MEC, the computation capabilities of the cloud are pushed to the edge of the radio access network, which is in close proximity to mobile devices, resulting in a cost-efficient and low-latency architecture [7], [8]. Application domains such as predictive maintenance of industrial machines benefit from the MEC provision to provide fast and highly localised feedback to modify a live representation of the world [9].
Numerous approaches and models for computation offloading in MEC emerged in the literature with the goal of decreasing the consumption of energy, reducing computation latency and/or allocating radio resources efficiently [10]- [14]. Obtaining an optimum offloading solution in complex and dynamic multi-user wireless MEC system is a challenging task. Additionally, the security threats encountered during data transmission have not been addressed in most offloading approaches in the literature [15]. Moreover, the lack of adequate data protection controls can quickly overshadow the advantages of the MEC paradigm. Motivated by these aforementioned considerations, we present a deep reinforcement learning model to handle performance optimization in a multiuser and multi-task MEC systems as well as a security layer for protecting data during edge server transmission. The main contributions of our paper are summarized as follows: • Formulating a combination model of computation offloading, security and resource allocation as an optimization problem with the goal of decreasing the total time and energy overhead of mobile devices. • Transforming the formulated problem into an equivalent form of reinforcement learning, in which all the possible solutions are modeled as state spaces and the movement between different states as actions. Then, a Deep-Q-Network-based algorithm has been proposed for solving this problem and obtaining the near-optimum solution in an efficient way. The reminder of this study is organized as follows. The related works on offloading strategies are introduced in Section II. In Section III, our system model is presented and the formulation of our optimization problem is defined. Then, the Deep-Q-Network-based proposed algorithm is presented in Section IV. Section V presents the experimental evaluation and discussion. Finally, this study is concluded in Section VI and the future work directions are presented.

II. RELATED WORK
Numerous optimization models and approaches for computation offloading in MEC environment have been proposed in the literature. Some of these models handle only multiuser single-task MEC systems, e.g. [16], whereas others deal with multi-user multi-task environments, e.g., [17]. In addition, offloading conventional methods such as Lyapunov and convex optimization techniques [18] have been used to solve these models, whereas new algorithms based on artificial intelligence and deep learning have recently emerged [11], [19]- [21]. This section will review a brief overview of the common offloading optimization models.

A. Conventional Optimization Methods
Minimizing the total consumption of energy under a latency constraint for a multi-user, single-task MEC environment is the objective of [22]. The authors formulated an optimization problem to jointly optimize the resources of computation and communication and the decisions of offloading. Further, an efficient algorithm based on separable semi-definite relaxation approach is developed for obtaining the near-optimum solution for this problem. However, this work neglects the deadline delay requirement for the computation tasks. Tuysuz et al. [23] proposed a novel approach for addressing the video streaming mobility based on quality of experience (QoE), which can be deployed at the MEC servers. More precisely, this method first generates a session on the basis of QoE level and collects a set of information from the user. Afterward, three core manipulations have been performed to maintain the quality of experience level for each mobile device and to balance the load between mobile users based on user locations and their mobility via handover operations.
Nur et al. [24] applied the caching concept with computation offloading for a multi-user system, in which the application code and their related data for the completed tasks are cached at the edge server for the next execution. To reduce the energy and delay costs, [24] considers the priority for the computation task which is calculated by task popularity, deadline, data size and computing resource. Nevertheless, the common drawback of [24] is the absence of security mechanisms to protect application's data from attacks during the transmission.
Dai et al. have addressed the computation offloading for multi-user environment with multi-task in [25] and [26]. Specifically, in [25], a new offloading framework of two-tier is proposed for a heterogeneous network. An optimization problem is formulated with the aim of decreasing the overall consumption of energy and MEC servers in which computation offloading, user association, allocation of transmission power and allocation of computation resource are considered. Furthermore, an algorithm is developed to find the optimum offloading decision. Whereas in [26], the authors have jointly considered the resource allocation and offloading along with mobility factors of vehicular edge computing systems. The load among vehicular edge computing servers is balanced by selecting the optimal offloading decision for the computation tasks while maximizing the system utility is the main goal. However, the main drawback in [25] and [26] is that the security and privacy of data during the offloading process are not considered.
The authors of [27] and [28] presented solutions to effectively secure applications data on MEC systems for computation offloading. Similarly, Meng et al. [27] presented a secure and efficient offloading framework for MCC, by regular renewing of the server key and random padding are jointly combined to protect against timing attacks. In addition, a hybrid and queuing model based on Markov chain is utilized to optimize security and performance. Whereas, Elgendy et al. [28], introduced a new security layer based on the AES cryptographic algorithm with a genetic algorithm to protect application data during transmission. However, management of offloading and processing in [27] are achieved via cloud data center, which results in increased delay. However, [28] only addressed a multi-user single-task environment and used a computationally prohibitive method for solving the associated offloading problem, especially for large-scale environments.

B. Deep Learning Methods
Deep learning algorithms are widely used in offloading for multi-user environments [11]. For example, an offloading scheme based on deep reinforcement learning for devices of IoT was proposed in [29] with the goal of minimizing the total system overhead. Specifically, the level of battery, the predicted amount of the consumed energy and the capacity of the channel are used in the optimal edge server for offloading the computation tasks. Then, a deep-Q-Network learningbased algorithm is proposed to decrease the dimensionality of the states space and to accelerate the learning speed. However, in [29], the application data is not protected from cyber-attacks during the transmission process.
A stochastic policy of computational offloading for a multiuser and multi-server environment was proposed in [30]. In this work, the task arrival, computation resources and the timevarying communication qualities between mobile users and the edge server are jointly considered. The authors formulated a Markov decision process as a problem whose aims is to increase the long-term utility performance of the entire system. Then, two efficient algorithms based on double Deep Q-Network are proposed to address the curse-of-dimensionality. In [31], Dai et al. proposed a novel artificial intelligence empowered vehicular network architecture for IoV which can intelligently orchestrate the edge computing as well as caching resources. In addition, they jointly formulate the edge computing and caching as a Markov decision process problem and design a Deep Deterministic Policy Gradient (DDPG) algorithm to locate the computation resources in an efficient manner. However, in [31], the popular contents are shared between the vehicles at the edge caching which are vulnerable to different types of attacks.
More recently, Huang et al. [32] proposed a framework based on deep reinforcement learning for an online computation offloading, where the resource allocation and the offloading decision are jointly formulated as a non-convex problem. The aim is to increase the rate of computation in wireless networks. Then, a deep reinforcement learningbased online algorithm is developed for solving this problem via decomposing it into two sub-problems, namely, decision of offloading and allocation of resource. In addition, for rapid algorithm convergence, an order-preserving quantization method and an adaptive procedure are designed. Meanwhile, a multi-user with a multi-task offloading model for IoT was proposed in [33], in which the latency of service, energy consumption and success rate of task are jointly formulated to enhance the QoE-oriented computation offloading. However, the common drawback of [32], [33] is the absence of security mechanisms to protect application's data from attacks during the transmission.
It is evident from the literature review that computation offloading was investigated for multi-user environment in which conventional methods and deep learning are used to solve these problems. However, handling the security issue in a MEC system, especially a multi-user environment with a multi-task is not addressed. In this class of systems, most mobile applications send multimedia services and generate a substantial data which may be offloaded via the mobile networks. This motivates this study of jointly considering the resource allocation challenge and offloading for an environment of a multi-user and with a multi-task. In addition, we attempt to address the data security requirement during transmission to protect against various types of attacks.

III. SYSTEM MODEL
We study a multi-user MEC system with a single wireless base station and N mobile devices, represented by a set N = {1, 2, . . . , N }, as shown in Fig. 1. In addition, an edge server is associated with the wireless base station to provide computational and storage services. Furthermore, each mobile device has a set of M = {1, 2, . . . , M } different types of computation tasks requirements that need to be accomplished locally or will be transmitted and executed remotely through a wireless channel. In our study, a quasistatic approach is assumed in which the number of users does not change through the offloading period whereas it may vary over different periods [28].
The next subsections are present the modeling of communication, computation and security followed with more details on the formulation of our optimization problem.

A. Communication Model
The assumed environment has a set of N = {1, 2, . . . , N } users that are connected to a single wireless base station via a wireless channel. Each mobile device has a set of M = {1, 2, . . . , M } computationally intensive tasks that need to be completed either locally or remotely. Our aim is to reduce the system overhead in terms of communication/processing time and consumption of energy.
We refer a i,j ∈ {0, 1} as the offloading decision for the computation task j of user i. Specifically, (a i,j = 0) indicates that the mobile device i selects to execute its computation task j locally, while (a i,j = 1) indicates that the device i selects to transmit and execute its computation task j remotely. So, we define A = {a 1,1 , a 1,2 , . . . , a N,M } as the profile decision of offloading for users.
Subsequently, in the offloading case, the data rate of uplink for the user i can be expressed as follows: where B and p i refer the bandwidth and the power of user i transmission and g i and θ 0 refer the gain and the density of power noise. Consequently, the simultaneous offloading of mobile devices is limited by the following bandwidth constraints: In this study, an Orthogonal Frequency Division Multiple access (OFDM) method is considered for addressing the transmission of multi-users at the same cell where the uplink transmission interference of intra-cellular is significantly reduced [28]. Furthermore, the consumption overhead for transmitting the result is neglected due to the small output size (result) of the computation task in comparison with the input data size [34].

B. Computation Model
This section presents the computation model for our system model that is composed of N number of mobile devices in which each device has an M number of intensive computation tasks that need to be completed. We use a tuple {I i,j , C i,j , τ i,j } to represent a computation task requirement in which I i,j , C i,j and τ i,j denote the input size of data for each task (code and parameters), cycles of CPU needed to accomplish the task and the maximum tolerable delay for task j completion of user i. The values of I i,j and C i,j depend on the nature of the application which is obtained using a program profiler [35].

Mobile Device Users
Wireless Base-station In the following subsections, the computation overhead for local and edge server computing approaches will be introduced with respect to both time of execution and consumption of energy.
1) Local Execution Approach: In local execution approach, each user i decides to execute its task j locally on its computation resources. So, the consumption of energy and time for processing the task j of user i locally can be calculated as follows: where f l i and ξ i denote the computational capability (CPU cycles/seconds) and the CPU cycle's consumed energy of user i.
2) Edge Server Execution Approach: In the edge server execution approach, the task j of user i will be transmitted and processed remotely. Therefore, the consumption of energy and time for offloading and executing task j of user i remotely, i.e., task transmission and execution, can be calculated as follows: where f e i denotes the capability of computation for edge (CPU cycles/seconds) which is allocated to each user i. This study assumed that the edge server's computational resources are equally shared between all users.

C. Security Model
During offloading of computation tasks and their related data to an edge server, the offloaded data may be vulnerable to different types of attacks. In order to eliminate the data security risks, a new layer is introduced to fulfil the data security requirements. AES is used to encrypt/decrypt application data during transmission due to its efficient security and performance [36].
First, each user receives the offloading decision from the edge server which determines if the mobile user will offload their computation task or not. For the offloading decision case, the user is issued with a secret key to encrypt the transmitted data using 128-bit AES before transmitting the encrypted data to the edge server. Afterwards, the edge server uses the same key to decrypt the received data and then executes the computation task upon this data. Finally, the edge server sends the result back to the user.
We denote β i ∈ {0, 1} as the decision of security for user i. Specifically, (β i = 0) refers that the computation task's data of a user i will be offloaded without encryption. Whereas, (β i = 1) indicates that the computation task's data of each user i will be encrypted using our security layer before being transmitted to the edge. Therefore, we define β = {β 1 , β 2 , . . . , β N } as a security profile. Accordingly, the extra-overhead for applying this layer could be defined as follows: where η i,j and δ i,j refer the CPU cycles needed for encrypting and decrypting the data at user i and edge server, respectively [37], [38]. Moreover, regarding the security, computation and communication models the total consumption of time and energy for processing a tasks j of the user i can be defined as: where T r i,j and E r i,j refer the total time and energy for our model with security consideration which can be expressed as follows: Finally, from Eq.(9) and Eq.(10), the total time and energy overhead can be calculated as follows: where w t i and w e i ∈ [0, 1] refer to parameters for the consumption of time and energy for user i.

D. Problem Formulation
In this section, an optimization model for a multi-user environment with a multi-task is formulated with the goal of decreasing the total system overhead for users with respect to communication/processing time and energy. The formulation is given as follows: The first two constraints are the energy and time limits for each computation task j. C 3 and C 4 constraints are the uplink data rate capacity and CPU computation capacity of an edge server node where F is the total CPU resources at each edge server. Finally, constraint C 5 ensures that the variable of decision offloading is binary.
Eq. (14) is considered as a linear problem where the optimal solution can be given by obtaining the offloading decision vector's values a. However, as a is considered as a binary variable, then, the set of feasible and the objective is considered as a non-convex, which makes the solving for this problem difficult, especially for a huge users' number. This is due to the problem of curse of dimensionality, in which problem size increases rapidly as the number of users increase [39]. Therefor, an deep reinforcement learning-based algorithm is proposed to obtain the near-optimum values for a.

IV. PROBLEM SOLUTION A. Reinforcement Learning
Reinforcement learning is considered as a variant of machine leaning that allows a system to learn how to behave within an unknown dynamic environment and make different decisions in an optimal way without explicitly being programmed or human intervened. Fig. 2 shows a general illustration of a reinforcement learning scenario in which the agent, environment, state, action and reward are considered the main components. It is observed from the figure that, at time step t, the agent receives an observation regarding state s t and chooses an action a t which translates the agent from state s t to a new state s t+1 on the basis of the policy π = P (a t |s t ). Then, the agent obtains a reward r t and transitions to the state s t+1 on the basis of function of reward and transition probability of state which are defined as R(s, a) and P (s t+1 |s t , a t ) respectively [40]. Subsequently, these steps are repeated until the agent reaches to the terminal state, where maximizing the expected cumulative rewards is the main goal which is defined as R t = ∞ k=0 γ k r t+k with a discount factor γ ∈ [0, 1]. The Q-learning algorithm is one of the most popular reinforcement learning algorithms where its learning method is defined based on recording a Q-value in the form of Qtable. This table declares the state-action pairs in which the row's headers represent the system states S, the column's headers represent the system actions A whereas the cell value represents the quality value, Q(s, a), of taking an action from that state having a long-term accumulative reward. Q(s, a) is calculated as:

Agent
where Q(s, a) and Q(s , a ) denote the current and the new Q values for that state and action respectively. In addition, r(s, a) denotes the reward value obtained when selecting the action a at state t. max Q(s a ) denotes the maximum expected future reward obtained given the new state s and all possible actions at that state. Finally, α and γ denote the learning rate and discount factor respectively. In this study, the computation offloading decision a i,j is used to represent the state s = {a i,j } while the corresponding movement among different states represent the action space A; this will be discussed with more details in the following subsection.
Regarding our optimization problem in Eq. (14), the Qlearning algorithm is not considered as effective for obtaining the optimal solution as the complexity of the problem increases rapidly as the number of users and their computation tasks increase; this leads to an increase in the state-action pairs. Moreover, it becomes difficult to store and compute the corresponding Q value for the Q table and solving this problem becomes computationally prohibitive as the number of stateaction pairs increases exponentially [39]. Therefore, Deep Q-Network (DQN) is considered to handle the Q-learning limitation through estimating the Q-value function instead of storing the Q-table as we will show in the next subsection.

B. Deep Q-Network
DQN is one of the effective reinforcement learning algorithm in which the neural network with parameter ω is used to approximate the function of Q-value and to generate the values for action as shown in Fig. 3. For DQN, the state is given as an input for the neural network and the Q-value is generated as the output, for all actions. In addition, -greedy strategy is used to select the action. A random action is selected for ∈ (0, 1), i.e., exploration, and a = arg max at Q(s(t), a(t); ω) for 1probability, i.e., exploitation.
In this study, an efficient DQN algorithm is proposed for solving our optimization problem and obtaining the nearoptimum offloading decision. This problem is presented in Eq. (14). The optimization problem firstly, needs to be transformed into an equivalent reinforcement learning form, in which all the possible solutions are modeled as state spaces and the movement between different states as actions. In addition, the rewards value can be calculated based on the objective function. Consequently, the state space, actions and reward for the problem can be defined as follows: • State: State space S is represented by the computation offloading decision X = {a 1,1 , a 1,2 , . . . , a N,M } which is a 1 × N M vector. Therefore, at an arbitrary index t, the system state can be defined as follows: • Action: The action space A is represented by the movement between two different states. Additionally, in this study, the system action can be defined as an indexselection within the state vector length in which the agent can move from the current state to a specific neighboring state based on the selected index. Specifically, a variable v is defined to denote the index of selection, in which v = 1, 2, . . . , N M , and the action a(t) = {a v (t)} is considered as 1 × N M vector. • Reward: The agent gets a reward R(s, a), at each step t, on the basis of a state s and after executing an action a which is considered as a scalar feedback signal for indicating how well the agent is doing. While the system state s(t) represents the computation offloading decision, the objective function in our problem, Z(t), can be derived based on the state s(t) and can be denoted as follows: where {a i,j (t)} is given by the state s(t) according to the definition in Eq. (16). Additionally, based on the values of Z s(t) (t) and Z s(t+1) (t + 1), the reward of the state-action pair (s(t), a(t)) is defined as follows: In this study, a pre-classification step has been applied on the state space in which the computation tasks that do not satisfy the completion time deadline constraints, i.e., T l i,j <= τ i,j ), must be forced to execute locally on the mobile device, i.e, a i,j = 0.
As shown in Fig. 3 and Algorithm (1), the DQN can be used to solve our optimization problem in Eq. (14). Firstly, given state, action and reward, the evaluation and target Q-network are initialized with random numbers ω and ω , respectively. Also, the replay memory Y is initialized with a capacity L. Then, for each episode k, an initial state s init is chosen. Afterward, for each time step t and based on the strategy, the evaluation network generates a random action a(t) for ∈ (0, 1) probability and a = arg max at Q pre (s(t), a(t); ω) for 1-probability. Then, on the basis of Eq. (18), the reward r(t) as well as the next state s(t + 1) are obtained. In addition, the transition (s(t), a(t), r(t), s(t + 1)) is stored in the experience replay Y . Consequently, for updating the evaluation network, a sample random minibatch of transitions (s(k), a(k), r(k), s(k + 1)) is selected from experience replay Y and the predicted and labeled Q values, Q pre and Q lab , are calculated respectively as Q(s(t), a(t); ω), r(t) + γ max a Q tar (s(t + 1), a (t); ω ) using evaluation and target networks shown in Procedure 1. This study is adopted as a loss function of a neural network which can calculate the loss between predicted and labeled Q values. In addition, Gradient Decent Algorithm (GDC) [41] is used to minimize this value. Finally, the parameter ω of target network is updated every C steps.

V. EXPERIMENTAL EVALUATION AND ANALYSIS
This section firstly introduces the setup of experimental. Afterward, an extensive discussion on the simulation results is presented to critically assess our proposed model's performance.

A. Experiment Setup
Our simulation is undertaken using a personal computer, which has an Intel ® CPU 3.4 GHz Core(TM) i7-4770 with 16 GB RAM capacity. Python for development. The software environment is TensorFlow and Numpy with preinstalled Python 3.6 on Windows 10 Professional 64-bit [42]. A Multi-user environment with a multi-task is considered in which we have five users. The system bandwidth, noise and transmission power are set to 20 MHz, −100dBm and 100 mW, respectively. Each mobile user has a face recognition application as an example which consists of three independent computation tasks, namely, face detection, pre-processing and feature extraction and classification. The data size is distributed uniformly in (0, 10)MB, while the cycles of CPU  are set to 1000 cycles/bit. The user's capability is assigned randomly within the {0.5, 0.6, . . . , 1.0}GHz set, while edge server' CPU computational capability is set to 100GHz. We also assume that the channel bandwidth, the transmission power of each device and background noise are 20M Hz, 100mW and −100dBm respectively. The energy consumption for each mobile device is uniformly distributed within (0, 20 x 10 −11 )J/cycle [34]. For the DQN algorithm, the episode, size mini-batch and replay memory are set to 20000, 32 and 512. While, the discount factor, learning rate, and − greedy values are set to 0.99, 0.01 and 0.1 respectively.
Finally, to verify the performance of our algorithm, five different policies are introduced: • Unsecure DQN: Our model is applied without security layer addition. • Secure DQN: Our model is applied after adding the security layer. • Local Execution: All the computation tasks will be processed locally. • Full Offloading: All the computation tasks will be processed remotely. • Random Offloading: A random set of computation tasks are selected to be processed remotely while the remaining tasks will be executed locally.

1) Convergence Performance:
This subsection studies the convergence performance of the proposed algorithm, in which different values of each parameter are tested and the proper value will be selected for the next simulation. Fig. 4 demonstrates the convergence performance of the total cost over different values of learning rate, in which the leaning rate can be used to adapt the updating speed of ω. Figure shows that, with the 0.01 value, the process of convergence is becoming faster than 0.001 value and this speed increases the value of learning rate increases. However, with the large value of learning rate i.e., 0.1, the convergence process can not converge well in which it will be fallen into a local optimum solution. Therefore, it is important to choose the appropriate learning rate value suitable for specific situations. Regarding this, we set 0.01 as a learning rate value, which is the most appropriate value. Fig. 5 depicts the effects of different memory sizes on the convergence performance. Through the figure, we shows that with the smaller value of memory size, the convergence is becoming faster, but a local optimum solution is obtained instead of global one. Therefore, in the following simulations, the size of the replay memory is set to 1024 which is the most appropriate value. Fig. 6 demonstrates the convergence performance of the proposed algorithm over different values of batch size in which the batch size can be utilized to determine the experience samples' number which are extracted from the memory at each training interval. From the figures, the batch size is set to 32 in the next simulations.
2) System Performance: This subsection presents and discusses the simulation results of our proposed model. First, Algorithm 1 DQN based Computation Offloading Algorithm 1: Initialize the evaluation and target Q network parameters with random weights ω and ω , respectively. (ω = ω) 2: Initialize replay memory Y with capacity L 3: for each episode k ≤ 1, 2, . . . , K do 4: Choose an initial state s init

5:
for each step t do 6: Generate a random number ϕ ∈ [0, 1] 7: if ϕ < then 8: Randomly select an action a(t). Execute the action a(t) and Calculate Z s(t) (t) according to Eq. (18) 13: if Z s(t) (t) > Z s(t+1) (t + 1) then 14: Set r(t) = 1 15: else if Z s(t) (t) < Z s(t+1) (t + 1) then Save transition (s(t), a(t), r(t), s(t + 1)) in Y 21: Execute Procedure 1 for updating the evalution network 22: Reset ω = ω after each C steps. Calculate the label Q-value Q lab : Q lab = r(k)+γ max a(k+1) Q tar (s(k+1), a(k+1); ω ). 6: end if 7: Optimize the parameter ω using gradient descent algorithm which minimize the loss: the overhead of processing the computation tasks under the defined five policies over the different value of users is seen in Fig. 7. It is demonstrated from the figure that with 3 users, the our proposed DQN algorithm's overhead with and without security addition is equal to the full offloading policy and less than the other two policies. In addition, with the increasing the users' number, our model with and without security addition is able to achieve a lower overhead relative to full offloading policy. This is because the communication channels are shared and overloaded thereby lead to increase the communication time with users' number increasing. Moreover, our model can optimally select which computation tasks should be offloaded and which should not while minimizing the total cost of users Similarly, Fig. 8 illustrates the total cost of executing the computation tasks under five different policies versus different data size for each task. As seen in this figure, the total cost of the five policies increases with the increasing size of input data for each task. Additionally, our DQN algorithm with and without a security layer outperforms the other policies. Moreover, the full offloading policy curve increases much more rapidly than the other four polices with the increasing size of input data for each task. This is because of as the size of data that is transmitted increases, the communication time also increases which leads to a significant increase in the total cost of the entire system.
Finally, Fig. 9 shows the total overhead of processing the computation tasks for different MEC server's capacity. It is seen in this figure that the policy of local execution is not

VI. CONCLUSION
Our study proposed a resource allocation and security-aware data Offloading model for a multi-user environment with a multi-task. A new efficient security layer is introduced using the AES algorithm to protect the communicated data against attacks. In addition, a combination model of security, resource allocation and computation offloading is formulated as a problem with the goal of reducing the total time and energy overhead of mobile users. Furthermore, to practically obtain the optimum solution, an equivalent form of reinforcement learning is given, in which the state space is defined using all available solutions and the movement between different states is used to define the actions. Then, an efficient algorithm based on DQN has been proposed for solving this problem and finding the optimum solution. Simulation results demonstrate that the proposed model can achieve performance gains of up to 13.2% and 64.7% of overhead in comparison with full offloading and local execution approaches. Additionally, our DQN-based approach was proven to scale well for the networks with a large-scale. For a future work, an new layer of compression will be added to our model. This addition will compress the transmission data size to reduce the transmission time and enhance the overall system performance. Additionally, mobile users' mobility will be managed in an efficient manner, in which each user can move dynamically among different edge servers within an offloading period.