Energy-Efficient Task Offloading Under E2E Latency Constraints

In this paper, we propose a novel resource management scheme that jointly allocates the transmit power and computational resources in a centralized radio access network architecture. The network comprises a set of computing nodes to which the requested tasks of different users are offloaded. The optimization problem minimizes the energy consumption of task offloading while takes the end-to-end latency, i.e., the transmission, execution, and propagation latencies of each task, into account. We aim to allocate the transmit power and computational resources such that the maximum acceptable latency of each task is satisfied. Since the optimization problem is non-convex, we divide it into two sub-problems, one for transmit power allocation and another for task placement and computational resource allocation. Transmit power is allocated via the convex-concave procedure. In addition, a heuristic algorithm is proposed to jointly manage computational resources and task placement. We also propose a feasibility analysis that finds a feasible subset of tasks. Furthermore, a disjoint method that separately allocates the transmit power and the computational resources is proposed as the baseline of comparison. A lower bound on the optimal solution of the optimization problem is also derived based on exhaustive search over task placement decisions and utilizing Karush-Kuhn-Tucker conditions. Simulation results show that the joint method outperforms the disjoint method in terms of acceptance ratio. Simulations also show that the optimality gap of the joint method is less than 5%.


A. Background
In order to fulfill the requirements of 5G mobile networks, key enabling technologies such introduced. With NFV, the network functions (NFs) that traditionally used dedicated hardware are implemented in applications running on top of commodity servers [1]. On the other hand, MEC aims to support low-latency mobile services by bringing the remote servers closer to the mobile users [2], [3]. Moreover, MEC enables the offloading of the computational burden of users' tasks to reduce the impact of the limited battery power of user equipment (UE). Note that when executing servers are NFV-enabled, they are able to process various types of tasks. As a result, there is no restriction on offloading a task to a predetermined server.
A typical task offloading example is shown in Fig. 1. In task offloading, the non-processed data of a task is sent from UE to an executing server that offloads the computational burden of the task execution on the executing server. As Fig. 1 shows, the user transmits the non-processed data of the task over the wireless link to its serving base station, which results in transmit latency T tx . Then, the received data is transmitted to an executing server. Executing servers are placed at the base station and distant nodes in the transport network. The data transmission through the transport network adds the propagation latency T prop to the offloading process. Finally, the received data is processed at the executing server with execution latency T exe and then is sent back to the user over the downlink. Therefore, the end-to-end (E2E) latency of task offloading is equal to the summation of T tx , T prop , and T exe in both uplink and downlink.

B. Related Works
We classify the related works on task offloading into four categories and discuss their applicability in practical scenarios.
1) Task offloading to multiple executing servers: In task offloading, a UE decides to whether offload a task to a single executing server or to select an executing server out of multiple servers.
Offloading a task to one server in a set of executing servers in a multi-tier heterogeneous network is considered in [4], [5]. Moreover, the authors in [6]- [8] propose to offload a user's task to one of executing servers at base stations in a multi-cell network. Note that in the aforementioned works, the executing servers are located at the edge of the radio access network and the computational resources in the non-radio part of the network are not considered. In contrast, it is possible to offload a task to any server in the network in [9], i.e., servers in radio access and non-radio parts of the network. However, radio resources are not allocated in [9]. Note that, ignoring computational resources in the non-radio part of the network or ignoring radio resource allocation results in an inefficient task offloading.
2) Task placement and computational resource allocation: Task offloading is comprised of two steps: i) task placement to select an executing server, and ii) computational resource allocation that allocates the resources of the executing server to each task. In this context, various works only focus on task placement with given computational resources for each task [4]- [6], [9]- [12], while others include resource allocation as well [7], [13]- [21]. Note that the servers in non-radio part of the network are not involved in these works. As a result, computationally intensive tasks with moderate sensitivity to latency may occupy the capacity of executing servers in radio part of the network while high capacity servers in non-radio part of the network are underutilized.
3) Joint Radio and Computational Resource Allocation: Extensive research is made on joint radio and computational resource allocation [7], [11], [13]- [27]. In these works, radio resources including transmit power and/or bandwidth as well as computational resources are allocated to each task. Energy-efficient resource allocation is performed in [11], [13], [17], [19], [21], [24], [26], [27], and a weighted combination of consumed energy and latency is optimized in [7], [14]- [16], [20], [22], [25]. Moreover, the impact of radio link quality without radio resource allocation is taken into account by [5], [6], [8], [10], [28], [29]. In these works, the latency of data transmission over radio links is taken into account, which impacts the optimal task placement. Note that although joint optimization of radio and computational resources increases the degrees of freedom in task offloading, the available computational resources in the radio access network are very limited, which limit the acceptance ratio of the network.

4) Feasibility Analysis:
When task offloading is subjected to a maximum acceptable latency, sufficient resources are required in various parts of the network. In case of insufficient resources, a feasibility analysis is needed to determine a feasible subset of requested tasks. One approach to face infeasibility is making some simplifying assumptions, e.g., assuming sufficient available resources for task offloading [9], [22] or offloading a task when it is beneficial, i.e., when offloading results in less energy consumption or latency [10]. In practice, however, the resources are limited and tasks are subjected to execution deadlines. As a result, a feasibility analysis is inevitable. The feasibility analysis is performed by introducing a binary optimization variable, which is one when the task is accepted or zero when the task is rejected [4], [7], [12], [13], [15], [19]- [21], [27]. Note that finding optimal binary variables results in combinatorial optimization problems that are challenging and of high complexity.

C. Motivation
The performance of a task offloading method is mainly measured by its latency and energy consumption. In practice, E2E latency comes from radio links, transport network links, and execution at the servers; and the energy consumption is impacted by consumed transmit power and computational resources.
Optimizing the performance of task offloading necessitates a joint optimization of all available resources in the network. However, existing works optimize a subset of resources and focus only on one part of the whole network. Moreover, the impact of E2E latency is not considered in the literature. As a result, existing methods may not perform well in practice.
In this paper, we propose a task offloading method that optimizes the energy consumption in terms of transmit power and computational resources under E2E latency constraints. Throughout the paper, the task offloading is referred to the process of transmit power allocation over radio links, task placement, i.e., selecting an executing server and its path, and computational resource allocation. The proposed method jointly allocates required transmit power to tasks, places each task in a proper NFV-enabled node, and allocates sufficient computational resources to the tasks.
With this joint method, high latency of radio links caused by weak radio channels is compensated by a proper task placement and computational resource allocation. In contrast, high execution latency caused by limited computational resources is compensated by consuming more transmit power in radio links. As a result, more tasks are served, compared to a disjoint method wherein transmit power allocation is independent of task placement and computational resource allocation.
NFV enables a general-purpose server to execute various tasks without needing a specialized server for each task. Therefore, various tasks are dynamically offloaded to general-purpose executing servers in a network of NFV-enabled nodes instead of offloading each task to a respective specialized server. As a result, a task placement method is needed to determine an executing server and its route for each task. In spite of conventional routing methods that choose a route to a predetermined server, our task placement method jointly determines an executing server, the associated route to the executing server, and the required computational resources in the executing server.
We assume a deadline for offloading each task, i.e., sending the task from UE to the executing server and sending it back to UE performed under a maximum acceptable latency constraint. As a result, the sum of latencies in radio link, transport network links, and execution at the executing server is less than the maximum acceptable latency. The feasibility of this E2E offloading method depends on the available resources and location of executing servers in the network. For example, when the available transmit power is low, the radio link latency is large, which may violate E2E latency. In contrast, when the available computational resources at the executing server are low, the execution latency is large, which may also violate the E2E latency constraint. Therefore, our task offloading method includes a feasibility analysis that finds a set of feasible tasks. Therefore, the set of feasible tasks is obtained by solving an optimization problem that minimizes the sum of non-negative variables, i.e., maximizes the number of feasible tasks.
Joint task offloading results in a non-convex problem due to coupling optimization variables.
Moreover, the task placement is performed by obtaining binary variables, which makes the optimization problem further complicated. To deal with the optimization problem, we decouple transmit power allocation from task placement and computational resource allocation. Transmit power allocation is performed via the well-known convex-concave procedure (CCP) and a heuristic algorithm is proposed for task placement and computational resource allocation. CCP and the heuristic algorithm are alternatively applied until convergence. Note that both CCP and the heuristic algorithm preserve the monotonicity of convergence.
We also develop two baseline methods to evaluate the efficiency of our joint task offloading method. The first is a disjoint method in which transmit power allocation is performed independent of task placement. In doing so, the maximum acceptable E2E latency of each task is divided into a radio latency constraint and a non-radio latency constraint. We allocate transmit power under the radio latency constraint. Then, the task placement and computational resource 6 allocation are performed under the non-radio latency constraint.
The second baseline method achieves a lower bound on the optimal solution of the joint task offloading optimization problem. The lower bound is achieved by relaxing some constraints in the optimization problem, which comes from leveraging practical assumptions such as orthogonality of wireless channels in large-scale antenna array systems. The optimal solution is then found by an exhaustive search over all feasible task placement candidates, finding the optimal computational resource allocation for each placement candidate, and choosing the placement candidate that results in the lowest objective value.

D. Contributions
In this paper, we develop an energy-efficient task offloading method that offloads the computational burden of a task from a UE to one of executing servers in a network of NFV-enabled nodes. In doing so, a task is offloaded by sending non-processed data of the task from the UE to a radio remote head (RRH) over a radio link, sending the data from the RRH toward the executing server through a transport network, and sending the processed data back from the executing server to UE. We assume that each task is offloaded under a respective deadline, i.e., the E2E latency of task offloading is less than the maximum acceptable latency of the task.
The main contributions and achievements of this paper are as follows: • We develop a joint task offloading method in a practical scenario, i.e., the proposed method allocates the transmit power, finds an executing server and the route to it, and allocates the computational resources in an energy-efficient manner. Moreover, the proposed method takes the E2E latency of task offloading into account. By the proposed method, the impact of weak radio links is compensated by placing the tasks in servers closer to UEs and consuming more computational resources. In contrast, limited computational resources are compensated by allocating more transmit power, resulting in an efficient and adaptive task offloading method.
• We propose a novel method for task placement and computational resource allocation.
While the conventional routing methods find a route to a predetermined node, our proposed method jointly finds the executing server, its associated route, and the required computational resources in an energy-efficient manner.
• We find a lower bound on the objective function of the optimization problem in the feasibility analysis, i.e., an upper bound on the acceptance ratio of the proposed method. The lower bound is obtained by relaxing some of constraints in the optimization problem, performing an exhaustive search over all feasible task placement candidates, and finding the optimal computational resource allocation by utilizing Karush-Kuhn-Tucker conditions.
• Simulation results show that the proposed joint method outperforms its disjoint counterpart in terms of acceptance ratio. Moreover, the lower bound on the optimal solution is almost tight because the joint method attains the lower bound in practical scenarios. Specifically, the optimality gap of the joint method is less than 5%.

E. Organization
The rest of the paper is organized as follows. Section II introduces the system model. Section III describes the optimization problem formulation. In Section IV, we propose joint task offloading while disjoint task offloading and lower bound on optimal task offloading are proposed in Sections V and VI, respectively. Simulation results are presented in Section VII and the paper is concluded in Section VIII.

F. Notation
The notation used in this paper are given as follows. The vectors are denoted by bold lowercase symbols. Operators · and | · | are vector norm and absolute value of a scalar, respectively.
(a) T is transpose of a and [a] + = max(a, 0). A\{a} discards the element a from the set A.
Finally, a ∼ CN (0, Σ) is a complex Gaussian vector with zero mean and covariance matrix Σ.

II. SYSTEM MODEL
The structure of the radio access network, channel model, and signaling scheme as well as NFV-enabled network, computational resources, and capacity of network links are described in this section.

A. Radio Access Network (RAN)
We consider a centralized RAN architecture with a baseband unit (BBU) pool, which serves a set of U RRHs, each equipped with M antennas. The set of all users is denoted by K. Each user is equipped with a single antenna and the total number of users is K = |K|. The considered model is shown in Fig. 2. It is assumed that each RRH is connected to the BBU pool through a fronthaul link.
We assume that each user requests a single task. Task k is represented by a triplet < L k , D k , T k >, where L k is the load of task k (i.e., the required CPU cycles), D k is the data size of task k (in terms of bits), and T k is the maximum acceptable latency of task k.
Each UE transmits the non-processed data of its task to its serving RRH through a wireless link. We assume that each UE is served by a single RRH. The set of users served by RRH u where J k u is an indicator which equals 1 if UE k is connected to RRH u (0 otherwise). In this paper, we assume that the UE-RRH assignment is given and fixed.
Focusing on the wireless link, we assume a narrow-band block fading channel model [21]. The channel vector between UE k and RRH u is denoted by h u,k , where h u,k = Q u,khu,k in which Q u,k represents the path loss between RRH u and UE k and small-scale fading is modeled as h u,k ∼ CN (0, I M ). Similar to [16], [17], we assume that the channel state information (CSI) is constant over the offloading time. As we show through simulations, this assumption is nonrestrictive in practical scenarios in sub-6 GHz bands. UE k transmits a symbol x k ∼ CN (0, 1) with transmit power ρ k toward its serving RRH. The transmit power of UE k is constrained to a maximum value, i.e., ρ k ≤ P max k ∀k. The received signal vector at RRH u is: We assume the maximum ratio combining (MRC) at RRHs because of its simplicity. Nevertheless, MRC is asymptotically optimal in massive MIMO systems [30]. Therefore, the combined signal is: where The estimated signal of UE k is: where n u ∼ CN (0, σ 2 n I M ) is the received noise vector at RRH u. Thus, the signal to interference plus noise ratio (SINR) of UE k is: Hence, the achievable data rate by UE k is where W is the radio access network bandwidth. The radio transmission latency of task k in the The sum of data rates of UEs served by RRH u is less than the capacity of its fronthaul link, i.e., k∈Ku R k ≤ B f,u , ∀u. In this paper, similar to [10], [29], and [11], we assume that the processed data size of task k is small. Moreover, since the power budget of RRHs is generally large, the radio transmission latency in the downlink is assumed negligible.

B. NFV-enabled Network
The NFV-enabled network includes a graph G = (N , E), where N and E are the set of nodes and edges (or links), respectively. A typical node in N is denoted by n while the BBU pool is indicated byn (which is also a node in N ). The link between two nodes m and m is denoted by (m, m ). Each NFV-enabled node is comprised of an executing server and a routing device.
The processing capacity (i.e., the maximum CPU cycles per second that are carried out) of the executing server in NFV-enabled node n is indicated by Υ n . Moreover, the capacity of link (m, m ) is indicated by B (m,m ) in terms of bps.
In this paper, we assume the full offloading scheme, i.e., the task of each user is completely executed in an executing server in the NFV-enabled network. Therefore, there is a need for placing each task to a proper executing server. A task placement decision consists of selecting an NFV-enabled node n and its associated path fromn to n . We denote the b th path between nodesn and n as p b n where b ∈ B n = {1 · · · B n } and B n is the total number of paths between nodesn and n. Note that a path betweenn and n may comprise some intermediate nodes, which only forward the tasks' data via their routing devices and do not deliver the data to their executing servers. We define decision variable ξ k p b n , which equals 1 when task k is offloaded to node n and sent over path p b n (0 otherwise). Each task is offloaded to one and only one node and path when we have: Indicator I (m,m ) p b n determines whether a link contributes to a path. The indicator is equal to 1 when link (m, m ) contributes to path p b n (0 otherwise). Moreover, the set of all links that contribute to The amount of computational resources allocated to task k is denoted by υ k (in terms of CPU cycles per second). Note that the execution of each tasks is performed at only one node. To ensure that the allocated computational resources do not violate the processing capacity of that node, we should have: Since the data of task k is sent over the network with rate R k , the aggregated rates of all tasks that pass a link should not exceed its capacity, which is guaranteed by the following constraint: The execution latency of task k is T exe k = L k υ k . The processed data of task k is sent toward the BBU pool (i.e., noden). In this paper, we assume the path of uplink and downlink are the same. Therefore, the overall propagation latency of task k over the path p b n is twice the propagation latency of path p b n . Thus, the propagation latency of task k is T prop is the propagation latency of link (m, m ). Table I summarizes the notation used in the paper.

III. PROBLEM FORMULATION
In this section, we formulate the optimization problem of joint task offloading. Each task is offloaded under its E2E latency constraint and in an energy-efficient manner. The objective function [υ 1 , · · · , υ K ] T , and ρ = [ρ 1 , · · · , ρ K ] T are the vectors of all ξ k p b n , υ k , and ρ k , respectively; Λ n denotes the computational energy efficiency coefficient of node n [18], and η is a weight. Note  that the first term in E is the transmit power consumption and the second term is the power consumption of executing servers. Therefore, the joint task offloading optimization problem is: under variables: ξ ∈ {0, 1}, υ ≥ 0, ρ ≥ 0. Constraint C1 guarantees that the maximum acceptable latency of task offloading is respected. Constraints C2 and C3 make sure that all tasks are offloaded without violation in processing capacity of nodes and capacity of links, respectively. Constraint C4 ensures the capacity of fronthaul links. Constraint C5 guarantees the power budget of UEs while constraint C6 makes sure that each task is offloaded to only one node and path.

IV. JOINT TASK OFFLOADING (JTO)
In this section, we solve optimization problem (7). This problem is non-convex due to integer variable ξ and coupling variables in C1-C4. Therefore, we solve (7) by decoupling transmit power allocation from task placement and computational resource allocation. In doing so, transmit power is allocated given task placement and allocated computational resources. Then, we perform task placement and computational resource allocation having allocated transmit powers. The proposed approach needs a feasible initialization. However, it is likely for constraint C1 to make (7) infeasible. Thus, we need to propose a feasibility analysis to find a feasible subset of tasks.

A. Feasibility Analysis
The feasible set of (7) is extended by adding a non-negative variable α k to the maximum acceptable latency of task k. Thus, the feasibility problem is constructed by replacing the objective function of (7) with the sum of non-negative variables, i.e., K k=1 α k [31]. The constraints which cause infeasibility are found by solving the feasibility problem and determining the constraints with positive values of non-negative variables. The feasibility problem is: Note that non-negative variables are added only to C1 because when C1 is eliminated, the optimization problem (7) is always feasible. Thus, we seek for the tasks whose maximum acceptable latencies are violated and eliminate them one by one until a subset of feasible tasks remains. The solution to (8) not only provides the infeasible constraints but also determines the level of infeasibility, i.e., constraints with larger values of non-negative variables need more resources to become feasible. Therefore, we first eliminate the tasks with larger values of non-negative variables.
Without loss of equivalence, we add the summation of inequalities in C1-a as a new constraint C7. Therefore, optimization problem (8) is restated as: This optimization problem is equivalent with: min ξ,υ,ρ,α s.t. C1-a, and C2-C6, in which the term k∈K T k is removed from the objective because it is constant. We solve (10) by decoupling transmit power allocation from task placement and coumputational resource allocation. In other words, we solve (10) under variables υ, ξ, α, having ρ fixed and vice versa.
To perform task placement and computational resource allocation, we need an initial ρ = ρ 0 that satisfies C3 and C4, which are satisfied with a small value of R k , i.e., small values of ρ k .
Next, we solve the following optimization problem: s.t. C1-a, C2,C3, and C6 (11) by a heuristic method. As in Algorithm 1, we find variables ξ and υ that minimize the objective of (11). Then, we set the non-negative variables for a feasible C1. In doing so, for task k, we calculate the amount of unused computational resources at all nodes, formally expressed as respectively. Note that from C1, the sufficient computational resources allocated to task k is WhenΥ k n ≥ υ temp , C1 is satisfied by setting υ k = υ temp and α k = 0. Otherwise, we set υ k =Υ k n and α k = T tx k + T exe k + T prop k − T k . Next, the available computational resources of nodes and available capacity of links are updated and this process is repeated for all of tasks. Note that we begin with tasks that require lower resources, i.e., tasks with lower values of T k .
After solving (11), we allocate the transmit power by solving: Note that in the heuristic method, we have T tx As a result, any feasible solution to (12) does not increase T tx k because (12) is infeasible for larger values of T tx k . Hence, replacing (12) with its feasibility problem counterpart does not impact the decreasing monotonicity of the objective function in (10). The feasibility problem of (12) is: find ρ s.t. C1-a, and C3-C5.
Input: ρ 1 sort α: T [1] ≤ T [2] ≤ · · · T [|K|] 2 for k = [1] : [|K|] do % Find a feasible node according to capacity of paths terminated at that node %Find the best node and its associated path % Update computational resource allocation and non-negative variables In solving (13), we note that the constraints C1-a, C3 and C4 are non-convex. Therefore, we need to find a convexified version of (13).We use CCP [32] to convexify (13). In doing so, we reformulate C1-a as: where T exe,i k and T prop,i k are the execution latency and propagation latency of task k obtained from the heuristic method in i th iteration, respectively. In order to convexify (14), we need a concave approximation of R k with respect to ρ. The rate R k is: which is equivalent to: Both h k (ρ) and g k (ρ) are concave functions of ρ. Thus, we need to find a linear approximation Next, we focus on the convex approximation of C3 and C4. To this aim, we find a convex approximation of R k , which is found by linear approximation of h k (ρ). Thus, we haveĥ k (ρ) = Finally, the convexified version of (12) is: under variable: ρ ≥ 0. Note that, based on CCP, any feasible solution of (19) is also feasible in (13) [32]. The feasibility problem (8) is solved by alternatively solving (11) and (19). Then, we reject the task that makes (7) infeasible. According to Algorithm 2, we find the value of the maximum non-negative variable. If the value is positive, its associated task is rejected, the set of served tasks is updated, and (8) is solved for updated set of tasks. This procedure continues until all non-negative variables are zero. The output of Algorithm 2 is feasible subset of tasks K as well as the solution of (8), i.e., the values of ξ ini , ρ ini , and υ ini , which are utilized as initialization for solving (7).

B. Optimization
Given the feasible solution ξ ini , ρ ini , υ ini , and the set of accepted tasks K , we seek for the solution of (7). Similar to Algorithm 2, we decouple power allocation from task placement and which is non-convex. Note that the objective of (20) is an increasing function of υ k and allocating lower computational resources to task k decreases the power consumption. But, allocating lower computational resources increases execution latency and violates the E2E latency constraint. As a result, we need to find nodes with smaller propagation latency to compensate for increased execution latency. In doing so, we find a subset of nodes with smaller propagation latency than the current executing server and with sufficient capacity of links terminating at that nodes.

This set of nodes is
where we assume task k is previously placed through path p b n . For each node in N k , we calculate the minimum computational resources that satisfy the E2E latency constraint, i.e., υ temp = . WhenΥ k n ≥ υ temp and Λ n υ 3 temp ≤ Λ n υ 3 k , we ensure that task placement through p b n and computational resource allocation υ temp are feasible and result in lower power consumption. Therefore, we set υ k = υ temp . Otherwise, we reinstate υ k for task k.
Algorithm 3 begins with the tasks with larger power consumption, i.e., Λ n k υ 3 k , where n k denotes the executing server of task k. This procedure is repeated for all accepted tasks.
Based on CCP in Algorithm 4 and starting from ρ 0 = ρ ini , an iterative solution of (21) provides a sub-optimal transmit power allocation. Finally, optimization problem (7) is solved via Algorithm 5, which alternatively solves optimization problem (11) via Algorithm 3 and optimization problem (21) via Algorithm 4.
From the implementation point of view, BBU is responsible for gathering the required information, performing resource allocation, and sending the decisions to the associated entities.
Specifically, in JTO, BBU needs to acquire CSI of UEs and the available computational resources in the NFV-enabled nodes. CSI of each UE is estimated at its serving RRH and is forwarded through fronthaul links with negligible latency. In addition, each NFV-enabled node sends the

Algorithm 4: Power Allocation in JTO.
Input: ρ 0 = ρ ini , i = 0, = 10 −3 , I ρ max = 10 2 1 repeat % Allocate power to users 2 Solve (21) and return ρ i+1 available computational resources to the BBU through the transport network. After performing JTO, BBU transmits the value of allocated powers to RRHs. Next, BBU forwards the received data of tasks as well as the obtained computational resources to associated NFV-enabled nodes based on task placement variables. In the downlink, the processed data of tasks are sent to BBU, which in turn transmits UEs processed data to their serving RRH.

C. Convergence analysis
In this subsection, we prove the convergence of Algorithms 2 and 5.
Proof. We show that the objective value of (8), i.e., k∈K α k , is non-increasing in each step of Algorithm 2 and since the objective value is lower bounded by zero, Algorithm 2 is convergent.
In i th iteration of Algorithm 2, Algorithm 1 sets α i+1 k either equal to 0 when E2E latency of task k is guaranteed or equal to T tx k + T exe k + T prop k − T k when E2E latency is larger than its maximum acceptable value. Therefore, we have α i+1 does not increase after i th iteration. Algorithm 1 affloads task k so that T exe k + T prop k in the objective of (11) is minimized (line 5 in Algorithm 1). As a result, Algorithm 1 does not increase the objective value of (11), i.e., k∈K (T prop . Moreover, as discussed in subsection IV-A, Algorithm 1 makes C1-a active, i.e., T tx , and therefore, any feasible solution to (13) does not increase the objective vlaue of (12), i.e., k∈K T tx . As a result, we have k∈K α i+1 k ≤ k∈K α i k , that is, Algorithm 2 is convergent. Note that Algorithm 2 may eliminate the task with maximum non-negative variable. This elimination is equivalent to removing the constraints of (8) associated with the eliminated task.
Note that, eliminating a task increases the available capacity of links in transport network and available computational resources in NFV-enabled nodes. As a result, a search space of Algorithm 1 increases, which may result in lower propagation and execution latencies. Moreover, eliminating a task extends the feasible set of (13). Therefore, data rate of users may increase, which in turn may decrease k∈K T tx k . As a result, eliminating the task with maximum non-negative variable does not increase the objective of (8).

Theorem 2. Algorithm 5 is convergent.
Proof. Algorithm 5 solves (7) by alternating minimization of (20) and (21). Therefore, we need to show that Algorithm 3 (which solves (20)) and Algorithm 4 (which solves (21)) do not increase the objective value of (7). According to line 7 of Algorithm 3, computational resource allocation and task placement do not increase the objective value of (20). In addition, based on [32], convergence of Algorithm 4 is guaranteed and CCP does not increase the objective of (21).
As a result, the objective value of (7) is non-increasing in each iteration, and since Ψ(ξ, υ, ρ) is lower bounded by zero, Algorithm 5 is convergent.

D. Summary of JTO
Herein we summarize JTO. We obtain a set of feasible tasks by solving (8). In doing so, we decouple the power allocation from task placement and computational resource allocation, which are performed by solving (13) and Algorithm (1), respectively. Then, we solve (7) for feasible tasks via Algorithm 5, which includes the alternating minimization of (20) and (21) via Algorithm 3 and Algorithm 4, respectively.

E. Computational Complexity (CC) Analysis
In this section, we analyze CC of the proposed algorithms. CC of JTO is equal to CC of feasibility analysis in Algorithm 2 and CC of optimization in Algorithm 5. Algorithm 2 includes two nested while loops and CC of the inner loop is equal to CC of Algorithm 1 and CC of solving (19). CC of Algorithm 1 depends on the required computations for calculation of the parameters in Algorithm 1, which are provided in Table II where B is the maximum number of the paths between any node andn whereas E is the total number of the edges in network graph G. Hence, CC of Algorithm 1 is CC 1 = O(K 2 N BE).
Problem (19) is solved via CVX, which exploits IPM for finding the optimal solution [33].
Based on [34] and [35], the required number of iterations for IPM to converge is log Nc t 0 log ς , where N c = 2K+U +E is the total number of constraints in (19), t 0 is the initial point for approximation of the barrier function, is the desired accuracy of convergence and 0 < ς 1 is used for updating the stepsize of the barrier function accuracy. Note that the inner loop in Algorithm 2 is repeated at most KI max times. As a result, CC of Algorithm 2 is Algorithm 5 includes alternating minimization of (20) and (21) via Algorithm 3 and Algorithm 4, respectively. Note that CC of Algorithm 3 is in the same order as that of Algorithm 1, i.e., CC 3 = CC 1 . Algorithm 4 solves (21) at most I ρ max times and CC of solving (21) is equal to CC of (19). As a result, CC of Algorithm 4 is Algorithms 3 and 4 are executed at most I max times. Therefore, CC of Algorithm 5 is CC 5 = I max (CC 3 + CC 4 ). Finally, CC of JTO is CC JTO = CC 2 + CC 5 . Note that Algorithm 1, Algorithm 3, and IPM are of polynomial time complexity. Therfore, JTO is also of polynomial time complexity.

V. DISJOINT TASK OFFLOADING (DTO)
In DTO, transmit power allocation is independent of task placement and computational resource allocation. The transmit power is allocated under a radio latency constraint, i.e., T tx k ≤ T RAN k . Then, the task placement and computational resource allocation are performed so that The convexified sub-problem of the transmit power allocation is: C4-a, and C5. (24) According to discussion on discussion on (7), a feasibility analysis is needed for (24). Similar to JTO, the feasibility problem of (24) is: Having obtained transmit power ρ, task placement and computational resource allocation are performed, whose associated sub-problem is: C2, C3, and C6.
A feasibility analysis is also needed for solving (26). Similar to the transmit power allocation, we introduce a set of non-negative variables. The resulting sub-problem is similar to (11) by replacing C1-a with C1-f, which is solved by algorithm 1. After obtaining a set of feasible tasks,  wherein the inner loop includes solving (25) via CVX, whose CC is log(2K+U ) t 0 log ς . Moreover, Algorithm 6 minimizes the transmit power via CCP, whose CC is derived in subsection IV-E.
As a result, CC of Algorithm 6 is CC 6 = (KI max + I ρ max ) log(2K+U ) t 0 log ς . Algorithm 7 includes two nested while loops, in which Algorithm 1 is executed. Based on CC of Algorithm 1 derived in subsection IV-E, CC of feasibility analysis in Algorithm 7 is KI max O(K 2 N BE). Algorithm 7 also includes executing Algorithm 3 at most I max times. As a result, CC of Algorithm 7 is CC 7 = (K + 1)I max O(K 2 N BE). Based on above, CC of DTO is CC DTO = CC 6 + CC 7 . Note that although CC DTO is larger than CC JTO , they are in the same order of complexity. The difference of CCs comes from the fact that CCP is performed at most I max times in JTO and only once in DTO. Given υ i and ξ i , solve (26) via Algorithm 3 and return υ i+1 and ξ i+1

VI. LOWER BOUND ON OPTIMAL SOLUTION (LTO)
Since the optimization problem (8) is non-convex, without loss of the optimality, we make some assumptions to resolve the non-convexity of (8). First, we note that it is very likely for the fiber-optic links to have sufficient capacity for carrying the traffic of UEs, which is the case for frontahul links and any wired link in the transport network. As a result, we relax the constraints C3 and C4 from (8). Note that the relaxation of C3 and C4 extends the feasible set of (8), resulting in a lower bound on the optimal solution to (8). In addition, with a large number of antenna elements at RRHs, the channel vectors between different RRHs and a specific user are approximately orthogonal, i.e., |h H u,k h u,j | ≈ 0 for all j = k [30]. Therefore, the interference in wireless channels is negligible and (15) becomes: The elimination of the interference increases R k with the same amount of power allocated to each UE, which again results in a lower bound on the optimal solution to (8). Based on the fact that min α,ξ,υ,ρ k∈K α k = min α,ξ,υ min ρ k∈K α k , the optimal power allocation is the solution to: The data rate in (27) removes the cross-coupling impact of the allocated power to different users.
Hence, without loss of optimality, (28) is solved for each ρ k independently. The associated power allocation problem is: in which α k in the objective is replaced with T tx k . Note that minimizing T tx k is equivalent to (27) is increasing with ρ k , the optimal solution of (29) is ρ k = P max k .
Note that feasibility of C1 is ensured by optimizing other variables.
Next, we deal with the binary optimization variable ξ. We propose an exhaustive search over all possible values of ξ to avoid any performance loss due to non-convexity of (8), stemmed from binary ξ. The number of all possible combinations of task placement decisions equals |B| |K| where |B| = n |B n |. Thus, we solve (8) for α and υ for each task placement decision and select the decision that results in lowest k α k as the optimal decision. Note that the exhaustive search may impose an excessive computational complexity. However, LTO is developed as a baseline for performance evaluation and it is not supposed to work in real-time.
The optimization problem for solving α and υ is: whereT k = T k − T prop k − T tx k and K n is the set of tasks to be executed at executing server n. Problem (30) is convex in both α and υ. As a result, the KKT conditions determine the optimal solution. To derive the KKT conditions, we first write the Lagrangian function as follows: By derivating the Lagrangian function with respect to α k and υ k we have: and In addition, the complementary slackness conditions are: Constraint C1-a implies υ k > 0. Hence, from (37) we have µ k = 0 and condition (33) results in: which implies λ n > 0. Thus, (35) gives: On the other hand, when (7) is infeasible, we get α k > 0. Thus, (36) leads to η k = 0 and condition (32) results in γ k = 1. As a result, from (34) we get: Having α k ≥ 0 and (38), the optimal non-negative variable is: wherein λ n is found such that: Then, the optimal values of α k and υ k are found as in (41) and (38), respectively. Having the optimal solution of (30) for all possible ξ, the optimal solution of (8) is the solution with lowest objective of (30). VII. SIMULATION RESULTS In this section, we evaluate the performance of JTO. The setup of the simulation is presented in Table III. We assume that U = 4 RRHs are placed with inter-site distance of 100 m and all users are served in an area of 100 m radius with a given user-RRH assignment. The nodes in the transport network are divided into three tiers based on their distance from UEs: the local tier, the regional tier, and the national tier. Although the number of serving nodes is very large, there are some distant nodes in each tier that impose a large propagation latency. Hence, we only incorporate the nodes with reasonable propagation latency in the transport network [7]. Network graph G consists of N = 6 nodes:n at the local tier with zero propagation latency, three nodes at the regional tier with relatively low propagation latency, and two distant nodes at the national tier.
For simplicity of comparison, we assume that all nodes have the same computational capacity and all tasks are of the same size, load, and maximum acceptable latency, i.e., D k = D, L k = L, and T k = T , ∀k. Moreover, we assume equal propagation latency and capacity for the network links. Note that the relatively low value of link capacity (0.4 Gbps) is the amount of capacity solely reserved for task offloading. Finally, the simulations are performed on a 3.30 GHz Core i5 CPU and 16 GB RAM. Fig. 3 (a) reports the performance of the feasibility analysis in JTO, showing the acceptance ratio versus T . The acceptance ratio is defined as the ratio of accepted services by the feasibility analysis over the total number of the requested tasks. Note that the acceptance ratio increases by increasing T . This is due to the fact that the tasks with higher T need less transmit power and computational resources to be served. Moreover, for higher T , a larger number of nodes are available for task offloading. In addition, we solve (8) by the alternate search method (ASM), in which (8) is alternatively solved for each variable. Note that the sub-problem of υ is solved by CVX and the sub-problem of ξ is solved by MOSEK (details are not provided due to space limitation). The effectiveness of JTO against ASM is also shown in Fig. 3 (a). Note that for latencies smaller than 75 ms, JTO outperforms ASM. Moreover, the performance of both methods is identical for low values of T . This is due to the fact that the set of accessible NFV-enabled nodes for low values of T is restricted ton and therefore, JTO is not able to offload the tasks to more distant NFV-enabled nodes because their propagation latencies violate the E2E latency constraints.
The acceptance ratio of JTO for different number of tasks is shown in Fig. 3 (b). Since the amount of available resources is limited, the acceptance ratio is decreasing with the increase in the total number of tasks. Again, the superiority of JTO over ASM is observed.   The convergence of Algorithm 2 is shown in Fig. 4 (a). As expected, the sum of non-negative variables is decreasing in each iteration. Furthermore, Algorithm 2 converges faster than ASM, which stems from higher acceptance ratio of JTO.   The acceptance ratio of JTO is compared with DTO in Fig. 4 (b). The acceptance ratio of JTO and DTO is depicted for T = 30 ms. For DTO, we obtain the acceptance ratio for different values of T RAN ∈ (0, T ). Moreover, the acceptance ratio of the feasibility analysis in the transmit power allocation phase of DTO, i.e., Algorithm 6, is depicted. The acceptance ratio of DTO is increasing for small values of T RAN , that is, the small values of T RAN impose high rates on users, which is impossible due to either insufficient bandwidth or limited fronthaul capacity.
On the other hand, for larger values of T RAN , the acceptance ratio of Algorithm 6 is 1 but the task placement and computational resource allocation restricts the number of accepted tasks.
Furthermore, JTO outperforms DTO in different values of T RAN . The average radio transmission latency increases by increasing D and subsequently the average execution latency is decreased to maintain the maximum acceptable latency. Therefore, it is inferred that JTO efficiently manages the transmit power and the computational resources.
Similarly, according to Fig. 5 (b), the average execution latency increases by increasing L and subsequently this increase is compensated with lower radio transmission latency.   a local node (i.e.,n) with zero propagation latency, a regional node with 20 ms propagation latency, and a national node with 40 ms propagation latency. The propagation latencies are the summation of uplink and downlink propagation latencies. Fig. 6 shows the task placement for different values of the processing capacity of nodes C = Υ n , ∀n. When C = 1, none of the nodes is able to serve the tasks in class (1) due to their high resource utilization. However, the tasks in class (2) are mainly served at the local node and class (3) tasks are placed at regional and national nodes. When C = 10, some of the tasks in class (1) are placed at the local node.
Moreover, some tasks in class (2) and (3) are served at the local node as well. Furthermore, the national node does not serve any task because JTO places the tasks at the nearest nodes in order to reduce the transmit power. When C = 20, more tasks in class (1) are served at the local node and the acceptance ratio reaches 1. Finally, when C = 50, almost all of the tasks are placed at the local node to reduce the transmit power consumption. Table IV shows the acceptance ratio of each class for different values C. Note that the acceptance ratio of all classes is increased by increasing C. Moreover, the acceptance ratio of class (1) is lower than that of classes (2) and (3). The reason is twofold, one is due to high resource utilization by tasks of this class and another is due to limited number of available nodes for tasks with low latency requirement (only noden in this example).

VIII. CONCLUSIONS AND FUTURE WORK
In this paper, we considered an energy-efficient task offloading problem under E2E latency constraints. We investigated the joint impact of radio transmission, propagation of tasks through the transport network, and execution of tasks on the experienced latency of tasks. Due to the nonconvexity of the optimization problem, we decoupled the transmit power allocation from task placement and computational resource allocation. The transmit power allocation was solved by adopting CCP to convexify the sub-problem. The task placement and computational resource allocation were solved via our proposed heuristic method, which minimizes the sum of propagation and execution latencies. Furthermore, to ensure the feasibility of the optimization problem, we proposed a feasibility analysis that eliminates the tasks causing infeasibility. Simulation results showed the superiority of JTO over both DTO and ASM. The performance of DTO depended on the part of latency required to be met in the radio access network, i.e., T RAN . However, JTO showed higher acceptance ratios for different values of T RAN . As future work, we plan to incorporate task scheduling into JTO. Moreover, the investigation of an innovative solution that divides the required computational load of each task among several nodes will be an interesting future research activity.