A Joint Communication and Learning Framework for Hierarchical Split Federated Learning

In contrast to methods relying on a centralized training, emerging Internet of Things (IoT) applications can employ federated learning (FL) to train a variety of models for performance improvement and improved privacy preservation. FL calls for the distributed training of local models at end-devices, which uses a lot of processing power (i.e., CPU cycles/sec). Most end-devices have computing power limitations, such as IoT temperature sensors. One solution for this problem is split FL. However, split FL has its problems, including a single point of failure, issues with fairness, and a poor convergence rate. We provide a novel framework, called hierarchical split FL (HSFL), to overcome these issues. On grouping, our HSFL framework is built. Partial models are constructed within each group at the devices, with the remaining work done at the edge servers. Each group then performs local aggregation at the edge following the computation of local models. End devices are given access to such an edge aggregated model so they can update their models. For each group, a unique edge aggregated HSFL model is produced by this procedure after a set number of rounds. Shared among edge servers, these edge aggregated HSFL models are then aggregated to produce a global model. Additionally, we propose an optimization problem that takes into account the relative local accuracy (RLA) of devices, transmission latency, transmission energy, and edge servers’ compute latency in order to reduce the cost of HSFL. The formulated problem is a mixed-integer nonlinear programming (MINLP) problem and cannot be solved easily. To tackle this challenge, we perform decomposition of the formulated problem to yield subproblems. These subproblems are edge computing resource allocation problem and joint RLA minimization, wireless resource allocation, task offloading, and transmit power allocation subproblem. Due to the convex nature of edge computing, resource allocation is done so utilizing a convex optimizer, as opposed to a block successive upper-bound minimization (BSUM)-based approach for joint RLA minimization, resource allocation, job offloading, and transmit power allocation. Finally, we present the performance evaluation findings for the proposed HSFL scheme.

A Joint Communication and Learning Framework for Hierarchical Split Federated Learning Latif U. Khan , Mohsen Guizani , Fellow, IEEE, Ala Al-Fuqaha , Senior Member, IEEE, Choong Seon Hong , Senior Member, IEEE, Dusit Niyato , Fellow, IEEE, and Zhu Han , Fellow, IEEE Abstract-In contrast to methods relying on a centralized training, emerging Internet of Things (IoT) applications can employ federated learning (FL) to train a variety of models for performance improvement and improved privacy preservation.FL calls for the distributed training of local models at end-devices, which uses a lot of processing power (i.e., CPU cycles/sec).Most end-devices have computing power limitations, such as IoT temperature sensors.One solution for this problem is split FL.However, split FL has its problems, including a single point of failure, issues with fairness, and a poor convergence rate.We provide a novel framework, called hierarchical split FL (HSFL), to overcome these issues.On grouping, our HSFL framework is built.Partial models are constructed within each group at the devices, with the remaining work done at the edge servers.Each group then performs local aggregation at the edge following the computation of local models.End devices are given access to such an edge aggregated model so they can update their models.For each group, a unique edge aggregated HSFL model is produced by this procedure after a set number of rounds.Shared among edge servers, these edge aggregated HSFL models are then aggregated to produce a global model.Additionally, we propose an optimization problem that takes into account the relative local accuracy (RLA) of devices, transmission latency, transmission energy, and edge servers' compute latency in order to reduce the cost of HSFL.The formulated problem is a mixedinteger nonlinear programming (MINLP) problem and cannot be solved easily.To tackle this challenge, we perform decomposition of the formulated problem to yield subproblems.These subproblems are edge computing resource allocation problem and joint RLA minimization, wireless resource allocation, task offloading, and transmit power allocation subproblem.Due to the convex nature of edge computing, resource allocation is

I. INTRODUCTION
A PPLICATIONS for the Internet of Things (IoT) are designed to serve a large number of users by satisfying their different needs [1], [2], [3], [4], [5], [6].IoT solutions must be carefully designed in order to meet these needs.Effectively modeling of IoT network functions can help us optimize the performance of IoT systems.These techniques can be based on a variety of theories, including optimization theory, game theory, graph theory, and heuristics.However, some IoT issues seem to be challenging to accurately model using the aforementioned techniques [7], [8], [9], [10].One can utilize machine learning (ML) to get around this restriction.Generally, centralized ML relies on moving data from end devices to a single location for training, and thus it suffers from privacy leaks.Federated learning (FL), which does not need transferring data from devices to a centralized server for training, can be used to address the privacy leakage problem associated with centralized ML [11], [12], [13].Devices in FL learn their local models and send them to a server at the edge or in the cloud where they are aggregated to produce a global model.The FL has some challenges, including resource optimization, single point of failure (SPF), incentive design, and learning algorithm design, despite the fact that it can maintain privacy more effectively than centralized ML [14], [15], [16].Additionally, a local learning model can also be unable to be trained within the allotted time on devices with insufficient computational power.
One aggregation server in FL has the potential to be attacked by an unauthorized user or malfunction as a result of a physical damage, which would degrade FL performance [11].To motivate the end-devices toward learning a global FL model, the works in [17] and [18] presented incentive mechanisms based on the Stackelberg game and auction theory, respectively.The works in [19], [20], and [21] presented the concept of split FL to resolve the issues of resource constraints in IoT devices.Different from the existing works [19], [20], and [21], we propose a novel hierarchical split FL (HSFL) framework.We consider resource optimization, resolve the SPF issue, and perform hierarchical aggregations to improve the convergence speed for resource-constrained devices.The issue of SPF can be resolved by enabling FL with distributed aggregation so as to avoid a CPF.Also, we modify the learning algorithm by performing local aggregations prior to global aggregation to improve the convergence speed.The issue of resourceconstrained devices can be resolved by performing partial local model learning at devices and remaining at the edge servers in a similar fashion as that of traditional split FL.Note that our proposed framework is based on distributed aggregations of edge aggregated models and is based on true decentralization.In traditional hierarchical FL [22], [23], [24], a leader and follower relationship is committed.A global aggregation system acts as a leader and IoT devices as followers.In this fashion of traditional hierarchical FL, some of the groups used for edge aggregated models may perform better compared to others and thus will influence the global model more.In our proposed decentralized aggregation-based HSFL, different groups can be tuned for edge aggregations and local iterations based on their performance to obtain a fairness-enabled global FL model.The key contributions of the article are given below.
1) We propose a novel framework, namely, HSFL.The latter computes partial local models at end-devices and the remaining at the edge server.Additionally, distributed aggregations are considered to avoid the SPF issue.To improve the convergence speed, HSFL uses local aggregations prior to computing its global one.The HSFL provides a tradeoff between communication cost due to edge aggregations and global HSFL accuracy.An increase in edge aggregations will improve its convergence but with an increase in the communication cost due to edge aggregations.2) A cost function that accounts for the overall latency (i.e., relative local accuracy (RLA), energy consumption (i.e., transmission energy), and edge server computing delay), and transmission delay is defined.3) Since the formulated problem is nonconvex, we divide the main problem into two subproblems: joint RLA minimization, transmit power allocation, resource allocation, and association subproblem, and edge servers computing resource optimization.We employ a convex optimization-based method for allocating computing resources to edge servers.We utilize a block successive upper-bound minimization (BSUM)-based technique for the joint RLA minimization, transmit power allocation, resource allocation, and association subproblem due to its nonconvex nature.4) We conclude by presenting the findings from our performance assessment of the proposed HSFL framework.

A. Resource Optimization in FL
Various research attempts [12], [25], [27], [28], [29] have discussed resource optimization for FL.The work in [26] took into account FL over wireless networks and developed a problem to reduce the overall FL time.For optimizing the wireless resource allocation and local device operating frequencies, the authors proposed a heuristic approach for reducing the global FL time.Despite the fact that the authors have demonstrated a performance gain, the high complexity of the heuristic strategy may not be preferred.Additionally, the performance of [26] can be further improved by performing efficient power allocation and association in the case of multiple base stations.The work in [12] discussed FL incentive systems and resource optimization.They also presented open challenges and a Stackelberg game-based incentive mechanism.Another work [30] proposed a framework for collaborative learning and communication for FL across wireless networks.The authors discussed how FL performance is impacted by the packet error rate.They formulated an optimization problem to reduce the FL cost by jointly optimizing resource and transmit power allocation.They employed bipartite matching for resource allocation while deriving an equation for the optimal power allocation.The analysis of complexity was also carried out.Wang et al. [27] offered a control strategy for resource-constrained scenarios that provided a tradeoff between the local model update and the global model aggregation for the loss function minimization.Finally, the authors used a real-time data set to validate their idea.Tran et al. [28] analyzed wireless FL.They formulated an optimization problem and offered two types of tradeoffs: 1) communication and computing latencies and 2) FL model computing time and local energy consumption.All of the works in [12], [26], [27], and [28] used a single aggregation server which can fail in case of a centralized server crash and/or out of commission.

B. Split FL
The works in [19], [20], and [21] considered split learning.Vepakomma et al. [19] presented a split neural network for collaborative healthcare institutions.In split learning, a partial model is learned at the device level, and the rest of the layers are computed at the server.They also identified two cases of split neural networks.Finally, the authors presented simulation results for a split neural network using CIFAR-100 and CIFAR-10 data sets.In another work [20], split learning and FL were coupled for devices with limited resources by the authors.In their framework, a set of resource-constrained devices learn a part of local learning models and send the remaining to nearby servers for analytics.At the central server, the local models are combined to produce a global model after the whole local learning models get computed through interaction between the server and the devices.The end devices then receive this global model to further update their local model.Finally, the authors validated their concept using CIFAR-10, MNIST, and FMNIST.Liu et al. [21] considered split learning and FL.Their framework consists of a device set which computes a partial local model.Then, the partial local model is shared with their corresponding servers.Additionally, they used a data set of chest X-rays for COVID-19 recognition and MNIST to assess how well their suggested architecture performed.All of the works in [19], [20], and [21] presented split learning, but they did not take into account problems with robustness brought on by an SPF.Additionally, the proposals of [19], [20], and [21] must be improved to get faster convergence.Also, there must be an effective analysis of split learning over wireless networks.

C. Hierarchical FL
Few works [22], [23], [24] studied hierarchical FL.Liu et al. [22] presented a client-edge hierarchical FL.Every edge is associated with a set of clients.Clients develop their own local models and share them with the edge servers that do aggregation.The aggregated models are exchanged with the devices for an update after aggregation.For a predetermined number of iterations, this procedure is iterative.Finally, the locally aggregated models at the edge servers are aggregated at a cloud to yield a global FL model.Another work [23] also presented hierarchical FL where edge aggregated models are computed at small cell base stations (SBSs) via iterative interaction between the devices and SBSs.Finally, a global FL model is computed by aggregating edge aggregated models.Wang et al. [24] performed a thorough investigation of the mathematical convergence for the hierarchical FL.All the works in [22], [23], and [24] are based on a centralized aggregation server that might cause performance degradation in the case of a centralized server failure.Additionally, there may be fairness issues in the case of hierarchical FL.Some of the groups may have better performance, and thus influence the global model compared to groups with poor performance.Therefore, we must take into account the fairness issue while designing hierarchical FL schemes.A summary of the comparison of our work with the existing works is given in Table I.

III. PROPOSED FRAMEWORK
We provide our proposed HSFL framework as depicted in Fig. 1(a).The framework consists of set of resourceconstrained IoT devices involved in learning.Generally, the IoT devices perform multiple tasks simultaneously (e.g., learning, augmented reality signal processing tasks, gaming, etc.) Different devices can have variable computing resources (i.e., CPU-cycles/sec) for performing learning tasks (i.e., local model computation).Some of the devices are able to compute their models faster compared to the devices with less computing power.Additionally, due to resource constraints, some devices might not be able to compute their complete local learning models.Therefore, those devices will compute a part of their models and offload the remaining learning task to the nearby edge servers.Note that output labels are required for computing a loss function at edge servers.There can be two possible ways to compute the loss function, such as edge labels-enabled HSFL and device labels-enabled HSFL.For the former, output labels are available at the edge servers, whereas in the latter, the server sends the output of the last layer to the devices for computing a loss function.Here, we assume that edge servers have output labels for computing the loss functions at an edge.The computing capacity of the edge strictly determines the number of devices that can offload their tasks to that edge server.Subsequently, devices per edge server should follow the maximum available computing power at the edge servers.After computing the remaining part of the local models of devices at the edge, the computed task data are shared with the devices for updating their weights.Next, the devices will send the local models to the edge servers for aggregation to produce edge aggregated models, as shown in Fig. 1(b). 1) Step 1: Compute the partial local models at end-devices due to computing resource constraints.

2)
Step 2: The partial local models are offloaded to the edge.

3)
Step 3: The partial local models are learned at the edge.4) Step 4: After computing the remaining local models using the partial models and data labels, send the gradients to devices for updating their weights.5) Step 5: Send the updated weights of all devices to their associated edge servers, where edge aggregations will take place.6) Step 6: Next, the local devices receive the edge aggregated models for a further update.For a fixed number of edge iterations, this process iteratively takes place.7) Step 7: Next, the edge aggregated models are shared among different edge servers.The transfer of edge aggregated models can be either through a core network or wireless.For a core network, the delay will be less because of the high-speed optical fibre link.8) Step 8: After edge aggregated models sharing, a global model is computed at every edge server in a distributed manner, and thus avoids SPF.To enable HSFL for a wireless network, one must efficiently allocate wireless and computing resources.Local devices and edge servers can be used to provide computing resources.A local device's energy consumption is proportional to its operating frequency.However, as the local operating frequency rises, the computation time for the local model lowers.As a result, we must carefully balance the device operating frequency and the time required to compute the local model.Meanwhile, there must be an effective learning task offloading scheme for end-devices.The effective criterion for a learning task offloading can be transmission latency.Additionally, we must take into account the computing power at both devices and servers while computing local models.

IV. SYSTEM MODEL AND PROBLEM FORMULATION
We consider a system that consists of a set N of N devices deployed randomly.Every device n has a local data set D n comprising of D n data points.Meanwhile, there is a set M of M edge servers-enabled SBSs.Each edge server m ∈ M has a certain computing power Cm that limits the association of end-devices with it.Table II contains a list of important notations.Depending on their available processing power, the devices compute a portion of their local model before sending the rest to the edge servers.After computing the local models, an edge aggregated model is obtained.The aggregated models at the edge are then sent back to the devices.For a specific number of edge iterations, this technique is iterative.Next to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I COMPARISON OF RELATED WORKS WITH THE PROPOSED WORK
computing the models aggregated at the edge, sharing them takes place among different servers.Finally, the global model is computed.Now, we present the communication model.

A. Communication Model
For communication, we utilize orthogonal frequency division multiple access (OFDMA).The transmission of DSFL updates between the edge servers and devices uses a collection of R resource blocks.We suggest using resource blocks that cellular users already utilize.Let the binary variable x n,r denote the allocation of resource blocks to devices (i.e., x n,r = 1 when a device n is assigned resource block r and x n,r = 0 otherwise).The devices and cellular users will interfere with one another.The devices engaged in HSFL, however, use resource blocks that are orthogonal, preventing interference between them.The size of DSFL local updates is dependent on the architectures and the type of local learning models (e.g., CNN) [31].Therefore, similar to the works in [11] and [30], we make use of a single block for communication No more than one device may be given access to a particular resource block, i.e., N n=1 x n,r ≤ 1 ∀r ∈ R. ( Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II SUMMARY OF KEY NOTATIONS
For all devices, the allotment of wireless resource blocks must also not be greater than the wireless resources that are readily available, that is x n,r ≤ R. ( When a device uses a resource block r, the SINR is provided by where p n and h n,r denote the transmit power of the devices and gain of the wireless channel, respectively.t∈T p t h t,r and σ 2 denote the interference of the cellular users and noise, respectively.A certain range must be followed by the devices transmit power, i.e., 0 ≤ p n ≤ P m ∀n ∈ N . (5) In our system, taking the summation of the transmit power of devices should follow the following constraint, i.e., The device must offload its task to a maximum of one edge server, i.e., where y n,m denotes the variable that shows the association of devices with the edge server (i.e., y n,m = 1 if device n is associated with edge server m and 0, otherwise).The serving capacity of the edge servers must not be violated The achievable throughput can be written as follows: where B is the resource block bandwidth that is used for all resource blocks in our model.Next, we discuss the wireless HSFL model.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. HSFL for Wireless Networks
In HSFL, a set N of N devices is divided into U groups each with N u devices.Each device n u of group u has a local data set of k n u data points.By optimizing device weights, HSFL seeks to minimize the global loss function [32].For instance, the loss function for regression will be different compared to the loss function for image classification tasks.On the other hand, generally, end-devices have limited computing power.Therefore, end-devices will compute a portion of their local models due to computing resources constraints.In addition to learning, the devices have other tasks (e.g., augmented reality) to do.Therefore, the available computing power for performing the learning task at device m is equal to f n = fn − f mis n .Depending on the available computing power f n , a part of the model is learned by the devices and the reaming is learned by the edge server.Let a device n complete a task (i.e., partial computation of local model) by processing e n computing points with each requires a q n computing power.The partial local model is sent to the edge server for further computation.Next to the transmission of the output (i.e., the partial local model) of device n to edge server m, the computing time (i.e., for processing v n computing points of the partial local model) at the server can be given by where c n,m and q n denote the continuous variable that denotes the computing power assigned to the learning task of device n by the edge server m and CPU-cycles/sec needed for processing a single data point, respectively.Note that every edge server has a certain computing power to serve the end-users, and therefore, the total computing power allocated by the edge server m should be within a limit.The computing power of the edge servers must take values between the defined range, i.e., Moreover, the overall processing power should be equal to or less than the processing power that is readily available The devices will perform updating the models locally and communicating them to the edge following a partial local model computation and a gradient sharing with the enddevices.Then, the time taken by the device n in partial model computing and weight update after receiving gradients from the edge server is given by where θ denotes the RLA.The lower the value of the RLA, the better will be the performance of the local learning model, and vice versa.Note that the local accuracy of devices significantly affects the global HSFL rounds while keeping the global accuracy fixed.For the global HSFL accuracy and relative accuracy θ , we can write [17], [28] where χ is a constant that is dependent on the local data set.It is evident from ( 14) that for a fixed global HSFL accuracy, a drop in RLA will result in a proportional decrease in the global communication rounds, and vice versa.The local accuracy performance is influenced by a number of variables, including the local data set, local iterations, and the local model design.
One can usually state that for a particular local data set and an architecture, the local iterations will generally determine the local model performance.In the HSFL, there is a deadline to compute the local model.An increase in the local operating frequency is needed in order to increase the local iterations while maintaining the local model computing deadline.The device energy consumption, however, increases when the computation frequency is increased.Meanwhile, every end device has power constraints for backup.As a result, we must balance the RLA performance and the local operating frequency.
The square of the local computing resource determines how much energy a device uses [28].Meanwhile, there are limitations on the available backup power for devices.Therefore, the local operating frequencies of the devices should be selected wisely.Alternatively, to improve the local model (i.e., reducing θ) within the allowable time while keeping other parameters fixed, there is a need to increase the local iterations.However, such a raise in the local device frequency will increase the energy consumption.Therefore, we must make a tradeoff between θ and the local energy consumption.The requirement that every device's energy usage must be within the permitted limit can be represented by the notation where O min is dependent mainly on the energy consumption required for training a local model.More local energy is a need for lower values of O min , and vice versa.Also, the θ should be within range, i.e., Let u n denote the local model output.Then, the transmission latency for sending u n to a server m at the network edge is given by The transmission energy can be given by e n,p (X, A summary of the HSFL algorithm is given in Algorithm 1.Initially, all devices compute their partial local models due to computing resource constraints and send them to the edge server for further computation (lines 6-8).Send the computed model parameters back to devices for updating their weights (line 9).Then, the local learning model weights are updated (lines 9 and 10).After computing local learning models for all devices, edge aggregated models are computed (line 13).The computed edge aggregated models are shared among all edge servers.In the end, the computation of a global model takes place at every edge (line 17).

7:
Send partial local models to edge servers.

8:
Compute partial models at edge servers.

9:
Share the gradients of edge servers with the devices to update their local weights.10: Send the local weights to the edge servers. 12: end for 13: 14: end for 16: Step 2: Global aggregation 17: ω n u ←− ω g 19: end for

C. Problem Formulation
First, we define the cost that consider both latency and energy.The latency in our model is due to the local model computing latency, transmission latency, and edge server computing latency.The delay due to the computing partial local model at the edge server is given by For an edge aggregated model, the latency is given by [17] T trans (θ, X, Y, P, C) By applying Taylors' approximation, (20) can be rewritten as follows: The total latency cost for the HSFL is the local model computation latency at the edge and T trans T HSFL (θ , X, Y, P, C) = T trans + T edge . ( The energy cost accounts for the transmission energy.Similar to (21), the cost associated with the transmission energy can be rewritten as follows: As a whole, we write the cost of HSFL as follows: C HSFL (θ, X, Y, P, C) = αE HSFL + (1 − α)T HSFL (24) where α ∈ (0, 1) is the scaling constant that scales the proportion of latency and energy.Note that θ depends mainly on the local model architecture and the local data set distribution.Next, we present our formulated problem C HSFL (θ, X, Y, P, C) (25) subject to: (1)−( 3), ( 5)−( 8), ( 11)−( 12), ( 15), ( 16). (25a) Problem P-1 is a mixed-integer nonlinear programming (MINLP) problem.Moreover, problem P-1 is NP-Hard.A problem is NP if its worst-case running time is more than O(n k ), where n and k are input size and some constant, respectively.The problem is NP-Hard if we can translate the algorithm to solve that given problem into those used to solve any NP-problem.NP-hard means "at least as hard as any NP-problem".Specifically, in NP-Hard problems, it is hard to find a solution but can be solved using nondeterministic machines.Finding the solution to our problem P-1 is very hard but one can solve it after making it simple and applying various schemes.It has three continuous variables, such as relative local frequency matrix, devices transmit power matrix P, and edge server computing capacity matrix C. The other two variables, such as resource allocation and association are binary variables.

A. Problem Decomposition
Problem P-1 has five variables, i.e., transmit power variable P, relative accuracy variable θ, association variable Y, edge computing resource variable C, and resource allocation variable X. θ has a nature that varies with many parameters, local model architecture, local data set quality (e.g., noise-less nature), and local data set distribution.Problem P-1 is difficult to solve because of the MINLP nature and it becomes NP-Hard for a large number of devices, resource blocks, and edge servers.Therefore, we cannot apply convex optimization schemes.To solve P-1, we first relax the binary variables into continuous variables.Then, P-1 can be rewritten as follows: X, Y,P,C   C HSFL θ, X, Y, P In P-2, the variables x n,m and y n,m are continuous version of the binary variables x n,m and y n,m , respectively.Because of the nonconvex character, the problem remains highly complex and challenging to resolve even after converting them to continuous variables.As a result, we divide the core problem into subproblems, as depicted in Fig. 2. We first decompose the problem P-2 into two subproblems: 1) jointly optimizing transmit power, resource allocation, association, as well as RLA and 2) optimization of edge server computing resource allocation.The problem for optimizing association, power allocation, resource allocation, and RLA can be written as follows:

subject to (C1)−(C10), (C12), (C13). (27a)
Problem P-3 is a nonconvex programming problem.Additionally, the problem has still four continuous variables that are difficult to solve directly.Also, it can not be solved using conventional convex optimization schemes.Therefore, there is a need to transform it into a convex problem and then apply convex optimization for a solution.Another possible way can be to use a scheme that is wellsuitable for nonconvex problems (e.g., BSUM).Next, we define the edge computing resource allocation subproblem as follows: subject to: (C8), (C11).(28a)

B. Proposed Solution
Here, solutions of various subproblems are discussed.First, we solve subproblem P-4 which is an edge server computing resource allocation problem.To solve convex subproblem P-4, we can use a convex optimizer.
Lemma 1 (Convexity of Subproblem P-4): We use a Hessian matrix for proving the convexity of P-4 for all possible values of the variable c.For P-4, the values of the Hessian matrix can be computed as follows [33], [34]: C HSFL (C) consists of a summation of terms with c i i = 1, 2, . . ., M. Taking the partial derivative of the terms other than diagonal terms will result in 0. From ( 29), we will get an M × M diagonal matrix γ with each diagonal element (2v n q n /c 3 m ), where c m denotes the computing resources assigned by edge server m to its associated devices.The positive semidefinite can be given as follows: For all possible positive values (i.e., between c m in and c m ax) of the variable c, it is clear that (30) is fulfilled.This shows that the objective function of P-4 is a convex function.Additionally, the constraints are linear inequality constraints.Therefore, subproblem P-4 is a convex optimization problem.
Now, we present the solution of subproblem P-3, which is a nonconvex problem.Therefore, one can not solve it directly using a convex optimization solver.There are two different ways to solve P-3, such as: 1) transforming a nonconvex problem into a convex and 2) using an approach that works well for a nonconvex problem P-3.Transforming the nonconvex problem into a convex problem is very difficult, and therefore, we solve P-5 using a scheme that is suitable for nonconvex problems.Here, our scheme is based on the modification of the BSUM presented in [26].BSUM solves the problem by using distributed computation in parallel.BSUM enables us with the features of easier problem decomposability and fast convergence [35].BSUM can be given as follows: where A := A 1 × A 2 ×, . . ., A O .Moreover, the nature of g(•) is continuous.v o and A 0 are a block of variables and a closed convex set, respectively.To solve a single block of variables, one can use a BCD scheme as follows: where It is very difficult to get a solution of ( 37) and ( 38) by applying the BCD scheme.The reason for this is the nonconvex nature.Addressing this challenge, our work considers BSUM, which uses minimizing the upper-bound function b(v o , y) of g(v o , y −o ).One can consider different upper-bound functions (e.g., Jensen's upper bound and quadratic upper bound) for BSUM.We make the following assumption similar to the previous works using BSUM [26], [35].
Assumptions: We consider the following assumptions: To guarantee the global upper-bound nature of b, assumptions 1) and 2) are used [26], [35].These assumptions are made since BSUM is based on minimizing the upper bound to give a good approximate solution to nonconvex problems [26].This is the reason Assumptions 1 and 2 are made to keep the b as upper bound.Also, the upper-bound function steps are negative of the objective function gradient in the direction of q, as indicated by assumption 3.For an upper-bound function, here, a quadratic penalization is adopted, i.e., where μ denotes the positive penalty constant.Then, the proximal upper-bound function can be solved as follows: In BSUM, successive updating blocks of primary variables are performed to minimize the upper-bound function of the original objective function.A block of variables should reach a minimum point v * = v t+1 j for a stationary solution to be coordinate-wise minimum.
P3 using BSUM can be written as follows: where C( θ, X, Y, P) = C HSFL ( θ, X, Y, P).Furthermore, the feasible of sets of θ, X, Y, and P are For every iteration, the set of indices I, k ∀i ∈ I k .The problem in (42) remains nonconvex in spite of the fact that we convert the binary variables into continuous variables.Hence, the BCD is not feasible for use.Therefore, a proximal term C i is added to (42).Here, we introduce a quadratic the penalty term for the penalty parameter μ > 0, which its essential goal is to preserve the convexity of h Similar to the above, one can use the penalty term to X i , Y i , and P i .b with respect to θ i , X i , Y i , and P i in (43) produces Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
unique θ , X, Ỹ, and P in every iteration.These unique values will be used as a solution of the (k − 1) iteration.For the (k + 1) iteration, the solution can be given by k+1) , Y (k+1) , P (k+1) (37) k+1) , Y k , P (k+1) (39) The above are the blocks for a certain parameter (θ ) while keeping the others ( X, Y, P) fixed.For solving (37)-(40), we use our proposed BSUM Algorithm 2. After getting a convergence using Algorithm 2, the resource allocation and offloading/association variables are transformed back to binary.

C. Complexity and Convergence Analysis 1) Complexity Analysis:
In our solution approach, we have three subproblems.For edge server resource and local computing resource allocation subproblems, we use a convex optimizer that converges fast within a finite number of iterations [31].Therefore, one can deduce that local computing resource and edge server computing resource allocation subproblems have a reasonable complexity.On the other hand, for the joint wireless resource, transmit power, and association subproblem, we employ a BSUM-based approach that also has a reasonable complexity.The standard BSUM in Algorithm 2 is based on minimizing the proximal upper bound.To minimize the upper-bound function, every block of variables is updated in an iterative manner.Until the upper-bound function converges to a coordinate-wise minimal stationary point, the variable blocks are updated iteratively.Using Remark 1 and assumptions, it is clear that BSUM in Algorithm 2 has a sublinear convergence rate.Additionally, BSUM in Algorithm 2 uses flexible update rules.From the above discussion, it is clear that Algorithm 2 has a sublinear iteration complexity O(1/i) for the ith iteration [36].Note that the sublinear means that the algorithm converges different (i.e., faster or slower) compared to the linear time complexity.
2) Convergence Analysis: We discuss the convergence analysis of the proposed HSFL scheme.There are two main aspects of convergence in the proposed HSFL: 1) the BSUM and the convex optimization-based solution convergence and 2) the HSFL convergence in terms of the upper bound.From Remark 1, it is clear that the proposed BSUM algorithm convergences with a reasonable complexity.Additionally, the convex optimization-based solutions for local computing resource management and edge server computing resource management also converge within finite iterations.Therefore, one can say that BSUM and convex optimization-based solution converge within finite, reasonable iterations.For analyzing the convergence of HSFL, we made several assumptions [24], [27], [28], [30], [31] as follows.
≤ ρ ω g(t+1) − ω g(t) (41 where ρ is positive constant, whereas * represent norm of * .3) F(ω g ) has a β-smooth nature for any ω g(t+1) , ω g(t) . (42) Similar to the works in [22] and [27], one can use the notion of the auxiliary parameter vector v [k] (t) that is based on the centralized gradient descent.Here, we observe the convergence for both traditional FL and HSFL.We can divide the whole local T iterations into K intervals and each interval can be represented by t ∈ [(k − 1)τ, kτ ].For FL, within each interval, the local model weights ω(t) of each device has no divergence from the v [k] (t) for the first two iterations [22], [27].However, with an increase in the number of local iterations (e.g., four local iterations), there will be a divergence between ω n u (t) and v [k] (t).The convergence bound of the FL depends on the gap between ω n u (t) and v [k] (t).For the FL, the convergence upper bound can be given by [27] where h(x) = (σ/β)((ηβ +1) x −1)−ησ x .w f and w * denote the weights after the end of the learning process and the optimal weights, respectively.σ and η denote the client-edge divergence and step size, respectively.On the other hand, for HSFL there can be two kinds of gradients divergence: 1) between devices and edge servers and 2) between the edge server and the cloud running global aggregators.Therefore, we must carefully tune the edge aggregations, local iterations, and global aggregations for the HSFL.For the HSFL, similar to the hierarchical FL, the upper bound can be given by [22] F w f − F w * ≤ 1 where represents the edge-cloud divergence.Remark 2 (Convergence for IID Data): For the IID data, there will be no divergence of device weight iterations from the virtually centralized weights iterations, and thus G c (k) will be zero because δ and are both equal to 0 for the IID distribution.This shows that the centralized weights iterations are the same as the distributed weights iterations.
Remark 3 (Convergence for Non-IID Data): For the non-IID data, there will be weights divergences, such as client-edge and edge-cloud.In G c (k 1 k 2 ), the first term h(k 1 k 2 , , n q ) is due to the edge-cloud divergence, whereas the second term Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.[37], [38] divergence.The first and the second terms have exponential relations with k 1 , k 2 and k 1 , respectively.Equation (44) clearly shows that for a fixed product of k 1 × k 2 , an increase in the number of edge aggregations (i.e., k 2 ) will improve the performance.However, this will be at the cost of communication resources.Therefore, one must make a tradeoff between the frequency of edge aggregations and the performance.

VI. PERFORMANCE EVALUATIONS
We discuss the simulation outcomes for the proposed HSFL scheme's performance improvement in this section.An area of 1000 × 1000 is considered with randomly deployed cellular users and set of devices used in the HSFL.Table III provides additional simulation-related parameters.The devices involved in the HSFL reuse the resource blocks of cellular users and thus, receives interference from them.For the HSFL, we deploy multiple SBSs that are positioned randomly in the considered scenario for each run.The cellular users transmit signals with equal power in all runs.The figures were all calculated using an average of 35 runs each.Additionally, we compare the performance using two baselines, such as baseline-A and baseline-R.Baseline-A differs from the proposed scheme only by the random resource allocation, whereas baseline-R differs from the proposed scheme by a random association.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The variations in the HSFL cost are shown in Fig. 3(a) for different SINR and RLA values.A low value of the RLA and a high value of the SINR are desirable.However, this will be at the cost of computing and communication.The communication cost will be related to the resource blocks and transmit power allocation, whereas the computing cost will be related to the local model iterations for a fixed local model and architecture.The cost C HSFL versus iterations for different schemes, including suggested, baseline-A, and baseline-R, are depicted in Fig. 3(b).The proposed system performs better than baseline-A and baseline-R, as shown in Fig. 3(b).The proposed scheme optimizes the RLA, edge server computing resource optimization, and joint resource allocation, association, and power allocation, which is the cause of this performance improvement.Baseline-A performs better than baseline-R, but less than the proposed approach.The reason for the performance enhancement of baseline-A compared to baseline-R is a dependency of cost C HSFL more on the association compared to the resource allocation in the current settings.In the proposed scheme, we perform optimization of various variables: resource allocation, edge computing resource, RLA, association, and transmit power allocation.We use baselines: Baseline-A (i.e., uses the proposed scheme using proposed association, resource allocation, computing resource allocation, and transmit power allocation with random resource allocation) and baseline-R (i.e., uses the proposed scheme using proposed association, resource allocation, computing resource allocation, and transmit power allocation with random association) for comparison.Therefore, to see the effect of the proposed transmit power optimization scheme, resource allocation, and association, we compare the performance of the proposed scheme with an exhaustive search algorithm for association and allocation along with equal power.Fig. 3(c) compares the performance of the proposed scheme with an exhaustive search algorithm using equal power [26].Fig. 3(c) reveals that the HSFL outperforms an exhaustive search scheme with equal power.Onwards, the cost C HSFL denotes the cost that accounts for the transmission latency and the transmission energy.Next, Fig. 4(a) shows the cost C HSFL versus iterations for the proposal and suggested plan with equal force.Fig. 4(a) depicts that the proposed scheme the proposed scheme with equal power.Also, if we increase the maximum limit of the maximum power, the performance of the proposed scheme may be further enhanced.Therefore, it is necessary that we should make a tradeoff between the maximum limit of transmit power for all devices.Fig. 4(b) shows the cost versus iterations for various devices.It is clear that the proposed scheme converges fast for all numbers of devices (i.e., 24, 30, and 36).Fig. 4(c) shows the cost versus iterations for fixed 24 devices and different SBSs.Similar to Fig. 4(b) and (c) converge for different numbers of SBSs.Fig. 5 shows cost C HSFL for variations in the constant α for the proposed scheme and proposed scheme with equal power.Fig. 5 shows a slight increase in the cost for higher values of α for both the proposed scheme and that with equal power.This shows that cost C HSFL has more contributions from the latency term compared to the transmission energy term.Now, we present learning results for evaluating the performance of the proposed HSFL scheme.We consider MNIST and Fashion MNIST data sets for evaluation.The training data is divided into shards, each of 300 images.We consider three cases, such as non-IID1, non-IID2, and non-IID3.For non-IID1, a single shard is assigned to every device, whereas for non-IID2 two shards are assigned to each device.Five groups, each with five devices are considered.For non-IID3, every device in a group of 10 is assigned images of a single class such that all devices in a group should be assigned all classes of MNIST.For both Fig. 6 An increase in the number of edge aggregations can improve performance but at the cost of communication.Therefore, one must make a tradeoff between the communication cost and the performance.Note that computing an edge aggregated HSFL model can offer the advantage of easy reuse of communication resource blocks.Within a small group, one might reuse communication resources more efficiently compared to the case of a single group used for computing the SFL model.For the single-use case of SFL, the area will be more.Therefore, to communicate with a single centralized aggregation, devices must transmit with high transmission power that can cause a higher interference to cellular users.

VII. CONCLUSION
In this article, we have presented a novel framework, namely, HSFL, that combines split learning and hierarchical FL.A problem considering latency and energy is formulated.We subdivided the main problem into two subproblems because the formulated problem was nonconvex and NP-hard.Edge server computing resource allocation subproblems are solved using a convex optimization scheme.We employed a BSUM-based strategy for the combined RLA minimization, resource allocation, association, and transmit power allocation.Additionally, we discussed the complexity and convergence analysis of the proposed solution.The results were then provided for the various scheme's validation.We came to the conclusion that a variety of IoT applications where devices lack the power to compute their local models could benefit from the use of our HSFL model.Our work can be used as guidelines for various applications where end-devices will have resource constraints, such as intelligent transportation, healthcare, and Industry 4.0, among others.Our framework offers privacy-aware learning with fast convergence for resource constraints end-devices and is thus preferable for use in future applications.
r ≤ R C4 : 0 ≤ p n ≤ P m ∀n ∈ N C5 : m ≤ 1 ∀n ∈ NAuthorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
k 1 and k 2 denote the number of local iterations prior to edge aggregation and edge aggregations prior to the global aggregation.

Fig. 3 .
Fig. 3. (a) C HSFL for various values of SINR and RLA, (b) C HSFL versus iterations for various schemes using 48 devices and 6 SBSs, and (c) C HSFL versus iterations for proposed and heuristic algorithm with equal power using 48 devices and 6 SBSs.

Fig. 4 .
Fig. 4. (a) C HSFL versus iterations for proposed and proposed with equal power, (b) C HSFL versus iterations for proposed scheme using different number of devices, and (c)C HSFL versus iterations for proposed scheme using different number of SBSs.

Fig. 6 .
Fig.6.(a) Accuracy versus global rounds for non-IID1 using five groups with each having ten devices, (b) accuracy versus global rounds for non-IID2 using five groups with each having ten devices, and (c) accuracy versus global rounds for non-IID3 using five groups with each having ten devices.
(a)-(c), a product of local iterations and edge aggregations per global round of communication is taken 10.For traditional, local iterations equal to 10 is used for Fig. 6(a)-(c).Fig. 6(a) uses a single shard per device, whereas for Fig. 6(b) two shards per device are used.It is clear from Fig. 6(a)-(c) that the HSFL outperforms SFL.