Semi-Supervised Federated Learning Over Heterogeneous Wireless IoT Edge Networks: Framework and Algorithms

Federated learning (FL) is a promising paradigm for future sixth-generation wireless systems to underpin network edge intelligence for smart cities applications. However, most of the data collected by the Internet of Things devices in such applications is unlabeled, necessitating the use of semi-supervised learning. Existing studies have introduced solutions to run semi-supervised FL; however, they overlooked the inherent critical impacts of the wireless characteristics at the network edge. We fill this gap by proposing novel solutions to run semi-supervised FL over wireless network edge, considering the limited computation and communication resources and deadline constraints and realizing that unlabeled data can be automatically labeled during the training rounds to improve the performance of the global model. The problem is first formulated as an optimization problem followed by a two-phase solution. In the first phase, we propose a bisection-based algorithm to find the transmit power and local processing speed that optimally fit the new injected labeled data. In the second phase, we propose three algorithms to control the local updates and injected samples that meet the deadline constraint. We analyze the performance of each algorithm concerning the tradeoffs between learning performance, training time, and total energy consumption. Targeting two applications in smart cities, human activity recognition and object detection, we conduct extensive simulations using realistic federated data sets under nonindependent and identically distributed settings. Numerical results show that the proposed algorithms effectively utilize unlabeled samples while accounting for the characteristics of wireless edge networks in smart cities.


I. INTRODUCTION
N OWADAYS, mobile and Internet of Things (IoT) devices are becoming standard computing infrastructure for billions of people around the globe [1]. Massive volumes of data are generated by these devices, which can be utilized to enhance a variety of applications [2], [3]. From privacy and cost perspectives, processing such data and training models locally is becoming increasingly popular, prompting the emergence of a new distributed learning paradigm called federated learning (FL). FL has gained a significant impetus over the last two years, mainly due to its increasing rate of adoption in IoT smart services and automation systems, while preserving the privacy [4], [5], and reducing the communication costs [6], [7]. FL combines locally learned model parameters to create a global shared model without exposing sensitive local data.
Recent advances in FL algorithms have enabled the development of federated edge learning systems, enabling lowlatency edge intelligence over wireless network edge. FL at network edge empowers intelligent systems to collaboratively train a shared machine learning model over wireless links while keeping all the training data on edge devices. It aims to leverage the data collected by edge devices in real time for fast inference response by making use of edge devices' processing capability to handle local data sets and train a learning model cooperatively [8]- [11]. As bandwidth, energy, and computation capabilities (i.e., central processing unit (CPU) and memory) at the wireless network edge are limited, it is crucial to design and develop FL learning approaches that accelerate convergence while considering these challenges and consuming less resources [12]- [16]. Therefore, as detailed in Section II, there has been considerable attention in the literature devoted to overcoming these challenges [17]- [29].
However, all relevant research focuses on supervised learning tasks assuming that IoT devices' data is fully labeled. This does not reflect the reality of many existing applications, in which the majority of collected data is unlabeled, and ordinary IoT devices are unlikely to be specialized, such as cross-silo FL, to perform the correct labeling, particularly for streaming data. In cross-silo FL, the entities are organizations, and the collected unlabeled data is likely to demand professional experts, such as medical sectors (i.e., disease diagnosis and health monitoring) to be labeled. On the other hand, on device FL, labeling all of the unlabeled data using such a manner is a costly and complicated endeavor necessitating auto-labeling techniques and inducing the need to employ semi-supervised learning and find efficient tools and algorithms to ensure correct labeling [30]. In smart city applications, data is massively distributed, and it is daunting to label all the data manually. Experts can only label small parts of it, which can help bootstrap the model, and then auto-labeling can be used to expand the data set and improve the model. For example, pseudo-label methods can be used in conjunction with semi-supervised FL to generate pseudolabels for unlabeled samples among distributed devices based on a confidence level that determines whether the label is correct for a given sample [31]- [35]. In the literature, different approaches [30], [35]- [42] have been proposed to apply semisupervised learning under FL settings, aiming to exploit the unlabeled data and inject more data into the learning process. However, all these works [30], [35]- [42] have overlooked the dynamic nature of the data being labeled, where it tends to increase as the training progresses. Furthermore, they have not taken into account the characteristics of the wireless network edge, particularly the resource constraints (i.e., bandwidth, transmit power, or local CPU speed) and the deadline constraint determined by the system to avoid missing the update or having to wait longer.
To this end, motivated by the above observations and remarks, we take advantage of bridging the above research gap and propose novel solutions to run federated semi-supervised learning (FedSemL) at the wireless network edge considering the computation and communication resources and the deadline constraints. To the best of our knowledge, this is the first work that considers both FedSemL and the characteristics of the edge networks. The key contributions of this article are as follows.
1) We build a FedSemL system on the network edge considering the limited resources and the deadline constraints, assuming that all IoT devices hold a scarcity of labeled data and an abundance of unlabeled data samples. 2) We formulate an optimization problem considering the training objective of minimizing the global loss, the adaptive labeling, the computation and communication resources, and the deadline constraint. The problem is then solved iteratively using three proposed algorithms. 3) We develop novel algorithms that control and perform the high confidence pseudo-labeling for the unlabeled samples and then use them to improve the model performance. All proposed algorithms use strong data augmentation during the training phase and weak data augmentation during the pseudo-labeling phase. Then, the proper number of unlabeled data is injected into the training process after optimizing the transmit power, CPU speed, and local updates. 4) We provide analysis for each proposed algorithm with respect to the accuracy, computation time, and energy consumption and present the tradeoff performance. 5) Targeting two applications in smart cities, one for human activity recognition (HAR) and another for object detection, we carry out extensive simulation experiments to evaluate the proposed algorithms and verify their performance using realistic federated data sets under nonindependent and identically distributed (non-i.i.d.) data distribution and different percentages of labeled data. The remainder of this article is organized as follows. In Section II, the associated literature is reviewed. In Section III, the system model is discussed where the learning, computation, and communication models are presented. The problem is formulated in Section IV while the proposed solutions and algorithms are introduced in Section V. The complexity analysis of all algorithms is provided in Section VII. The data set and the experiments are discussed in Section VIII along with the outcomes and most important lessons learned. Finally, the work is concluded, and the future research directions are presented in Section IX.

II. RELATED WORK
Given the resource constraints on IoT edge networks, the study of FL deployment over such networks with massive IoT edge devices has received increased attention in the literature. For instance, the works in [17] and [19] studied the convergence analysis of the FL training algorithms over wireless channels considering the impact of unreliable links on the convergence rate. To minimize the training time of the FL, several scheduling policies have been proposed in [18], [20]- [22], and [26], aiming to accelerate the convergence rate while accounting for the limited resources of the wireless edge. The work in [43] proposed an updated federated averaging algorithm in which they used a distributed Adam optimizer to reduce the communication rounds required for the global model to converge. Yu et al. [44] studied how to select the clients participating in the training rounds. They aimed to optimize the tradeoff between the participants' maximization and resource minimization.
Focusing on the energy efficiency and wireless characteristics, Zeng et al. [45] investigated this problem intending to minimize the overall energy consumption. A greedy allocation technique was presented to solve a minimization problem to provide additional resources to devices that have bad channels. Wang et al. [46] studied energy-efficient FL over wireless channels when computing and communication resources are limited. This work aimed to reduce the overall training time while minimizing the associated energy. Albaseer et al. [14], [15] introduced a fine-grained data selection to improve the energy efficiency where the useless data were excluded from the training resulting in more time allocated for transmission.
However, all of these works have mainly focused on federated supervised learning, where the data amongst IoT devices is fully labeled. In reality, most of the collected data is unlabeled, necessitating the use of semi-supervised learning and finding techniques and algorithms to provide an efficient labeling [30]. The work in [35] proposed a two-phase algorithm to apply semi-supervised FL where the first phase mainly performs the training on the labeled data. Then, the second phase starts after the model converges by injecting the unlabeled data into the training process after applying the pseudo-labeling. A similar work to exploit the unlabeled data is considered in [36] to propose a scheme where the IoT devices agree on a consistency loss that regularizes the local models of participating clients to produce similar predictions. Later, the work in [37] considered the unlabeled data by applying federated distillation [38], [39] with an assumption of shared unlabeled data where the training of the large model is performed on the server-side while IoT devices train the small size model. A similar research in [40] proposes an approach to synchronize the training and aggregating of the model parameters on unlabeled IoT devices and the labeled server. Recently, the work on [41] presented a new approach called SemiFL, a new FL framework for users with fully unlabeled data and a small quantity of labeled data on the server. SemiFL separates the training of server-side supervised data from IoT deviceside unsupervised data. Finally, Zhu et al. [42] studied the use of semi-supervised FL to identify the travel mode from GPS data. They applied pseudo-labeling for the unlabeled data. However, none of these efforts [30], [35]- [42] have taken into account the network edge's limited resources (such as bandwidth, transmission power, and local CPU speed) as well as the system's deadline constraint to avoid missing the update or having to wait longer.
To conclude, on the one hand, the works in [14], [15], [17]- [22], [26], and [43]- [46] studied the deployment of FL at wireless network edge while mainly focused on supervised learning, where the data amongst IoT devices is fully labeled. One the other hand, the works in [30] and [35]- [42] studied the semi-supervised FL while ignoring the characteristics of wireless network edge. This motivates us to bridge this gap taking advantage of studying such a problem and introducing the framework to run FedSemL over the network edge. We propose a management algorithm for resource optimization and three different algorithms to leverage the unlabeled data concerning the computation and communication limited resources and the system deadline, which determines the deadline of uploading the model updates from IoT devices to the server.

III. SYSTEM MODEL
For the system model illustrated in Fig. 1, we consider a beamforming-based base station (BS) linked to K IoT devices. Without loss of generality, we assume that each kth IoT device holds D u large size of unlabeled data where k = 1, 2, . . . , K. At the same time, it holds a set of labeled data D l , where x i,d is an input features of sample i, y i,d is the associated class label vector and i = 1, 2, . . . , D l k . Here, D l k = |D l k | and D u k = |D u k | denote the cardinality of a set for both labeled samples and unlabeled samples, respectively. All IoT devices and servers have the same model structure denoted by θ . The server initiates the model parameters, selects the IoT devices to perform the updates, and broadcasts the global model to all selected IoT devices. The selected IoT devices perform the required updates using their own labeled data and send them back to the server, aggregating all uploaded updates, fusing them, and forming a new version of the global model. These steps are repeated until it converges to a stationary point. Due to the characteristics of wireless networks (i.e., bandwidth limitation, channel uncertainty, and fading) and the deadline constraint imposed to prevent long waits for a model update, the server cannot include all available devices in the learning round, and only a subset, M, can perform the model update, where M is the indices vector that comprises all ids of the selected IoT devices. It is worth noting that we mainly consider FedSemL with a scarcity of labeled data samples and an abundance of unlabeled data samples amongst clients, and the server has no access to any of this data. Table I lists the main notations utilized in this article.

A. Federated Learning Model
In FL at the wireless network edge, all devices and servers have the same model structure. The FL scenario could be summarized as follows: 1) the server collects prior information from all available devices, such as the local data size, CPU status, and battery level; 2) the server sets a deadline constraint (i.e., the maximum latency) and determines the default local epochs and batch size used for local training; 3) the server selects a subset of these devices to perform the model updates and broadcasts the global model to all selected devices. It is worth mentioning that the server initiates random parameters at the beginning or sends the latest version of the global model; 4) the selected devices receive the global model from the server and perform the required updates by applying the local solver (i.e., local stochastic gradient descent (SGD) with mini-batches) on their labeled data. Once the local updates finish, 5) the updated model is sent them back to the server, 6) which aggregates all uploaded models from all participating devices, fuses them, and forms a new version of the global model. The server ensures that the resulting global model is unbiased by changing the selection set every round or set of rounds. These steps are repeated until the global model converges. It is worth noting that updating the global model on the devices begins from the same point (i.e., the latest global model parameters). However, each selected device performs a different number of sequential local updates based on their labeled data ratio D l k , CPU speed, and the time allocated for the uploading. Mathematically speaking, the total number of sequential local updates for several epochs ep and the batch size b is defined as follows: From (1), one can notice that increasing the number of updates can be achieved with a more labeled data ratio, increasing the number of epochs, or reducing the batch size. In machine learning and deep learning algorithms, the more updates performed on the model, the closer the distance to the true model convergence, as shown later on in (3) and also in convergence analysis. In practice, each device k runs the local solver (i.e., local SGD) n times to capture the error of the model on the labeled data samples, D l k , which is defined as follows: where F i (θ ) captures the local error over each sample i and i = 1, 2, . . . , D l k . It is worth emphasizing that, in our work, we use local SGD with mini-batch (also known as, local-update SGD, parallel SGD, or federated averaging) to balance the system resources [47], [48]. In local SGD, the data is first partitioned into batches of size b and the local solver iterates through each batch B to perform one update F k (θ ) (1/b) i∈B F i (θ , x i , y i ), meaning that each device performs (D l k /b) updates every single epoch. Accordingly, the local model parameters θ r k , for each IoT device k every round r, are updated as where j = 1, 2, . . . , n, η is the step size (i.e., learning rate) at each local iteration which controls how much the weights of the model are adjusted w.r.t. the loss gradient, θ r k (0) is the received model parameters θ r−1 , and θ r k (n) is the updated model parameters θ r k uploaded to the server at each round r after performing n sequential local updates. In fully supervised FL settings, the number of epochs and the batch size are assumed to be homogeneous for all participants since the labeled data ratio being used for training is known by the server prior to the local training, enabling the server to set the deadline properly. However, in semi-supervised FL, the labeled data ratio increases during the training process, depending on the model performance to predict the unlabeled data. Also, the training and uploading task is subjected to the time constraint (i.e., the deadline), reducing the batch size or increasing the labeled data ratio can compensate for the number of epochs, which is more appropriate for the new labeled data injected during the training. This enables the model to converge faster and recognize more data patterns. Therefore, we proposed the three controlling algorithms adjust the number of epochs and the number of injected labeled data based on the system's needs. More priority is given to the new labeled data as it has more impacts on the model convergence as shown in [20], [21], [26], and [49]- [51]. To clearly realize the relationship between learning epochs, labeled data ratio, batch size, and true model convergence, we provide an in-depth convergence analysis in Section VI. Once uploading all local updates, θ r k and related loss functions F k (θ), the global loss function amongst IoT devices at every round r is obtained as where D K k=1 D l k . As a result, the relevant global model parameters are calculated as follows: During the training process, θ r is sent to all selected IoT devices at each round to be used as a reference when updating the model parameters. Thus, the goal is to find the optimal θ * to minimize F(θ )

B. Radio Frequency Uploading Model
We denote the uploading channel gain between the IoT device k and the BS by g k . Accordingly, for a given uploading time T up,k using orthogonal transmission, the uploading data rate achieved by the kth IoT device can be calculated as where β k B is the allocated bandwidth for IoT device k, ω k denotes the receive beamforming vector at the RF BS, P up k is the kth IoT device transmit power, σ 2 0 is the variance of the additive white Gaussian noise (AWGN), I is the identity matrix, and (.) H stands for the Hermitian operation. Furthermore, the transmission energy consumption of the kth IoT device is defined as where T up,k is defined as where ξ is the model size determined by the server in FL settings. In deep learning, model size is determined by the width of the network and its depth (number of inputs, number of layers, and number of outputs).

C. Local Computation and Energy Models
Let f k denote the local CPU speed, and denote the required cycles to process one sample, thus, the local computation time T cmp,k can be defined as Substituting (1), n = ep([D l k ]/b), into the right-hand side of (10) yields It is worth mentioning that (10) has been extensively used in the literature as in [24] and [26]- [29]. From (11). the corresponding local energy consumption for every kth IoT device due to model training is defined as [52] where (α k /2) is the energy capacitance coefficient of the kth device. Substituting (11) into the right-hand side of (12) yields and the total energy consumed by each kth device at every round due to the training and uploading tasks is given by In practical FL, as shown in Fig. 2, the server set a deadline of T to prevent longer waiting time, especially for participants with low energy level, slower processing speed, and bad channel conditions (i.e., stragglers). It is worth noting that the terms "round deadline," "delay requirement," or "round latency" have been used interchangeably in the literature as in [20], [21], [26], [49]- [51], and [53]. Therefore, each kth IoT device must finish its computation and communication tasks within T. The server determines T based on the prior information collected every round r, and it could be expressed as follows: As a consequence, the maximum local computation time can be expressed as To avoid exceeding the deadline and send the update to the edge server on time, the local computation time should satisfy the following condition: Let us write (16) in a complete form as follows: From (17), we can see that the larger the amount of data is, the higher the CPU speed is required in order to finish the training on time. Now, assume that the maximum CPU speed is employed, then the option is to increase the transmit power to not exceed the deadline. Thus, the large labeled data size increases not only on the CPU speed but also the transmit power.

IV. PROBLEM FORMULATION
As previously stated, each IoT device has a large amount of unlabeled data that may be adaptively labeled using the pseudo-labeling technique and then injected during the training to improve model performance. However, the server should set a round deadline for updating and uploading the local models to avoid a long waiting and begin a new round. Hence, any IoT device that exceeds this time will not contribute to the global model. This brings out the challenge of effectively injecting a large number of unlabeled data that pass the confidence level (i.e., predicted with probability greater than ϕ) during the training task to improve the model quality while accounting for the communication and computation latencies and the system deadline constraint. Let N l denote the number of unlabeled data samples that successfully meet the confidence level. Then, the updated number of labeled data is defined as From (18) and its reflection on (11), we can note that the more samples appended into the labeled data, the more computation time and faster CPU speed are needed to finish the training task, which in return consume more energy. As a result, our goal is to find the optimal resource allocation (i.e., transmit power and local CPU speed), the appropriate number of injected samples, and learning epochs committed to the round deadline and efficiently minimize the global loss function. This problem can be posed as an optimization problem as follows: Constraint (19a) enforces that the computation and communication time in a given round should exceed the time set by the server. Constraint (19b) indicates that the allocated bandwidth for each IoT device should be sufficient to upload the local model. The total bandwidth constraint is given in (19c). The energy consumption constraint is also given in (19d) to guarantee that the energy expended due to the computation and communication task is aligned with the energy budget (i.e., the battery level). Constraint (19e) is relevant to the CPU frequency for each kth IoT device while the constraint (19f) is related to the transmission power. We can see that the problem (19) is nonconvex, and solving it directly is challenging. In the following sections, we solve (19) iteratively as subproblems where the server initially divides the bandwidth into M subchannels with a size of (B/ξ ) for each to be adequate for uploading the local model of size ξ . As a consequence, only M IoT devices where M = |M| can take part to train the global model in a specific round. Here, M holds the indices of the selected devices. Next, we introduce two phases of optimization where we first propose an iterative algorithm to find the optimal transmit power and CPU speed that minimize the total energy consumption for each kth device that is aligned with the allocated bandwidth and the round deadline. In the second phase, we optimize the local number of updates and the new injected samples considering the optimized transmit power, CPU speed, and deadline constraint.

V. PROPOSED SOLUTION
To solve (19), we first reformulate the problem taking into account the preallocated bandwidth, as stated before, based on the model size and M selected participants. Thus, the problem (19) can be reformulated as where s k (r) is a binary variable that determines whether device k is selected in the rth round s k (r) = 1 or not s k (r) = 0. We can notice that (20b) and (20c) can be checked by the edge server during the selection phase while (20a), (20d), and (20e) are optimized locally.

A. Semi-Supervised Learning-Based Pseudo-Labeling
To fully utilize unlabeled data, a straightforward method is to generate pseudo-labels using pertained models. However, in FL, each IoT device holds two models: 1) the local model and 2) the global model. In the literature, there is a tradeoff when using any of both models to perform the pseudo-labeling. Among these approaches, the global model with strong data augmentation shows a better performance [54]. Thus, we adopt such an approach in this article. More specifically, each IoT device receives the global model from the server and uses strong data augmentation to create different sample forms using only the labeled data samples when training their local model. Thus, each input x i in the training set is strongly augmented x i = (x i ), where (·) denotes a strong data augmentation. We consider warming up rounds at the beginning of the training process in which the IoT devices use only the labeled data samples at every global round (i.e., the warming up round can be the first ten rounds). After receiving the global model parameters from the edge server, ignoring the warming up rounds, each kth IoT device uses the updated model to predict the labels of the unlabeled samples (i.e., pseudo-labeling) after performing weak data augmentation and then feeds the unlabeled samples into the global model. Specifically, to perform the pseudo-labeling, every kth IoT device receives the global model θ r , and it generates pseudo-labels y u,i for each ith sample as follows: where ψ(·) represents a weak data augmentation. For each sample, the labels will be the top-class corresponding to the highest probability, which can be expressed as follows: where P is the class output probability. To trust such label, a strict threshold probability is adopted to avoid accumulating errors caused by pseudo-labeling as the data samples filtered out with low confidence utterances are not injected into the training process to avoid training with erroneous labels [31]- [34], [55]. Mathematically, the following condition should be satisfied: where ϕ is a predetermined threshold probability. In this work as in Section VIII, we use ϕ = 0.8, 0.7, 0.6, 0.5, and 0.2 and provide insights how to set the proper threshold. Each IoT device adds the passing samples to the labeled data, as It is worth noting that the samples predicted with uncertainty probability (i.e., the probability less than the confidence level) are kept for the next global rounds, while the samples with probability greater than the threshold (i.e., the probability greater than the confidence level) are recognized as a labeled data and then appended into the labeled data set. Hence, it is injected into local training when updating the global model. Practically, the pseudo-label of one unlabeled example is determined based on the maximum prediction probabilities in the pseudo-labeling process that satisfies the strict confidence level. The confidence level can be set based on system requirements, with critical applications setting a higher value (i.e., ϕ = 0.90) and noncritical applications setting a slightly lower value (i.e., ϕ = 0.70). Our technique utilizes the soft-max layer's outputs, where the output's location to which the input belongs is a probability value. This could eliminate various statistical errors that may have occurred throughout the decision process. It is worth noting that other techniques can be applied to perform the pseudo-labeling without any modification in our proposed algorithms.

B. Resource Management
As explained in Section V-A, the computation time increases proportionally to the number of successfully labeled data samples. It is highly dependent on the deadline constraint, the number of local updates, the local CPU speed, and the transmit power as explained in Section III-C. Hence, each IoT device has to find the optimal transmit power, optimal CPU speed, and the proper number of local updates to address these challenges. This can be achieved using two phases. The first phase aims to find the optimal transmit power and CPU speed with respect to the deadline constraint and available energy using a bisection-based algorithm. The second phase aims to find the proper number of updates (i.e., number of epochs) and the number of new labeled data samples considering the available computation and communication resources and the deadline determined by the server.
As for the first phase, we optimize the CPU frequency f k , and the transmit power P up k depending on the criteria to save energy by using the bisection-based algorithm [56]. The bisection-based algorithm is applied to find the optimal f k , and P up k that meet the deadline constraint and consume less energy. Each kth IoT device solves the following subproblem: . Then, it iterates through the possible intervals to find the optimal transmit time that consumes less energy, as the transmit time is the bottleneck for the limited resource IoT devices. Accordingly, the computation time can be obtained. This leads to finding the optimal transmit power and local CPU speed. Algorithm 1 summarizes these steps.
In the following sections, we present the proposed algorithms for the second optimization phase that aim to provide flexible, reliable, and scalable instantaneous labeling to Return T up,k , P up k , and f cmp k increase the usage of unlabeled data at every r-th round, considering the deadline and resource constraints.

C. Adaptive Pseudo-Labeling Algorithm With Deadline Exceeded Avoidance
This algorithm adjusts the learning epochs and the number of new injected labeled data based on the deadline constraint. The basic idea of this approach is that if the number of new labeled data is large, the maximum transmit power and CPU speed are used; thus, we aim to find the exact number of epochs or the number of new labeled data samples that satisfy the deadline constraint. First, the number of local epochs is reduced; if it reaches 1 while more time is required, the number of new labeled data samples is reduced. Mathematically speaking, we derive the closed-form solutions, find the proper number of updates, and bound the maximum and a minimum number of samples injected into training.
From (16), we have ep((D l k )/f k ) = T − T up,k . By adding the number of successfully labeled data yields In (25), the aim is to ensure that the number of new injected samples is aligned with the deadline constraint. Accordingly, the proper number of local epochs is given as It is worth noting that decreasing the number of epochs and increasing the number of labeled data samples do not affect the number of local updates as n is proportional to the data size. As a consequence, this allows feeding the model with new samples, which in return improves the performance rather than keeping the same data and epochs every round.
From (26), we can notice that the number of epochs can be computed to meet the time constraint as well, and it is inversely proportional to D l k and . Assuming that the number of epochs obtained from (26) is less than one, ep < 1, we must reduce the number of appended samples by taking into account the transmit power and CPU speed while ensuring at least one local update ep = 1. It can be bounded as where T up,k (P max k ) means that the maximum transmission power is utilized to upload the model. Algorithm 2 summarizes all these steps. In Algorithm 2, we initially check if the kth IoT device uses the maximum CPU speed f max to not waste time by running the bisection-based algorithm as no further improvement can be obtained, and we must use the second phase algorithm. The proper number of new injected samples and the number of local epochs are adjusted in the second phase algorithm to meet the deadline and resource constraints.

D. Adaptive Pseudo-Labeling Algorithm With Samples Replacement
This algorithm aims to enable IoT devices using only the active samples from the labeled set while excluding the unwanted samples that are predicted with the highest probability that provide no improvement to the model quality (i.e., the samples predicted with 0.9 probability). These steps are performed before employing the deadline avoidance algorithm described in Section V-C. To differentiate between these steps, we define two threshold probabilities, the first probability ϕ 1 is applied for samples exclusion from the labeled set. In contrast, the second probability ϕ 2 is utilized as a confidence level to perform the pseudo-labeling after the warming up rounds. Mathematically speaking, let N x denote the number of Return D l k excluded samples of the existing labeled data and N l denote the number of included samples after applying the pseudolabeling. The number of samples included in the rest of the epochs can be defined as follows: We need to ensure that the computation time to train D l k samples satisfies the deadline constraint to avoid missing the update or waiting longer. Hence, from (27), we can bound the maximum number of injected samples as follows: The steps of this approach are summarized in Algorithm 3 and details on the threshold settings are given in Section VIII.

E. Adaptive Pseudo-Labeling Algorithm With Adaptive Deadline
In this algorithm, all IoT devices use the global model to perform the pseudo-labeling with a predefined confidence level for the unlabeled samples to only estimate the expected computation time for the next training round. This can be achieved by feeding the unlabeled data into the received global model. The expected time for the next round can be estimated as As for the local training, the only current labeled set will be used after finding the optimal transmit power and local CPU speed using a bisection algorithm based on D l k . After finishing the training task, the updates will be uploaded to the server with a request for deadline extension based on the expected computation time T k r+1 . The server then will adopt the new deadline training round based on the extension requests. It is worth noting that this algorithm does not require any update in the local training algorithm.

F. FedSemL Framework
In this section, we present the complete framework of running FedSemL at the wireless network edge with our

Server collects prior information from available
IoT devices, i.e., CPU speed, channel state. 4 The server specifies a round deadline of T 5 Server selects M IoT devices and send them θ r−1 6 for every IoT device k ∈ M in parallel do 7 IoT device uses θ r k ← θ r IoT device generates pseudo-label after applying weak augmentation on the unlabeled samples D u k , IoT device adds the passing samples to the labeled data, as Compute F k (θ r k (j)) using (1) 25 Set θ r k = θ r k (j) 26 end 27 end proposed algorithms (as illustrated above), which enables to exploit the unlabeled data samples to improve the learning performance with high confidence labeling while accounting for the resource and system constraints. Algorithm 1 starts with the initialization phase where the model is constructed, and the number of available IoT devices with their number of local data, the confidence level, and the learning hyperparameters are determined. In steps 1-12, the system performs the training process for R rounds. In each round, the server first collects the prior information from the available IoT devices (step 3) specifies the deadline (step 4), and then selects a subset of these devices to take part in the training process (step 5). After that, in steps 6-9, all selected devices use the received model to label the unlabeled data before conducting the local updates as in steps 14-27. The server then aggregates all updates and forms a new global model version (steps 10 and 11). Each IoT device uses weak augmentation when performing the pseudo-labeling as in step 16. In step 17, only the samples that are labeled with a probability higher than the confidence level are appended into the labeled data set. To be aligned with the deadline and find the optimal transmit power and local CPU speed, each device uses one of the proposed algorithms as in step 18, where the two optimization levels are employed. In steps 19-27, the global model is updated using strong data augmentation and the local SGD with minibatch.

VI. CONVERGENCE ANALYSIS
This section presents the convergence rate, which provides insights to better understand the relationship between learning epochs, labeled data ratio, batch size, and model true convergence. It is worth noting that the following assumptions have been extensively used in the literature w.r.t. the FL's behavior [15], [17], [27], [57]- [60]. As a start, let us define θ * k = arg min θ F(θ * k ) as an optimal model parameters for device k which corresponds to the minimum loss F(θ * k ) for its local data. The following assumptions could be made.
Assumption 1: F k (θ) is convex for the server and ∀k.
Assumption 4: Let B be sampled every local update from the kth participating device's local data. The variance of gradients is bounded: Definition 1: From Assumptions 1 to 4, the local loss function becomes closer to the optimal as the number of local iterations with different batches increases. This is also reflected by (1) and (3) where the larger value of n, the more convergence toward the optimal model parameters as the more updates on θ r k (j) = θ r k (j − 1) − η∇F k (θ r k (j)) reduces the gap as F k (θ ) − ∇F k (θ * ).
Remark 1: From Assumptions 1 to 4 as well as (1) and (3), one can notice that increasing the number of updates can be achieved by injecting more labeled data ratio into the training process, increasing the number of epochs, or reducing the batch size. However, due to the wireless networks constraints, injecting more labeled data ratio with small batches is the better choice to accelerate the convergence as increasing the number of epochs leads to exceed the deadline while visiting the same data samples. Thus, reducing the batch size and increasing the labeled data ratio can compensate for the number of epochs and enable the model to capture new data patterns.
Definition 2 (Global Loss Function): For a given global model, the global loss function is F r (θ ) k (D l k /D)F k (θ ). Lemma 1: Under Assumptions 1-3, the global loss function F r is also convex, L-Lipschitz continuous, and β-Lipschitz smooth.
Proof: This simply follows by Definition 2, given F r (θ) is a linear combination of the local loss functions of the participating devices F k (θ ). Now, let F r * (θ ) and F * k (θ) be the optimal minimum values of F r and F k , respectively. Then, F * (θ) − k (D l k /D)F * k (θ) obviously becomes closer to zero as the number of labeled data ratio increases. Consequently, as the number of labeled data ratios increases among the selected devices, the convergence rate is accelerated, thereby reducing the global communication rounds.

VII. ALGORITHMS ANALYSIS AND COMPLEXITY
This section presents the analysis of the proposed algorithms in terms of computational complexity scales, networking aspects, and total energy consumption.
For all algorithms, the major complexity lies in finding the optimal transmit power and local CPU speed every rth round, which has a time complexity of O(log ([(a − b) , and 1 is the small value (i.e., 1 = 0.001) determining the convergence of Algorithm 1. It is worth emphasizing that, for adaptive pseudo-labeling algorithm with samples replacement (APL-SR), the upper and lower intervals are , leading to a decrease in the number of iterations to find the optimal solution. At the same time, for adaptive pseudo-labeling algorithm with adaptive deadline (APL-AD), bisection algorithms require more iterations to converge as the intervals [a, b] becomes larger due to adding all new labeled data in the next round, where N l = |D new k |. On the other hand, due to injecting new labeled data samples, each kth device initially checks if the maximum CPU speed f max is used to not waste time by running the bisection-based algorithm as no further improvement can be obtained, and one of the second phase algorithms is called. Here, we analyze the properties of each algorithm in terms of time and energy consumption.
For adaptive pseudo-labeling algorithm with deadline exceeded avoidance (APL-DEV), it requires more time and energy, especially if the number of original labeled data is small and the allocated CPU speed, f k , and transmit power, P up k , are on their lower levels, f k ≈ f min k and P up k ≈ P min k . In contrast, the APL-SR algorithm starts by excluding the useless data samples and labeling the unlabeled samples based on the received global model. Each phase takes O(1) time complexity to complete. In comparison to APL-DEV, it is evident that excluding the useless samples helps to find the optimal transmit power and the optimal local CPU speed faster as APL-DEV uses the same labeled data samples regardless of whether or not they help improve the model. APL-SR is, therefore, faster and more energy efficient when only the active samples are used to train the local models for the epochs specified. Last but not least, the cost of using APL-AD lies in the required extensions in subsequent training rounds. This time increases incrementally during training rounds as more data samples are labeled, which requires more energy in future rounds.

VIII. RESULTS AND DISCUSSION
In this section, we evaluate the performance of the FedSemL framework using the proposed algorithms.

A. Experimental Setup
Unless otherwise specified, we assume an FL setting with a bandwidth of B = 10 MHz and noise in the background power of σ = 10 −8 . The proximity between IoT devices and edge servers is uniformly dispersed within 25 and 100 m. We model the channel using Rician distribution with the Rician factor of 8 dB [61], and a loss exponent of 3.2. The coordinating server has M = 2 antennas, and every kth IoT device has M = 1 antenna. Considering maximum and minimum transmission power, we adopt P max = 20 dBm and P min = 10 dBm, respectively. The CPU frequency is uniformly distributed between 1 and 9 GHz. We target two IoT applications: 1) the HAR [62] and 2) the object detection (CIFAR-10 [63]). The simulation parameters are listed in Table II.

B. Performance Evaluation
We compare the proposed algorithms in terms of testing accuracy, training time, and energy consumption. The aim here is to find the tradeoffs of each algorithm based on the system's needs. Afterward, we reveal the performance of each algorithm using different confidence levels (i.e., threshold probability). We initially use 10% of labeled data samples for each IoT device. We assume that the server has no access to any form of IoT device's data (i.e., labeled or unlabeled). We use similar settings network parameters while the number of users is 30 and 50 for HAR and CIFAR-10 data sets, respectively. Finally, we further evaluate the proposed algorithms using different percentages of labeled samples. We consider non-i.i.d. data distribution for all conducted experiments as it is the main challenge in the FL settings. It is worth mentioning that the channel state and the energy budget are continually changing during global training rounds. For example, the battery level (20c) allows the participating device to take part only in a few rounds. This enables the server to select different participants every round or every set of rounds which diversifies the model updates and addresses the data heterogeneity. Thus, the resulting model will be unbiased as diversifying the updates during the training rounds will improve the generalization performance.

C. Human Activity Recognition Data Set
HAR is an appealing smart city application that offers a variety of benefits. The HAR portable-based health applications can help the elderly and sick individuals recover rapidly from injuries and avoid accidents. We use a HAR data set of 10 299 samples obtained from mobile phone accelerometers and gyroscopes from 30 people performing six different activities: 1) standing; 2) walking; 3) sitting; 4) lying down; 5) walking upstairs; and 6) walking downstairs. Each sample comprises 561 features with time and frequency domain variables of sensor signals. The sensor signals (accelerometer and gyroscope) have been preprocessed using noisy filters before being sampled in 2.56-s stationary sliding windows with 50% overlapping (128 readings/window). A Butterworth low-pass filtering was used to split the gravitational and human motion components of the sensor acceleration signal into body acceleration and gravity. We use the HAR data set under federated non-i.i.d settings, where each user has a different number of samples with some activities.
1) Importance of Applying FedSemL: This section investigates the importance of applying FedSemL to improve the model performance compared to only applying fully supervised FL. Here, we use the fully supervised FL as a baseline to show the effectiveness of using the FedSemL and the unlabeled data samples to improve the FL model. We use ϕ = 0.70 as a confidence level for FedSemL and assume that only 10% of data is labeled. Fig. 3 shows the importance of applying FedSemL to improve the performance of the FL model. From this figure, we can notice that injecting the unlabeled data during the training rounds provides significant improvements in the accuracy. Thus, in the following sections, we evaluate the proposed algorithms with respect to FedSemL.
2) Testing Accuracy of the Proposed Algorithms: We start with a HAR application, where the users hold both labeled and unlabeled data samples. Fig. 4(a)-(c) shows the testing accuracy of the proposed algorithms on the HAR application when the confidence level ϕ = 0.90, 0.8, 0.7, 0.6, 0.5, and 0.2. We can see that injecting the unlabeled data improves the model performance for all algorithms and the APL-AD algorithm achieves better accuracy. This stems from the fact that APL-AD ensures more local updates and more new labeled samples in the r + 1 round, which accelerates the convergence rate as seen in Fig. 4(c) when ϕ = 0.90. This algorithm exhibits a tradeoff between the reduction in the number of communication rounds and substantial increase in the computation time in the r + 1 round as seen in Figs. 5 and 6. On the other hand, APL-DEV as in Fig. 4(a) achieves the lowest accuracy compared to APL-SR and APL-AD due to the fewer local updates performed when updating the local model, which may reach 1 if the number of new labeled samples is large. For example, more new labeled data may cause the maximum transmit power and local CPU speed leading to the drop of some important samples while reducing the number of local updates to meet the deadline. Moreover, APL-SR achieves a satisfying accuracy while consuming less resources. This is due to excluding unwanted samples that do not contribute to the training update. At the same time, those samples are replaced with active samples that satisfy the confidence level and positively affect the training accuracy. This algorithm balances the performance in terms of accuracy as well as the resource consumption. Fig. 6 shows the average energy consumption per round for all proposed algorithms. We can see that APL-DA consumes more energy because it requests more time in the r + 1 round to train all new labeled samples, causing the computing operations throughout local training to increase. As for the APL-DEV algorithm, it consumes less than the APL-AD algorithm and is slightly greater than the APL-SR algorithm although it reduces the number of local updates and injects only the proper number of samples into the training round. This results from using the useless data samples repeatedly when performing the local training, which yields a slower convergence rate and an increase in the number of communication rounds. On the other hand, the APL-SR shows the best performance with respect to the energy as it saves more energy due to excluding the unwanted samples and replacing new active labeled samples before conducting the model update for the rest of the epochs. We should emphasize that APL-SR might consume more energy than APL-DEV in some rounds, especially if the number of new injected samples is large. Nevertheless, APL-SR can exclude more samples as the training progresses, leading to less energy on average.

3) Training Time and Energy Consumption of the Proposed Algorithms:
As for time consumption, Fig. 5 shows the average computation time for all proposed algorithms during the training rounds. From this figure, one can observe that APL-AD takes more than 2 × s on average compared to APL-DEV and APL-SR. This is due to the adaptive round deadline based on the new labeled samples with a fixed number of local epochs. As for APL-SR, it takes less time on average to train the local  models due to excluding unwanted samples at the beginning of the local training, which accelerates the model updates. We can notice that APL-SR reduces the computation time, which in return allocates more time for transmission, leading to further consuming less energy as seen in Fig. 6. On the other hand, APL-DEV takes more time than APL-SR, which negatively affects the transmit power to upload the update before the deadline, leading to consuming more energy, as one can see in Fig. 6.

D. Experiments on CIFAR-10 for Object Detection
After evaluating the performance of the proposed algorithms on HAR application, we also use CIFAR-10 as a more complex training task. The CIFAR-10 has 50 000 images for training and 10 000 for testing. We use this data sets under realistic federated settings and non-i.i.d. data distribution where this data set, CIFAR-10, is split into ten partitions (number of labels) and each IoT device is assigned batches of two classes only. Fig. 7(a)-(c) shows the testing accuracy of all proposed algorithms. We can see that the performance is almost aligned with the results of HAR although the training task using CIFAR-10 is more complex than HAR. In all figures, the global model requires more communication rounds than HAR; nevertheless, the proposed algorithms operate similarly. Furthermore, all algorithms models with ϕ ≥ 0.50 and ϕ < 0.80 achieve the best accuracy as exhibited in Fig. 8. This brings out the need for carefully choosing the confidence level, and as experimentally seen, we recommend 0.50 ≤ ϕ ≤ 0.80. Also, one can observe that when ϕ <= 0.60, the lowest accuracy is 72% accuracy achieved by APL-DEV, and the highest accuracy is 81%. Also, the largest difference in accuracy across all proposed algorithms is less than 7%, demonstrating that the proposed solutions are likewise robust to trust auto-labeling after a satisfactory confidence level. In contrast, all algorithms' resulting accuracy shows undesirable performance when the confidence level is less than 0.50 (i.e., ϕ = 0.20). This is due to the fact that pseudo-labeling induces an accumulating error when the top-class probability is on the uncertainty level.
Figs. 9 and 10, on the other hand, show the average energy and time consumption per round. We can clearly observe an increase in energy and time compared to HAR. This stems from a large number of samples, 50 000, and the complexity of each training sample where the features are 32 × 32 with  ten classes opposite to HAR, which has 516 features with six classes even though all proposed algorithms show the same behavior, i.e., APL-AD consumes more energy and time while APL-RA shows the best performance as seen for HAR.
Next, we consider different percentages of unlabeled data among IoT devices as depicted in Fig. 11. Here, we use 5%, 10%, and 20% of labeled data, respectively, while we consider a single confidence level ϕ = 0.60, which shows the best performance as discussed above. We note that regardless of the labeled data percentage, injecting the unlabeled data improves the model performance. This stems from the ability to capture more patterns from tested data. On the other hand, there is no significant difference when the percentage of labeled data is 5%, 10%, or even 20% where the maximum accuracy difference is only 3%, demonstrating that the  proposed algorithms can achieve robust labeling despite the percentage of labeled data samples.

E. Experiments on Other Data Sets
To further verify our results, we carry out additional experiments using MNIST data which is commonly used by the research community. MNIST consists of 69 000 images of handwritten digits (0-9) with 28 × 28 pixel resolution each. To conduct experiments under non-i.i.d settings, the data was distributed among IoT devices in such a way that each IoT device has imbalanced samples of just two digits and the number of samples per IoT device follows a power law to ensure the imbalanced data distribution. The model input is a flattened 784-D (28 × 28) image, and the output is a class label between 0 and 9. In Figs. 12-14, it is evident that the proposed algorithms operate similarly regardless of the complexity of the model as the performance in terms of accuracy, time, and energy is similar to HAR and CIFAR10. In general, it is worth pointing out that the learning task strongly impacts the performance in terms of energy consumption and time. In light of our findings, this can be characterized as: CIFAR10's training round consumes more energy and time, and it decreases when using MNIST, whereas HAR consumes less energy and time.

F. Lessons Learned
The following important lessons are learned from the experiments performed in this article.  1) Regardless of the application, either the HAR or the object detection, or even handwritten classification, injecting the unlabeled data improves the performance and enables the system to make more accurate decisions. 2) The proposed algorithms can efficiently employ unlabeled data while considering limited computation and communication resources, energy, and time constraints.
3) The proposed algorithms are well suited to cope with the energy, time, and accuracy tradeoff depending on the system need. 4) APL-SR is recommended for the applications that can tolerate some minor errors, but the energy is critical, while APL-AD is recommended for the critical applications as all samples are repeatedly trained with the same local iterations. 5) Even though the edge server only has labeled data and the IoT devices only have unlabeled data, the proposed algorithms can be employed as the server has no information of the number of unlabeled data injected in a given round.

IX. CONCLUSION
In this article, we introduced novel algorithms to run semisupervised FL on the network edge where the devices have a scarcity of labeled data and an abundance of unlabeled data. We considered the limited computation and communication resources as well as the deadline constraint specified by the system. We proposed a bisection-based algorithm to minimize energy consumption by finding the optimal transmit power and local CPU speed. Then, we proposed three controlling algorithms that apply auto-labeling for the unlabeled data samples during the training rounds. All algorithms use strong data augmentation during the training phase and weak data augmentation during the pseudo-labeling phase. We have presented an analysis for each proposed algorithm with respect to the accuracy, computation time, and energy consumption and present the tradeoff performance. We have experimentally evaluated all proposed algorithms under non-i.i.d. data distribution. We consider that the proposed algorithms provide an efficient tool to rum FedSemL over wireless edge networks while satisfying the system need. Finally, finding the optimal client scheduling for SemFedL is a promising research direction considering the convergence rate acceleration, resource allocation, and energy consumption.