Communication-Efficient Randomized Algorithm for Multi-Kernel Online Federated Learning

Online federated learning (OFL) is a promising framework to learn a sequence of global functions from distributed sequential data at local devices. In this framework, we first introduce a single kernel-based OFL (termed S-KOFL) by incorporating random-feature (RF) approximation, online gradient descent (OGD), and federated averaging (FedAvg). As manifested in the centralized counterpart, an extension to multi-kernel method is necessary. Harnessing the extension principle in the centralized method, we construct a vanilla multi-kernel algorithm (termed vM-KOFL) and prove its asymptotic optimality. However, it is not practical as the communication overhead grows linearly with the size of a kernel dictionary. Moreover, this problem cannot be addressed via the existing communication-efficient techniques (e.g., quantization and sparsification) in the conventional federated learning. Our major contribution is to propose a novel randomized algorithm (named eM-KOFL), which exhibits similar performance to vM-KOFL while maintaining low communication cost. We theoretically prove that eM-KOFL achieves an optimal sublinear regret bound. Mimicking the key concept of eM-KOFL in an efficient way, we propose a more practical pM-KOFL having the same communication overhead as S-KOFL. Via numerical tests with real datasets, we demonstrate that pM-KOFL yields the almost same performance as vM-KOFL (or eM-KOFL) on various online learning tasks.


INTRODUCTION
F EDERATED learning has emerged as a promising decentralized machine learning framework, in which distributed nodes (e.g., mobile phones, wearable devices, etc.) learn a global function collaboratively under the coordination of a server without sharing raw data placed at edge nodes [1], [2]. To be specific, federated learning optimizes a global function (or model) by repeating the two operations: i) local model optimizations at edge nodes; ii) global model update (e.g., model averaging) at the server [3]. This distributed learning framework has myriad of applications: activities of mobile phone users, predicting a low blood sugar, heart attack risk from wearable devices, or detecting burglaries within smart homes [4], [5], [6].
In many real-world applications, function learning tasks are expected to be performed in an online fashion. For instance, online learning is required when data is generated as a function of time (e.g., time-series predictions) [7], [8] and when a large number of data makes it hard to carry out data analytic in batch form [9]. Focusing on a centralized network, this challenging problem has been efficiently addressed via online multiple kernel learning (OMKL) [10], [11], [12]. OMKL learns a sequence of functions (or models) from continuous streaming data, which can predict the label of a newly incoming data in real time. Using multiple kernels, OMKL can yield a superior accuracy and enjoy a great flexibility compared with a single kernel-based online learning [11], [12]. Despite its effectiveness, OMKL is restricted to the centralized network only and an extension to a decentralized network (e.g., federated learning) has not been investigated yet.
Motivated by the success of OMKL, this paper considers a novel kernel-based OFL (KOFL) framework, in which the objective is to learn a sequence of global (or common) functions fðx;m t Þ with a model parameterm t 2 R M using streaming data across a large number of distributed nodes. In particular, our learned function fðx;m t Þ belongs to a reproducing Hilbert kernel space (RKHS) H [13], i.e., fðx;m t Þ 2 H. Specifically, at every time t, each node k 2 f1; 2; . . . ; Kg estimates a local modelm ½k;tþ1 from the current global modelm t and an incoming data ðx k;t ; y k;t Þ. Then, it sendsm ½k;tþ1 to the server. This transmission is called uplink. Leveraging the aggregated local models, the server constructs an updated global model aŝ m tþ1 ¼ hm ½1;tþ1 ; . . . ;m ½K;tþ1 À Á ; (1) and broadcasts it to the K edge nodes. The mapping h should be carefully designed according to the structures of model parameters. One representative example is averaging (called FedAvg) [3]. Given the global model, every node k can estimate the label of a newly incoming data x k;tþ1 such as probabilistically, eM-KOFL can yield the same performance as vM-KOFL while maintaining the low communication cost. Notably, the communication overhead of eM-KOFL is irrespective of the size of a kernel dictionary (i.e., P ). To achieve a higher learning accuracy, thus, a sufficiently large number of kernels can be used for free. Our major contributions are summarized as follows.
We propose a delayed-Exp, from which the server constructs a sequence of probability mass functions (PMFs)q t ¼ ðq ½t;1 ; . . . ;q ½t;P Þ, t ¼ 1; 2; . . . ; T , in a distributed and online fashion. At every time t, the server chooses one kernel index out of P kernels randomly according to the PMFq t . The proposed strategy guarantees that the selected kernel converges to the best kernel in hindsight with high probability. Likewise vM-KOFL, therefore, the proposed eM-KOFL can operate as S-KOFL with the best kernel in hindsight.
Leveraging a martingale argument, we theoretically prove that the proposed eM-KOFL achieves an optimal sublinear regret bound Oð ffiffiffi ffi T p Þ compared with the best kernel function in hindsight. This analysis also implies that eM-KOFL can asymptotically achieve the same learning accuracy as the centralized OMKL in [11], [12] without sharing raw data (i.e., preserving an edge-node privacy). Regarding the communication overhead (per node), S-KOFL and vM-KOFL respectively have M and PM for both downlink and uplink. Due to the clever use of multiple kernels, the communication overheads of eM-KOFL are reduced as M þ 1 and P þ M for downlink and uplink, respectively. We further reduce the uplink communication overhead of eM-KOFL by mimicking the delayed-Exp strategy efficiently. The resulting algorithm is named pM-KOFL, which has the communication overheads M þ 1 % M for both downlink and uplink. These results are summarized in Table 4.2. Via numerical tests on real datasets, we demonstrate the effectiveness of the proposed methods on online regression and time-series prediction tasks. Notably, pM-KOFL can yield the almost same performance as vM-KOFL and eM-KOFL. This gives the positive answer to the question, i.e., pM-KOFL can fully enjoy the advantage of multiple kernels while maintaining the communication cost of S-KOFL. The remainder of this paper is organized as follows. In Section 2, we formally define the problem setting of KOFL framework. In Section 3, a vanilla multi-kernel method (termed vM-KOFL) is constructed as a baseline approach and its drawback is identified. In Section 4, we propose a randomized method (named eM-KOFL) and its communication-efficient variant pM-KOFL. We provide theoretical analyses for the proposed methods in Section 5. Some numerical experiments on real datasets are conducted to demonstrate the effectiveness of the proposed methods for online regression and time-series prediction tasks. We conclude this paper in Section 7.
Notations: Bold lowercase letters denote the column vectors. For any vector x, x T denotes the transpose of x and kxk denote the ' 2 -norm of x. Also, E½Á represents the expectation over an associated probability distribution. To simplify the notations, we let ½n ¼ D f1; . . . ; ng for any positive integer n. Also, k, t, and i will be used to stand for the indices of node, time, and kernel, respectively, The number of nodes, kernels in a kernel dictionary, and total incoming data are denoted as K, P , and T , respectively.

PRELIMINARIES
We introduce an online federated learning (OFL) and then define our kernel-based OFL (KOFL) framework.

Online Federated Learning (OFL)
The objective of OFL is to learn a sequence of global (or common) functions using sequential data (e.g., time-series data) across a large number of distributed edge nodes. In detail, at every time t, the server distributes the latest global modelm t (i.e., the parameter of a global function fðx;m t Þ) to the decentralized K nodes. Then, each node k receives the global modelm t and an incoming data ðx k;t ; y k;t Þ, where x k;t 2 X R N and y k;t 2 Y R represent the feature and the label, respectively. Using them, each node k updates its local model, which is denoted asm ½k;tþ1 2 R M for k 2 ½K. Every node k sendsm ½k;tþ1 back to the server, from which the server constructs an updated global model aŝ The mapping h should be carefully designed according to the structure of a learned function. In the existing federated learning frameworks, one representative mapping is to average the aggregated local models, called FedAvg [3], [16]. Our learned global function at time t is given as fðx;m tþ1 Þ, from which each node k can estimate the label of a newly incoming data x k;tþ1 in real time.

KOFL Framework
We briefly describe a kernel-based function learning and then by incorporating it into OFL, we define our kernelbased OFL (KOFL) framework reproducing kernel Hilbert space (RKHS) H, defined as H ¼ f : fðxÞ ¼ P t a t kðx; x t Þ È É , where kðx; x t Þ : X Â X ! Y is a symmetric positive semidefinite basis function (called kernel) [23]. The major drawback of the above learning technique is that the dimension of a parameter to be optimized grows linearly with the number of incoming data. Thus, it is not applicable for online learning with continuous streaming data. This problem has been circumvented in [24] by presenting the random-feature (RF) approximation, in which a kernel is well-approximated with a fixed and small number of random features. The RF-based kernel function is represented with a parameterŵ 2 R M as fðx;ŵÞ ¼ŵ T zðxÞ 2 H; (4) where a feature mapping zðxÞ, which relies on a kernel k, is defined as and where fv i : i 2 ½dg denotes an independent and identically distributed (i.i.d.) samples from the Fourier transform of a given kernel function kðÁ; ÁÞ. Via the RF approximation, the hyper-parameter M ¼ 2d can be chosen independently from the number of incoming data. The accuracy of the above RF-based kernel learning fully relies on a predetermined kernel k, which can be chosen manually either by a task-specific priori knowledge or by some intensive cross-validation process. As shown in [11], [12], [22], a multiple kernel learning, using a preselected set of P kernels K ¼ fk 1 ; k 2 ; . . . ; k P g (called a kernel dictionary), is more powerful as it can enlarge a function space for optimization. In this case, the RF-based multi-kernel function is represented as fðx; fq i ;ŵ i gÞ ¼ whereq ½i 2 ½0; 1 denotes the combination weight (or reliability) of the associated kernel function fðx;ŵ i Þ ¼ŵ T i z i ðxÞ 2 H i . Now we are ready to define KOFL framework formally. It is a class of OFL, which seeks a sequence of kernel functions in the form of (6). An algorithm (or method) for KOFL framework is evaluated in terms of a learning accuracy and a communication overhead: Learning accuracy: As in centralized OMKL [11], [12], the learning accuracy of an algorithm is measured by the cumulative regret over T time slots where LðÁ; ÁÞ denotes a loss (or cost) function. This metric compares the cumulative loss of our algorithm to the cumulative loss of the static optimal function from the best kernel. Communication overhead (per node): It is measured by the number of transmissions between each node and the server (i.e., the dimension of local (resp. global) model parameter for uplink (resp. downlink)). For example, if a learned function has the form of (6) (i.e.,m t ¼ fq ½t;i ;ŵ ½t;i : i 2 ½P g), the communication overhead is equal to PM þ P for downlink and uplink, respectively. Throughout the paper, M denotes the dimension of random features in (5) (i.e., the dimension of the parameter of each single-kernel function).

VANILLA METHODS
In this section, we present two vanilla methods for KOFL framework, which can be constructed by properly combining some existing techniques. A single kernel-based method (termed S-KOFL) is first introduced, which is constructed by incorporating the RF approximation [14], online gradient descent (OGD) [15], and FedAvg [3]. In addition, harnessing the idea of multi-kernel extension in the centralized counterpart [11], [12], a vanilla multi-kernel KOFL (termed vM-KOFL) is constructed. Our key observation is that such extension causes a heavy communication overhead compared with S-KOFL, and the existing overhead-reduction techniques in federated learning cannot address such problem. This is the motivation of our work.

S-KOFL
In S-KOFL, the goal is to seek a sequence of RF-based kernel functions fðx;ŵ 1 Þ; . . . ; fðx;ŵ T Þ defined in (4), such as fðx;ŵ t Þ ¼ŵ T t zðxÞ 2 H; where zðxÞ is given in (5) according to a preselected single kernel k. To optimize the above global modelsŵ t 2 R M , t ¼ 1; . . . ; T in an online fashion, S-KOFL operates with the following two steps. Local Model Update. At time t, every node k receives the current global modelŵ t form the server and an incoming data ðx k;t ; y k;t Þ. Using them, it updates the local model via OGD [15]ŵ ½k;tþ1 ¼ŵ t À h ' rLŵ T t zðx k;t Þ; y k;t À Á ; with a step size h l > 0 and the initial valueŵ 1 ¼ 0, where rLŵ T t zðx k;t Þ; y k;t À Á is the gradient at the pointŵ t . Then, each node k sends the updated local modelŵ ½k;tþ1 2 R M back to the server.
Global Model Update. The server updates a global model from the aggregated models fŵ ½k;tþ1 : k 2 ½Kg via FedAvg [16] Then, it distributes the updated global modelŵ tþ1 2 R M to the K nodes. In S-KOFL, the uplink/downlink communication overhead is equal to M.

vM-KOFL
In vM-KOFL, a global learned function at time t is represented as fðx; fŵ ½t;i ;q ½t;i gÞ ¼ where the global model is defined asm t ¼ fŵ ½t;i ;q ½t;i : i 2 ½P g. Then, vM-KOFL consists of the following two operations (see Algorithm 1).
Local Model Update. Given the global model fŵ ½t;i : i 2 ½P g and an incoming data ðx k;t ; y k;t Þ, each node k updates its local parameters fŵ ½k;tþ1;i g via OGD w ½k;tþ1;i ¼ŵ ½t;i À h ' rLŵ T ½t;i z i ðx k;t Þ; y k;t ; (11) for 8i 2 ½P , where h l > 0 is a step size and the initial value isŵ ½1;i ¼ 0 for 8i 2 ½P . Also, it computes the reliabilities of the P kernels with respect to its own local datâ with the initial values' ½k;1;i ¼ 1 for 8i 2 ½P . Then, it sends the updated local model fŵ ½k;tþ1;i ;' ½k;tþ1;i : i 2 ½P g back to the server. Receive a streaming data ðx k;t ; y k;t Þ.
Send the updated global model fw ½tþ1;i ;q ½tþ1;i : i 2 ½P g to the K nodes.
Å S-KOFL follows the above procedures with a predetermined single kernel k (i.e., P ¼ 1).
Global Model Update. The server receives the updated local models fŵ ½k;tþ1;i ;' ½k;tþ1;i : i 2 ½P ; k 2 ½Kg from the K nodes. First, the parameters of the P kernel functions are obtained via FedAvĝ Then, the weights for combining the P kernel functions, which is determined on the basis of the entire losses of the K nodes, are computed aŝ The above method to compute the combining weights is known as Exp strategy [25]. Finally, the server broadcasts the updated global model fŵ ½tþ1;i ;q ½tþ1;i : i 2 ½P g to the K nodes. In vM-KOFL, the uplink/downlink communication overhead is equal to P ðM þ 1Þ.

Remark 1.
We observe that vM-KOFL can enjoy the merit of multiple kernels at the expense of communication overheads. Especially, the uplink/downlink communication overhead of vM-KOFL grows linearly with the size of a kernel dictionary P . Definitely, as in S-KOFL, the communication-reduction techniques in federated learning (e.g., quantization or sparsification) [17], [18], [20], [26] can be applied to vM-KOFL. We remark that these method can only reduce the overhead term M as M 0 ( M, whereas the term P is unchanged. This poses a new demanding problem on the construction of a communication-efficient multi-kernel method for KOFL framework. We contribute to this subject in the next section. Algorithm 2. Proposed Method (eM-KOFL) 1: Input: K edge nodes, a set of preselected P kernels fk i : i 2 ½P g, and hyper-parameters ðh l ; h g Þ. 2: Output: A sequence of common functions: t 2 ½T þ 1 fðx; fq ½t;i ;w ½t;i : i 2 ½P gÞ; t ¼ 1; . . . ; T: 3: Initialization:w ½1;i ¼ 0 for 8i 2 ½P . Each node has an identical random features z i ðÁÞ for 8i 2 ½P .
Å At the server Update the global parameter w tþ1 via (20). Randomly choosep tþ2 according to the PMF in (21).
Send the updated global model fw tþ1 ;p tþ2 g to the K nodes.

PROPOSED METHODS
In this section, we propose a novel randomized algorithm (named eM-KOFL), which can enjoy the advantage of multiple kernels while having the almost same communication overhead as S-KOFL. It is remarkable that the uplink/ downlink communication overhead of eM-KOFL is irrespective of P (i.e., the number of kernels) and thus, to achieve a higher learning accuracy, a sufficiently large number of kernels can be used without the burden of the communication overhead. The main idea of eM-KOFL is as follows. At every time t, one kernel our of the entire P kernels is randomly selected according to a carefully designed probability mass function (PMF). Only local functions in the selected kernel are updated globally, while the other functions are updated locally, which ensures that the communication overhead no longer relies on P . More importantly, it is proved that our PMF, obtained by the proposed delayed-Exp strategy, can guarantee that as t proceeds, the best kernel in hindsight is chosen with high probability. The communication overheads of eM-KOFL are equal to M þ 1 and M þ P 2M for downlink and uplink, respectively. Mimicking the delayed-Exp strategy in an efficient way, we further reduce the uplink communication overhead of eM-KOFL from M þ P to M þ 1. This variant of eM-KOFL is named pM-KOFL. The communication overheads of S-KOFL, vM-KOFL, eM-KOFL, and pM-KOFL are summarized in Table 4.2, where recall that M is the size of random features and P is the size of a kernel dictionary. They are hyper-parameters and in our experiments, M and P are set by 100 and 11, respectively. The actual communication costs in our experiments can be immediately obtained from Table 4.2 with M ¼ 100 and P ¼ 11.

eM-KOFL
In the proposed eM-KOFL, a global learned function at time t is fully determined by the parameters fp t ;w t g, i.e., fðx; fw t ;p t gÞ ¼w T t zp t ðxÞ 2 Hp t ; (15) where the global model is defined asm ¼ fp t ;w t g. We note that fp t ;w t g are random variables and thus, eM-KOFL is a randomized algorithm. Thus, unlike S-KOFL, the associated RKHS in eM-KOFL can change over the time. The proposed eM-KOFL consists of the following two operations (see Algorithm 2): Local Model Update. Every node k has its own local information fg ½k;t;i : i 2 ½P g, which is not shared with the server. Also, at time t, it receives the global model ðp tþ1 ;w t Þ from the server. Leveragingp t (received at time t À 1) andw t , each node k updates the parameters of the P kernel functions asw for 8i 2 ½P . Note that the kernel function belonging to Hp t is only updated globally. Leveraging them, each node k updates its local information via OGD for 8i 2 ½P , where h ' > 0 is a step size. Also, from the received parameterp tþ1 , it updates the local model to be conveyed to the server w ½k;tþ1 ¼g ½k;tþ1;p tþ1 : Likewise vM-KOFL, the accumulated losses (or reliabilities) of the P kernels are computed as Lw T ½k;t;i z i ðx k;t Þ; y k;t " # ; where' ½k;1;i ¼' ½k;2;i ¼ 1 for 8i 2 ½P . Obviously,' ½k;tþ2;i can be different from' ½k;tþ2;i in (19). Finally, every node k sends the updated local modelw ½k;tþ1 and f' ½k;tþ2;i : i 2 ½P g to the server. The corresponding communication overhead in the uplink is equal to M þ P . Since P is generally much less than M (e.g., P ¼ 11 and M ¼ 100 in our experiments), the uplink communication overhead can be less than 2M. Global Model Update. At time t, the server receives the updated local models fw ½k;tþ1 : k 2 ½Kg and f' ½k;tþ2;i : i 2 ½P ; k 2 ½Kg from the K nodes. As in S-KOFL, it updates the global parameter via FedAvg Next, the server chooses a kernel indexp tþ2 by taking the reliabilities of the P kernels into account. Specifically,p tþ2 2 ½P is chosen according to the following PMF The proposed method above is called Delayed-Exp strategy.
Unlike the conventional Exp strategy [25], our strategy employs the one-time delayed information, i.e.,q ½tþ2;i is determined on the bases of incoming data up to time t, whereas in Exp strategy, it is determined up to time t þ 1. Finally, the server distributes the updated global model fw tþ1 ;p tþ2 g to the K nodes. The downlink communication overhead of eM-KOFL is equal to M þ 1. We emphasize that unlike vM-KOFL, the uplink/downlink communication overhead of eM-KOFL does not grow with P .

pM-KOFL
We propose pM-KOFL as the communication-efficient variant of eM-KOFL, where the corresponding uplink/downlink communication overhead is equal to M þ 1. Since pM-KOFL almost follows the procedures of eM-KOFL with some modifications, we only highlight such differences. The full descriptions of pM-KOFL are illustrated in Fig. 1.
In pM-KOFL, every node k transmits a candidate kernel index p ½k;tþ2 , rather than sending the accumulated losses f' ½k;tþ2;i : i 2 ½P g in eM-KOFL. The corresponding communication overhead can be reduced from P to 1. To be specific, the candidate index p ½k;tþ2 is chosen according to the following local PMF q ½k;tþ2;i ¼ ð' ½k;tþ2;i Þ K P P i¼1 ð' ½k;tþ2;i Þ K ; 8i 2 ½P ; where' ½k;tþ2;i is defined in (19). Then, the server chooses a kernel index p tþ2 from the aggregated candidates f p ½k;tþ2 : k 2 ½Kg randomly and uniformly. By integrating the randomness at the K nodes and the server, it can be interpreted that p tþ2 in pM-KOFL is chosen according to the following PMF q ½tþ2;i ¼ 1 K for 8i 2 ½P . Then, f q ½tþ2;i g can be considered as the proxy of the true PMF fq ½tþ2;i g in (21). Definitely pM-KOFL can approach the performance of eM-KOFL, provided that q ½tþ2;i is sufficiently close toq ½tþ2;i . In Section 6, it will be demonstrated that the proxy in (23) is quite accurate so that pM-KOFL yields the same performance as eM-KOFL. This leads us to conclude that pM-KOFL can almost achieve the performance of vM-KOFL while having the same communication overhead as S-KOFL.

REGRET ANALYSIS
In this section, we theoretically prove that eM-KOFL attains the same asymptotic performance as vM-KOFL. Namely, both methods achieves a sublinear regret bound, i.e., regret T Oð ffiffiffi ffi T p Þ. Our analysis also reveals that eM-KOFL yields the same performance of the centralized counterpart [11], [12] asymptotically without sharing raw data (i.e., preserving an edge-node privacy). Thus, eM-KOFL is asymptotically optimal. Before stating our main theorems, we introduce some useful notations and assumptions. For any fixed kernel k i , let fðx; w ½ ? ;i Þ ¼ w T ½ ? ;i z i ðxÞ denote the best RF-based function belonging to RKHS H i , i.e., For our analysis, we make the following assumptions, which are usually assumed for the analysis of online convex optimization and online learning [11], [12], [15], [25]: Assumption 1. For any fixed z i ðx t Þ and y t , the loss function Lðw T z i ðx t Þ; y t Þ is convex with respect to w, and is bounded as Lðw T z i ðx t Þ; y t Þ 2 ½0; 1.
Assumption 2. For any fixed kernel k i , the optimal parameter w ½ ? ;i is assumed to be bounded, i.e., kw ½ ? ;i k 2 C.
Under Assumption 1 -Assumption 3, the main results of this section are derived as follows.
Theorem 1. Given any set of P kernels fk i ; i 2 ½P g, the vanilla method (vM-KOFL) in Algorithm 1 achieves the following regret bound Proof. The proof is provided in Section 5.1.

t u
Theorem 2. Given any set of P kernels fk i ; i 2 ½P g, the proposed eM-KOFL in Algorithm 2 achieves the following regret bound with probability 1 À d Lw T t zp t ðx k;t Þ; y k;t À Á À min Note that fp t ;w t g; t ¼ 1; . . . ; T are random sequences.
Proof. The proof is provided in Section 5.2.  (23) is sufficiently close to the true PMFq ½t;i in (21). In an asymptotic analysis, it is easily proved that pM-KOFL can achieve a sublinear regret bound as in eM-KOFL, if the following condition holds Unfortunately, it is not possible to theoretically prove if pM-KOFL satisfies the above condition for any decentralized dataset. Via numerical tests, we have confirmed that the above condition easily holds for our real datasets in Section 6. The corresponding result is shown in Fig. 2, where y-axis represents Pðp t ¼ p t Þ (which can clearly capture the left-hand side of (25)).

Remark 3.
We give some discussions on our theoretical analysis with respect to the two major challenges in the conventional federated learning, such as non-identically and non-independently distributed (non-IID) data (or data heterogeneity) and partial (node) participation [27]. The partial participation implies that only a random portion (e.g., 10%) of distributed nodes in the network transmit their local models to the server for FedAvg. We emphasize that our analytical results in Theorem 1 and Theorem 2 are valid for any dataset (e.g., non-IID data and any real-world data), provided that all the nodes in the network contribute to the global update. Namely, in the case of full participation, the proposed eM-KOFL is asymptotically optimal irrespective of the degree of heterogeneity of incoming data at the distributed nodes. However, our analysis could not cover the case of partial participation with non-IID data. In a simple single-kernel case, we cannot ensure that S-KOFL, based on OGD and FedAvg, yield an optimal sublinear regret bound. Harnessing the similarity between OGD and SGD, the existing ideas to correct for non-IID data in the conventional federated learning, such as adding a penalty (or proximal) term to local objective functions [28], [29], [30] or designing a new optimizer [31]), can be applied to S-KOFL. Yet, the regret analysis of the resulting algorithms is not straightforward and should be further investigated. When considering multiple kernels, the problem becomes more challenging as the combination weights (or PMF) are also affected by the partial participation. An estimated PMF q ½t;i (by some participated nodes) can be different from the true PMFq ½t;i (by the entire nodes). Obviously, this difference tends to be larger as the degree of data-heterogeneity increases. Under the partial participation and non-IID data, thus, our algorithm can achieve an optimal sublinear regret bound only when the following condition holds X T t¼1 X P i¼1 jq ½t;i Àq ½t;i j Oð ffiffiffi ffi T p Þ: The left-hand side of (26) can reflect the heterogeneity of the dataset. The above condition may not hold for the case of high-degree data heterogeneity and thus, in this case, the proposed delayed-Exp strategy should be further enhanced. In this scenario, it is an interesting research topic to construct an efficient multi-kernel algorithm with an analytical performance guarantee, which is beyond the scope of this paper. Instead, via experimental analysis in Section 6, we demonstrate that pM-KOFL (or eM-KOFL) can provide the robustness to the partial participation and non-IID data to some degree (see Fig. 5).

Proof of Theorem 1
We provide the proof of Theorem 1. Following the notations in Section 3.2, let fŵ ½t;i : i 2 ½P g and fq ½t;i : i 2 ½P g denote the global parameters of a learned function at time t. Using these notations, we derive the two key lemmas.
Lemma 1. For any kernel k i and any step size h l > 0, S-KOFL in Section 3.1 achieves the following regret bound Proof. Fix a kernel k i for any i 2 ½P . From OGD update in (8) and FedAvg in (9), the global model is updated aŝ where for ease of exposition, we let r ½k;t;i ¼ D rLðŵ T ½t;i z i ðx k;t Þ; y k;t Þ: From (27), we have r T ½k;t;i ðŵ ½t;i À w ½ ? ;i Þ: Using the convexity of a loss function, we can get Lŵ T ½t;i z i ðx k;t Þ; y k;t À Lŵ T ½ ? ;i z i ðx k;t Þ; y k;t X K k¼1 r T ½k;t;i ðŵ ½t;i À w ½ ? ;i Þ: From (29) and (30), we derive the following upper bound Lŵ T ½t;i z i ðx k;t Þ; y k;t À L w T ½ ? ;i z i ðx k;t Þ; y k;t K kŵ ½t;i À w ½ ? ;i k 2 À kŵ ½tþ1;i À w ½ ? ; where (a) is from the Cauchy-Schwartz inequality. Using the telescoping sum over t ¼ 1; 2; ::; T , we can get where (a) followsŵ ½1;i ¼ 0, Assumption 2, and Assumption 3. This completes the proof. t u Lemma 2. For any learning rate h g > 0, Exp strategy guarantees the following regret bound Proof. The proof is completed from [12, Lemma 2].

t u
Note that Lemma 1 holds for any kernel k i . Thus, combining Lemma 1 and Lemma 2, we complete the proof of Theorem 1.

Proof of Theorem 2
We provide the proof of Theorem 2. In this proof, we follow the notations in Section 4.1. For example, let fw ½k;t;i : i 2 ½P g denote the local parameters at the node k. Also, letw t andp t denote the global parameters. Using them, we provide the the key lemmas for the main proof.
Lemma 3. For any kernel i 2 ½P and step size h l > 0, the local kernel functions of eM-KOFL can achieve the following regret bound Proof. Given the global parametersp t 2 ½P andw t , and from (17), the local parameters are updated via OGD as g ½k;tþ1;i ¼w ½k;t;i À h ' rLðw T ½k;t;i z i ðx k;t Þ; y k;t Þ; for 8i 2 ½P , wherẽ Then, from (33), we have that for any k 2 ½K Foe ease of exposition, throughout the proof, we let r ½k;t;i ¼ D rLðw T ½k;t;i z i ðx k;t Þ; y k;t Þ: (36) According to the two cases in (34), we can get: i) Whenw ½k;t;i ¼g ½k;t;i (i.e., i 6 ¼p t ), from (17), we have kg ½k;tþ1;i À w ½ ? ;i k 2 ¼ kg ½k;t;i À w ½ ? ;i k 2 þ h 2 ' kr ½k;t;i k 2 À 2h ' r T ½k;t;i ðw ½k;t;i À w ½ ? ;i Þ: ii) whenw ½k;t;i ¼ 1 (17) where (a) is due to the fact that for any k 2 ½K, we have where (a) is and (b) are from the triangle inequality and Cauchy-Schwartz inequality, respectively. Also, from the convexity of a loss function, we obtain that for any k 2 V Lw T ½k;t;i z i ðx k;t Þ; y t À L w T ½ ? ;p z i ðx k;t Þ; y t r T ½k;t;i ðw ½k;t;i À w ½ ? ;i Þ: Plugging (39) into (37) and (38) separately, and combining the two cases, we can get Summing (40) over t ¼ 1; . . . ; T , we obtain that for any fixed i 2 ½P , Lw T ½k;t;i z i ðx k;t Þ; y t À L w T ½ ? ;i z p ðx k;t Þ; y t where (a) is due to the telescoping sum and (b) is from g ½k;1;i ¼ 0, Assumption 2 and Assumption 3. This completes the proof. t u Lemma 4. For any learning rate h g > 0, the proposed delayed-Exp strategy guarantees the following regret bound ½t;i Lw T ½k;t;i z i ðx k;t Þ; y k;t À min Lw T ½k;t;i z i ðx k;t Þ; y k;t Proof. For ease of exposition, we definẽ The proof will be completed using the upper and lower bounds on z, which is defined as Lðf ½k;t;i ðx k;t Þ; y k;t Þ ! " #

:
We first derive the upper bound on z where the expectation in (a) over the random variable I t $ ðq t;1 ; . . . ;q t;P Þ and (b) follows the Hoeffding's inequality with the bounded random variable. We next derive the lower bound on z. First, we define Lw T ½k;t;i z i ðx k;t Þ; y k;t " # ; Also, recall that Lw T ½k;t;i z i ðx k;t Þ; y k;t " # : Obviously, we have ' ½k;t;i ¼' ½k;tþ1;i and ' ½k;t;i ' ½k;t;i : Then, using the above definitions, we have where (a) follows the relation in (46), (b) is from the telescoping sum, (c) is due to the fact that P P i¼1'½k;T þ1;i ! P P i¼1 ' ½k;T þ1;i , and (d) is from the fact that the accumulated loss is non-negative. From the lower and upper bounds, we can get À 2h g min Lw T ½k;t;i z i ðx k;t Þ; y k;t À 2log P Rearranging the above term, we finally obtain where we used the fact that Lw T ½k;t;i z i ðx k;t Þ; y k;t : This completes the proof. t u We remark that Lemma 3 and Lemma 4 hold for any realizations of our randomized algorithm. We next prove that our randomized algorithm, choosing one kernel at every time instead of the combination of all P kernels, only leads to a bounded loss compared with using the combination of the P kernels.
Lemma 5. For some d > 0, the proposed randomized algorithm achieves the following regret bound with at least probability Proof. We first define w ½k;t;i as w ½k;t;i ¼ 1 K ½k;t;i ; 8i 2 ½P : Note that w ½k;t;i ¼w ½k;t;i only when i ¼p t . Then, we define a random variable X t X k;t ¼ Lw T t zp t ðx k;t Þ; y k;t À Á À X P i¼1q ½t;i L w T ½k;t;i z p ðx k;t Þ; y k;t : Note thatq ½t;i is obtained as a consequence of random variablesp 1 ; . . . ;p tÀ1 . Also,p t is chosen according to the PMF ðq ½t;1 ; . . . ;q ½t;P Þ. Let F t ¼ sðp 1 ;p 2 ; . . . ;p t Þ be the smallest sigma algebra such that ðp 1 ;p 2 ; . . . ;p t Þ is measurable. Then, fF t : t ¼ 1; . . . ; T g is filtration and X k;t is F t -measurable. Note that the condition on F tÀ1 ,q ½t;i is fixed andp t andw t are random variables. Using this fact, we have ½t;i L w T ½k;t;i z i ðx k;t Þ; y k;t : Hence, fX k;t : t 2 ½T g is a martingale difference sequence and X k;t 2 ½B t ; B t þ c t is bounded, where B t is a random variable and F tÀ1 measurable, and c t ¼ 1, where ½t;i Lð w T ½k;t;i z p ðx k;t Þ; y k;t Þ: From Azuma-Hoeffding's inequality, the following bound holds for some d > 0 with at least probability Since this is true for any k 2 V, we have Also, we have ½t;i Lðw T ½k;t;i z p ðx k;t Þ; y k;t Þ; where (a) is from the definitions ofw ½k;t;i and w ½k;t;i and (b) is due to the convexity of loss function (i.e., Assumption 1). Combining (49), (52), and (53), the proof is completed.
t u Combining Lemmas 3, 4 and 5, we can complete the proof of Theorem 2.

EXPERIMENTAL RESULTS
In this section, we demonstrate the superiority of the proposed eM-KOFL and pM-KOFL via experiments with real datasets on online regression and time-series prediction tasks. As benchmark methods, the two vanilla methods as S-KOFL and vM-KOFL are considered. We believe that they are reasonable baseline methods because they are constructed by leveraging the best-known federated learning and multi-kernel learning approaches. Also, to the best of our knowledge, no other method for KOFL framework can be found. In our experiments, we consider a communication network consisting of K ¼ 20 and K ¼ 100 decentralized nodes for small-scale and large-scale datasets, respectively. A regularized least-square loss function, i.e., The learning accuracy at time t is measured by the cumulative mean-square-error (MSE) as whereŷ k;t and y k;t denote a predicted label and a true label, respectively. Due to the randomness of the above methods caused by the RF approximation, the average MSE performances over 50 trials are evaluated. Also, the following hyper-parameters are used h l ¼ 0:5; h g ¼ 0:5; ¼ 0:01; and M ¼ 100: The above parameters can be further elaborated. However, as noticed in [12], such hyper-parameter optimization is still an open problem even in a simpler centralized network. In our experiments, thus, one pair of the hyper-parameters in (56) are used for all test datasets. We build the kernel dictionary consisting of 11 Gaussian kernels (i.e., P ¼ 11), each of which is defined with the following basis kernels with the parameters (or bandwidths) Finally, the real datasets for our experiments on online regressions and time-series predictions are described in Sections 6.1 and 6.2, respectively. They are also summarized in Table 2. Performance Evaluations. We first verify that pM-KOFL can operate as equivalently as eM-KOFL, i.e., the true PMF in (21) can be well-approximated by the proxy PMF in (23). The corresponding numerical result is illustrated in Fig. 2, where Pðp t ¼ p t Þ is computed empirically with 100 samples. Recall thatp t and p t indicate the best kernel indices at time t randomly chosen by the true PMF (in eM-KOFL) and the proxy PMF (in pM-KOFL), respectively. It is clearly shown that after a certain time (called a mixing time), pM-KOFL and eM-KOFL operate at the same way with high probability. Furthermore, the mixing time is extremely fast. For this reason, pM-KOFL can yield the almost same performance as eM-KFOL with not-so-large number of incoming data (e.g., T is finite). Next, we demonstrate the effectiveness of the proposed methods on various online learning tasks. Fig. 3 shows the MSE performances on online regression tasks with real datasets in Section 6.1. We identify that multi-kernel based methods as vM-KOFL, eM-KFOL, and pM-KOFL yield more stable performances than the single kernel methods. In contrast, S-KOFL can provide an attractive performance only when a proper single kernel is preselected. Otherwise, S-KOFL deteriorates the learning accuracy considerably. This situation can be happened in real-world applications and thus, S-KOFL is not recommended in practice. Notably, both eM-KOFL and pM-KOFL attain the almost same performance as vM-KOFL for all real datasets, where the performance of vM-KOFL can be regarded as the best performance (lower bound) under KOFL framework. Namely, they can fully enjoy the advantage of multiple kernels as in vM-KOFL while having a similar communication overhead with S-KOFL. One can expect that without increasing the communication overheads, pM-KOFL (or eM-KOFL) yields an outstanding performance for any real-world application using a sufficiently large number of kernels. This surprising result could not been attained from the vanilla vM-KOFL. The exactly same results have been observed in Fig. 4 on timeseries prediction tasks. This verifies that the proposed pM-KOFL and eM-KOFL can give stable performances on various online learning tasks. These numerical results suggest the practicality of the proposed methods.
We next evaluate the robustness of the proposed pM-KOFL on the partial participation and non-IID (or heterogeneous) data. Toward this, the real datasets (e.g., Twitter, Air quality, Parking occupancy, and Power consumption) are partitioned in a non-IID fashion as follows. Given the dataset fðx t ; y t Þ : t 2 ½KT g, they are sorted in ascending order with respect to fy t : t 2 ½KT g and the sorted dataset is denoted as fðx t j ; y t j Þ : j 2 ½KT g with y t 1 y t 2 Á Á Á y t KT . Then, the local data of the node k is obtained as D k ¼ fðx k;t ; y k;t Þ : t 2 ½T g; (59) x k;t ¼ x t ðkÀ1ÞT þt and y k;t ¼ y t ðkÀ1ÞT þt . Recall that at each time t, the node k receives the incoming data ðx k;t ; y k;t Þ. By construction, D k tends to contain the larger labels as k increases. The labels between D 1 and D K can be quite different.
Regarding the partial participation, only 10% of distributed nodes in the network transmit the local models to the server for FedAvg. Also, the 10% of the nodes are selected uniformly at random and independently from time-to-time (called uniform selection). Only for the purpose of verifying the heterogeneity of the dataset in (59), we consider the socalled biased selection, in which 10% of nodes are selected uniformly at random and independently from the first bK=2c nodes. Due to the aforementioned data partition, it is expected that a learned global function based on biased selection is customized to the data with smaller labels. In Fig. 5, MSE results obtained by biased selection verify that our data partition indeed reflects the non-IID nature. From Fig. 5, we observe that the proposed pM-KOFL can provide the robustness to non-IID data and partial participation when online regression and time-series prediction tasks are considered. Namely, our algorithm is bearable to the degree of heterogeneity of the datasets used in our experiments. As explained in Remark 3, nonetheless, the robustness of our algorithm has not been theoretically proved yet. Such rigorous analysis requires extensive additional efforts, which is left for a future work.

Online Regression Tasks
In the experiments of online regression tasks, the following popular real datasets from UCI Machine Learning Repository are considered: Twitter [32]: Data contains buzz events from Twitter, where each attribute is used to predict the popularity of a topic. Higher value indicates more popularity. Conductivity [33]: Data contains samples of extracted from superconductors, where each feature represents critical information to construct superconductor such as density and mass of atoms. The goal is to predict the critical temperature to create superconductor. Air quality [34]: Data includes samples, of which features include hourly response from an array of chemical sensors embedded in a city of Italy. The task is to predict the concentration of polluting chemicals in the air.

Time-Series Prediction Tasks
We consider time-series prediction tasks which estimate the future values in online fashion. As in the centralized counterpart [12], the famous time-series prediction method called Autogressive (AR) model is considered. An ARðsÞ model predicts the future value y t assuming the linear dependency on its s values, i.e., y t ¼ X s i¼1 g i y tÀi þ n t ; where g i denotes the weight for y tÀi and n t denotes a Gaussian noise at time t. Based on this, the RF-based kernelized ARðsÞ model, which can explore a nonlinear dependency, is introduced in [12] y t ¼ f t ðx t Þ þ n t ; where x t ¼ ½y tÀ1 ; . . . ; y tÀs T . The proposed pM-KOFL aims at learning f t ðÁÞ with a parameterized model fðx; f p t ; w t gÞ ¼ w t z p t ðxÞ. The other methods can be defined similarly. Then, the proposed and benchmark methods are tested with the following univariate time-series datasets from UCI Machine Learning Repository: Power consumption [35]: Data contains samples, each of which represents the active energy consumed every minute (in watt per hour) in the household by electrical equipment. Parking occupancy [36]: Data contains samples obtained from the parking lot in Birmingham, each of which indicates the car park occupancy rate. Traffic [37]: Data contains the time-series traffic data obtained from Minneapolis Department of Transportation in US. Data is collected from hourly interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN.

CONCLUSION
We proposed a novel randomized algorithm (named eM-KOFL) for kernel-based online federated learning (KOFL) framework. It was theoretically proved that eM-KOFL can achieve the same asymptotic performance as the vanilla multi-kernel method (termed vM-KOFL), while having a lower communication overhead. Also, our analysis revealed that eM-KOFL yields the same asymptotic performance as the centralized counterpart without sharing raw data (i.e., preserving an edge-node privacy). Focusing on a practical aspect, we presented the communication-efficient variant of eM-KOFL by mimicking the delayed-Exp strategy in an efficient way. The proposed method is named pM-KOFL. Via experiments with real datasets, we demonstrated the effectiveness of the proposed eM-KOFL and pM-KOFL on various online learning tasks. In particular, pM-KOFL can yield the almost same performance as vM-KOFL while having the 1=P uplink/downlink communication overhead, where P denote the size of a kernel dictionary. These suggest the practicality of pM-KOFL. One interesting extension is to build the so-called collaborative KOFL by integrating collaborating learning with KOFL so as to enable edge nodes to engage in KOFL framework without directly connecting the server. Another interesting future work is to provide more rigorous theoretical analysis on our algorithm by taking into account the impact of partial participation and non-IID data. Jeongmin Chae (Student Member, IEEE) received the MSc degree in electrical and computer engineering from Ajou University, Suwon, South Korea, in 2020. She is currently working toward the PhD degree in electrical engineering from the University of Southern California, Los Angeles, CA, USA, where she has been engaged in structured matrix completion and active matrix factorization. Her research interests include the areas of high-dimensional data inference, theoretical machine learning, and statistical signal processing.