Distributed Stochastic Consensus Optimization Using Communication-Censoring Strategy

In this article, a novel communication-efficient distributed stochastic algorithm (CO-DSA) is proposed for solving large-scale consensus optimization problems. As compared to the existing relevant work where only a sublinear convergence rate is obtained for strongly convex and smooth objective functions, the CO-DSA achieves a linear convergence rate even in the presence of an event-triggered communication-censoring strategy. Moreover, by properly setting the threshold function of the event-triggered communication scheme, the CO-DSA maintains the same convergence rate as the algorithm without event-triggered communication. This means the CO-DSA theoretically yields communication efficiency for free. Numerical experiments verify the theoretical findings and also show the excellent communication saving effect of the CO-DSA in large distributed networks.


I. INTRODUCTION
T HIS article considers a connected network consisting of N nodes (e.g., data centers or mobile devices).The N nodes cooperatively solve the following optimization problem: where f n (x) : R p → R is the so-called local objective function that is only accessed by node n and defined as the average of q n local sample functions f n,i (x).To solve the problem distributively, a peer-to-peer communication network is used among the nodes that is described by a directed graph G = {V, A}, where V = {1, 2, . .., N} is defined as the set of N nodes, and A denotes the collection of directed edges (i, j), i, j ∈ V, such that node i can receive information from node j.It is noted that the local sample functions are assumed to be strongly convex and smooth throughout our article.The authors are with the College of Information Science and Engineering, Northeastern University, Shenyang 110819, China (e-mail: liranran6@qq.com;1337620113@qq.com;yufan_0@126.com).
In this article, we propose a novel communication-efficient distributed stochastic gradient algorithm (CO-DSA) with an event-triggered communication strategy and, therefore, both the computational burden and the communication efficiency issues are considered.To the best of our knowledge, the algorithms with the form of "distributed stochastic gradient+event-triggered communication" have been investigated in [15] and [21].In [15], however, a central sever is assumed to exist that receives/sends information from/to all the nodes.From this point of view, the algorithm proposed in [15] is not a fully distributed algorithm.In [21], a sublinear convergence rate is obtained and a stringent assumption is needed that the gradient should be bounded.In our article, no central sever exists and, hence, our proposed algorithm is a fully distributed algorithm.More important, we have proved that a linear convergence rate is attained for our proposed algorithm without the bounded gradient assumption.

A. Challenges
The CO-DSA builds on the idea of the distributed stochastic algorithm (DSA) [11] and a threshold-function-based event-triggered communication scheme.The CO-DSA requires all the communications among the nodes to be reviewed by prescribed communication-censoring threshold functions.The communication is then permitted only if the local model parameters have changed significantly, and thereby, the communication reduction is achieved.Some significant technical challenges arise in analyzing the convergence of CO-DSA.The reason is as follows.The DSA mainly relies on utilizing a difference of two consecutive stochastic averaging gradients as descent direction, e.g., g t n − g t−1 n in Step 7 of Algorithm 1. g t n is an unbiased estimate of the local gradient ∇f n (x t n ), which is designed to alleviate the effect of the noise due to stochastic gradient approximation.However, the event-triggered communication scheme is to introduce noise artificially.For example, if ξ t+1 n < τ t+1 at some t = t 0 in Step 9 of Algorithm 1, xt 0 +1 n is kept unchanged so that is probably not equal to x t 0 +1 n .This mechanism, in fact, introduces noise to the iteration process that may offset the effect of the unbiased stochastic gradient estimator.The combination of the two seemingly contradictory ideas brings the technical challenges to the current theoretical analysis that can be seen in the proof of Theorem 1.

B. Our Contributions
The main contributions of this article are summarized as follows.
1) A linear convergence rate is achieved by the CO-DSA.
To the best of our knowledge, most of the existing results applied the event-triggered communication scheme to the deterministic distributed algorithms except [15] and [21].
In [15], a central parameter server is needed, and every working node in the network has to conduct bidirectional communication with the parameter server.In our work, no such central parameter server exists.The difference of our work with [15] is illustrated through an example given in Fig. 1.In [21], only a sublinear convergence rate is obtained.Moreover, in [21], a bounded gradient assumption that is removed in our article is used.2) By properly setting the censoring threshold function of the CO-DSA, the same convergence rate is maintained as the algorithm without event-triggered communication.
This means our proposed algorithm theoretically yields communication efficiency for free.The rest of this article is organized as follows.In Section II, we give the details of the proposed CO-DSA.In Section III, some assumptions are presented and under which the linear convergence of the CO-DSA is theoretically proved.Section IV presents numerical experiments to validate the effectiveness of the CO-DSA.Finally, Section V concludes this article.Some detailed proofs are given in the Appendix.

Notations:
In this article, we use the following notations.
1) R N denotes the N -dimensional Euclidean space.
2) • denotes the Euclidean norm for vectors and Frobenius norm for matrices.3) A T represents the transpose of the matrix A. 4) A, B := A T B, where A and B are two matrices with appropriate dimensions.5) A B and A ≺ B mean B − A is a positivesemidefinite matrix and a positive-definite matrix, respectively.6) a A = √ a T Aa denotes the A-weighted norm of vector a, where A is a positive-definite matrix.7) E[x] denotes expectation over stochastic variable x and E[x|F t ] denotes the conditional expectation of x, and F t measures the history of the system up until iteration t. 8) Null{A} and span{A} are defined as null space and span space of the matrix A, respectively.9) "⊗" denotes the Kronecker product.10) N i denotes the set of the nodes that can receive information from node i in the network and |N i | denotes its cardinality.11) i t n is a number chosen randomly from the set {1, . .., q n } of node n at the tth iteration.12) LHS and RHS are abbreviations of left-hand side and right-hand side, respectively.13) ∇f n,i (•) denotes the gradient of the function column vectors of all-one.17) x * := 1 N ⊗ x * , where x * ∈ R p denotes the optimal argument of (1), i.e., N n=1 ∇f n (x * ) = 0. Choose i t n from the set {1, . .., q n } randomly 4: Compute and store stochastic averaging gradient ∇f n,i (y t n,i ).

5:
Update If t = 0, update variable x t n as 7: Else update variable x t n as To solve (1) in a computation and communication efficient manner, we propose a DSA with an event-triggered communication scheme.The details are given in Algorithm 1.
It can be seen from Algorithm

= xt
n .The details are given in steps 8-10.y t n,i , i = 1, . . ., q n can be seen as intermediate variables, and the instantaneous gradients ∇f n,i , i = 1, . . ., q n , are evaluated.For node n, the variables y t+1 n,i are updated at iteration t as follows: where i t n is chosen randomly from the set {1, . . ., q n }.It is assumed that the instantaneous gradients ∇f n,i are stored or maintained in a gradient table.At each iteration, one gradient is updated randomly, while other gradients are kept unchanged, i.e., ∇f n,i n is kept unchanged.The idea was proposed in [7] to reduce the noise introduced by stochastic gradient approximation.Generally speaking, x t n is the consensus variable, xt n is related to the event-triggered communication scheme, and y t n,i is related to the reduction of the computational burden by a stochastic gradient approximation scheme.
Observe that the last part of the average gradient g t in step 4 can be equivalently expressed as All terms on the RHS of ( 2) are known at iteration t, which means the computation of g t n in step 4 can be performed in an efficient manner.
It is noted that two network connection weights w nm and wnm are used in Algorithm 1, e.g., steps 6 and 7.This is due to the fact that two mixing matrices are used in our algorithm.In the following, we will illustrate the weight matrices and the communication graph used in the algorithm.The weight matrices W and W can be chosen as different matrices, provided that they satisfy Assumption 1.It is found in [8] that a simple choice W = I + W /2 will facilitate the theoretical analysis and also lead to high efficiency in applications.Throughout our article, W and W should satisfy the following assumption.
Assumption 1: The matrices W and W are symmetric; moreover, we have the following: and 0 ≺ W . Remark 1: Under Assumption 1, it can be derived from [8, Proposition 2.2] that null{I − W } = span{1}.Furthermore, the symmetry of W and W implies that we use undirected networks throughout this article.
Then, Algorithm 1 can be rewritten as the following form: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
When t > 0, then In the next section, we will analyze the convergence of ( 3) and (4).

A. Preconditions of Convergence
In the following, two assumptions are presented that guarantee the convergence of the proposed algorithm.
Assumption 2: The sample functions f n,i (•) are strongly convex with parameter μ, i.e., for all n ∈ 1, . .., N; i ∈ 1, . .., q n and a, b ∈ R p , we have Assumption 3: The gradients of sample functions are Lipschitz continuous with parameter L and differentiable, i.e., for all n ∈ 1, . .., N; i ∈ 1, . .., q n and a, b ∈ R p , we have Assumptions 2 and 3 are introduced to ensure linear convergence that are widely used in the existing literature such as [8], [11], and [18].

B. Main Results
For the existence of a solution to the optimization problem (1), there should exist x * ∈ span(1 N ⊗ I p ) such that the following Karush-Kuhn-Tucker condition holds: We will next show that x t of (3) and (4) converges to x * at a linear rate in expectation.Summing up (4) from k = 1 to t yields ( Applying (3) into (5) yields Define new variables e t = x t − xt , v 0 = Ux 0 and Then, ( 6) is transformed to Lemma 1: Under Assumption 1, the variables x and v in ( 7) and ( 8) satisfy where v * is defined as α∇f (x * ) + Uv * = 0. Proof: Combining ( 7) and ( 8) yields i of ( 10) holds because of the fact that Then, we add α∇f (x * ) + Uv * = 0 to the RHS of (10), which completes the proof.
Next, we will present two lemmas from [11] that will be used in our theoretical analysis.
Lemma 2 (see [11,Lemma 4]): Under Assumptions 1-3, we obtain where Lemma 3 (see [11,Lemma 6]): If Assumptions 1-3 hold, then for all t ≥ 0, the sequence of p t in (12) satisfies where q min and q max denote the smallest and largest values for the number of instantaneous functions at a node, respectively, i.e., 1 ≤ q min ≤ q n ≤ q max .Lemmas 2 and 3 establish the upper bounds of E[ , respectively.It is noted that due to the strong convexity of the local sample functions, p t is positive during the iteration process.
Lemma 4: Under Assumptions 1-3, we choose a nonincreasing nonnegative square-summable censoring threshold, i.e., 0 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Proof: The following relationship is easily obtained: From Lemma 1, we have Taking conditional expectation of (15) yields where i of ( 17) holds because of a + b 2 ≤ 2 a 2 + 2 b 2 and ( 16).Notice that v 0 = Ux 0 and v t+1 = v t + U xt ; then, we obtain v t+1 = t i=1 U xt and, therefore, the vectors v t at any iteration t lie in the column space of matrix U , i.e., and Combining ( 17) and ( 18), we obtain where i of ( 19) uses Lemma 2 and e t+1 ≤ τ t+1 ≤ τ t .Multiplying x t+1 − x * to both sides of (9) yields According to (7) and the fact that Ux * = 0, we obtain U (x t+1 − x * ) = v t+1 − v t + Ue t+1 ; then, we have Then, taking ( 21) into (20) yields i of ( 22) uses Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
With the lemmas obtained above, we are ready to present the first convergence result as summarized in the following theorem.
Theorem 1: Under the conditions of Lemma 4, if we set the step size α in Algorithm 1 as follows: where ε is the minimum eigenvalue of the positive-definite matrix W , and β 1 is a parameter chosen from the following Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. interval: then there exists a positive constant δ ∈ (0, 1) such that where c and d can be arbitrarily chosen from the following intervals, respectively: where θ 1 , θ 2 , θ 3 , and θ 4 denote the maximum eigenvalue of the matrices W − W , I − W , I + W − 2 W , and W , respectively.λ 1 is the nonzero minimum eigenvalue of the positivesemidefinite matrix W − W .The selections of the parameters β 2 , β 3 , and β 4 are given in (49).
Proof: See the Appendix.Remark 2: It can be seen from ( 35) that the convergence of CO-DSA depends on the threshold function τ t .Meanwhile, the algorithm has a linear convergence rate if the event-triggered communication mechanism is removed, i.e., τ t = 0. Next, we will show that the linear convergence rate can also be attained by properly setting the threshold function.
We define a scalar function as Then, the following corollary is obtained.
Corollary 1: Under the same conditions of Theorem 1, and setting the censoring threshold function τ t = ρ • σ t with ρ > 0 and 0 < σ < √ 1 − δ, we have Proof: Taking expectation of ( 35) yields Equation ( 37) can be expressed equivalently as i holds due to the fact that 1 − δ > σ 2 .Recalling that p t is a positive sequence and then we obtain Combining ( 38) and ( 39), the proof is completed.We make the following three observations from Corollary 1. 1) Corollary 1 implies that the variable x converges to the optimal solution at a linear rate of √ 1 − δ in expectation by setting the threshold function as τ t = ρ • σ t .From (35), we know that the algorithm has a linear convergence rate of √ 1 − δ if the event-triggered communication mechanism is removed, i.e., τ t = 0. Therefore, Corollary 1 discloses a fact that the CO-DSA can attain the same convergence rate as the algorithm without event-triggered mechanism.This is an important result in that the CO-DSA is theoretically proved to obtain communication efficiency for free.
2) It is worth noting that the upper bound of σ is √ 1 − δ.Recalling that τ t = ρ • σ t , where σ is the linear converge rate, the condition σ < √ 1 − δ means the linear rate of τ t decaying to 0 must be faster than the linear convergence rate of the algorithm without event-triggered mechanism.
3) ε denotes the minimum eigenvalue of the mixing matrix W and a larger ε means better connectivity of the communication network.On the other hand, it is observed from (36) that a larger ε leads to a faster convergence speed.This observation is coincident with the intuition that better connectivity of the communication network leads to faster convergence.

IV. NUMERICAL EXPERIMENT
In this section, we give some numerical experiments to validate the performance of the CO-DSA and some comparisons with existing algorithms are also given.The following logistic regression problem is considered: where λ 2 x 2 denotes the regularization term, which is used to prevent overfitting of the objective function and λ is set to 10 −4 .Both the parts of the global function are strongly convex and so as the function f (x).We can get the local function and sample Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.function from the definition of the global function that The training samples are characterized by the feature vectors s n,i ∈ R p and the related labels l n,i ∈ {−1, 1}.The constant edge weight matrix W satisfies where σ is a positive constant.To ensure that the symmetric double random weight matrix W is a positive semidefinite matrix, we set σ = 0 and In particular, we set W as I + W /2, which has excellent performance in EXTRA, the network is generated with different connectivity ratio κ at random.Throughout the numerical experiments, the accuracy of the algorithm is characterized by residual error x t − x * 2 / x 0 − x * 2 .The cumulative communication cost is defined as the number of broadcast messages of all the nodes.

1) Comparison With DSA:
To compare fairly, we use the dataset presented in [11].We set the global samples Q = 500 and the total nodes N = 50, so each node will be assigned ten samples.The comparisons are performed over three different networks, and the threshold functions τ t = ρ × σ t of the CO-DSA are set as ρ = 0.1, σ = 0.99 over line network; ρ = 0.1, σ = 0.98 over random network with connectivity ratio κ = 0.4; and ρ = 0.1, σ = 0.975 over the complete network.The step sizes for CO-DSA and DSA are chosen as α = 4 × 10 −3 , α = 1.5 × 10 −2 , and α = 2.2 × 10 −2 for the three networks, respectively.It can be seen from Figs. 2 and 3 that the CO-DSA has almost the same convergence speed as DSA, but the communication cost is reduced significantly.
2) Effect of Censoring Threshold: We consider four different censoring thresholds to discuss the influence.The parameters of the threshold function τ t = ρ × σ t are chosen as ρ = 1,   while σ = 0.93, 0.95, 0.97, and 0.99.We choose a random graph with κ = 0.4 and step size α = 2.5 × 10 −2 .It can be seen from Fig. 4 that a larger σ leads to a greater reduction in communication cost but at the expense of slower convergence speed.It demonstrates that there is a tradeoff between communication cost and the convergence speed when choosing the parameter of the threshold function.
3) Censoring Threshold Sequence is Square Summable but Not Summable: In this part, the threshold function is chosen as τ t = 3/10000t that is square summable but not summable.Then, the performance is compared with the summable case τ t = 0.1 × 0.98 t .All other parameters are chosen the same as in the last part.It can be seen from Fig. 5 that the effect of communication censoring is weak in the early stage when τ t = 3/10000t.After reaching a certain stage, the intensity of communication censoring is gradually strengthened, but the convergence speed slows down significantly.This is because the attenuation rate of the square-summable sequence τ t = 3/10000t is less than the convergence rate of the algorithm at an early stage and, therefore, during which the communication censoring effect is not obvious.After reaching a certain interval, the convergence rate of the algorithm is much less than that of τ t = 3/10000t   [18], ET-LALM [22], and ET-GT [23].
and, therefore the communication censoring mechanism begins to take a significant effect.

B. Performance Validation Over a Real-World Dataset
In this part, the performance validation is presented on a real-world dataset Covertype, which is available at https://www.openml.org.We compare the CO-DSA with another algorithm referred to as SPARO-SGD in [21].We selected two types of samples from the Covertype dataset and 500 samples in total.In the experiments, the network consists of 50 nodes, and each node is allocated with ten samples.We use a random network with the connectivity ratio κ = 0.4.The step size of the DSA is set to α = 0.4.The parameters of the CO-DSA are set as α = 0.4 and τ t = 3 × 0.997 t .The parameters of SPARO-SGD are set as η t = 500 t+500 , γ = 0.1, and c t = 1500 × t.As we can see from Fig. 6, compared to SPARO-SGD, the CO-DSA has faster convergence speed and is more communication efficient.This is mainly due to the fact that the CO-DSA uses a fixed step size, while SPARO-SGD uses a decaying learning rate.[18], [22], and [23] In this subsection, we compare the CO-DSA with the algorithms with communication censoring schemes in [18], [22], and [23].In contrast to the utilization of stochastic gradients in the CO-DSA, deterministic gradients are used in [18], [22], and [23].We still consider the problem (40) and a network of N = 500 with connectivity ratio κ = 0.4.We use the same dataset as in [11].It can be seen from Fig. 7(a) that the CO-DSA has much less computational burden than the algorithms in [18], [22], and [23].This is due to the fact that only one local sample gradient is evaluated randomly at each iteration in the CO-DSA, while the algorithms in [18], [22], and [23] need to evaluate all the local sample gradients.Fig. 7(b) shows that the deterministic algorithms proposed in [18], [22], and [23] converge faster than the CO-DSA.This is mainly due to the noise introduced by the stochastic gradient approximation of the CO-DSA.

V. CONCLUSION
In this article, we proposed a DSA with an event-triggered communication scheme to solve a large-scale optimization problem.Compared to existing stochastic algorithms such as DSA, communication cost was reduced significantly.Moreover, we established the linear convergence rate in expectation for the CO-DSA, which is in contrast to the existing communication efficient algorithm SPARO-SGD in which only sublinear convergence is obtained.We demonstrated the effect of communication saving over different graphs by experiments on a real dataset.Our next research direction aims to apply the communication-censoring strategy to nonconvex optimization algorithms.

A. Proof of Theorem 1
Proof: Inequality (35) can be equivalently written as Then, we establish the upper bound of the LHS of (42) and the lower bound of the RHS of (42) and ensure that the inequality always holds by setting parameters appropriately.First, we need to find the lower bound of the LHS of (42).Under Lemmas 2 and 3 and (34), we obtain Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Then, taking ( 19) into (43) yields Then, we need to find the upper bound of the LHS of (42).We have i of (45) is due to the strong convexity of objective function and ii uses (19); θ 4 represents the maximum eigenvalue of matrix Z.
Next, we show that the upper bound of the LHS of (42) is less than the lower bound of the RHS.For convenience, we define Combining (44) and (45), we obtain To ensure that (46) holds, the following conditions should be satisfied: H 3 ≥ 0 (47c) In order to ensure that (47a)-(47f) hold, δ should satisfy To ensure that the minimum value of the six terms in (48) is greater than zero, we choose the parameters sequentially as follows: We sum the expectation of (35) from t = 0 to ∞ and yields Equation ( 50) means E[ u t − u * 2 G + cp t ] converges to zero as t → ∞.Recall that u t − u * 2 G = v t − v * 2 + x t − x * 2 Z ; then, we obtain ∞ t=0 x t − x * 2 Z < ∞, so x t − x * 2 Z → 0 as t → ∞, and the proof is completed.
Remark 3: It is worthy to note that the range for choosing c in (49) is nonempty.To verify this, we need to show that the lower bound of the range is not greater than its upper bound, i.e., the following inequality holds: For the two bounds, we further have the following inequalities: where i and ii hold due to the selections β 3 ≤ λ 1 2θ 1 and α ≤ ε 2β 1 .Then, (51) holds if the following inequality holds: It is easily obtained that (52) holds by the selection of β 1 as in (49).Therefore, we conclude that the range for choosing c is not empty by the selections of β 1 , α, and β 3 as in (49).

Distributed
Stochastic Consensus Optimization Using Communication-Censoring Strategy Ranran Li , Weicheng Xu , and Fan Yu Abstract-In this article, a novel communication-efficient distributed stochastic algorithm (CO-DSA) is proposed for solving large-scale consensus optimization problems.As compared to the existing relevant work where only a sublinear convergence rate is obtained for strongly convex and smooth objective functions, the CO-DSA achieves a linear convergence rate even in the presence of an event-triggered communication-censoring strategy.Moreover, by properly setting the threshold function of the eventtriggered communication scheme, the CO-DSA maintains the same convergence rate as the algorithm without eventtriggered communication.This means the CO-DSA theoretically yields communication efficiency for free.Numerical experiments verify the theoretical findings and also show the excellent communication saving effect of the CO-DSA in large distributed networks.Index Terms-Communication-censoring strategy, largescale distributed optimization, linear convergence, stochastic gradient.

Manuscript received 19
November 2022; revised 1 March 2023; accepted 24 May 2023.Date of publication 31 May 2023; date of current version 1 March 2024.This work was supported by the National Natural Science Foundation of China under Grant 61603084.Recommended by Associate Editor J. He.(Corresponding author: Ranran Li.)

Fig. 1 .
Fig. 1.Illustration of the difference between our work with [15] through a five-worker network.Red arrows represent communication.(a) Communication network in [15].(b) Communication network in our work.
If receive x t+1 m from neighbor node m, let xt+1 m = x t+1 m ; else let xt+1 m = xt m .11: end for respectively.λ 1 denotes the smallest nonzero eigenvalue of the matrix W − W . ε denotes the minimum eigenvalue of the positive-definite matrix W . 21) To express the algorithm more clearly, we provide the following definitions:

Fig. 2 .
Fig. 2. (a) and (b) Performance comparison of DSA and CO-DSA over a random network.

Fig. 3 .
Fig. 3. (a) and (b) Performance comparison of DSA and CO-DSA over a complete network.

Fig. 4 .
Fig. 4. (a) and (b) Performance comparison of the CO-DSA with different censoring thresholds and DSA over a random network.

Fig. 5 .
Fig. 5. (a) and (b) Performance comparison of the CO-DSA with square-summable but not summable censoring threshold, summable censoring threshold, and DSA over a random network.

Fig. 6 .
Fig. 6.(a) and (b) Performance comparison of the CO-DSA with SPARO-SGD over Covertype dataset and a random network.
, θ 2 , θ 3 , and θ 4 denote the maximum eigenvalue of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the matrices W − W , I − W , I + W − 2 W , and W , Algorithm 1.Here, w nm and wnm are combination weights of two different networks.xt n stores the last state variable.We define the Euclidean distance ξ t+1 1 that for each node n at iteration t, the variables x t n , xt n , and y t n,i are introduced.x t n represents the local estimate of model parameters that is designed to achieve consensus for all the nodes, see steps 6 and 7 of