Distributed Discrete-Time Convex Optimization With Closed Convex Set Constraints: Linearly Convergent Algorithm Design

The convergence rate and applicability to directed graphs with interaction topologies are two important features for practical applications of distributed optimization algorithms. In this article, a new kind of fast distributed discrete-time algorithms is developed for solving convex optimization problems with closed convex set constraints over directed interaction networks. Under the gradient tracking framework, two distributed algorithms are, respectively, designed over balanced and unbalanced graphs, where momentum terms and two time-scales are involved. Furthermore, it is demonstrated that the designed distributed algorithms attain linear speedup convergence rates provided that the momentum coefficients and the step size are appropriately selected. Finally, numerical simulations verify the effectiveness and the global accelerated effect of the designed algorithms.

some other fields [4].Within the context of algorithm design, how to improve the convergence rate and apply to directed graphs with interaction topologies poses many technical challenges, especially for constrained optimization problems.
In the past two decades, a great deal of effort has been devoted to studying various distributed constrained optimization problems under different conditions [5], [6], [7], [8].For balanced graphs, discrete-time algorithms for optimization problems with various constraints were designed in [9], [10], [11], and [12].The new push-sum-based constrained optimization algorithm was proposed in [13] to address the optimization problem with closed convex set constraints over unbalanced graphs, where the gradient descentlike method was utilized.Moreover, by using a rescaling gradient technique, a distributed optimization problem at the presence of inequality and equality constraints over unbalanced graphs was solved in [14].Furthermore, a discrete-time algorithm with fixed step size was developed to obtain the exact optimal solution for the optimization with convex and smooth objective functions and coupled constraints in [15].However, the convergence rate of the algorithms is also vital in some practical decision scenarios with constraints [16].A fast linear convergence rate could not be guaranteed in the above-mentioned algorithms.
To improve the convergence rate, some accelerated distributed algorithms were recently proposed and discussed, such as the EXTRA algorithm [17] and the accelerated distributed Nesterov gradient descent method in [18].Generally speaking, faster convergence rate can be obtained from discrete-time optimization algorithms with constant step size comparing to those with decaying step sizes.Motivated partly by this observation, the DGD method using constant step size was presented in [19].However, the faster convergence rate was attained at the cost of reducing the accuracy of solutions.Moreover, the convergence rate of the ADMM algorithm was proved in [20].Even though these algorithms present outstanding acceleration performance, most of them are only effective for solving distributed unconstrained optimization problems.
Recently, the gradient tracking method was newly proposed for designing the distributed optimization algorithms in [21] and [22].The essence of this method is to introduce an additional variable aiming at estimating the average gradient when seeking the optimum solution.With this method employed, distributed algorithms with constant step sizes were designed over balanced graphs to settle the unconstrained optimization problem in [21], [22], and [23].Moreover, it was proved that if the step sizes were suitably selected, the designed algorithms can attain a linear convergence rate in [21], [22], and [23].For the cases considering the unbalanced graphs, the Push-DIGing algorithm and the ADD-OPT algorithm were designed in [22] and [24], respectively, under the push-sum framework with the column stochastic matrices utilized.Furthermore, the AB algorithms utilizing both row stochastic matrices and column stochastic matrices were developed in [25] and [26].The TV-AB algorithm was further developed in [27] which can be effective on random and time-varying directed interaction topologies.Additionally, the push-pull gradient algorithm was also applied in the gossip situation for reducing the communication cost.Nevertheless, the aforementioned algorithms over directed interaction topologies necessarily involve the out-degree information of each node.On the other hand, it is more standard and practical to employ the row stochastic weight matrices in distributed setting.In view of this point, the algorithms were proposed with only the row stochastic matrices utilized in [28] and [29].It is worth noting that distributed constrained optimization problems have not been adequately researched at all when the gradient tracking method is employed.As commonly known, it is more valuable to study constrained optimization problems since in reality constraints are often indispensable, such as the power grid dispatch problem [30] and some problems involved in intelligent transportation [31].Inspired by this observation, in the present paper, a new kind of fast distributed algorithms is designed with the gradient tracking technique and the row stochastic matrix employed to effectively address the distributed optimization problem confining to the closed convex set constraint.
Specifically, this article aims to investigate the distributed discrete-time optimization problem where decision variables are subjected to set constraints and the network topologies are allowed to be unbalanced.The main contributions of this article are structured as follows.
1) The linear convergence rate of the designed optimization algorithms will be analyzed, showing that linear convergence rates can be obtained for solving the considered optimization problem over both balanced and unbalanced network topologies if the step size and the maximum momentum coefficients are appropriately selected.Simultaneously, the decision variables are always kept within the feasible region during the operation process, which means being safe to utilize after a finite number of iterations in the practical constrained scenarios.Moreover, with only the row stochastic weight matrix involved, the implementability of algorithms can be improved since each node can decide the importance of the received information from its neighbors.2) The projection method is successfully integrated into the gradient tracking framework by utilizing the fast and slow time-scales technique inspired by [32].In detail, in order to achieve the convergence and the linear convergence rate, the condition λ<1/2, even the condition λ<1/(m+1) with arbitrary m≥1, needs to be assumed, where λ is a positive parameter related to the network topology satisfying ρ(W−(1/N)1π T )< λ defined in Lemma 2. However, only λ<1 can be guaranteed in general case, which is the essential difficulty in employing the projection method in the gradient tracking algorithms.On the other hand, the aforementioned parameter restrictions can be correspondingly satisfied with the fast and slow time-scales technique used, and thus the encountered difficulty is overcome.
3) Heavy-ball momentum terms are introduced into the distributed algorithms to improve the convergence rate with acceleration performance.The effectiveness and advantages of the present algorithms are compared with existing algorithms by simulations.The reminder of this article is organized as follows.Section II introduces notations and preliminaries on graph theory and matrix theory.In Section III, two distributed optimization algorithms are designed, and the convergence is analyzed.In Section IV, theoretical results are verified by numerical simulation.Finally, Section V summarizes the investigation.

II. PRELIMINARIES
Some basic notations and preliminaries on graph theory and matrix theory are introduced in this section.

A. Notations
R n represents the n-dimensional Euclidean space.Let 1 be the column vector with all elements being 1 in appropriate dimension.Given a vector μ ∈ R n , μ represents its 2-norm, μ s represents its sth entry, and μ is said to be a non-negative (positive) vector if all the elements of μ are non-negative (positive).∇h(μ) stands for the gradient of function h at μ. Let R n×n and I be the n × n real matrices set and identity matrices with proper dimensions, respectively.For a matrix B ∈ R n×n , B represents its induced 2-norm, B st denotes the element on row s and column t, and let ρ(B) be the spectral radius.Moreover, if all entries of matrix B are positive (non-negative), it is considered to be positive (non-negative).Vector [B] j denotes the jth column of matrix B. Additionally, λ max (B) and λ min (B) are the maximum eigenvalue and minimum eigenvalue of matrix B, respectively.Let diag{β 1 , β 2 , . . ., β n } represent a diagonal matrix with β 1 , β 2 , . . ., β n on the diagonal.For a differentiable function f : R p → R and any

B. Preliminaries on Graph Theory
A directed graph (digraph) G consists of an adjacency matrix W = [w lm ] ∈ R N×N , a node set V = {1, . . ., N} and an edge set E ⊆ V × V. Here, if (l, m) ∈ E, that is, node l can receive information from node m, then w lm > 0; otherwise, w lm = 0. Specifically, for 1 ≤ l ≤ N, w ll > 0. Besides, w lm > 0 also means that node m is an in-neighbor of node l, while node l is node m's out-neighbor.N in l and N out l represent, respectively, the collections of in-neighbors and out-neighbors of node l.G is balanced if the in-degree weight of each node is equal to its out-degree weight.If each node has a directed path to any other node, G is regarded to be strongly connected.

C. Preliminaries on Matrix Theory Definition
Furthermore, if a non-negative matrix W satisfies W1 = 1 and 1 T W = 1 T , it is regarded as doubly stochastic.
Obviously, 1 is an eigenvalue of a row stochastic matrix, and some relevant significant properties are shown in the next lemmas.
Lemma 1 [33]: there is such a non-negative left eigenvector π associated with the eigenvalue 1 that π T 1 = N.

III. MAIN RESULTS
The convex optimization problem is considered as follows: where f i : R p → R as the objective function is only known by node i, μ ∈ R p , and 0 ⊆ R p denotes a closed convex set.For convenience, assume p = 1, and the case of p > 1 is discussed in subsequent Remark 3.
Next, define a projection operator on the closed convex set 0 for arbitrary u ∈ R p as The projection operator has the following nonexpansive property: Algorithm 1 Algorithm Over Balanced Graphs for Node i Parameters: the momentum parameter β i ≥ 0, the step size η > 0, w ij , j = 1, 2, . . ., N. Initialization: Node i sets: 5: end for Generally speaking, the stationary point is not the minimum point of the convex optimization problems with closed convex set constraints, which is different from the unconstrained cases.The following lemma gives an accurate expression of the optimum solution to (2).
The following two subsections present distributed discretetime algorithms to address (2) over strongly connected balanced graphs and unbalanced graphs, respectively.

A. Algorithm Over Balanced Graphs
In this section, a distributed discrete-time algorithm is designed as follow for (2) over strongly connected and balanced graphs.Then, the algorithm is demonstrated to converges linearly when the step size and the maximum momentum coefficient are appropriately selected.
First, for 1 ≤ i ≤ N, the processing dynamics are designed in Algorithm 1 for the variables associated with node i. T .Write Algorithm 1 in the following compact form: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where matrix W associated with the balanced graph G is nonnegative and doubly stochastic.
Remark 1: The design of algorithm ( 6) over balanced graphs mainly utilizes projection operator to deal with the set constraint and momentum terms to accelerate the convergence under the gradient tracking framework [21], [22].However, different from the distributed projected subgradient algorithm [7], [10], the integration of a projection operator into the gradient tracking framework causes great difficulties in restricting the parameters related to the network topology, and thus the fast and slow time-scales technique is introduced.
Remark 2: When the function varies severely in different dimensions, most existing optimization algorithms will fall into the neighborhood of a local optimal solution slowly.The heavy-ball momentum method was introduced in [37], where the key idea is to correct the descent components using the historical information so that the states converge faster toward the optimal solution.Following [38], as a global effective method to reduce the fluctuation of states evolution as shown in Fig. 1 and accelerate states convergence in certain directions, the heavy-ball momentum term is introduced into the algorithm for further accelerating convergence.The global acceleration effect of the momentum term on the algorithm will be verified in the subsequent simulations.
For the convergence analysis below, define Lemma 6: With the given initial values, the following equalities hold: Proof: Equality (7) can be directly obtained from (6a).On the basis of (6b), one gets which means that Thus, it can be verified by accumulative calculation that which implies (8) with the given initial values.
Lemma 7: For 0 < η < 2/l, the next linear inequality is established where Proof: The proof is divided into the following four steps.
Step 1: Analyze μ(k + 1) − μ(k + 1) .According to the dynamic formula (6a) and equality (7), one has which gives Based on Lemma 2, one can find a constant λ, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Clearly, it is necessary to let the diagonal elements of Q 1 be strictly less than one to guarantee ρ(Q 1 ) < 1.Without using the fast and slow time-scales technique, 2 λ < 1, that is, λ < 1/2, is required.However, this condition does not hold in general.Inspired by [32], the fast and slow time-scales are used, which are denoted by k 1 and k in Algorithm 1, respectively.By making λ = λS 1 < 1/2, the encountered difficulty is overcome.Note that this skill can be similarly applied to the case of unbalanced graphs.Now, the first main result of this article is established.Theorem 1: Suppose that Assumption 1.Let the step size satisfy and the maximum momentum parameter β satisfy with the positive parameters Accordingly, μ i (k) linearly converge to the optimal solution μ * of problem (2).
Remark 3: For the case of p > 1, define then the compact form (6) and convergence analysis remain valid by replacing μ * and μ(k) with μ * T and μT (k), respectively.Additionally, the following algorithm over unbalanced graphs for p > 1 can be analyzed similarly.

B. Algorithm Over Unbalanced Graphs
In this section, the above result is generalized to the case involving an unbalanced graph.
Motivated by [28], a new algorithm is presented as Algorithm 2. The matrix W associated with the unbalanced Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 2 Algorithm Over Unbalanced Graphs for Node i Parameters: the momentum parameter β i ≥ 0, the step size η > 0, w ij , j = 1, 2, . . ., N. Initialization: Node i sets: 1: for k = 0, 1, 2, . . .do 2: Node i calculates the momentum term: k).4: Slow time scale: Node i updates the local estimation μ i (k + 1) of μ * , the estimation vector y i (k + 1) for the left eigenvector corresponding to the eigenvalue 1 of matrix W, and the gradient tracking term χ i (k+1) of f 0 : with ∇r i (k + 1)=∇f i (μ i (k + 1))/y i i (k + 1).5: end for graph G is non-negative row stochastic.Since w ii > 0 and w ij ≥ 0, one gets y i i (k) > 0 with the given initial values, implying that ∇r i (k) is well-defined.
Then, Lemma 8 can be established with a proof similar to Lemma 6, which can be omitted.
Lemma 8: When the initial values are given, the following equalities hold: (38) where Proof: According to the definition of ∇R(k) and the result stated in Remark 4, one obtains which completes the proof.
The following convergence analysis is similar to that in the case involving the balanced graph.
Lemma 10: The following inequality can be yielded for any 0 < η < 2/l: where T , the definition of vector v(k) is the same as that in Lemma 7, and Proof: The proof is divided into the following four steps.
Step 1: Analyze μ(k + 1) − μ(k + 1) .Using the dynamic formula (35a) and Lemma 8, one gets Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Using the nonexpansive property of the projection operator, one obtains where m = 1π T /N .From Lemma 2, there exists a positive constant 0< λ<1 satisfying ρ(W − (1/N)1π T ) < λ.For an unbalanced graph, since . Below, let λ = λS 2 for convenience.Thus, one has Step 2: Analyze μ(k + 1) − 1μ * .From Lemmas 5 and 8, one has Based on Lemma 4, one obtains and letting Assumption 1 hold, one has where . Combining (45) with (44) together gives Step 3: Analyze ε(k + 1) .From the update rule (35a) and Lemma 5, one gets Based on Lemmas 4 and 5, the following inequalities hold: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Based on the result in Remark 4 and the fact that 0 is compact, there exists a positive constant D 0 such that where r 1 is a positive constant such that ρ 1 < 1.From Lemma 10, one has Thus, the proof is completed.Remark 5: To perform analysis and computation, the step sizes and the maximum momentum coefficients designed in Theorems 1 and 2 are very conservative.There is an interference between their ranges due to the proof framework adopted and the auxiliary vector δ selected.In applications, the parameters can be manually adjusted to improve the performance of the algorithms.
Remark 6: In the case with balanced graph, algorithm (6) needs to be operated . Similarly, for the case involving unbalanced graphs, algorithm (35) needs to be operated Remark 7: Compared with the distributed optimization algorithms in [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], and [28] for unconstrained optimization problems, the algorithms proposed in this article can more efficiently solve the distributed optimization problems with the closed convex set constraint at linear convergence rates.Additionally, compared with [29], the projection method is directly successfully integrated into the gradient tracking framework to deal with the closed convex set constraint in this article, and thus fewer convergence steps of fast-time scale are needed.Moreover, the introduced momentum terms can greatly reduce the convergence steps of the algorithms and be effective for ill-conditioned functions, which will be verified in the simulations.

IV. SIMULATION EXAMPLES
In this section, consider a balanced and an unbalanced strongly connected graphs with N = 10 nodes as shown in Fig. 2. Clearly, the undirected graph is balanced.The corresponding adjacency matrix is defined by 1] as the closed convex set constraint of the optimization problem on R 2 .Select η = 0.01 and η = 0.001 as the step size for both algorithms ( 6) and (35).
Case 1: As a classical extension of linear regression, logistic regression with the advantages of simplicity and parallelization is widely employed to deal with the classification problems [4], such as spam recognition, image content recognition, and so on.The considered logistic regression problem with its objective function being where parameters are set as b T rt = [0.1r+0.1t,0.2r+0.2t],λ 0 = 10, T = 5, the training data are labeled as l(r, t) = (−1) (r+t) and μ = (μ 1 , μ 2 ) T .The simulation results are shown in Figs. 3 and 4. As for the balanced graph, we give the numerical comparison between the algorithm (6) in this article with the projection subgradient algorithm in [10].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.As shown in Figs. 3 and 5, about 30 steps are needed for algorithm (6) in this article to achieve the optimal solution μ * = [−0.004,−0.008], while at least 120 steps are needed for the projection subgradient algorithm with 1/(t+1) being step size in [10].For the unbalanced case, the simulation results of algorithm (35) in this article and Algorithm 1 in [7] are, respectively, presented in Figs. 4 and 6.It is obvious that algorithm (35) can reach the optimal solution μ * faster.
In addition, momentum coefficients have a great influence on the convergence and changes of the states.Taking algorithm (6) over the balanced graph with different momentum coefficients for example, the states evolution are shown in Fig. 7.It can be noted that compared with the simulation results where the momentum coefficients are, respectively, set as β i = 0.02, 0.4, 0.8, for i = 1, 2, . . ., N, the acceleration performance of algorithm ( 6) is obvious when the momentum coefficients are set as 0.6.
Furthermore, the performance of algorithm (6) over balanced graphs is compared with the algorithm proposed in [21] and algorithm (5) in [29], and algorithm (35) over unbalanced graphs is compared with Algorithm 1 in [7], the algorithm in [13], and the FROST algorithm in [28], respectively.To be fair in comparisons, set S 1 = 1 and S 2 = 1 for algorithms (6) and (35), respectively.The residual error (1/N) N i=1 x i (t)−x * of all nodes is shown in Fig. 8 for these algorithms.It can be seen that algorithms (6) and (35) in this article show fast convergence rates, even comparable

TABLE I COMPARISON OF ALGORITHM CONVERGENCE STEPS
to the algorithm in [21] and the FROST algorithm for the unconstrained optimization problem, respectively.
Case 2: To further verify the global acceleration performance of the momentum terms designed in the algorithms, take the optimization problem with the objective function being a sum of ill-conditioned functions for example.Consider a distributed quadratic programming problem The ill-conditioned function refers to a function with a large condition number, where the condition number κ of f 0 (μ) is λ max ( N i=1 R i )/λ min ( N i=1 R i ).Generally speaking, since the ill-conditioned functions vary severely in different directions, the old algorithms converge very slowly.
Regarding algorithms ( 6) and ( 35) with or without momentum terms, the convergence steps are shown in Table I.The momentum coefficients could be adjusted manually to improve the algorithm performance in simulations.Clearly, as seen from Table I, the designed momentum terms with appropriate coefficients have a significant acceleration effect.This shows that the distributed algorithms designed in the present paper are very effective for handling ill-conditioned functions.
V. CONCLUSION Accelerated distributed optimization algorithms have been developed in this article, applicable to the convex optimization problem with the closed convex set constraint over digraphs under the framework of gradient tracking.Moreover, the designed algorithms have been proved to achieve the linear convergence rates when the step size and the maximum momentum coefficients are appropriately selected.The advantage of this article is utilizing the fast and slow time-scales technique to successfully implement the gradient projection method under the gradient tracking framework.Besides, compared with the existing distributed first-order methods, the heavy-ball momentum terms combined in the designed algorithms have excellent acceleration performance in problem solving, even with illconditioned functions.Future works will focus on solving the constrained optimization problem with more general constraints and in the online optimization scenarios.

4 :
Slow time scale: Node i updates the local estimation μ i (k+1) of μ * and the gradient tracking term χ i (k+1) of f 0 :

Fig. 1 .
Fig. 1.Effect comparison of the momentum term on algorithm.(a) Optimization algorithm without the momentum term.(b) Optimization algorithm with the momentum term.