Scaling Stratified Stochastic Gradient Descent for Distributed Matrix Completion

Stratified SGD (SSGD) is the primary approach for achieving serializable parallel SGD for matrix completion. State-of-the-art parallelizations of SSGD fail to scale due to large communication overhead. During an SGD epoch, these methods send data proportional to one of the dimensions of the rating matrix. We propose a framework for scalable SSGD through significantly reducing the communication overhead via exchanging point-to-point messages utilizing the sparsity of the rating matrix. We provide formulas to represent the essential communication for correctly performing parallel SSGD and we propose a dynamic programming algorithm for efficiently computing them to establish the point-to-point message schedules. This scheme, however, significantly increases the number of messages sent by a processor per epoch from <inline-formula><tex-math notation="LaTeX">$\mathcal {O}(K)$</tex-math><alternatives><mml:math><mml:mrow><mml:mi mathvariant="script">O</mml:mi><mml:mo>(</mml:mo><mml:mi>K</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="aykanat-ieq1-3253791.gif"/></alternatives></inline-formula> to <inline-formula><tex-math notation="LaTeX">$\mathcal {O}(K^{2})$</tex-math><alternatives><mml:math><mml:mrow><mml:mi mathvariant="script">O</mml:mi><mml:mo>(</mml:mo><mml:msup><mml:mi>K</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="aykanat-ieq2-3253791.gif"/></alternatives></inline-formula> for a <inline-formula><tex-math notation="LaTeX">$K$</tex-math><alternatives><mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="aykanat-ieq3-3253791.gif"/></alternatives></inline-formula>-processor system which might limit the scalability. To remedy this, we propose a Hold-and-Combine strategy to limit the upper-bound on the number of messages sent per processor to <inline-formula><tex-math notation="LaTeX">$\mathcal {O}(K\lg \!K)$</tex-math><alternatives><mml:math><mml:mrow><mml:mi mathvariant="script">O</mml:mi><mml:mo>(</mml:mo><mml:mi>K</mml:mi><mml:mo form="prefix">lg</mml:mo><mml:mspace width="-0.166667em"/><mml:mi>K</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="aykanat-ieq4-3253791.gif"/></alternatives></inline-formula>. We also propose a hypergraph partitioning model that correctly encapsulates reducing the communication volume. Experimental results show that the framework successfully achieves a scalable distributed SSGD through significantly reducing the communication overhead. Our code is publicly available at: github.com/nfabubaker/CESSGD

among which Collaborative Filtering (CF) is the most widely used.
CF approaches recommend an item to a target user by using other users' ratings given that those other users and the target user have rated some other items similarly.The rating data produced nowadays, whether by social networks or e-commerce, is rather huge and change very often.Recommender systems for such huge data are usually implemented on distributed-memory systems that might involve multiple data centers.Therefore, the CF component should be performant and scalable to utilize the provided computational resources as well as the high-speed networks.
Low-rank matrix factorization have been successfully used in CF via revealing feature vectors that represent the users and the items (latent factors).In matrix-factorization-based CF methods, the known ratings are stored as a sparse matrix, rows of which represent users and columns of which represent items.The sparse matrix is factorized into two dense matrices representing the feature vectors of items and users, and these dense matrices are then used to predict missing ratings in the original rating matrix.This use of matrix factorization is commonly referred to as matrix completion.The matrix factorization can be computed with different methods, including stochastic gradient descent (SGD), alternating least squares (ALS), cyclic coordinate descent (CCD) and more.
SGD is very efficient and usually achieves high completion accuracy compared to other methods [1].However, given its sequential nature it has been a challenge to efficiently parallelize while maintaining accuracy and convergence guarantee.For this reason, serializable parallel SGD algorithms are most desired.Serializability of parallel SGD refers to the existence of an equivalent serially-executed SGD algorithm with the same update order.Serializability guarantees the convergence and assures that no two processors update the same feature vector at the same time (race condition) thus leading to faster convergence [2].Stratified SGD [3] is the de-facto algorithm for achieving a serializable parallel SGD.
The state-of-the-art methods implementing SSGD (such as DSGD [3], DSGD++ [4], and NOMAD [5]) achieve the interprocessor communication necessary for the correctness of the SSGD through sending/receiving feature vectors with sizes proportional to one of the dimensions of the input rating matrix.In other words, these methods perform dense communications without exploiting the sparse nature of the rating matrix, leading to a huge amount of unnecessary communication especially when the nonzero density of the rating matrix is low.The extra communication did not pose a concern because these methods are tested on a relatively small number of processors (up to 64) in distributed setting.At such small scale, the SGD runtime is expected to be dominated by computation; and investing in improving the communication component does not drastically affect the overall running time.
On large scale (hundreds and thousands of processors), the communication component becomes dominant, and therefore reducing the communication overhead becomes essential for the scalability of the SSGD algorithm.The dense communications of the state-of-the-art methods implementing SSGD prohibit their scalability.This is empirically confirmed by us (Section VI) as well as by another work [6] that runs DSGD on thousands of processors (cf.Fig. 8 in [6]).
In this work, we propose a communication-efficient framework for the SSGD algorithm.Our framework starts with exploiting the sparsity of the rating matrix by performing sparse communications instead of dense communications.This is achieved with efficiently finding the essential feature vectors to be communicated between processors and communicating them through point-to-point (P2P) messages.This approach, although invaluable for reducing the volume of communication performed, has the down side of increasing the number of messages sent per processor from O(K) (as in DSGD) to O(K 2 ).
Inter-processor communication cost ideally consists of latency term and bandwidth term.The latency term is proportional to the number of messages sent, whereas the bandwidth term is proportional to the volume of data transferred.If the number of messages is high, the latency cost might dominate the overall communication component since each message's startup time might be higher than that of sending a few kilobytes of data [7].The O(K 2 ) bound of the new sparse communication method has the potential to increase the latency overhead and possibly affecting the scalability as K increases, which makes it latencyunsafe.To remedy this, we propose a novel approach called hold and combine that reduces the upper bound on the number of messages from O(K 2 ) to O(K lgK) which renders the new sparse communication method latency-safe.
The volume of the sparse communication in parallel SSGD is also affected by how the ratings are distributed to different processors.This property indicates that there is a room for reducing the volume of communication combinatorialy via intelligent partitioning methods.We propose a partitioning method utilizing a hypergraph partitioning model that correctly encapsulates the total volume of communication between processors.In this method, the objective of reducing the cutsize of the hypergraph model partition also corresponds to reducing the total volume of communication in an SSGD epoch.
The rest of the paper is organized as follows: Section II gives the essentials of using and parallelizing SGD for matrix completion.In Section III, the communication requirement in parallel SSGD is studied in detail.In Section IV, the proposed framework for scaling P2P SSGD, including the hold-and-combine scheme, is presented.In Section V, the proposed hypergraph partitioning (HP) method is presented.Section VI contains the experiments conducted on an HPC system along with the results and discussions.Related works are discussed in Section VII and the paper is concluded in Section VIII.

A. Matrix Completion With SGD
We define the matrix completion problem in the context of collaborative filtering as follows: Given a set U of N users, a set I of M items, and a set Ω of ratings as the known entries of a sparse rating matrix R ∈ R N ×M .The problem is to find two dense factor matrices W ∈ R N ×F and H ∈ R M ×F such that a lowrank approximation R ≈ WH is achieved.Here, F M, N is called the dimension or the rank of the factorization.The approximation rij of rating r ij can then be calculated as where w i and h j respectively denote the ith row of W and the jth row of H.The quality of the approximation is usually measured by an application-dependent loss function L, thus the problem becomes argmin W,H L(R, W, H).For collaborative filtering, L is usually the eculedian distance and thus the problem becomes where γ is a regularization parameter to avoid over-fitting, and rij is computed with (1).
Since the minimization problem in (2) has two unknowns W and H, L is a non-convex function [1].SGD has been widely used to optimize (minimize) such functions due to its ability to escape local minimas.At an SGD epoch, each rating r ij ∈ Ω is used to update the objective function's parameters.The gradient of the objective function at point r ij is calculated (∇ r ij L r ij (R, W, H)) and the corresponding w i and h j rows are updated as where is the step size.
It is clear from (3) and ( 4) that SGD is sequential in nature, thus parallelizing it requires communicating up-to-date Wand H-matrix rows.Trivially, the up-to-date Wand H-matrix rows should be communicated after each SGD update which enforces very high communication and synchronization overheads.Otherwise, some SGD updates will be performed on stale versions of Wand H-matrix rows which may drastically affect the learning process and the convergence guarantee.The parallel SGD methods that allow updating on stale Wand H-matrix rows (i.e., allow staleness) are called asynchronous.These methods are usually non-serializable.Simple parallelizations of the SGD-based matrix completion, such as row-wise or column-wise partitioning of the rating matrix, are examples of asynchronous SGD (see Fig. 1).Fig. 1.Stale updates in simple row-or column-wise partitions (upper part) versus stale-free DSGD (bottom).In the row-wise partition of R, the rows of W are partitioned conformably and thus each W-matrix row is accessed by one processor.However, this is not the case for H-matrix rows.For instance, ratings r il and r jl are respectively assigned to p 1 and p 2 and both used to update h l possibly at the same time thus either p 1 or p 2 will update on a stale h l .A similar discussion holds for column-wise partition in a dual manner regarding r jl , r jn and w j .Black starts are known ratings.

B. Stratified SGD (SSGD) and its Parallelization 1) SSGD:
The SSGD method is proposed by Gemulla et al. [3] in order to mitigate the staleness problem.In SSGD, the rating matrix is divided into K 2 2D blocks using K-way mutually exclusive and exhaustive partitions on the rows Π R = {R 1 , . . ., R K } and columns Π C = {C 1 , . . ., C K } of R. The rows of the dense matrices W and H are partitioned conformably with Π R and Π C , respectively.We denote the row blocks of W and H that respectively conform with R α and C β as W α and H β .We denote a block of R with rows in R α and columns in C β as R αβ .
In SSGD, a set of K 2D non-overlapping sub-matrix blocks are called a stratum (denoted by S hereafter).Two 2D sub-matrix blocks are said to be non-overlapping if they do not share any row or column.A set of K stratums S = {S 1 , . . ., S K } that exhausts all of the K 2 sub-matrix blocks is called correct strata.Fig. 2 shows the strata Given correct strata to be used in an SSGD epoch, each stratum is processed in a separate mini epoch (called sub-epoch), and the order in which these sub-epochs are executed can be random.Although the SSGD algorithm is serial, its distinguishing property is that no ratings in different blocks of a stratum can update the same row of the factor matrices W and H, which makes it suitable for stale-free parallelization.
2) Parallel SSGD: In [3], the parallel algorithm that utilizes SSGD is called the Distributed Stochastic Gradient Descent Fig. 2. The numbers identify the sub-matrix blocks that constitute a stratum in a ring strata with seed =1.Stratum S 2 is highlighted.Side arrows show the processor update order of h i and h j in H 1 .
(DSGD) algorithm.In DSGD, each stratum is executed in parallel in one sub-epoch, where the Wand H-matrix rows are updated with the ratings in the stratum according to (3) and (4).Then, inter-processor communications are performed to synchronize all updated rows of factor matrices.If a row-parallel execution is chosen, that is the R matrix is partitioned row-wise such that each row block is executed by a single processor, then communication is restricted to the H-matrix rows.Row-parallel execution is usually preferred because the number of items is generally much less than the number of users which means the amount of data to be communicated (H) is small compared to W. In row-parallel execution, we abuse the stratum notation S to also be viewed as a mapping function S : [K] → [K] (where [K] is used to denote the set {1, . . ., K} hereafter) from a processor p k to the index β of a column block C β .For instance, S 2 (p 4 ) = 5 means that during sub-epoch 2, processor p 4 will exclusively update the rows of the H 5 sub-matrix.We also use S −1  β (p x ) to retrieve the sub-epoch at which p x updates H β .As mentioned previously in the introduction, DSGD performs dense communications.We will utilize the parallelization style of DSGD in our methods while changing how the communication is performed.Hereafter, we will refer to the parallelization style of DSGD as "parallel SSGD," and we will use the name "DSGD" to distinguish the algorithm that performs dense communication.
3) Generating Correct Strata: There are several ways to generate a correct strata that covers the whole dataset and schedule the strata to sub-epochs.For simplicity, we consider a simple form of scheduling as follows: at sub-epoch 1, processor p x , for x = 1, 2, . . ., K, processes the ratings in R xx to update the rows in W x and H x ; at sub-epoch k, processor p x processes the ratings in R xβ to update the rows in W x and H β , where β = 1 + (x + k − 2) mod K.We refer to this scheduling as "ring scheduling" or "ring strata" hereafter.A general form of the ring scheduling consists of a seed, where 1 ≤ seed ≤ K.At sub-epoch k, the processor p seed processes the ratings in R seed,k to update the rows in W seed and H k .At sub-epoch k, processor p x processes the ratings in R xβ to update the rows in W x and H β , where

III. COMMUNICATION IN PARALLEL SSGD
In this section, we analyze the communication requirement of parallel SSGD.We define the essential required communication in an SSGD epoch that utilizes the data sparsity, and compare it with the dense communication of DSGD.
Given strata S where each stratum is to be processed in a sub-epoch in row-parallel execution.For an H-matrix row block H β , we define the sequence of processors that compute the gradient using ratings in the column block C β according to S. That is, p i 1 updates the rows of H β in the first sub-epoch, p i 2 in the second sub-epoch and so forth.Furthermore, we define a distance metric d xy β between two processors p x and p y updating H β as This distance quantifies the number of sub-epochs elapsed after p x updates rows in H β and before p y does so.

A. Defining Essential Required Communication
The communication of H-matrix rows required for correctly executing SSGD in a distributed fashion is described according to the following definition: Definition 1 (d-gap rows): During parallel SSGD, if a row h j ∈ H β is updated by both p i x and p i x+1 , then h j is called a zero-gap row.If h j is updated by both p i x and p i x+2 but not p i x+1 , then h j is called a one-gap row.Then for the general case, consider two nonadjacent processors in Υ β : p i x and p i y such that x < y. h j is called a d-gap row if it is updated by both p i x and p i y but not any of the d = d xy β processors in-between (that is, in p i x+1 , . . ., p i y−1 ).The set of all such d-gap rows between p i x and p i y in H β is given by Communicating H i x i y β from p i x to p i y after p i x processes the ratings in 2D block R i x β and before p i y starts processing the ratings in R i y β guarantees a correct distributed row-parallel SSGD execution.

B. Communication in DSGD
In the original DSGD algorithm [3], after processor p x updates row block H β in sub-epoch k, it sends the rows in H β to the the processor that will update H β in sub-epoch k + 1.Therefore, at each sub-epoch, each processor sends a whole row block of H to exactly one processor.For instance, assuming the DSGD is executed according to the ring strata given in Fig. 2, after sub-epoch 1 is completed p 1 sends H 1 to p 8 , p 8 sends H 8 to p 7 and so forth.
The communication scheme of DSGD guarantees the correctness of the SSGD algorithm since up-to-date h j ∈ H i x i y β will eventually reach p i y from p i x , assuming x < y in Υ β , via forwarding through p i x+1 , . . ., p i y−1 .Furthermore, the communication scheme of the DSGD has the nice property of very low latency overhead since it restricts the number of messages sent by any processor at any sub-epoch to one.However, this scheme suffers from increasing the bandwidth overhead (communication volume) due to forwarding the H-matrix rows.For each epoch, the communication volume sent by all processors is equal to F ×M ×K words as each processor sends approximately M/K dense H-matrix rows each of size F words during each of the K sub-epochs.Especially for highly sparse rating matrices, it is clear that the volume of communication performed is much more than the required, and the increased bandwidth overhead due to forwarding can be prohibitive as K increases, see Fig. 3.
In Fig. 2, the update sequence for row block H 1 is Υ 1 = p 1 , p 8 , p 7 , p 6 , p 5 , p 4 , p 3 , p 2 .The communication of h i ∈ H 1 through the subsequence/subchain p 1 → p 8 → p 7 → p 6 does not incur any extra volume since each of these processors update h i .However, p 5 does not update h i but still p 5 needs to receive the up-to-date h i from p 6 and forward it to p 4 in the next sub-epoch.In this case, h i incurs F words of forwarding overhead.In the case of h j ∈ H 1 , the first processor to update it after p 1 is p 4 .Therefore, four forwarding communications, each of size F , are incurred due to h j in p 1 → p 8 → p 7 → p 6 → p 5 .
Let λ(h j ) denote the number of processors that update h j ∈ H β , then the amount of forwarding overhead of h j in DSGD is F (K − λ(h j )).The total amount of forwarding overhead per epoch then becomes F (MK − h j ∈H λ(h j )).The clear difference between the communication in DSGD and the essential required communication is that the former is a direct factor of K and M , whereas the latter is upper bounded by the number of nonzeros (nnz) of the rating matrix.This can be shown as follows: at sub-epoch k, processor p x sends nnz(R x,S k (p x ) ) H-matrix rows at the worst case.This means that, at worst case, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
end for 18: end for 19: until convergence or max.number of epochs reached the total volume of communication sent by all processors per an SSGD epoch is equal to

A. Communicating d-Gap Rows Through P2P Messages
We propose to avoid the forwarding overhead by sending an updated H-matrix row to the processor that updates it next directly through P2P communications.At the beginning of subepoch k, processor p x sends P2P messages to a set of processors SendSet k (p x ) and receives from RecvSet k (p x ).These two sets can be respectively constructed as For example, in Fig. 2, at the beginning of the second sub-epoch p 1 sends h i to p 8 and h j to p 4 .Algorithm 1 presents the P2P-based parallel SSGD algorithm for processor p x .At line 3, processor p 1 picks strata S and broadcasts to all other processors.At line 4, p x determines the communication requirement according to (7) and constructs the send/receive information of the P2P messages according to (8) and (9).Then, the up-to-date rows required in the current subepoch are communicated at lines 8-13 through P2P messages.The SGD updates are performed at lines 15 and 16 respectively according to (3) and (4).

Algorithm 2: Find d-gap H-matrix Rows on Processor p x
Require:Rating matrix R, Processor count K, Strata S 1: for k = 2 to K do 5: end for 10: end for

B. Efficiently Constructing d-Gap Row Sets
Computing d-gap H-matrix rows using (7) has recurring computations for different instances.For example, computing H For an efficient computation, we devise an algorithm that utilizes a dynamic programming formulation leveraging efficient bulk bit-wise operations.
Consider a binary string β is set to '1' if p i x updates the bth row in H β , and set to '0' otherwise.Then, the indices of the rows to be communicated between p i x and p i y are the indices of the 1-bits in )), (10) where ⊕, ∧ and ∨ respectively denote logical exclusive OR (XOR), logical AND and logical OR operations.The term ) in (10) can be computed incrementally thanks to the associativity property of the ∨ operation.
Given H β and Υ β , we define Υ p x β as the sequence of processors updating H β starting from p x .Υ p x β can be obtained from Υ β by left-rotating the sequence until p x is at the first index.Algorithm 2 presents the efficient dynamic-programming-based computation of the d-gap H-matrix rows between p x and the other K −1 processors.For each H β , the order of processors updating H β starting from p x according to strata S is maintained in Υ p x β (line 3).Then, in lines 4-9, p x constructs the d-gap rows one by one according to this order leveraging the bottom-up construction of the term (B ).

C. Hold & Combine Strategy for Reducing Latency
Using P2P messages to communicate the updated rows without forwarding is indispensable for reducing the bandwidth overhead of the communication.However, it has a high potential of increasing the latency overhead via increasing the number of messages performed per epoch compared to DSGD.In DSGD, a processor sends K messages per epoch (one message to one processor at each sub-epoch), whereas using the P2P requires sending at most K × (K −1) messages per epoch (up to K −1 messages from each of the K processors at each sub-epoch).
We propose the hold and combine (H&C) strategy to reduce the upper-bound on the number of messages sent per epoch to O(K lgK).
Definition 2: Fixed-distance strata is any strata that satisfies for any pair of H-matrix rows α and β. (11) That is, the fixed-distance strata have the property of constant distance between any two processors regardless of the H-matrix row block they are updating.We refer to the distance between two processors p x and p y in a fixed-distance strata as d xy .Any ring strata scheduled with ( 5) is a fixed-distance strata. During The second summation is a harmonic series which can be approximated by ln(K − 1)/2 + 1, thus is the upper bound on the number of messages sent per processor per epoch.
To facilitate the presentation of the H&C strategy, we assume that each processor constructs a tabular-shaped message schedule (TSMS).In the TSMS of p x , rows are the K −1 processors that p x communicates with during an epoch, and columns represent sub-epochs as well as the corresponding H-matrix row blocks updated by p x .Each table entry TSMS(p y , H β ) represents the sub-epoch S −1  β (p y ).Fig. 4 shows a TSMS for p 3 using strata with seed = 5.In the figure, the circled TSMS entries denote the messages (H-matrix row blocks) that can be combined.For instance, the communication requirement between p 3 and p 7 during an SGD epoch can be done with two messages.The first message, required at the beginning of sub-epoch 5, consists of H 3,7  7 ∪ H 3,7 8 ∪ H 3,7 1 ∪ H 3,7 2 .The second message, required at the beginning of sub-epoch 1 of the next SGD epoch, consists

Require:SendSet
for each p y ∈ SendSet k (p x ) do 3: end for 11: end for of H 3,7  3 ∪ H 3,7 4 ∪ H 3,7 5 ∪ H 3,7  6 .Observe that the sub-epoch at which the combined message should be sent is decided by the first H-matrix block of the combined message.For instance, the first message to p 7 must arrive before p 7 starts updating rows in H 7 which is sub-epoch 5.
Algorithm 3 shows the procedure to construct combined messages from P2P messages at p x .Given the SendSet k (p x ) ∀k ∈ {1, . . ., K} and the d-gap rows between p x and {p y | y = x}, the combined messages are constructed as follows: There are K/d xy possible messages to p y each of which is identified by m id .For each p y ∈ SendSet k (p x ) the rows in H xy β are assigned to a combined message M xy m id (lines 3 and 4).Then, p y is added to the new send set of the sub-epoch at which message m id is sent (lines 5-9).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 1 can be modified to accommodate the H&C strategy as follows: After constructing the P2P communication (line 4), Algorithm 3 is used to combine the messages.Then, lines 8-13 can be replaced with the sending/receiving of combined messages; for each p y in cSendSet k (p x ) a combined message is identified using m id = k/d xy and sent to p y , and similarly so for receiving from each p z in cRecvSet k (p x ).
It is important to make sure that the K lgK messages sent per epoch are uniformly distributed over K sub-epochs.Otherwise, some sub-epochs will constitute a performance bottleneck due to high number of messages.We show that utilizing Algorithm 3 for combining the messages has the nice property of limiting the expected number of messages sent by each processor at each sub-epoch to O(lgK).
Theorem 1: Using the H&C strategy, the expected number of messages sent by each processor at each sub-epoch is O(lgK).
Proof: Consider a set xy that consists of all sub-epochs wherein a message is sent from p x to p y .For each sub-epoch k, the function defines if there is a message to be sent from p x to p y in k.
We can prove that O(lgK) messages are sent by each processor at each sub-epoch as follows.The number of messages sent by p x per sub-epoch is equal to the number of occurrences of that sub-epoch in y∈[K]∧y =x xy .For each processor p y with distance d xy , the probability that k is one of the sub-epochs in which a message is sent to p y is equal to 1/d xy .In other words, given K sub-epochs, the probability that sub-epoch k will be used to send one of the K/d xy messages is 1/d xy .Then, the expected number of messages from p x to p y at sub-epoch k is Using linearity of expectation, the expected total number of messages sent by p x at sub-epoch k is

V. HP MODEL FOR REDUCING BANDWIDTH COST
There exists two hypergraph models for 1D partitioning of sparse matrices for SpMV-like kernels; namely the column-net model for rowwise partitioning and the row-net model for columnwise partitioning [8].In these models, the "connectivity−1" metric [8] is utilized for partitioning objective of reducing the communication volume in SpMV-like kernels, whereas the partitioning constraint is maintaining computational balance among processors.As mentioned earlier, rating matrices usually have larger number of rows than columns, hence we mainly focus on rowwise partitioning of rating matrix R. The hypergraph model discussed here is topologically similar to the column-net model, however the cutsize metric utilized in the partitioning objective is different.
In the hypergraph model H R = (V, N ), there exists a vertex v i ∈ V for each row r i of R and a net (hyperedge) n j ∈ N for each column c j of R. Each net n j connects the vertices corresponding to the R-matrix rows that contain nonzeros in column c j .That is, P ins(n j ) = {v i ∈ V | r ij = 0}.Each vertex v i is associated with a weight equals to the number of nonzeros in row r i .Each net is associated with a cost F .
A K-way partition Π(H R ) = {V 1 , V 2 , . . ., V K } is decoded as a K-way rowwise partition of R, where the rows corresponding to the vertices in part V α constitute the row block R α , for α = 1, 2, . . ., K. Without loss of generality, row block R α is assigned to processor p α for α = 1, 2, . . ., K. The W-matrix rows are partitioned conformably with the R-matrix row partition.That is, W-matrix rows in W α correspond to the R-matrix rows in R α .
In partition Π(H R ), the weight of each part is equal to the sum of the weights of the vertices in that part.Hence, the partitioning constraint of maintaining balance on the part weights encodes maintaining balance on the nonzero counts of the R-matrix row blocks.This in turn corresponds to maintaining balance on the computational loads of the processors.
In partition Π(H R ), a net n j is said to connect a part V α if it connects at least one vertex in part V α , that is, P ins(n j ) ∩ V α = ∅.The connectivity set Λ(n j ) of a net n j is defined as the set of the parts that net n j connects, whereas the connectivity λ(n j ) denotes the number of parts connected by net n j , that is λ(n j ) = |Λ(n j )|.A net n j is said to be cut if λ(n j ) > 1 and uncut otherwise.The partitioning objective is to minimize the cutsize which is defined over the cut nets.
In this model, Λ(n j ) also represents the set of R-matrix row blocks that has at least one nonzero in column c j of R. Hence, the connectivity set of net n j denotes the set of processors that update the H-matrix row h j .Consider the H-matrix row h j corresponding to a cut net n j in the P2P communication scheme.Also consider h j update sequence defined using the connectivity set and strata.For each epoch, each processor except the last processor in the sequence should send its updated h j value once to the next processor in the sequence.The last processor sends its updated h j value to the first processor for the next iteration.Hence, each cut net n j incurs a communication volume of F λ(n j ).On the other hand, uncut nets incurs no communication.Therefore, cutsize which encapsulates the total communication volume during an SSGD epoch can be computed as Among the various cutsize metrics in the literature, cutsize ( 14) is called as the sum of external degrees (SOED) [9].There exists several successful hypergraph partitioning tools that utilize multilevel recursive bipartitioning (RB) algorithms.Among these partitioning tools, to our knowledge, only hMETIS [9] supports the SOED metric via direct multi-way partitioning [10].In fact Karypis and Kumar [10] clearly indicates that RB framework does not allow directly optimizing the SOED metric.Here, we propose an RB framework that encodes the minimization of the SOED metric correctly.
In the RB framework, a given hypergraph H is recursively bipartitioned until K parts are obtained, assuming K is a power of two without loss of generality.At each RB step, a bipartition Here, V L and V R are respectively used to refer to the left and right parts of the bipartition.The net sets N L and N R are constructed through cut net splitting method [8] as follows: Internal nets of V L and V R are respectively included in N L and N R .A cut net n j in Π 2 is split into two subnets n j and n j , where P ins(n j ) = P ins(n j ) ∩ V L and P ins(n j ) = P ins(n j ) ∩ V R .
In order to encode the SOED metric ( 14), we propose the following strategy during the RB framework.We assign a cost of 2F to each net of the initial hypergraph.Then, after each RB step, internal nets inherit their cost, whereas splitted nets are assigned a cost of F .That is, a net holds its cost of 2F until it becomes cut for the first time, then a cost of F is assigned to each of split subnets and they inherit their cost of F through the further RB steps until the end of the partitioning.Hence, when a net becomes cut for the first time it incurs 2F to the cutsize, then whenever its subnets become cut they incur F to the cutsize.In this way, the sum of all cut net costs encountered during the overall RB algorithm becomes equal to the SOED metric (14).

A. Experimental Framework
We evaluate the contributions proposed in this work through comparing three methods implementing parallel SSGD using six real-world rating matrices.The first method, DSGD, is the algorithm proposed in the original work of Gemulla et al. [3].DSGD performs block-wise communication of H-matrix row blocks in each sub-epoch.The second method, P2P, uses P2P messages as in Algorithm 1.The third, H&C, uses combined P2P messages (Algorithm 3) for communication.
In all three methods, column-to-stratum assignments are done randomly in such a way that the number of columns per stratum differs by at most one.Row-to-processor assignments are obtained either randomly in a way similar to that of columnto-stratum assignments, or using the HP method discussed in Section V. Whenever the former is used, the method will be prefixed by RAND, whereas if the latter is used the method will be prefixed by HP.The HP method is implemented according to the RB framework described in Section V to encapsulate the SOED metric.In order to obtain two-way partitions on the (sub)hypergraphs at each RB level, we use the HP tool PaToH [8] with default parameters in SPEED mode.
We implemented the parallel SSGD code that includes DSGD, P2P and H&C in C and used MPI for inter-process communications.We perform our experiments on an HPC system with AMD EPYC 7742 processors and a high-speed HDR InfiniBand network with 200 Gb/s bandwidth.
We compare the three methods in terms of communication cost metrics as well as SGD iteration time.The communication cost metrics consist of bandwidth-oriented metrics: sum-max vol and tot vol, and latency-oriented metrics: sum-max msgs and tot msgs.sum-max msgs is calculated as follows: at each subepoch, the number of messages sent by the bottleneck processor (the processor that sends highest number of messages) is obtained.Then, the summation is taken over all K sub-epochs.That is, In a similar way, sum-max vol is computed as tot msgs and tot vol are respectively computed as Here, or H&C are used.Whenever the values for the volume of communication are presented, these values are normalized with respect to F .This uncoupling of F from the volume values helps evaluate the proposed methods and model for any F value.
Table I shows the real-world matrices used to evaluate the proposed methods and their properties.Amz Items contains product reviews from Amazon between May 1996 -July 2014 [11] with aggressive duplicate removal.The other two amazon datasets, Books and Clothing, are category-based subsets of the original comprehensive reviews.Goodread Reviews contains user ratings of books from the Goodreads website [12].Google Reviews contains user ratings/reviews of local businesses from the Google Maps website [13], [14].Twitch contains ratings relative to how much time a user spent on a stream in the Twitch streaming website [15].The original data does not contain any explicit ratings.We modified the dataset to represent (user, stream, rating) such that the rating value is proportional to the amount of time the user spent in the specific stream.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.amount of communication volume per epoch (more than 10x).Compared to RAND, the HP-based P2P and H&C methods incur significantly reduced volume (between 1.4x and 5x).

B. Evaluations With Communication Cost Metrics
Fig. 5(b) shows that in all matrix instances P2P and H&C have a significantly reduced sum-max vol compared to DSGD (more than 10x).H&C has slightly higher sum-max vol compared to P2P.This is because combining the messages disturbs the random volume balancing of P2P.As expected, HP-based P2P incurs less sum-max vol compared to RAND-based P2P.HP-based H&C shows a decrease in sum-max vol on two matrices (Amz Books and Amz Clothing & Jewelry), and an increase in other four matrices.This is because the HP method, when used for H&C, does not encapsulate reducing the sum-max vol metric.
2) Latency-Oriented Communication Cost Metrics: Fig. 5(c) shows that the H&C method significantly reduces tot msgs on all dataset matrices.DSGD always incurs a constant number of messages for each K value, thus tot msgs is always equal to K 2 = (1024) 2 = 1048576.tot msgs of P2P can go up to K 2 × (K −1).On the other hand, H&C keeps tot msgs limited to O(K 2 lgK).Depending on the sparsity pattern of the matrix, tot msgs of P2P can be very high (e.g., Amz Books, Amz Items and Twitch) or relatively close to the lower bound (e.g., Amz Clothing & Jewelry).The H&C method successfully controls the fluctuation in the number of messages thanks to the lgK factor.The significant reduction in tot vol of HP-based P2P and H&C methods compared to those of RAND-based is expected to reflect on the total number of messages, which is the case as shown in the figure .Fig. 6 showcases the H&C method's regularization of messages sent per epoch over K sub-epochs.In order to experimentally verify the O(lgK) bound given in Theorem 1, we introduce the max-max msgs metric as the maximum number of messages sent per sub-epoch among all sub-epochs.That is, As seen in Fig. 6(a), using H&C, max-max msgs is empirically found to be ≈ 3 × lgK, which is very close to the expected lgK bound on the number of messages per sub-epoch given in (13).The figure shows that P2P incurs high max-max msgs on K = 256, and then the max-max msgs values start to decrease as K increases.We believe this is attributed to the ability of random partitioning to balance P2P message counts and volume.In Fig. 6(b), the sum-max msgs metric is shown for all matrices in the dataset using P2P and H&C on K = 64, . . ., 1024 processors.The figure shows the success of H&C in keeping the number of messages under the K lgK theoretical bound.Since P2P's sum-max msgs do not decrease as K increases, this means maximum number of messages per sub-epoch are almost equal among all sub-epochs, especially when K ≥ 512.On the other hand, although the H&C's max-max msgs come very close to those of P2P on some instances such as Goodreads Reviews and Google Reviews, sum-max msgs stay significantly less than those of P2P.This means that although the maximum number of messages sent per sub-epoch can reach 3 lgK in very few sub-epochs, it is still equal to or less than the expected lgK messages.

C. Evaluations With SGD Iteration Time
Fig. 5(d) and (e) compare the methods in terms of SGD iteration time on K = 1024 processors respectively using F = 16 and F = 64 values.The figure shows that the P2P improvement over DSGD is significant (more than 4x on all matrices, except for Twitch which is 1.4x) when F = 16.The improvement grows further as F increases to 64.It becomes more than 15x on all matrices except Twitch, and on Twitch the improvement becomes at least 4.7x.
Using HP improves the P2P runtime by 1.3x, 1.17x, 1.22x and 3.35x on Amz Books, Amz Items, Goodreads Reviews and Google Reviews, respectively, when F = 16.On Amz Clothing & Jewelry there is no significant improvement and on Twitch there is deterioration by 1.4x.When F = 64, HP improves the P2P runtime by 1.4x, 1.42x, 1.3x and 3.9x respectively on Amz Books, Amz Items, Goodreads Reviews and Google Reviews.Table II also shows the normalized P2P-HP cost metrics with respect to those of P2P-RAND.On average, HP improves (reduces) the SGD iteration time by 22% when F = 16 and 29% when F = 64.The increase in the gap between HP and RAND in terms of P2P runtime when F grows from 16 to 64 is expected since the HP method aims at reducing the total volume, effect of which is seen more with higher F values.We observed that the HP method improves the H&C runtime compared to RAND only on Goodreads Reviews and Google Reviews.Increasing the F value is expected to render the SGD communication as bandwidth-bound.Therefore, the effect of the methods that reduce the bandwidth (volume of communication) becomes conspicuous.This is observed in two different cases when moving from F = 16 in Fig. 7(a) to F = 64 in Fig. 7(b): (i) the performance gap between P2P/H&C and DSGD increases as F becomes larger as a result of the huge reduction in communication volume when using P2P/H&C, and (ii) the difference in performance between P2P and H&C slightly reduces due to the communication overhead leaning towards bandwidth.

D. Evaluations With Loss Values
Since all the methods discussed in this work follow the stratified SGD algorithm, their loss values per iteration is expected to be very similar regardless of the communication strategy used or number of processors.We demonstrate this using Fig. 8(a).The figure shows the loss value (y-axis) following each SGD iteration (x-axis) of Amz Books and Goodreads Reviews using the RAND-based DSGD, P2P, H&C methods on K = {64, 256, 1024} processors.The loss values are very close as expected thus the curves appear to be on top of each other.

VII. RELATED WORK
There exist several works in the literature that adopt the SSGD for parallel matrix completion for shared-memory systems [16], [17], [18] and distributed-memory systems [3], [4], [5].Here, we focus on the works that involve distributed-memory implementations.The work of Gemulla et at.[3] proposed the SSGD approach as well as the parallel DSGD algorithm discussed in Sections II-B and III-B.Teflioudi et al. [4] proposed DSGD++, an improved DSGD framework for better performance.They use computation and communication overlaying through dividing the input matrix into K × 2 K blocks, and in each of the K sub-epochs DSGD++ performs computation on K blocks while simultaneously communicating the other K blocks.They report up to 2.3x improvement over DSGD in terms of runtime.Yun et al. [5] extend the idea of DSGD++ in their framework, NOMAD, and divide the input matrix into K × M blocks.Each of the K processors dedicates threads to update H-matrix rows, and M − other threads for communication.Once processor p x updates an H-matrix row, or a set of rows, it sends it/them to another processor p y that has idle computation threads.DSGD, DSGD++ and NOMAD have the same total communication volume during an SGD epoch which is equals to F ×M ×K as discussed in Section III-B.The number of messages sent per processor during an epoch of DSGD and DSGD++ has an upper bound of O(K), whereas NOMAD may send up to O(M ) messages.Guo et al. [6] proposed a novel framework, BaPa, for improving the nonzero load balance of DSGD through a novel algorithm for balancing per-processor and per-epoch ratings.Their BaPa-based DSGD shows a significant runtime improvement on small number of processors (< 16).However, their results show that both the original DSGD as well as the BaPa-based DSGD stop scaling after 256 processors.
There are several asynchronous-SGD-based parallel matrix completion algorithms in the literature.ASGD [4] (shown in the upper part of Fig. 1) is the simplest example of such algorithm.During ASGD, it is possible that several processors update the same H-matrix row h j at the same time (i.e., stale updates).This results in each processor having a different copy of h j .These copies are coordinated by sending them to a processor responsible for h j .This processor takes their average and then sends the up-to-date version of h j back to the same set of processors.This type of coordination is done once or more during an SGD epoch [4], [19].GASGD [19] extends ASGD by utilizing intelligent partitioning for balancing computational loads, reducing communication between processors, and reducing staleness.The authors utilize a bipartite graph model and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
propose a partitioning method based on the balanced K-way vertex-cut problem [20] to achieve the partitioning goals.Luo et al. [21] proposed a different strategy to facilitate asynchronously computing SGD in parallel which is called alternating SGD.In alternating SGD, each epoch is divided into two sub-epochs where in each sub-epochs one factor matrix is fixed and the other is updated.This approach enables limiting the feature vector updates that use stale data to one of the two factor matrices during a sub-epoch.Recently, Shi et al. [22] proposed a distributed algorithm based on alternating SGD with data-aware partitioning.

VIII. CONCLUSION
We proposed a framework for scaling stratified SGD through significantly reducing the communication overhead.The framework targets at reducing the bandwidth overhead by efficiently finding the required communication during an SGD epoch, using P2P messages to perform it, and an HP-based method to further reduce the P2P communication volume.The framework targets at reducing the increase in latency overhead through the novel H&C strategy to limit the number of messages sent by a processor per epoch to O(K lgK).Our proposed framework achieves scalable distributed SGD, on up to K = 1024 processors, without any compromise on convergence rate or any update on stale factors.The proposed framework achieves up to 15x runtime improvement over the state of the art DSGD method, on 1024 processors, using six real-world rating matrices.

Fig. 3 .
Fig. 3. Illustrating the extra communication of DSGD.The figure shows two blocks of R that belong to processors p x and p x+1 such that p x+1 updates column block H β right after p x .A row h j ∈ H β shas to be sent from p x to p x+1 if both processors contain a nonzero with column index equal to h j because p x+1 has to know the up-to-date version of h j after p x updates it.When either only one of p x or p x+1 has such nonzero, or neither of them, the communication of h j at this stage is considered extra and can be avoided.

Fig. 4 .Algorithm 3 :
Fig.4.An example TSMS for p 3 .The rows are the processors that p 3 communicates with sorted according to their distance from p 3 .The columns represent both the sub-epochs and the H-matrix blocks to be updated at each sub-epoch.An entry (p y , H β ) gives the sub-epoch at which p y updates H β after p 3 does (note that this sub-epoch might be in the next epoch).The circles show the messages that can be combined.

Fig. 5 .
Fig. 5. Comparing RAND-and HP-based P2P and H&C methods against RAND-based DSGD using communication cost metrics (a to c) and SGD iteration time (d and e) using all dataset matrices on K = 1024 processors.

Fig. 6 .
Fig.6.Showcasing the upper bound of the max-max messages and sum-max messages sent per sub-epoch using the H&C method compared to P2P on K = {64, . . ., 1024} processors.

Fig. 5 (
Fig.5(a), (b) and (c) compare DSGD, P2P and H&C in terms of communication cost metrics tot vol, sum-max vol and tot msgs on K = 1024 processors.In the figures, the red bars denote RAND-based methods whereas light blue bars denote HP-based methods.HP does not affect DSGD's communication which is why HP is not applicable for DSGD and hence DSGD has only red bars.Comparison in terms of sum-max msgs will be discussed in Fig.6.1)Bandwidth-Oriented Communication Cost Metrics: As seen in Fig.5(a), both P2P and H&C incur the essential amount of communication volume as defined in(10), without any forwarding overhead.Compared to DSGD, both RAND-and HP-based P2P and H&C methods incur significantly reduced

Fig. 7
Fig.5(d) and (e) compare the methods in terms of SGD iteration time on K = 1024 processors respectively using F = 16 and F = 64 values.The figure shows that the P2P improvement over DSGD is significant (more than 4x on all matrices, except for Twitch which is 1.4x) when F = 16.The improvement grows further as F increases to 64.It becomes more than 15x on all matrices except Twitch, and on Twitch the improvement becomes at least 4.7x.Using RAND, the H&C improvement over P2P is also significant.When F = 16, H&C improves the iteration runtime over P2P by 2x, 1.2x, 2x, 1.5x, 2.15x, and 1.25x respectively on Amz Books, Amz Clothing & Jewelry, Amz Items, Goodreads Reviews, Google Reviews and Twitch.When F = 64, the respective values become 1.7x, 1.2x, 2x, 1.4x, 1.74x, and 1.22x.Using HP improves the P2P runtime by 1.3x, 1.17x, 1.22x and 3.35x on Amz Books, Amz Items, Goodreads Reviews and Google Reviews, respectively, when F = 16.On Amz Clothing & Jewelry there is no significant improvement and on Twitch there is deterioration by 1.4x.When F = 64, HP improves the P2P runtime by 1.4x, 1.42x, 1.3x and 3.9x respectively on Amz Books, Amz Items, Goodreads Reviews and Google Reviews.Table II also shows the normalized P2P-HP cost metrics with respect to those of P2P-RAND.On average, HP improves (reduces) the SGD iteration time by 22% when F = 16 and 29% when F = 64.The increase in the gap between HP and RAND in terms of P2P runtime when F grows from 16 to 64 is expected since the HP method aims at reducing the total volume, effect of which is seen more with higher F values.We observed that the HP method improves the H&C runtime compared to RAND only on Goodreads Reviews and Google Reviews.Fig.7shows the strong scaling curves of RAND-based DSGD, P2P and H&C using two different F values on K = {64, 128, 256, 512, 1024} processors.As seen in the figure, P2P and H&C show superior scaling compared to DSGD.Furthermore, H&C performs significantly better than P2P, especially with smaller F values.

Fig. 8 (
Fig.8(b)  shows the amount of time (x-axis) required to reach a certain loss value (y-axis) of Amz Items and Google Reviews using the RAND-based DSGD, P2P, H&C methods on 1,024 processors.The figure shows that DSGD requires significantly more time to reach a certain loss value compared to P2P and H&C.Fig.8(c)shows the scaling behavior of the RAND-based H&C method with Amz Clothes & Jewelry and Twitch in terms of loss value as the time increases.

Algorithm 1 :
Point-to-Point Parallel SSGD on Processor p x .
ij ∈ R xβ curr do 15: an SGD epoch, the communication of H xy β should be performed after p x updates H β in sub-epoch k and before p y starts updating H β .This means that H xy β can be sent at the beginning of any sub-epoch between k + 1 and k + d xy β .Now consider the communication of H xy β at sub-epoch k in a fixeddistance strata.Observe that when sub-epoch k + d xy is reached, all the rows in H xy S k+1 (p x ) , H xy S k+2 (p x ) , . . ., H xy S k+d xy −1 (p x ) are already updated by p x and ready to be sent to p y .So, these rows can be held by p x and sent all at once in one message to p y in sub-epoch k + d xy along with H xy β .Utilizing fixed-distance strata, we propose to hold P2P messages and combine them as follows: If d xy ≥ K/2, then the messages between p x and p y in an epoch can be combined into two or more P2P messages.This is because if d xy = K −1 then one message is needed for K −1 H-matrix row blocks and another message needed for the last block.Otherwise if d xy < K/2, then the messages between p x and p y can be combined in K/d xy P2P messages.Therefore, the number of messages sent per processor per epoch can be computed by

TABLE I PROPERTIES
OF MATRICES IN THE DATASET

TABLE II NORMALIZED
COST METRICS OF P2P-HP WITH RESPECT TO P2P-RAND ON