{"title": "Double Quantization for Communication-Efficient Distributed Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 4438, "page_last": 4449, "abstract": "Modern distributed training of machine learning models often suffers from high communication overhead for synchronizing stochastic gradients and model parameters. In this paper, to reduce the communication complexity, we propose \\emph{double quantization}, a general scheme for quantizing both model parameters and gradients. Three communication-efficient algorithms are proposed based on this general scheme. Specifically, (i) we propose a low-precision algorithm AsyLPG with asynchronous parallelism, (ii) we explore integrating gradient sparsification with double quantization and develop Sparse-AsyLPG, (iii) we show that double quantization can be accelerated by the momentum technique and design accelerated AsyLPG. We establish rigorous performance guarantees for the algorithms, and conduct experiments on a multi-server test-bed with real-world datasets to demonstrate that our algorithms can effectively save transmitted bits without performance degradation, and significantly outperform existing methods with either model parameter or gradient quantization.", "full_text": "Double Quantization for Communication-Ef\ufb01cient\n\nDistributed Optimization\n\nYue Yu\n\nIIIS, Tsinghua University\n\nyu-y14@mails.tsinghua.edu.cn\n\nJiaxiang Wu\nTencent AI Lab\n\njonathanwu@tencent.com\n\nLongbo Huang\u2217\n\nIIIS, Tsinghua University\n\nlongbohuang@tsinghua.edu.cn\n\nAbstract\n\nModern distributed training of machine learning models often suffers from high\ncommunication overhead for synchronizing stochastic gradients and model pa-\nrameters. In this paper, to reduce the communication complexity, we propose\ndouble quantization, a general scheme for quantizing both model parameters and\ngradients. Three communication-ef\ufb01cient algorithms are proposed based on this\ngeneral scheme. Speci\ufb01cally, (i) we propose a low-precision algorithm AsyLPG\nwith asynchronous parallelism, (ii) we explore integrating gradient sparsi\ufb01cation\nwith double quantization and develop Sparse-AsyLPG, (iii) we show that double\nquantization can be accelerated by the momentum technique and design acceler-\nated AsyLPG. We establish rigorous performance guarantees for the algorithms,\nand conduct experiments on a multi-server test-bed with real-world datasets to\ndemonstrate that our algorithms can effectively save transmitted bits without per-\nformance degradation, and signi\ufb01cantly outperform existing methods with either\nmodel parameter or gradient quantization.\n\n1\n\nIntroduction\n\nThe data parallel mechanism is a widely used architecture for distributed optimization, which has\nreceived much recent attention due to data explosion and increasing model complexity. It decomposes\nthe time consuming gradient computations into sub-tasks, and assigns them to separate worker\nmachines for execution. Speci\ufb01cally, the training data is distributed among M workers and each\nworker maintains a local copy of model parameters. At each iteration, each worker computes a\ngradient from a mini-batch randomly drawn from its local data. The global stochastic gradient is\nthen computed by synchronously aggregating M local gradients. Model parameters are then updated\naccordingly.\nTwo issues signi\ufb01cantly slow down methods based on the data parallel architecture. One is the\ncommunication cost. For example, all workers must send their entire local gradients to the master\nnode. If gradients are dense, the master node has to receive and send M \u00d7 d \ufb02oating-point numbers\nper iteration (d is the size of the model vector), which scales linearly with network and model vector\nsizes. With the increasing computing cluster size and model complexity, it has been observed in many\nsystems that such communication overhead has become the performance bottleneck [38, 35]. The\nother factor is the synchronization cost, i.e., the master node has to wait for the last local gradient\narrival at each iteration. This coordination dramatically increases system\u2019s idle time.\n\n\u2217Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmin\nx\u2208\u2126\n\n(cid:88)n\n\n1\nn\n\nTo overcome the \ufb01rst issue, many works focus on reducing the communication complexity of\ngradients in the data parallel architecture. Generally, there are two approaches. One is quantization\n[2], which stores gradients using fewer number of bits (lower precision). The other is sparsi\ufb01cation\n[1], i.e., dropping out some coordinates of gradients following certain rules. However, existing\ncommunication-ef\ufb01cient algorithms based on data-parallel network still suffer from signi\ufb01cant\nsynchronization cost. To address the second issue, many asynchronous algorithms have recently\nbeen developed for distributed training [17, 23]. By allowing workers to communicate with master\nwithout synchronization, they can effectively improve the training ef\ufb01ciency. Unfortunately, the\ncommunication bottleneck caused by transmitting gradients in \ufb02oating-point numbers still exists.\nIn this paper, we are interested in jointly achieving communication-ef\ufb01ciency and asynchronous\nparallelism. Speci\ufb01cally, we study the following composite problem\n\nfi(x),\n\ni=1\n\nP (x) = f (x) + h(x),\n\nf (x) =\n\n(1)\nwhere x \u2208 Rd is the model vector, fi(x) is smooth, and h(x) is convex but can be nonsmooth.\nDomain \u2126 \u2286 Rd is a convex set. This formulation has found applications in many different areas,\nsuch as operations research, statistics and machine learning [10, 14], e.g., classi\ufb01cation or regression.\nIn these problems, n often denotes the number of samples, and fi(x) denotes the loss function for\nsample i, and h(x) represents certain regularizer.\nTo solve (1), we propose a novel double quantization scheme, which quantizes both model parameters\nand gradients. This quantization is nontrivial, because we have to deal with low-precision gradients,\nevaluated on low-precision model vectors. Three communication-ef\ufb01cient algorithms are then\nproposed under double quantization. We analyze the precision loss of low-precision gradients\nand prove that these algorithms achieve fast convergence rates while signi\ufb01cantly reducing the\ncommunication cost. The main contributions are summarized as follows.\n(i) We propose an asynchronous low-precision algorithm AsyLPG to solve the nonconvex and\nnonsmooth problem (1). We show that AsyLPG achieves the same asymptotic convergence rate\nas the unquantized serial counterpart, but with a signi\ufb01cantly lower communication cost. (ii) We\ncombine gradient sparsi\ufb01cation with double quantization and propose Sparse-AsyLPG to further\n\nreduce communication overhead. Our analysis shows that the convergence rate scales with(cid:112)d/\u03d5 for\n\na sparsity budget \u03d5. (iii) We propose accelerated AsyLPG, and mathematically prove that double\nquantization can be accelerated by the momentum technique [19, 26]. (iv) We conduct experiments\non a multi-server distributed test-bed. The results validate the ef\ufb01ciency of our algorithms.\n\n2 Related Work\n\nDesigning large-scale distributed algorithms for machine learning has been receiving increasing\nattention, and many algorithms, both synchronous and asynchronous, have been proposed, e.g.,\n[22, 4, 17, 12]. In order to reduce the communication cost, researchers also started to focus on cutting\ndown transmitted bits per iteration, based mainly on two schemes, i.e., quantization and sparsi\ufb01cation.\nQuantization. Algorithms based on quantization store a \ufb02oating-point number using limited number\nof bits. For example, [25] quantized gradients to a representation of {\u22121, 1}, and empirically showed\nthe communication-ef\ufb01ciency in training of deep neural networks. [5, 6] considered the bi-direction\ncommunications of gradients between master and workers. In their setting, each worker transmitted\n\u221a\ngradient sign to the master and master aggregated signs by majority vote. [2, 34, 35] adopted an\nunbiased gradient quantization with multiple levels. [13] provided a convergence rate of O(1/\nK)\nfor implementing SGD with unbiased gradient quantizer in solving nonconvex objectives, where K is\nthe number of iterations. The error-feedback method was applied in [25, 35, 29] to integrate history\nquantization error into the current stage. Speci\ufb01cally, [29] compressed transmitted gradients with\nerror-compensation in both directions between master and workers, and showed a linear speedup\nin the nonconvex case. [15] constructed several examples where simply transmitting gradient sign\ncannot converge. They combine the error-feedback method to \ufb01x the divergence and prove the\nconvergence rate for nonconvex smooth objectives. [40] also studied bi-direction compression with\nerror-feedback. They partitioned gradients into several blocks, which were compressed using different\n1-bit quantizers separately. They analyzed the convergence rate when integrating the momentum. [9]\nproposed a low-precision framework of SVRG [14], which quantized model parameters for single\n\n2\n\n\fmachine computation. [38] proposed an end-to-end low-precision scheme, which quantized data,\nmodel and gradient with synchronous parallelism. A biased quantization with gradient clipping\nwas analyzed in [37]. [8] empirically studied asynchronous and low-precision SGD on logistic\nregression. [28] considered the decentralized training and proposed an extrapolation compression\nmethod to obtain a higher compression level. [36] proposed a two-phase parameter quantization\nmethod, where the parameter in the \ufb01rst phase was the linear combination of full-precision and\nlow-precision parameters. In the second phase, they set the weight of full-precision value to zero to\nobtain a full compression.\nSparsi\ufb01cation. Methods on sparsi\ufb01cation drop out certain coordinates of transmitted vectors. [1]\nonly transmitted gradients exceeding a threshold. [33, 30] formulated gradient sparsi\ufb01cation into an\noptimization problem to balance sparsity and variance. [31] reduced transmissions in the parameter-\nserver setting by solving a shifted L1 regularized minimization problem. [13] studied a variant of\n\u221a\ndistributed SGD, i.e., only transmitting a subset of model parameters in each iteration. They showed\na convergence rate of O(1/\nK) and a linear speedup as long as all of the model parameters were\ntransmitted in limited consecutive iterations. Recently, [27, 3] analyzed the convergence behavior of\nsparsi\ufb01ed SGD with memory, i.e., compensating gradient with sparsi\ufb01cation error.\nOur work distinguishes itself from the above results in: (i) we quantize both model vectors and\ngradients, (ii) we integrate gradient sparsi\ufb01cation into double quantization and prove convergence,\nand (iii) we analyze how double quantization can be accelerated to reduce communication rounds.\nNotation. x\u2217 is the optimal solution of (1). ||x||\u221e, ||x||1 and ||x|| denote the max, L1 and L2 norms\nof x, respectively. For a vector vt \u2208 Rd, [vt]i or vt,i denotes its i-th coordinate. {ei}d\ni=1 is the\nstandard basis in Rd. The base of logarithmic function is 2. \u02dcO(f ) denotes O(f \u00b7 polylog(f )). We\nuse the proximal operator to handle a nonsmooth function h(x), i.e., prox\u03b7h(x) = arg miny h(y) +\n2\u03b7||y \u2212 x||2. If problem (1) is nonconvex and nonsmooth, we apply the commonly used convergence\n1\n\u03b7 [x \u2212 prox\u03b7h(x \u2212 \u03b7\u2207f (x))]. x is de\ufb01ned as an\nmetric gradient mapping [20], i.e., G\u03b7(x) (cid:44) 1\n\u0001-accurate solution if it satis\ufb01es E||G\u03b7(x)||2 \u2264 \u0001.\n\n3 Preliminary\n\nLow-Precision Representation via Quantization. Low-precision representation stores numbers\nusing limited number of bits, contrast to the 32-bit full-precision.2 It can be represented by a tuple\n(\u03b4, b), where \u03b4 \u2208 R is the scaling factor and b \u2208 N+ is the number of bits used. Speci\ufb01cally, given a\ntuple (\u03b4, b), the set of representable numbers is given by\n\ndom(\u03b4, b) = {\u22122b\u22121 \u00b7 \u03b4, ...,\u2212\u03b4, 0, \u03b4, ..., (2b\u22121 \u2212 1) \u00b7 \u03b4}.\n\nFor any full-precision x \u2208 R, we call the procedure of transforming it to a low-precision representation\nas quantization, which is denoted by function Q(\u03b4,b)(x). It outputs a number in dom(\u03b4, b) according\nto the following rules:\n(i) If x lies in the convex hull of dom(\u03b4, b), i.e., there exists a point z \u2208 dom(\u03b4, b) such that\nx \u2208 [z, z + \u03b4], then x will be stochastically rounded in an unbiased way:\n\u03b4 ,\nwith probability z+\u03b4\u2212x\n\u03b4\n\n(cid:26)z + \u03b4, with probability x\u2212z\n\n.\n\nQ(\u03b4,b)(x) =\n\nz,\n\n(ii) Otherwise, Q(\u03b4,b)(x) outputs the closest point to x in dom(\u03b4, b).\nThis quantization method is widely used in existing works, e.g., [38, 2, 34, 9], sometimes under\ndifferent formulation. In the following sections, we adopt Q(\u03b4,b)(v) to denote quantization on vector\nv \u2208 Rd, which means that each coordinate of v is independently quantized using the same tuple\n(\u03b4, b). Low-precision representation can effectively reduce communication cost, because we only\nneed (32 + bd) bits to transmit the quantized Q(\u03b4,b)(v) (32 bits for \u03b4, and b bits for each coordinate),\nwhereas it needs 32d bits for a full-precision v.\n\n2We assume without loss of generality that a \ufb02oating-point number is stored using 32 bits (also see, e.g.,\n\n[2, 37]). Our results can extend to the case when numbers are stored with other precision.\n\n3\n\n\fAlgorithm 1 AsyLPG\n1: Input: S, m, \u03b7, bx, b, \u02dcx0 = x0;\n2: for s = 0, 1, ..., S \u2212 1 do\nxs+1\n0 = \u02dcxs;\n3:\nCompute \u2207f (\u02dcxs) = 1\n4:\nfor t = 0 to m \u2212 1 do\n5:\n6:\n7:\n\nn\n\n(cid:80)n\ni=1 \u2207fi(\u02dcxs);\n\n/* Map-reduce global gradient computation\n\n||xs+1\nD(t)||\u221e\n2bx\u22121\u22121 and quantize xs+1\n\nD(t) subject to (2).\n\n/* For master:\n(i) Model Parameter Quantization: Set \u03b4x =\nThen, send Q(\u03b4x,bx)(xs+1\nD(t)) to workers;\n(ii) Receive local gradient \u03b6t, update us+1\n/* For worker:\n(i) Receive Q(\u03b4x,bx)(xs+1\ngradient \u03b1t = \u2207fa(Q(\u03b4x,bx)(xs+1\n(ii) Gradient Quantization: Set \u03b4\u03b1t =\nQ(\u03b4\u03b1t ,b)(\u03b1t) to the master;\n\nt = \u03b6t +\u2207f (\u02dcxs), xs+1\n\nt \u2212 \u03b7us+1\n);\nD(t)), stochastically sample a data-point a \u2208 {1, ..., n}, and calculate\n\nt+1 = prox\u03b7h(xs+1\n\nD(t))) \u2212 \u2207fa(\u02dcxs);\n||\u03b1t||\u221e\n2b\u22121\u22121 and send the quantized gradient \u03b6t =\n\nt\n\n8:\n9:\n10:\n\n11:\n\nend for\nm ;\n\u02dcxs+1 = xs+1\n\n12:\n13:\n14: end for\n15: Output: Uniformly choosing from {{xs+1\n\nt=0 }S\u22121\n}m\u22121\ns=0 .\n\nt\n\nDistributed Network with Asynchronous Communication. As shown in Figure 1, we con-\nsider a network with one master and multiple workers, e.g.,\nthe parameter-server setting.\nThe master maintains and updates a model vec-\ntor x, and keeps a training clock. Each worker\ncan get access to the full datasets and keeps a dis-\njoint partition of data. In each communication\nround, a worker retrieves x from the master, eval-\nuates the gradient g(x), and then sends it back to\nFigure 1: The framework of distributed network with\nthe master. Since workers asynchronously pull\nasynchronous communication. Left: network structure.\nand push data during the training process, at a\nRight: training process.\ntime t, the master may use a delayed gradient\ncalculated on a previous xD(t), where D(t) \u2264 t. Many works showed that a near linear speedup can\nbe achieved if the delay is reasonably moderate [17, 23].\n\n4 Algorithms\n\nTo solve problem (1), we propose a communication-ef\ufb01cient algorithm with double quantization,\nnamely AsyLPG, and introduce its two variants with gradient sparsi\ufb01cation and momentum accelera-\ntion. We begin with the assumptions made in this paper. They are mild and are often assumed in the\nliterature, e.g., [14, 24].\nAssumption 1. The stochastically sampled gradient is unbiased, i.e., for a \u2208 {1, ..., n} sampled in\nAlgorithms 1, 2, 3, Ea[\u2207fa(x)] = \u2207f (x). Moreover, the random variables in different iterations are\nindependent.\nAssumption 2. Each fi(x) in (1) is L-smooth, i.e., ||\u2207fi(x) \u2212 \u2207fi(y)|| \u2264 L||x \u2212 y||, \u2200x, y \u2208 \u2126.\nAssumption 3. The gradient delay is bounded by some \ufb01nite constant \u03c4 > 0, i.e., t \u2212 D(t) \u2264 \u03c4, \u2200t.\n\n4.1 Communication-Ef\ufb01cient Algorithm with Double Quantization: AsyLPG\n\nIn this section, we introduce our new distributed algorithm AsyLPG, with asynchronous communica-\ntion and low-precision \ufb02oating-point representation. As shown in Algorithm 1, AsyLPG divides the\ntraining procedure into epochs, similar to SVRG [14], with each epoch containing m inner iterations.\nAt the beginning of each epoch, AsyLPG performs one round of communication between the master\nand workers to calculate the full-batch gradient \u2207f (\u02dcxs), where \u02dcxs is a snapshot variable evaluated at\n\n4\n\n\fthe end of each epoch s. This one round communication involves full-precision operation because it is\nonly performed once per epoch, and its communication overhead is small compared to the subsequent\nm communication rounds in inner iterations, where the model parameters are updated.\nIn inner iterations, the communication between the master and workers utilizes the asynchronous\nparallelism described in Section 3. To reduce communication complexity, we propose double\nquantization, i.e., quantizing both model parameters and gradients.\nModel Parameter Quantization. Prior works [9, 39] showed that simply quantizing model vectors\nusing a constant tuple (\u03b4, b) fails to converge to the optimal solution due to a non-diminishing\nquantization error. In our case, we impose the additional requirement that the chosen (\u03b4, b) satis\ufb01es\nthe following condition (see Step 7 of AsyLPG):\n\nEQ||Q(\u03b4x,bx)(xs+1\n\nD(t)) \u2212 xs+1\n\nD(t)||2 \u2264 \u00b5||xs+1\n\nD(t) \u2212 \u02dcxs||2.\n\n(2)\n\nD(t) with a dynamic value of ||xs+1\n\nFigure 2: The value of \u00b5 in different epochs that guaran-\ntees (2). Left: bx = 8. Right: bx = 4. The statistics are\nbased on a logistic regression on dataset covtype [7].\n\nThis condition is set to control the precision loss\nD(t) \u2212 \u02dcxs||2\nof xs+1\nand a positive hyperparameter \u00b5, so as to achieve\nan accurate solution. Note that with a larger \u00b5,\nwe can aggressively save more transmitted bits\n(using a smaller bx). In practice, the precision\nloss and communication cost can be balanced by\nselecting a proper \u00b5 . On the other hand, from\nthe analysis aspect, we can always \ufb01nd a \u00b5 that guarantees (2) throughout the training process, for\nany given bx. In the special case when xs+1\nD(t) = \u02dcxs, the master only needs to send a \ufb02ag bit since \u02dcxs\nhas already been stored at the workers, and (2) still holds. Figure 2 validates the practicability of\n(2), where we plot the value of \u00b5 required to guarantee (2) given bx. Note that the algorithm already\nconverges in both graphs. In this case, we see that when bx is 4 or 8, \u00b5 can be upper bounded by\na constant. Also, by setting a larger \u00b5, we can choose a smaller bx. The reason \u00b5 increases at the\nbeginning of each epoch is because ||xs+1\nD(t) \u2212 \u02dcxs||2 is small. After several inner iterations, xs+1\nmoves further away from \u02dcxs. Thus, a smaller \u00b5 suf\ufb01ces to guarantee (2).\nGradient Quantization. After receiving the low-precision model parameter, as shown in Steps 10-11,\na worker calculates a gradient \u03b1t and quantizes it into its low-precision representation Q(\u03b4\u03b1t ,b)(\u03b1t),\nand then sends it to the master.\nIn Step 8, the master constructs a semi-stochastic gradient us+1\nbased on the received low-precision\nQ(\u03b4\u03b1t ,b)(\u03b1t) and the full-batch gradient \u2207f (\u02dcxs), and updates the model vector x using step size \u03b7.\nThe semi-stochastic gradient evaluated here adopts the variance reduction method proposed in SVRG\n[14] and is used to accelerate convergence. If Algorithm 1 is run without double quantization and\nasynchronous parallelism, i.e., only one compute node with no delay, [24] showed that:\nTheorem 1. ([24], Theorem 5) Suppose h(x) is convex and Assumptions 1, 2 hold. Let T = Sm and\n\u03b7 = \u03c1/L where \u03c1 \u2208 (0, 1/2) and satis\ufb01es 4\u03c12m2 + \u03c1 \u2264 1. Then for the output xout of Algorithm 1,\n\nwe have E||G\u03b7(xout)||2 \u2264(cid:0)2L(P (x0) \u2212 P (x\u2217))(cid:1)/(cid:0)\u03c1(1 \u2212 2\u03c1)T(cid:1).\n\nD(t)\n\nt\n\n4.1.1 Theoretical Analysis\nLemma 1. Denote \u2206 =\nAlgorithm 1, its variance can be bounded by\n\nE||us+1\n\nt \u2212 \u2207f (xs+1\n\n4(2b\u22121\u22121)2 . If Assumptions 1, 2, 3 hold, then for the gradient us+1\n\nt \u2212 \u02dcxs||2(cid:105)\noutput xout of Algorithm 1 , we have E||G\u03b7(xout)||2 \u2264(cid:0)2L(P (x0) \u2212 P (x\u2217))(cid:1)/(cid:0)\u03c1(1 \u2212 2\u03c1)T(cid:1).\n\nTheorem 2. Suppose h(x) is convex, conditions in Lemma 1 hold, T = Sm, \u03b7 = \u03c1\n\u03c1 \u2208 (0, 1\n\nL , where\n2 ), and \u03c1, \u03c4 satisfy 8\u03c12m2(\u00b5 + 1)(\u2206 + 2) + 2\u03c12(\u00b5 + 1)(\u2206 + 2)\u03c4 2 + \u03c1 \u2264 1. Then, for the\n\n)||2 \u2264 2L2(\u00b5 + 1)(\u2206 + 2)E\n\n(cid:104)||xs+1\n\nD(t) \u2212 xs+1\n\n||2 + ||xs+1\n\nin\n\nt\n\n.\n\nd\n\nt\n\nt\n\n\u221a\n\nRemarks. From Theorem 2, we see that if \u00b5 = O(1), b = O(log\nm ), AsyLPG\nachieves the same asymptotic convergence rate as in Theorem 1, while transmitting much fewer\nbits (b = O(log\nd) is much smaller than 32). Our analytical results focus on convergence, as\ndone in [35, 34]. Exactly quantifying the amount of improvement on communication complexity,\nhowever, remains challenging due to the complicated dynamics of parameter updates, e.g., [9, 28].\n\nd) and \u03c1 = O( 1\n\n\u221a\n\n5\n\n0123456Epochs0.00.51.01.52.02.5Value of 0123456Epochs05101520Value of \fAlgorithm 2 Sparse-AsyLPG: Procedures for worker\n1: (i) Receive Q(\u03b4x,bx)(xs+1\n\nD(t)), stochastically sample a data-point a \u2208 {1, ..., n}, and calculate\n\ngradient \u03b1t = \u2207fa(Q(\u03b4x,bx)(xs+1\n\n2: (ii) Gradient Sparsi\ufb01cation: Select a budget \u03d5t and sparsify \u03b1t to obtain \u03b2t using (3);\n3: (iii) Gradient Quantization: Set \u03b4\u03b2t =\n\nD(t))) \u2212 \u2207fa(\u02dcxs);\n||\u03b2t||\u221e\n2b\u22121\u22121, and send \u03b6t = Q(\u03b4\u03b2t ,b)(\u03b2t) to the master;\n\nWe instead show through extensive experiments that our scheme signi\ufb01cantly improves upon existing\nbenchmarks, e.g., Figure 3 and Table 1. Note that AsyLPG can also adopt other gradient quantization\nmethods (even biased ones), e.g., [37], and similar results can be established.\n\n4.2 AsyLPG with Gradient Sparsi\ufb01cation\n\nIn this section, we explore how to further reduce the communication cost by incorporating gradient\nsparsi\ufb01cation into double quantization, and propose a new algorithm Sparse-AsyLPG. As shown in\nAlgorithm 2, after calculating \u03b1t, workers successively perform sparsi\ufb01cation and quantization on it.\nSpeci\ufb01cally, we drop out certain coordinates of \u03b1t to obtain a sparsi\ufb01ed vector \u03b2t according to the\nfollowing rules [33]:\n\n\u03b1t,1\np1\n\n\u03b1t,2\np2\n\n,\n\nZ1\n\n, Z2\n\n\u03b2t =\n\n, ..., Zd\n\npi. It can be veri\ufb01ed that E[\u03b2t] = \u03b1t. De\ufb01ne \u03d5t (cid:44)(cid:80)d\n||\u03b1t||\u221e . Then, for \u03b1t =(cid:80)d\n\n(3)\nwhere Z = [Z1, Z2, ..., Zd] is a binary-valued vector with Zi \u223c Bernoulli(pi), 0 < pi \u2264 1, and Zi\u2019s\nare independent. Thus, \u03b2t is obtained by randomly selecting the i-th coordinate of \u03b1t with probability\ni=1 pi to measure the sparsity of \u03b2t. To reduce\nthe communication complexity, it is desirable to make \u03d5t as small as possible, which, on the other\nhand, brings about a large variance. The following lemma quanti\ufb01es the relationship between \u03b2t and\n\u03d5t, and is derived based on results in [30].\nLemma 2. Suppose \u03d5t \u2264 ||\u03b1t||1\nE||\u03b2t||2 \u2265 1\nBased on Lemma 2, in Step 2 of Algorithm 2, we can select a sparsity budget \u03d5t \u2264 ||\u03b1t||1\n||\u03b1t||\u221e and set\nto minimize the variance of \u03b2t. Then, we quantize \u03b2t and send its low-precision version\npi =\nto the master. The master node in Sparse-AsyLPG employs the same model parameter quantization\nand updates parameter x in the same manner as in AsyLPG. The asynchronous parallelism is also\napplied in the communications between master and workers.\n\ni=1 \u03b1t,iei and \u03b2t generated in (3), we have\n\n1. The equality holds if and only if pi =\n\n|\u03b1t,i|\u00b7\u03d5t\n||\u03b1t||1\n\n|\u03b1t,i|\u00b7\u03d5t\n||\u03b1t||1\n\n||\u03b1t||2\n\n\u03d5t\n\n.\n\n(cid:104)\n\n(cid:105)\n\n\u03b1t,d\npd\n\n4.2.1 Theoretical Analysis\n\nt = \u03b6t + \u2207f (\u02dcxs).\n\nt\n\nd2\n\nt \u2212 \u2207f (xs+1\n\nWe \ufb01rst study the variance of the sparsi\ufb01ed gradient us+1\nLemma 3. Suppose \u03d5t \u2264 ||\u03b1t||1\n||\u03b1t||\u221e , Assumptions 1, 2, 3 hold, and for each i \u2208 {1, ..., d}, pi =\n\u03d5 + 1, where \u03d5 = mint{\u03d5t}. Then, for the gradient us+1\nDenote \u0393 =\n4\u03d5(2b\u22121\u22121)2 + d\n||2 + ||xs+1\nAsyLPG, we have E||us+1\nTheorem 3. Suppose h(x) is convex, conditions in Lemma 3 hold, T = Sm, \u03b7 = \u03c1\n\u03c1 \u2208 (0, 1\n\nL , where\n2 ), and \u03c1, \u03c4 satisfy 8\u03c12m2(\u00b5 + 1)\u0393 + 2\u03c12(\u00b5 + 1)\u03c4 2\u0393 + \u03c1 \u2264 1. Then, for the output xout of\n\nSparse-AsyLPG, we have E||G\u03b7(xout)||2 \u2264(cid:0)2L(P (x0) \u2212 P (x\u2217))(cid:1)/(cid:0)\u03c1(1 \u2212 2\u03c1)T(cid:1).\nthat Sparse-AsyLPG converges with a rate linearly scales with(cid:112)d/\u03d5, and signi\ufb01cantly reduces\n\nd), we obtain \u0393 = O(d/\u03d5). We then conclude from Theorem 3\n\nt \u2212 \u02dcxs||2(cid:105)\n\n.\nin Sparse-\n.\n\nRemarks. Setting b = O(log\n\n)||2 \u2264 2L2(\u00b5 + 1)\u0393E\n\n(cid:104)||xs+1\n\nD(t) \u2212 xs+1\n\n|\u03b1t,i|\u00b7\u03d5t\n||\u03b1t||1\n\ntransmitted bits per iteration. Note that before transmitting \u03b6t, we need to encode it to a string, which\ncontains 32 bits for \u03b4\u03b2t and b bits for each coordinate. Since \u03b2t is sparse, we only encode the nonzero\ncoordinates, i.e., using log d bits to encode the position of a nonzero element followed by its value.\n\n\u221a\n\nt\n\nt\n\n6\n\n\fm = \u02dcx0;\n(cid:80)n\n0 + (1 \u2212 \u03b8s)\u02dcxs\u22121, ys\ni=1 \u2207fi(\u02dcxs\u22121); /* Map-reduce global gradient computation\n\n0 = ys\u22121\nm ;\n\nAlgorithm 3 Acc-AsyLPG\n1: Input: S, m, bx, b, \u02dcx0, y0\n2: for s = 1, 2, ..., S do\n3:\n4:\n5:\n6:\n7:\n\nupdate \u03b8s, \u03b7s, xs\n0 = \u03b8sys\nCompute \u2207f (\u02dcxs\u22121) = 1\nfor t = 0 to m \u2212 1 do\nn\n\n(cid:80)m\u22121\n\nt=0 xs\n\nt+1;\n\nend for\n\u02dcxs = 1\nm\n\n11:\n12:\n13:\n14: end for\n15: Output: \u02dcxS.\n\nD(t)) to workers;\n\n||xs\nD(t)||\u221e\n2bx\u22121\u22121 , and quantize xs\n\n/* For master:\n(i) Model Parameter Quantization: Set \u03b4x =\nThen, send Q(\u03b4x,bx)(xs\n(ii) Momentum Acceleration:\nReceive local gradient Q(\u03b4\u03b1t ,b)(\u03b1t), compute us\nys\nt+1 = prox\u03b7sh(ys\n/* For worker:\n(i) Receive Q(\u03b4x,bx)(xs\ngradient \u03b1t = \u2207fa(Q(\u03b4x,bx)(xs\n(ii) Gradient Quantization: quantize \u03b1t using Step 11 in Algorithm 1;\n\nD(t))) \u2212 \u2207fa(\u02dcxs\u22121);\n\nt+1 = \u02dcxs\u22121 + \u03b8s(ys\n\nt+1 \u2212 \u02dcxs\u22121);\n\nt \u2212 \u03b7sus\n\nt ), xs\n\nD(t)), stochastically sample a data-point a \u2208 {1, ..., n} and calculate\n\n8:\n\n9:\n10:\n\nD(t) subject to (4).\n\nt = Q(\u03b4\u03b1t ,b)(\u03b1t) + \u2207f (\u02dcxs\u22121) and update\n\n4.3 Accelerated AsyLPG\n\nIn the above, we mainly focus on reducing the communication cost within each iteration. Here we\npropose an algorithm with an even faster convergence and fewer communication rounds. Speci\ufb01cally,\nwe incorporate the popular momentum or Nesterov technique [19, 26] into AsyLPG. To simplify\npresentation, we only present accelerated AsyLPG (Acc-AsyLPG) in Algorithm 3. The method can\nsimilarly be applied to Sparse-AsyLPG.\nAlgorithm 3 still adopts asynchronous parallelism and double quantization, and makes the following\nkey modi\ufb01cations. (i) In Step 7, the model parameter quantization satis\ufb01es\n\nEQ||Q(\u03b4x,bx)(xs\n\nD(t)) \u2212 xs\n\nD(t)||2 \u2264 \u03b8s\u00b5||xs\n\nD(t) \u2212 \u02dcxs\u22121||2,\n\nt+1. The update of xs\n\nt+1 combines history information \u02dcxs\u22121 and ys\n\n(4)\nwhere \u00b5 is the hyperparameter that controls the precision loss. \u03b8s is the momentum weights and its\nvalue will be speci\ufb01ed later. (ii) Momentum acceleration is implemented in Steps 3 and 8, through\nan auxiliary variable ys\nt+1. In\nthe following, we show that with the above modi\ufb01cations, Acc-AsyLPG achieves an even faster\nconvergence rate.\nTheorem 4. Suppose each fi(x) and h(x) are convex, Assumptions 1, 2, 3 hold, and the domain\n\u2126 of x is bounded by D, such that \u2200x, y \u2208 \u2126,||x \u2212 y||2 \u2264 D. Let \u03b8s = 2\n, where\n\u03c3 > 1 is a constant. If \u03c3, \u03c4 satisfy \u03c4 \u2264 1\n\u2206 =\n\u02dcO((L/m + LD\u00b5\u2206/\u03c4 + LD\u00b5)/S2).\n\nwhere\n(2b\u22121\u22121)2 + 2, \u03b3 = 1 + 2\u03b8s\u00b5, then under Algorithm 3, we have E[P (\u02dcxS) \u2212 P (x\u2217)] \u2264\n\ns+2 , \u03b7s = 1\n\u03c3L\u03b8s\n+ \u03b8s\u2206)\n\n+ \u03b8s\u2206(cid:1)2\n\n(cid:104)(cid:113)(cid:0) 2\n\n\u03b3 \u2212 ( 2\n\n+ 4(\u03c3\u22121)\n\n(cid:105)\n\n\u03b3\u03b8s\n\n\u03b3\u03b8s\n\nd\n\n2\n\n\u221a\nThe bounded domain condition in Theorem 4 is commonly assumed in literature, e.g., [32], and the\npossibility of going outside domain is avoided by the proximal operator in Step 8. If b = O(log\nd)\nand \u00b5 = O(1), the constraint of delay \u03c4 can be easily satis\ufb01ed with a moderate \u03c3. Then, our\nAcc-AsyLPG achieves acceleration while effectively reducing the communication cost.\n\n5 Experiments\n\nWe conduct experiments to validate the ef\ufb01ciency of our algorithms. We start with the logistic\nregression problem and then evaluate the performance of our algorithms on neural network models.\nWe further study the relationship of hyperparameter \u00b5 and number of transmitted bits.\n\n7\n\n\f(a) dataset: real-sim\n\n(b) dataset: real-sim\n\n(c) dataset: rcv1\n\n(d) dataset: rcv1\n\nFigure 3: (a) and (c): training curves on real-sim and rcv1. (b): decomposition of time consumption (recorded\nuntil the training loss is \ufb01rst below 0.5). (d): # of transmitted bits until the training loss is \ufb01rst below 0.4.\n5.1 Evaluations on Logistic Regression.\n\nWe begin with logistic regression on dataset real-sim [7]. The evaluations are setup on a 6-server\ndistributed test-bed. Each server has 16 cores and 16GB memory. The communication among\nservers is handled by OpenMPI [21]. We use L1, L2 regularization with weights 10\u22125 and 10\u22124,\nB(cid:101). The following six algorithms\nrespectively. The mini-batch size B = 200 and epoch length m = (cid:100) n\nare compared, using a constant learning rate (denoted as lr) tuned to achieve the best result from\n{1e\u22121, 1e\u22122, 5e\u22122, 1e\u22123, 5e\u22123, ..., 1e\u22125, 5e\u22125}.\n(i) AsyLPG, Sparse-AsyLPG, Acc-AsyLPG. We set bx = 8 and b = 8 in these three algorithms. The\nsparsity budget in Sparse-AsyLPG is selected as \u03d5t = ||\u03b1t||1/||\u03b1t||\u221e. We do not tune \u03d5t to present\na fair comparison. Parameters in Acc-AsyLPG are set to be \u03b8s = 2/(s + 2) and \u03b7s = lr/\u03b8s.\n(ii) QSVRG [2], which is a gradient-quantized algorithm. We implement it in an asynchronous-\nparallelism way. Its gradient quantization method is equivalent to Step 11 in AsyLPG. If run with\nsynchronization and without quantization, QSVRG and AsyLPG have the same convergence rate.\nFor a fair comparison, we set the gradient quantization bit b = 8 for QSVRG.\n(iii) The full-precision implementations of AsyLPG and Acc-AsyLPG, denoted as AsyFPG and\nAcc-AsyFPG, respectively. In both algorithms, we remove double quantization.\nConvergence and time consumption. Figure 3 presents the evaluations on dataset real-sim. The\nplot (a) shows that AsyLPG and Acc-AsyLPG have similar convergence rates to their full-precision\ncounterparts. Our Sparse-AsyLPG also converges fast with a very small accuracy degradation. The\ntime consumption presented in the plot (b) shows the communication-ef\ufb01ciency of our algorithms.\nWith similar convergence rates, our low-precision algorithms signi\ufb01cantly reduce the communication\noverhead when achieving the same training loss. Moreover, the comparison between AsyLPG and\nQSVRG validates the redundancy of 32 bits representation of model parameter.\nCommunication complexity. We experimented logistic regression on dataset rcv1 [7]. The L1 and\nL2 regularization are adopted, both with weights 10\u22124. lr is tuned in the same way as real-sim. In\nFigure 3(d), we record the total number of transmitted bits. It shows that AsyLPG, Sparse-AsyLPG,\nAcc-AsyLPG can save up to 6.48\u00d7, 7.33\u00d7 and 19.19\u00d7 bits compared to AsyFPG.\n\n5.2 Evaluations on Neural Network\n\nWe conduct evaluations on dataset MNIST [18] using a 3-layer fully\nconnected neural network. The 6-server distributed test-bed in Sec-\ntion 5.1 is adopted. The hidden layer contains 100 nodes, and uses\nReLU activation function. Softmax loss function and L2 regularizer\nwith weight 10\u22124 are adopted. We use 10k training and 2k test samples\nwhich are randomly drawn from the full dataset (60k training / 10k\ntest). The mini-batch size is 20 and the epoch size m is 500. We set\nbx = 8 and b = 4 for low-precision algorithms.\nWe also compare our algorithms with HALP [9], which is a low-\nprecision variant of SVRG with quantized model vectors. We im-\nplement its distributed version with asynchronous parallelism. For a\nfair comparison, we also set bx = 8 for HALP. lr is constant and is\ntuned to achieve the best result for each algorithm. Figure 4 presents the convergence of the seven\nalgorithms. In the left table of Table 1, we record the total number of transmitted bits. We see\nthat the results are similar as in the logistic regression case, i.e., our new low-precision algorithms\ncan signi\ufb01cantly reduce communication overhead compared to their full-precision counterparts and\n\nFigure 4: Evaluation on dataset\nMNIST. Top: the training curve.\nBottom: the test error rate.\n\n8\n\n020406080100# of epochs0.10.20.30.40.50.60.7Loss function's valueAsyLPGSparse-AsyLPGAcc-AsyLPGAsyFPGQSVRGAcc-AsyFPGAsyFPGAcc-AsyFPGQSVRGAsyLPGSparse-AsyLPGAcc-AsyLPG010203040506070Time (s)ComputationEncodingTransmissionDecoding020406080100# of epochs0.10.20.30.40.50.60.7Loss function's valueAsyLPGSparse-AsyLPGAcc-AsyLPGAsyFPGQSVRGAcc-AsyFPGAsyFPGAcc-AsyFPGQSVRGAsyLPGSparse-AsyLPGAcc-AsyLPG1010# of bits1.00x2.21x3.04x6.48x7.33x19.19x20406080100# of epochs0.00.10.20.30.4Loss function's valueAsyLPGHALPSparse-AsyLPGAcc-AsyLPGAsyFPGQSVRGAcc-AsyFPG20406080100# of epochs0.100.150.200.25Test error rateAsyLPGHALPSparse-AsyLPGAcc-AsyLPGAsyFPGQSVRGAcc-AsyFPG\fFigure 6: Evaluations on CIFAR10: training loss (1st column), test accuracy (2nd column) and total number of\ntransmitted bits.\nTable 1: Evaluation on dataset MNIST. Left: # of transmitted bits until the training loss is \ufb01rst below 0.05. Right:\nThe value of bx and # of transmitted bits of AsyLPG under different \u00b5.\n\n# bits\nAlgorithm\n2.42e9\nAsyFPG\n6.87e8\nAcc-AsyFPG\n4.50e8\nQSVRG\n7.36e8\nHALP\n3.33e8\nAsyLPG\nSparse-AsyLPG 2.73e8\nAcc-AsyLPG\n1.26e8\n\nRatio\n\u2212\n3.52\u00d7\n5.38\u00d7\n3.29\u00d7\n7.28\u00d7\n8.87\u00d7\n19.13\u00d7\n\n\u00b5\n0.005\n0.01\n0.05\n0.1\n0.5\n\nbx\n11\n10\n9\n8\n7\n\n# bits\n2.42e7\n2.33e7\n1.52e7\n1.07e7\n1.12e7\n\n\u00b5\n2.0\n10\n50\n150\n800\n\nbx\n6\n5\n4\n3\n2\n\n# bits\n1.24e7\n1.14e7\n1.08e7\n1.44e7\n2.16e8\n\nQSVRG. Moreover, our AsyLPG with double quantization has comparable convergence rate to HALP\nwhile sending much less bits.\nStudy of \u00b5. The hyperparameter \u00b5 is set to control the precision loss incurred by model pa-\nrameter quantization. Before, we \ufb01x bx to compare our algorithms with other methods. Now\nwe study the relationship of \u00b5 and transmitted bits. Note that when \u00b5 is \ufb01xed, we choose bx\nto satisfy (2).\nIn Figure 5, we set \u00b5 = 0.6 and study how the accuracy of model quantizer\nimproves with iterations when running AsyLPG on MNIST. We see that the quantization error\ndiminishes. Thus, the number of transmitted bits increases as the number of iteration grows.\nNext, in the right table of Table 1, we study the performance of AsyLPG\nunder different \u00b5. The value bx is also chosen by guaranteeing (2). We\nprovide the overall numbers of transmitted bits under different bx until\nthe training loss is \ufb01rst less than 0.5. The results validate that with the\nincreasing \u00b5, we can choose a smaller bx, to save more communication\ncost per iteration. The total number of transmitted bits decreases until\na threshold \u00b5 = 0.5, beyond which signi\ufb01cant precision loss happens\nand we need more training iterations for achieving the same accuracy.\nEvaluations on Deep Model. We further set up experiments on PyTorch with ResNet18 [11] on\nCIFAR10 dataset [16]. The model size is about 44MB. We use 50k training samples and 10k\nevaluation samples. For direct comparison, no data augmentation is used. The batch size is 128. The\nlearning rate starts from 0.1, and is divided by 10 at 150 and 250 epochs. We set bx = 8, b = 4 for\nlow-precision algorithms. The sparsity budget \u03d5t = ||\u03b1t||1/||\u03b1t||\u221e. In Figure 6, we plot the training\nloss and test accuracy with respect to epochs, and provide the total transmitted bits until the training\nloss \ufb01rst gets below 0.17. It shows that our algorithms achieve similar accuracy and effectively reduce\nthe communication cost compared to benchmarks.\n\nFigure 5: The accuracy of\nmodel quantizer.\n\n6 Conclusion\n\nWe propose three communication-ef\ufb01cient algorithms for distributed training with asynchronous\nparallelism. The key idea is quantizing both model parameters and gradients, called double quan-\ntization. We analyze the variance of low-precision gradients and show that our algorithms achieve\nthe same asymptotic convergence rate as the full-precision algorithms, while transmitting much\nfewer bits per iteration. We also incorporate gradient sparsi\ufb01cation into double quantization, and\nsetup relation between convergence rate and sparsity budget. We accelerate double quantization by\nintegrating momentum techniques. The evaluations on logistic regression and neural network based\non real-world datasets validate that our algorithms can signi\ufb01cantly reduce communication cost.\n\nAcknowledgments\nThe work of Yue Yu and Longbo Huang was supported in part by the National Natural Science\nFoundation of China Grant 61672316, the Zhongguancun Haihua Institute for Frontier Information\nTechnology and the Turing AI Institute of Nanjing. We would like to thank anonymous reviewers for\ntheir insightful suggestions.\n\n9\n\n0100200300Epochs0.00.51.01.52.0Loss functions valueAsyLPGSparse-AsyLPGAcc-AsyLPGAsyFPGAcc-AsyFPGQSVRG0100200300Epochs30405060708090Test accuracyAsyLPGSparse-AsyLPGAcc-AsyLPGAsyFPGAcc-AsyFPGQSVRGAsyFPGAcc-AsyFPGQSVRGAsyLPGSparse-AsyLPGAcc-AsyLPG1014# of bits1.00X1.11X3.45X6.03X8.17X8.59X024681012141618Epochs0.02.55.07.5Quantization error\fReferences\n[1] A. F. Aji and K. Hea\ufb01eld. Sparse communication for distributed gradient descent. In Proceedings\n\nof the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.\n\n[2] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-ef\ufb01cient\nsgd via gradient quantization and encoding. In Advances in Neural Information Processing\nSystems, pages 1709\u20131720, 2017.\n\n[3] D. Alistarh, T. Hoe\ufb02er, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The\nconvergence of sparsi\ufb01ed gradient methods. In Advances in Neural Information Processing\nSystems, pages 5977\u20135987, 2018.\n\n[4] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and\n\ndistributed approaches. Cambridge University Press, 2011.\n\n[5] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed\nOptimisation for Non-Convex Problems. In International Conference on Machine Learning\n(ICML), 2018.\n\n[6] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD with Majority\nVote is Communication Ef\ufb01cient and Fault Tolerant. In International Conference on Learning\nRepresentations (ICLR), 2019.\n\n[7] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions\n\non intelligent systems and technology (TIST), 2(3):27, 2011.\n\n[8] C. De Sa, M. Feldman, C. R\u00e9, and K. Olukotun. Understanding and optimizing asynchronous\nlow-precision stochastic gradient descent. In Proceedings of the 44th Annual International\nSymposium on Computer Architecture, pages 561\u2013574. ACM, 2017.\n\n[9] C. De Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, and C. R\u00e9.\n\nHigh-accuracy low-precision training. arXiv preprint:1803.03383, 2018.\n\n[10] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang,\nQ. V. Le, et al. Large scale distributed deep networks. In Advances in Neural Information\nProcessing Systems, pages 1223\u20131231, 2012.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013\n778, 2016.\n\n[12] Z. Huo and H. Huang. Asynchronous stochastic gradient descent with variance reduction for\n\nnon-convex optimization. arXiv preprint arXiv:1604.03584, 2016.\n\n[13] P. Jiang and G. Agrawal. A linear speedup analysis of distributed deep learning with sparse\nand quantized communication. In Advances in Neural Information Processing Systems, pages\n2525\u20132536. Curran Associates, Inc., 2018.\n\n[14] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[15] S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi. Error feedback \ufb01xes signsgd and other\ngradient compression schemes. In International Conference on Machine Learning (ICML),\n2019.\n\n[16] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Technical\n\nreport, Tech Report, 2009.\n\n[17] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex\noptimization. In Advances in Neural Information Processing Systems, pages 2737\u20132745, 2015.\n\n[18] MNIST. http://yann.lecun.com/exdb/mnist/.\n\n10\n\n\f[19] Y. Nesterov. A method of solving a convex programming problem with convergence rate o\n\n(1/k2). In Soviet Mathematics Doklady, volume 27, pages 372\u2013376, 1983.\n\n[20] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2003.\n\n[21] OpenMPI. https://www.open-mpi.org/.\n\n[22] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic\ngradient descent. In Advances in Neural Information Processing Systems, pages 693\u2013701, 2011.\n\n[23] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. J. Smola. On variance reduction in stochastic\ngradient descent and its asynchronous variants. In Advances in Neural Information Processing\nSystems, pages 2647\u20132655, 2015.\n\n[24] S. J. Reddi, S. Sra, B. Poczos, and A. Smola. Fast stochastic methods for nonsmooth nonconvex\n\noptimization. In Advances in Neural Information Processing Systems, 2016.\n\n[25] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application\nto data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the\nInternational Speech Communication Association, 2014.\n\n[26] F. Shang, Y. Liu, J. Cheng, and J. Zhuo. Fast stochastic variance reduced gradient method with\n\nmomentum acceleration for machine learning. arXiv preprint arXiv:1703.07948, 2017.\n\n[27] S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsi\ufb01ed sgd with memory. In Advances in Neural\n\nInformation Processing Systems, pages 4452\u20134463, 2018.\n\n[28] H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu. Communication compression for decentralized\n\ntraining. In Advances in Neural Information Processing Systems, pages 7652\u20137662, 2018.\n\n[29] H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu. Doublesqueeze: Parallel stochastic gradient\ndescent with double-pass error-compensated compression. In International Conference on\nMachine Learning (ICML), 2019.\n\n[30] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright. Atomo:\nCommunication-ef\ufb01cient learning via atomic sparsi\ufb01cation. In Advances in Neural Information\nProcessing Systems, 2018.\n\n[31] J. Wang, M. Kolar, N. Srebro, and T. Zhang. Ef\ufb01cient distributed learning with sparsity.\nIn Proceedings of the 34th International Conference on Machine Learning, volume 70 of\nProceedings of Machine Learning Research, pages 3636\u20133645, International Convention Centre,\nSydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[32] J. Wang, W. Wang, and N. Srebro. Memory and communication ef\ufb01cient distributed stochastic\n\noptimization with minibatch-prox. Conference on Learning Theory (COLT), 2017.\n\n[33] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsi\ufb01cation for communication-ef\ufb01cient\n\ndistributed optimization. In Advances in Neural Information Processing Systems, 2018.\n\n[34] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients\nto reduce communication in distributed deep learning. In Advances in Neural Information\nProcessing Systems, pages 1509\u20131519, 2017.\n\n[35] J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized sgd and its applications\nto large-scale distributed optimization. In International Conference on Machine Learning\n(ICML), 2018.\n\n[36] P. Yin, S. Zhang, J. Lyu, S. Osher, Y. Qi, and J. Xin. Binaryrelax: A relaxation approach for\ntraining deep neural networks with quantized weights. SIAM Journal on Imaging Sciences,\n11(4):2205\u20132223, 2018.\n\n[37] Y. Yu, J. Wu, and J. Huang. Exploring fast and communication-ef\ufb01cient algorithms in large-\nscale distributed networks. In Proceedings of The 22nd International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), 2019.\n\n11\n\n\f[38] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. Zipml: Training linear models\nwith end-to-end low precision, and a little bit of deep learning. In International Conference on\nMachine Learning (ICML), pages 4035\u20134043, 2017.\n\n[39] X. Zhang, J. Liu, Z. Zhu, and E. S. Bentley. Compressed distributed gradient descent:\nCommunication-ef\ufb01cient consensus over networks. In IEEE INFOCOM 2019-IEEE Conference\non Computer Communications, pages 2431\u20132439. IEEE, 2019.\n\n[40] S. Zheng, Z. Huang, and J. T. Kwok. Communication-ef\ufb01cient distributed blockwise momentum\n\nsgd with error-feedback. In Advances in Neural Information Processing Systems, 2019.\n\n12\n\n\f", "award": [], "sourceid": 2485, "authors": [{"given_name": "Yue", "family_name": "Yu", "institution": "Tsinghua University"}, {"given_name": "Jiaxiang", "family_name": "Wu", "institution": "Tencent AI Lab"}, {"given_name": "Longbo", "family_name": "Huang", "institution": "IIIS, Tsinghua Univeristy"}]}