Finite-Sum Optimization: Adaptivity to Smoothness and Loopless Variance Reduction
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 33234
Finite-Sum Optimization: Adaptivity to Smoothness and Loopless Variance Reduction

Authors: Bastien Batardière, Joon Kwon

Abstract:

For finite-sum optimization, variance-reduced gradient methods (VR) compute at each iteration the gradient of a single function (or of a mini-batch), and yet achieve faster convergence than SGD thanks to a carefully crafted lower-variance stochastic gradient estimator that reuses past gradients. Another important line of research of the past decade in continuous optimization is the adaptive algorithms such as AdaGrad, that dynamically adjust the (possibly coordinate-wise) learning rate to past gradients and thereby adapt to the geometry of the objective function. Variants such as RMSprop and Adam demonstrate outstanding practical performance that have contributed to the success of deep learning. In this work, we present AdaLVR, which combines the AdaGrad algorithm with loopless variance-reduced gradient estimators such as SAGA or L-SVRG that benefits from a straightforward construction and a streamlined analysis. We assess that AdaLVR inherits both good convergence properties from VR methods and the adaptive nature of AdaGrad: in the case of L-smooth convex functions we establish a gradient complexity of O(n+(L+√nL)/ε) without prior knowledge of L. Numerical experiments demonstrate the superiority of AdaLVR over state-of-the-art methods. Moreover, we empirically show that the RMSprop and Adam algorithm combined with variance-reduced gradients estimators achieve even faster convergence.

Keywords: Convex optimization, variance reduction, adaptive algorithms, loopless.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 67

References:


[1] A.-L. Cauchy, “Méthode générale pour la résolution des systemes d’équations simultanées,” Comptes rendus de l’Académie des sciences, vol. 25, no. 1847, pp. 536–538, 1847.
[2] H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
[3] N. Le Roux, M. Schmidt, and F. Bach, “A stochastic gradient method with an exponential convergence rate for finite training sets,” Advances in neural information processing systems, vol. 25, 2012.
[4] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,” Mathematical Programming, vol. 162, no. 1, pp. 83–112, 2017.
[5] D. Blatt, A. O. Hero, and H. Gauchman, “A convergent incremental gradient method with a constant step size,” SIAM Journal on Optimization, vol. 18, no. 1, pp. 29–51, 2007.
[6] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in neural information processing systems, 2013, pp. 315–323.
[7] A. Defazio, F. Bach, and S. Lacoste-Julien, “Saga: A fast incremental gradient method with support for non-strongly convex composite objectives,” in Advances in neural information processing systems, 2014, pp. 1646–1654.
[8] Z. Allen-Zhu and Y. Yuan, “Improved SVRG for non-strongly-convex or sum-of-non-convex objectives,” in International conference on machine learning, PMLR, 2016, pp. 1080–1089.
[9] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams, “Variance reduced stochastic gradient descent with neighbors,” Advances in Neural Information Processing Systems, vol. 28, 2015.
[10] R. M. Gower, P. Richtárik, and F. Bach, “Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching,” Mathematical Programming, vol. 188, pp. 135–192, 2021.
[11] A. Defazio, J. Domke, et al., “Finito: A faster, permutable incremental gradient method for big data problems,” in International Conference on Machine Learning, PMLR, 2014, pp. 1125–1133.
[12] J. Mairal, “Optimization with first-order surrogate functions,” in International Conference on Machine Learning, PMLR, 2013, pp. 783–791.
[13] J. Koneˇcn`y and P. Richtárik, “Semi-stochastic gradient descent methods,” Frontiers in Applied Mathematics and Statistics, vol. 3, p. 9, 2017.
[14] B. Li, M. Ma, and G. B. Giannakis, “On the convergence of SARAH and beyond,” in International Conference on Artificial Intelligence and Statistics, PMLR, 2020, pp. 223–233.
[15] S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinate ascent methods for regularized loss minimization.,” Journal of Machine Learning Research, vol. 14, no. 1, 2013.
[16] Q. Lin, Z. Lu, and L. Xiao, “An accelerated proximal coordinate gradient method,” Advances in Neural Information Processing Systems, vol. 27, 2014.
[17] S. Shalev-Shwartz and T. Zhang, “Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization,” in International conference on machine learning, PMLR, 2014, pp. 64–72.
[18] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochastic variance reduction for nonconvex optimization,” in International conference on machine learning, PMLR, 2016, pp. 314–323.
[19] H. Lin, J. Mairal, and Z. Harchaoui, “A universal catalyst for first-order optimization,” Advances in neural information processing systems, vol. 28, 2015.
[20] Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochastic gradient methods,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 8194–8244, 2017.
[21] G. Lan and Y. Zhou, “An optimal randomized incremental gradient method,” Mathematical programming, vol. 171, no. 1, pp. 167–215, 2018.
[22] G. Lan, Z. Li, and Y. Zhou, “A unified variance-reduced accelerated gradient method for convex optimization,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[23] P. Joulani, A. Raj, A. Gyorgy, and C. Szepesvári, “A simpler approach to accelerated optimization: Iterative averaging meets optimism,” in International Conference on Machine Learning, PMLR, 2020, pp. 4984–4993.
[24] C. Song, Y. Jiang, and Y. Ma, “Variance reduction via accelerated dual averaging for finite-sum optimization,” Advances in Neural Information Processing Systems, vol. 33, pp. 833–844, 2020.
[25] Z. Li, “ANITA: An optimal loopless accelerated variance-reduced gradient method,” arXiv preprint arXiv:2103.11333, 2021.
[26] Y. Liu, F. Shang, W. An, H. Liu, and Z. Lin, “Kill a bird with two stones: Closing the convergence gaps in non-strongly convex optimization by directly accelerated SVRG with double compensation and snapshots,” in International Conference on Machine Learning, PMLR, 2022, pp. 14 008–14 035.
[27] B. E. Woodworth and N. Srebro, “Tight complexity bounds for optimizing composite objectives,” Advances in neural information processing systems, vol. 29, 2016.
[28] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáˇc, “SARAH: A novel method for machine learning problems using stochastic recursive gradient,” in International Conference on Machine Learning, PMLR, 2017, pp. 2613–2621.
[29] Z. Allen-Zhu, “Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter,” in International Conference on Machine Learning, PMLR, 2017, pp. 89–97.
[30] C. Fang, C. J. Li, Z. Lin, and T. Zhang, “SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[31] A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex sgd,” Advances in neural information processing systems, vol. 32, 2019.
[32] F. Latorre, P. Rolland, and V. Cevher, “Lipschitz constant estimation of neural networks via sparse polynomial optimization,” in International Conference on Learning Representations, 2020.
[33] L. Armijo, “Minimization of functions having Lipschitz continuous first partial derivatives,” Pacific Journal of mathematics, vol. 16, no. 1, pp. 1–3, 1966.
[34] Y. Nesterov, “Universal gradient methods for convex optimization problems,” Mathematical Programming, vol. 152, no. 1, pp. 381–404, 2015.
[35] C. Lemaréchal, A. Nemirovskii, and Y. Nesterov, “New variants of bundle methods,” Mathematical programming, vol. 69, pp. 111–147, 1995.
[36] G. Lan, “Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization,” Mathematical Programming, vol. 149, no. 1-2, pp. 1–45, 2015.
[37] J. Barzilai and J. M. Borwein, “Two-point step size gradient methods,” IMA journal of numerical analysis, vol. 8, no. 1, pp. 141–148, 1988.
[38] M. N. Vrahatis, G. S. Androulakis, J. N. Lambrinos, and G. D. Magoulas, “A class of gradient unconstrained minimization algorithms with adaptive stepsize,” Journal of Computational and Applied Mathematics, vol. 114, no. 2, pp. 367–386, 2000.
[39] Y. Malitsky and K. Mishchenko, “Adaptive gradient descent without descent,” in Proceedings of the 37th International Conference on Machine Learning, 2020, pp. 6702–6712.
[40] S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien, “Painless stochastic gradient: Interpolation, line-search, and convergence rates,” Advances in neural information processing systems, vol. 32, 2019.
[41] D. Dvinskikh, A. Ogaltsov, A. Gasnikov, P. Dvurechensky, A. Tyurin, and V. Spokoiny, “Adaptive gradient descent for convex and non-convex stochastic optimization,” arXiv preprint arXiv:1911.08380, 2019.
[42] C. Tan, S. Ma, Y.-H. Dai, and Y. Qian, “Barzilai-Borwein step size for stochastic gradient descent,” Advances in neural information processing systems, vol. 29, 2016.
[43] Y. Liu, C. Han, and T. Guo, “A class of stochastic variance reduced methods with an adaptive stepsize,” URL http://www. optimization-online. org/DB_FILE/2019/04/7170. pdf, 2019.
[44] B. Li and G. B. Giannakis, “Adaptive step sizes in variance reduction via regularization,” arXiv preprint arXiv:1910.06532, 2019.
[45] B. Li, L. Wang, and G. B. Giannakis, “Almost tune-free variance reduction,” in International conference on machine learning, PMLR, 2020, pp. 5969–5978.
[46] Z. Yang, Z. Chen, and C. Wang, “Accelerating mini-batch SARAH by step size rules,” Information Sciences, vol. 558, pp. 157–173, 2021.
[47] H. B. McMahan and M. Streeter, “Adaptive bound optimization for online convex optimization,” in Proceedings of the 23rd Conference on Learning Theory (COLT), 2010, p. 244.
[48] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, 2011.
[49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
[50] K. Levy, “Online to offline conversions, universality and adaptive minibatch sizes,” in Advances in Neural Information Processing Systems, 2017, pp. 1613–1622.
[51] Y. K. Levy, A. Yurtsever, and V. Cevher, “Online adaptive methods, universality and acceleration,” in Advances in Neural Information Processing Systems, 2018, pp. 6500–6509.
[52] A. Kavis, K. Y. Levy, F. Bach, and V. Cevher, “UnixGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization,” Advances in neural information processing systems, vol. 32, 2019.
[53] X. Wu, R. Ward, and L. Bottou, “WNGrad: Learn the learning rate in gradient descent,” arXiv preprint arXiv:1803.02865, 2018.
[54] A. Cutkosky, “Anytime online-to-batch, optimism and acceleration,” in International Conference on Machine Learning, PMLR, 2019, pp. 1446–1454.
[55] A. Ene, H. L. Nguyen, and A. Vladu, “Adaptive gradient methods for constrained convex optimization and variational inequalities,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 7314–7321.
[56] A. Ene and H. Nguyen, “Adaptive and universal algorithms for variational inequalities with optimal convergence,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 6559–6567.
[57] R. Ward, X. Wu, and L. Bottou, “AdaGrad stepsizes: Sharp convergence over nonconvex landscapes,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 9047–9076, 2020.
[58] K. Levy, A. Kavis, and V. Cevher, “STORM+: Fully adaptive SGD with recursive momentum for nonconvex optimization,” Advances in Neural Information Processing Systems, vol. 34, pp. 20 571–20 582, 2021.
[59] A. Kavis, K. Levy, and V. Cevher, “High probability bounds for a class of nonconvex algorithms with adagrad stepsize,” Publisher Copyright: © 2022 ICLR 2022 - 10th International Conference on Learning Representationss., 2022.
[60] M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward, “The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance,” in Conference on Learning Theory, PMLR, 2022, pp. 313–355.
[61] A. Attia and T. Koren, “SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance,” in Proceedings of the 40th International Conference on Machine Learning, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., ser. Proceedings of Machine Learning Research, vol. 202, PMLR, 2023, pp. 1147–1171.
[Online]. Available: https://proceedings.mlr.press/v202/attia23a. html.
[62] Z. Liu, T. D. Nguyen, A. Ene, and H. Nguyen, “Adaptive accelerated (extra-) gradient methods with variance reduction,” in International Conference on Machine Learning, PMLR, 2022, pp. 13 947–13 994.
[63] B. Dubois-Taine, S. Vaswani, R. Babanezhad, M. Schmidt, and S. Lacoste-Julien, “SVRG meets AdaGrad: Painless variance reduction,” Machine Learning, pp. 1–51, 2022.
[64] W. Li, Z. Wang, Y. Zhang, and G. Cheng, “Variance reduction on general adaptive stochastic mirror descent,” Machine Learning, pp. 1–39, 2022.
[65] R. Wang and D. Klabjan, “Divergence results and convergence of a variance reduced version of ADAM,” arXiv preprint arXiv:2210.05607, 2022.
[66] A. Kavis, E. P. SKOULAKIS, K. Antonakopoulos, L. T. Dadi, and V. Cevher, “Adaptive stochastic variance reduction for non-convex finite-sum minimization,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022.
[Online]. Available: https : / / openreview . net / forum?id=k98U0cb0Ig.
[67] A. Cutkosky and R. Busa-Fekete, “Distributed stochastic optimization via adaptive sgd,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[68] H. Li, A. Jadbabaie, and A. Rakhlin, “Convergence of Adam under relaxed assumptions,” arXiv preprint arXiv:2304.13972, 2023.
[69] B. Xie, C. Jin, K. Zhou, J. Cheng, and W. Meng, “An adaptive incremental gradient method with support for non-euclidean norms,” arXiv preprint arXiv:2205.02273, 2022.
[70] Z. Shi, A. Sadiev, N. Loizou, P. Richtárik, and M. Takáˇc, “AI-SARAH: Adaptive and implicit stochastic recursive gradient methods,” arXiv preprint arXiv:2102.09700, 2021.
[71] R. Gower, D. Goldfarb, and P. Richtárik, “Stochastic block BFGS: Squeezing more curvature out of data,” in International Conference on Machine Learning, PMLR, 2016, pp. 1869–1878.
[72] E. Gorbunov, F. Hanzely, and P. Richtarik, “A unified theory of sgd: Variance reduction, sampling,quantization and coordinate descent,” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, S. Chiappa and R. Calandra, Eds., ser. Proceedings of Machine Learning Research, vol. 108, PMLR, 2020, pp. 680–690.
[Online]. Available: https : / / proceedings . mlr . press / v108 / gorbunov20a.html.
[73] L. Condat and P. Richtarik, “Murana: A generic framework for stochastic variance-reduced optimization,” in Proceedings of Mathematical and Scientific Machine Learning, B. Dong, Q. Li, L. Wang, and Z.-Q. J. Xu, Eds., ser. Proceedings of Machine Learning Research, vol. 190, PMLR, 2022, pp. 81–96.
[Online]. Available: https://proceedings.mlr. press/v190/condat22a.html.
[74] J. Blackard, Covertype, UCI Machine Learning Repository, DOI: https://doi.org/10.24432/C50K5N, 1998.
[75] J. Diaz-Mejia, Scmark an ’mnist’ like benchmark to evaluate and optimize models for unifying scrna data, version 1.0, 2021. DOI: 10 . 5281 / zenodo . 5765804.
[Online]. Available: https : / /doi.org/10.5281/zenodo. 5765804.
[76] S. Ruder, An overview of gradient descent optimization algorithms, 2017. arXiv: 1609.04747 (cs.LG).
[77] Y. Nesterov, “Introductory lectures on convex optimization - a basic course,” in Applied Optimization, 2004.
[78] A. Defazio, “New optimisation methods for machine learning,” Ph.D. dissertation, 2015. DOI: 10 . 48550 / ARXIV.1510.02533. Online. Available: https://arxiv. org/abs/1510.02533.