Proximal Gradient Methods¶

One approach for solving instances of Problem (1) is to use proximal gradient methods. The basic form of the proximal gradient iteration is

(2)¶\[x_{k+1} = \arg\,\min_{x \in \mathbb{R}^{d}} \left\lbrace \operatorname{f}(x_{k}) + \left\langle \nabla\operatorname{f}(x_{k}), x - x_{k} \right\rangle + \operatorname{h}(x) + \frac{1}{2\gamma_{k}} {\left\lVert x - x_{k} \right\rVert}_{2}^{2} \right\rbrace \,,\]

where \(\gamma_{k}\) is the step size. Thus, the next iterate, \(x_{k+1}\), is selected to be the minimizer of the sum of the first-order approximation of the smooth loss function around the current iterate, \(x_{k}\), the nonsmooth loss function, and a quadratic penalty on the deviation from the the current iterate [2017-Beck]. After some algebraic manipulations, one can rewrite Equation (2) in terms of the proximal operator [2017-Beck]

\[\begin{split}x_{k+1} & = \arg\,\min_{x \in \mathbb{R}^{d}} \left\lbrace \gamma_{k} \operatorname{h}(x) + \frac{1}{2} {\left\lVert x - \left(x_{k} - \gamma_{k} \nabla\operatorname{f}(x_{k})\right) \right\rVert}_{2}^{2} \right\rbrace \\ & := \operatorname{prox}_{\gamma_{k}\operatorname{h}} \left(x_{k} - \gamma_{k} \nabla\operatorname{f}(x_{k})\right) \,.\end{split}\]

As a result, the method can be interpreted as a two-step procedure: first, a query point is computed by modifying the current iterate in the direction of the negative gradient, and then the prox operator is applied to this query point.

Even though the proximal gradient method described in Equation (2) looks involved, in the sense that it requires solving an optimization problem at each iteration, the prox-mapping can actually be evaluated very efficiently for several important functions \(\operatorname{h}(\cdot)\) such as, for instance, projections onto affine sets, half-spaces, boxes, and \(\ell_{1}\)- and \(\ell_{2}\)-norm balls. Together with its strong theoretical convergence guarantees, this makes the proximal gradient method a favorable option in many applications. However, the gradient calculation step in the vanilla proximal gradient method can be prohibitively expensive when the number of component functions (\(N\)) or the dimension of the decision vector (\(d\)) is large enough. To improve the scalability of the proximal gradient method, researchers have long tried to come up with ways of parallelizing the proximal gradient computations and more clever query points than the simple gradient step in Equation (2). As a result, the proximal gradient family encompasses a large variety of algorithms. We have listed some of the more influential variants in Table 1 1.

Table 1 Some members of the proximal gradient methods. `s`, `cr`, `ir` and `ps` stand for serial, consistent read/write, inconsistent read/write and Parameter Server [2013-Li], respectively.¶
Algorithm	`boosting`	`smoothing`	`step`	`prox`	`execution`
(S)GD	\(\times\)	\(\times\)	\(\gamma\), \(\gamma_{k}\)	\(\times\)	`s`, `cr`, `ps`
IAG [2007-Blatt]	`aggregated`	\(\times\)	\(\gamma\)	\(\times\)	`s`, `cr`, `ps`
PIAG [2016-Aytekin]	`aggregated`	\(\times\)	\(\gamma\)	\(\checkmark\)	`s`, `cr`, `ps`
SAGA [2014-Defazio]	`saga`	\(\times\)	\(\gamma\)	\(\checkmark\)	`s`
Heavyball [1964-Polyak]	`classical`	\(\times\)	\(\gamma\)	\(\times\)	`s`
Nesterov [1983-Nesterov]	`nesterov`	\(\times\)	\(\gamma\)	\(\times\)	`s`
AdaGrad [2011-Duchi]	\(\times\)	`adagrad`	\(\gamma\)	\(\times\)	`s`
AdaDelta [2012-Zeiler]	\(\times\)	`adadelta`	\(\gamma\)	\(\times\)	`s`
Adam [2014-Kingma]	`classical`	`rmsprop`	\(\gamma\)	\(\times\)	`s`
Nadam [2016-Dozat]	`nesterov`	`rmsprop`	\(\gamma\)	\(\times\)	`s`
AdaDelay [2015-Sra]	\(\times\)	\(\times\)	\(\gamma_{k}\)	\(\checkmark\)	`s`, `cr`, `ps`
HOGWILD! [2011-Recht]	\(\times\)	\(\times\)	\(\gamma\)	\(\times\)	`ir`
ASAGA [2016-Leblond]	`saga`	\(\times\)	\(\gamma\)	\(\times\)	`ir`
ProxASAGA [2017-Pedregosa]	`saga`	\(\times\)	\(\gamma\)	\(\checkmark\)	`ir`

In the rest of the chapter, we will use different preconfigured algorithms from Table 1 on two different problems: Logistic Regression and Least Squares Regression.

2017-Beck(1,2): Amir Beck. First-Order Methods in Optimization (MOS-SIAM Series on Optimization). Society for Industrial and Applied Mathematics (SIAM), 2017. ISBN: 978-1-611974-98-0.
2013-Li: Mu Li et al. “Parameter Server for Distributed Machine Learning.” Big Learning Workshop, Advances in Neural Information Processing Systems 26 (NIPS). 2013. URL: http://web.archive.org/web/20160304101521/http://www.biglearn.org/2013/files/papers/biglearning2013_submission_2.pdf
2007-Blatt: Doron Blatt, Alfred O. Hero, and Hillel Gauchman. “A Convergent Incremental Gradient Method with a Constant Step Size.” SIAM Journal on Optimization 18.1 (Jan. 2007), pp. 29-51. DOI: 10.1137/040615961.
2016-Aytekin: Arda Aytekin, Hamid R. Feyzmahdavian, and Mikael Johansson. “Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server.” (Oct. 2016). arXiv: 1610.05507.
2014-Defazio: Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. “SAGA: A Fast Incremental Gradient Method with Support for Non-Strongly Convex Composite Objectives.” Advances in Neural Information Processing Systems 27 (NIPS). Curran Associates, Inc., 2014, pp. 1646-1654. URL: http://papers.nips.cc/paper/5258-saga-a-fast-incremental-gradient-method-with-support-for-non-strongly-convex-composite-objectives.pdf
2015-Sra: Suvrit Sra et al. “AdaDelay: Delay Adaptive Distributed Stochastic Convex Optimization.” (Aug. 2015). arXiv: 1508.05003.
2011-Recht: Benjamin Recht et al. “HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent.” Advances in Neural Information Processing Systems 24 (NIPS). Curran Associates, Inc., 2011, pp. 693-701. URL: http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf
2016-Leblond: Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. “Asaga: Asynchronous Parallel Saga.” (June 2016). arXiv: 1606.04809.
2017-Pedregosa: Fabian Pedregosa, Rémi Leblond, and Simon Lacoste-Julien. “Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization.” Advances in Neural Information Processing Systems 30 (NIPS). Curran Associates, Inc., 2017, pp. 56-65. URL: http://papers.nips.cc/paper/6611-breaking-the-nonsmooth-barrier-a-scalable-parallel-method-for-composite-optimization.pdf

Footnotes

1: Meanings of boosting, smoothing, step, prox, and execution will be clear in Proximal Gradient Methods.