Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic...
Transcript of Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic...
Large-scale Stochastic Optimization11-741/641/441 (Spring 2016)
Hanxiao [email protected]
March 24, 2016
1 / 22
Outline
1. Gradient Descent (GD)
2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD
3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization
4. Random topicsI Variance reductionI Implementation trickI Other variants
2 / 22
Outline
1. Gradient Descent (GD)
2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD
3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization
4. Random topicsI Variance reductionI Implementation trickI Other variants
2 / 22
Outline
1. Gradient Descent (GD)
2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD
3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization
4. Random topicsI Variance reductionI Implementation trickI Other variants
2 / 22
Outline
1. Gradient Descent (GD)
2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD
3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization
4. Random topicsI Variance reductionI Implementation trickI Other variants
2 / 22
Risk Minimization
{(xi, yi)}ni=1: training datai.i.d.∼ D.
minf
E(x,y)∼D
` (f (x) , y)︸ ︷︷ ︸True risk
=⇒ minf
1
n
n∑i=1
` (f (xi) , yi)︸ ︷︷ ︸Empirical risk
(1)
=⇒ minw
1
n
n∑i=1
` (fw (xi) , yi) (2)
Algorithm ` (fw(xi), yi)
Logistic Regression ln(1 + e−yiw
>xi)+ λ
2‖w‖2
SVMs max(0, 1− yiw>xi
)+ λ
2‖w‖2
3 / 22
Risk Minimization
{(xi, yi)}ni=1: training datai.i.d.∼ D.
minf
E(x,y)∼D
` (f (x) , y)︸ ︷︷ ︸True risk
=⇒ minf
1
n
n∑i=1
` (f (xi) , yi)︸ ︷︷ ︸Empirical risk
(1)
=⇒ minw
1
n
n∑i=1
` (fw (xi) , yi) (2)
Algorithm ` (fw(xi), yi)
Logistic Regression ln(1 + e−yiw
>xi)+ λ
2‖w‖2
SVMs max(0, 1− yiw>xi
)+ λ
2‖w‖2
3 / 22
Gradient Descent (GD)
` (fw (xi) , yi)def= `i (w), ` (w)
def= 1
n
∑ni=1 `i (w)
Training objective:minw
`(w) (3)
Gradient update: w(k) = w(k−1) − ηk∇`(w(k−1))
I ηk: pre-specified or determined via backtracking
I Can be easily generalized to handle nonsmooth loss
1. Gradient Subgradient2. Proximal gradient method (for structured ` (w))
Question of interest: How fast does GD converge?
4 / 22
Gradient Descent (GD)
` (fw (xi) , yi)def= `i (w), ` (w)
def= 1
n
∑ni=1 `i (w)
Training objective:minw
`(w) (3)
Gradient update: w(k) = w(k−1) − ηk∇`(w(k−1))
I ηk: pre-specified or determined via backtracking
I Can be easily generalized to handle nonsmooth loss
1. Gradient Subgradient2. Proximal gradient method (for structured ` (w))
Question of interest: How fast does GD converge?
4 / 22
Gradient Descent (GD)
` (fw (xi) , yi)def= `i (w), ` (w)
def= 1
n
∑ni=1 `i (w)
Training objective:minw
`(w) (3)
Gradient update: w(k) = w(k−1) − ηk∇`(w(k−1))
I ηk: pre-specified or determined via backtracking
I Can be easily generalized to handle nonsmooth loss
1. Gradient Subgradient2. Proximal gradient method (for structured ` (w))
Question of interest: How fast does GD converge?
4 / 22
Convergence
Theorem (GD convergence)If ` is both convex and differentiable 1
`(w(k)
)−` (w∗) ≤
{‖w(0)−w∗‖22
2ηk= O
(1k
)in general
ckL‖w(0)−w∗‖222
= O(ck)
` is strongly convex
(4)
– To achieve `(x(k))− ` (x∗) ≤ ρ, GD needs O
(1ρ
)iterations
in general, and O(
log(1ρ
))iterations with strong convexity.
1the step size η must be no larger than 1L , where L is the Lipschitz
constant satisfying ‖∇` (a)−∇` (b) ‖2 ≤ L‖a− b‖2 ∀a, b5 / 22
Convergence
Theorem (GD convergence)If ` is both convex and differentiable 1
`(w(k)
)−` (w∗) ≤
{‖w(0)−w∗‖22
2ηk= O
(1k
)in general
ckL‖w(0)−w∗‖222
= O(ck)
` is strongly convex
(4)
– To achieve `(x(k))− ` (x∗) ≤ ρ, GD needs O
(1ρ
)iterations
in general, and O(
log(1ρ
))iterations with strong convexity.
1the step size η must be no larger than 1L , where L is the Lipschitz
constant satisfying ‖∇` (a)−∇` (b) ‖2 ≤ L‖a− b‖2 ∀a, b5 / 22
GD Efficiency
Why not happy with GD?
I Fast convergence 6= high efficiency.
w(k) = w(k−1) − ηk∇`(w(k−1)) (5)
= w(k−1) − ηk∇
[1
n
n∑i=1
`i(w(k−1))] (6)
I Per-iteration complexity = O(n) (extremely large)I A single cycle of all the data may take forever.
I Cheaper GD? - SGD
6 / 22
GD Efficiency
Why not happy with GD?
I Fast convergence 6= high efficiency.
w(k) = w(k−1) − ηk∇`(w(k−1)) (5)
= w(k−1) − ηk∇
[1
n
n∑i=1
`i(w(k−1))] (6)
I Per-iteration complexity = O(n) (extremely large)I A single cycle of all the data may take forever.
I Cheaper GD? - SGD
6 / 22
Stochastic Gradient Descent
Approximate the full gradient via an unbiased estimator
w(k) = w(k−1) − ηk∇( 1
n
n∑i=1
`i(w(k−1)) ) (7)
≈ w(k−1) − ηk∇( 1
|B|∑i∈B
`i(w(k−1)) )
︸ ︷︷ ︸mini-batch SGD 2
Bunif∼ {1, 2, . . . n}
(8)
≈ w(k−1) − ηk∇`i(w(k−1))︸ ︷︷ ︸
pure SGD
iunif∼ {1, 2, . . . n} (9)
Trade-off : lower computation cost v.s. larger variance
2When using GPU, |B| usually depends on the memory budget.7 / 22
GD v.s. SGD
For strongly convex ` (w), according to [Bottou, 2012]
Optimizer GD SGD Winner
Time per-iteration O (n) O (1) SGD
Iterations to accuracy ρ O(log(1ρ
))O(
1ρ
)GD
Time to accuracy ρ O(n log 1
ρ
)O(
1ρ
)Depends
Time to “generalization error” ε O(
1ε1/α
log 1ε
)O(1ε
)SGD
where 12≤ α ≤ 1
8 / 22
SVMs Solver: Pegasos
[Shalev-Shwartz et al., 2011]
Recall
`i (w) = max(0, 1− yiw>xi
)+λ
2‖w‖2 (10)
=
{λ2‖w‖2 yiw
>xi ≥ 1
1− yiw>xi + λ2‖w‖2 yiw
>xi < 1(11)
Therefore
∇`i (w) =
{λw yiw
>xi ≥ 1
λw − yixi yiw>xi < 1
(12)
9 / 22
SVMs Solver: Pegasos
[Shalev-Shwartz et al., 2011]
Recall
`i (w) = max(0, 1− yiw>xi
)+λ
2‖w‖2 (10)
=
{λ2‖w‖2 yiw
>xi ≥ 1
1− yiw>xi + λ2‖w‖2 yiw
>xi < 1(11)
Therefore
∇`i (w) =
{λw yiw
>xi ≥ 1
λw − yixi yiw>xi < 1
(12)
9 / 22
SVMs in 10 Lines
Algorithm 1: Pegasos: SGD solver for SVMs
Input: n, λ, T ;Initialization: w ← 0;for k = 1, 2, . . . , T do
iuni∼ {1, 2, . . . n};ηk ← 1
λk ;
if yiw(k)>xi < 1 then
w(k+1) ← w(k) − ηk(λw(k) − yixi
)else
w(k+1) ← w(k) − ηkλw(k)
end
end
Output: w(T+1)
10 / 22
Empirical Comparisons
SGD v.s. batch solvers3 on RCV1
#Features #Training examples
47, 152 781, 265
Algorithm Time (secs) Primal Obj Test Error
SMO (SVMlight) ≈ 16, 000 0.2275 6.02%Cutting Plane (SVMperf ) ≈ 45 0.2275 6.02%SGD < 1 0.2275 6.02%
Where is the magic?
3http://leon.bottou.org/projects/sgd
11 / 22
The Magic
I SGD takes a long time to accurately solve the problem.
I There’s no need to solve the problem super accuratelyin order to get good generalization ability.
3http://leon.bottou.org/slides/largescale/lstut.pdf
12 / 22
SGD for Matrix Factorization
The idea of SGD can be trivially extended to MF 4
` (U, V ) =1
|O|∑
(a,b)∈O
`a,b (ua, vb)︸ ︷︷ ︸e.g. (rab−u>a vb)
2
(13)
SGD updating rule: for each user-item pair (a, b) ∼ O
u(k)a = u(k−1)a − ηk∇`a,b(u(k−1)a
)(14)
v(k)b = v
(k−1)b − ηk∇`a,b
(v(k−1)b
)(15)
Buildingblock for distributed SGD for MF
4We omit the regularization term for simplicity13 / 22
SGD for Matrix Factorization
The idea of SGD can be trivially extended to MF 4
` (U, V ) =1
|O|∑
(a,b)∈O
`a,b (ua, vb)︸ ︷︷ ︸e.g. (rab−u>a vb)
2
(13)
SGD updating rule: for each user-item pair (a, b) ∼ O
u(k)a = u(k−1)a − ηk∇`a,b(u(k−1)a
)(14)
v(k)b = v
(k−1)b − ηk∇`a,b
(v(k−1)b
)(15)
Buildingblock for distributed SGD for MF
4We omit the regularization term for simplicity13 / 22
Empirical Comparisons
On Netflix [Gemulla et al., 2011]
DSGD Distributed SGD
ALS Alternating least squaresI one of the state-of-the-art batch solvers
DGD Distributed GD14 / 22
SGD Revisited
Can we even do better?
Bottleneck of SGD: high variance in ∇`i (w)
I Less effective gradient steps
I The existence of variance =⇒ limk→∞ ηk = 0 forconvergence =⇒ slower progress
Variance reduction—SVRG [Johnson and Zhang, 2013],SAG, SDCA . . .
15 / 22
Stochastic Variance Reduced Gradient
w - a snapshot of w (to be updated every few cycles)
µ - 1n
∑ni=1∇`i (w)
Key idea - use ∇`i(w) to “cancel” the volatility in ∇`i(w)
1
n
n∑i=1
∇`i (w) =1
n
n∑i=1
∇`i (w)− µ+ µ (16)
=1
n
n∑i=1
∇`i (w)− 1
n
n∑i=1
∇`i (w) + µ (17)
≈ ∇`i (w)−∇`i (w) + µ i ∼ {1, 2, . . . , n}(18)
A desirable property: ∇`i(w(k)
)−∇`i (w) + µ→ 0
16 / 22
Stochastic Variance Reduced Gradient
w - a snapshot of w (to be updated every few cycles)
µ - 1n
∑ni=1∇`i (w)
Key idea - use ∇`i(w) to “cancel” the volatility in ∇`i(w)
1
n
n∑i=1
∇`i (w) =1
n
n∑i=1
∇`i (w)− µ+ µ (16)
=1
n
n∑i=1
∇`i (w)− 1
n
n∑i=1
∇`i (w) + µ (17)
≈ ∇`i (w)−∇`i (w) + µ i ∼ {1, 2, . . . , n}(18)
A desirable property: ∇`i(w(k)
)−∇`i (w) + µ→ 0
16 / 22
Results
Image classification using neural networks[Johnson and Zhang, 2013]
17 / 22
Implementation trick
For `i (w) = ψi (w) + λ2‖w‖2
w(k+1) ← (1− ηλ)w(k)︸ ︷︷ ︸shrink w
−η∇ψi(w(k)
)︸ ︷︷ ︸highly sparse
(19)
The shrinking operations takes O(p) – not happyTrick 5: recast w as w = c · w′
c(k+1) · w′(k+1) ← (1− ηλ) c(k)︸ ︷︷ ︸scalar update
·
w′(k) − ηψi
(c(k)w′(k)
)(1− ηλ) c(k)
︸ ︷︷ ︸
sparse update
(20)More SGD tricks can be found at [Bottou, 2012]
5http://blog.smola.org/post/940672544/
fast-quadratic-regularization-for-online-learning
18 / 22
Implementation trick
For `i (w) = ψi (w) + λ2‖w‖2
w(k+1) ← (1− ηλ)w(k)︸ ︷︷ ︸shrink w
−η∇ψi(w(k)
)︸ ︷︷ ︸highly sparse
(19)
The shrinking operations takes O(p) – not happyTrick 5: recast w as w = c · w′
c(k+1) · w′(k+1) ← (1− ηλ) c(k)︸ ︷︷ ︸scalar update
·
w′(k) − ηψi
(c(k)w′(k)
)(1− ηλ) c(k)
︸ ︷︷ ︸
sparse update
(20)More SGD tricks can be found at [Bottou, 2012]
5http://blog.smola.org/post/940672544/
fast-quadratic-regularization-for-online-learning
18 / 22
Popular SGD Variants
A non-exhaustive list
1. AdaGrad [Duchi et al., 2011]
2. Momentum [Rumelhart et al., 1988]
3. Nesterov’s method [Nesterov et al., 1994]
4. AdaDelta: AdaGrad refined [Zeiler, 2012]
5. Rprop & Rmsprop [Tieleman and Hinton, 2012]:Ignoring the magnitude of gradient
All are empirically found effective in solving nonconvexproblems (e.g., deep neural nets).
Demos 6: Animation 0, 1, 2, 3
6https://www.reddit.com/r/MachineLearning/comments/2gopfa/
visualizing_gradient_optimization_techniques/cklhott
19 / 22
Popular SGD Variants
A non-exhaustive list
1. AdaGrad [Duchi et al., 2011]
2. Momentum [Rumelhart et al., 1988]
3. Nesterov’s method [Nesterov et al., 1994]
4. AdaDelta: AdaGrad refined [Zeiler, 2012]
5. Rprop & Rmsprop [Tieleman and Hinton, 2012]:Ignoring the magnitude of gradient
All are empirically found effective in solving nonconvexproblems (e.g., deep neural nets).
Demos 6: Animation 0, 1, 2, 3
6https://www.reddit.com/r/MachineLearning/comments/2gopfa/
visualizing_gradient_optimization_techniques/cklhott
19 / 22
Summary
Today’s talk
1. GD - expensive, accurate gradient evaluation
2. SGD - cheap, noisy gradient evaluation
3. SGD-based solvers (SVMs, MF)
4. Variance reduction techniques
Remarks about SGD
I extremely handy for large problems
I only one of many handy toolsI alternatives: quasi-Newton (BFGS), Coordinate
descent, ADMM, CG, etc.I depending on the problem structure
20 / 22
Summary
Today’s talk
1. GD - expensive, accurate gradient evaluation
2. SGD - cheap, noisy gradient evaluation
3. SGD-based solvers (SVMs, MF)
4. Variance reduction techniques
Remarks about SGD
I extremely handy for large problems
I only one of many handy toolsI alternatives: quasi-Newton (BFGS), Coordinate
descent, ADMM, CG, etc.I depending on the problem structure
20 / 22
Reference I
Bordes, A., Bottou, L., and Gallinari, P. (2009).Sgd-qn: Careful quasi-newton stochastic gradient descent.The Journal of Machine Learning Research, 10:1737–1754.
Bottou, L. (2012).Stochastic gradient descent tricks.In Neural Networks: Tricks of the Trade, pages 421–436. Springer.
Duchi, J., Hazan, E., and Singer, Y. (2011).Adaptive subgradient methods for online learning and stochasticoptimization.The Journal of Machine Learning Research, 12:2121–2159.
Gemulla, R., Nijkamp, E., Haas, P. J., and Sismanis, Y. (2011).Large-scale matrix factorization with distributed stochastic gradient descent.In Proceedings of the 17th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 69–77. ACM.
Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems, pages 315–323.
21 / 22
Reference II
Nesterov, Y., Nemirovskii, A., and Ye, Y. (1994).Interior-point polynomial algorithms in convex programming, volume 13.SIAM.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).Learning representations by back-propagating errors.Cognitive modeling, 5:3.
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2011).Pegasos: Primal estimated sub-gradient solver for svm.Mathematical programming, 127(1):3–30.
Tieleman, T. and Hinton, G. (2012).Lecture 6.5-rmsprop: Divide the gradient by a running average of its recentmagnitude.COURSERA: Neural Networks for Machine Learning, 4.
Zeiler, M. D. (2012).Adadelta: An adaptive learning rate method.arXiv preprint arXiv:1212.5701.
Zhang, T.Modern optimization techniques for big data machine learning.
22 / 22