Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic...

36
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu [email protected] March 24, 2016 1 / 22

Transcript of Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic...

Page 1: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Large-scale Stochastic Optimization11-741/641/441 (Spring 2016)

Hanxiao [email protected]

March 24, 2016

1 / 22

Page 2: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Outline

1. Gradient Descent (GD)

2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD

3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization

4. Random topicsI Variance reductionI Implementation trickI Other variants

2 / 22

Page 3: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Outline

1. Gradient Descent (GD)

2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD

3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization

4. Random topicsI Variance reductionI Implementation trickI Other variants

2 / 22

Page 4: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Outline

1. Gradient Descent (GD)

2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD

3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization

4. Random topicsI Variance reductionI Implementation trickI Other variants

2 / 22

Page 5: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Outline

1. Gradient Descent (GD)

2. Stochastic Gradient Descent (SGD)I FormulationI Comparisons with GD

3. Useful large-scale SGD solversI Support Vector MachinesI Matrix Factorization

4. Random topicsI Variance reductionI Implementation trickI Other variants

2 / 22

Page 6: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Risk Minimization

{(xi, yi)}ni=1: training datai.i.d.∼ D.

minf

E(x,y)∼D

` (f (x) , y)︸ ︷︷ ︸True risk

=⇒ minf

1

n

n∑i=1

` (f (xi) , yi)︸ ︷︷ ︸Empirical risk

(1)

=⇒ minw

1

n

n∑i=1

` (fw (xi) , yi) (2)

Algorithm ` (fw(xi), yi)

Logistic Regression ln(1 + e−yiw

>xi)+ λ

2‖w‖2

SVMs max(0, 1− yiw>xi

)+ λ

2‖w‖2

3 / 22

Page 7: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Risk Minimization

{(xi, yi)}ni=1: training datai.i.d.∼ D.

minf

E(x,y)∼D

` (f (x) , y)︸ ︷︷ ︸True risk

=⇒ minf

1

n

n∑i=1

` (f (xi) , yi)︸ ︷︷ ︸Empirical risk

(1)

=⇒ minw

1

n

n∑i=1

` (fw (xi) , yi) (2)

Algorithm ` (fw(xi), yi)

Logistic Regression ln(1 + e−yiw

>xi)+ λ

2‖w‖2

SVMs max(0, 1− yiw>xi

)+ λ

2‖w‖2

3 / 22

Page 8: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Gradient Descent (GD)

` (fw (xi) , yi)def= `i (w), ` (w)

def= 1

n

∑ni=1 `i (w)

Training objective:minw

`(w) (3)

Gradient update: w(k) = w(k−1) − ηk∇`(w(k−1))

I ηk: pre-specified or determined via backtracking

I Can be easily generalized to handle nonsmooth loss

1. Gradient Subgradient2. Proximal gradient method (for structured ` (w))

Question of interest: How fast does GD converge?

4 / 22

Page 9: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Gradient Descent (GD)

` (fw (xi) , yi)def= `i (w), ` (w)

def= 1

n

∑ni=1 `i (w)

Training objective:minw

`(w) (3)

Gradient update: w(k) = w(k−1) − ηk∇`(w(k−1))

I ηk: pre-specified or determined via backtracking

I Can be easily generalized to handle nonsmooth loss

1. Gradient Subgradient2. Proximal gradient method (for structured ` (w))

Question of interest: How fast does GD converge?

4 / 22

Page 10: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Gradient Descent (GD)

` (fw (xi) , yi)def= `i (w), ` (w)

def= 1

n

∑ni=1 `i (w)

Training objective:minw

`(w) (3)

Gradient update: w(k) = w(k−1) − ηk∇`(w(k−1))

I ηk: pre-specified or determined via backtracking

I Can be easily generalized to handle nonsmooth loss

1. Gradient Subgradient2. Proximal gradient method (for structured ` (w))

Question of interest: How fast does GD converge?

4 / 22

Page 11: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Convergence

Theorem (GD convergence)If ` is both convex and differentiable 1

`(w(k)

)−` (w∗) ≤

{‖w(0)−w∗‖22

2ηk= O

(1k

)in general

ckL‖w(0)−w∗‖222

= O(ck)

` is strongly convex

(4)

– To achieve `(x(k))− ` (x∗) ≤ ρ, GD needs O

(1ρ

)iterations

in general, and O(

log(1ρ

))iterations with strong convexity.

1the step size η must be no larger than 1L , where L is the Lipschitz

constant satisfying ‖∇` (a)−∇` (b) ‖2 ≤ L‖a− b‖2 ∀a, b5 / 22

Page 12: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Convergence

Theorem (GD convergence)If ` is both convex and differentiable 1

`(w(k)

)−` (w∗) ≤

{‖w(0)−w∗‖22

2ηk= O

(1k

)in general

ckL‖w(0)−w∗‖222

= O(ck)

` is strongly convex

(4)

– To achieve `(x(k))− ` (x∗) ≤ ρ, GD needs O

(1ρ

)iterations

in general, and O(

log(1ρ

))iterations with strong convexity.

1the step size η must be no larger than 1L , where L is the Lipschitz

constant satisfying ‖∇` (a)−∇` (b) ‖2 ≤ L‖a− b‖2 ∀a, b5 / 22

Page 13: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

GD Efficiency

Why not happy with GD?

I Fast convergence 6= high efficiency.

w(k) = w(k−1) − ηk∇`(w(k−1)) (5)

= w(k−1) − ηk∇

[1

n

n∑i=1

`i(w(k−1))] (6)

I Per-iteration complexity = O(n) (extremely large)I A single cycle of all the data may take forever.

I Cheaper GD? - SGD

6 / 22

Page 14: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

GD Efficiency

Why not happy with GD?

I Fast convergence 6= high efficiency.

w(k) = w(k−1) − ηk∇`(w(k−1)) (5)

= w(k−1) − ηk∇

[1

n

n∑i=1

`i(w(k−1))] (6)

I Per-iteration complexity = O(n) (extremely large)I A single cycle of all the data may take forever.

I Cheaper GD? - SGD

6 / 22

Page 15: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Stochastic Gradient Descent

Approximate the full gradient via an unbiased estimator

w(k) = w(k−1) − ηk∇( 1

n

n∑i=1

`i(w(k−1)) ) (7)

≈ w(k−1) − ηk∇( 1

|B|∑i∈B

`i(w(k−1)) )

︸ ︷︷ ︸mini-batch SGD 2

Bunif∼ {1, 2, . . . n}

(8)

≈ w(k−1) − ηk∇`i(w(k−1))︸ ︷︷ ︸

pure SGD

iunif∼ {1, 2, . . . n} (9)

Trade-off : lower computation cost v.s. larger variance

2When using GPU, |B| usually depends on the memory budget.7 / 22

Page 16: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

GD v.s. SGD

For strongly convex ` (w), according to [Bottou, 2012]

Optimizer GD SGD Winner

Time per-iteration O (n) O (1) SGD

Iterations to accuracy ρ O(log(1ρ

))O(

)GD

Time to accuracy ρ O(n log 1

ρ

)O(

)Depends

Time to “generalization error” ε O(

1ε1/α

log 1ε

)O(1ε

)SGD

where 12≤ α ≤ 1

8 / 22

Page 17: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

SVMs Solver: Pegasos

[Shalev-Shwartz et al., 2011]

Recall

`i (w) = max(0, 1− yiw>xi

)+λ

2‖w‖2 (10)

=

{λ2‖w‖2 yiw

>xi ≥ 1

1− yiw>xi + λ2‖w‖2 yiw

>xi < 1(11)

Therefore

∇`i (w) =

{λw yiw

>xi ≥ 1

λw − yixi yiw>xi < 1

(12)

9 / 22

Page 18: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

SVMs Solver: Pegasos

[Shalev-Shwartz et al., 2011]

Recall

`i (w) = max(0, 1− yiw>xi

)+λ

2‖w‖2 (10)

=

{λ2‖w‖2 yiw

>xi ≥ 1

1− yiw>xi + λ2‖w‖2 yiw

>xi < 1(11)

Therefore

∇`i (w) =

{λw yiw

>xi ≥ 1

λw − yixi yiw>xi < 1

(12)

9 / 22

Page 19: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

SVMs in 10 Lines

Algorithm 1: Pegasos: SGD solver for SVMs

Input: n, λ, T ;Initialization: w ← 0;for k = 1, 2, . . . , T do

iuni∼ {1, 2, . . . n};ηk ← 1

λk ;

if yiw(k)>xi < 1 then

w(k+1) ← w(k) − ηk(λw(k) − yixi

)else

w(k+1) ← w(k) − ηkλw(k)

end

end

Output: w(T+1)

10 / 22

Page 20: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Empirical Comparisons

SGD v.s. batch solvers3 on RCV1

#Features #Training examples

47, 152 781, 265

Algorithm Time (secs) Primal Obj Test Error

SMO (SVMlight) ≈ 16, 000 0.2275 6.02%Cutting Plane (SVMperf ) ≈ 45 0.2275 6.02%SGD < 1 0.2275 6.02%

Where is the magic?

3http://leon.bottou.org/projects/sgd

11 / 22

Page 21: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

The Magic

I SGD takes a long time to accurately solve the problem.

I There’s no need to solve the problem super accuratelyin order to get good generalization ability.

3http://leon.bottou.org/slides/largescale/lstut.pdf

12 / 22

Page 22: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

SGD for Matrix Factorization

The idea of SGD can be trivially extended to MF 4

` (U, V ) =1

|O|∑

(a,b)∈O

`a,b (ua, vb)︸ ︷︷ ︸e.g. (rab−u>a vb)

2

(13)

SGD updating rule: for each user-item pair (a, b) ∼ O

u(k)a = u(k−1)a − ηk∇`a,b(u(k−1)a

)(14)

v(k)b = v

(k−1)b − ηk∇`a,b

(v(k−1)b

)(15)

Buildingblock for distributed SGD for MF

4We omit the regularization term for simplicity13 / 22

Page 23: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

SGD for Matrix Factorization

The idea of SGD can be trivially extended to MF 4

` (U, V ) =1

|O|∑

(a,b)∈O

`a,b (ua, vb)︸ ︷︷ ︸e.g. (rab−u>a vb)

2

(13)

SGD updating rule: for each user-item pair (a, b) ∼ O

u(k)a = u(k−1)a − ηk∇`a,b(u(k−1)a

)(14)

v(k)b = v

(k−1)b − ηk∇`a,b

(v(k−1)b

)(15)

Buildingblock for distributed SGD for MF

4We omit the regularization term for simplicity13 / 22

Page 24: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Empirical Comparisons

On Netflix [Gemulla et al., 2011]

DSGD Distributed SGD

ALS Alternating least squaresI one of the state-of-the-art batch solvers

DGD Distributed GD14 / 22

Page 25: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

SGD Revisited

Can we even do better?

Bottleneck of SGD: high variance in ∇`i (w)

I Less effective gradient steps

I The existence of variance =⇒ limk→∞ ηk = 0 forconvergence =⇒ slower progress

Variance reduction—SVRG [Johnson and Zhang, 2013],SAG, SDCA . . .

15 / 22

Page 26: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Stochastic Variance Reduced Gradient

w - a snapshot of w (to be updated every few cycles)

µ - 1n

∑ni=1∇`i (w)

Key idea - use ∇`i(w) to “cancel” the volatility in ∇`i(w)

1

n

n∑i=1

∇`i (w) =1

n

n∑i=1

∇`i (w)− µ+ µ (16)

=1

n

n∑i=1

∇`i (w)− 1

n

n∑i=1

∇`i (w) + µ (17)

≈ ∇`i (w)−∇`i (w) + µ i ∼ {1, 2, . . . , n}(18)

A desirable property: ∇`i(w(k)

)−∇`i (w) + µ→ 0

16 / 22

Page 27: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Stochastic Variance Reduced Gradient

w - a snapshot of w (to be updated every few cycles)

µ - 1n

∑ni=1∇`i (w)

Key idea - use ∇`i(w) to “cancel” the volatility in ∇`i(w)

1

n

n∑i=1

∇`i (w) =1

n

n∑i=1

∇`i (w)− µ+ µ (16)

=1

n

n∑i=1

∇`i (w)− 1

n

n∑i=1

∇`i (w) + µ (17)

≈ ∇`i (w)−∇`i (w) + µ i ∼ {1, 2, . . . , n}(18)

A desirable property: ∇`i(w(k)

)−∇`i (w) + µ→ 0

16 / 22

Page 28: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Results

Image classification using neural networks[Johnson and Zhang, 2013]

17 / 22

Page 29: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Implementation trick

For `i (w) = ψi (w) + λ2‖w‖2

w(k+1) ← (1− ηλ)w(k)︸ ︷︷ ︸shrink w

−η∇ψi(w(k)

)︸ ︷︷ ︸highly sparse

(19)

The shrinking operations takes O(p) – not happyTrick 5: recast w as w = c · w′

c(k+1) · w′(k+1) ← (1− ηλ) c(k)︸ ︷︷ ︸scalar update

·

w′(k) − ηψi

(c(k)w′(k)

)(1− ηλ) c(k)

︸ ︷︷ ︸

sparse update

(20)More SGD tricks can be found at [Bottou, 2012]

5http://blog.smola.org/post/940672544/

fast-quadratic-regularization-for-online-learning

18 / 22

Page 30: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Implementation trick

For `i (w) = ψi (w) + λ2‖w‖2

w(k+1) ← (1− ηλ)w(k)︸ ︷︷ ︸shrink w

−η∇ψi(w(k)

)︸ ︷︷ ︸highly sparse

(19)

The shrinking operations takes O(p) – not happyTrick 5: recast w as w = c · w′

c(k+1) · w′(k+1) ← (1− ηλ) c(k)︸ ︷︷ ︸scalar update

·

w′(k) − ηψi

(c(k)w′(k)

)(1− ηλ) c(k)

︸ ︷︷ ︸

sparse update

(20)More SGD tricks can be found at [Bottou, 2012]

5http://blog.smola.org/post/940672544/

fast-quadratic-regularization-for-online-learning

18 / 22

Page 31: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Popular SGD Variants

A non-exhaustive list

1. AdaGrad [Duchi et al., 2011]

2. Momentum [Rumelhart et al., 1988]

3. Nesterov’s method [Nesterov et al., 1994]

4. AdaDelta: AdaGrad refined [Zeiler, 2012]

5. Rprop & Rmsprop [Tieleman and Hinton, 2012]:Ignoring the magnitude of gradient

All are empirically found effective in solving nonconvexproblems (e.g., deep neural nets).

Demos 6: Animation 0, 1, 2, 3

6https://www.reddit.com/r/MachineLearning/comments/2gopfa/

visualizing_gradient_optimization_techniques/cklhott

19 / 22

Page 32: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Popular SGD Variants

A non-exhaustive list

1. AdaGrad [Duchi et al., 2011]

2. Momentum [Rumelhart et al., 1988]

3. Nesterov’s method [Nesterov et al., 1994]

4. AdaDelta: AdaGrad refined [Zeiler, 2012]

5. Rprop & Rmsprop [Tieleman and Hinton, 2012]:Ignoring the magnitude of gradient

All are empirically found effective in solving nonconvexproblems (e.g., deep neural nets).

Demos 6: Animation 0, 1, 2, 3

6https://www.reddit.com/r/MachineLearning/comments/2gopfa/

visualizing_gradient_optimization_techniques/cklhott

19 / 22

Page 33: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Summary

Today’s talk

1. GD - expensive, accurate gradient evaluation

2. SGD - cheap, noisy gradient evaluation

3. SGD-based solvers (SVMs, MF)

4. Variance reduction techniques

Remarks about SGD

I extremely handy for large problems

I only one of many handy toolsI alternatives: quasi-Newton (BFGS), Coordinate

descent, ADMM, CG, etc.I depending on the problem structure

20 / 22

Page 34: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Summary

Today’s talk

1. GD - expensive, accurate gradient evaluation

2. SGD - cheap, noisy gradient evaluation

3. SGD-based solvers (SVMs, MF)

4. Variance reduction techniques

Remarks about SGD

I extremely handy for large problems

I only one of many handy toolsI alternatives: quasi-Newton (BFGS), Coordinate

descent, ADMM, CG, etc.I depending on the problem structure

20 / 22

Page 35: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Reference I

Bordes, A., Bottou, L., and Gallinari, P. (2009).Sgd-qn: Careful quasi-newton stochastic gradient descent.The Journal of Machine Learning Research, 10:1737–1754.

Bottou, L. (2012).Stochastic gradient descent tricks.In Neural Networks: Tricks of the Trade, pages 421–436. Springer.

Duchi, J., Hazan, E., and Singer, Y. (2011).Adaptive subgradient methods for online learning and stochasticoptimization.The Journal of Machine Learning Research, 12:2121–2159.

Gemulla, R., Nijkamp, E., Haas, P. J., and Sismanis, Y. (2011).Large-scale matrix factorization with distributed stochastic gradient descent.In Proceedings of the 17th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 69–77. ACM.

Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems, pages 315–323.

21 / 22

Page 36: Large-scale Stochastic Optimizationhanxiaol/slides/hanxiaol-sgd.pdf · Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016

Reference II

Nesterov, Y., Nemirovskii, A., and Ye, Y. (1994).Interior-point polynomial algorithms in convex programming, volume 13.SIAM.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).Learning representations by back-propagating errors.Cognitive modeling, 5:3.

Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2011).Pegasos: Primal estimated sub-gradient solver for svm.Mathematical programming, 127(1):3–30.

Tieleman, T. and Hinton, G. (2012).Lecture 6.5-rmsprop: Divide the gradient by a running average of its recentmagnitude.COURSERA: Neural Networks for Machine Learning, 4.

Zeiler, M. D. (2012).Adadelta: An adaptive learning rate method.arXiv preprint arXiv:1212.5701.

Zhang, T.Modern optimization techniques for big data machine learning.

22 / 22