Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’...

40
Lecture 5: SemiStochas2c Methods Peter Richtárik Graduate School in Systems, Op2miza2on, Control and Networks Belgium 2015

Transcript of Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’...

Page 1: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Lecture  5:  Semi-­‐Stochas2c  Methods  

Peter  Richtárik  

Graduate  School  in  Systems,  Op2miza2on,  Control  and  Networks  Belgium  2015  

Alan  Turing  Ins2tute  

Page 2: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Jakub  Konecny  and  P.R.  Semi-­‐Stochas,c  Gradient  Descent  Methods  arXiv:1312.1666,  2013  

Page 3: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Further  Papers  

Jakub  Konecny,  Zheng  Qu  and  P.R.  Semi-­‐Stochas,c  Coordinate  Descent  arXiv:1412.6293,  2014  

Rie  Johnson  and  Tong  Zhang  Accelera,ng  Stochas,c  Gradient  Descent  using  Predic,ve  Variance  Reduc,on    Neural  Informa9on  Processing  Systems,  2013  

Jakub  Konecny,  Jie  Liu,  P.R.  and  Mar2n  Takac  Mini-­‐Batch  Semi-­‐Stochas,c  Gradient  Descent  in  the  Proximal  Se?ng  IEEE  Journal  of  Selected  Topics  in  Signal  Processing,  2015  

Page 4: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

The  Problem  

Page 5: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Minimizing  Average  Loss  

•  Problems  are  oVen  structured  

 •  Arising  in  machine  learning,  signal  processing,  engineering,  …  

minx2Rd

(F (x) =

1

n

nX

i=1

f

i

(x)

)n  is  big  

\min_{x\in  \mathbb{R}^d}  \leV\{  F(x)  =  \frac{1}{n}\sum_{i=1}^n  f_i(x)  \right\}  

Page 6: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Examples  

•  Linear  regression  (least  squares)  

 •  Logis2c  regression  (classifica2on)  

   

Page 7: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Assump2ons  

•  L-­‐smoothness  

 •  Strong  convexity      

Lipschitz  constant  

$$F(x+h)  \geq  f(x)  +  \langle  F(x),  h  \rangle  +  \frac{\mu}{2}\|h\|^2$$  

fi(x+ h) fi(x) + hfi(x), hi+L

2khk2

Strong  convexity  constant  

$$f_i(x+h)  \leq  f_i(x)  +  \langle  f_i(x),  h  \rangle  +  \frac{L}{2}\|h\|^2$$  

F (x+ h) � F (x) + hF (x), hi+ µ

2khk2

Page 8: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Applica2ons  

Page 9: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

SPAM  DETECTION  

Page 10: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Page  Rank  

Page 11: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

FA’K’E  DETECTION  

Page 12: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Recommender  Systems  

Page 13: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’
Page 14: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

GD  vs  SGD  

Page 15: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

hjp://madeincalifornia.blogspot.co.uk/2012/11/gradient-­‐descent-­‐algorithm.html  

Page 16: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Gradient  Descent  (GD)  

•  Update  rule:  

•  Complexity:                                                                        

•  Cost  of  a  single  itera2on:    

#  stochas2c  gradient  evalua2ons  

xk+1 = xk � 1LrF (xk)

$x_{k+1}  =  x_k  -­‐  \frac{1}{L}\nabla  F(x_k)$  

#  itera2ons  

Page 17: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

•  Update  rule:  

 

•  Complexity:  

•  Cost  of  a  single  itera2on:    1  

Stochas2c  Gradient  Descent  (SGD)  

xk+1 = xk � hkrfi(xk) $x_{k+1}  =  x_k  -­‐  h_k  \nabla  f_i(x_k)$  

E[rfi(x)] = rF (x)i  =  chosen  uniformly  

at  random  

stepsize  

$\mathbb{E}[\nabla  f_i(x)]  =  \nabla  F(x)$    

O⇣

1✏

#  stochas2c  gradient  evalua2ons  

Page 18: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Dream…  GD  

Fast  convergence  

SGD  

Slow  convergence    

Combine  the  good  stuff  in  a  single  algorithm  

Cheap  itera2ons    

 Expensive  itera2ons  

Page 19: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

S2GD:  Semi-­‐Stochas2c    Gradient  Descent  

Page 20: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Intui2on  

Gradient  does  not  change  dras2cally…  Can  recycle  older  informa2on?  

Page 21: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Gradient  Approxima2on  

Already  computed  gradient  

Gradient  change  We  can  try  to  es2mate  

rF (x) ⇡ rfi(x)�rfi(x̃) +rF (x̃)$\nabla  F(x)  \approx  \nabla  f_i(x)  -­‐  \nabla  f_i(\2lde{x})  +  \nabla  F(\2lde{x})$  

$\nabla  F(x)  =  \nabla  F(x)  -­‐  \nabla  F(\2lde{x})  +  \nabla  F(\2lde{x})$  

x ⇡ x̃

rF (x) = rF (x)�rF (x̃) +rF (x̃)

Page 22: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

The  S2GD  Algorithm  

rF (x̃j)

rF (x̃j)

rF (x) ⇡ rfi(x)�rfi(x̃) +rF (x̃)

Simplifica2on.  Size  of    the  inner  loop  (m)  is  random  in  theory,  following  a  

geometric  rule.  

Page 23: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Complexity  

Page 24: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Theorem:  Convergence  Rate  

How  to  set  the  parameters                                    ?  

Can  be  made  arbitrarily  small,  by  decreasing  

For  any  fixed          ,  can  be  made  arbitrarily  small  by  increasing      

EhF (x̃j)�F (x⇤)F (x̃0)�F (x⇤)

i cj

Page 25: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Setng  the  Parameters  

#  of  outer  itera2ons  

stepsize  

#  of  inner  itera2ons  

This  is  achieved  by  setng  the  parameters  as:  

Total  complexity  (#  stochas2c  gradient  evalua2ons):  

m  inner  itera2ons  #  full  gradient  evalua2ons  #  outer  iters  

Target  error  tolerance  E

hF (x̃j)�F (x⇤)F (x̃0)�F (x⇤)

i ✏

$\mathbb{E}\leV[\frac{F(\2lde{x}_j)  -­‐  F(x_*)}{F(\2lde{x}_0)  -­‐  F(x_*)}\right]  \leq  \epsilon$  

Page 26: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Complexity  of  GD  vs  S2GD  

•  S2GD  complexity  

•  GD  complexity  

Cost  of  1  itera2on   #  itera2ons  

Page 27: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Experiment  (logis2c  regression  on:  ijcnn,  rcv,  real-­‐sim,  url)  

Page 28: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Related  Methods  •  SAG:  Stochas2c  Average  Gradient    

(Mark  Schmidt,  Nicolas  Le  Roux,  Francis  Bach,  2013)  

–  Refresh  single  stochas2c  gradient  in  each  itera2on  – Need  to  store  n  gradients  –  Similar  convergence  rate  –  Cumbersome  analysis  

•  SAGA  (Aaron  Defazio,  Francis  Bach,  Simon  Lacoste-­‐Julien,  2014)  –  Refined  analysis  

•  MISO:  Minimiza2on  by  Incremental  Surrogate  Op2miza2on  (Julien  Mairal,  2014)    –  Similar  to  SAG,  slightly  worse  performance  –  Elegant  analysis  

Page 29: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Related  Methods    •  SVRG:  Stochas2c  Variance  Reduced  Gradient  

(Rie  Johnson,  Tong  Zhang,  2013)  – Arises  as  a  special  case  in  S2GD  

•  Prox-­‐SVRG  (Tong  Zhang,  Lin  Xiao,  2014)  –  Extended  to  proximal  setng  

•  EMGD:  Epoch  Mixed  Gradient  Descent  (Lijun  Zhang,  Mehrdad  Mahdavi  ,  Rong  Jin,  2013)  – Handles  simple  constraints  – Worse  convergence  rate:    

Page 30: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Extensions  

Page 31: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Extensions  

– Constraints  [Prox-­‐SVRG]  – Proximal  setup  [Prox-­‐SVRG]  – Mini-­‐batching  [mS2GD]  – Efficient  handling  of  sparse  data  [S2GD]  – Pre-­‐processing  with  SGD  [S2GD]  – Op2mal  choice  of  parameters  [S2GD]  – Weakly  convex  func2ons  [S2GD]  – High-­‐probability  result  [S2GD]  –  Inexact  computa2on  of  gradients  

Page 32: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

S2CD:  Semi-­‐Stochas2c    Coordinate  Descent  

O (nCgrad + ̂Ccd) log(1/✏)Complexity:  

S2GD:   O (nCgrad + Cgrad) log(1/✏)

Page 33: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

mS2GD:  S2GD  with  Mini-­‐batching  

Page 34: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Sparse  Data  •  For  linear/logis2c  regression,  gradient  copies  sparsity  pajern  of  the  example:  

•  But  the  update  direc2on  is  fully  dense  

•  Can  we  do  something  about  it?  

rfi(x)�rfi(x̃) +rF (x̃)

dense  sparse  

Page 35: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

S2GD:  Implementa2on  for  Sparse  Data  

Page 36: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

S2GD+  

•  Observing  that  SGD  can  make  reasonable  progress  while  S2GD  computes  the  first  full  gradient,  we  can  formulate  the  following  algorithm:  

         S2GD+  •  Run  one  pass  of  SGD  •  Use  the  output  as  the  star2ng  point  of  S2GD  •  Run  S2GD  

Page 37: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

S2GD+  Experiment  

Page 38: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

High  Probability  Result  

•  The  result  holds  only  in  expecta2on  •  Can  we  say  anything  about  the  concentra2on  of  the  result  in  prac2ce?  

•  For  any      we  have:  

Paying  just  a  logarithmic  cost  

P⇣

F (xk)�F (x⇤)F (x0)�F (x⇤)

✏⌘� 1� ⇢

$\mathbb{P}\leV(  \frac{F(x_k)-­‐F(x_*)}{F(x_0)-­‐F(x_*)}\leq  \epsilon\right)  \geq  1-­‐\rho$  

$k  \geq  \frac{\log\leV(\frac{1}{\epsilon  \rho}\right)}{\log\leV(\frac{1}{c}\right)}$  k � log( 1

✏⇢ )log( 1

c )

Page 39: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

Code    

Efficient  implementa2on  for  logis2c  regression  available  at  MLOSS  

http://mloss.org/software/view/556/

Page 40: Lecture’5:’richtarik/docs/SOCN-Lec5.pdf · 2015-11-18 · Lecture’5:’ Semi-Stochas2c’Methods ’ Peter’Richtárik’ Graduate’School’in’Systems,’Op2mizaon,’Control’and’Networks’

THE  END