!Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1....

31
Neural Networks for Machine Learning Lecture 6a Overview of minibatch gradient descent Geoffrey Hinton with Ni@sh Srivastava Kevin Swersky

Transcript of !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1....

Page 1: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

 Neural  Networks  for  Machine  Learning    

 Lecture  6a  Overview  of  mini-­‐batch  gradient  descent  

Geoffrey  Hinton    with  Ni@sh  Srivastava    Kevin  Swersky  

Page 2: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Reminder:  The  error  surface  for  a  linear  neuron  

•  The  error  surface  lies  in  a  space  with  a  horizontal  axis  for  each  weight  and  one  ver@cal  axis  for  the  error.    –  For  a  linear  neuron  with  a  squared  error,  it  is  

a  quadra@c  bowl.    –  Ver@cal  cross-­‐sec@ons  are  parabolas.    –  Horizontal  cross-­‐sec@ons  are  ellipses.  

•  For  mul@-­‐layer,  non-­‐linear  nets  the  error  surface  is  much  more  complicated.  –  But  locally,  a  piece  of  a  quadra@c  bowl  is  

usually  a  very  good  approxima@on.  

E  

w1  

w2  

Page 3: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Convergence  speed  of  full  batch  learning  when  the  error  surface  is  a  quadra@c  bowl  

•  Going  downhill  reduces  the  error,  but  the  direc@on  of  steepest  descent  does  not  point  at  the  minimum  unless  the  ellipse  is  a  circle.  –  The  gradient  is  big  in  the  direc@on  in  

which  we  only  want  to  travel  a  small  distance.    

–  The  gradient  is  small  in  the  direc@on  in  which  we  want  to  travel  a  large  distance.  

       Even  for  non-­‐linear  mul@-­‐layer  nets,  the  error  surface  is  locally  quadra@c,  so  the  same  speed  issues  apply.  

Page 4: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

How  the  learning  goes  wrong  

•  If  the  learning  rate  is  big,  the  weights  slosh  to  and  fro  across  the  ravine.    –  If  the  learning  rate  is  too  big,  this  

oscilla@on  diverges.  •  What  we  would  like  to  achieve:  

–  Move  quickly  in  direc@ons  with  small  but  consistent  gradients.  

–  Move  slowly  in  direc@ons  with  big  but  inconsistent  gradients.  

E  

w  

Page 5: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Stochas@c  gradient  descent  

•  If  the  dataset  is  highly  redundant,  the  gradient  on  the  first  half  is  almost  iden@cal  to  the  gradient  on  the  second  half.    –  So  instead  of  compu@ng  the  full  

gradient,  update  the  weights  using  the  gradient  on  the  first  half  and  then  get  a  gradient  for  the  new  weights  on  the  second  half.  

–  The  extreme    version  of  this  approach  updates  weights  aVer  each  case.  Its  called  “online”.  

•  Mini-­‐batches  are  usually  beYer  than  online.  –  Less  computa@on  is  used  

upda@ng  the  weights.  –  Compu@ng  the  gradient  for  

many  cases  simultaneously  uses  matrix-­‐matrix  mul@plies  which  are  very  efficient,  especially  on  GPUs  

•  Mini-­‐batches  need  to  be  balanced  for  classes  

 

Page 6: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Two  types  of  learning  algorithm  

If  we  use  the  full  gradient  computed  from  all  the  training  cases,  there  are  many  clever  ways  to  speed  up  learning  (e.g.  non-­‐linear  conjugate  gradient).  

–  The  op@miza@on  community  has  studied  the  general  problem  of  op@mizing  smooth  non-­‐linear  func@ons  for  many  years.  

–  Mul@layer  neural  nets  are  not  typical  of  the  problems  they  study  so  their  methods  may  need  a  lot  of  adapta@on.  

For  large  neural  networks  with  very  large  and  highly  redundant  training  sets,  it  is  nearly  always  best  to  use  mini-­‐batch  learning.  

–  The  mini-­‐batches  may  need  to  be  quite  big  when  adap@ng  fancy  methods.  

–  Big  mini-­‐batches  are  more  computa@onally  efficient.      

Page 7: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

A  basic  mini-­‐batch  gradient  descent  algorithm    

•  Guess  an  ini@al  learning  rate.  –  If  the  error  keeps  geang  worse  

or  oscillates  wildly,  reduce  the  learning  rate.  

–  If  the  error  is  falling  fairly  consistently  but  slowly,  increase  the  learning  rate.    

•  Write  a  simple  program  to  automate  this  way  of  adjus@ng  the  learning  rate.  

•  Towards  the  end  of  mini-­‐batch  learning  it  nearly  always  helps  to  turn  down  the  learning  rate.  –  This  removes  fluctua@ons  in  the  

final  weights  caused  by  the  varia@ons  between  mini-­‐batches.    

•  Turn  down  the  learning  rate  when  the  error  stops  decreasing.    –  Use  the  error  on  a  separate  

valida@on  set  

Page 8: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

 Neural  Networks  for  Machine  Learning    

 Lecture  6b    A  bag  of  tricks  for  mini-­‐batch  gradient  descent  

Geoffrey  Hinton    with  Ni@sh  Srivastava    Kevin  Swersky  

Page 9: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Be  careful  about  turning  down  the  learning  rate  

•  Turning  down  the  learning  rate  reduces  the  random  fluctua@ons  in  the  error  due  to  the  different  gradients  on  different  mini-­‐batches.  –  So  we  get  a  quick  win.  –  But  then  we  get  slower  

learning.  •  Don’t  turn  down  the  

learning  rate  too  soon!  

error  

epoch  

reduce  learning  rate  

Page 10: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Ini@alizing  the  weights  

•  If  two  hidden  units  have  exactly  the  same  bias  and  exactly  the  same  incoming  and  outgoing  weights,  they  will  always  get  exactly  the  same  gradient.    –  So  they  can  never  learn  to  be  

different  features.  –  We  break  symmetry  by  

ini@alizing  the  weights  to  have  small  random  values.  

 

•  If  a  hidden  unit  has  a  big  fan-­‐in,  small  changes  on  many  of  its  incoming  weights  can  cause  the  learning  to  overshoot.  –  We  generally  want  smaller  

incoming  weights  when  the  fan-­‐in  is  big,  so  ini@alize  the  weights  to  be  propor@onal  to  sqrt(fan-­‐in).  

•  We  can  also  scale  the  learning  rate  the  same  way.  

Page 11: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

ShiVing  the  inputs  

•  When  using  steepest  descent,  shiVing    the  input  values  makes  a  big  difference.  –  It  usually  helps  to  transform  

each  component  of  the  input  vector  so  that  it  has  zero  mean  over  the  whole  training  set.    

•  The  hypberbolic  tangent  (which  is  2*logis@c  -­‐1)  produces  hidden  ac@va@ons  that  are  roughly  zero  mean.    –  In  this  respect  its  beYer  than  the  

logis@c.    

w1 w2

101,  101  à    2  101,      99    à  0  

gives  error  surface  

1,      1  à    2  1,    -­‐1  à    0    

gives  error  surface  

color  indicates  training  case  

Page 12: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Scaling  the  inputs  

•  When  using  steepest  descent,  scaling    the  input  values  makes  a  big  difference.  –  It  usually  helps  to  

transform  each  component  of  the  input  vector  so  that  it  has  unit  variance  over  the  whole  training  set.    

w1 w2

1,      1  à    2  1,    -­‐1  à    0    

0.1,      10    à    2  0.1,    -­‐10    à    0  

gives  error  surface  

gives  error  surface  

color  indicates  weight  axis  

Page 13: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

A  more  thorough  method:  Decorrelate  the  input  components  

•  For  a  linear    neuron,  we  get  a  big  win  by  decorrela@ng  each  component  of  the  input  from  the  other  input  components.  

•  There  are  several  different  ways  to  decorrelate  inputs.  A  reasonable  method  is  to  use  Principal  Components  Analysis.  –  Drop  the  principal  components  with  the  smallest  eigenvalues.  

•   This  achieves  some  dimensionality  reduc@on.  –  Divide  the  remaining  principal  components  by  the  square  roots  of  their  

eigenvalues.  For  a  linear  neuron,    this    converts  an  axis  aligned  ellip@cal  error  surface  into  a  circular  one.  

•  For  a  circular  error  surface,  the  gradient  points  straight  towards  the  minimum.    

Page 14: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Common  problems  that  occur  in  mul@layer  networks  •  If  we  start  with  a  very  big  learning  

rate,  the  weights  of  each  hidden  unit  will  all  become  very  big  and  posi@ve  or  very  big  and  nega@ve.  –   The  error  deriva@ves  for  the  

hidden  units  will  all  become  @ny  and  the  error  will  not  decrease.  

–  This  is  usually  a  plateau,  but  people  oVen  mistake  it  for  a  local  minimum.  

•  In    classifica@on  networks  that  use  a  squared  error  or  a  cross-­‐entropy  error,  the  best  guessing  strategy  is  to  make  each  output  unit  always  produce  an  output  equal  to  the  propor@on  of  @me  it  should  be  a  1.    –  The  network  finds  this  strategy  

quickly  and  may  take  a  long  @me  to  improve  on  it  by  making  use  of  the  input.    

–  This  is  another  plateau  that  looks  like  a  local  minimum.    

Page 15: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Four  ways  to  speed  up  mini-­‐batch  learning  

•  Use  “momentum”  –  Instead  of  using  the  gradient  

to  change  the  posi@on  of  the  weight  “par@cle”,  use  it  to  change  the  velocity.    

•  Use  separate  adap@ve  learning  rates  for  each  parameter  –  Slowly  adjust  the  rate  using  

the  consistency  of  the  gradient  for  that  parameter.  

 •  rmsprop:  Divide  the  learning  rate  for  a  

weight  by  a  running  average  of  the  magnitudes  of  recent  gradients  for  that  weight.  –  This  is  the  mini-­‐batch  version  of  just  

using  the  sign  of  the  gradient.    •  Take  a  fancy  method  from  the  

op@miza@on  literature  that  makes  use  of  curvature  informa@on  (not  this  lecture)  –  Adapt  it  to  work  for  neural  nets  –  Adapt  it  to  work  for  mini-­‐batches.    

Page 16: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

 Neural  Networks  for  Machine  Learning    

 Lecture  6c  The  momentum  method  

Geoffrey  Hinton    with  Ni@sh  Srivastava    Kevin  Swersky  

Page 17: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

The  intui@on  behind  the  momentum  method  

       Imagine  a  ball  on  the  error  surface.  The  loca@on  of  the  ball  in  the  horizontal  plane  represents  the  weight  vector.  –  The  ball  starts  off  by  following  the  

gradient,  but  once  it  has  velocity,  it  no  longer  does  steepest  descent.    

–  Its  momentum  makes  it  keep  going  in  the  previous  direc@on.  

•  It  damps  oscilla@ons  in  direc@ons  of  high  curvature  by  combining  gradients  with  opposite  signs.  

•  It  builds  up  speed  in  direc@ons  with  a  gentle  but  consistent  gradient.  

Page 18: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

The  equa@ons  of  the  momentum  method  

v(t) =α v(t −1)−ε ∂E∂w(t)

=α v(t −1)−ε ∂E∂w(t)

=α Δw(t −1)−ε ∂E∂w(t)

The  effect  of  the  gradient  is  to  increment  the  previous  velocity.  The  velocity  also  decays  by    α      which  is  slightly  less  then  1.  The  weight  change  is  equal  to  the  current  velocity.    

The  weight  change  can  be  expressed  in  terms  of  the  previous  weight  change  and  the  current  gradient.    

Δw(t) = v(t)

Page 19: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

The  behavior  of  the  momentum  method  

•  If  the  error  surface  is  a  @lted  plane,  the  ball  reaches  a  terminal  velocity.  –  If  the  momentum  is  close  to  1,  

this  is  much  faster  than  simple  gradient  descent.  

•  At  the  beginning  of  learning  there  may  be  very  large  gradients.    –  So  it  pays  to  use  a  small  

momentum  (e.g.  0.5).  –  Once  the  large  gradients  have  

disappeared  and  the  weights  are  stuck  in  a  ravine  the  momentum  can  be  smoothly  raised  to  its  final  value  (e.g.  0.9  or  even  0.99)  

•  This  allows  us  to  learn  at  a  rate  that  would  cause  divergent  oscilla@ons  without  the  momentum.  

v(∞) = 11−α

−ε∂E∂w

$

%&

'

()

Page 20: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

A  beYer  type  of  momentum  (Nesterov  1983)  

•  The  standard  momentum  method  first  computes  the  gradient  at  the  current  loca@on  and  then  takes  a  big  jump  in  the  direc@on  of  the  updated  accumulated  gradient.  

•  Ilya  Sutskever  (2012  unpublished)  suggested  a  new  form  of  momentum  that  oVen  works  beYer.    –  Inspired  by  the  Nesterov  method  

for  op@mizing  convex  func@ons.    

•  First  make  a  big  jump  in  the  direc@on  of  the  previous  accumulated  gradient.  

•  Then  measure  the  gradient  where  you  end  up  and  make  a  correc@on.  –  Its  beYer  to  correct  a  

mistake  aVer  you  have  made  it!  

Page 21: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

A  picture  of  the  Nesterov  method    

•  First  make  a  big  jump  in  the  direc@on  of  the  previous  accumulated  gradient.  •  Then  measure  the  gradient  where  you  end  up  and  make  a  correc@on.  

 

brown  vector  =  jump,              red  vector  =  correc@on,              green  vector  =  accumulated  gradient    blue  vectors  =  standard  momentum  

Page 22: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

 Neural  Networks  for  Machine  Learning    

 Lecture  6d  A  separate,  adap@ve  learning  rate  for  each  

connec@on    

Geoffrey  Hinton    with  Ni@sh  Srivastava    Kevin  Swersky  

Page 23: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

The  intui@on  behind  separate  adap@ve  learning  rates    •  In  a  mul@layer  net,  the  appropriate  learning  rates  

can  vary  widely  between  weights:  –  The  magnitudes  of  the  gradients  are  oVen  very  

different  for  different  layers,  especially  if  the  ini@al  weights  are  small.  

–  The  fan-­‐in  of  a  unit  determines  the  size  of  the  “overshoot”  effects  caused  by  simultaneously  changing  many  of  the  incoming  weights  of  a  unit  to  correct  the  same  error.  

•  So  use  a  global  learning  rate  (set  by  hand)  mul@plied  by  an  appropriate  local  gain  that  is  determined  empirically  for  each  weight.    

Gradients  can  get  very  small  in  the  early  layers  of  very    deep  nets.  

The  fan-­‐in  oVen  varies  widely  between  layers.  

Page 24: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

One  way  to  determine  the  individual  learning  rates  

•  Start  with  a  local  gain  of  1  for  every  weight.    •  Increase  the  local  gain  if  the  gradient  for  

that  weight  does  not  change  sign.  •  Use  small  addi@ve  increases  and  

mul@plica@ve  decreases  (for  mini-­‐batch)  –  This  ensures  that  big  gains  decay  rapidly  

when  oscilla@ons  start.  –  If  the  gradient  is  totally  random  the  gain  

will  hover  around  1  when  we  increase    by  plus          half  the  @me  and  decrease                by  @mes                            half  the  @me.  

if ∂E∂wij

(t) ∂E∂wij

(t −1)#

$%%

&

'((> 0

then gij (t) = gij (t −1)+.05else gij (t) = gij (t −1)*.95δ

1−δ

Δwij = −ε gij∂E∂wij

Page 25: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Tricks  for  making  adap@ve  learning  rates  work  beYer  

•  Limit  the  gains  to  lie  in  some  reasonable  range  –   e.g.  [0.1,  10]  or  [.01,  100]  

•  Use  full  batch  learning  or  big  mini-­‐batches  –  This  ensures  that  changes  in  

the  sign  of  the  gradient  are  not  mainly  due  to  the  sampling  error  of  a  mini-­‐batch.  

•  Adap@ve  learning  rates  can  be  combined  with  momentum.  –  Use  the  agreement  in  sign    

between  the  current  gradient  for  a  weight  and  the  velocity  for  that  weight  (Jacobs,  1989).    

•  Adap@ve  learning  rates  only  deal  with  axis-­‐aligned  effects.  –  Momentum  does  not  care  about  

the  alignment  of  the  axes.  

Page 26: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

 Neural  Networks  for  Machine  Learning    

 Lecture  6e  rmsprop:  Divide  the  gradient  by  a  running  average  

of  its  recent  magnitude  

Geoffrey  Hinton    with  Ni@sh  Srivastava    Kevin  Swersky  

Page 27: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

rprop:  Using  only  the  sign  of  the  gradient  

•  The  magnitude  of  the  gradient  can  be  very  different  for  different  weights  and  can  change  during  learning.  –  This  makes  it  hard  to  choose  a  

single  global  learning  rate.  •  For  full  batch  learning,  we  can  deal  

with  this  varia@on  by  only  using  the  sign  of  the  gradient.  –  The  weight  updates  are  all  of  the  

same  magnitude.  –  This  escapes  from  plateaus  with  

@ny  gradients  quickly.    

•  rprop:  This  combines  the  idea  of  only  using  the  sign  of  the  gradient  with  the  idea  of  adap@ng  the  step  size  separately  for  each  weight.  –  Increase  the  step  size  for  a  weight  

mul@plica@vely  (e.g.  @mes  1.2)  if  the  signs  of  its  last  two  gradients  agree.  

–  Otherwise  decrease  the  step  size  mul@plica@vely  (e.g.  @mes  0.5).  

–  Limit  the  step  sizes  to  be  less  than  50  and  more  than  a  millionth  (Mike  Shuster’s  advice).  

Page 28: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Why  rprop  does  not  work  with  mini-­‐batches  

•  The  idea  behind  stochas@c  gradient  descent  is  that  when  the  learning  rate  is  small,  it  averages  the  gradients  over  successive  mini-­‐batches.  –  Consider  a  weight  that  gets  a  

gradient  of    +0.1  on  nine  mini-­‐batches  and  a  gradient  of  -­‐0.9  on  the  tenth  mini-­‐batch.    

–  We  want  this  weight  to  stay  roughly  where  it  is.  

•  rprop  would  increment  the  weight  nine  @mes  and  decrement  it  once  by  about  the  same  amount  (assuming  any  adapta@on  of  the  step  sizes  is  small  on  this  @me-­‐scale).    –  So  the  weight  would  grow  a  lot.  

•  Is  there  a  way  to  combine:  –  The  robustness  of  rprop.  –  The  efficiency  of  mini-­‐batches.  –  The  effec@ve  averaging  of  

gradients  over  mini-­‐batches.    

Page 29: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

rmsprop:  A  mini-­‐batch  version  of  rprop  

•  rprop  is  equivalent  to  using  the  gradient  but  also  dividing  by  the  size  of  the  gradient.  –  The  problem  with  mini-­‐batch  rprop  is  that  we  divide  by  a  different  number  

for  each  mini-­‐batch.  So  why  not  force  the  number  we  divide  by  to  be  very  similar  for  adjacent  mini-­‐batches?    

•  rmsprop:  Keep  a  moving  average  of  the  squared  gradient  for  each  weight  

 •  Dividing  the  gradient  by                                                                                          makes  the  learning  work  much  

beYer  (Tijmen  Tieleman,  unpublished).    

MeanSquare(w, t) = 0.9 MeanSquare(w, t−1) + 0.1 ∂E ∂w(t)( )2

MeanSquare(w, t)

Page 30: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Further  developments  of  rmsprop  

•  Combining  rmsprop  with  standard  momentum  –  Momentum  does  not  help  as  much  as  it  normally  does.  Needs  more  

inves@ga@on.  •  Combining  rmsprop  with  Nesterov  momentum  (Sutskever  2012)  

–  It  works  best  if  the  RMS  of  the  recent  gradients  is  used  to  divide  the  correc@on  rather  than  the  jump  in  the  direc@on  of  accumulated  correc@ons.  

•  Combining  rmsprop  with  adap@ve  learning  rates  for  each  connec@on    –  Needs  more  inves@ga@on.  

•  Other  methods  related  to  rmsprop  –  Yann  LeCun’s  group  has  a  fancy  version  in  “No  more  pesky  learning  rates”  

Page 31: !Neural!Networks!for!Machine!Learning! !Lecture!6a …hinton/coursera/lecture6/lec6.pdf · 2019. 1. 13. · !Neural!Networks!for!Machine!Learning!!!Lecture!6a Overview!of!mini9batch!gradientdescent

Summary  of  learning  methods  for  neural  networks  •  For  small  datasets  (e.g.  10,000  cases)  

or  bigger  datasets  without  much  redundancy,  use  a  full-­‐batch  method.  –  Conjugate  gradient,  LBFGS  ...  –  adap@ve  learning  rates,  rprop  ...  

•  For  big,  redundant  datasets  use  mini-­‐batches.  –  Try  gradient  descent  with  

momentum.  –  Try  rmsprop  (with  momentum  ?)  –  Try  LeCun’s  latest  recipe.  

•  Why  there  is  no  simple  recipe:                Neural  nets  differ  a  lot:    

–  Very  deep  nets  (especially  ones  with  narrow  boYlenecks).  

–  Recurrent  nets.    –  Wide  shallow  nets.  

             Tasks  differ  a  lot:  –  Some  require  very  accurate  

weights,  some  don’t.  –  Some  have  many  very  rare  

cases  (e.g.  words).