Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th...

13
1 SGD cont’d AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 2 nd , 2015 ©Emily Fox 2015 1 Case Study 1: Estimating Click Probabilities Learning Problem for Click PredicOon PredicOon task: Features: Data: Batch: Online: Many approaches (e.g., logisOc regression, SVMs, naïve Bayes, decision trees, boosOng,…) Focus on logisOc regression; captures main concepts, ideas generalize to other approaches ©Emily Fox 2015 2

Transcript of Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th...

Page 1: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

1

SGD  cont’d  AdaGrad  

Machine  Learning  for  Big  Data      CSE547/STAT548,  University  of  Washington  

Emily  Fox  April  2nd,  2015  

©Emily Fox 2015 1

Case Study 1: Estimating Click Probabilities

Learning  Problem  for  Click  PredicOon  •  PredicOon  task:    •  Features:  

•  Data:  

–  Batch:      –  Online:  

 •  Many  approaches  (e.g.,  logisOc  regression,  SVMs,  naïve  Bayes,  decision  trees,  

boosOng,…)  –  Focus  on  logisOc  regression;  captures  main  concepts,  ideas  generalize  to  other  approaches  

©Emily Fox 2015 2

Page 2: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

2

Standard  v.  Regularized  Updates  •  Maximum  condiOonal  likelihood  esOmate  

•  Regularized  maximum  condiOonal  likelihood  esOmate  

©Emily Fox 2015 3

(t)

(t)

w

⇤= argmax

wln

2

4Y

j

P (yj |xj ,w))

3

5� �X

i>0

w2i

2

Challenge  1:  Complexity  of  compuOng  gradients  

•  What’s  the  cost  of  a  gradient  update  step  for  LR???  

©Emily Fox 2015 4

(t)

Page 3: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

3

Challenge  2:  Data  is  streaming  •  AssumpOon  thus  far:  Batch  data  

•  But,  click  predicOon  is  a  streaming  data  task:  –  User  enters  query,  and  ad  must  be  selected:  

•  Observe  xj,  and  must  predict  yj  

–  User  either  clicks  or  doesn’t  click  on  ad:  •  Label  yj  is  revealed  a_erwards  

–  Google  gets  a  reward  if  user  clicks  on  ad    –  Weights  must  be  updated  for  next  Ome:  

©Emily Fox 2015 5

SGD:  StochasOc  Gradient  Ascent  (or  Descent)  •  “True”  gradient:    

•  Sample  based  approximaOon:  

•  What  if  we  esOmate  gradient  with  just  one  sample???  –  Unbiased  esOmate  of  gradient  –  Very  noisy!  –  Called  stochasOc  gradient  ascent  (or  descent)  

•  Among  many  other  names  –  VERY  useful  in  pracOce!!!  

©Emily Fox 2015 6

r`(w) = Ex

[r`(w,x)]

Page 4: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

4

StochasOc  Gradient  Ascent:  General  Case  •  Given  a  stochasOc  funcOon  of  parameters:  

–  Want  to  find  maximum  

•  Start  from  w(0)  •  Repeat  unOl  convergence:  

–  Get  a  sample  data  point  xt  

–  Update  parameters:  

•  Works  in  the  online  learning  sefng!  •  Complexity  of  each  gradient  step  is  constant  in  number  of  examples!  •  In  general,  step  size  changes  with  iteraOons  

©Emily Fox 2015 7

StochasOc  Gradient  Ascent  for  LogisOc  Regression  

•  LogisOc  loss  as  a  stochasOc  funcOon:  

•  Batch  gradient  ascent  updates:  

•  StochasOc  gradient  ascent  updates:  –  Online  sefng:  

©Emily Fox 2015 8

w

(t+1)i w

(t)i + ⌘

8<

:��w(t)i +

1

N

NX

j=1

x

(j)i [y(j) � P (Y = 1|x(j)

,w

(t))]

9=

;

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

Ex

[`(w,x)] = Ex

⇥lnP (y|x,w)� �||w||22

2

Page 5: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

5

Convergence  Rate  of  SGD  •  Theorem:    

–  (see  Nemirovski  et  al  ‘09  from  readings)  –  Let  l  be  a  strongly  convex  stochasOc  funcOon  –  Assume  gradient  of  l    is  Lipschitz  conOnuous  and  bounded  

–  Then,  for  step  sizes:  

–  The  expected  loss  decreases  as  O(1/t):  

©Emily Fox 2015 9

Convergence  Rates  for    Gradient  Descent/Ascent  vs.  SGD  

•  Number  of  IteraOons  to  get  to  accuracy  

•  Gradient  descent:  –  If  func  is  strongly  convex:  O(ln(1/ϵ))  iteraOons  

•  StochasOc  gradient  descent:  –  If  func  is  strongly  convex:  O(1/ϵ)  iteraOons  

•  Seems  exponenOally  worse,  but  much  more  subtle:  –  Total  running  Ome,  e.g.,  for  logisOc  regression:  

•  Gradient  descent:  

•  SGD:  

•  SGD  can  win  when  we  have  a  lot  of  data  –  See  readings  for  more  details  

©Emily Fox 2015 10

`(w⇤)� `(w) ✏

Page 6: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

6

Constrained  SGD:  Projected  Gradient  •  Consider  an  arbitrary  restricted  feature  space  

•  OpOmizaOon  objecOve:  

•  If                    ,  can  use  projected  gradient  for  (sub)gradient  descent  

©Emily Fox 2015 11

w(t+1) =

w 2 W

w 2 W

MoOvaOng  AdaGrad  (Duchi,  Hazan,  Singer  2011)  •  Assuming                                      ,  standard  stochasOc  (sub)gradient  descent  

updates  are  of  the  form:  

 •  Should  all  features  share  the  same  learning  rate?  

•  O_en  have  high-­‐dimensional  feature  spaces  –  Many  features  are  irrelevant  –  Rare  features  are  o_en  very  informaOve  

•  Adagrad  provides  a  feature-­‐specific  adapOve  learning  rate  by  incorporaOng  knowledge  of  the  geometry  of  past  observaOons  

©Emily Fox 2015 12

w 2 Rd

w(t+1)i w(t)

i � ⌘tgt,i

Page 7: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

7

Why adapt to geometry?

Hard

Nice

y

t

t,1 �

t,2 �

t,3

1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant

2 Infrequent, predictive

3 Infrequent, predictive

Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32

x

x

x

Why  Adapt  to  Geometry?  

©Emily Fox 2015 13

Examples from Duchi et al. ISMP 2012

slides

Why adapt to geometry?

Hard

Nice

y

t

t,1 �

t,2 �

t,3

1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant

2 Infrequent, predictive

3 Infrequent, predictive

Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32

Not  All  Features  are  Created  Equal  

•  Examples:  

©Emily Fox 2015 14

Motivation

Text data:

The most unsung birthday

in American business and

technological history

this year may be the 50th

anniversary of the Xerox

914 photocopier.

a

aThe Atlantic, July/August 2010.

High-dimensional image features

Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...

Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 3 / 32

Images from Duchi et al. ISMP 2012 slides

Page 8: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

8

©Emily Fox 2015 15

Credit: http://imgur.com/a/Hqolp

Visualizing  Effect  

Regret  MinimizaOon  •  How  do  we  assess  the  performance  of  an  online  algorithm?  

•  Algorithm  iteraOvely  predicts  •  Incur  loss      •  Regret:    

What  is  the  total  incurred  loss  of  algorithm  relaOve  to  the  best  choice  of                that  could  have  been  made  retrospec1vely  

©Emily Fox 2015 16

w(t)

w

`t(w(t))

R(T ) =TX

t=1

`t(w(t))� inf

w2W

TX

t=1

`t(w)

Page 9: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

9

Regret  Bounds  for  Standard  SGD    •  Standard  projected  gradient  stochasOc  updates:  

•  Standard  regret  bound:  

 

©Emily Fox 2015 17

w(t+1) = arg minw2W

||w � (w(t) � ⌘gt)||22

TX

t=1

`t(w(t))� `t(w

⇤) 1

2⌘||w(1) �w⇤||22 +

2

TX

t=1

||gt||22

Projected  Gradient  using  Mahalanobis  

w(t+1) = arg minw2W

||w � (w(t) � ⌘A�1gt)||2A

©Emily Fox 2015 18

•  Standard  projected  gradient  stochasOc  updates:  

•  What  if  instead  of  an  L2  metric  for  projecOon,  we  considered  the  Mahalanobis  norm  

 

w(t+1) = arg minw2W

||w � (w(t) � ⌘gt)||22

Page 10: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

10

Mahalanobis  Regret  Bounds  

•  What  A  to  choose?      •  Regret  bound  now:  

•  What  if  we  minimize  upper  bound  on  regret  w.r.t.  A  in  hindsight?  

©Emily Fox 2015 19

w(t+1) = arg minw2W

||w � (w(t) � ⌘A�1gt)||2A

TX

t=1

`t(w(t))� `t(w

⇤) 1

2⌘||w(1) �w⇤||22 +

2

TX

t=1

||gt||2A�1

minA

TX

t=1

gTt A�1gt

Mahalanobis  Regret  MinimizaOon  •  ObjecOve:  

•  SoluOon:    

       For  proof,  see  Appendix  E,  Lemma  15  of  Duchi  et  al.  2011.            Uses  “trace  trick”  and  Lagrangian.    •  A  defines  the  norm  of  the  metric  space  we  should  be  operaOng  in  

©Emily Fox 2015 20

A = c

TX

t=1

gtgTt

! 12

subject to A ⌫ 0, tr(A) CminA

TX

t=1

gTt A�1gt

Page 11: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

11

AdaGrad  Algorithm  •  At  Ome  t,  esOmate  opOmal  (sub)gradient  modificaOon  A  by  

•  For  d  large,  At  is  computaOonally  intensive  to  compute.    Instead,  

•  Then,  algorithm  is  a  simple  modificaOon  of  normal  updates:  

©Emily Fox 2015 21

At =

tX

⌧=1

g⌧gT⌧

! 12

w(t+1) = arg minw2W

||w � (w(t) � ⌘diag(At)�1gt)||2diag(At)

AdaGrad  in  Euclidean  Space  •  For                        ,    •  For  each  feature  dimension,  

       where    

•  That  is,  

•  Each  feature  dimension  has  it’s  own  learning  rate!  –  Adapts  with  t  –  Takes  geometry  of  the  past  observaOons  into  account  –  Primary  role  of  η  is  determining  rate  the  first  Ome  a  feature  is  encountered    

©Emily Fox 2015 22

W = Rd

w(t+1)i w(t)

i � ⌘t,igt,i

⌘t,i =

w(t+1)i w(t)

i �⌘qPt⌧=1 g

2⌧,i

gt,i

Page 12: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

12

TX

t=1

`t(w(t))� `t(w

⇤) 2R1

dX

i=1

||g1:T,i||2

AdaGrad  TheoreOcal  Guarantees  •  AdaGrad  regret  bound:  

 –  In  stochasOc  sefng:      

•  This  really  is  used  in  pracOce!  •  Many  cool  examples.  Let’s  just  examine  one…  

©Emily Fox 2015 23

R1 := max

t||w(t) �w⇤||1

E"`

1

T

TX

t=1

w(t)

!#� `(w⇤) 2R1

T

dX

i=1

E[||g1:T,j ||2]

AdaGrad  TheoreOcal  Example  •  Expect  to  out-­‐perform  when  gradient  vectors  are  sparse  •  SVM  hinge  loss  example:    

•  If  xjt  ≠  0  with  probability  

 •  Previously  best  known  method:    

   

©Emily Fox 2015 24

x

t 2 {�1, 0, 1}d

/ j�↵, ↵ > 1

`t(w) = [1� yt⌦x

t,w↵]+

E"`

1

T

TX

t=1

w(t)

!#� `(w⇤

) = O✓ ||w⇤||1p

T·max{log d, d1�↵/2}

E"`

1

T

TX

t=1

w(t)

!#� `(w⇤) = O

✓ ||w⇤||1pT

·pd

Page 13: Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th anniversary of the Xerox 914 photocopier.a aThe Atlantic, July/August 2010. High-dimensional

13

•  Very  non-­‐convex  problem,  but  use  SGD  methods  anyway  

 

Neural Network Learning

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rage F

ram

e A

ccura

cy (

%)

Accuracy on Test Set

SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L−BFGS

(Dean et al. 2012)

Distributed, d = 1.7 · 109 parameters. SGD and AdaGrad use 80machines (1000 cores), L-BFGS uses 800 (10000 cores)

Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 26 / 32

Neural Network Learning

Wildly non-convex problem:

f(x; ⇠) = log (1 + exp (h[p(hx1, ⇠1i) · · · p(hxk

, ⇠

k

i)], ⇠0i))

where

p(↵) =

1

1 + exp(↵)

�1 �2 �3 �4�5

x1 x2 x3 x4 x5

p(hx1, �1i)

Idea: Use stochastic gradient methods to solve it anyway

Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 25 / 32

Neural  Network  Learning  

©Emily Fox 2015 25

Images from Duchi et al. ISMP 2012 slides

`(w, x) = log(1 + exp(h[p(hw1, x1i) · · · p(hwk, xki)], x0i))

p(↵) =1

1 + exp(↵)

p(hw1, x1i)w ww w w

x1 x2 x3 x4 x5

What  you  should  know  about    LogisOc  Regression  (LR)  and  Click  PredicOon  

•  Click  predicOon  problem:  –  EsOmate  probability  of  clicking  –  Can  be  modeled  as  logisOc  regression  

•  LogisOc  regression  model:  Linear  model  •  Gradient  ascent  to  opOmize  condiOonal  likelihood  •  Overfifng  +  regularizaOon  •  Regularized  opOmizaOon  

–  Convergence  rates  and  stopping  criterion  •  StochasOc  gradient  ascent  for  large/streaming  data  

–  Convergence  rates  of  SGD  •  AdaGrad  moOvaOon,  derivaOon,  and  algorithm  

©Emily Fox 2015 26