Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th...

1

SGD cont’d AdaGrad

Machine Learning for Big Data CSE547/STAT548, University of Washington

Emily Fox April 2nd, 2015

©Emily Fox 2015 1

Case Study 1: Estimating Click Probabilities

Learning Problem for Click PredicOon •  PredicOon task: •  Features:

•  Data:

–  Batch: –  Online:

•  Many approaches (e.g., logisOc regression, SVMs, naïve Bayes, decision trees,

boosOng,…) –  Focus on logisOc regression; captures main concepts, ideas generalize to other approaches

©Emily Fox 2015 2

2

Standard v. Regularized Updates •  Maximum condiOonal likelihood esOmate

•  Regularized maximum condiOonal likelihood esOmate

©Emily Fox 2015 3

(t)

(t)

w

⇤= argmax

wln

2

4Y

j

P (yj |xj ,w))

3

5� �X

i>0

w2i

2

Challenge 1: Complexity of compuOng gradients

•  What’s the cost of a gradient update step for LR???

©Emily Fox 2015 4

(t)

3

Challenge 2: Data is streaming •  AssumpOon thus far: Batch data

•  But, click predicOon is a streaming data task: –  User enters query, and ad must be selected:

•  Observe xj, and must predict yj

–  User either clicks or doesn’t click on ad: •  Label yj is revealed a_erwards

–  Google gets a reward if user clicks on ad –  Weights must be updated for next Ome:

©Emily Fox 2015 5

SGD: StochasOc Gradient Ascent (or Descent) •  “True” gradient:

•  Sample based approximaOon:

•  What if we esOmate gradient with just one sample??? –  Unbiased esOmate of gradient –  Very noisy! –  Called stochasOc gradient ascent (or descent)

•  Among many other names –  VERY useful in pracOce!!!

©Emily Fox 2015 6

r`(w) = Ex

[r`(w,x)]

4

StochasOc Gradient Ascent: General Case •  Given a stochasOc funcOon of parameters:

–  Want to find maximum

•  Start from w(0) •  Repeat unOl convergence:

–  Get a sample data point xt

–  Update parameters:

•  Works in the online learning sefng! •  Complexity of each gradient step is constant in number of examples! •  In general, step size changes with iteraOons

©Emily Fox 2015 7

StochasOc Gradient Ascent for LogisOc Regression

•  LogisOc loss as a stochasOc funcOon:

•  Batch gradient ascent updates:

•  StochasOc gradient ascent updates: –  Online sefng:

©Emily Fox 2015 8

w

(t+1)i w

(t)i + ⌘

8<

:��w(t)i +

1

N

NX

j=1

x

(j)i [y(j) � P (Y = 1|x(j)

,w

(t))]

9=

;

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

Ex

[`(w,x)] = Ex

⇥lnP (y|x,w)� �||w||22

⇤

2

5

Convergence Rate of SGD •  Theorem:

–  (see Nemirovski et al ‘09 from readings) –  Let l be a strongly convex stochasOc funcOon –  Assume gradient of l is Lipschitz conOnuous and bounded

–  Then, for step sizes:

–  The expected loss decreases as O(1/t):

©Emily Fox 2015 9

Convergence Rates for Gradient Descent/Ascent vs. SGD

•  Number of IteraOons to get to accuracy

•  Gradient descent: –  If func is strongly convex: O(ln(1/ϵ)) iteraOons

•  StochasOc gradient descent: –  If func is strongly convex: O(1/ϵ) iteraOons

•  Seems exponenOally worse, but much more subtle: –  Total running Ome, e.g., for logisOc regression:

•  Gradient descent:

•  SGD:

•  SGD can win when we have a lot of data –  See readings for more details

©Emily Fox 2015 10

`(w⇤)� `(w) ✏

6

Constrained SGD: Projected Gradient •  Consider an arbitrary restricted feature space

•  OpOmizaOon objecOve:

•  If , can use projected gradient for (sub)gradient descent

©Emily Fox 2015 11

w(t+1) =

w 2 W

w 2 W

MoOvaOng AdaGrad (Duchi, Hazan, Singer 2011) •  Assuming , standard stochasOc (sub)gradient descent

updates are of the form:

•  Should all features share the same learning rate?

•  O_en have high-‐dimensional feature spaces –  Many features are irrelevant –  Rare features are o_en very informaOve

•  Adagrad provides a feature-‐specific adapOve learning rate by incorporaOng knowledge of the geometry of past observaOons

©Emily Fox 2015 12

w 2 Rd

w(t+1)i w(t)

i � ⌘tgt,i

7

Why adapt to geometry?

Hard

Nice

y

t

�

t,1 �

t,2 �

t,3

1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant

2 Infrequent, predictive


Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32

x

x

x

Why Adapt to Geometry?

©Emily Fox 2015 13

Examples from Duchi et al. ISMP 2012

slides

Why adapt to geometry?

Hard

Nice

y

t

�

t,1 �

t,2 �

t,3

1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant




Not All Features are Created Equal

•  Examples:

©Emily Fox 2015 14

Motivation

Text data:

The most unsung birthday

in American business and

technological history

this year may be the 50th

anniversary of the Xerox

914 photocopier.

a

aThe Atlantic, July/August 2010.

High-dimensional image features

Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...


Images from Duchi et al. ISMP 2012 slides

8

©Emily Fox 2015 15

Credit: http://imgur.com/a/Hqolp

Visualizing Effect

Regret MinimizaOon •  How do we assess the performance of an online algorithm?

•  Algorithm iteraOvely predicts •  Incur loss •  Regret:

What is the total incurred loss of algorithm relaOve to the best choice of that could have been made retrospec1vely

©Emily Fox 2015 16

w(t)

w

`t(w(t))

R(T ) =TX

t=1

`t(w(t))� inf

w2W

TX

t=1

`t(w)

9

Regret Bounds for Standard SGD •  Standard projected gradient stochasOc updates:

•  Standard regret bound:

©Emily Fox 2015 17

w(t+1) = arg minw2W

||w � (w(t) � ⌘gt)||22

TX

t=1

`t(w(t))� `t(w

⇤) 1

2⌘||w(1) �w⇤||22 +

⌘

2

TX

t=1

||gt||22

Projected Gradient using Mahalanobis

w(t+1) = arg minw2W

||w � (w(t) � ⌘A�1gt)||2A

©Emily Fox 2015 18

•  Standard projected gradient stochasOc updates:

•  What if instead of an L2 metric for projecOon, we considered the Mahalanobis norm

w(t+1) = arg minw2W

||w � (w(t) � ⌘gt)||22

10

Mahalanobis Regret Bounds

•  What A to choose? •  Regret bound now:

•  What if we minimize upper bound on regret w.r.t. A in hindsight?

©Emily Fox 2015 19

w(t+1) = arg minw2W

||w � (w(t) � ⌘A�1gt)||2A

TX

t=1

`t(w(t))� `t(w

⇤) 1

2⌘||w(1) �w⇤||22 +

⌘

2

TX

t=1

||gt||2A�1

minA

TX

t=1

gTt A�1gt

Mahalanobis Regret MinimizaOon •  ObjecOve:

•  SoluOon:

For proof, see Appendix E, Lemma 15 of Duchi et al. 2011. Uses “trace trick” and Lagrangian. •  A defines the norm of the metric space we should be operaOng in

©Emily Fox 2015 20

A = c

TX

t=1

gtgTt

! 12

subject to A ⌫ 0, tr(A) CminA

TX

t=1

gTt A�1gt

11

AdaGrad Algorithm •  At Ome t, esOmate opOmal (sub)gradient modificaOon A by

•  For d large, At is computaOonally intensive to compute. Instead,

•  Then, algorithm is a simple modificaOon of normal updates:

©Emily Fox 2015 21

At =

tX

⌧=1

g⌧gT⌧

! 12

w(t+1) = arg minw2W

||w � (w(t) � ⌘diag(At)�1gt)||2diag(At)

AdaGrad in Euclidean Space •  For , •  For each feature dimension,

where

•  That is,

•  Each feature dimension has it’s own learning rate! –  Adapts with t –  Takes geometry of the past observaOons into account –  Primary role of η is determining rate the first Ome a feature is encountered

©Emily Fox 2015 22

W = Rd

w(t+1)i w(t)

i � ⌘t,igt,i

⌘t,i =

w(t+1)i w(t)

i �⌘qPt⌧=1 g

2⌧,i

gt,i

12

TX

t=1

`t(w(t))� `t(w

⇤) 2R1

dX

i=1

||g1:T,i||2

AdaGrad TheoreOcal Guarantees •  AdaGrad regret bound:

–  In stochasOc sefng:

•  This really is used in pracOce! •  Many cool examples. Let’s just examine one…

©Emily Fox 2015 23

R1 := max

t||w(t) �w⇤||1

E"`

1

T

TX

t=1

w(t)

!#� `(w⇤) 2R1

T

dX

i=1

E[||g1:T,j ||2]

AdaGrad TheoreOcal Example •  Expect to out-‐perform when gradient vectors are sparse •  SVM hinge loss example:

•  If xjt ≠ 0 with probability

•  Previously best known method:

©Emily Fox 2015 24

x

t 2 {�1, 0, 1}d

/ j�↵, ↵ > 1

`t(w) = [1� yt⌦x

t,w↵]+

E"`

1

T

TX

t=1

w(t)

!#� `(w⇤

) = O✓ ||w⇤||1p

T·max{log d, d1�↵/2}

◆

E"`

1

T

TX

t=1

w(t)

!#� `(w⇤) = O

✓ ||w⇤||1pT

·pd

◆

13

•  Very non-‐convex problem, but use SGD methods anyway

Neural Network Learning

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rage F

ram

e A

ccura

cy (

%)

Accuracy on Test Set

SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L−BFGS

(Dean et al. 2012)

Distributed, d = 1.7 · 109 parameters. SGD and AdaGrad use 80machines (1000 cores), L-BFGS uses 800 (10000 cores)



Wildly non-convex problem:

f(x; ⇠) = log (1 + exp (h[p(hx1, ⇠1i) · · · p(hxk

, ⇠

k

i)], ⇠0i))

where

p(↵) =

1

1 + exp(↵)

�1 �2 �3 �4�5

x1 x2 x3 x4 x5

p(hx1, �1i)

Idea: Use stochastic gradient methods to solve it anyway



©Emily Fox 2015 25

Images from Duchi et al. ISMP 2012 slides

`(w, x) = log(1 + exp(h[p(hw1, x1i) · · · p(hwk, xki)], x0i))

p(↵) =1

1 + exp(↵)

p(hw1, x1i)w ww w w

x1 x2 x3 x4 x5

What you should know about LogisOc Regression (LR) and Click PredicOon

•  Click predicOon problem: –  EsOmate probability of clicking –  Can be modeled as logisOc regression

•  LogisOc regression model: Linear model •  Gradient ascent to opOmize condiOonal likelihood •  Overfifng + regularizaOon •  Regularized opOmizaOon

–  Convergence rates and stopping criterion •  StochasOc gradient ascent for large/streaming data

–  Convergence rates of SGD •  AdaGrad moOvaOon, derivaOon, and algorithm

©Emily Fox 2015 26

Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th...

Documents

Transcript of Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th...