Recap: perceptrons

57
Recap: perceptrons 1

Transcript of Recap: perceptrons

Page 1: Recap: perceptrons

Recap: perceptrons

1

Page 2: Recap: perceptrons

The perceptron

A B instance xi Compute: yi =sign( vk . xi )

^

yi ^

yi

If mistake: vk+1 = vk + yi xi

2

2

Page 3: Recap: perceptrons

The perceptron

A B instance xi Compute: yi =sign( vk . xi )

^

yi ^

yi

If mistake: vk+1 = vk + yi xi

2

2 2

⎟⎟⎠

⎞⎜⎜⎝

⎛=

γR

u

-u

margin

positive

negative

+ + + - -

-

Mistake bound:

3

Page 4: Recap: perceptrons

Perceptron tricks

•  voting and averaging •  sparsified averaging •  kernel perceptrons

– the “hash trick” as a kernel •  ranking perceptrons •  structured perceptrons

4

Page 5: Recap: perceptrons

On-line to batch learning

1.  Pick a vk at random according to mk/m, the fraction of examples it was used for.

2.  Predict using the vk you just picked.

3.  (Actually, use some sort of deterministic approximation to this).

Voted perceptron

5

Page 6: Recap: perceptrons

1.  Pick a vk at random according to mk/m, the fraction of examples it was used for.

2.  Predict using the vk you just picked.

3.  (Actually, use some sort of deterministic approximation to this).

predict using sign(v*. x)

Averaged perceptron

6

Page 7: Recap: perceptrons

The averaged perceptron

7

Page 8: Recap: perceptrons

Complexity of perceptron

•  Algorithm: •  v=0 •  for each example x,y:

–  if sign(v.x) != y •  v = v + yx

•  init hashtable

•  for xi!=0, vi += yxi

O(n)

O(|x|)=O(|d|)

Final hypothesis (last): v

8

Page 9: Recap: perceptrons

Complexity of averaged perceptron

•  Algorithm: •  vk=0 •  va = 0 •  for each example x,y:

–  if sign(vk.x) != y •  va = va + mk*vk •  vk = vk + yx •  m = m+1 •  mk = 1

–  else •  mk++

O(n) O(n|V|)

O(|x|)=O(|d|)

O(|V|)

Final hypothesis (avg): va/m 9

Page 10: Recap: perceptrons

Sparsified averaged perceptron

•  Algorithm: •  vk = 0 •  va = 0 •  for each example x,y:

–  va = va + vk –  m = m+1 –  if sign(vk.x) != y

•  vk = vk + y*x •  va = va + (T-m)*y*x

•  Return va/T

yjx jj∈Sk∑

T = the total number of examples in the stream…(all epochs)

All subsequent additions of x to va

O(|x|) !

10

Page 11: Recap: perceptrons

The kernel perceptron

A B instance xi

Compute: yi = vk . xi ^

yi ^

yi

If mistake: vk+1 = vk + yi xi

+

+

∑∑∈∈

−= kFP

ikFN

i

kk

y xxxxxx

..ˆ :Compute

FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If

i

i

xx

Mathematically the same as before … but allows use of the kernel trick

11

Page 12: Recap: perceptrons

The kernel perceptron

A B instance xi

Compute: yi = vk . xi ^

yi ^

yi

If mistake: vk+1 = vk + yi xi

+

+

∑∑∈∈

−= kFP

ikFN

i

kk

y xxxxxx

..ˆ :Compute

FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If

i

i

xx

Mathematically the same as before … but allows use of the “kernel trick”

),(),(ˆ −

+

+

∑∑∈∈

−= kFP

ikFN

i

kk

KKy xxxxxx

kkK xxxx ⋅≡),(

Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.

12

Page 13: Recap: perceptrons

Kernels 101

•  Duality: two ways to look at this ),(),(ˆ −

+

+

∑∑∈∈

−= kFP

ikFN

i

kk

KKy xxxxxx

)()(),( kkK xxxx φφ ⋅≡

)(ˆ wx,wx Ky =⋅=

∑∑∈∈ −

+

+ −=FP

kFN

kkkxx

xxw

wx ⋅= )(ˆ φy

∑∑∈∈ −

+

+ −=FP

kFN

kkkxx

xxw )()( φφ

),(),(ˆ −

+

+

∑∑∈∈

−= kFP

ikFN

i

kk

KKy xxxxxx

)'()'(),( kkK xxxx φφ ⋅≡

Two different computational ways of getting the same behavior

Explicitly map from x to φ(x)–i.e.tothepointcorrespondingtoxintheHilbertspace

Implicitly map from x to φ(x)bychangingthekernelfunc:onK

13

Page 14: Recap: perceptrons

Hash trick for logistic regression

•  Algorithm: •  Initialize arrays W, A of size R and set k=0 •  For each iteration t=1,…T

– For each example (xi,yi) •  Let V be hash table so that

•  pi = … ; k++ • For each hash value h: V[h]>0:

» W[h] *= (1 - λ2μ)k-A[j] » W[h] = W[h] + λ(yi - pi)V[h]

» A[h] = k

V[h]= xij

j:hash(xij )%R=h∑

14

array V replaces hashtable x

like φ(x) implicitly replaces x in a kernel

Page 15: Recap: perceptrons

The voted perceptron for ranking

A B instances x1 x2 x3 x4… Compute: yi = vk . xi

Return: the index b* of the “best” xi

^

b*

b

If mistake: vk+1 = vk + xb - xb*

15

Page 16: Recap: perceptrons

u

-u

x x

x

x

x

γ

Ranking some x’s with the target vector u

16

Page 17: Recap: perceptrons

u

-u

x x

x

x

x

γ

v

Ranking some x’s with some guess vector v – part 1

17

Page 18: Recap: perceptrons

u

-u

x x

x

x

x

v

Ranking some x’s with some guess vector v – part 2.

The purple-circled x is xb* - the one the learner has chosen to rank highest. The green circled x is xb, the right answer.

18

Page 19: Recap: perceptrons

u

-u

x x

x

x

x

v

Correcting v by adding xb – xb*

19

Page 20: Recap: perceptrons

x x

x

x

x

vk

Vk+1

Correcting v by adding xb – xb*

(part 2)

20

Page 21: Recap: perceptrons

u

-u

v1

+x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

u

-u

u

-u

xx

x

x

x

v

Notice this doesn’t depend at all on the number of x’s being ranked

Neither proof depends on the dimension of the x’s.

2

21

Page 22: Recap: perceptrons

Ranking perceptrons ! structured perceptrons

•  Ranking API: –  A sends B a (maybe

huge) set of items to rank

–  B finds the single best one according to the current weight vector

–  A tells B which one was actually best

•  Structured classification API: –  Input: list of words:

x=(w1,…,wn) –  Output: list of labels:

y=(y1,…,yn) –  If there are K classes,

there are Kn labels possible for x

22

Page 23: Recap: perceptrons

Ranking perceptrons ! structured perceptrons

•  New API: –  A sends B the word

sequence x –  B finds the single best

y according to the current weight vector using Viterbi

–  A tells B which y was actually best

–  This is equivalent to ranking pairs g=(x,y’)

•  Structured classification API –  Input: list of words:

x=(w1,…,wn) –  Output: list of labels:

y=(y1,…,yn) –  If there are K classes,

there are Kn labels possible for x

23

Page 24: Recap: perceptrons

Viterbi for a linear-chain MRF

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

When will prof Cohen post the notes …

• Inference: find the highest-weight path • This can be done efficiently using dynamic programming • I can also find the features for a particular label sequence 24

Page 25: Recap: perceptrons

The voted perceptron for NER

A B instances g1 g2 g3 g4… Compute: yi = vk . gi

Return: the index b* of the “best” gi

^

b* b

If mistake: vk+1 = vk + gb - gb*

1.  A sends B feature functions, and instructions for creating the instances g:

•  A sends a word vector xi. Then B could create the instances g1 =F(xi,y1), g2= F(xi,y2), …

•  but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi.

2.  A sends B the correct label sequence yi.

3.  On errors, B sets vk+1 = vk + gb - gb* = vk + F(xi,y) - F(xi,y*)

25

Page 26: Recap: perceptrons

Perceptron tricks

•  voting and averaging •  sparsified averaging •  kernel perceptrons

– the “hash trick” as a kernel •  ranking perceptrons •  structured perceptrons

– you need a way to find the best label y given example x relative to weights w

– but weights can be used in some search or non-trivial computation

26

Page 27: Recap: perceptrons

</Recap: perceptrons>

27

Page 28: Recap: perceptrons

NAACL 2010

28

Page 29: Recap: perceptrons

Comment

•  It’s easier to speedup a structured perceptron than a regular perceptron

•  Why?

29

Page 30: Recap: perceptrons

Parallel Structured Perceptrons •  Simplest idea:

–  Split data into S “shards” –  Train a perceptron on each shard independently

•  weight vectors are w(1) , w(2) , …

–  Produce some weighted average of the w(i)‘s as the final result

30

Page 31: Recap: perceptrons

Parallelizing perceptrons

Instances/labels

Instances/labels – 1 Instances/labels – 2 Instances/labels – 3

vk -1 vk- 2 vk-3

vk

Split into example subsets

Combine by some sort of weighted

averaging

Compute vk’s on subsets

31

Page 32: Recap: perceptrons

Parallel Perceptrons •  Simplest idea:

–  Split data into S “shards” –  Train a perceptron on each shard independently

•  weight vectors are w(1) , w(2) , …

–  Produce some weighted average of the w(i)‘s as the final result

•  Theorem: this doesn’t always work. •  Proof: by constructing an example where you can converge on every shard,

and still have the averaged vector not separate the full training set – no matter how you average the components.

32

Page 33: Recap: perceptrons

Parallel Perceptrons – take 2

Idea: do the simplest possible thing iteratively. •  Split the data into shards •  Let w = 0 •  For n=1,…

•  Train a perceptron on each shard with one pass starting with w • Average the weight vectors (somehow) and let w be that average

Extra communication cost: •  redistributing the weight vectors •  done less frequently than if fully synchronized, more frequently than if fully parallelized

All-Reduce

33

Page 34: Recap: perceptrons

Parallelizing perceptrons – take 2

Instances/labels

Instances/labels – 1

Instances/labels – 2

Instances/labels – 3

w -1 w- 2 w-3

w

Split into example subsets

Combine by some sort of

weighted averaging

Compute local vk’s

w (previous)

34

Page 35: Recap: perceptrons

A theorem

Corollary: if we weight the vectors uniformly, then the number of mistakes is still bounded. I.e., this is “enough communication” to guarantee convergence.

35

Page 36: Recap: perceptrons

What we know and don’t know

uniform mixing…μ=1/S

could we lose our speedup-from-parallelizing to slower convergence?

speedup by factor of S is cancelled by slower convergence by factor of S

36

Page 37: Recap: perceptrons

Results on NER

perceptron Averaged perceptron

37

Page 38: Recap: perceptrons

Results on parsing

perceptron Averaged perceptron

38

Page 39: Recap: perceptrons

The theorem…

39

Page 40: Recap: perceptrons

The theorem…

This is not new…. 40

Page 41: Recap: perceptrons

u

-u

u

-u

v1

+x2

v2

+x1 v1

-x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

(3b) The guess v2 after the one positive and one negative example: v2=v1-x2

If mistake: yi xi vk < 0

2

2

2

41

Page 42: Recap: perceptrons

This is new …. We’ve never considered averaging operations before

Follows from: u.w(i,1) >= k1,i γ

42

Page 43: Recap: perceptrons

IH1 inductive case:

γ

IH1

From A1

Distribute

μ’s are distribution

43

Page 44: Recap: perceptrons

IH2 proof is similar

IH1, IH2 together imply the bound (as in the usual perceptron case)

44

Page 45: Recap: perceptrons

What we know and don’t know

uniform mixing…

could we lose our speedup-from-parallelizing to slower convergence?

45

Page 46: Recap: perceptrons

What we know and don’t know

46

Page 47: Recap: perceptrons

What we know and don’t know

47

Page 48: Recap: perceptrons

What we know and don’t know

48

Page 49: Recap: perceptrons

Regret analysis for on-line optimization

49

Page 50: Recap: perceptrons

2009

50

Page 51: Recap: perceptrons

1.  Take a gradient step: x’ = xt – ηt gt 2.  If you’ve restricted the parameters to a subspace X (e.g.,

must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ )

3.  But…. you might be using a “stale” g (from τsteps ago)

f is loss function, x is parameters

51

Page 52: Recap: perceptrons

Regret: how much loss was incurred during learning, over and above the loss incurrent with an optimal choice of x

Special case: •  ft is 1 if a mistake was made, 0 otherwise •  ft(x*) = 0 for optimal x*

Regret = # mistakes made in learning

52

Page 53: Recap: perceptrons

Theorem: you can find a learning rate so that the regret of delayed SGD is bounded by

T = # timesteps τ= staleness > 0

53

Page 54: Recap: perceptrons

Lemma: if you have to use information from at least τtime steps ago then R[m] ≥τR[m/τ] R[m] = regret after m instances

the worse case!

54

Page 55: Recap: perceptrons

Theorem 8: you can do better if you assume (1) examples are i.i.d. (2) the gradients are smooth, analogous to the assumption about L: Then you can show a bound on expected regret

No-delay loss dominant term

55

Page 56: Recap: perceptrons

Experiments?

56

Page 57: Recap: perceptrons

Experiments

But: this speedup required using a quadratic kernel which took 1 ms/example

57