Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors...

Transcript of Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors...

Online Learning, Games, Boosting

Arindam Banerjee

. – p.1

Page 2: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Predicting with Expert Advice

Basic setting: n experts, boolean prediction

Prediction proceeds in trialsEach expert makes a boolean prediction“Master algorithm” combines the predictionsThe true label is revealed

No explicit notion of input

Master algorithms with “good” cumulative/relative loss

. – p.2

Page 3: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

The Halving Algorithm

If one of the experts is always known to be correct:

Always vote based on the majority

If the master is wrong, get rid of the experts that were wrong

Proposition: The master makes at most log2 n mistakes

. – p.3

Page 4: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Weighted Majority (Simple)

Initialize the weights wi = 1, [i]n1

Each expert makes a boolean prediction zi ∈ {0, 1}, [i]n1

The master predicts 1 if

∑

i:zi=1

wi ≥∑

i:zi=0

wi ,

and 0 otherwise

The true label ` is received

Modify the weights as follows: if zi = `, no change; if zi 6= `,wi ← wi/2, [i]n1

Theorem: The master makes at most 2.41(m + log2 n) mistakes,where m is the number of mistakes by the best expert so far

. – p.4

Page 5: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Weighted Majority (Randomized)

Initialize the weights wi = 1, [i]n1

Each expert makes a boolean prediction zi ∈ {0, 1}, [i]n1

The master predicts zi with probability wi/W , where W =∑

i wi

The true label ` is received

Modify the weights as follows: if zi = `, no change; if zi 6= `,wi ← βwi, [i]n1 for β ∈ [0, 1]

Theorem: The expected number of mistakes the master makes is atmost m ln(1/β)+ln n

1−β , where m is the number of mistakes by the bestexpert so far

Corollary: If β is adjusted dynamically, or, if the total number of trialsis known upfront, the expected number of mistakes is at mostm + ln n + O(

√m lnn)

. – p.5

Page 6: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Extensions

The range of prediction is more general, i.e., not necessarilyboolean

An explicit notion of input exists

The adversary is not arbitrary, e.g., choosing labels from afunction, a fixed distribution, etc.

Labels from a function with random noise

Learning (boolean) functions

Realizable learning

Agnostic learning

Random Noise

. – p.6

Game Theory

Rock Paper ScissorsRock 1

2 1 0Paper 0 1

2 1Scissors 1 0 1

2

The loss matrix M for the row player

The actual game has payoffs 2(M − 12 )

Pure strategies: Rock, Paper, Scissors

Randomized playRow player chooses distribution P

Column player chooses distribution Q

The expected loss for the row player

M(P, Q) = P T MQ =∑

ij

P (i)Q(j)M(i, j)

. – p.7

Page 8: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Sequential Play

Row player chooses a (randomized) strategy P

Given P , column player chooses a strategy Q such that the lossof row player is maximized, i.e., maxQ M(P, Q)

Row player chooses P upfront to minimize the maximum loss

minP

maxQ

M(P, Q)

If column player chooses first, then the loss of the row player

maxQ

minP

M(P, Q)

“Obviously” maxQ

minP

M(P, Q) ≤ minP

maxQ

M(P, Q)

Von Neumann says the following:

maxQ

minP

M(P, Q) = minP

maxQ

M(P, Q)

. – p.8

Page 9: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Repeated Play

Row is the learner, Column is the adversaryThe game is being played repeatedly t = 1, . . . , T

On round t, learner chooses Pt

Adversary chooses Qt

Learner gets losses M(i, Qt) for individual strategiesLearner suffers loss M(Pt, Qt)

Learner’s algorithm (Hedging)Initialize the weights w1(i) = 1, [i]n1At round t, choose

Pt(i) =wt(i)

∑

i wt(i)

Upon receiving loss M(i, Qt), update

wt+1(i) = wt(i)βM(i,Qt)

. – p.9

Page 10: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Main Results

Theorem: The sequence of strategies P1, . . . , PT , chosen by thealgorithm satisfies

T∑

t=1

M(Pt, Qt) ≤ aβminP

T∑

t=1

M(P, Qt) + cβ ln n

where aβ = ln(1/β)1−β and cβ = 1

1−β

Corollary: With β = 1

1+√

2 ln n

T

, the average per-trial loss satisfies

1

T

T∑

t=1

M(Pt, Qt) ≤ minP

1

T

T∑

t=1

M(P, Qt) + ∆T

where ∆T = O

(

√

ln nT

)

. – p.10

Page 11: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Learning in Games

Corollary: If v is the value of the game

1

T

T∑

t=1

M(Pt, Qt) ≤ v + ∆T

A (new) proof of the minimax theorem

minP

maxQ

PT MQ ≤ maxQ

1

T

∑

t

PTt MQ

≤ 1

T

∑

t

maxQ

PTt MQ =

1

T

∑

t

PTt MQt

≤ minP

1

T

∑

t

PT MQt + ∆t

≤ maxQ

minP

PT MQ + ∆t

. – p.11

Page 12: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Weak Learning

A weak learner predicts slightly better than random

The PAC setting (basics)Let X be a instance space, c : X 7→ {0, 1} be a targetconcept, H be a hypothesis space (h : X 7→ {0, 1})Let Q be a fixed but unknown distribution on X

An algorithm, after training on (xi, c(xi)), [i]m1 , selects h ∈ H

such that

Px∼Q[h(x) 6= c(x)] ≤ 1

2− γ

The algorithm is called a γ-weak learner

We assume the existence of such a learner

. – p.12

Page 13: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

The Boosting Model

Boosting converts a weak learner to a strong learner

Boosting proceeds in roundsBooster constructs Dt on X, the train setWeak learner produces a hypothesis ht ∈ H so thatPx∼Dt

[ht(x) 6= c(x)] ≤ 12 − γ

After T rounds, the weak hypotheses ht, [t]T1 are combined

into a final hypothesis hfin

We need proceduresfor obtaining Dt at each stepfor combining the weak hypotheses

. – p.13

Page 14: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Minimax and Majority Votes

Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise

The row player can choose P over HThe column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)

However M(h, Q) = Px∼Q[h(x) 6= c(x)]

Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀hFrom weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 1

2 − γ

Hence v ≤ 12 − γ

Also ∃P ∗ such that ∀x,

M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1

2− γ <

1

2

Majority voting according to P ∗ is equivalent to c

. – p.14

Page 15: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

The row player can choose P over H

The column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)

2 − γ

M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1

2− γ <

1

2

. – p.14

Page 16: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

The row player can choose P over HThe column player can choose Q over X

Now minP maxx M(P, x) = v = maxQ minh M(h, Q)

2 − γ

M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1

2− γ <

1

2

. – p.14

Page 17: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

2 − γ

M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1

2− γ <

1

2

. – p.14

Page 18: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

2 − γ

M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1

2− γ <

1

2

. – p.14

Page 19: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀h

From weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 12 − γ

M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1

2− γ <

1

2

. – p.14

Page 20: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

2 − γ

M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1

2− γ <

1

2

. – p.14

Page 21: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

2 − γ

M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1

2− γ <

1

2

. – p.14

Page 22: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

2 − γ

M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1

2− γ <

1

2

. – p.14

Page 23: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

2 − γ

M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1

2− γ <

1

2

. – p.14

Page 24: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Putting it together

Boosting maintains distribution over data

Minimax argument requires distribution over hypothesis

Change the mistake matrix to M ′ = 1−MT so thatM ′(x, h) = 1−M(h, x) = 1 if h(x) = c(x) and 0 otherwise

On round t of boostingLet Dt = Pt from hedging over rows (data)Weak learner returns ht with Px∼Dt

[ht(x) = c(x)] ≥ 12 + γ

Update weights using Qt = ht by adversary

Let hfin be the majority vote of ht, [t]T1

. – p.15

Page 25: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Putting it together

[ht(x) = c(x)] ≥ 12 + γ

. – p.15

Page 26: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Putting it together

[ht(x) = c(x)] ≥ 12 + γ

. – p.15

Page 27: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Putting it together

[ht(x) = c(x)] ≥ 12 + γ

. – p.15

Page 28: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Putting it together

[ht(x) = c(x)] ≥ 12 + γ

. – p.15

Page 29: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

Analysis of Boosting

We have M ′(Pt, ht) = Px∼Pt[ht(x) = c(x)] ≥ 1

2 + γ

Hence, from Corollary,

1

2+ γ ≤ 1

T

∑

t

M ′(Pt, ht) ≤ minx

1

T

∑

t

M ′(x, ht) + ∆T

Therefore, ∀x (and sufficiently large T )

1

T

∑

t

M ′(x, ht) ≥1

2+ γ −∆T >

1

2

Hence, the majority vote hfin(x) = c(x), ∀x

But what happens on unseen data?

. – p.16

Page 30: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

2 + γ

1

2+ γ ≤ 1

T

∑

t

1

T

∑

t

M ′(x, ht) + ∆T

1

T

∑

t

M ′(x, ht) ≥1

2+ γ −∆T >

1

2

. – p.16

Page 31: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

2 + γ

1

2+ γ ≤ 1

T

∑

t

1

T

∑

t

M ′(x, ht) + ∆T

1

T

∑

t

M ′(x, ht) ≥1

2+ γ −∆T >

1

2

. – p.16

Page 32: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

2 + γ

1

2+ γ ≤ 1

T

∑

t

1

T

∑

t

M ′(x, ht) + ∆T

1

T

∑

t

M ′(x, ht) ≥1

2+ γ −∆T >

1

2

. – p.16

Page 33: Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors Rock 1 2 1 0 Paper 0 1 2 1 Scissors 1 0 1 2 The loss matrix M for the row player

2 + γ

1

2+ γ ≤ 1

T

∑

t

1

T

∑

t

M ′(x, ht) + ∆T

1

T

∑

t

M ′(x, ht) ≥1

2+ γ −∆T >

1

2

. – p.16

Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors...

Documents

Transcript of Online Learning, Games, Boostingbanerjee/Teaching/Spring06/online.pdfGame Theory Rock Paper Scissors...