Online Learning + Game Theory
description
Transcript of Online Learning + Game Theory
![Page 1: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/1.jpg)
Online Learning + Game Theory
Avrim BlumCarnegie Mellon University
Your guide:
Indo-US Lectures Week in Machine Learning, Game Theory, and Optimzation
![Page 2: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/2.jpg)
ThemeSimple but powerful algorithms for online decision-making when you don’t have a lot of knowledge.
Nice connections between these algorithms and game-theoretic notions of optimality and equilibrium.
Start with something called the problem of “combining expert advice”.
![Page 3: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/3.jpg)
Using “expert” advice• We solicit n “experts” for their advice. (Will
the market go up or down?)• We then want to use their advice somehow
to make our prediction. E.g.,
Say we want to predict the stock market.
Basic question: Is there a strategy that allows us to do nearly as well as best of these in hindsight?[“expert” = someone with an opinion. Not necessarily someone who knows anything.]
![Page 4: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/4.jpg)
Simpler question• We have n “experts”.• One of these is perfect (never makes a
mistake). We just don’t know which one.• Can we find a strategy that makes no more
than lg(n) mistakes?Answer: sure. Just take majority vote over all experts that have been correct so far.Each mistake cuts # available by factor of 2.Note: this means ok for n to be very large.
![Page 5: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/5.jpg)
What if no expert is perfect?One idea: just run above protocol until all
experts are crossed off, then repeat.Makes at most log(n) mistakes per mistake
of the best expert (plus initial log(n)).
Seems wasteful. Constantly forgetting what we've “learned”. Can we do better?
![Page 6: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/6.jpg)
What if no expert is perfect?Intuition: Making a mistake doesn't
completely disqualify an expert. So, instead of crossing off, just lower its weight.
Weighted Majority Alg:– Start with all experts having weight 1.– Predict based on weighted majority vote.– Penalize mistakes by cutting weight in half.Weights: 1 1 1 1
Predictions: U U U D We predict: UWeights: ½ ½ ½ 1
Truth: D
![Page 7: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/7.jpg)
Analysis: do nearly as well as best expert in hindsight
• M = # mistakes we've made so far.• m = # mistakes best expert has made so
far.• W = total weight (starts at n).• After each mistake, W drops by at least
25%. So, after M mistakes, W is at most n(3/4)M.• Weight of best expert is (1/2)m. So,
So, if m is small, then M is pretty small too.
constant ratio
![Page 8: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/8.jpg)
Randomized Weighted Majority
2.4(m + lg n) not so good if the best expert makes a mistake 20% of the time. Can we do better? Yes.
• Instead of taking majority vote, use weights as probabilities. (e.g., if 70% on up, 30% on down, then pick 70:30) Idea: smooth out the worst case.
• Also, generalize ½ to 1- e.
unlike most worst-case
bounds, numbers are pretty good.
M = expected
#mistakes
![Page 9: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/9.jpg)
Analysis• Say at time t we have fraction Ft of
weight on experts that made mistake.• So, we have probability Ft of making a mistake,
and we remove an eFt fraction of the total weight.– Wfinal = n(1-e F1)(1 - e F2)...– ln(Wfinal) = ln(n) + åt [ln(1 - e Ft)] · ln(n) - e åt Ft
(using ln(1-x) < -x) = ln(n) - e M. (å Ft = E[# mistakes])• If best expert makes m mistakes, then ln(Wfinal) > ln((1-
e)m).• Now solve: ln(n) - e M > m ln(1-e).
Ft
![Page 10: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/10.jpg)
Additive regret• So, have M · OPT + eOPT + 1/e log(n).• If set e=(log(n)/OPT)1/2 to balance the two terms
out (or use guess-and-double), get bound of M · OPT + 2(OPT¢ )1/2
· OPT + 2(T )1/2
• These are called “additive regret” bounds. M/T · OPT/T + 2(log(n)/T)1/2 . “no regret”
OPT
T = # time steps
![Page 11: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/11.jpg)
Extensions• What if experts are actions? (rows in a matrix
game, ways to drive to work,…)• At each time t, each has a loss (cost) in {0,1}.• Can still run the algorithm
– Rather than viewing as “pick a prediction with prob proportional to its weight” ,
– View as “pick an expert with probability proportional to its weight”
– Alg pays expected cost . • Same analysis applies.
Do nearly as well as best action in hindsight!
![Page 12: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/12.jpg)
Extensions• What if losses (costs) in [0,1]? • Just modify alg update rule: .• Fraction of wt removed from system is: • Analysis very similar to case of {0,1}.
![Page 13: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/13.jpg)
World – life - opponent
RWM (multiplicative weights alg)
111111
(1-ec11)
(1-ec21)
(1-ec31)
.
.(1-ecn
1)
scaling so costs in [0,1]
c1 c2
(1-ec12)
(1-ec22)
(1-ec32)
.
.(1-ecn
2)
Guarantee: do nearly as well as fixed row in hindsight Which implies doing nearly as well (or
better) than minimax optimal
![Page 14: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/14.jpg)
World – life - opponent
111111
(1-ec11)
(1-ec21)
(1-ec31)
.
.(1-ecn
1)
scaling so costs in [0,1]
c2
(1-ec12)
(1-ec22)
(1-ec32)
.
.(1-ecn
2)
If play RWM against a best-response oracle, will approach minimax optimality (most will be close).
(If if didn’t, wouldn’t be getting promised guarantee)
Connections to minimax optimality
![Page 15: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/15.jpg)
World – life - opponent
111111
(1-ec11)
(1-ec21)
(1-ec31)
.
.(1-ecn
1)
scaling so costs in [0,1]
c2
(1-ec12)
(1-ec22)
(1-ec32)
.
.(1-ecn
2)
If play two RWM against each other, then empirical distributions must be near-minimax-
optimal.
(Else, one or the other could & would take advantage)
Connections to minimax optimality
![Page 16: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/16.jpg)
A natural generalization A natural generalization of our regret goal (thinking
of driving) is: what if we also want that on rainy days, we do nearly as well as the best route for rainy days.
And on Mondays, do nearly as well as best route for Mondays.
More generally, have N “rules” (on Monday, use path P). Goal: simultaneously, for each rule i, guarantee to do nearly as well as it on the time steps in which it fires.
For all i, want E[costi(alg)] · (1+e)costi(i) + O(e-1log N).
(costi(X) = cost of X on time steps where rule i fires.)
Can we get this?
![Page 17: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/17.jpg)
A natural generalization This generalization is esp natural in machine
learning for combining multiple if-then rules. E.g., document classification. Rule: “if <word-X>
appears then predict <Y>”. E.g., if has football then classify as sports.
So, if 90% of documents with football are about sports, we should have error · 11% on them.
“Specialists” or “sleeping experts” problem.
Assume we have N rules. For all i, want E[costi(alg)] · (1+e)costi(i) + O(e-1log
N).(costi(X) = cost of X on time steps where rule i fires.)
![Page 18: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/18.jpg)
A simple algorithm and analysis (all on one slide)
Start with all rules at weight 1. At each time step, of the rules i that fire,
select one with probability pi / wi. Update weights:
If didn’t fire, leave weight alone. If did fire, raise or lower depending on
performance compared to weighted average: ri = [åj pj cost(j)]/(1+e) – cost(i) wi à <- wi(1+e)ri
So, if rule i does exactly as well as weighted average, its weight drops a little. Weight increases if does better than weighted average by more than a (1+e) factor. This ensures sum of weights doesn’t increase.
Final wi = (1+e)E[costi(alg)]/(1+e)-costi(i). So, exponent · e-1log N.
So, E[costi(alg)] · (1+e)costi(i) + O(e-1log N).
![Page 19: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/19.jpg)
Application: adapting to change What if we want to adapt to change - do nearly as
well as best recent expert? For each expert, instantiate copy who wakes up on
day t for each 0 · t · T-1. Our cost in previous t days is at most (1+²)(best
expert in last t days) + O(²-1 log(NT)). (not best possible bound since extra log(T) but not
bad).
![Page 20: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/20.jpg)
[ACFS02]: applying RWM to bandit setting
What if only get your own cost/benefit as feedback?
Use of RWM as subroutine to get algorithm with cumulative regret O( (TN log N)1/2 ).
[average regret O( ((N log N)/T)1/2 ).]
Will do a somewhat weaker version of their analysis (same algorithm but not as tight a bound).
For fun, talk about it in the context of online pricing…
![Page 21: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/21.jpg)
Online pricing• Say you are selling lemonade (or a cool new software
tool, or bottles of water at the world cup).• For t=1,2,…T
– Seller sets price pt
– Buyer arrives with valuation vt
– If vt ¸ pt, buyer purchases and pays pt, else doesn’t.– Repeat.
• Assume all valuations · h.$500 a glass
$2$3.00 a glass
• Goal: do nearly as well as best fixed price in hindsight.
View each possible price as a different
row/expert
• If vt revealed, run RWM. E[gain] ¸ OPT(1-²) - O(²-1 h log n).
![Page 22: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/22.jpg)
Multi-armed bandit problemExponential Weights for Exploration and Exploitation
(exp3)
RWM
n = #experts
Exp3
Distrib pt
Expert i ~ qt
$1.25Gain gi
t Gain vector ĝt
qt
qt = (1-°)pt + ° unifĝt = (0,…,0, gi
t/qit,0,…,0)
OPTOPT
1. RWM believes gain is: pt ¢ ĝt = pit(gi
t/qit) ´ gt
RWM
3. Actual gain is: git = gt
RWM (qit/pi
t) ¸ gtRWM(1-°)
2. åt gtRWM ¸ (1-²) - O(²-1 nh/° log n)OPT
4. E[ ] ¸ OPT. OPT Because E[ĝjt] = (1- qj
t)0 + qjt(gj
t/qjt) = gj
t ,so E[maxj[åt ĝj
t]] ¸ maxj [ E[åt ĝjt] ] = OPT.
· nh/°
[Auer,Cesa-Bianchi,Freund,Schapire]
![Page 23: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/23.jpg)
Multi-armed bandit problemExponential Weights for Exploration and Exploitation
(exp3)
RWM
n = #experts
Exp3
Distrib pt
Expert i ~ qt
$1.25Gain gi
t Gain vector ĝt
qt
qt = (1-°)pt + ° unifĝt = (0,…,0, gi
t/qit,0,…,0)
OPTOPT
Conclusion (° = ²): E[Exp3] ¸ OPT(1-²)2 - O(²-2 nh log(n))
[Auer,Cesa-Bianchi,Freund,Schapire]
· nh/°
Balancing would give O((OPT nh log n)2/3) in bound because of ²-2. But can reduce to ²-1 and O((OPT nh log n)1/2) more care in
analysis.
![Page 24: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/24.jpg)
Summary (of results so far)Algorithms for online decision-making with strong guarantees on performance compared to best fixed choice.• Application: play repeated game against
adversary. Perform nearly as well as fixed strategy in hindsight.
Can apply even with very limited feedback.• Application: which way to drive to work, with
only feedback about your own paths; online pricing, even if only have buy/no buy feedback.
![Page 25: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/25.jpg)
Internal/Swap Regret and
Correlated Equilibria
![Page 26: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/26.jpg)
What if all players minimize regret? In zero-sum games, empirical frequencies
quickly approach minimax optimality. In general-sum games, does behavior quickly
(or at all) approach a Nash equilibrium? After all, a Nash Eq is exactly a set of
distributions that are no-regret wrt each other. So if the distributions stabilize, they must converge to a Nash equil.
Well, unfortunately, they might not stabilize.
![Page 27: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/27.jpg)
An interesting bad example• [Balcan-Constantin-Mehta12]:
– Failure to converge even in Rank-1 games (games where R+C has rank 1).
– Interesting because one can find equilibria efficiently in such games.
![Page 28: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/28.jpg)
What can we say?If algorithms minimize “internal” or “swap” regret, then empirical distribution of play approaches correlated equilibrium.
Foster & Vohra, Hart & Mas-Colell,… Though doesn’t imply play is stabilizing.
What are internal/swap regret and correlated equilibria?
![Page 29: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/29.jpg)
More general forms of regret1. “best expert” or “external” regret:
– Given n strategies. Compete with best of them in hindsight.
2. “sleeping expert” or “regret with time-intervals”:
– Given n strategies, k properties. Let Si be set of days satisfying property i (might overlap). Want to simultaneously achieve low regret over each Si.
3. “internal” or “swap” regret: like (2), except that Si = set of days in which we chose strategy i.
![Page 30: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/30.jpg)
Internal/swap-regret• E.g., each day we pick one stock to buy
shares in.– Don’t want to have regret of the form
“every time I bought IBM, I should have bought Microsoft instead”.
• Formally, swap regret is wrt optimal function f:{1,…,n}!{1,…,n} such that every time you played action j, it plays f(j).
![Page 31: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/31.jpg)
Weird… why care?“Correlated equilibrium”• Distribution over entries in matrix, such that
if a trusted party chooses one at random and tells you your part, you have no incentive to deviate.
• E.g., Shapley game. -1,-1 -1,1 1,-1 1,-1 -1,-1 -1,1 -1,1 1,-1 -1,-1
RPS
R P S
In general-sum games, if all players have low swap-regret, then empirical distribution of play is apx correlated equilibrium.
-1,-1-1,-1
-1,-1
![Page 32: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/32.jpg)
Connection• If all parties run a low swap regret
algorithm, then empirical distribution of play is an apx correlated equilibrium.
– Correlator chooses random time t 2 {1,2,…,T}. Tells each player to play the action j they played in time t (but does not reveal value of t).
– Expected incentive to deviate:åjPr(j)(Regret|j) = swap-regret of algorithm
– So, this suggests correlated equilibria may be natural things to see in multi-agent systems where individuals are optimizing for themselves
![Page 33: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/33.jpg)
Correlated vs Coarse-correlated Eq
“Correlated equilibrium”• You have no incentive to deviate, even after
seeing what the advice is.
“Coarse-Correlated equilibrium”• If only choice is to see and follow, or not to
see at all, would prefer the former.
In both cases: a distribution over entries in the matrix. Think of a third party choosing from this distr and telling you your part as “advice”.
Low external-regret ) apx coarse correlated equilib.
![Page 34: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/34.jpg)
Internal/swap-regret, contd
Algorithms for achieving low regret of this form:
– Foster & Vohra, Hart & Mas-Colell, Fudenberg & Levine.
– Will present method of [BM05] showing how to convert any “best expert” algorithm into one achieving low swap regret.
– Unfortunately, #steps to achieve low swap regret is O(n log n) rather than O(log n).
![Page 35: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/35.jpg)
Can convert any “best expert” algorithm A into one achieving low swap regret. Idea:
– Instantiate one copy Aj responsible for expected regret over times we play j.
Alg
q1Play p = pQ
Cost vector cq2
A1
A2
An
.
.
.qn
Q
– Allows us to view pj as prob we play action j, or as prob we play alg Aj.
pnc
p1c
p2c
– Give Aj feedback of pjc.– Aj guarantees åt (pj
tct)¢qjt · mini åt pj
tcit + [regret term]
– Write as: åt pjt(qj
t¢ct) · mini åt pjtci
t + [regret term]
![Page 36: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/36.jpg)
Can convert any “best expert” algorithm A into one achieving low swap regret. Idea:
– Instantiate one copy Aj responsible for expected regret over times we play j.
Alg
q1Play p = pQ
Cost vector cq2
A1
A2
An
.
.
.qn
Q
– Sum over j, get:
pnc
p1c
p2c
åt ptQtct · åj mini åt pjtci
t + n[regret term]
– Write as: åt pjt(qj
t¢ct) · mini åt pjtci
t + [regret term]
Our total cost
For each j, can move our prob to its own i=f(j)
![Page 37: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/37.jpg)
Can convert any “best expert” algorithm A into one achieving low swap regret. Idea:
– Instantiate one copy Aj responsible for expected regret over times we play j.
Alg
q1Play p = pQ
Cost vector cq2
A1
A2
An
.
.
.qn
Q
– Sum over j, get:
pnc
p1c
p2c
åt ptQtct · åj mini åt pjtci
t + n[regret term]
Our total cost
For each j, can move our prob to its own i=f(j)
– Get swap-regret at most n times orig external regret.
![Page 38: Online Learning + Game Theory](https://reader035.fdocuments.us/reader035/viewer/2022081505/56816724550346895ddbb066/html5/thumbnails/38.jpg)
Summary (game theory connection)
Zero-sum game: two players minimizing external regret minimax optimality
General-sum game: players minimizing external regret coarse-correlated equilib.
General-sum game: players minimizing internal regret correlated equilib.
Algs can be useful for fast approximately optimal solutions to optimization problems.