Learning in Multiagent Systems - University of South Carolina

Learning in Multiagent Systems


Jose M Vidal

Department of Computer Science and Engineering, University of South Carolina

February 11, 2010

Abstract

We introduce the topic of learning in multiagent systems andpresent recent results.


Introduction

Outline

1 Introduction

2 Cooperative Learning

3 Learning in GamesFictitious PlayReplicator DynamicsThe AWESOME Algorithm

4 Stochastic GamesReinforcement Learning

5 General Theories for Learning AgentsCLRI TheoryN-Level Agents

6 Collective Intelligence


Introduction

The Learning Problem

Weight

Sp

eed

+

+

+

+

+

++

+

+ −

−−

−−

−−


Introduction

The Multiagent Learning Problem

Weight

Sp

eed

+

+

+

+

+

++

+

+ −

−−

−−

−−


Cooperative Learning

Outline

1 Introduction







Cooperative Learning

Sharing Learned Knowledge

Fairly easy with identical agent abilities.

Largely unexplored for heterogeneous agents.

Generalizing learned knowledge seems very domain-specific(the induction problem again).


Learning in Games

Outline

1 Introduction







Learning in Games

j

C D

iA 0,0 5,1

B -1,6 1,5


Learning in Games

Fictitious Play

Outline

1 Introduction







Learning in Games

Fictitious Play

Weight Function

kti (sj) = kt−1

i (sj) +

1 if st−1

j = sj ,

0 if st−1j 6= sj .


Learning in Games

Fictitious Play

Model of Opponent

Prti [sj ] =kti (sj)

∑sj∈Sjkti (sj)

.


Learning in Games

Fictitious Play

Best Response

Player i then determines the strategy that will give it the highestexpected utility given that j will play each of its sj ∈ Sj withprobability Prti [sj ].


Learning in Games

Fictitious Play

Example

j

C D

iA 0,0 1,2

B 1,2 0,0

si sj ki (C ) ki (D) Pri [C ] Pri [D]

A C 1 0 1 0

B D 1 1 .5 .5A D 1 2 1/3 2/3A D 1 3 1/4 3/4A D 1 4 1/5 4/5


Learning in Games

Fictitious Play

Example

j

C D

iA 0,0 1,2

B 1,2 0,0


A C 1 0 1 0B D 1 1 .5 .5

A D 1 2 1/3 2/3A D 1 3 1/4 3/4A D 1 4 1/5 4/5


Learning in Games

Fictitious Play

Example

j

C D

iA 0,0 1,2

B 1,2 0,0


A C 1 0 1 0B D 1 1 .5 .5A D 1 2 1/3 2/3

A D 1 3 1/4 3/4A D 1 4 1/5 4/5


Learning in Games

Fictitious Play

Example

j

C D

iA 0,0 1,2

B 1,2 0,0


A C 1 0 1 0B D 1 1 .5 .5A D 1 2 1/3 2/3A D 1 3 1/4 3/4

A D 1 4 1/5 4/5


Learning in Games

Fictitious Play

Example

j

C D

iA 0,0 1,2

B 1,2 0,0


A C 1 0 1 0B D 1 1 .5 .5A D 1 2 1/3 2/3A D 1 3 1/4 3/4A D 1 4 1/5 4/5


Learning in Games

Fictitious Play

Theorem (Nash Equilibrium is Attractor to Fictitious Play)

If s is a strict Nash equilibrium and it is played at time t then itwill be played at all times greater than t.


Learning in Games

Fictitious Play

Theorem (Fictitious Play Converges to Nash)

If fictitious play converges to a pure strategy then that strategymust be a Nash equilibrium.


Learning in Games

Fictitious Play

Infinite Cycle Example

j

C D

iA 0,0 1,1

B 1,1 0,0

si sj ki (C ) ki (D) kj(A) kj(B)

1 1.5 1 1.5

A C 2 1.5 2 1.5B D 2 2.5 2 2.5A C 3 2.5 3 2.5B D 3 3.5 3 3.5


Learning in Games

Fictitious Play


j

C D

iA 0,0 1,1

B 1,1 0,0


1 1.5 1 1.5A C 2 1.5 2 1.5

B D 2 2.5 2 2.5A C 3 2.5 3 2.5B D 3 3.5 3 3.5


Learning in Games

Fictitious Play


j

C D

iA 0,0 1,1

B 1,1 0,0


1 1.5 1 1.5A C 2 1.5 2 1.5B D 2 2.5 2 2.5

A C 3 2.5 3 2.5B D 3 3.5 3 3.5


Learning in Games

Fictitious Play


j

C D

iA 0,0 1,1

B 1,1 0,0


1 1.5 1 1.5A C 2 1.5 2 1.5B D 2 2.5 2 2.5A C 3 2.5 3 2.5

B D 3 3.5 3 3.5


Learning in Games

Fictitious Play


j

C D

iA 0,0 1,1

B 1,1 0,0


1 1.5 1 1.5A C 2 1.5 2 1.5B D 2 2.5 2 2.5A C 3 2.5 3 2.5B D 3 3.5 3 3.5


Learning in Games

Replicator Dynamics

Outline

1 Introduction







Learning in Games

Replicator Dynamics

Fraction of Agents Playing s

let φ t(s) be the number of agents using strategy s at time t. Wecan then define

θt(s) =

φ t(s)

∑s ′∈S φ t(s ′)


Learning in Games

Replicator Dynamics

Expected Utility for Playing s

ut(s)≡ ∑s ′∈S

θt(s ′)u(s,s ′),


Learning in Games

Replicator Dynamics

Reproduction Rate

φt+1(s) = φ

t(s)(1 + ut(s)).


Learning in Games

Replicator Dynamics

Population Dynamics

Population size could change but we scale back or ignore.

Game must be symmetric.

A stable population of more than one strategy corresponds toa mixed strategy.


Learning in Games

Replicator Dynamics

Theorem (Nash equilibrium is a Steady State)

Every Nash equilibrium is a steady state for the replicatordynamics.

Proof.

By contradiction. If an agent had a pure strategy that wouldreturn a higher utility than any other strategy then this strategywould be a best response to the Nash equilibrium. If this strategywas different from the Nash equilibrium then we would have a bestresponse to the equilibrium which is not the equilibrium, so thesystem could not be at a Nash equilibrium.


Learning in Games

Replicator Dynamics

Theorem (Stable Steady State is a Nash Equilibrium)

A stable steady state of the replicator dynamics is a Nashequilibrium. A stable steady state is one that, after suffering froma small perturbation, is pushed back to the same steady state bythe system’s dynamics.


Learning in Games

Replicator Dynamics

Theorem (Asymptotically Stable is Trembling-Hand Nash)

An asymptotically stable steady state corresponds to a Nashequilibrium that is trembling-hand perfect and isolated. That is,the stable steady states are a refinement on Nash equilibria—onlya few Nash equilibria can are stable steady states.


Learning in Games

Replicator Dynamics

Definition (Evolutionary Stable Strategy)

An ESS is an equilibrium strategy that can overcome the presenceof a small number of invaders. That is, if the equilibrium strategyprofile is ω and small number ε of invaders start playing ω ′ thenESS states that the existing population should get a higher payoffagainst the new mixture (εω ′+ (1− ε)ω) than the invaders.


Learning in Games

Replicator Dynamics

Theorem (ESS is Steady State of Replicator Dynamics)

ESS is an asymptotically stable steady state of the replicatordynamics. However, the converse need not be true—a stable statein the replicator dynamics does not need to be an ESS.


Learning in Games

Replicator Dynamics

j

a b c

i

a 1,1 2,2 0,0

b 0,0 1,1 2,2

c 2,2 0,0 1,1

a c

b


Learning in Games

The AWESOME Algorithm

Outline

1 Introduction







Learning in Games


AWESOME

1 play -eq,play -sta← true;eq-rej ← false;φ ← πi ; t← 02 while play -sta3 do play φ for N times in a row (an epoch)4 ∀j update sj given what they played in these N rounds.5 if play -eq6 then if some player j has maxa(sj(a),πj(a)) > εe

7 then eq-rej ← true;φ ← random action8 else if ¬eq-rej ∧∃j maxa(sold

j (a),sj(a)) > εs

9 then play -sta← false10 eq-rej ← false11 b← arg maxa ui (a,s−i )12 if ui (b,s−i ) > ui (φ ,s−i ) + n|Ai |εt+1

s µ

13 then φ ← b14 ∀jsold

j ← sj

15 t← t + 116 goto 1


Learning in Games


The Schedule

In order for the algorithm to always converge, εe and εs must bedecreased and N must be increased over time using a schedulewhere

1 εs and εe decrease monotonically to 0,

2 N increases to infinity,

3 ∏t←1,...,∞ 1− ∑i |Ai |Nt(εt

s )2 > 0

4 ∏t←1,...,∞ 1− ∑i |Ai |Nt(εt

e )2 > 0.


Learning in Games


It Converges

Theorem (AWESOME converges)

With a valid schedule, the AWESOME algorithm converges to bestresponse if all the other players play fixed strategies and to a Nashequilibrium if all the other players are AWESOME players.


Stochastic Games

Outline

1 Introduction







Stochastic Games

What is a Stochastic Game?

One where the agents do not know the payoff they might get.

That is, an unexplored MDP.


Stochastic Games

Reinforcement Learning

Outline

1 Introduction







Stochastic Games


Reinforcement Learning Problem Definition

st is a state, taken from S ,

at is an action, taken from A,

P(st+1 |st ,at) is the state transition function

r(st ,at)→ℜ is the reward function.

The problem is to find the policy π(s)→ a which maximizes thediscounted successive rewards rt the agent receives if using π.That is, find

π∗ = arg max

π

∞

∑i=0

γi ri


Stochastic Games


Q-learning

1 ∀s∀aQ(s,a)← 0; λ ← 1; ε ← 12 s← current state3 if rand() < ε Exploration rate4 then a← random action5 else a← arg maxa Q(s,a)6 Take action a7 Receive reward r8 s ′← current state9 Q(s,a)← λ (r + γ maxa′ Q(s ′,a′)) + (1−λ )Q(s,a)

10 λ ← .99λ

11 ε ← .98ε

12 goto 2


Stochastic Games


Theorem (Q-learning Converges)

Given that the learning and exploration rates decrease slowlyenough, Q-learning is guaranteed to converge to the optimalpolicy.


Stochastic Games


Definition (Nash Equilibrium Point)

A Nash equilibrium point is a tuple of n strategies (π∗1 , . . . ,π∗n)such that for all s ∈ S and i = 1, . . . ,n,

∀πi∈Πivi (s,π∗1 , . . .π∗n)≥ vi (s,π∗1 , . . .π∗i−1,πi ,π

∗i+1, . . . ,π

∗n)


Stochastic Games


Theorem (Nash Equilibrium Point Exists)

Every n−player discounted stochastic game possesses at least oneNash equilibrium point in stationary strategies.


Stochastic Games


NashQ-learning

1 t← 02 s0← current state3 ∀s∈S∀j←1,...,n∀aj∈Aj

Qtj (s,a1, . . . ,an)← 0

4 Choose action ati

5 Observe r t1 , . . . , r t

n ; at1, . . . ,at

n; st+1 = s ′

6 for j ← 1, . . . ,n

7 do Qt+1j (s,a1, . . . ,an)←(1−λ t)Qt

j (s,a1, . . . ,an) + λ t(r tj + γNashQt

j (s ′))

where NashQtj (s ′) = Qt

j (s ′,π1(s ′) · · ·πn(s ′))

and π1(s ′) · · ·πn(s ′) are Nash EP calculated from Q values8 t← t + 19 goto 4


Stochastic Games


Assumption

There exists an adversarial equilibrium for the entire game and forevery game defined by the Q functions encountered duringlearning.

Assumption

There exists a coordination equilibrium for the entire game and forevery game defined by the Q functions encountered duringlearning.

Theorem (NashQ-learning Converges)

Under these assumptions NashQ-learning converges to a Nashequilibrium as long as all the equilibria encountered during thegame are unique.


Stochastic Games


friend-or-foe

1 t← 02 s0← current state3 ∀s∈S∀aj∈Aj

Qti (s,a1, . . . ,an)← 0

4 Choose action ati

5 Observe r t1 , . . . , r t

n ; at1, . . . ,at

n; st+1 = s ′

6 Qt+1i (s,a1, . . . ,an)←

(1−λ t)Qti (s,a1, . . . ,an) + λ t(r t

i + γNashQti (s ′))

where NashQti (s ′) = maxπ∈Π(X1×···×Xk ) minyi ,...,yl∈Y1×···×Yl

∑x1,...,xk∈X1×···×Xkπ(x1) · · ·π(xk)Qi (s,x1, . . . ,xk ,y1, . . .yl)

and X are actions for i ’s friends and Y are for the foes.7 t← t + 18 goto 4


Stochastic Games


Theorem (friend-or-foe converges)

friend-or-foe converges.

However, in general these do not correspond to a Nash equilibriumpoint.

Still, we can show

Theorem

foe-q learns values for a Nash equilibrium policy if the game hasan adversarial equilibrium and friend-q learns values for a Nashequilibrium policy if the game has a coordination equilibrium. Thisis true regardless of opponent behavior.


Stochastic Games


Theorem (friend-or-foe converges)

friend-or-foe converges.

However, in general these do not correspond to a Nash equilibriumpoint. Still, we can show

Theorem

foe-q learns values for a Nash equilibrium policy if the game hasan adversarial equilibrium and friend-q learns values for a Nashequilibrium policy if the game has a coordination equilibrium. Thisis true regardless of opponent behavior.


General Theories for Learning Agents

Outline

1 Introduction








Moving Target Function Problem

As the other agents change their behavior, what you need todo also changes.



CLRI Theory

Outline

1 Introduction








CLRI Theory

CLRI Notation

δ ti (w) : W → A: decision function.

∆ti (w): target function

e(δ ti ) = Pr[δ t

i (w) 6= ∆ti (w) |w ∈D(W ): error function.



CLRI Theory

δ ti ∆t

ie(δ t

i )

δt+1i

Lear

n

∆t+1iMove

e(δ t+1i )

Figure: The moving target function problem.



CLRI Theory

CLRI Parameter

Change rate (c) is the probability that an agent will change atleast one of its incorrect mappings in δ t(w).

Learning rate (l) is the probability that the agent changes anincorrect mapping to the correct one.

Retention rate (r) represents the probability that the agentwill retain its correct mapping.

Impact (Iij) is the impact that i ’s learning has on j ’s targetfunction. Specifically, it is the probability that ∆t

j (w) will

change given that δt+1i (w) 6= δ t

i (w).



CLRI Theory

CLRI Equation

E [e(δt+1i )] = 1− ri + vi

(|Ai |ri −1

|Ai |−1

)+ e(δ

ti )

(ri − li + vi

(|Ai |(li − ri ) + li − ci

|Ai |−1

))(1)



N-Level Agents

Outline

1 Introduction








N-Level Agents

I Think that You Think that I Think that. . .

0-level agent is one that does not recognize the existence ofother agents in the world.

1-level agent recognizes that there are other agents in theworld whose actions affect its payoff. It also has someknowledge that tells it the utility it will receive given any setof joint actions.

2-level agent believes that all other agents are 1-level agents.



N-Level Agents

Decreasing Returns of Thinking

n-level always beats (n−1)-level agents.

Marginal utility gains grow smaller with each extra level.

Computational costs grow exponentially with each extra level.

Many times, it doesn’t pay to think about your opponent.


Collective Intelligence

Outline

1 Introduction








COllective INtelligence

Idea: start with the global utility function U(s,~a) anddetermine from it the individual utilities.



Define Preferences

We define i ’s preference over s,~a as

Pi (s,~a) =∑~a′∈~A Θ[ri (s,~a)− ri (s,~a′)]

|~A|,

where Θ(x) is the Heaviside function which is 1 if x is greater thanor equal to 0, otherwise it is 0. SSimilarly, we define the global preference function as

P(s,~a) =∑~a′∈~A Θ[U(s,~a)−U(s,~a′)]

|~A|.



The Easier Case

A system where, for all agents i it is true thatPi (s,~a) = P(s,~a) is called factored.

But, might still not converge because agents’ actions changeother agents’ target function.



Opacity

We define the opacity Ωi for agent i as

Ωi (s,~a) = ∑~a′∈~A

Pr[~a′]|ui (s,~a)−ui (s,~a′−i ,~ai )||ui (s,~a)−ui (s,~a−i ,~a′i )|

.



System Categorization

If factored and zero opacity then its easy to solve, but rare.

If zero opacity then it amounts to multiple parallel learningproblems.

Goal: Find low opacity reward functions that are highlyfactored.



Wonderful Life

The wonderful life utility function gives each agent:

ui (s,~a) = U(s,~a)−U(s,~a−i ,0),



Aristocrat Utility

Another solution is the aristocrat utility

ui (s,~a) = U(s,~a)− ∑~a′∈~A

Pr[~a′]U(s,~a−i ,~a′i ),

where Pr[~a′] is the probability that ~a′ happens.



COIN Tests

Both utility functions have been shown to perform better thanui = U and other hand-tailored utility functions.

Learning in Multiagent Systems - University of South Carolina

Documents

Transcript of Learning in Multiagent Systems - University of South Carolina