Learning in Multiagent Systems

Jose M Vidal

Department of Computer Science and Engineering, University of South Carolina

February 11, 2010

Abstract

We introduce the topic of learning in multiagent systems andpresent recent results.

Introduction

Outline

1 Introduction

2 Cooperative Learning

3 Learning in GamesFictitious PlayReplicator DynamicsThe AWESOME Algorithm

4 Stochastic GamesReinforcement Learning

5 General Theories for Learning AgentsCLRI TheoryN-Level Agents

6 Collective Intelligence

Introduction

The Learning Problem

Weight

−−

Introduction

Weight

−−

Introduction

Weight

−−

Introduction

The Multiagent Learning Problem

Weight

−−

Cooperative Learning

Outline

1 Introduction

Sharing Learned Knowledge

Fairly easy with identical agent abilities.

Largely unexplored for heterogeneous agents.

Generalizing learned knowledge seems very domain-specific(the induction problem again).

Sharing Learned Knowledge

Fairly easy with identical agent abilities.

Largely unexplored for heterogeneous agents.

Generalizing learned knowledge seems very domain-specific(the induction problem again).

Learning in Games

Outline

1 Introduction

Learning in Games

iA 0,0 5,1

B -1,6 1,5

Learning in Games

Fictitious Play

Outline

1 Introduction

Learning in Games

Fictitious Play

Weight Function

kti (sj) = kt−1

i (sj) +

1 if st−1

j = sj ,

0 if st−1j 6= sj .

Learning in Games

Fictitious Play

Model of Opponent

Prti [sj ] =kti (sj)

∑sj∈Sjkti (sj)

Learning in Games

Fictitious Play

Best Response

Player i then determines the strategy that will give it the highestexpected utility given that j will play each of its sj ∈ Sj withprobability Prti [sj ].

Learning in Games

Fictitious Play

Example

iA 0,0 1,2

B 1,2 0,0

si sj ki (C ) ki (D) Pri [C ] Pri [D]

A C 1 0 1 0

B D 1 1 .5 .5A D 1 2 1/3 2/3A D 1 3 1/4 3/4A D 1 4 1/5 4/5

Learning in Games

Fictitious Play

Example

iA 0,0 1,2

B 1,2 0,0

A C 1 0 1 0B D 1 1 .5 .5

A D 1 2 1/3 2/3A D 1 3 1/4 3/4A D 1 4 1/5 4/5

Learning in Games

Fictitious Play

Example

iA 0,0 1,2

B 1,2 0,0

A C 1 0 1 0B D 1 1 .5 .5A D 1 2 1/3 2/3

A D 1 3 1/4 3/4A D 1 4 1/5 4/5

Learning in Games

Fictitious Play

Example

iA 0,0 1,2

B 1,2 0,0

A C 1 0 1 0B D 1 1 .5 .5A D 1 2 1/3 2/3A D 1 3 1/4 3/4

A D 1 4 1/5 4/5

Learning in Games

Fictitious Play

Example

iA 0,0 1,2

B 1,2 0,0

A C 1 0 1 0B D 1 1 .5 .5A D 1 2 1/3 2/3A D 1 3 1/4 3/4A D 1 4 1/5 4/5

Learning in Games

Fictitious Play

Theorem (Nash Equilibrium is Attractor to Fictitious Play)

If s is a strict Nash equilibrium and it is played at time t then itwill be played at all times greater than t.

Learning in Games

Fictitious Play

Theorem (Fictitious Play Converges to Nash)

If fictitious play converges to a pure strategy then that strategymust be a Nash equilibrium.

Learning in Games

Fictitious Play

Infinite Cycle Example

iA 0,0 1,1

B 1,1 0,0

si sj ki (C ) ki (D) kj(A) kj(B)

1 1.5 1 1.5

A C 2 1.5 2 1.5B D 2 2.5 2 2.5A C 3 2.5 3 2.5B D 3 3.5 3 3.5

Learning in Games

Fictitious Play

iA 0,0 1,1

B 1,1 0,0

1 1.5 1 1.5A C 2 1.5 2 1.5

B D 2 2.5 2 2.5A C 3 2.5 3 2.5B D 3 3.5 3 3.5

Learning in Games

Fictitious Play

iA 0,0 1,1

B 1,1 0,0

1 1.5 1 1.5A C 2 1.5 2 1.5B D 2 2.5 2 2.5

A C 3 2.5 3 2.5B D 3 3.5 3 3.5

Learning in Games

Fictitious Play

iA 0,0 1,1

B 1,1 0,0

1 1.5 1 1.5A C 2 1.5 2 1.5B D 2 2.5 2 2.5A C 3 2.5 3 2.5

B D 3 3.5 3 3.5

Learning in Games

Fictitious Play

iA 0,0 1,1

B 1,1 0,0

1 1.5 1 1.5A C 2 1.5 2 1.5B D 2 2.5 2 2.5A C 3 2.5 3 2.5B D 3 3.5 3 3.5

Learning in Games

Replicator Dynamics

Outline

1 Introduction

Learning in Games

Replicator Dynamics

Fraction of Agents Playing s

let φ t(s) be the number of agents using strategy s at time t. Wecan then define

θt(s) =

φ t(s)

∑s ′∈S φ t(s ′)

Learning in Games

Replicator Dynamics

Expected Utility for Playing s

ut(s)≡ ∑s ′∈S

θt(s ′)u(s,s ′),

Learning in Games

Replicator Dynamics

Reproduction Rate

φt+1(s) = φ

t(s)(1 + ut(s)).

Learning in Games

Replicator Dynamics

Population Dynamics

Population size could change but we scale back or ignore.

Game must be symmetric.

A stable population of more than one strategy corresponds toa mixed strategy.

Learning in Games

Replicator Dynamics

Theorem (Nash equilibrium is a Steady State)

Every Nash equilibrium is a steady state for the replicatordynamics.

Proof.

By contradiction. If an agent had a pure strategy that wouldreturn a higher utility than any other strategy then this strategywould be a best response to the Nash equilibrium. If this strategywas different from the Nash equilibrium then we would have a bestresponse to the equilibrium which is not the equilibrium, so thesystem could not be at a Nash equilibrium.

Learning in Games

Replicator Dynamics

Theorem (Stable Steady State is a Nash Equilibrium)

A stable steady state of the replicator dynamics is a Nashequilibrium. A stable steady state is one that, after suffering froma small perturbation, is pushed back to the same steady state bythe system’s dynamics.

Learning in Games

Replicator Dynamics

Theorem (Asymptotically Stable is Trembling-Hand Nash)

An asymptotically stable steady state corresponds to a Nashequilibrium that is trembling-hand perfect and isolated. That is,the stable steady states are a refinement on Nash equilibria—onlya few Nash equilibria can are stable steady states.

Learning in Games

Replicator Dynamics

Definition (Evolutionary Stable Strategy)

An ESS is an equilibrium strategy that can overcome the presenceof a small number of invaders. That is, if the equilibrium strategyprofile is ω and small number ε of invaders start playing ω ′ thenESS states that the existing population should get a higher payoffagainst the new mixture (εω ′+ (1− ε)ω) than the invaders.

Learning in Games

Replicator Dynamics

Theorem (ESS is Steady State of Replicator Dynamics)

ESS is an asymptotically stable steady state of the replicatordynamics. However, the converse need not be true—a stable statein the replicator dynamics does not need to be an ESS.

Learning in Games

Replicator Dynamics

a 1,1 2,2 0,0

b 0,0 1,1 2,2

c 2,2 0,0 1,1

Learning in Games

The AWESOME Algorithm

Outline

1 Introduction

Learning in Games

AWESOME

1 play -eq,play -sta← true;eq-rej ← false;φ ← πi ; t← 02 while play -sta3 do play φ for N times in a row (an epoch)4 ∀j update sj given what they played in these N rounds.5 if play -eq6 then if some player j has maxa(sj(a),πj(a)) > εe

7 then eq-rej ← true;φ ← random action8 else if ¬eq-rej ∧∃j maxa(sold

j (a),sj(a)) > εs

9 then play -sta← false10 eq-rej ← false11 b← arg maxa ui (a,s−i )12 if ui (b,s−i ) > ui (φ ,s−i ) + n|Ai |εt+1

13 then φ ← b14 ∀jsold

j ← sj

15 t← t + 116 goto 1

Learning in Games

The Schedule

In order for the algorithm to always converge, εe and εs must bedecreased and N must be increased over time using a schedulewhere

1 εs and εe decrease monotonically to 0,

2 N increases to infinity,

3 ∏t←1,...,∞ 1− ∑i |Ai |Nt(εt

s )2 > 0

4 ∏t←1,...,∞ 1− ∑i |Ai |Nt(εt

e )2 > 0.

Learning in Games

It Converges

Theorem (AWESOME converges)

With a valid schedule, the AWESOME algorithm converges to bestresponse if all the other players play fixed strategies and to a Nashequilibrium if all the other players are AWESOME players.

Stochastic Games

Outline

1 Introduction

Stochastic Games

What is a Stochastic Game?

One where the agents do not know the payoff they might get.

That is, an unexplored MDP.

Stochastic Games

Reinforcement Learning

Outline

1 Introduction

Stochastic Games

Reinforcement Learning Problem Definition

st is a state, taken from S ,

at is an action, taken from A,

P(st+1 |st ,at) is the state transition function

r(st ,at)→ℜ is the reward function.

The problem is to find the policy π(s)→ a which maximizes thediscounted successive rewards rt the agent receives if using π.That is, find

π∗ = arg max

∑i=0

γi ri

Stochastic Games

Reinforcement Learning Problem Definition

st is a state, taken from S ,

at is an action, taken from A,

P(st+1 |st ,at) is the state transition function

r(st ,at)→ℜ is the reward function.

The problem is to find the policy π(s)→ a which maximizes thediscounted successive rewards rt the agent receives if using π.That is, find

π∗ = arg max

∑i=0

γi ri

Stochastic Games

Q-learning

1 ∀s∀aQ(s,a)← 0; λ ← 1; ε ← 12 s← current state3 if rand() < ε Exploration rate4 then a← random action5 else a← arg maxa Q(s,a)6 Take action a7 Receive reward r8 s ′← current state9 Q(s,a)← λ (r + γ maxa′ Q(s ′,a′)) + (1−λ )Q(s,a)

10 λ ← .99λ

11 ε ← .98ε

12 goto 2

Stochastic Games

Theorem (Q-learning Converges)

Given that the learning and exploration rates decrease slowlyenough, Q-learning is guaranteed to converge to the optimalpolicy.

Stochastic Games

Definition (Nash Equilibrium Point)

A Nash equilibrium point is a tuple of n strategies (π∗1 , . . . ,π∗n)such that for all s ∈ S and i = 1, . . . ,n,

∀πi∈Πivi (s,π∗1 , . . .π∗n)≥ vi (s,π∗1 , . . .π∗i−1,πi ,π

∗i+1, . . . ,π

Stochastic Games

Theorem (Nash Equilibrium Point Exists)

Every n−player discounted stochastic game possesses at least oneNash equilibrium point in stationary strategies.

Stochastic Games

NashQ-learning

1 t← 02 s0← current state3 ∀s∈S∀j←1,...,n∀aj∈Aj

Qtj (s,a1, . . . ,an)← 0

4 Choose action ati

5 Observe r t1 , . . . , r t

n ; at1, . . . ,at

n; st+1 = s ′

6 for j ← 1, . . . ,n

7 do Qt+1j (s,a1, . . . ,an)←(1−λ t)Qt

j (s,a1, . . . ,an) + λ t(r tj + γNashQt

j (s ′))

where NashQtj (s ′) = Qt

j (s ′,π1(s ′) · · ·πn(s ′))

and π1(s ′) · · ·πn(s ′) are Nash EP calculated from Q values8 t← t + 19 goto 4

Stochastic Games

Assumption

There exists an adversarial equilibrium for the entire game and forevery game defined by the Q functions encountered duringlearning.

Assumption

There exists a coordination equilibrium for the entire game and forevery game defined by the Q functions encountered duringlearning.

Theorem (NashQ-learning Converges)

Under these assumptions NashQ-learning converges to a Nashequilibrium as long as all the equilibria encountered during thegame are unique.

Stochastic Games

Assumption

There exists an adversarial equilibrium for the entire game and forevery game defined by the Q functions encountered duringlearning.

Assumption

There exists a coordination equilibrium for the entire game and forevery game defined by the Q functions encountered duringlearning.

Theorem (NashQ-learning Converges)

Under these assumptions NashQ-learning converges to a Nashequilibrium as long as all the equilibria encountered during thegame are unique.

Stochastic Games

friend-or-foe

1 t← 02 s0← current state3 ∀s∈S∀aj∈Aj

Qti (s,a1, . . . ,an)← 0

4 Choose action ati

5 Observe r t1 , . . . , r t

n ; at1, . . . ,at

n; st+1 = s ′

6 Qt+1i (s,a1, . . . ,an)←

(1−λ t)Qti (s,a1, . . . ,an) + λ t(r t

i + γNashQti (s ′))

where NashQti (s ′) = maxπ∈Π(X1×···×Xk ) minyi ,...,yl∈Y1×···×Yl

∑x1,...,xk∈X1×···×Xkπ(x1) · · ·π(xk)Qi (s,x1, . . . ,xk ,y1, . . .yl)

and X are actions for i ’s friends and Y are for the foes.7 t← t + 18 goto 4

Stochastic Games

Theorem (friend-or-foe converges)

friend-or-foe converges.

However, in general these do not correspond to a Nash equilibriumpoint.

Still, we can show

Theorem

foe-q learns values for a Nash equilibrium policy if the game hasan adversarial equilibrium and friend-q learns values for a Nashequilibrium policy if the game has a coordination equilibrium. Thisis true regardless of opponent behavior.

Stochastic Games

Theorem (friend-or-foe converges)

friend-or-foe converges.

However, in general these do not correspond to a Nash equilibriumpoint. Still, we can show

Theorem

foe-q learns values for a Nash equilibrium policy if the game hasan adversarial equilibrium and friend-q learns values for a Nashequilibrium policy if the game has a coordination equilibrium. Thisis true regardless of opponent behavior.

General Theories for Learning Agents

Outline

1 Introduction

Moving Target Function Problem

As the other agents change their behavior, what you need todo also changes.

CLRI Theory

Outline

1 Introduction

CLRI Theory

CLRI Notation

δ ti (w) : W → A: decision function.

∆ti (w): target function

e(δ ti ) = Pr[δ t

i (w) 6= ∆ti (w) |w ∈D(W ): error function.

CLRI Theory

δ ti ∆t

ie(δ t

δt+1i

∆t+1iMove

e(δ t+1i )

Figure: The moving target function problem.

CLRI Theory

δ ti ∆t

ie(δ t

δt+1i

∆t+1iMove

e(δ t+1i )

CLRI Theory

δ ti ∆t

ie(δ t

δt+1i

∆t+1iMove

e(δ t+1i )

CLRI Theory

δ ti ∆t

ie(δ t

δt+1i

∆t+1iMove

e(δ t+1i )

CLRI Theory

CLRI Parameter

Change rate (c) is the probability that an agent will change atleast one of its incorrect mappings in δ t(w).

Learning rate (l) is the probability that the agent changes anincorrect mapping to the correct one.

Retention rate (r) represents the probability that the agentwill retain its correct mapping.

Impact (Iij) is the impact that i ’s learning has on j ’s targetfunction. Specifically, it is the probability that ∆t

j (w) will

change given that δt+1i (w) 6= δ t

i (w).

CLRI Theory

CLRI Equation

E [e(δt+1i )] = 1− ri + vi

(|Ai |ri −1

|Ai |−1

)+ e(δ

(ri − li + vi

(|Ai |(li − ri ) + li − ci

|Ai |−1

N-Level Agents

Outline

1 Introduction

N-Level Agents

I Think that You Think that I Think that. . .

0-level agent is one that does not recognize the existence ofother agents in the world.

1-level agent recognizes that there are other agents in theworld whose actions affect its payoff. It also has someknowledge that tells it the utility it will receive given any setof joint actions.

2-level agent believes that all other agents are 1-level agents.

N-Level Agents

Decreasing Returns of Thinking

n-level always beats (n−1)-level agents.

Marginal utility gains grow smaller with each extra level.

Computational costs grow exponentially with each extra level.

Many times, it doesn’t pay to think about your opponent.

N-Level Agents

Decreasing Returns of Thinking

n-level always beats (n−1)-level agents.

Marginal utility gains grow smaller with each extra level.

Computational costs grow exponentially with each extra level.

Many times, it doesn’t pay to think about your opponent.

Collective Intelligence

Outline

1 Introduction

COllective INtelligence

Idea: start with the global utility function U(s,~a) anddetermine from it the individual utilities.

Define Preferences

We define i ’s preference over s,~a as

Pi (s,~a) =∑~a′∈~A Θ[ri (s,~a)− ri (s,~a′)]

where Θ(x) is the Heaviside function which is 1 if x is greater thanor equal to 0, otherwise it is 0. SSimilarly, we define the global preference function as

P(s,~a) =∑~a′∈~A Θ[U(s,~a)−U(s,~a′)]

The Easier Case

A system where, for all agents i it is true thatPi (s,~a) = P(s,~a) is called factored.

But, might still not converge because agents’ actions changeother agents’ target function.

The Easier Case

A system where, for all agents i it is true thatPi (s,~a) = P(s,~a) is called factored.

But, might still not converge because agents’ actions changeother agents’ target function.

Opacity

We define the opacity Ωi for agent i as

Ωi (s,~a) = ∑~a′∈~A

Pr[~a′]|ui (s,~a)−ui (s,~a′−i ,~ai )||ui (s,~a)−ui (s,~a−i ,~a′i )|

System Categorization

If factored and zero opacity then its easy to solve, but rare.

If zero opacity then it amounts to multiple parallel learningproblems.

Goal: Find low opacity reward functions that are highlyfactored.

Wonderful Life

The wonderful life utility function gives each agent:

ui (s,~a) = U(s,~a)−U(s,~a−i ,0),

Aristocrat Utility

Another solution is the aristocrat utility

ui (s,~a) = U(s,~a)− ∑~a′∈~A

Pr[~a′]U(s,~a−i ,~a′i ),

where Pr[~a′] is the probability that ~a′ happens.

COIN Tests

Both utility functions have been shown to perform better thanui = U and other hand-tailored utility functions.

Learning in Multiagent Systems - University of South Carolina

Transcript of Learning in Multiagent Systems - University of South Carolina

Learning in Multiagent Systems - University of South Carolina

Documents

Transcript of Learning in Multiagent Systems - University of South Carolina

SOUTH CAROLINA NONPOINT SOURCE MANAGEMENT PROGRAM … · SOUTH CAROLINA NONPOINT SOURCE MANAGEMENT PROGRAM 2014 ANNUAL REPORT 2014 SOUTH CAROLINA DEPARTMENT OF ... South Carolina

South Carolina Association of Counties · South Carolina Association of Counties ... A Handbook for County Government in South Carolina is a project of the South Carolina Associ-ation

South Carolina By Tressa Reed. Columbia, South Carolina Columbia was the first city in South Carolina. You could attend the University of South Carolina.

Cabin 3: Living Room - South Carolina Parks - South Carolina Parks

Undiscovered South Carolina Golf South Carolina Lowcountry ... · Lake Hartwell Country UNDISCOVERED SOUTH CAROLINA GOLF Undiscovered South Carolina Golf HIDDEN GEMS FOR THE AVID

South Carolina - Bureau of Transportation Statistics · 2020-03-23 · SOUTH CAROLINA TRANSPORTATION BY THE NUMBERS SOUTH CAROLINA SOUTH CAROLINA SOUTH CAROLINA $233.9b 1 44.6% Current

Automotive Consultant South Carolina | Automotive Service Consultant South Carolina | Car Dealership Consultant South Carolina | MCS

~~ STATE SOUTH - South Carolina

Multiagent Systems Introduction - Jos© M. Vidal - University of South

2 Fall • 2019 Carolina Fire ... · such as the South Carolina Hospital Association, South Carolina Bureau of EMS, South Carolina Fire Chiefs Association, South Carolina Firefighters

South Carolina Alternate Assessments - sc-alt.portal ... · 2017 TECHNICAL REPORT . South Carolina Alternate Assessments . South Carolina National Center and State Collaborative (SC

South Carolina v. North Carolina (2010)

South Carolina Higher Education Statistical Abstract … · South Carolina Higher Education Statistical Abstract ... The South Carolina Higher Education Statistical Abstract is a

SOUTH CAROLINA DEPARTMENT OF TRANSPORTATION …September 29, 2017 . Members of the South Carolina Transportation Commission South Carolina Department of Transportation Columbia, South

Science G4 PE South Carolina · South Carolina Life Science South Carolina state insect: Carolina mantid The Carolina mantid uses the sit and wait tactic to catch its prey.

1740 South Carolina Slave Code. Acts of the South Carolina … · 2009. 2. 6. · Citation: 1740 South Carolina Slave Code. Acts of the South Carolina General Assembly, 1740 # 670.

South carolina

Settling the South Maryland Virginia North Carolina South Carolina Georgia.

22017 South Carolina Gamecock Softball Spotting Chart017 South Carolina … · 2018-06-22 · 22017 South Carolina Gamecock Softball Spotting Chart017 South Carolina Gamecock Softball

South Carolina Competitiveness Initiative: A Strategic ... files/200504_southcarolina_report... · South Carolina Competitiveness Initiative: A Strategic Plan for South Carolina 1