Download - Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)

Multi-Agent Planning in Complex Uncertain

Environments

Daphne KollerStanford University

Joint work with:Carlos Guestrin (CMU)Ronald Parr (Duke)

©2004 – Carlos Guestrin, Daphne Koller

Collaborative Multiagent Planning

Search and rescue, firefighting Factory management Multi-robot tasks (Robosoccer) Network routing Air traffic control Computer game playing

Long-termgoals

Multiple agents

Coordinateddecisions

CollaborativeMultiagentPlanning


Joint Planning Space

Joint action space: Each agent i takes action ai at each step

Joint action a= {a1,…, an} for all agents

Joint state space: Assignment x1,…,xn to some set of variables X1,…,Xn

Joint state x= {x1,…, xn} of entire system

Joint system: Payoffs and state dynamics depend on joint state and joint action

Cooperative agents: Want to maximize total payoff


Exploiting Structure

Real-world problems have:

Hundreds of objects Googles of states

Real-world problems have structure!

Approach: Exploit structured representation to obtain efficient approximate solution


Outline

Action Coordination Factored Value Functions Coordination Graphs Context-Specific Coordination

Joint Planning Multi-Agent Markov Decision Processes Efficient Linear Programming Solution Decentralized Market-Based Solution

Generalizing to New Environments Relational MDPs Generalizing Value Functions


One-Shot Optimization Task

Q-function Q(x,a) encodes agents’ payoff for joint action a in joint state x

Agents’ task: To compute

#actions is exponential Complete state observability Full agent communication

)a,x(maxarga

Q


Factored Payoff Function

Approximate Q function as sum of Q sub-functions

Each sub-function depends on local part of system Two interacting agents Agent and important resource Two inter-dependent pieces of machinery

[K. & Parr ’99,’00][Guestrin, K., Parr ’01]

Q(A1,…,A4, X1,…,X4)¼Q2(A1, A2, X1,X2)Q1(A1, A4,

X1,X4) Q3(A2, A3, X2,X3)+

++

Q4(A3, A4, X3,X4)


Distributed Q Function

Q(A1,…,A4, X1,…,X4)

2

3

4

1

Q4

¼Q2(A1, A2, X1,X2)

Q4(A3, A4, X3,X4)

Q1(A1, A4,

X1,X4) Q3(A2, A3, X2,X3)+

++

[Guestrin, K., Parr ’01]

Q sub-functions assigned to relevant agents


Multiagent Action Selection

2

3

4

1

Q2(A1, A2, X1,X2)

Q4(A3, A4, X3,X4)

Q1(A1, A4,

X1,X4)

Q3(A2, A3, X2,X3)

Distributed Q

function

Instantiate current state x

Maximal action

argmaxa


Instantiating State x

2

3

4

1

Q2(A1, A2, X1,X2)

Q4(A3, A4, X3,X4)

Q1(A1, A4,

X1,X4)

Q3(A2, A3, X2,X3)

Observe only

X1 and X2

Limited observability: agent i only observes variables in Qi


Choosing Action at State x

2

3

4

1

Q2(A1, A2, X1,X2)

Q4(A3, A4, X3,X4)

Q1(A1, A4,

X1,X4)

Q3(A2, A3, X2,X3)

Q2(A1, A2)

Q3(A2, A3)

Q4(A3, A4)

Q1(A1, A4)

Instantiate current state x

Maximal action maxa


Variable Elimination

Q2(A1, A2)

Q3(A2, A3)

Q4(A3, A4)

Q1(A1, A4)

maxa

+ + +

Use variable elimination for maximization:

Limited communication for optimal action choice

Comm. bandwidth = tree-width of coord. graph

),(),(),(max 421212411,, 421

AAgAAQAAQA A A

),(),(max),(),(max 434323212411,, 3421

AAQAAQAAQAAQAA A A

),(),(),(),(max 434323212411,,, 4321

AAQAAQAAQAAQA A A A

A2 A4 Value of optimal A3

action

Attack Attack 5

Attack Defend

6

Defend

Attack 8

Defend

Defend

12



),(),(),(max 421212411,, 421

AAgAAQAAQA A A

),(),(max),(),(max 434323212411,, 3421

AAQAAQAAQAAQAA A A

),(),(),(),(max 434323212411,,, 4321

AAQAAQAAQAAQA A A A



2

3

4

1

Q2(A1, A2)

Q3(A2,

A3)

Q4(A3,

A4)

Q1(A1,

A4)

maxA3

[ +

]

g1(A2, A4)

),(),(),(max 421212411,, 421

AAgAAQAAQA A A

),(),(max),(),(max 434323212411,, 3421

AAQAAQAAQAAQAA A A

),(),(),(),(max 434323212411,,, 4321

AAQAAQAAQAAQA A A A


Coordination Graphs

Communication follows triangulated graph Computation grows exponentially in tree

width Graph-theoretic measure of “connectedness” Arises in BNs, CSPs, …

Cost exponential in worst case,fairly low for many real graphs

A4

A1

A3

A2

A7

A5

A6 A9

A8

A10 A11


Context-Specific Interactions

Payoff structure can vary by context Agents A1, A2 both trying to pass

through same narrow corridor Can use context-specific “value

rules”<At(X,A1), At(X,A2),

A1 = fwd A2 = fwd : -100> Hope: Context-specific payoffs will

induce context-specific coordination

A1A2

X


Context-Specific Coordination

Instantiate current state: x = true

A1

A4 A2

A3

A5 A6

1.0:32 xaa

3:43 xaa

3:421 xaaa

5:21 xaa 1:31 xaa

7:6 xa 4:51 xaa 2:65 xaa 3:61 xaa

1:4 xa



A1

A4 A2

A3

A5 A6

3:43 aa

3:421 aaa

5:21 aa

7:6a4:51 aa 2:65 aa

1:4a1.0:32 aa

Coordination structure varies based on context



A1

A4 A2

A3

A5 A6

3:43 aa

3:421 aaa

5:21 aa

7:6a4:51 aa 2:65 aa

1:4a

Maximizing out A1

11 aA

1.0:32 aa

Rule-based variable elimination [Zhang & Poole ’99]

4:5a5:2aCoordination structure varies

based on communication



A1

A4 A2

A3

A5 A6

1.0:32 aa

3:43 aa

7:6a

2:65 aa

1:4a

4:5a5:2a

Eliminate A1 from the graph Rule-based variable elimination [Zhang &

Poole ’99]

Coordination structure varies based on agent decisions


Robot Soccer

UvA Trilearn 2002 won German Open 2002, but placed fourth in Robocup-2002.

“ … the improvements introduced in UvA Trilearn 2003 … include an extension of the intercept skill, improved passing behavior and especially the usage of coordination graphs to specify the coordination requirements between the different agents.”

Kok, Vlassis & GroenUniversity of Amsterdam


RoboSoccer Value Rules

Coordination graph rules include conditions on player role and aspects of global system state

Example rules for player i, in role of passer:

Depends on distance of j to goal after move


UvA Trilearn 2003 ResultsRound 1 Opponent Score

Round 1 Mainz Rolling Brains (Germany)

4-0

Iranians (Iran) 31-0

Sahand (Iran) 39-0

a4ty (Latvia) 25-0

Round 2 Helios (Iran) 2-1

AT-Humboldt (Germany) 5-0

ZJUBase (China) 6-0

Aria (Iran) 6-0

Hana (Japan) 26-0

Round 3 Zenit-NewERA (Russia) 4-0

RoboSina (Iran) 6-0

Wright Eagle (China) 3-1

Everest (China) 7-1

Aria (Iran) 5-0

Semi-final Brainstormers (Germany) 4-1

Final TsinghuAeolus (China) 4-3

177-7

UvA Trilearn won • German Open 2003• US Open 2003• RoboCup 2003• German Open 2004


Outline





peasant

footman

building

Real-time Strategy GamePeasants collect resources and buildFootmen attack enemiesBuildings train peasants and footmen


Planning Over Time

Action space: Joint agent actions a= {a1,…,

an}

State space: Joint state descriptions x= {x1,

…, xn}

Momentary reward function R(x,a) Probabilistic system dynamics P(x’|x,a)

Markov Decision Process (MDP) representation:


Policy

Policy: (x) = aAt state x, action a for all agents

(x0) = both peasants get woodx0

(x1) = one peasant gets gold, other builds barrack

x1

(x2) = Peasants get gold, footmen attack

x2


Value of Policy

Value: V(x)Expected long-term

reward starting from

xStart

from x0

x0

R(x0)

(x0

)

V(x0) = E[R(x0) + R(x1) + 2 R(x2) + 3 R(x3) + 4 R(x4) + ]

Future rewards discounted by [0,1)x1

R(x1)

x1’’

x1’R(x1’)

R(x1’’)

(x1

)x2

R(x2)

(x2

)x3

R(x3)

(x3

) x4

R(x4)

(x1’)

(x1’’)


Optimal Long-term Plan

Optimal policy *(x)

Optimal Q-function Q*(x,a)

)a,x(max)x(a

QV

'x

)'x()a,x|'x()a,x()a,x( VPRQ

Optimal policy:)a,x(maxarg)x(

a

Q

Bellman Equations:


Solving an MDP

Policy iteration [Howard ’60, Bellman ‘57]

Value iteration [Bellman ‘57]

Linear programming [Manne ’60]

…

Solve Bellman equation

Optimal value V*(x)

Optimal policy *(x)

Many algorithms solve the Bellman equations:


LP Solution to MDP

,

),()( :subject to

)(:minimize

ax

xax

xx

QV

V

One variable V (x) for each state One constraint for each state x and action a Polynomial time solution


Are We Done?

Planning is polynomial in #states and #actions

#states exponential in number of variables

#actions exponential in number of agents

Efficient approximation by exploiting structure!


F’

E’

G’

P’

Structured Representation

State Dynamics Decisions Rewards

Peasant

Footman

Enemy

Gold

RComplexity of representation:Exponential in #parents (worst

case)

[Boutilier et al. ’95]t t+1Time

APeasant

ABuild

AFootman

P(F’|F,G,AB,AF)

FactoredMDP


Structured Value function ?Factored MDP Structure in V*

Y’’

Z’’

X’’

R

Y’’’

Z’’’

X’’’

Time t t+1

R

Y’

Z’

X’

t+2 t+3

R

Z

Y

X

R

Factored MDP Structure in V*

Almost!

Factored V often provides

good approximate value function


[Bellman et al. ‘63], [Tsitsiklis & Van Roy ‘96][K. & Parr ’99,’00]

Structured Value Functions

Approximate V* as a factored value function

In rule-based case: hi is a rule concerning small part of the system wi is the value associated with the rule

Goal: find w giving good approximation V to V*

i iihwV )()( xx

Factored value function V = wi hi

Factored Q function Q = Qi

Can use coordination graph


Approximate LP Solution

:subject to

, ax

:minimize x

),( xaQ)(xV

)(xV

),( xa

iiQ)( x

iii hw

)( xi

ii hw

One variable wi for each basis function Polynomial number of LP variables

One constraint for every state and action Exponentially many LP constraints

, ax

)( xi

ihwi

)( xi

hwi i


,),()( :subject to

axxaxi

ii

ii Qhw

So What Now?

)x()x,a( :to subject max0x,a

i

iii hwQ

Exponentially many linear = one nonlinear constraint

,)(),(0 :subject to

axxxai

iii hwQ



Variable Elimination Revisited

Use Variable Elimination to represent constraints: ),(),(max),(),(max0 4321

,,DBfDCfCAfBAf

DCBA

),(),(

),(),(max0

43),(

1

),(121

,,

DBfDCfg

gCAfBAf

CB

CB

CBA

Exponentially

fewerconstraints )x()x,a( :to subject max0

x,a

i

iii hwQ


Polynomial LP for findinggood factored approximation to V*


Network Management Problem

Ring

Star

Ring of Rings

k-grid

Computer runs processes

Computer status = {good, dead, faulty}

Dead neighbors increase dying probability

Reward for successful processes

Each SysAdmin takes local action = {reboot, not reboot }


Scaling of Factored LP

Explicit LP Factored LP

k = tree-width

2n (n+1-k)2k

Explicit LP

0

10000

20000

30000

40000

2 4 6 8 10 12 14 16number of variables

nu

mb

er o

f co

nst

rain

ts

Factored LP

k = 3

k = 5

k = 8

k = 10

k = 12


Multiagent Running Time

0

20

40

60

80

100

120

140

160

180

2 4 6 8 10 12 14 16

Number of agents

To

tal

run

nin

g t

ime

(sec

on

ds)

Star single basis

Star pair basis

Ring ofrings


Strategic 2x2

Factored MDP model2 Peasants, 2 Footmen, Enemy, Gold, Wood,

Barracks~1 million state/action pairs

Factored LP computes value function

Q

offl

ine

on

lin

e

Worldx a

Coordination graph computes argmaxa Q(x,a)


Demo: Strategic 2x2Guestrin, Koller, Gearhart & Kanodia


Limited Interaction MDPs Some MDPs have additional

structure: Agents are largely autonomous Interact in limited ways

e.g., competing for resources

Can decompose MDP as set of agent-based MDPs, with limited interface

A2

A1

X1

R1

X3

X2

X’3

X’2

X’1

h2

h1

R2

R3 A2

A1

X3

X2

X’3

X’2

R2

R3

A1

X1

R1

X2 X’2

X’1

X1

X2X1A1

M2

M1

[Guestrin & Gordon, ’02]


Limited Interaction MDPs

In such MDPs, our LP matrix is highly structured

Can use Dantzig-Wolfe LP decomposition to solve LP optimally, in a decentralized way

Gives rise to a market-like algorithm with multiple agents and a centralized “auctioneer”



Auction-style planning Each agent solves local

(stand-alone) MDP Agents send `constraint

messages’ to auctioneer:

Must agree on “policy” for shared variables

Auctioneer sends `pricing messages’ to agents

Pricing reflects penalties for constraint violations

Influences agents’ rewards in their MDP

Auctioneer

$

$$

Set pricingbased onconflicts

Plan, plan, plan

Plan, plan, plan Plan,

plan, plan



UAV start Target

Fuel Allocation Problem

UAVs share a pot of fuel Targets have varying

priority Ignore target interference

Bererton, Gordon,Thrun & Khosla

©2004 – Carlos Guestrin, Daphne Koller[Bererton, Gordon, Thrun, & Khosla , ’03]

Fuel Allocation Problem


High-Speed Robot Paintball

Bererton, Gordon & Thrun



Game variant 1 Game variant 2

Coordination point

Sensor Placement

x = start location

+ = goal location



Bererton, Gordon & Thrun


Outline





Generalizing to New Problems

SolveProblem 1

SolveProblem n

Good solution to

Problem n+1

SolveProblem 2

MDPs are different! Different sets of states, action, reward,

transition, …

Many problems are “similar”


Generalizing with Relational MDPs

Avoid need to replan Tackle larger problems

“Similar” domains have similar “types” of objects

Exploit similarities by computing generalizable value functions

RelationalMDP

Generalization


Relational Models and MDPs

Classes: Peasant, Footman, Gold, Barracks, Enemy…

Relations Collects, Builds, Trains, Attacks…

Instances Peasant1, Peasant2, Footman1, Enemy1…

Builds on Probabilistic Relational Models [K. & Pfeffer ‘98]

[Guestrin, K., Gearhart & Kanodia ‘03]


Relational MDPs

Very compact representation!Does not depend on # of objects

Enemy

H’ Health

RCount

Footman

H’ Health

AFootman

my_enemy

Class-level transition probabilities depends on: Attributes; Actions; Attributes of

related objects Class-level reward function



World is a Large Factored MDP

Instantiation (world): # instances of each class Links between instances

Well-defined factored MDP

RelationalMDP

Linksbetweenobjects

FactoredMDP

# of objects


MDP with 2 Footmen and 2 Enemies

F1.Health

F1.A

F1.H’

E1.Health E1.H’

F2.Health

F2.A

F2.H’

E2.Health E2.H’

R1

R2

Footman1

Enemy1

Enemy2

Footman2


World is a Large Factored MDP

Instantiate world Well-defined factored MDP Use factored LP for planning

We have gained nothing!

RelationalMDP

Linksbetweenobjects

FactoredMDP

# of objects


Class-level Value Functions

F1.Health E1.Health F2.Health E2.Health

Footman1

Enemy1

Enemy2

Footman2

VF1(F1.H, E1.H) VE1(E1.H) VF2(F2.H, E2.H) VE2(E2.H)

V(F1.H, E1.H, F2.H, E2.H) = + + + Units are Interchangeable!VF1 VF2 VF VE1 VE2

VE

At state x, each footman has different contribution to V

Given wC — can instantiate value function for any world

Footman1

Enemy1

Enemy2

Footman2

VF VF VE VE

0

5

10

15

20 F1 alive,E1 aliveF1 alive,E1 deadF1 dead,E1 aliveF1 dead,E1 dead 0

5

10

15

20 F2 alive,E2 aliveF2 alive,E2 deadF2 dead,E2 aliveF2 dead,E2 dead


Factored LP-based Generalization

E1 F1

E2 F2

Sample SetI

0

5

10

15

20 F alive,E aliveF alive,E deadF dead,E aliveF dead,E dead

VF

0

2

4

6

8

10 E alive

E dead

VE

Generalize

E1 F1

E2 F2

E3 F3

0

5

10

15


0

2

4

6

8

10E1 alive

E1 dead

0

5

10

15


0

2

4

6

8

10E2 alive

E2 dead

0

5

10

15


0

2

4

6

8

10E3 alive

E3 dead

Class-level

factored LP

How many samples?


Sampling Complexity

NO!

Exponentially many worlds need

exponentially many samples?

# objects in a world is unbounded must sample

very large worlds?


Theorem

samples

Value function within O() of class-level value function optimized for all worlds,

with prob. at least 1-

Rcmax is the maximum class reward

Sample m small worlds of up to O( ln 1/ ) objects

m =


Strategic 2x2

Relational MDP model

2 Peasants, 2 Footmen, Enemy, Gold, Wood,



Q

offl

ine

on

lin

e

Worldx a





Barracks


Qo

offl

ine

on

lin

e

Worldx a


~3 trillion state/action pairs

grows exponentially in #

agents

Strategic 9x3


Strategic Generalization




Factored LP computes class-level value function

wC

offl

ine

on

lin

e

Worldx a



Barracks~3 trillion state/action pairs

instantiated Q-functionsgrow polynomially in #

agents


Demo: Generalized 9x3Guestrin, Koller, Gearhart & Kanodia


Tactical Generalization

Planned in 3 Footmen versus 3 Enemies

Generalized to 4 Footmen versus 4 Enemies

3 v. 3 4 v. 4

Generalize


Demo: Planned Tactical 3x3Guestrin, Koller, Gearhart & Kanodia


Demo: Generalized Tactical 4x4


Guestrin, Koller, Gearhart & Kanodia


Summary

Structured Multi-Agent MDPs

Effectiveplanning

under uncertainty

Distributedcoordinated

action selection

Generalization

to new problems


Important Questions

Continuousspaces

Partialobservability

Complexactions Learning

to act

How far can we go??

http://robotics.stanford.edu/~koller

Carlos Guestrin Ronald Parr

Chris Gearhart Neal KanodiaShobha Venkataraman

Curt BerertonGeoff GordonSebastian Thrun

Jelle KokMatthijs SpaanNikos Vlassis