Post on 16-Dec-2015
1
Graphical Models for Online Solutions to Interactive
POMDPs
Prashant Doshi Yifeng Zeng Qiongyu ChenUniversity of Georgia Aalborg University National Univ.
USA Denmark of Singapore
International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2007)
2
Decision-Making in Multiagent Settings
State (S)
Act to optimize preferences given beliefs
Actions (Ai)
Agent i
Observations (Oi)
Actions (Aj)
Observations (Oj)
Agent j
Belief over state and model of j
Belief over state and model of i
3
Finitely Nested I-POMDP (Gmytrasiewicz&Doshi, 05) A finitely nested I-POMDP of agent i with a strategy
level l : Interactive states:
Beliefs about physical environments: Beliefs about other agents in terms of their preferences,
capabilities, and beliefs: Type: A Joint actions Possible observations Ti Transition function: S×A×S [0,1]
Oi Observation function: S×A× [0,1]
Ri Reward function: S×A
},,,,,;{:1, jjjjjjjjlj OCRAbM
S1,, ljli MSIS
iiiijili RAAAISPOMDPI li ,,,),(,,,
i
1, ljM
i
4
Belief Update
5
Forget It!
Different approach Use the language of Influence Diagrams (IDs) to
represent the problem more transparently Belief update
Use standard ID algorithms to solve it Solution
6
Challenges
Representation of nested models for other agents Influence diagram is a single agent oriented
language
Update beliefs on models of other agents New models of other agents Over time agents revise beliefs over the models of
others as they receive observations
7
Related Work
Multiagent Influence Diagrams (MAIDs) (Koller&Milch,2001) Uses IDs to represent incomplete information games Compute Nash equilibrium solutions efficiently by exploiting
conditional independence
Network of Influence Diagrams (NIDs) (Gal&Pfeffer,2003) Allows uncertainty over the game Allows multiple models of an individual agent Solution involves collapsing models into a MAID or ID
Both model static single play games Do not consider agent interactions over time (sequential decisio
n-making)
8
Introduce Model Node and Policy Link
A generic level l Interactive-ID (I-ID) for agent i situated with one other agent j Model Node: Mj,l-1
Models of agent j at level l-1
Policy link: dashed line Distribution over the other
agent’s actions given its models
Beliefs on Mj,l-1
P(Mj,l-1|s) Update?
AiRi
Oi
S Aj
M j,l-1
Level l I-ID
9
Details of the Model Node
Members of the model node Different chance nodes are
solutions of models mj,l-1
Mod[Mj] represents the different models of agent j
CPT of the chance node Aj is a multiplexer Assumes the distribution of
each of the action nodes (Aj
1, Aj2) depending on the valu
e of Mod[Mj]
Mod[Mj]
Aj1
Aj2
M j,l-1
S
m j,l-11
m j,l-12
Aj
m j,l-11 , m j,l-12 could be I-IDs or IDs
10
Whole I-ID
AiRi
Oi
S Aj
Mod[Mj]
Aj1 Aj
2
m j,l-11 m j,l-12m j,l-11 , m j,l-12 could be I-IDs or IDs
11
Interactive Dynamic Influence Diagrams (I-DIDs)
Ait+1
Ri
Oit+1
St+1 Ajt+1
M j,l-1t+1
Ait
Ri
Oit
St
Ajt
M j,l-1t
Model Update Link
12
m j,l-1t,2
Semantics of Model Update Link
Mod[Mjt]
Aj1
M j,l-1t
st
m j,l-1t,1
Ajt
Aj2
Oj1
Oj2
Oj
Mod[Mjt+1]
Aj1
M j,l-1t+1
st+1
m j,l-1t+1,1
m j,l-1t+1,2
Ajt+1
Aj2
Aj3
Aj4
m j,l-1t+1,3
m j,l-1t+1,4
These models differ in their initial beliefs, each of which is the result of j updating its beliefs due to its actions and possible observations
13
Notes
Updated set of models at time step (t+1) will have at most models :number of models at time step t :largest space of actions :largest space of observations
New distribution over the updated models uses original distribution over the models probability of the other agent performing the action, and receiving the observation that led to the updated model
|||||| 1, jjtlj AM
|| 1,tljM
|| jA
|| j
14
Ait+1
Ri
Oit+1
St+1Oit
Ait
Ri
St
m j,l-1t,1
m j,l-1t,2
Aj1 Oj
1
Aj2 Oj
2
Aj1
Aj2
Aj3
Aj4
m j,l-1t+1,1
m j,l-1t+1,2
m j,l-1t+1,3
m j,l-1t+1,4
Ajt+1
Mod[Mjt]
Ajt
Oj
Mod[Mjt+1]
15
Example Applications: Emergence of Social Behaviors Followership and Leadership in the persistent
multiagent tiger problem
Altruism and Reciprocity in the public good problem with punishment
Strategies in a simple version of two-player Poker
16
Followership and Leadership in Multiagent Persistent Tiger Experimental Setup:
Agent j has a better hearing capability (95% accurate) compared to i’s (65% accuracy)
Agent i does not have initial information about the tiger’s location
Agent i considers two models of agent j which differ in j’s level 0 initial beliefs Agent j likely thinks that the tiger is behind the left door Agent j likely thinks that the tiger is behind the right door
Solve the corresponding level 1 I-DID expanded over three time steps and get the normative behavioral policy of agent i
17
Level 1 I-ID in the Tiger Problem
Expand over threetime steps
Mapping decision nodes to chance nodes
18
Policy Tree 1: Agent i has hearing accuracy of 65%
LL
LL LL
OROR LLLL LL LL OLOL
GL,* GR,*
GL,CRGL,S/CL
GR,*GL,*
GR,S/CR
GR,CL
Conditional Followership
19
Policy Tree 2: Agent i loses hearing ability (accuracy is 0.5)
LL
LL
OROR OLOLLL
*,*
*,CR *,S *,CL
Unconditional (Blind) Followership
20
Example 2: Altruism and Reciprocity in the Public Good Problem Public Good Game
Two agents initially endowed with XT amount of resources Each agent may choose
contribute (C) a fixed amount of the resources to a public pot not contribute ie. defect (D)
Agents’ actions and pot are not observable, but agents receive an observation symbolizing the state of the public pot plenty (PY) meager (MR)
Value of resources in the public pot is discounted by ci (<1) for each agent i, where ci is the marginal private return
In order to encourage contributions, the contributing agents punish free riders P but incur a small cost cp for administering the punishment
21
Agent Types
Altruistic and Non-altruistic types Altruistic agent has a high marginal private return (ci
is close to 1) and does not punish others who defect Optimal Behavior
One action remaining: both types of agents choose to contribute to avoid being punished
Two actions to go: altruistic type chooses to contribute, while the other defects Why?
Three steps to go: the altruistic agent contributes to avoid punishment and the non-altruistic type defects
Greater than three steps: altruistic agent continues to contribute to the public pot depending on how close its marginal return is to 1, the non-altruistic type prescribes defection
22
Level 1 I-ID in the Public Good Game
Expand over threetime steps
23
Policy Tree 1: Altruism in PG
If agent i (altruistic type) believes with a probability 1 that j is altruistic, i chooses to contribute for each of the three steps.
This behavior persists when i is unaware of whether j is altruistic, and when i assigns a high probability to j being the non-altruistic type
CC
CC
CC
*
*
24
Policy Tree 2: Reciprocal Agents Reciprocal Type
The reciprocal type’s marginal private return is less and obtains a greater payoff when its action is similar to that of the other
Experimental Setup Consider the case when the
reciprocal agent i is unsure of whether j is altruistic and believes that the public pot is likely to be half full
Optimal Behavior From this prior belief, i chooses to
defect On receiving an observation of
plenty, i decides to contribute, while an observation of meager makes it defect
With one action to go, i believes that j contributes, will choose to contribute too to avoid punishment regardless of its observations
DD
CC
CC
DD
CC
*
PY
*
MR
25
Conclusion and Future Work
I-DIDs: A general ID-based formalism for sequential decision-making in multiagent settings Online counterparts of I-POMDPs
Solving I-DIDs approximately for computational efficiency (see AAAI ’07 paper on model clustering)
Apply I-DIDs to other application domains
Visit our poster on I-DIDs today for more information
26
Thank You!