Post on 27-Dec-2015
Stochastic Dynamic Programming with Factored Representations
Presentation by Dafna Shahaf(Boutilier, Dearden, Goldszmidt 2000)
The Problem Standard MDP algorithms require explicit
state space enumeration Curse of dimensionality Need: Compact Representation
(intuition: STRIPS) Need: versions of standard dynamic
programming algorithms for it
A Glimpse of the Future
Policy Tree Value Tree
A Glimpse of the Future: Some Experimental Results
Roadmap
MDPs- Reminder Structured Representation for MDPs:
Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
MDPs- Reminder
(states, actions, transitions, rewards)
Discounted infinite-horizon Stationary Policies
(an action to take at state s) Value functions: is k-stage-to-go
value function for π)(sV k
AS :
RTAS ,,,
Roadmap
MDPs- Reminder Structured Representation for MDPs:
Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Representing MDPs as Bayesian Networks: Coffee world
O: Robot is in office W: Robot is wet U: Has umbrella R: It is raining HCR: Robot has coffee HCO: Owner has coffee
Go: Switch location BuyC: Buy coffee DelC: Deliver coffee GetU: Get umbrella
The effect of the actions might be noisy.Need to provide a distribution for each effect.
Representing Actions: DelC
00.300
Representing Actions: Interesting Points
No need to provide marginal distribution over pre-action variables
Markov Property: we need only the previous state For now, no synchronic arcs Frame Problem? Single Network vs. a network for each action Why Decision Trees?
Representing Reward
Generally determined by a subset of features.
Policies and Value Functions
Policy Tree Value Tree
The optimal choice may depend only on certain variables (given some others).
FeaturesHCR=T
HCR=F
ValuesActions
Roadmap
MDPs- Reminder Structured Representation for MDPs:
Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Bellman Backup
Q-Function: The value of performing a in s, given value function v
Value Iteration- Reminder
)'(' )',,'Pr()()( ss vsassRsQva
)(max:)(max)(1
sQsQsV kaa
Vaa
kk
)}'(' ),,'Pr({max)()( 1 ss VsassRsV ka
k
Structured Value Iteration- OverviewInput: Tree( ). Output: Tree( ).
1. Set Tree( )= Tree( )
2. Repeat
(a) Compute Tree( )= Regress(Tree( ),a)
for each action a
(b) Merge (via maximization) trees Tree( )
to obtain Tree( )
Until termination criterion. Return Tree( ).
VkaQ
VkaQ
RR
0V
1kV1kV
kV
*V
Example World
Step 2a: Calculating Q-Functions
)'(' )',,Pr()()( ss VsassRsQVa
1. Expected FutureValue
2. DiscountingFutureValue
3. AddingImmediate
Reward
How to use the structure of the trees?
Tree( ) should distinguish only conditions under which a makes a branch of Tree(V) true with different odds.
VaQ
Calculating :
Tree(V0)
1aQ
PTree( )
Finding conditions under which a will have distinct expected value, with respect to V0
1aQ FVTree( )1aQ
Undiscounted Expected Future Value for performing action a with one-stage-to-go.
Tree( )1aQ
Discounting FVTree (by 0.9), and adding the immediate reward function.
Z:
Z: Z:
1*10+0*0
An Alternative View:
(a more complicated example)
Tree(V1) PartialPTree( )
UnsimplifiedPTree( )
PTree( )
2aQ
2aQ
2aQ FVTree( )2aQ Tree( )
2aQ
The Algorithm: Regress
Input: Tree(V), action a. Output: Tree( )
1. PTree( )= PRegress(Tree(V),a) (simplified)
VaQ
VaQ
The Algorithm: Regress
Input: Tree(V), action a. Output: Tree( )
1. PTree( )= PRegress(Tree(V),a) (simplified)
2. Construct FVTree( ):
for each branch b of PTree, with leaf node l(b)
(a) Prb =the product of individual distr. from l(b)
(b)
(c) Re-label leaf l(b) with vb.
VaQ
VaQ
VaQ
)(')'()'(Pr
VTreeb
bb bVbv
The Algorithm: Regress
Input: Tree(V), action a. Output: Tree( )
1. PTree( )= PRegress(Tree(V),a) (simplified)
2. Construct FVTree( ):
for each branch b of PTree, with leaf node l(b)
(a) Prb =the product of individual distr. from l(b)
(b)
(c) Re-label leaf l(b) with vb.
3. Discount FVTree( ) with , append Tree(R)
4. Return FVTree( )
VaQ
VaQ
VaQ
)(')'()'(Pr
VTreeb
bb bVbv
VaQ
VaQ
The Algorithm: PRegressInput: Tree(V), action a. Output: PTree( )
1. If Tree(V) is a single node, return emptyTree
2. X = the variable at the root of Tree(V)
= the tree for CPT(X) (label leaves with X)
VaQ
PXT
The Algorithm: PRegressInput: Tree(V), action a. Output: PTree( )
1. If Tree(V) is a single node, return emptyTree
2. X = the variable at the root of Tree(V)
= the tree for CPT(X) (label leaves with X)
3. = the subtrees of Tree(V) for X=t, X=f
4. = call PRegress on
VaQ
PXT
VfX
VtX TT ,
PfX
PtX TT ,
VfX
VtX TT ,
The Algorithm: PRegressInput: Tree(V), action a. Output: PTree( )
1. If Tree(V) is a single node, return emptyTree
2. X = the variable at the root of Tree(V)
= the tree for CPT(X) (label leaves with X)
3. = the subtrees of Tree(V) for X=t, X=f
4. = call PRegress on
5. For each leaf l in , add or both (according to distribution. Use union to combine labels)
6. Return
VaQ
PXT
VfX
VtX TT ,
PfX
PtX TT ,
VfX
VtX TT ,
PXT
PfX
PtX TT ,
PXT
Step 2b. Maximization
Value Iteration Complete.
Roadmap
MDPs- Reminder Structured Representation for MDPs:
Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Experimental Results
WorstCase:
BestCase:
Roadmap
MDPs- Reminder Structured Representation for MDPs:
Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Extensions
Synchronic edges POMDPs Rewards Approximation
Questions?
Backup slides
Here be dragons.
Regression through a Policy
Improving Policies: Example
Maximization Step, Improved Policy