5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.
-
Upload
brett-stephens -
Category
Documents
-
view
214 -
download
0
Transcript of 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.
04/18/23
Mahdi Naser-MoghadasiTexas Tech University
A MDP model contains: A set of states S A set of actions A A set of state transition description T A reward function R (s, a)
04/18/23
Agent
Environment
State Reward Action
A POMDP model contains: A set of states S A set of actions A A set of state transition description T A reward function R (s, a) A finite set of observations Ω An observation function O:S╳A→Π(Ω)
O(s’, a, o)
04/18/23
Value function
Policy is a description of behavior of Agent.
Policy Tree Witness Algorithm
04/18/23
A tree of depth t that specifies a complete t-step policy. Nodes: actions, the top node determines the
first action to be taken. Edges: the resulting observation
04/18/23
04/18/23
04/18/23
Value Function:
Vp(s) is the value function of step-t that starting from state s and executing policy tree p.
04/18/23
Value Evaluation: Vt with only two states:
04/18/23
Value Function: Vt with three states:
04/18/23
Improved by choosing useful policy tree:
04/18/23
Witness algorithm:
04/18/23
Witness algorithm: Finding witness:
At each iteration we ask, Is there some belief state , b, for which the true value, , computed by one-step lookahead using Vt-1, is different from the estimated value, , computed using the set U?
Witness algorithm: Complete value-iteration:
An agenda containing any single policy tree A set U containing the set of desired policy tree Using pnew to determine whether it is an improvement
over the policy trees in U 1. If no witness points are discovered, then that policy
tree is removed from the agenda. When the agenda is empty, the algorithm terminates.
2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda.
04/18/23
Two doors: Behind one door is a tiger Behind another door is a large reward
Two states: the state of the world when the tiger is on the left as sl and
when it is on the right as sr Three actions:
left, right, and listen. Rewards:
reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1
Observations: to hear the tiger on the left (Tl) or to hear the tiger on the right
(Tr) in state sl, the listen action results in observation Tl with
probability 0.85 and the observation Tr with probability 0.15; conversely for world state sr.
04/18/23
04/18/23
04/18/23
Decreasing listening reliability from 0.85 down to 0.65:
How number of horizon affect the complexity of solving POMDPs? So can we conclude that Pruning non useful policies is the key point of solving POMPDs?
On page 5, they say “sometimes we need to … compute a greedy policy given a function”, why would you need the greedy policy?
Can you explain the “Witness Algorithm” I don’t understand it at all. (Page 15)
Did you find any papers that implement the techniques in this paper and provide a discussion of timing or accuracy?
Can you give some more POMDP problems in the real world besides tiger problem?
04/18/23
04/18/23