5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Mahdi Naser-MoghadasiTexas Tech University

A MDP model contains: A set of states S A set of actions A A set of state transition description T A reward function R (s, a)

04/18/23

Agent

Environment

State Reward Action

A POMDP model contains: A set of states S A set of actions A A set of state transition description T A reward function R (s, a) A finite set of observations Ω An observation function O:S╳A→Π(Ω)

O(s’, a, o)

04/18/23

Value function

Policy is a description of behavior of Agent.

Policy Tree Witness Algorithm

04/18/23

A tree of depth t that specifies a complete t-step policy. Nodes: actions, the top node determines the

first action to be taken. Edges: the resulting observation

04/18/23

04/18/23

04/18/23

Value Function:

Vp(s) is the value function of step-t that starting from state s and executing policy tree p.

04/18/23

Value Evaluation: Vt with only two states:

04/18/23

Value Function: Vt with three states:

04/18/23

Improved by choosing useful policy tree:

04/18/23

Witness algorithm:

04/18/23

Witness algorithm: Finding witness:

At each iteration we ask, Is there some belief state , b, for which the true value, , computed by one-step lookahead using Vt-1, is different from the estimated value, , computed using the set U?

Witness algorithm: Complete value-iteration:

An agenda containing any single policy tree A set U containing the set of desired policy tree Using pnew to determine whether it is an improvement

over the policy trees in U 1. If no witness points are discovered, then that policy

tree is removed from the agenda. When the agenda is empty, the algorithm terminates.

2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda.

04/18/23

Two doors: Behind one door is a tiger Behind another door is a large reward

Two states: the state of the world when the tiger is on the left as sl and

when it is on the right as sr Three actions:

left, right, and listen. Rewards:

reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1

Observations: to hear the tiger on the left (Tl) or to hear the tiger on the right

(Tr) in state sl, the listen action results in observation Tl with

probability 0.85 and the observation Tr with probability 0.15; conversely for world state sr.

04/18/23

04/18/23

04/18/23

Decreasing listening reliability from 0.85 down to 0.65:

How number of horizon affect the complexity of solving POMDPs? So can we conclude that Pruning non useful policies is the key point of solving POMPDs?

On page 5, they say “sometimes we need to … compute a greedy policy given a function”, why would you need the greedy policy?

Can you explain the “Witness Algorithm” I don’t understand it at all. (Page 15)

Did you find any papers that implement the techniques in this paper and provide a discussion of timing or accuracy?

Can you give some more POMDP problems in the real world besides tiger problem?

04/18/23

04/18/23

5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

Documents

Transcript of 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.