5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

18
06/27/22 Mahdi Naser-Moghadasi Texas Tech University

Transcript of 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

Page 1: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Mahdi Naser-MoghadasiTexas Tech University

Page 2: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

A MDP model contains: A set of states S A set of actions A A set of state transition description T A reward function R (s, a)

04/18/23

Agent

Environment

State Reward Action

Page 3: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

A POMDP model contains: A set of states S A set of actions A A set of state transition description T A reward function R (s, a) A finite set of observations Ω An observation function O:S╳A→Π(Ω)

O(s’, a, o)

04/18/23

Page 4: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

Value function

Policy is a description of behavior of Agent.

Policy Tree Witness Algorithm

04/18/23

Page 5: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

A tree of depth t that specifies a complete t-step policy. Nodes: actions, the top node determines the

first action to be taken. Edges: the resulting observation

04/18/23

Page 6: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Page 7: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Value Function:

Vp(s) is the value function of step-t that starting from state s and executing policy tree p.

Page 8: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Value Evaluation: Vt with only two states:

Page 9: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Value Function: Vt with three states:

Page 10: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Improved by choosing useful policy tree:

Page 11: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Witness algorithm:

Page 12: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Witness algorithm: Finding witness:

At each iteration we ask, Is there some belief state , b, for which the true value, , computed by one-step lookahead using Vt-1, is different from the estimated value, , computed using the set U?

Page 13: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

Witness algorithm: Complete value-iteration:

An agenda containing any single policy tree A set U containing the set of desired policy tree Using pnew to determine whether it is an improvement

over the policy trees in U 1. If no witness points are discovered, then that policy

tree is removed from the agenda. When the agenda is empty, the algorithm terminates.

2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda.

04/18/23

Page 14: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

Two doors: Behind one door is a tiger Behind another door is a large reward

Two states: the state of the world when the tiger is on the left as sl and

when it is on the right as sr Three actions:

left, right, and listen. Rewards:

reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1

Observations: to hear the tiger on the left (Tl) or to hear the tiger on the right

(Tr) in state sl, the listen action results in observation Tl with

probability 0.85 and the observation Tr with probability 0.15; conversely for world state sr.

04/18/23

Page 15: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Page 16: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23

Decreasing listening reliability from 0.85 down to 0.65:

Page 17: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

How number of horizon affect the complexity of solving POMDPs? So can we conclude that Pruning non useful policies is the key point of solving POMPDs?

On page 5, they say “sometimes we need to … compute a greedy policy given a function”, why would you need the greedy policy?

Can you explain the “Witness Algorithm” I don’t understand it at all. (Page 15)

Did you find any papers that implement the techniques in this paper and provide a discussion of timing or accuracy?

Can you give some more POMDP problems in the real world besides tiger problem?

04/18/23

Page 18: 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

04/18/23