CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty
description
Transcript of CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty
CPSC 7373: Artificial IntelligenceLecture 10: Planning with Uncertainty
Jiang Bian, Fall 2012University of Arkansas at Little Rock
Planning under Uncertainty
Planning Uncertainty
Learning
MDPsPO-MDPs
RL
Planning Agent Tasks Characteristics
DETERMINISTIC STOCHASTIC
FULLY OBSERVABLE A*, DEPTH-FIRST, etc. MDP
PARTIALLY OBSERVABLE POMDP
• Stochastic is an environment where the outcome of an action is somewhat random, whereas an environment that's deterministic where the outcome of an action is predictable and always the same.
• An environment is fully observable if you can see the state of the environment which means if you can make all decisions based on the momentary sensory input. Whereas if you need memory, it's partially observable.
Markov Decision Process (MDP)
S1
S2
S3
a1
a2
a2
a1
a2a1
S1
S2
S3
a1
a2
a2
a1
a2a1
Finite State Machine Markov Decision Process
Randomness
50%
50%
Markov Decision Process (MDP)
S1
S2
S3
a1
a2
a2
a1
a2a1
50%
50%
States: S1…SnActions: a1…akState Transition Matrix
T(S, a, S’) = P(S’|a,S)
Reward functionR(S)
MDP Grid World
+100
-100
START
Absorbing States
80%
80%10% 10%
Stochastic actions:
1 2 3 4
a
b
c
Policy: pi(s) -> AThe planning problem we have becomes one of finding the optimal policy
10% 10%
Stochastic Environments – Conventional Planning
+100
-100
START
1 2 3 4
a
b
c
c1
N S W E
b1 c1
Problems:1) Branching factor: 4 choices, 3 outcomes, at least 12 branches we need to follow2) Depth of the search tree (i.e., loops, etc.)3) Many states visited more than once (i.e., states may re-occur)
1) In A*, we ensure we only visit each state only once
c1 c2
Policy
+100
-100
START
1 2 3 4
a
b
c
Goal: Find an optimal policy for all these states that with maximum probability leads me to the absorbing state plus 100
Quiz: What is the optimal action?a1: N, S, W, E ???c1: N, S, W, E ???c4: N, S, W, E ???B3: N, S, W, E ???
MDP and Costs
+100
-100
1 2 3 4
a
b
c
R(s) +100, a4-100, b4-3, other states (i.e., gives us incentives to shorten our action sequence)
γ=discount factor, e.g., γ=0.9, (i.e., decay of the future rewards)
Objective of MDP:
Value Function
-3 -3 -3 +100
-3 -3 -100
-3 -3 -3 -3
1 2 3 4
a
b
c
Value Function:for each state, the value of the state is the expected sum of
future discounted reward provided that we start in state S, executed policy PI.
Planning = Iteratively calculating value functions
Value Iteration
0 0 0 +100
0 0 -100
0 0 0 0
1 2 3 4
a
b
c
85 89 93 +100
81 68 -100
77 73 70 47
1 2 3 4
a
b
c
run value iteration through convergence
Value Iteration - 285 89 93 +100
81 68 -100
77 73 70 47
1 2 3 4
a
b
c
If S is the terminal state
Back-up equation:
After converges (Bellman equality): the optimal future cost reward trade off that you can achieve if you act optimally in any given state.
Quiz – DETERMINSTIC
+100
-100
1 2 3 4
a
b
c
V(a3) = ???
DETERMINSTIC, γ= 1, R(S) = -3
Quiz - 1
97 +100
-100
1 2 3 4
a
b
c
V(a3) = 97V(b3) = ???
DETERMINSTIC, γ= 1, R(S) = -3
Quiz - 1
97 +100
94 -100
1 2 3 4
a
b
c
V(a3) = 97V(b3) = 94V(c1) = ???
DETERMINSTIC, γ= 1, R(S) = -3
Quiz - 1
91 94 97 +100
88 94 -100
85 88 91 88
1 2 3 4
a
b
c
V(a3) = 97V(b3) = 94V(c1) = 85
DETERMINSTIC, γ= 1, R(S) = -3
Quiz – STOCHASTIC
+100
-100
1 2 3 4
a
b
c
V(a3) = ???
STOCHASTIC, γ= 1, R(S) = -3, P=0.8
Quiz – STOCHASTIC
77 +100
-100
1 2 3 4
a
b
c
V(a3) = 77V(b3) = ???
STOCHASTIC, γ= 1, R(S) = -3, P=0.8
Quiz – STOCHASTIC
77 +100
48.6 -100
1 2 3 4
a
b
c
V(a3) = 77V(b3) = 48.6
STOCHASTIC, γ= 1, R(S) = -3, P=0.8
N: 0.8 * 77 + 0.1(-100) + 0.1*0 – 3 = 48.6W: 0.1 * 77 + 0.8 * 0 + 0.1 *0 – 3 = 4.7
Value Iteration and Policy - 1
What is the optimal policy?
Value Iteration and Policy - 2
85 89 93 +100
81 68 -100
77 73 70 47
1 2 3 4
a
b
c
STOCHASTIC, γ= 1, R(S) = -3, P=0.8
What is the optimal policy?
+100
-100
1 2 3 4
a
b
c
This is a situation where the risk of falling into the -100 is balanced by the time spent going around.
Value Iteration and Policy - 3
100 100 100 +100
100 100 -100
100 100 100 100
1 2 3 4
a
b
c
STOCHASTIC, γ= 1, R(S) = 0, P=0.8
What is the optimal policy?
+100
-100
1 2 3 4
a
b
c
Value Iteration and Policy - 4
-704 -423 -173 +100
-954 -357 -100
-1082 -847 -597 -377
1 2 3 4
a
b
c
STOCHASTIC, γ= 1, R(S) = -100, P=0.8
What is the optimal policy?
+100
-100
1 2 3 4
a
b
c
Markov Decision Processes
• Fully Observable: S1, …, Sn; a1, …, ak• Stochastic: P(S’|a, S)• Reward: R(S)• Objective:• Value iteration: V(S)• Converges: PI = argmax…
Partial Observability+100 -100
• Fully Observable, Deterministic• Fully Observable, Stochastic
Partial Observability? ?
• Partially Observable, Stochastic
MDP vs POMDP:Optimal exploration versus exploitation, where some of the actions might be information-gathering actions; whereas others might be goal-driven actions.
SIGN
Partial Observability
Information Space (Belief Space)
? ?+100 +100
50% 50%