CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

27
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock

description

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty. Jiang Bian, Fall 2012 University of Arkansas at Little Rock. Planning under Uncertainty. Uncertainty. Planning. MDPs PO-MDPs. RL. Learning. Planning Agent Tasks Characteristics. - PowerPoint PPT Presentation

Transcript of CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Page 1: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

CPSC 7373: Artificial IntelligenceLecture 10: Planning with Uncertainty

Jiang Bian, Fall 2012University of Arkansas at Little Rock

Page 2: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Planning under Uncertainty

Planning Uncertainty

Learning

MDPsPO-MDPs

RL

Page 3: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Planning Agent Tasks Characteristics

DETERMINISTIC STOCHASTIC

FULLY OBSERVABLE A*, DEPTH-FIRST, etc. MDP

PARTIALLY OBSERVABLE POMDP

• Stochastic is an environment where the outcome of an action is somewhat random, whereas an environment that's deterministic where the outcome of an action is predictable and always the same.

• An environment is fully observable if you can see the state of the environment which means if you can make all decisions based on the momentary sensory input. Whereas if you need memory, it's partially observable.

Page 4: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Markov Decision Process (MDP)

S1

S2

S3

a1

a2

a2

a1

a2a1

S1

S2

S3

a1

a2

a2

a1

a2a1

Finite State Machine Markov Decision Process

Randomness

50%

50%

Page 5: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Markov Decision Process (MDP)

S1

S2

S3

a1

a2

a2

a1

a2a1

50%

50%

States: S1…SnActions: a1…akState Transition Matrix

T(S, a, S’) = P(S’|a,S)

Reward functionR(S)

Page 6: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

MDP Grid World

+100

-100

START

Absorbing States

80%

80%10% 10%

Stochastic actions:

1 2 3 4

a

b

c

Policy: pi(s) -> AThe planning problem we have becomes one of finding the optimal policy

10% 10%

Page 7: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Stochastic Environments – Conventional Planning

+100

-100

START

1 2 3 4

a

b

c

c1

N S W E

b1 c1

Problems:1) Branching factor: 4 choices, 3 outcomes, at least 12 branches we need to follow2) Depth of the search tree (i.e., loops, etc.)3) Many states visited more than once (i.e., states may re-occur)

1) In A*, we ensure we only visit each state only once

c1 c2

Page 8: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Policy

+100

-100

START

1 2 3 4

a

b

c

Goal: Find an optimal policy for all these states that with maximum probability leads me to the absorbing state plus 100

Quiz: What is the optimal action?a1: N, S, W, E ???c1: N, S, W, E ???c4: N, S, W, E ???B3: N, S, W, E ???

Page 9: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

MDP and Costs

+100

-100

1 2 3 4

a

b

c

R(s) +100, a4-100, b4-3, other states (i.e., gives us incentives to shorten our action sequence)

γ=discount factor, e.g., γ=0.9, (i.e., decay of the future rewards)

Objective of MDP:

Page 10: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Value Function

-3 -3 -3 +100

-3 -3 -100

-3 -3 -3 -3

1 2 3 4

a

b

c

Value Function:for each state, the value of the state is the expected sum of

future discounted reward provided that we start in state S, executed policy PI.

Planning = Iteratively calculating value functions

Page 11: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Value Iteration

0 0 0 +100

0 0 -100

0 0 0 0

1 2 3 4

a

b

c

85 89 93 +100

81 68 -100

77 73 70 47

1 2 3 4

a

b

c

run value iteration through convergence

Page 12: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Value Iteration - 285 89 93 +100

81 68 -100

77 73 70 47

1 2 3 4

a

b

c

If S is the terminal state

Back-up equation:

After converges (Bellman equality): the optimal future cost reward trade off that you can achieve if you act optimally in any given state.

Page 13: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Quiz – DETERMINSTIC

+100

-100

1 2 3 4

a

b

c

V(a3) = ???

DETERMINSTIC, γ= 1, R(S) = -3

Page 14: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Quiz - 1

97 +100

-100

1 2 3 4

a

b

c

V(a3) = 97V(b3) = ???

DETERMINSTIC, γ= 1, R(S) = -3

Page 15: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Quiz - 1

97 +100

94 -100

1 2 3 4

a

b

c

V(a3) = 97V(b3) = 94V(c1) = ???

DETERMINSTIC, γ= 1, R(S) = -3

Page 16: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Quiz - 1

91 94 97 +100

88 94 -100

85 88 91 88

1 2 3 4

a

b

c

V(a3) = 97V(b3) = 94V(c1) = 85

DETERMINSTIC, γ= 1, R(S) = -3

Page 17: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Quiz – STOCHASTIC

+100

-100

1 2 3 4

a

b

c

V(a3) = ???

STOCHASTIC, γ= 1, R(S) = -3, P=0.8

Page 18: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Quiz – STOCHASTIC

77 +100

-100

1 2 3 4

a

b

c

V(a3) = 77V(b3) = ???

STOCHASTIC, γ= 1, R(S) = -3, P=0.8

Page 19: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Quiz – STOCHASTIC

77 +100

48.6 -100

1 2 3 4

a

b

c

V(a3) = 77V(b3) = 48.6

STOCHASTIC, γ= 1, R(S) = -3, P=0.8

N: 0.8 * 77 + 0.1(-100) + 0.1*0 – 3 = 48.6W: 0.1 * 77 + 0.8 * 0 + 0.1 *0 – 3 = 4.7

Page 20: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Value Iteration and Policy - 1

What is the optimal policy?

Page 21: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Value Iteration and Policy - 2

85 89 93 +100

81 68 -100

77 73 70 47

1 2 3 4

a

b

c

STOCHASTIC, γ= 1, R(S) = -3, P=0.8

What is the optimal policy?

+100

-100

1 2 3 4

a

b

c

This is a situation where the risk of falling into the -100 is balanced by the time spent going around.

Page 22: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Value Iteration and Policy - 3

100 100 100 +100

100 100 -100

100 100 100 100

1 2 3 4

a

b

c

STOCHASTIC, γ= 1, R(S) = 0, P=0.8

What is the optimal policy?

+100

-100

1 2 3 4

a

b

c

Page 23: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Value Iteration and Policy - 4

-704 -423 -173 +100

-954 -357 -100

-1082 -847 -597 -377

1 2 3 4

a

b

c

STOCHASTIC, γ= 1, R(S) = -100, P=0.8

What is the optimal policy?

+100

-100

1 2 3 4

a

b

c

Page 24: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Markov Decision Processes

• Fully Observable: S1, …, Sn; a1, …, ak• Stochastic: P(S’|a, S)• Reward: R(S)• Objective:• Value iteration: V(S)• Converges: PI = argmax…

Page 25: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Partial Observability+100 -100

• Fully Observable, Deterministic• Fully Observable, Stochastic

Page 26: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Partial Observability? ?

• Partially Observable, Stochastic

MDP vs POMDP:Optimal exploration versus exploitation, where some of the actions might be information-gathering actions; whereas others might be goal-driven actions.

SIGN

Page 27: CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty

Partial Observability

Information Space (Belief Space)

? ?+100 +100

50% 50%