-1.2truecmStochastic Optimal Control - Stanford...

31
stanford university Stochastic Optimal Control Marco Pavone Stanford University April 29, 2015

Transcript of -1.2truecmStochastic Optimal Control - Stanford...

Page 1: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Stochastic Optimal Control

Marco Pavone

Stanford University

April 29, 2015

Page 2: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

AA 241X Mission

Mission: “A wild fire is occurring in Lake Lagunita and AA241XTeams have been contracted to minimize the damage. Teams haveto design, build and fly a UAV that can detect, prevent andextinguish the fire, with the goal of minimizing the area on fire ina fixed amount of time. Multiple fires can be present at the startof mission and as time goes by the fire propagates through LakeLagunita.”

• A difficult problem, it combines exploration and exploitation

• Goal: to provide you with fundamental knowledge in the fieldof stochastic optimal control (focus on exploitation)

• Approach: dynamic programming

Page 3: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

AA 241X Mission

Mission: “A wild fire is occurring in Lake Lagunita and AA241XTeams have been contracted to minimize the damage. Teams haveto design, build and fly a UAV that can detect, prevent andextinguish the fire, with the goal of minimizing the area on fire ina fixed amount of time. Multiple fires can be present at the startof mission and as time goes by the fire propagates through LakeLagunita.”

• A difficult problem, it combines exploration and exploitation

• Goal: to provide you with fundamental knowledge in the fieldof stochastic optimal control (focus on exploitation)

• Approach: dynamic programming

Page 4: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

AA 241X Mission

Mission: “A wild fire is occurring in Lake Lagunita and AA241XTeams have been contracted to minimize the damage. Teams haveto design, build and fly a UAV that can detect, prevent andextinguish the fire, with the goal of minimizing the area on fire ina fixed amount of time. Multiple fires can be present at the startof mission and as time goes by the fire propagates through LakeLagunita.”

• A difficult problem, it combines exploration and exploitation

• Goal: to provide you with fundamental knowledge in the fieldof stochastic optimal control (focus on exploitation)

• Approach: dynamic programming

Page 5: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

Page 6: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

Page 7: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

Page 8: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

Page 9: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

Page 10: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

}

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

Page 11: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Key points

• Discrete-time model

• Markovian model

• Objective: find optimal closed-loop policy

• Additive cost (central assumption)

• Risk-neutral formulation

Other communities use different notation:

• Powell, W. B. AI, OR and control theory: A rosetta stone for

stochastic optimization. Princeton University, 2012.

http://castlelab.princeton.edu/Papers/AIOR_July2012.pdf

Page 12: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Key points

• Discrete-time model

• Markovian model

• Objective: find optimal closed-loop policy

• Additive cost (central assumption)

• Risk-neutral formulation

Other communities use different notation:

• Powell, W. B. AI, OR and control theory: A rosetta stone for

stochastic optimization. Princeton University, 2012.

http://castlelab.princeton.edu/Papers/AIOR_July2012.pdf

Page 13: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Principle of Optimality

• Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be optimal policy

• Consider tail subproblem

E

{gN(xN) +

N−1∑k=i

gk(xk , µk(xk),wk)

}

and the tail policy {µ∗i , . . . , µ∗N−1}

• Principle of optimality: The tail policy is optimal for the tailsubproblem

Page 14: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Principle of Optimality

• Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be optimal policy

• Consider tail subproblem

E

{gN(xN) +

N−1∑k=i

gk(xk , µk(xk),wk)

}

and the tail policy {µ∗i , . . . , µ∗N−1}

• Principle of optimality: The tail policy is optimal for the tailsubproblem

Page 15: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

The DP Algorithm

Intuition:

• DP first solves ALL tail subroblems at the final stage

• At generic step, it solves ALL tail subproblems of a given timelength, using solution of tail subproblems of shorter length

The DP algorithm:

• Start withJN(xN) = gN(xN),

and go backwards using

Jk(xk) = minuk∈Uk (xk )

Ewk{gk(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))} ,

for k = 0, 1, . . . ,N − 1

• Then J∗(x0) = J0(x0) and optimal policy is constructed bysetting µ∗k(xk) = u∗k .

Page 16: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

The DP Algorithm

Intuition:

• DP first solves ALL tail subroblems at the final stage

• At generic step, it solves ALL tail subproblems of a given timelength, using solution of tail subproblems of shorter length

The DP algorithm:

• Start withJN(xN) = gN(xN),

and go backwards using

Jk(xk) = minuk∈Uk (xk )

Ewk{gk(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))} ,

for k = 0, 1, . . . ,N − 1

• Then J∗(x0) = J0(x0) and optimal policy is constructed bysetting µ∗k(xk) = u∗k .

Page 17: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Example: Inventory Control Problem (1/2)

• Stock available xk ∈ N, inventory uk ∈ N, and demandwk ∈ N

• Dynamics: xk+1 = max(0, xk + uk − wk)

• Constraints: xk + uk ≤ 2

• Probabilistic structure: p(wk = 0) = 0.1, p(wk = 1) = 0.7,and p(wk = 2) = 0.2

• Cost

E

0︸︷︷︸g3(x3)

+2∑

k=0

(uk + (xk + uk − wk)2

)︸ ︷︷ ︸g(xk ,uk ,wk )

Page 18: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Example: Inventory Control Problem (2/2)

• Algorithm takes form

Jk(xk) = min0≤uk≤2−xk

Ewk

{uk + (xk + uk − wk)2

+ Jk+1(max(0, xk + uk − wk))},

for k = 0, 1, 2

• For example

J2(0) = minu2=0,1,2

Ew2

{u2 + (u2 − w2)2

}= min

u2=0,1,2

[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2

]which yields J2(0) = 1.3, and µ∗2(0) = 1

• Final solution J0(0) = 3.7, J0(1) = 2.7, and J0(2) = 2.818

Page 19: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Example: Inventory Control Problem (2/2)

• Algorithm takes form

Jk(xk) = min0≤uk≤2−xk

Ewk

{uk + (xk + uk − wk)2

+ Jk+1(max(0, xk + uk − wk))},

for k = 0, 1, 2

• For example

J2(0) = minu2=0,1,2

Ew2

{u2 + (u2 − w2)2

}= min

u2=0,1,2

[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2

]which yields J2(0) = 1.3, and µ∗2(0) = 1

• Final solution J0(0) = 3.7, J0(1) = 2.7, and J0(2) = 2.818

Page 20: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Example: Inventory Control Problem (2/2)

• Algorithm takes form

Jk(xk) = min0≤uk≤2−xk

Ewk

{uk + (xk + uk − wk)2

+ Jk+1(max(0, xk + uk − wk))},

for k = 0, 1, 2

• For example

J2(0) = minu2=0,1,2

Ew2

{u2 + (u2 − w2)2

}= min

u2=0,1,2

[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2

]which yields J2(0) = 1.3, and µ∗2(0) = 1

• Final solution J0(0) = 3.7, J0(1) = 2.7, and J0(2) = 2.818

Page 21: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Difficulties of DP

• Curse of dimensionality:• Exponential growth of the computational and storage

requirements

• Intractability of imperfect state information problems

• Curse of modeling: if “system stochastics” are complex, it isdifficult to obtain expressions for the transition probabilities

• Curse of time• The data of the problem to be solved is given with little

advance notice

• The problem data may change as the system iscontrolled—need for on-line replanning

Page 22: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Solution: Approximate DP

• Certainty Equivalent Control

• Cost-to-Go Approximation

• Other Approaches (e.g., approximation in policy space)

Page 23: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Certainty Equivalent Control

• Idea: Replace the stochastic problem with a deterministic one

• At each time “k ,” the future uncertain quantities are fixed atsome “typical” values

• Online implementation

1 Fix the wi , i ≥ k , at some w̄i and solve deterministic problem

min gN(xN) +N−1∑i=k

gi (xi , ui , w̄i )

where xi+1 = fi (xi , ui ,wi )

2 Use as control µ̄k(xk) the first element of optimal controlsequence and move to step k + 1

• Extends to imperfect state information case (use x̄k(Ik))

Page 24: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Certainty Equivalent Control

• Idea: Replace the stochastic problem with a deterministic one

• At each time “k ,” the future uncertain quantities are fixed atsome “typical” values

• Online implementation

1 Fix the wi , i ≥ k , at some w̄i and solve deterministic problem

min gN(xN) +N−1∑i=k

gi (xi , ui , w̄i )

where xi+1 = fi (xi , ui ,wi )

2 Use as control µ̄k(xk) the first element of optimal controlsequence and move to step k + 1

• Extends to imperfect state information case (use x̄k(Ik))

Page 25: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Cost-to-Go Approximation (CGA)

• Idea: Truncate time horizon and approximate cost-to-go

• One-step lookahead policy: at each k and state xk , usecontrol µ̄k(xk) that

minuk∈Uk (xk )

E{gk(xk , uk ,wk) + J̃k+1(fk(xk , uk ,wk))

},

• J̃N = gN

• J̃k+1: approximation to true-cost-to-go Jk+1

• Analogously, two-step lookahead policy: all of the above and

J̃k+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{gk+1(xk+1, uk+1,wk+1)

+ J̃k+2(fk+1(xk+1, uk+1,wk+1))}

Page 26: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

CGA—Computational Aspects

• If J̃k+1 is readily available and minimization not too hard, thisapproach is implementable on-line

• Choice of approximating functions J̃k is critical

1 Problem Approximation: approximate by considering simplerproblem

2 Parametric Cost-to-Go Approximation: approximate cost-to-gofunction with function of suitable parametric form (parameterstuned by some scheme → neuro-dynamic programming)

3 Rollout Approach: approximate cost-to-go with cost of somesuboptimal policy

Page 27: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

CGA—Problem Approximation

• Many problem-dependent possibilities• Replace uncertain quantities by nominal values (in the spirit of

CEC)

• Simplify difficult constraints or dynamics

• Decouple subsystems

• Aggregate states

Page 28: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

CGA—Parametric Approximation

• Use a cost-to-go approximation from a parametric classJ̃(x , r) where x is the current state and r = (r1, . . . , rm) is avector of “tunable” weights

• Two key aspects

• Choice of parametric class J̃(x , r)• Example: future extraction method

J̃(x , r) =m∑i=1

ri yi (x),

where the yi ’s are features

• Algorithm for tuning the weights (possibly, simulation-based)

Page 29: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

CGA—Rollout Approach

• J̃k is the cost-to-go of some heuristic policy (called the basepolicy)

• To compute rollout control, one need for all uk

Qk(xk , uk) := E {gk(xk , uk ,wk) + Hk+1(fk(xk , uk ,wk))} ,

where Hk+1 is the value of the cost-to-go for the base policy

• Q-factors can be evaluated via Monte-Carlo simulation

• Q-factors can be approximated, e.g., by using a CEC approach

• Model predictive control (MPC) can be viewed as a specialcase of rollout algorithms (AA 203)

Page 30: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

Other ADP Approaches

• Minimize the DP equation error

• Direct approximation of control policies

• Approximation in policy space

Page 31: -1.2truecmStochastic Optimal Control - Stanford Universityadl.stanford.edu/groups/aa241x/wiki/e054d... · stanford university Certainty Equivalent Control Idea: Replace the stochastic

s tanford un i v er s i ty

References

• Bertsekas, D. P. Dynamic programming and optimal control.Volumes 1 & 2. Athena Scientific, 2005.

• Puterman, M. L. Markov decision processes: discretestochastic dynamic programming. John Wiley & Sons, 2014.

• Barto, A. G. Reinforcement learning: An introduction. MITpress, 1998.