-1.2truecmStochastic Optimal Control - Stanford...

s tanford un i v er s i ty

Stochastic Optimal Control

Marco Pavone

Stanford University

April 29, 2015

AA 241X Mission

Mission: “A wild fire is occurring in Lake Lagunita and AA241XTeams have been contracted to minimize the damage. Teams haveto design, build and fly a UAV that can detect, prevent andextinguish the fire, with the goal of minimizing the area on fire ina fixed amount of time. Multiple fires can be present at the startof mission and as time goes by the fire propagates through LakeLagunita.”

• A difficult problem, it combines exploration and exploitation

• Goal: to provide you with fundamental knowledge in the fieldof stochastic optimal control (focus on exploitation)

• Approach: dynamic programming

AA 241X Mission

Basic SOC Problem

• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N

• Control constraints: uk ∈ U(xk)

• Probability distribution: Pk(·|xk , uk) of wk

• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

• Stochastic optimal control problem

J∗(x0) = minπ

Jπ(x0)

Basic SOC Problem

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

J∗(x0) = minπ

Jπ(x0)

Basic SOC Problem

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

J∗(x0) = minπ

Jπ(x0)

Basic SOC Problem

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

J∗(x0) = minπ

Jπ(x0)

Basic SOC Problem

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

J∗(x0) = minπ

Jπ(x0)

Basic SOC Problem

• Expected Cost:

Jπ(x0) = E

{gN(xN) +

N−1∑k=0

gk(xk , µk(xk),wk)

J∗(x0) = minπ

Jπ(x0)

Key points

• Discrete-time model

• Markovian model

• Objective: find optimal closed-loop policy

• Additive cost (central assumption)

• Risk-neutral formulation

Other communities use different notation:

• Powell, W. B. AI, OR and control theory: A rosetta stone for

stochastic optimization. Princeton University, 2012.

http://castlelab.princeton.edu/Papers/AIOR_July2012.pdf

Key points

• Discrete-time model

• Markovian model

• Objective: find optimal closed-loop policy

• Additive cost (central assumption)

• Risk-neutral formulation

Other communities use different notation:

• Powell, W. B. AI, OR and control theory: A rosetta stone for

stochastic optimization. Princeton University, 2012.

http://castlelab.princeton.edu/Papers/AIOR_July2012.pdf

Principle of Optimality

• Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be optimal policy

• Consider tail subproblem

{gN(xN) +

N−1∑k=i

gk(xk , µk(xk),wk)

and the tail policy {µ∗i , . . . , µ∗N−1}

• Principle of optimality: The tail policy is optimal for the tailsubproblem

Principle of Optimality

• Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be optimal policy

• Consider tail subproblem

{gN(xN) +

N−1∑k=i

gk(xk , µk(xk),wk)

and the tail policy {µ∗i , . . . , µ∗N−1}

• Principle of optimality: The tail policy is optimal for the tailsubproblem

The DP Algorithm

Intuition:

• DP first solves ALL tail subroblems at the final stage

• At generic step, it solves ALL tail subproblems of a given timelength, using solution of tail subproblems of shorter length

The DP algorithm:

• Start withJN(xN) = gN(xN),

and go backwards using

Jk(xk) = minuk∈Uk (xk )

Ewk{gk(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))} ,

for k = 0, 1, . . . ,N − 1

• Then J∗(x0) = J0(x0) and optimal policy is constructed bysetting µ∗k(xk) = u∗k .

The DP Algorithm

Intuition:

• DP first solves ALL tail subroblems at the final stage

• At generic step, it solves ALL tail subproblems of a given timelength, using solution of tail subproblems of shorter length

The DP algorithm:

• Start withJN(xN) = gN(xN),

and go backwards using

Jk(xk) = minuk∈Uk (xk )

Ewk{gk(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))} ,

for k = 0, 1, . . . ,N − 1

• Then J∗(x0) = J0(x0) and optimal policy is constructed bysetting µ∗k(xk) = u∗k .

Example: Inventory Control Problem (1/2)

• Stock available xk ∈ N, inventory uk ∈ N, and demandwk ∈ N

• Dynamics: xk+1 = max(0, xk + uk − wk)

• Constraints: xk + uk ≤ 2

• Probabilistic structure: p(wk = 0) = 0.1, p(wk = 1) = 0.7,and p(wk = 2) = 0.2

• Cost

0︸︷︷︸g3(x3)

(uk + (xk + uk − wk)2

)︸︷︷︸g(xk ,uk ,wk )

• Algorithm takes form

Jk(xk) = min0≤uk≤2−xk

{uk + (xk + uk − wk)2

+ Jk+1(max(0, xk + uk − wk))},

for k = 0, 1, 2

• For example

J2(0) = minu2=0,1,2

{u2 + (u2 − w2)2

}= min

u2=0,1,2

[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2

]which yields J2(0) = 1.3, and µ∗2(0) = 1

• Final solution J0(0) = 3.7, J0(1) = 2.7, and J0(2) = 2.818

{uk + (xk + uk − wk)2

+ Jk+1(max(0, xk + uk − wk))},

for k = 0, 1, 2

• For example

J2(0) = minu2=0,1,2

{u2 + (u2 − w2)2

}= min

u2=0,1,2

[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2

{uk + (xk + uk − wk)2

+ Jk+1(max(0, xk + uk − wk))},

for k = 0, 1, 2

• For example

J2(0) = minu2=0,1,2

{u2 + (u2 − w2)2

}= min

u2=0,1,2

[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2

Difficulties of DP

• Curse of dimensionality:• Exponential growth of the computational and storage

requirements

• Intractability of imperfect state information problems

• Curse of modeling: if “system stochastics” are complex, it isdifficult to obtain expressions for the transition probabilities

• Curse of time• The data of the problem to be solved is given with little

advance notice

• The problem data may change as the system iscontrolled—need for on-line replanning

Solution: Approximate DP

• Certainty Equivalent Control

• Cost-to-Go Approximation

• Other Approaches (e.g., approximation in policy space)

Certainty Equivalent Control

• Idea: Replace the stochastic problem with a deterministic one

• At each time “k ,” the future uncertain quantities are fixed atsome “typical” values

• Online implementation

1 Fix the wi , i ≥ k , at some w̄i and solve deterministic problem

min gN(xN) +N−1∑i=k

gi (xi , ui , w̄i )

where xi+1 = fi (xi , ui ,wi )

2 Use as control µ̄k(xk) the first element of optimal controlsequence and move to step k + 1

• Extends to imperfect state information case (use x̄k(Ik))

Certainty Equivalent Control

• Idea: Replace the stochastic problem with a deterministic one

• At each time “k ,” the future uncertain quantities are fixed atsome “typical” values

• Online implementation

1 Fix the wi , i ≥ k , at some w̄i and solve deterministic problem

min gN(xN) +N−1∑i=k

gi (xi , ui , w̄i )

where xi+1 = fi (xi , ui ,wi )

2 Use as control µ̄k(xk) the first element of optimal controlsequence and move to step k + 1

• Extends to imperfect state information case (use x̄k(Ik))

Cost-to-Go Approximation (CGA)

• Idea: Truncate time horizon and approximate cost-to-go

• One-step lookahead policy: at each k and state xk , usecontrol µ̄k(xk) that

minuk∈Uk (xk )

E{gk(xk , uk ,wk) + J̃k+1(fk(xk , uk ,wk))

• J̃N = gN

• J̃k+1: approximation to true-cost-to-go Jk+1

• Analogously, two-step lookahead policy: all of the above and

J̃k+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{gk+1(xk+1, uk+1,wk+1)

+ J̃k+2(fk+1(xk+1, uk+1,wk+1))}

CGA—Computational Aspects

• If J̃k+1 is readily available and minimization not too hard, thisapproach is implementable on-line

• Choice of approximating functions J̃k is critical

1 Problem Approximation: approximate by considering simplerproblem

2 Parametric Cost-to-Go Approximation: approximate cost-to-gofunction with function of suitable parametric form (parameterstuned by some scheme → neuro-dynamic programming)

3 Rollout Approach: approximate cost-to-go with cost of somesuboptimal policy

CGA—Problem Approximation

• Many problem-dependent possibilities• Replace uncertain quantities by nominal values (in the spirit of

• Simplify difficult constraints or dynamics

• Decouple subsystems

• Aggregate states

CGA—Parametric Approximation

• Use a cost-to-go approximation from a parametric classJ̃(x , r) where x is the current state and r = (r1, . . . , rm) is avector of “tunable” weights

• Two key aspects

• Choice of parametric class J̃(x , r)• Example: future extraction method

J̃(x , r) =m∑i=1

ri yi (x),

where the yi ’s are features

• Algorithm for tuning the weights (possibly, simulation-based)

CGA—Rollout Approach

• J̃k is the cost-to-go of some heuristic policy (called the basepolicy)

• To compute rollout control, one need for all uk

Qk(xk , uk) := E {gk(xk , uk ,wk) + Hk+1(fk(xk , uk ,wk))} ,

where Hk+1 is the value of the cost-to-go for the base policy

• Q-factors can be evaluated via Monte-Carlo simulation

• Q-factors can be approximated, e.g., by using a CEC approach

• Model predictive control (MPC) can be viewed as a specialcase of rollout algorithms (AA 203)

Other ADP Approaches

• Minimize the DP equation error

• Direct approximation of control policies

• Approximation in policy space

References

• Bertsekas, D. P. Dynamic programming and optimal control.Volumes 1 & 2. Athena Scientific, 2005.

• Puterman, M. L. Markov decision processes: discretestochastic dynamic programming. John Wiley & Sons, 2014.

• Barto, A. G. Reinforcement learning: An introduction. MITpress, 1998.

-1.2truecmStochastic Optimal Control - Stanford...

Documents

Transcript of -1.2truecmStochastic Optimal Control - Stanford...