-1.2truecmStochastic Optimal Control - Stanford...
Transcript of -1.2truecmStochastic Optimal Control - Stanford...
s tanford un i v er s i ty
Stochastic Optimal Control
Marco Pavone
Stanford University
April 29, 2015
s tanford un i v er s i ty
AA 241X Mission
Mission: “A wild fire is occurring in Lake Lagunita and AA241XTeams have been contracted to minimize the damage. Teams haveto design, build and fly a UAV that can detect, prevent andextinguish the fire, with the goal of minimizing the area on fire ina fixed amount of time. Multiple fires can be present at the startof mission and as time goes by the fire propagates through LakeLagunita.”
• A difficult problem, it combines exploration and exploitation
• Goal: to provide you with fundamental knowledge in the fieldof stochastic optimal control (focus on exploitation)
• Approach: dynamic programming
s tanford un i v er s i ty
AA 241X Mission
Mission: “A wild fire is occurring in Lake Lagunita and AA241XTeams have been contracted to minimize the damage. Teams haveto design, build and fly a UAV that can detect, prevent andextinguish the fire, with the goal of minimizing the area on fire ina fixed amount of time. Multiple fires can be present at the startof mission and as time goes by the fire propagates through LakeLagunita.”
• A difficult problem, it combines exploration and exploitation
• Goal: to provide you with fundamental knowledge in the fieldof stochastic optimal control (focus on exploitation)
• Approach: dynamic programming
s tanford un i v er s i ty
AA 241X Mission
Mission: “A wild fire is occurring in Lake Lagunita and AA241XTeams have been contracted to minimize the damage. Teams haveto design, build and fly a UAV that can detect, prevent andextinguish the fire, with the goal of minimizing the area on fire ina fixed amount of time. Multiple fires can be present at the startof mission and as time goes by the fire propagates through LakeLagunita.”
• A difficult problem, it combines exploration and exploitation
• Goal: to provide you with fundamental knowledge in the fieldof stochastic optimal control (focus on exploitation)
• Approach: dynamic programming
s tanford un i v er s i ty
Basic SOC Problem
• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N
• Control constraints: uk ∈ U(xk)
• Probability distribution: Pk(·|xk , uk) of wk
• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)
• Expected Cost:
Jπ(x0) = E
{gN(xN) +
N−1∑k=0
gk(xk , µk(xk),wk)
}
• Stochastic optimal control problem
J∗(x0) = minπ
Jπ(x0)
s tanford un i v er s i ty
Basic SOC Problem
• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N
• Control constraints: uk ∈ U(xk)
• Probability distribution: Pk(·|xk , uk) of wk
• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)
• Expected Cost:
Jπ(x0) = E
{gN(xN) +
N−1∑k=0
gk(xk , µk(xk),wk)
}
• Stochastic optimal control problem
J∗(x0) = minπ
Jπ(x0)
s tanford un i v er s i ty
Basic SOC Problem
• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N
• Control constraints: uk ∈ U(xk)
• Probability distribution: Pk(·|xk , uk) of wk
• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)
• Expected Cost:
Jπ(x0) = E
{gN(xN) +
N−1∑k=0
gk(xk , µk(xk),wk)
}
• Stochastic optimal control problem
J∗(x0) = minπ
Jπ(x0)
s tanford un i v er s i ty
Basic SOC Problem
• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N
• Control constraints: uk ∈ U(xk)
• Probability distribution: Pk(·|xk , uk) of wk
• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)
• Expected Cost:
Jπ(x0) = E
{gN(xN) +
N−1∑k=0
gk(xk , µk(xk),wk)
}
• Stochastic optimal control problem
J∗(x0) = minπ
Jπ(x0)
s tanford un i v er s i ty
Basic SOC Problem
• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N
• Control constraints: uk ∈ U(xk)
• Probability distribution: Pk(·|xk , uk) of wk
• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)
• Expected Cost:
Jπ(x0) = E
{gN(xN) +
N−1∑k=0
gk(xk , µk(xk),wk)
}
• Stochastic optimal control problem
J∗(x0) = minπ
Jπ(x0)
s tanford un i v er s i ty
Basic SOC Problem
• System: xk+1 = fk(xk , uk ,wk), k = 0, . . . ,N
• Control constraints: uk ∈ U(xk)
• Probability distribution: Pk(·|xk , uk) of wk
• Policies: π = {µ0, . . . , µN−1}, where uk = µk(xk)
• Expected Cost:
Jπ(x0) = E
{gN(xN) +
N−1∑k=0
gk(xk , µk(xk),wk)
}
• Stochastic optimal control problem
J∗(x0) = minπ
Jπ(x0)
s tanford un i v er s i ty
Key points
• Discrete-time model
• Markovian model
• Objective: find optimal closed-loop policy
• Additive cost (central assumption)
• Risk-neutral formulation
Other communities use different notation:
• Powell, W. B. AI, OR and control theory: A rosetta stone for
stochastic optimization. Princeton University, 2012.
http://castlelab.princeton.edu/Papers/AIOR_July2012.pdf
s tanford un i v er s i ty
Key points
• Discrete-time model
• Markovian model
• Objective: find optimal closed-loop policy
• Additive cost (central assumption)
• Risk-neutral formulation
Other communities use different notation:
• Powell, W. B. AI, OR and control theory: A rosetta stone for
stochastic optimization. Princeton University, 2012.
http://castlelab.princeton.edu/Papers/AIOR_July2012.pdf
s tanford un i v er s i ty
Principle of Optimality
• Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be optimal policy
• Consider tail subproblem
E
{gN(xN) +
N−1∑k=i
gk(xk , µk(xk),wk)
}
and the tail policy {µ∗i , . . . , µ∗N−1}
• Principle of optimality: The tail policy is optimal for the tailsubproblem
s tanford un i v er s i ty
Principle of Optimality
• Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be optimal policy
• Consider tail subproblem
E
{gN(xN) +
N−1∑k=i
gk(xk , µk(xk),wk)
}
and the tail policy {µ∗i , . . . , µ∗N−1}
• Principle of optimality: The tail policy is optimal for the tailsubproblem
s tanford un i v er s i ty
The DP Algorithm
Intuition:
• DP first solves ALL tail subroblems at the final stage
• At generic step, it solves ALL tail subproblems of a given timelength, using solution of tail subproblems of shorter length
The DP algorithm:
• Start withJN(xN) = gN(xN),
and go backwards using
Jk(xk) = minuk∈Uk (xk )
Ewk{gk(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))} ,
for k = 0, 1, . . . ,N − 1
• Then J∗(x0) = J0(x0) and optimal policy is constructed bysetting µ∗k(xk) = u∗k .
s tanford un i v er s i ty
The DP Algorithm
Intuition:
• DP first solves ALL tail subroblems at the final stage
• At generic step, it solves ALL tail subproblems of a given timelength, using solution of tail subproblems of shorter length
The DP algorithm:
• Start withJN(xN) = gN(xN),
and go backwards using
Jk(xk) = minuk∈Uk (xk )
Ewk{gk(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))} ,
for k = 0, 1, . . . ,N − 1
• Then J∗(x0) = J0(x0) and optimal policy is constructed bysetting µ∗k(xk) = u∗k .
s tanford un i v er s i ty
Example: Inventory Control Problem (1/2)
• Stock available xk ∈ N, inventory uk ∈ N, and demandwk ∈ N
• Dynamics: xk+1 = max(0, xk + uk − wk)
• Constraints: xk + uk ≤ 2
• Probabilistic structure: p(wk = 0) = 0.1, p(wk = 1) = 0.7,and p(wk = 2) = 0.2
• Cost
E
0︸︷︷︸g3(x3)
+2∑
k=0
(uk + (xk + uk − wk)2
)︸ ︷︷ ︸g(xk ,uk ,wk )
s tanford un i v er s i ty
Example: Inventory Control Problem (2/2)
• Algorithm takes form
Jk(xk) = min0≤uk≤2−xk
Ewk
{uk + (xk + uk − wk)2
+ Jk+1(max(0, xk + uk − wk))},
for k = 0, 1, 2
• For example
J2(0) = minu2=0,1,2
Ew2
{u2 + (u2 − w2)2
}= min
u2=0,1,2
[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2
]which yields J2(0) = 1.3, and µ∗2(0) = 1
• Final solution J0(0) = 3.7, J0(1) = 2.7, and J0(2) = 2.818
s tanford un i v er s i ty
Example: Inventory Control Problem (2/2)
• Algorithm takes form
Jk(xk) = min0≤uk≤2−xk
Ewk
{uk + (xk + uk − wk)2
+ Jk+1(max(0, xk + uk − wk))},
for k = 0, 1, 2
• For example
J2(0) = minu2=0,1,2
Ew2
{u2 + (u2 − w2)2
}= min
u2=0,1,2
[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2
]which yields J2(0) = 1.3, and µ∗2(0) = 1
• Final solution J0(0) = 3.7, J0(1) = 2.7, and J0(2) = 2.818
s tanford un i v er s i ty
Example: Inventory Control Problem (2/2)
• Algorithm takes form
Jk(xk) = min0≤uk≤2−xk
Ewk
{uk + (xk + uk − wk)2
+ Jk+1(max(0, xk + uk − wk))},
for k = 0, 1, 2
• For example
J2(0) = minu2=0,1,2
Ew2
{u2 + (u2 − w2)2
}= min
u2=0,1,2
[u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2
]which yields J2(0) = 1.3, and µ∗2(0) = 1
• Final solution J0(0) = 3.7, J0(1) = 2.7, and J0(2) = 2.818
s tanford un i v er s i ty
Difficulties of DP
• Curse of dimensionality:• Exponential growth of the computational and storage
requirements
• Intractability of imperfect state information problems
• Curse of modeling: if “system stochastics” are complex, it isdifficult to obtain expressions for the transition probabilities
• Curse of time• The data of the problem to be solved is given with little
advance notice
• The problem data may change as the system iscontrolled—need for on-line replanning
s tanford un i v er s i ty
Solution: Approximate DP
• Certainty Equivalent Control
• Cost-to-Go Approximation
• Other Approaches (e.g., approximation in policy space)
s tanford un i v er s i ty
Certainty Equivalent Control
• Idea: Replace the stochastic problem with a deterministic one
• At each time “k ,” the future uncertain quantities are fixed atsome “typical” values
• Online implementation
1 Fix the wi , i ≥ k , at some w̄i and solve deterministic problem
min gN(xN) +N−1∑i=k
gi (xi , ui , w̄i )
where xi+1 = fi (xi , ui ,wi )
2 Use as control µ̄k(xk) the first element of optimal controlsequence and move to step k + 1
• Extends to imperfect state information case (use x̄k(Ik))
s tanford un i v er s i ty
Certainty Equivalent Control
• Idea: Replace the stochastic problem with a deterministic one
• At each time “k ,” the future uncertain quantities are fixed atsome “typical” values
• Online implementation
1 Fix the wi , i ≥ k , at some w̄i and solve deterministic problem
min gN(xN) +N−1∑i=k
gi (xi , ui , w̄i )
where xi+1 = fi (xi , ui ,wi )
2 Use as control µ̄k(xk) the first element of optimal controlsequence and move to step k + 1
• Extends to imperfect state information case (use x̄k(Ik))
s tanford un i v er s i ty
Cost-to-Go Approximation (CGA)
• Idea: Truncate time horizon and approximate cost-to-go
• One-step lookahead policy: at each k and state xk , usecontrol µ̄k(xk) that
minuk∈Uk (xk )
E{gk(xk , uk ,wk) + J̃k+1(fk(xk , uk ,wk))
},
• J̃N = gN
• J̃k+1: approximation to true-cost-to-go Jk+1
• Analogously, two-step lookahead policy: all of the above and
J̃k+1(xk+1) = minuk+1∈Uk+1(xk+1)
E{gk+1(xk+1, uk+1,wk+1)
+ J̃k+2(fk+1(xk+1, uk+1,wk+1))}
s tanford un i v er s i ty
CGA—Computational Aspects
• If J̃k+1 is readily available and minimization not too hard, thisapproach is implementable on-line
• Choice of approximating functions J̃k is critical
1 Problem Approximation: approximate by considering simplerproblem
2 Parametric Cost-to-Go Approximation: approximate cost-to-gofunction with function of suitable parametric form (parameterstuned by some scheme → neuro-dynamic programming)
3 Rollout Approach: approximate cost-to-go with cost of somesuboptimal policy
s tanford un i v er s i ty
CGA—Problem Approximation
• Many problem-dependent possibilities• Replace uncertain quantities by nominal values (in the spirit of
CEC)
• Simplify difficult constraints or dynamics
• Decouple subsystems
• Aggregate states
s tanford un i v er s i ty
CGA—Parametric Approximation
• Use a cost-to-go approximation from a parametric classJ̃(x , r) where x is the current state and r = (r1, . . . , rm) is avector of “tunable” weights
• Two key aspects
• Choice of parametric class J̃(x , r)• Example: future extraction method
J̃(x , r) =m∑i=1
ri yi (x),
where the yi ’s are features
• Algorithm for tuning the weights (possibly, simulation-based)
s tanford un i v er s i ty
CGA—Rollout Approach
• J̃k is the cost-to-go of some heuristic policy (called the basepolicy)
• To compute rollout control, one need for all uk
Qk(xk , uk) := E {gk(xk , uk ,wk) + Hk+1(fk(xk , uk ,wk))} ,
where Hk+1 is the value of the cost-to-go for the base policy
• Q-factors can be evaluated via Monte-Carlo simulation
• Q-factors can be approximated, e.g., by using a CEC approach
• Model predictive control (MPC) can be viewed as a specialcase of rollout algorithms (AA 203)
s tanford un i v er s i ty
Other ADP Approaches
• Minimize the DP equation error
• Direct approximation of control policies
• Approximation in policy space
s tanford un i v er s i ty
References
• Bertsekas, D. P. Dynamic programming and optimal control.Volumes 1 & 2. Athena Scientific, 2005.
• Puterman, M. L. Markov decision processes: discretestochastic dynamic programming. John Wiley & Sons, 2014.
• Barto, A. G. Reinforcement learning: An introduction. MITpress, 1998.