Stochastic Dynamic Programminggdrro.lip6.fr/sites/default/files/Expose-Leclere...Stochastic Dynamic...

Deterministic Dynamic Programming Stochastic Dynamic Programming Curses of Dimensionality

Stochastic Dynamic Programming

V. Leclere (CERMICS, ENPC)

July 5, 2016

V. Leclere Dynamic Programming July 5, 2016 1 / 20


Contents

1 Deterministic Dynamic Programming

2 Stochastic Dynamic Programming

3 Curses of Dimensionality



Controlled Dynamic System

A controlled dynamic system is defined by its dynamic

xt+1 = ft(xt , ut)

and initial state x0.The variables

xt is the state of the system,

ut is the control applied to the system at time t.

Example :

xt is the position and speed of a satellite, ut the accelerationdue to the engine (at time t).

xt is the stock of products available, ut the consumption attime t

...



Optimization Problem

We want to solve the following optimization problem

minu0,...,uT−1

T−1∑t=0

Lt

(xt , ut

)+ K

(xT)

(1a)

s.t. xt+1 = ft(xt , ut), x0 given (1b)

ut ∈ Ut(xt) (1c)

Where

Lt(x , u) is the cost incurred between t and t + 1 for a startingstate x with control u;K (x) is the final cost incurred for the final state x ;ft is the dynamic of the dynamical system;Ut(x) is the set of admissible controls at time t with startingstate x .

Note : this is a Shortest Path Problem on an acircuitic directedgraph.



Problem decomposition

The problem can be written

minu0

{L0(x0, u0) + min

u1,...,uT−1

T−1∑t=1

Lt

(xt , ut

)+ K

(xT)}

s.t. xt+1 = ft(xt , ut)

x1 = f0(x0, u0)

ut ∈ Ut(xt)

Or, more simply,

minu0

L0(x0, u0) + V1

(f0(x0, u0)

)where V1(x) is the value of the problem starting at time t = 1with state x1 = x .



Bellman value function

More generically, we denote Vt0(x) the optimal value of theproblem starting at time t with state x :

Vt0(x) = minut0 ,...,uT−1

T−1∑t=t0

Lt

(xt , ut

)+ K

(xT)

(2a)

s.t. xt+1 = ft(xt , ut), xt0 = x (2b)

ut ∈ Ut(xt) (2c)



Bellman Equation

Theorem

We have the Bellman equation (we assume existence ofminimizers)

VT (x) = K (x) ∀x ∈ XT

Vt(x) = minut∈Ut(x)

Lt(x , ut) + Vt+1 ◦ ft(x , ut)︸︷︷︸xt+1

∀x ∈ Xt .

And the optimal policy is given by

π]t(x) ∈ arg minut∈Ut(x)

{Lt(x , ut) + Vt+1 ◦ ft(x , ut)︸︷︷︸

xt+1

}∀x ∈ Xt .



Policy

Definition

An admissible policy for problem (1) is a sequence of function πtmapping the set Xt of possible state at time t into the set Ut ofpossible controls and such that

∀t ∈ J0,T − 1K, ∀x ∈ Xt , πt(x) ∈ Ut(x).



Open-Loop vs Closed Loop solution

Problem (1) can be solved with a Pontryagin approach, which

will yields a sequence of optimal controls (u]0, . . . , u

]T−1). This

is a so called open-loop solution as it is decided once (at timet = 0) and never questionned. This type of solution is easy tostore and use but not robust to errors or imprecisions.

Dynamic Programming approach yields an optimal policy{π]t}t∈J0,T−1K. This is a so-called closed-loop solution as the

control ut is choosen at time t according to the actual state t.It is more complex to use and compute, but more robust toerrors or imprecisions.

In a deterministic and exact setting an open-loop solution isequivalent to a closed loop solution.



Contents






Stochastic Controlled Dynamic System

A stochastic controlled dynamic system is defined by its dynamic

xt+1 = ft(xt ,ut , ξt+1)

and initial statex0 = x0

The variables

xt is the state of the system,

ut is the control applied to the system at time t,

ξt is an exogeneous noise.



Examples

Stock of water in a dam:

xt is the amount of water in the dam at time t,ut is the amount of water turbined at time t,ξt is the inflow of water at time t.

Boat in the ocean:

xt is the position of the boat at time t,ut is the direction and speed chosen at time t,ξt is the wind and current at time t.

Subway network:

xt is the position and speed of each train at time t,ut is the acceleration chosen at time t,ξt is the delay due to passengers and incident on the networkat time t.



Optimization Problem

We want to solve the following optimization problem

min E[ T−1∑

t=0

Lt

(xt ,ut , ξt+1

)+ K

(xT)]

(3a)

s.t. xt+1 = ft(xt ,ut , ξt+1), x0 = x0 (3b)

ut ∈ Ut(xt) (3c)

σ(ut) ⊂ Ft := σ(ξ0, · · · , ξt

)(3d)

Where

constraint (3b) is the dynamic of the system ;

constraint (3c) refer to the constraint on the controls;

constraint (3d) is the information constraint : ut is choosenknowing the realisation of the noises ξ0, . . . , ξt but withoutknowing the realisation of the noises ξt+1, . . . , ξT−1.



Dynamic Programming Principle

Theorem

Assume that the noises ξt are independent and exogeneous. Then,there exists (under technical assumption satisfied in the discretecase) an optimal solution, called a strategy, of the formut = πt

(xt).

We have

πt(x) ∈ arg minu∈Ut(x)

E[

Lt(x , u, ξt+1)︸︷︷︸current cost

+ Vt+1 ◦ ft(x , u, ξt+1

)︸︷︷︸future costs

],

where (Dynamic Programming Equation)VT (x) = K (x)

Vt(x) = minu∈Ut(x)

E[Lt(x , u, ξt+1) + Vt+1 ◦ ft

(x , u, ξt+1

)︸︷︷︸”Xt+1”

]V. Leclere Dynamic Programming July 5, 2016 13 / 20


Interpretation of Bellman Value

The Bellman’s value function Vt0(x) can be interpreted as thevalue of the problem starting at time t0 from the state x . Moreprecisely we have

Vt0(x) = min E[ T−1∑t=t0

Lt

(xt ,ut , ξt+1

)+ K

(xT)]

s.t. xt+1 = ft(xt ,ut , ξt+1), xt0 = x

ut ∈ Ut(xt)

σ(ut) ⊂ σ(ξ0, · · · , ξt

)



Information structure I

In Problem (3), constraint (3d) is the information constraint.There are different possible information structure.

If constraint (3d) reads σ(ut) ⊂ F0, the problem is open-loop,as the controls are choosen without knowledge of therealisation of any noise.

If constraint (3d) reads σ(ut) ⊂ Ft , the problem is said to bein decision-hazard structure as decision ut is chosen withoutknowing ξt+1.

If constraint (3d) reads σ(ut) ⊂ Ft+1, the problem is said tobe in hazard-decision structure as decision ut is chosen withknowledge of ξt+1.

If constraint (3d) reads σ(ut) ⊂ FT−1, the problem is said tobe anticipative as decision ut is chosen with knowledge of allthe noises.



Information structure II

Be careful when modeling your information structure:

Open-loop information structure might happen in practice(you have to decide on a planning and stick to it). If theproblem does not require an open-loop solution then it mightbe largely suboptimal (imagine driving a car eyes closed...). Inany case it yields an upper-bound of the problem.

In some cases decision-hazard and hazard-decision are bothapproximation of the reality. Hazard-decision yield a lowervalue then decision-hazard.

Anticipative structure is never an accurate modelization of thereality. However it can yield a lower-bound of youroptimization problem relying on deterministic optimizationand Monte-Carlo.



Non-independence of noise in DP

The Dynamic Programming equation requires only thetime-independence of noises.This can be relaxed if we consider an extended state.Consider a dynamic system driven by an equation

yt+1 = ft(xt ,ut , εt+1)

where the random noise εt is an AR1 process :

εt = αtεt−1 + βt + ξt ,

{ξt}t∈Z being independent.Then yt is called the physical state of the system and DP canbe used with the information state xt = (yt , εt−1).Generically speaking, if the noise ξt is exogeneous (notaffected by decisions ut), then we can always apply DynamicProgramming with the state

(xt , ξ1, . . . , ξt).V. Leclere Dynamic Programming July 5, 2016 17 / 20


Contents






Dynamic Programming Algorithm

Data: Problem parametersResult: optimal control and value;VT ≡ K ;for t : T − 1→ 0 do

for x ∈ Xt doVt(x) =∞;for u ∈ Ut(x) do

vu = E[Lt(x , u, ξt+1) + Vt+1 ◦ ft

(x , u, ξt+1

)];

if vu < Vt(x) thenVt(x) = vu ;πt(x) = u ;

Algorithm 1: Dynamic Programming Algorithm (discrete case)

Number of flops: O(T × |Xt | × |Ut | × |Ξt |).



3 curses of dimensionality

1 State. If we consider 3 independent states each taking 10values, then |Xt | = 103 = 1000. In practice DP is notapplicable for states of dimension more than 5.

2 Decision. The decision are often vector decisions, that is anumber of independent decision, hence leading to huge|Ut(x)|.

3 Expectation. In practice random information came from largedata set. Without a proper statistical treatment computing anexpectation is costly. Monte-Carlo approach are costly too,and unprecise.



Numerical considerations

The DP equation holds in (almost) any case.The algorithm shown before compute a look-up table ofcontrol for every possible state offline. It is impossible to do ifthe state is (partly) continuous.Alternatively, we can focus on computing offline anapproximation of the value function Vt and derive the optimalcontrol online by solving a one-step problem, solved only atthe current state :

πt(x) ∈ arg minu∈Ut(x)

E[Lt(x , u, ξt+1) + Vt+1 ◦ ft

(x , u, ξt+1

)]The field of Approximate DP gives methods for computingthose approximate value function (decomposed on a base offunctions).The simpler one consisting in discretizing the state, and theninterpolating the value function.


Stochastic Dynamic Programminggdrro.lip6.fr/sites/default/files/Expose-Leclere...Stochastic Dynamic...

Documents

Transcript of Stochastic Dynamic Programminggdrro.lip6.fr/sites/default/files/Expose-Leclere...Stochastic Dynamic...