Stochastic optimization methods for the simultaneous control ......Bottou, Curtis and Nocedal,...

STOCHASTIC OPTIMIZATION METHODS FOR THESIMULTANEOUS CONTROL OFPARAMETER-DEPENDENT SYSTEMS

Umberto BiccariFundación Deusto and Universidad de Deusto, Bilbao, Spainumberto.biccari@deusto.es cmc.deusto.es/umberto-biccari

joint work with:Ana Navarro - Universitat de ValènciaEnrique Zuazua - FAU, Fundación Deusto and Universidad Autónoma de Madrid

June 12, 2020

INTRODUCTION

Keywords

Key concepts of the presentation:

parameter-depending models

simultaneous controllability

stochastic optimization

Parameter-depending models

Parameter-dependent models appear in many real-life applications, todescribe physical phenomena which may have different realizations{

x′ν(t) = Aνxν(t) + Bu(t), 0 < t < T,

xν(0) = x0,, ν ∈ K

Example 1: linearized cart-inverted pendulum systemxνvνθνων

0 0 1 00 − ν

M 0 00 0 0 10 ν+M

M` 0 0

xνvνθνων

010−1

Parameter-depending models

Parameter-dependent models appear in many real-life applications, todescribe physical phenomena which may have different realizations{

x′ν(t) = Aνxν(t) + Bu(t), 0 < t < T,

xν(0) = x0,, ν ∈ K

Example 2: system of thermoelasticity

wtt − µ ∆w −

Lamécoeffi-cients

(λ+ µ) ∇div(w) + α∇θ = u 1ω

θt −∆θ + βdiv(wt) = 0

Lebeau and Zuazua, Null controllability of a system of linear thermoelasticity, 2002

Simultaneous controllability

We look for a unique parameter-independent control u such that, at timeT > 0, the corresponding solution xν satisfies

xν(T) = xT , for all ν ∈ K

In the ODE setting, simultaneous controllability is equivalent to the classi-cal controllability of the augmented system

x = Ax + Bu

with x = (xν1 , . . . , xν|K|)T ∈ RN|K|, u = (u, . . . ,u)T ∈ L2(0,T;RN|K|), and

where the matrices A and B are given by

Aν1 0. . .

0 Aν|K|

∈ RN|K|×N|K| and B =

∈ RN|K|×1

Lohéac and Zuazua, From averaged to simultaneous controllability, 2016

Simultaneous controllability

We look for a unique parameter-independent control u such that, at timeT > 0, the corresponding solution xν satisfies

xν(T) = xT , for all ν ∈ K

In the ODE setting, simultaneous controllability is equivalent to the classi-cal controllability of the augmented system

x = Ax + Bu

with x = (xν1 , . . . , xν|K|)T ∈ RN|K|, u = (u, . . . ,u)T ∈ L2(0,T;RN|K|), and

where the matrices A and B are given by

Aν1 0. . .

0 Aν|K|

∈ RN|K|×N|K| and B =

∈ RN|K|×1

Lohéac and Zuazua, From averaged to simultaneous controllability, 2016

Computation of simultaneous controls

u = minu∈L2(0,T;RM)

Fν(u)

Fν(u) :=12E[∥∥∥xν(T)− xT

∥∥∥2RN

2‖u‖2L2(0,T;RM)

Fν(u) :=1|K|

∑νk∈K

fνk +β

2‖u‖2L2(0,T;RM)

Typical approaches:

• Gradient Descent (GD): uk+1 = uk − ηk∇Fν(uk)

• Conjugate Gradient (CG)

Nocedal and Wright, Numerical optimization, 1999

Ciarlet, Introduction à l’analyse numérique matricielle et à l’optimisation, 1988

Both approaches have a high computational cost when dealing with largeparameter sets.

Fν(u)

Fν(u) :=12E[∥∥∥xν(T)− xT

∥∥∥2RN

2‖u‖2L2(0,T;RM)

Fν(u) :=1|K|

∑νk∈K

fνk +β

2‖u‖2L2(0,T;RM)

Typical approaches:

Fν(u)

Fν(u) :=12E[∥∥∥xν(T)− xT

∥∥∥2RN

2‖u‖2L2(0,T;RM)

Fν(u) :=1|K|

∑νk∈K

fνk +β

2‖u‖2L2(0,T;RM)

Typical approaches:

Stochastic optimization

STOCHASTIC GRADIENT DESCENT (SGD)This is a simplification of the classical GD in which, instead of computing∇Fν for all parameters ν ∈ K, in each iteration this gradient is estimatedon the basis of a single randomly picked configuration

uk+1 = uk − ηk∇fνk(uk)

Robbins and Monro, A stochastic approximation method, 1951

CONTINUOUS STOCHASTIC GRADIENT (CSG)This is a variant of SGD, based on the idea of reusing previously obtainedinformation to improve the efficiency of the algorithm

uk+1 = uk − ηkG k, G k =k∑`=1

α`∇fν`(u`)

Pflug, Bernhardt, Grieshammer and Stingl, A new stochastic gradient method for the efficientsolution of structural optimization problems with infinitely many state problems, 2020

OPTIMIZATION ALGORITHMS

Gradient Descent

uk+1 = uk − ηk∇Fν(uk)

Convergence

Since Fν is convex, if we take ηk constant small enough, we have∥∥∥uk − u∥∥∥2RN≤∥∥∥u0 − u

∥∥∥2RN

e−2CGDk, CGD = ln

(ρ+ 1ρ− 1

∥∥∥uk − u∥∥∥2RN< ε → k = O

(ln(ε−1)

)→ costGD = O

(|K| ln(ε−1)

Gradient Descent

uk+1 = uk − ηk

(βuk − 1

|K|∑ν∈K

B>pkν

x′ν(t) = Aνxν(t) + Bu, 0 < t < T

p′ν(t) = −A>ν pν(t), 0 < t < T

xν(0) = x0, pν(T) = −(xν(T)− xT)

Convergence

Since Fν is convex, if we take ηk constant small enough, we have∥∥∥uk − u∥∥∥2RN≤∥∥∥u0 − u

∥∥∥2RN

e−2CGDk, CGD = ln

(ρ+ 1ρ− 1

∥∥∥uk − u∥∥∥2RN< ε → k = O

(ln(ε−1)

)→ costGD = O

(|K| ln(ε−1)

GD - practical considerations

The expected exponential convergence of GD may be violated in practice.

The convergence rate is given in terms of the constant CGD(ρ) which ispositive decreasing converge to zero as ρ→ +∞.

A bad conditioning in a minimization problem affects the actual conver-gence of GD.

Example

minx∈R

(12x>Qτx − b>x

1 0 00 τ 00 0 τ2

b = −

ρ =λmax

λmin= τ2

τ iterations ρ

2 27 45 161 2510 633 10020 2511 40050 15619 2500

Meza, Steepest descent, 2010

GD - practical considerations

The expected exponential convergence of GD may be violated in practice.

The convergence rate is given in terms of the constant CGD(ρ) which ispositive decreasing converge to zero as ρ→ +∞.

A bad conditioning in a minimization problem affects the actual conver-gence of GD.

Example

minx∈R

(12x>Qτx − b>x

1 0 00 τ 00 0 τ2

b = −

ρ =λmax

λmin= τ2

τ iterations ρ

2 27 45 161 2510 633 10020 2511 40050 15619 2500

Meza, Steepest descent, 2010

Conjugate Gradient

∇Fν(u) = βu− 1|K|

∑ν∈K

Convergence∥∥∥uk − u∥∥∥2RN≤ 4

∥∥∥u0 − u∥∥∥2RN

e−2CCGk, CCG = ln

(√ρ+ 1√ρ− 1

∥∥∥uk − u∥∥∥2RN< ε → k = O

(ln(ε−1)

)→ costCG = O

(|K| ln(ε−1)

)10/23

Conjugate Gradient

∇Fν(u) =

(βI + E[L∗T,νLT,ν ]) u+

E[L∗T,ν(yν(T)− xT)] → Au = b

LT,ν : L2(0,T;RM) −→ RN

u 7−→ zν(T)L∗T,ν : RN −→ L2(0,T;RM)

pT,ν 7−→ B>pν{y′ν(t) = Aνyν(t), 0 < t < T

yν(0) = x0

{z′ν(t) = Aνzν(t) + Bu(t), 0 < t < T

zν(0) = 0

Convergence∥∥∥uk − u∥∥∥2RN≤ 4

∥∥∥u0 − u∥∥∥2RN

e−2CCGk, CCG = ln

(√ρ+ 1√ρ− 1

∥∥∥uk − u∥∥∥2RN< ε → k = O

(ln(ε−1)

)→ costCG = O

(|K| ln(ε−1)

)10/23

CG - practical considerations

The expected exponential convergence of CG may be violated in practicalexperiments, although the situation is less critical than in GD.

• The constant CCG(ρ) depends on the square root of ρ, hence CG is lesssensible to the conditioning of the problem.

• CG enjoys the finite termination property. This means that, if weapply CG to solve a N-dimensional problem, the algorithm willconverge in at most N-iterations.

11 / 23

Stochastic Gradient Descent

uk+1 = uk − ηk∇fνk(uk), νk i.i.d. from K

Applying SGD for minimizing Fν(u) requires, at each iteration k, only oneresolution of the dynamics.

12 /23

Stochastic Gradient Descent

uk+1 = uk − ηk(βuk − B>pkνk

)x′νk(t) = Aνkxνk(t) + Bu, 0 < t < T

p′νk(t) = −A>νkpνk(t), 0 < t < T

xνk(0) = x0, pνk(T) = −(xνk(T)− xT)

Applying SGD for minimizing Fν(u) requires, at each iteration k, only oneresolution of the dynamics.

12 /23

Stochastic Gradient Descent - convergence

In SGD the iterate sequence (uk)k≥1 is a stochastic process determined bythe random sequence (νk)k≥1 ⊂ K. Hence, the convergence properties

are defined in expectation E[ ∥∥uk+1 − u

∥∥2RN

]or in the context of almost

sure convergence.

Bach and Moulines, Non-asymptotic analysis of stochastic approximation algorithms for ma-chine learning, 2011

Bottou, Online learning and stochastic approximations, 1998

In SGD, convergence is guaranteed if the step-sizes are chosen such that

E[∥∥∇Fν(uk)

∥∥2] is bounded above by a deterministic quantity. In particular,

a fixed step-size ηk = η, even if small, does not allow to converge. Astandard approach is to use as a decreasing sequence such that

∞∑k=1

ηk = +∞ and∞∑k=1

η2k < +∞

Robbins and Monro, A stochastic approximation method, 1951

Bottou, Curtis and Nocedal, Optimization methods for large-scale machine learning, 2018

If ηk is properly chosen, by means of standard martingale techniques wecan show that the SGD converges almost surely

uk a.s−→ u, as k→ +∞

Convergence rate

Because of the noise introduced by the random selection of the descentdirection the convergence of SGD is linear

E[∥∥∥uk − u

∥∥∥2RN

(k−1)

E[∥∥∥uk − u

∥∥∥2RN

]< ε → k = O

(ε−1)→ costSGD = O

(ε−1)

14 /23

If ηk is properly chosen, by means of standard martingale techniques wecan show that the SGD converges almost surely

uk a.s−→ u, as k→ +∞

Convergence rate

Because of the noise introduced by the random selection of the descentdirection the convergence of SGD is linear

E[∥∥∥uk − u

∥∥∥2RN

(k−1)

E[∥∥∥uk − u

∥∥∥2RN

]< ε → k = O

(ε−1)→ costSGD = O

(ε−1)

14 /23

Continuous Stochastic Gradient

uk+1 = uk − ηkG k, G k =k∑`=1

α`∇fν`(u`)

CONVERGENCE PROPERTIESAs the optimization process evolves, the approximated gradient Gk con-verges almost surely to the full gradient of the objective functional

G k a.s−→ ∇Fν , as k→ +∞

In particular, CSG is a less noisy algorithm and has a better convergencebehavior. In particular, convergence may be guaranteed also choosing afixed learning rate sequence ηk = η.

Continuous Stochastic Gradient

uk+1 = uk − ηkk∑`=1

α`(βu` − B>p`ν`

CONVERGENCE PROPERTIESAs the optimization process evolves, the approximated gradient Gk con-verges almost surely to the full gradient of the objective functional

G k a.s−→ ∇Fν , as k→ +∞

In particular, CSG is a less noisy algorithm and has a better convergencebehavior. In particular, convergence may be guaranteed also choosing afixed learning rate sequence ηk = η.

NUMERICAL SIMULATIONS

Numerical simulations

Linearized cart-inverted pendulum systemxνvνθνων

0 0 1 00 − ν

M 0 00 0 0 10 ν+M

M` 0 0

xνvνθνων

010−1

• The system includes a cart of massM and a rigid pendulum of length `.

• The pendulum is anchored to the cart and at the free extremity it isplaced a variable mass described by the parameter ν .

• The cart moves on a horizontal plane. The states xν(t) and vν(t)describe its position and velocity, respectively.

• During the motion of the cart the pendulum deviates from the initialvertical position by an angle θν(t), with an angular velocity ων(t).

• Starting from an initial state (x i, v i,0,0), we want to compute aparameter-independent control function u steering all the realizationsof the system in time T to the final state (x f ,0,0,0).

17 /23

• The system includes a cart of massM and a rigid pendulum of length `.

• The pendulum is anchored to the cart and at the free extremity it isplaced a variable mass described by the parameter ν .

• The cart moves on a horizontal plane. The states xν(t) and vν(t)describe its position and velocity, respectively.

• During the motion of the cart the pendulum deviates from the initialvertical position by an angle θν(t), with an angular velocity ων(t).

• Starting from an initial state (x i, v i,0,0), we want to compute aparameter-independent control function u steering all the realizationsof the system in time T to the final state (x f ,0,0,0).

17 /23

Input data

• x0 = (−1, 1,0,0)>

• xT = (0,0,0,0)>

• T = 1s

• ε = 10−4

• M = 10

• ` = 1

• ν ∈ K = {ν1, . . . , ν|K|} withν1 = 0.1 and ν|K| = 1

18 /23

Input data

• x0 = (−1, 1,0,0)>

• xT = (0,0,0,0)>

• T = 1s

• ε = 10−4

• M = 10

• ` = 1

• ν ∈ K = {ν1, . . . , ν|K|} withν1 = 0.1 and ν|K| = 1

18 /23

GD CG SGD CSG

|K| Iter. Time Iter. Time Iter. Time Iter. Time

2 1868 45.1s 12 1.1s 2195 33.1s 930 18.6s

10 1869 150.1s 13 2.6s 2106 31.4s 923 17.4s

100 1870 1799.5s 12 17.7s 2102 28.9s 929 17.4s

250 13 50.3s 2080 28.2s 928 17.9s

500 13 101.3s 2099 32.9s 927 21.5s

CSG outperforms SGD in terms of the number of iterations it requiresto converge and, consequently, of the total computational time. Thisbecause the optimization process is less noisy than SGD, yielding to abetter convergence behavior.

Conclusions

We compared the GD, CG, SGD and CSG algorithms for the minimization ofa quadratic functional associated to the simultaneous controllability oflinear parameter-dependent models.

We observed the following:

1. The GD approach is the worst one in terms of the computationalcomplexity, as a consequence of the bad conditioning of thesimultaneous controllability problem.

2. The choice of SGD and CSG instead of CG is preferable only whendealing with parameter sets of large cardinality |K|.

21 /23

Open problems

SIMULTANEOUS CONTROLLABILITY OF PDE MODELS

• In the PDE setting, simultaneous controllability is a quite delicate issuebecause of the appearance of peculiar phenomena which are notdetected at the ODE level.

• For some PDE systems, simultaneous controllability may beunderstood by looking at the spectral properties of the model.Roughly speaking, one needs all the eigenvalues to have multiplicityone in order to be able to observe every eigenmode independently.This fact generally yields restrictions to the validity of simultaneouscontrollability, which may be difficult to tackle at the numerical level.

Dáger and Zuazua, Controllability of star-shaped networks of strings, 2001

Open problems

COMPARISON WITH THE GREEDY METHODOLOGY• The greedy approach aims to approximate the dynamics and controlsof linear parameter-depending by identifying the most meaningfulrealizations of the parameters.Lazar and Zuazua, Greedy controllability of finite dimensional linear systems, 2016

Hernández-Santamaría, Lazar and Zuazua, Greedy controllability of finite dimensional

linear systems, 2019

• A comparison of the greedy and stochastic would be an interestingissue.

THANK YOU FOR YOUR ATTENTION!

This project has received funding from the European Research Coun-cil (ERC) under the European Union’s Horizon 2020 research andinnovation program (grant agreement No 694126-DYCON).

Stochastic optimization methods for the simultaneous control ......Bottou, Curtis and Nocedal,...

Documents

Transcript of Stochastic optimization methods for the simultaneous control ......Bottou, Curtis and Nocedal,...

Stochastic Gradient Descent Tricks - microsoft.com · Stochastic Gradient Descent Tricks L eon Bottou Microsoft Research, Redmond, WA leon@bottou.org Abstract. Chapter 1 strongly

Stochastic Optimization Online and batch stochastic ...ibis.t.u-tokyo.ac.jp/.../nagoya_intensive/Nagoya2015_3_StochasticOpt… · 1 Online stochastic optimization Stochastic gradient

Continuous-time Models for Stochastic Optimization Algorithmspapers.nips.cc/paper/...models-for-stochastic-optimization-algorithm… · stochastic gradient methods and provide new

Introductory Lectures on Stochastic Optimizationweb.stanford.edu/~jduchi/PCMIConvex/Duchi16.pdf · 4 Introductory Lectures on Stochastic Optimization focusing on non-stochastic optimization

Unit Tests for Stochastic Optimization

L eon Bottou Frank E. Curtis Jorge Nocedalz June 16, 2016 · 2016-06-16 · Optimization Methods for Large-Scale Machine Learning L eon Bottou Frank E. Curtisy Jorge Nocedalz June

ISSUES IN STOCHASTIC SEARCH AND OPTIMIZATION …Stochastic Search and Optimization • Focus here is on stochastic search and optimization: A. Random noise in input information (e.g.,

Stochastic Optimization Algorithms for Machine Learninglaurent.risser.free.fr/TMP_SHARE/optimCIMI2018/slidesGuanghuiLan.pdf · Stochastic Optimization Algorithms for Machine Learning

Stochastic and Global Optimization

How does a stochastic optimization/approximation algorithm ... · differential inclusions are also presented. Keywords Stochastic approximation · Stochastic optimization · Nonsmooth-jump

Lecture Notes Stochastic Optimization-Koole

Stochastic optimization: Beyond stochastic gradients and ...suvrit.de/talks/vr_nips16_bach.pdf · Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis

Alexander Shapiro - Stochastic Optimization

Fine-GrainedAnalysisofStabilityandGeneralizationforyy298919/ICML20.pdfscale stochastic optimization algorithms was established in Bousquet & Bottou [4], where three factors inﬂuencing

Stochastic Optimization: a Review

The Stochastic Optimization Problem

Approximation Algorithms for Stochastic Optimization

PCA-enhanced stochastic optimization methods

More Optimization Methods - CILVR at NYU · ““The Wall” (by Léon Bottou, not Pink Floyd) The Wall” (by Léon Bottou, not Pink Floyd) SGD is very fast at first, and very slow

A Model Reference Adaptive Search Method for Stochastic ...marcus/docs/SMRAS.pdfKeywords: stochastic optimization, global optimization, combinatorial optimization. 1 Introduction Stochastic