LiL Oti lO - UTA talks/online adapt optimal 01 10.pdf · Adaptive Controlis generally not Optimal...

F.L. Lewis, K. Vamvoudakis, and D. VrabieMoncrief-O’Donnell Endowed Chair

Head Controls & Sensors Group

Supported by :NSF ‐ PAUL WERBOSARO – RANDY ZACHERY

Automation & Robotics Research Institute (ARRI)The University of Texas at Arlington

Head, Controls & Sensors Group

O li L i f O ti l C t lOnline Learning of Optimal Controland Zero‐Sum Game Solutions

Talk available online at http://ARRI.uta.edu/acs

Thank youWen Changyun

Thank you:Fuli Wang Changyun Wen Jiongtian Liu Guang Hong YangJiongtian Liu Guang-Hong Yang

Online Gaming

Organized and ginvited by Jianliang Wang

Online GamingF.L. Lewis

Moncrief-O’Donnell Endowed ChairHead Controls & Sensors GroupHead, Controls & Sensors Group

Automation & Robotics Research Institute (ARRI)The University of Texas at Arlington

It is man’s obligation to explore the most difficult questions inIt is man s obligation to explore the most difficult questions inthe clearest possible way and use reason and intellect to arriveat the best answer.

Man’s task is to understand patterns in nature and society.

The first task is to understand the individual problem, then toanalyze symptoms and causes, and only then to designtreatment and controls.

Ibn Sina 1002-1021(Avicenna)

Adaptive Control-Adaptive ControlOnline in real time No dynamics knowledge needed

Adaptive control generally minimizes a squared (tracking error)p g y q ( g )

Inverse Optimal adaptive controlMinimizes a cost not of our choosingMinimizes a cost not of our choosing.

Indirect optimal adaptive controlidentifies A and B and then solves the Riccati equation

Adaptive Control is generally not Optimal

Optimal Control is off-line, d d t k th t d i t l d iand needs to know the system dynamics to solve design eqs.e.g. Riccati equation needs A, B

We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosingFor any performance cost of our own choosing

Adaptive ControlHow to do

Optimal Adaptive

Identify the

p p??????

Identify the system model-Indirect Adaptive

Identify the Controller-Direct AdaptiveDirect Adaptive

Plant

control outputcontrol output

IEEE Circuits & Systems Mag.Editor- Ron Chen

Synthesis of

Computational intelligence Computational intelligence

Control systems

Neurobiology

Different methods of learning

M hi l i th f l t d f l i tMachine learning- the formal study of learning systems

Xi-Ren Cao, Stochastic Learning and Optimization:

A Sensitivity-Based Approach


Reinforcement learningIvan Pavlov 1890s

A t C iti L i

We want OPTIMAL performance- ADP- Approximate Dynamic Programming

Desiredperformance

Actor-Critic LearningEvery living organism interacts with its environment and uses those i t ti t i it

Reinforcementsignal

Critic

interactions to improve its own actions in order to survive and increase.

Control

environmentTuneactor

Critic

SystemAdaptiveLearning system

ControlInputs

outputsActor

ADP – Paul Werbos

BuAxx

1. Optimal Control: Linear Quadratic Regulator

System BuAxx

( ( )) ( ) ( ) ( )T T T

t

V x t x Qx u Ru d x t Px t

System

Cost

0 ( , , ) 2 2 ( )T

T T T T T T TV VH x u V x Qx u Ru x x Qx u Ru x P Ax Bu x Qx u Ru

x x

Differential equivalent is

x x

u Kx Given any stabilizing FB policy

The cost value is found by solving Lyapunov equation

0 ( ) ( )T TA BK P P A BK Q K RK

TT 1

1( ) ( ) ( )Tu t R B Px t Lx t Optimal Control is

Algebraic Riccati equation

PBPBRQPAPA TT 10 Full system dynamics must be knownOff-line solution

Kleinman Algorithm to Solve ARE

Start with stabilizing feedback 0K

1. For a given control policy solve for the cost: ku K x

0 T TA P P A Q K RKk kA A BK

L

2. Improve policy:

0 T Tk k k k k kA P P A Q K RK Lyapunov eq.

p p y

11

Tk kK R B P

Kleinman 1968:

monotonically converges to the unique positive definite solution of the Riccati equation.

kP

Kleinman 1968:

Every iteration step will return a stabilizing controller.

The system has to be known

OFF‐LINE DESIGN

MUST SOLVE LYAPUNOV EQUATION AT EACH STEP.

( ) ( ) ( )f fS t

2. Optimal Control for Nonlinear Systems

( , ) ( ) ( )x f x u f x g x u

t

T

t

dtRuuxQdtuxrtxV ))((),())((

System

Cost


),,(),(),(),(),(0 ux

VxHuxruxf

x

Vuxrx

x

VuxrV

TT

0)0( V,

Given any stabilizing policy ( )u x

the cost value is found by solving the nonlinear Lyapunov equationthe cost value is found by solving the nonlinear Lyapunov equation

0 ( , ( )) ( , ( )) ( ) ( ) ( ) ( , , ( ))T

TV VV r x x f x x Q x x R x H x x

x x

0)0( V

Optimal control found by solving

* *

( ) ( )0 min ( , , ) min ( ) ( , )

T

T

u t u t

V VH x u Q x u Ru f x u

x x

Optimal control policy*

*

( )( ( )) arg min ( , , )

u t

Vx t H x u

x

** 11

2( ( )) ( )T Vx t R g x

x

( )u t x

dx

dVggR

dx

dVxQf

dx

dV T

TT *1

*

41

*

)(0

0)0( VHJB equation ,

dxdxdx

(Nonlinear Riccati equation)

Policy Iteration Algorithm to Solve HJB

St t ith t bili i i iti l li ( )u x

1. For a given control policy solve for the cost :( )j x ( ( ))jV x t

Start with stabilizing initial policy 0 0 ( )u x

0 ( , , ( )) ( ) ( ) ( ) ( ) ( ( ) ( ) ( ))

(0) 0

T Tj j j j j j

j

H x V x Q x x R x V f x g x x

V

Nonlinear Lyapunov eq.

2 I li

1 arg min[ ( , , )]j ju

H x V u

2. Improve policy:

111 2( ) ( )T

j jx R g x V • Convergence proved by Saridis 1979 if nonlinear

Lyapunov eq. solved exactly

• Beard & Saridis used Galerkin Integrals to solve gLyapunov eq.

• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence

Off-line solutionNonlinear Lyapunov equation must be solved at each step

3 Online Solution of Optimal Control

Kyriakos Vamvoudakis

3. Online Solution of Optimal Control for Nonlinear Systems

Optimal Adapti e ControlOptimal Adaptive Control

Need to solve online:eed o so e o e

0 ( , ( )) ( , ( )) ( , ( )) ( ) ( , , ( ))T T

TV V VV r x x x r x x f x x Q x u Ru H x x

Nonlinear Lyapunov eq. for Value

0 ( , ( )) ( , ( )) ( , ( )) ( ) ( , , ( ))V r x x x r x x f x x Q x u Ru H x xx x x

Control update

112

( ) ( )Tx R g x V

Solve by parameterizing value V(X)Solve by parameterizing value V(X)converts PDE into algebraic equation

i. Value Function Approximation

In the LQR case ( ) ( )( ) ( )T T TV x x Px vec P x x W x

Use quadratic basis setUse quadratic basis set

In nonlinear case must approx both ( ),V

V x Vx

Approx in Sobolev norm

Weierstrass higher-order approx Theorem

1 11 1 1 1

( ) ( ) ( ) ( ) ( ) ( )N

Ti i i i i i i i

i i i N i N

V x c x c x c x C x c x

There exists a basis set s.t.

1 1 1

( ) ( ) ( )( ) Ni i i

i i ii i i N

x x xV xc c c

x x x x

Need complete independent basis setNeed complete independent basis setRetain N termsLast term (approx. error) goes to zero uniformly as N

1 1( ) ( ) ( )TV x W x x Take VFA as

1 1 1( ) ( )TV x W x

Then nonlinear Lyapunov eq

becomes

0 ( , ) ( , ) ( , , )T

V Vf x u r x u H x u

x x

1 1 1( , , ) ( ) ( )T THH x W u W f gu Q x u Ru

W1= LS solution to this eq for given N. Unknown.

. sup 0Hx

N

a

Theorem (Abu-Khalaf)as

1 1 2

1

1

. 0

. sup 0

. sup 0x

b W C

c V V

d V V

x

1 1ˆ ˆ( ) ( )TV x W x

CRITIC NN for VFAPaul Werbos

1 1( ) ( )

1 1 1 1ˆ ˆ( , , ) ( ) ( )T TH x W u W f gu Q x u Ru e

Find NN weights to solve the algebraic (Lyapunov) eq.

Parameter Identification of Value Function

1 1 1 1ˆ ˆ( , , ) ( ) ( )T TH x W u W f gu Q x u Ru e

Theorem (Kyriakos Vamvoudakis)- Online tuning of Critic NN weights.

Let u(t) be bounded stabilizing and be PE. Tune critic NN weights as1 1( )f gu ( ) g g1 1( )f g

(modified Levenberg-Marquardt)

Then the critic parameter error converges exponentially with decay factor given by

to the residual set 2( ) 1 2W t a

1 31

ln( 1 2 )a

to the residual set 1 2 1 max1

( ) 1 2 .W t a

( are PE parameters),

ii. Action NN for Control Approximation

11 ˆT T 11( ) ( )TR V C f111 22

( ) ( ) ,T Tu x R g x W

Theorem (Kyriakos Vamvoudakis)- Online Learning of Nonlinear Optimal Control

112

( ) ( )Tx R g x V Comes from

Let be PE. Tune critic NN weights as1 1( )f gu

Tune actor NN weights as1

12 2 2 2 1 1 1 2 11ˆ ˆ ˆ ˆ ˆ{( ) ( ) ( ) }4

T TW F W F W D x W m x W

where 11 1 1( ) ( ) ( ) ( ) ( ) ,T TD x x g x R g x x 1

21 1( 1)T

m

Then there exists an N0 such that, for the number of hidden layer units 0N N

the closed-loop system state, the critic NN error ˆ

1 1 1̂W W W

and the actor NN error are UUB. 2 1 2ˆW W W

111 22

ˆ( ) ( )T Tu x R g x W Control policy

Summary Nota Bene

1 22( ) ( )g

Tune critic NN weights as

Tune actor NN weights as Extra terms needed for stabilityu e ac o e g s as

12 2 2 2 1 1 1 2 11ˆ ˆ ˆ ˆ ˆ{( ) ( ) ( ) }4


where 11 1 1( ) ( ) ( ) ( ) ( ) ,T TD x x g x R g x x 1

21 1( 1)T

m

Note, it does not work to simply set

111 1̂( ) ( )T Tu x R g x W 1 12

( ) ( )u x R g x W

Proof:

1 11 1 1 2 2 2

1 1( ) ( ) ( ) ( ).

2 2T TL t V x tr W a W tr W a W

T T

V(x)= Unknown solution to HJB eq.

1140 ( )

T TTdV dV dV

f Q x gR gdx dx dx

Guarantees stability

2 1 2ˆW W W

1 1 1̂W W W Guarantees stability

W1= Unknown LS solution to Lyap. eq for given N

1 1 1( , , ) ( ) ( )T THH x W u W f gu Q x u Ru

ONLINE solutionDoes not require solution of HJB or nonlinear Lyapunov eq.

D i t d i t b kDoes require system dynamics to be known

Finds approximate local smooth solution to NONLINEAR HJB equation online

An optimal adaptive controller‘indirect’ because it identifies parameters for VFA‘direct’ because control is directly found from value functiondirect because control is directly found from value function

Optimal Adaptive Control

fIdentify the Value-Optimal Adaptive

Identify the system model-Indirect Adaptive

Identify the

Indirect Adaptive

Identify the Controller-Direct Adaptive

Pl tPlant

control output

Simulation 1- F-16 aircraft pitch rate controller

1.01887 0.90506 0.00215 0 Stevens and Lewis 20031.01887 0.90506 0.00215 0

0.82225 1.07741 0.17555 0

0 0 1 1

x x u

Stevens and Lewis 2003

,Q I R I

][ eqx

* [p 2p 2p p 2p p ]TW

PBPBRQPAPA TT 10 ARE

Exact solution

Select quadratic NN basis set for VFA

1 11 12 13 22 23 33[p 2p 2p p 2p p ]

[1.4245 1.1682 -0.1352 1.4349 -0.1501 0.4329]TW

Exact solution

Must add probing noise to get PE

Algorithm converges to

Must add probing noise to get PE

111 22

ˆ( ) ( ) ( )T Tu x R g x W n t (exponentially decay n(t))

1̂( ) [1.4279 1.1612 -0.1366 1.4462 -0.1480 0.4317] .TfW t


2ˆ ( ) [1.4279 1.1612 -0.1366 1.4462 -0.1480 0.4317]TfW t 1

2 1

2 0 0 1.42790 1 1612

T

T

x

x x

f2 1

3 11 11 12 2 2

2

3 2

3

0 1.16120

0 -0.1366ˆ ( ) 0

1.44620 2 01

-0.148000.43170 0 2

T

T

x x

x xu x R B Px R

x

x x

x

Critic NN parameters

System states

Simulation 2. – Nonlinear System

Nevistic V. and Primbs J. A. (1996)Converse optimal

2( ) ( ) ,x f x g x u x R

21 2

1 2

10.5 0.5 (1 ( )( )

cos(2 ) 2 )

x xf x

xx x

1

0 ( ) .

cos(2 ) 2g x

x

,Q I R I

1* 2 21 2

1( )

2V x x x Optimal Value

*1 2( ) (cos(2 ) 2) .u x x x Optimal control

2 21 1 1 2 2( ) [ ] ,Tx x x x x Select VFA basis set

1ˆ ( ) [0.5017 -0.0020 1.0008] .T

fW t


ˆ ( ) [0 5017 0 0020 1 0008]TW t 2 0 0 5017T

2( ) [0.5017 -0.0020 1.0008] .TfW t 1

112 2 12

12

2 0 0.50170

ˆ ( ) -0.0020cos(2 ) 2

0 2 1.0008

T x

u x R x xx

x

Critic NN parameters states

Optimal value fn. Value fn. approx. error Control approx error

4. Zero-Sum Games for Nonlinear Systems

( , ) ( ) ( ) ( )

( )

x f x u f x g x u k x d

y h x

System

Cost 22( ( ), , ) ( , , )T T

t t

V x t u d h h u Ru d dt r x u d dt


0 ( ) ( ) ( ( ) ( ) ( ) ) ( )T V

r x u d V r x u d V f x g x u k x d H x u d

Given any stabilizing control and disturbance policies ( ) ( )u x d x

0 ( , , ) ( , , ) ( ( ) ( ) ( ) ) ( , , , )

(0) 0

r x u d V r x u d V f x g x u k x d H x u dx

V

Given any stabilizing control and disturbance policies ( ), ( )u x d x

the cost value is found by solving this nonlinear Lyapunov equation

Define 2-player zero-sum game as

2* 2

0

( (0)) min max ( (0), , ) min max ( ) ( )T T

u ud dV x J x u d h x h x u Ru d dt

The game has a unique value (saddle-point solution) iff the Nash condition holds

min max ( (0), , ) max min ( (0), , )u ud d

J x u d J x u d

min max ( , , , ) max min ( , , , )u ud d

H x V u d H x V u d

A necessary condition for this is the Isaacs Condition

Game saddle point solution found from Hamiltonian

22( , , , ) ( ( ) ( ) ( ) )TT TV

H x u d h h u Ru d V f x g x u k x dx

Optimal control/dist. policies found by stationarity conditions 0 , 0H H

u d

* 11 T* 112

( )Tu R g x V

*2

1( )

2Td k x V

HJI equation

* *0 ( , , , )

1 1

H x V u d

0)0( V

12

1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

4 4T T T T T Th h V x f x V x g x R g x V x V x kk V x

(Nonlinear ‘Riccati’ equation)

Policy Iteration Algorithm to Solve HJI

St t ith t bili i i iti l t l li ( )u x

1. For a given control policy solve for the value ( )ju x 1( ( ))jV x t

Start with stabilizing initial control policy 0 ( )u x

1

HJ equation

1 1 12

1

10 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

4

(0) 0

T T T T Tj j j

j

h h V x f x g x u x u x Ru x V x kk V x

V

11( ) ( )Tu x R g x V

J equa oNonlinear ‘Riccati’ eq.

2. Improve policy:

1 12( ) ( )j ju x R g x V

Minimal nnd solution of HJ equation is the Available Storage for ( )ju x

Off-line solutionNonlinear HJ equation must be solved at each step

Double Policy Iteration Algorithm to Solve HJI

St t ith t bili i i iti l li ( )u xAdd inner loop to solve for available storage

1. For a given control policy solve for the value ( )ju x 1( ( ))jV x t

Start with stabilizing initial policy 0 ( )u x

220 ( )( )T iT i T ij j j jh h V x f gu kd u Ru d

2. Set . For i=0,1,… solve for 1( ( )),i i

jV x t d 0 0d

Nonlinear Lyapunov eq.

( )( )j j j jf g 1

2

1( )

2i T i

jd k x V

On convergence set ( ) ( )iV x V x

11( ) ( )TR V

3. Improve policy:

• Convergence proved by Van der Schaft if can

On convergence set 1( ) ( )j jV x V x

111 12( ) ( )T

j ju x R g x V • Convergence proved by Van der Schaft if can

solve nonlinear Lyapunov equation exactly

• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergencey p g

Off-line solutionNonlinear Lyapunov equation must be solved at each step

5 Online Solution of ZS Games

Kyriakos Vamvoudakis

5. Online Solution of ZS Gamesfor Nonlinear Systems

Optimal (Game) Adapti e ControlOptimal (Game) Adaptive Control

Need to solve online:eed o so e o e

Nonlinear Lyapunov eq. for Value

220 ( )( )T T Th h V f kd R d 20 ( )( )T T Th h V x f gu kd u Ru d

1

Disturbance update

Control update

2

1( )

2Td k x V

112

( ) ( )Tx R g x V

Control update

Use three Neural Networks

C iti NNCritic NN

1 1ˆ ˆ( ) ( )TV x W x

Control Actor NN11

1 2ˆ( ) ( )T Tu x R g x W

Disturbance actor NN

1 22( ) ( )u x R g x W

11 3

ˆ( ) ( ) ,T Td x k x W 2 1 32( ) ( ) ,d x k x W

Simultaneously:a. Solve Lyap eq.

221 1 1 1

ˆ ˆ( , , ) ( )T T TH x W u W f gu kd h h u Ru d e

and b. update u(x), d(x)

Theorem (Kyriakos Vamvoudakis)- Online Gaming

Let be PE. Tune critic NN weights as2 1( )f gu kd

Tune actor NN weights as

12 2 2 2 1 2 1 2 11ˆ ˆ ˆ ˆ ˆ{( ) ( ) ( ) }4


3 3 4 3 3 2 1 1 3 12

1ˆ ˆ ˆ ˆ ˆ( )4

T TW F W F W x W m W

where

4

11 1 1

1 1 1

( ) ( ) ( ) ( ) ( ),

( ) ( ) ( ),

T T

T T

D x x g x R g x x

E x x kk x

22

2 2( 1)Tm

Then there exists an N0 such that, for the number of hidden layer units 0N N

the closed-loop system state, the critic NN error

and the actor NN errors 2 1 2 3 1 3ˆ ˆ,W W W W W W

1 1 1̂W W W

are UUB.

2 1 2 3 1 3

ONLINE solutionDoes not require solution of HJI eq, HJ eq, or nonlinear Lyapunov eq.

D i t d i t b kDoes require system dynamics to be known

Finds approximate local smooth solution to NONLINEAR HJI equation online

An optimal adaptive controller‘indirect’ because it identifies parameters for VFA‘direct’ because control is directly found from value functiondirect because control is directly found from value function

Simulation 1- F-16 aircraft pitch rate controller

1.01887 0.90506 0.00215 0 1 Stevens and Lewis 20030.82225 1.07741 0.17555 0 0

0 0 1 1 0

T

x x u d

y C x

][ eqx

* [p 2p 2p p 2p p ]TW

GARE

Exact solution

,TQ C C I R I 1

2

10T T TA P PA Q PBR B P PKK P

Wind gust

1 11 12 13 22 23 33[p 2p 2p p 2p p ]

[1.6573 1.3954 -0.1661 1.6573 -0.1804 0.4371]TW

Exact solution

Must add probing noise to u(x) and d(x) to get PE


Must add probing noise to u(x) and d(x) to get PE

111 22

ˆ( ) ( ) ( )T Tu x R g x W n t (exponentially decay n(t))

ˆ ( ) [1 7090 1 3303 0 1629 1 7354 0 1730 0 4468]TW tAlgorithm converges to 1( ) [1.7090 1.3303 -0.1629 1.7354 -0.1730 0.4468] .fW t

2 3 1ˆ ˆ ˆ( ) ( ) ( )f f fW t W t W t

1

2 1

2 0 0 1.70900 1 3303

T

T

x

x x

1

2 1

2 0 0 1.70900 1 3303

T

T

x

x x 2 1

3 1112 2

2

3 2

3

0 1.33030

0 -0.1629ˆ ( ) 0

1.73540 2 01

-0.173000.44680 0 2

T x x

x xu x R

x

x x

x

2

2 1

3 112

2

3 2

3

0 1.33030

0 -0.1629ˆ( ) 01.73540 2 0

1-0.173000.44680 0 2

T x x

x xd x

x

x x

x

Critic NN parameters

System states

Simulation 2- F-16 aircraft pitch rate controllerwith d(t)=0, no disturbance

Critic NN parametersWith disturbance

Critic NN parametersWithout disturbance

C FASTER ith tConverges FASTER with an opponentOne learns faster with an adversary

Simulation 3. – Nonlinear System 2( ) ( ) ( )x f x g x u k x d x R ( ) ( ) ( ) ,x f x g x u k x d x R

3 3 2 21 2 2 2

1 2

1 12

10.25 ( ) 0.25 (sin )

( )cos(2 ) 2 (4 ) 2

x x

f xx xx xx x

, , 8Q I R I

1 1

0 0 ( ) , ( ) .

cos(2 ) 2 (4 ) 2(sin )g x k x

xx

1 1Optimal Value

*1 2( ) (cos(2 ) 2) .u x x x Saddle point solution

* 4 21 2

1 1( )

4 2V x x x

*1 22

1( ) (sin(4 ) 2)d x x x

Select VFA basis set

2 2 4 41 1 2 1 2( ) [ ]x x x x x

ˆAlgorithm converges to

2 3 1ˆ ˆ ˆ( ) ( ) ( )f f fW t W t W t

1ˆ ( ) [0.0008 0.4999 0.2429 0.0032]T

fW t

12 0 0.0008T

x

12 0 0.0008T

x 211

32 211

32

0 20 0.4999ˆ ( ) 04cos(2 ) 2 0.2429

0.00320 4

T xu x R

xx

x

2

2132 11

32

0 20 0.4999ˆ( ) 04sin(4 ) 2 0.2429

0.00320 4

T xd x

xx

x

Critic NN parameters states

Value fn. approx. error Dist. approx errorControl approx error

6. Can Avoid knowledge of drift term f(x)Work of Draguna Vrabie

Integral Reinforcement Learning

P li it ti i t d l ti f th L ti


Policy iteration requires repeated solution of the Lyapunov equation

0 ( , ( )) ( , ( )) ( , ( )) ( ) ( , , ( ))T T

TV V VV r x x x r x x f x x Q x u Ru H x x

x x x

This can be done online without knowing f(x)using measurements of x(t), u(t) along the system trajectories

Lemma 1 – Draguna Vrabie

0 ( , ) ( , ) ( , , ), (0) 0T

V Vf x u r x u H x u V

( ( )) ( ) ( ( )) (0) 0t T

V d V T V

0 ( , ) ( , ) ( , , ), (0) 0f x u r x u H x u Vx x

CT B ll

Is equivalent to

Solves Lyapunov equation without knowing f(x,u)

( ( )) ( , ) ( ( )), (0) 0t

V x t r x u d V x t T V CT Bellman eq.

( ( ))( , ) ( , )

Td V x V

f x u r x udt x

Proof:

y g ( )

dt x

( , ) ( ( )) ( ( )) ( ( ))t T t T

r x u d d V x V x t V x t T

t t

Allows definition of temporal difference error for CT systems

( ) ( ( )) ( , ) ( ( ))t T

t

e t V x t r x u d V x t T

0T TA P P A L RL Q

Lemma 1 - D. Vrabie- LQR case

0c cA P P A L RL Q

is equivalent to

cA A BL

( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T T

t

x t Px t x Q L RL x d x t T Px t T

is equivalent to

Proof

Solves Lyapunov equation without knowing A or B

( )( ) ( )

TT T T T

c c

d x P xx A P PA x x L RL Q x

dt

Proof:

t T t T

( ) ( ) ( ) ( ) ( ) ( )T T T T T

t t

x Q L RL xd d x Px x t Px t x t T Px t T

BUT- this does not allow simultaneous tuning of critic and actor NN

Reinforcement Learning= predict cost, observe behavior, update cost prediction

( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T T

t

x t Px t x Q L RL x d x t T Px t T

Draguna Vrabie

1. Policy iteration


1 1( ( )) ( , ) ( ( ))t T

k k k

t

V x t r x u dt V x t T

Cost update

Policy evaluation

For LQR case

A and B do not appear

( ) ( )k ku t L x t

1 1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

t

x t P x t x Q L RL x d x t T P x t T

11

1

kT

k PBRLControl gain update B needed for control update

Policy improvement

Initial stabilizing control is needed

CT Policy Iteration – How to implement online?Linear Systems Quadratic Cost- LQR

1 1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k kx t P x t x Q L RL x d x t T P x t T

Policy evaluation

1 1( ) ( ) ( )( ) ( ) ( ) ( )k k k kt

Q

t T

1 1( ) ( ) ( ) ( ) ( )( ) ( )t T

T T T Tk k k k

t

x t P x t x t T P x t T x Q L RL x d

1 111 12 11 121 2 1 2

2 212 22 12 22

1 2 1 2

( ) ( )( ) ( ) ( ) ( )

( ) ( )

( ) ( )

p p p px t x t Tx t x t x t T x t T

p p p px t x t T

x x

1 2 1 211 12 22 11 12 22

2 2 2 2

( ) ( )

1

2 2

( ) ( )

( ) ( )

t t T

Tk

p p p x x p p p x x

x x

p x t x t T

1 ( ) ( )kp x t x t

Quadratic basis set

t TCritic update

Algorithm Implementation

1 1( ) ( ) ( )( ) ( ) ( ) ( )t T

T T T Tk k k k

t

x t P x t x Q L RL x d x t T P x t T

( ) ( )Tvec ABC C A vec B

t T

is the quadratic basis set

Use Kronecker product

To set this up as ( ) ( ) ( )x t x t x t

t T

1 1( ) ( ) ( ) ( ) ( )t T

T T T Tk i i k

t

p x t x Q K RK x d p x t T

c.f. Linear in the parameters system ID

1 1( ) ( ) ( ) ( ) ( ) ( )T T T Tk k i i

t

p t p x t x t T x Q K RK x d ( , )t t T Reinforcement on time interval [t, t+T]

Quadratic regression vectorQuadratic regression vector

Same form as standard System ID problems ( , ) ( , )Tk kh u y r u y

Regression matrix

( 1) / 2n n

y p

Solve using RLS or batch LS

Need data points along the system trajectory

Unknown parameters

t T

Algorithm Implementation- Approximate Dynamic Programming(or Adaptive DP)

1 1( ( )) ( ) ( ( ))t T

Tk k k k

t

V x t Q x u Ru dt V x t T

A i t l b N l N t k ( )TV W Approximate value by Neural Network ( )TV W x

1 1( ( )) ( ) ( ( ))t T

T T Tk k k kW x t Q x u Ru dt W x t T

t

1 ( ( )) ( ( )) ( )t T

T Tk k k

t

W x t x t T Q x u Ru dt

regression vector Reinforcement on time interval [t, t+T]

1kW Use RLS along the trajectory to get new weights

Then find updated FB

( ( ))T

V x t 1 11 12 21 1 1

( ( ))( ) ( ) ( )

( )T Tk

k k k

V x tu h x R g x R g x W

x x t

Direct Optimal Adaptive Control for Partially Unknown CT Systems

1. Select initial control policy

2 Find associated costSolves Lyapunov eq. without knowing dynamics

This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.

2. Find associated cost

1 ( ) ( ) ( ) ( ) ( ) ( , )t T

T T Tk k k

t

p x t x t T x Q L RL x d t t T

3. Improve control 1

1 1T

k kL R B P

t

observe x(t)Measure cost increment (reinforcement)

apply uk(t)=Lkx(t)

observe cost integral

Measure cost increment (reinforcement)by adding V as a state. Then

( )T k T kV x Qx u Ru observe x(t+T)

update P A is not needed anywhere

t t+T

update control gain to Lk+1

do RLS until convergence to Pk+1

Persistence of Excitation

1 1( ) ( ) ( ) ( ) ( ) ( )t T

T T T Tk k k kp t p x t x t T x Q L RL x d

1 1k k k kt

Regression vector must be PE

Relates to choice of reinforcement interval T

Direct Optimal Adaptive ControllerDraguna Vrabie

Solves Riccati Equation Online without knowing A matrix

Run RLS or use batch L.S.To identify value of current control

So es ccat quat o O e t out o g at

D iZOH T

Critic

T T DynamicControlSystemw/ MEMORY

V

RuuQxxV TT

T T

Update FB gain afterCritic has converged

xu

0; xBuAxx System

RuuQxxV

Actor

K

A hybrid continuous/discrete dynamic controller

0; xBuAxx K

y ywhose internal state is the observed cost over the interval

Reinforcement interval T can be selected on line on the fly – can change

L

Gain update (Policy)

Lk

k0 1 2 3 4 5

Control

)()( txLtu kk

tT

Reinforcement Intervals T need not be the sameThey can be selected on-line in real time

T

Continuous-time control with discrete gain updates

0Control signal

Simulations on: F-16 autopilotLoad frequency control for power system

-0 3

-0.2

-0.1

0

3

3.5

4System states

A matrix not needed

0 0.5 1 1.5 2-0.3

Time (s)

0Controller parameters

1

1.5

2

2.5

0 0.5 1 1.5 2-0.4

-0.2

Time (s)

0 0.5 1 1.5 20

0.5

Time (s)

0 15

0.2

Critic parameters

P(1,1)P(1,2) Converge to SS Riccati equation soln

0.1

0.15 P(2,2)P(1,1) - optimalP(1,2) - optimalP(2,2) - optimal

Converge to SS Riccati equation soln

Solves ARE online without knowing A

0 1 2 3 4 5 60

0.05

Time (s)


Reinforcement learningIvan Pavlov 1890s

A t C iti L i

We want OPTIMAL performance- ADP- Approximate Dynamic Programming

Desiredperformance

Actor-Critic LearningEvery living organism interacts with its environment and uses those i t ti t i it

Reinforcementsignal

Critic

interactions to improve its own actions in order to survive and increase.

Control

environmentTuneactor

Critic

SystemAdaptiveLearning system

ControlInputs

outputsActor

ADP – Paul Werbos

Cerebral cortexMotor areas

ThalamusBasal ganglia

Hippocampus

Critic

Cerebellum

Cognitive map of the environment - place cells -

Brainstem

theta rhythms 4-10 Hz

Behavior reference InformationCritic information sent to the ActorActor

Spinal cordinf. olive

Interoceptivereceptors

Exteroceptivereceptors

Muscle contraction and t

Motor control 200 HzControl signal

receptorsreceptors movement

IEEE Circuits & Systems Mag.Editor- Ron Chen

LiL Oti lO - UTA talks/online adapt optimal 01 10.pdf · Adaptive Controlis generally not Optimal...

Documents

Transcript of LiL Oti lO - UTA talks/online adapt optimal 01 10.pdf · Adaptive Controlis generally not Optimal...