LiL Oti lO - UTA talks/online adapt optimal 01 10.pdf · Adaptive Controlis generally not Optimal...
Transcript of LiL Oti lO - UTA talks/online adapt optimal 01 10.pdf · Adaptive Controlis generally not Optimal...
F.L. Lewis, K. Vamvoudakis, and D. VrabieMoncrief-O’Donnell Endowed Chair
Head Controls & Sensors Group
Supported by :NSF ‐ PAUL WERBOSARO – RANDY ZACHERY
Automation & Robotics Research Institute (ARRI)The University of Texas at Arlington
Head, Controls & Sensors Group
O li L i f O ti l C t lOnline Learning of Optimal Controland Zero‐Sum Game Solutions
Talk available online at http://ARRI.uta.edu/acs
Thank youWen Changyun
Thank you:Fuli Wang Changyun Wen Jiongtian Liu Guang Hong YangJiongtian Liu Guang-Hong Yang
Online Gaming
Organized and ginvited by Jianliang Wang
Online GamingF.L. Lewis
Moncrief-O’Donnell Endowed ChairHead Controls & Sensors GroupHead, Controls & Sensors Group
Automation & Robotics Research Institute (ARRI)The University of Texas at Arlington
It is man’s obligation to explore the most difficult questions inIt is man s obligation to explore the most difficult questions inthe clearest possible way and use reason and intellect to arriveat the best answer.
Man’s task is to understand patterns in nature and society.
The first task is to understand the individual problem, then toanalyze symptoms and causes, and only then to designtreatment and controls.
Ibn Sina 1002-1021(Avicenna)
Adaptive Control-Adaptive ControlOnline in real time No dynamics knowledge needed
Adaptive control generally minimizes a squared (tracking error)p g y q ( g )
Inverse Optimal adaptive controlMinimizes a cost not of our choosingMinimizes a cost not of our choosing.
Indirect optimal adaptive controlidentifies A and B and then solves the Riccati equation
Adaptive Control is generally not Optimal
Optimal Control is off-line, d d t k th t d i t l d iand needs to know the system dynamics to solve design eqs.e.g. Riccati equation needs A, B
We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosingFor any performance cost of our own choosing
Adaptive ControlHow to do
Optimal Adaptive
Identify the
p p??????
Identify the system model-Indirect Adaptive
Identify the Controller-Direct AdaptiveDirect Adaptive
Plant
control outputcontrol output
IEEE Circuits & Systems Mag.Editor- Ron Chen
Synthesis of
Computational intelligence Computational intelligence
Control systems
Neurobiology
Different methods of learning
M hi l i th f l t d f l i tMachine learning- the formal study of learning systems
Xi-Ren Cao, Stochastic Learning and Optimization:
A Sensitivity-Based Approach
Different methods of learning
Reinforcement learningIvan Pavlov 1890s
A t C iti L i
We want OPTIMAL performance- ADP- Approximate Dynamic Programming
Desiredperformance
Actor-Critic LearningEvery living organism interacts with its environment and uses those i t ti t i it
Reinforcementsignal
Critic
interactions to improve its own actions in order to survive and increase.
Control
environmentTuneactor
Critic
SystemAdaptiveLearning system
ControlInputs
outputsActor
ADP – Paul Werbos
BuAxx
1. Optimal Control: Linear Quadratic Regulator
System BuAxx
( ( )) ( ) ( ) ( )T T T
t
V x t x Qx u Ru d x t Px t
System
Cost
0 ( , , ) 2 2 ( )T
T T T T T T TV VH x u V x Qx u Ru x x Qx u Ru x P Ax Bu x Qx u Ru
x x
Differential equivalent is
x x
u Kx Given any stabilizing FB policy
The cost value is found by solving Lyapunov equation
0 ( ) ( )T TA BK P P A BK Q K RK
TT 1
1( ) ( ) ( )Tu t R B Px t Lx t Optimal Control is
Algebraic Riccati equation
PBPBRQPAPA TT 10 Full system dynamics must be knownOff-line solution
Kleinman Algorithm to Solve ARE
Start with stabilizing feedback 0K
1. For a given control policy solve for the cost: ku K x
0 T TA P P A Q K RKk kA A BK
L
2. Improve policy:
0 T Tk k k k k kA P P A Q K RK Lyapunov eq.
p p y
11
Tk kK R B P
Kleinman 1968:
monotonically converges to the unique positive definite solution of the Riccati equation.
kP
Kleinman 1968:
Every iteration step will return a stabilizing controller.
The system has to be known
OFF‐LINE DESIGN
MUST SOLVE LYAPUNOV EQUATION AT EACH STEP.
( ) ( ) ( )f fS t
2. Optimal Control for Nonlinear Systems
( , ) ( ) ( )x f x u f x g x u
t
T
t
dtRuuxQdtuxrtxV ))((),())((
System
Cost
Differential equivalent is
),,(),(),(),(),(0 ux
VxHuxruxf
x
Vuxrx
x
VuxrV
TT
0)0( V,
Given any stabilizing policy ( )u x
the cost value is found by solving the nonlinear Lyapunov equationthe cost value is found by solving the nonlinear Lyapunov equation
0 ( , ( )) ( , ( )) ( ) ( ) ( ) ( , , ( ))T
TV VV r x x f x x Q x x R x H x x
x x
0)0( V
Optimal control found by solving
* *
( ) ( )0 min ( , , ) min ( ) ( , )
T
T
u t u t
V VH x u Q x u Ru f x u
x x
Optimal control policy*
*
( )( ( )) arg min ( , , )
u t
Vx t H x u
x
** 11
2( ( )) ( )T Vx t R g x
x
( )u t x
dx
dVggR
dx
dVxQf
dx
dV T
TT *1
*
41
*
)(0
0)0( VHJB equation ,
dxdxdx
(Nonlinear Riccati equation)
Policy Iteration Algorithm to Solve HJB
St t ith t bili i i iti l li ( )u x
1. For a given control policy solve for the cost :( )j x ( ( ))jV x t
Start with stabilizing initial policy 0 0 ( )u x
0 ( , , ( )) ( ) ( ) ( ) ( ) ( ( ) ( ) ( ))
(0) 0
T Tj j j j j j
j
H x V x Q x x R x V f x g x x
V
Nonlinear Lyapunov eq.
2 I li
1 arg min[ ( , , )]j ju
H x V u
2. Improve policy:
111 2( ) ( )T
j jx R g x V • Convergence proved by Saridis 1979 if nonlinear
Lyapunov eq. solved exactly
• Beard & Saridis used Galerkin Integrals to solve gLyapunov eq.
• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence
Off-line solutionNonlinear Lyapunov equation must be solved at each step
3 Online Solution of Optimal Control
Kyriakos Vamvoudakis
3. Online Solution of Optimal Control for Nonlinear Systems
Optimal Adapti e ControlOptimal Adaptive Control
Need to solve online:eed o so e o e
0 ( , ( )) ( , ( )) ( , ( )) ( ) ( , , ( ))T T
TV V VV r x x x r x x f x x Q x u Ru H x x
Nonlinear Lyapunov eq. for Value
0 ( , ( )) ( , ( )) ( , ( )) ( ) ( , , ( ))V r x x x r x x f x x Q x u Ru H x xx x x
Control update
112
( ) ( )Tx R g x V
Solve by parameterizing value V(X)Solve by parameterizing value V(X)converts PDE into algebraic equation
i. Value Function Approximation
In the LQR case ( ) ( )( ) ( )T T TV x x Px vec P x x W x
Use quadratic basis setUse quadratic basis set
In nonlinear case must approx both ( ),V
V x Vx
Approx in Sobolev norm
Weierstrass higher-order approx Theorem
1 11 1 1 1
( ) ( ) ( ) ( ) ( ) ( )N
Ti i i i i i i i
i i i N i N
V x c x c x c x C x c x
There exists a basis set s.t.
1 1 1
( ) ( ) ( )( ) Ni i i
i i ii i i N
x x xV xc c c
x x x x
Need complete independent basis setNeed complete independent basis setRetain N termsLast term (approx. error) goes to zero uniformly as N
1 1( ) ( ) ( )TV x W x x Take VFA as
1 1 1( ) ( )TV x W x
Then nonlinear Lyapunov eq
becomes
0 ( , ) ( , ) ( , , )T
V Vf x u r x u H x u
x x
1 1 1( , , ) ( ) ( )T THH x W u W f gu Q x u Ru
W1= LS solution to this eq for given N. Unknown.
. sup 0Hx
N
a
Theorem (Abu-Khalaf)as
1 1 2
1
1
. 0
. sup 0
. sup 0x
b W C
c V V
d V V
x
1 1ˆ ˆ( ) ( )TV x W x
CRITIC NN for VFAPaul Werbos
1 1( ) ( )
1 1 1 1ˆ ˆ( , , ) ( ) ( )T TH x W u W f gu Q x u Ru e
Find NN weights to solve the algebraic (Lyapunov) eq.
Parameter Identification of Value Function
1 1 1 1ˆ ˆ( , , ) ( ) ( )T TH x W u W f gu Q x u Ru e
Theorem (Kyriakos Vamvoudakis)- Online tuning of Critic NN weights.
Let u(t) be bounded stabilizing and be PE. Tune critic NN weights as1 1( )f gu ( ) g g1 1( )f g
(modified Levenberg-Marquardt)
Then the critic parameter error converges exponentially with decay factor given by
to the residual set 2( ) 1 2W t a
1 31
ln( 1 2 )a
to the residual set 1 2 1 max1
( ) 1 2 .W t a
( are PE parameters),
ii. Action NN for Control Approximation
11 ˆT T 11( ) ( )TR V C f111 22
( ) ( ) ,T Tu x R g x W
Theorem (Kyriakos Vamvoudakis)- Online Learning of Nonlinear Optimal Control
112
( ) ( )Tx R g x V Comes from
Let be PE. Tune critic NN weights as1 1( )f gu
Tune actor NN weights as1
12 2 2 2 1 1 1 2 11ˆ ˆ ˆ ˆ ˆ{( ) ( ) ( ) }4
T TW F W F W D x W m x W
where 11 1 1( ) ( ) ( ) ( ) ( ) ,T TD x x g x R g x x 1
21 1( 1)T
m
Then there exists an N0 such that, for the number of hidden layer units 0N N
the closed-loop system state, the critic NN error ˆ
1 1 1̂W W W
and the actor NN error are UUB. 2 1 2ˆW W W
111 22
ˆ( ) ( )T Tu x R g x W Control policy
Summary Nota Bene
1 22( ) ( )g
Tune critic NN weights as
Tune actor NN weights as Extra terms needed for stabilityu e ac o e g s as
12 2 2 2 1 1 1 2 11ˆ ˆ ˆ ˆ ˆ{( ) ( ) ( ) }4
T TW F W F W D x W m x W
where 11 1 1( ) ( ) ( ) ( ) ( ) ,T TD x x g x R g x x 1
21 1( 1)T
m
Note, it does not work to simply set
111 1̂( ) ( )T Tu x R g x W 1 12
( ) ( )u x R g x W
Proof:
1 11 1 1 2 2 2
1 1( ) ( ) ( ) ( ).
2 2T TL t V x tr W a W tr W a W
T T
V(x)= Unknown solution to HJB eq.
1140 ( )
T TTdV dV dV
f Q x gR gdx dx dx
Guarantees stability
2 1 2ˆW W W
1 1 1̂W W W Guarantees stability
W1= Unknown LS solution to Lyap. eq for given N
1 1 1( , , ) ( ) ( )T THH x W u W f gu Q x u Ru
ONLINE solutionDoes not require solution of HJB or nonlinear Lyapunov eq.
D i t d i t b kDoes require system dynamics to be known
Finds approximate local smooth solution to NONLINEAR HJB equation online
An optimal adaptive controller‘indirect’ because it identifies parameters for VFA‘direct’ because control is directly found from value functiondirect because control is directly found from value function
Optimal Adaptive Control
fIdentify the Value-Optimal Adaptive
Identify the system model-Indirect Adaptive
Identify the
Indirect Adaptive
Identify the Controller-Direct Adaptive
Pl tPlant
control output
Simulation 1- F-16 aircraft pitch rate controller
1.01887 0.90506 0.00215 0 Stevens and Lewis 20031.01887 0.90506 0.00215 0
0.82225 1.07741 0.17555 0
0 0 1 1
x x u
Stevens and Lewis 2003
,Q I R I
][ eqx
* [p 2p 2p p 2p p ]TW
PBPBRQPAPA TT 10 ARE
Exact solution
Select quadratic NN basis set for VFA
1 11 12 13 22 23 33[p 2p 2p p 2p p ]
[1.4245 1.1682 -0.1352 1.4349 -0.1501 0.4329]TW
Exact solution
Must add probing noise to get PE
Algorithm converges to
Must add probing noise to get PE
111 22
ˆ( ) ( ) ( )T Tu x R g x W n t (exponentially decay n(t))
1̂( ) [1.4279 1.1612 -0.1366 1.4462 -0.1480 0.4317] .TfW t
Algorithm converges to
2ˆ ( ) [1.4279 1.1612 -0.1366 1.4462 -0.1480 0.4317]TfW t 1
2 1
2 0 0 1.42790 1 1612
T
T
x
x x
f2 1
3 11 11 12 2 2
2
3 2
3
0 1.16120
0 -0.1366ˆ ( ) 0
1.44620 2 01
-0.148000.43170 0 2
T
T
x x
x xu x R B Px R
x
x x
x
Critic NN parameters
System states
Simulation 2. – Nonlinear System
Nevistic V. and Primbs J. A. (1996)Converse optimal
2( ) ( ) ,x f x g x u x R
21 2
1 2
10.5 0.5 (1 ( )( )
cos(2 ) 2 )
x xf x
xx x
1
0 ( ) .
cos(2 ) 2g x
x
,Q I R I
1* 2 21 2
1( )
2V x x x Optimal Value
*1 2( ) (cos(2 ) 2) .u x x x Optimal control
2 21 1 1 2 2( ) [ ] ,Tx x x x x Select VFA basis set
1ˆ ( ) [0.5017 -0.0020 1.0008] .T
fW t
Algorithm converges to
ˆ ( ) [0 5017 0 0020 1 0008]TW t 2 0 0 5017T
2( ) [0.5017 -0.0020 1.0008] .TfW t 1
112 2 12
12
2 0 0.50170
ˆ ( ) -0.0020cos(2 ) 2
0 2 1.0008
T x
u x R x xx
x
Critic NN parameters states
Optimal value fn. Value fn. approx. error Control approx error
4. Zero-Sum Games for Nonlinear Systems
( , ) ( ) ( ) ( )
( )
x f x u f x g x u k x d
y h x
System
Cost 22( ( ), , ) ( , , )T T
t t
V x t u d h h u Ru d dt r x u d dt
Differential equivalent is
0 ( ) ( ) ( ( ) ( ) ( ) ) ( )T V
r x u d V r x u d V f x g x u k x d H x u d
Given any stabilizing control and disturbance policies ( ) ( )u x d x
0 ( , , ) ( , , ) ( ( ) ( ) ( ) ) ( , , , )
(0) 0
r x u d V r x u d V f x g x u k x d H x u dx
V
Given any stabilizing control and disturbance policies ( ), ( )u x d x
the cost value is found by solving this nonlinear Lyapunov equation
Define 2-player zero-sum game as
2* 2
0
( (0)) min max ( (0), , ) min max ( ) ( )T T
u ud dV x J x u d h x h x u Ru d dt
The game has a unique value (saddle-point solution) iff the Nash condition holds
min max ( (0), , ) max min ( (0), , )u ud d
J x u d J x u d
min max ( , , , ) max min ( , , , )u ud d
H x V u d H x V u d
A necessary condition for this is the Isaacs Condition
Game saddle point solution found from Hamiltonian
22( , , , ) ( ( ) ( ) ( ) )TT TV
H x u d h h u Ru d V f x g x u k x dx
Optimal control/dist. policies found by stationarity conditions 0 , 0H H
u d
* 11 T* 112
( )Tu R g x V
*2
1( )
2Td k x V
HJI equation
* *0 ( , , , )
1 1
H x V u d
0)0( V
12
1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
4 4T T T T T Th h V x f x V x g x R g x V x V x kk V x
(Nonlinear ‘Riccati’ equation)
Policy Iteration Algorithm to Solve HJI
St t ith t bili i i iti l t l li ( )u x
1. For a given control policy solve for the value ( )ju x 1( ( ))jV x t
Start with stabilizing initial control policy 0 ( )u x
1
HJ equation
1 1 12
1
10 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
4
(0) 0
T T T T Tj j j
j
h h V x f x g x u x u x Ru x V x kk V x
V
11( ) ( )Tu x R g x V
J equa oNonlinear ‘Riccati’ eq.
2. Improve policy:
1 12( ) ( )j ju x R g x V
Minimal nnd solution of HJ equation is the Available Storage for ( )ju x
Off-line solutionNonlinear HJ equation must be solved at each step
Double Policy Iteration Algorithm to Solve HJI
St t ith t bili i i iti l li ( )u xAdd inner loop to solve for available storage
1. For a given control policy solve for the value ( )ju x 1( ( ))jV x t
Start with stabilizing initial policy 0 ( )u x
220 ( )( )T iT i T ij j j jh h V x f gu kd u Ru d
2. Set . For i=0,1,… solve for 1( ( )),i i
jV x t d 0 0d
Nonlinear Lyapunov eq.
( )( )j j j jf g 1
2
1( )
2i T i
jd k x V
On convergence set ( ) ( )iV x V x
11( ) ( )TR V
3. Improve policy:
• Convergence proved by Van der Schaft if can
On convergence set 1( ) ( )j jV x V x
111 12( ) ( )T
j ju x R g x V • Convergence proved by Van der Schaft if can
solve nonlinear Lyapunov equation exactly
• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergencey p g
Off-line solutionNonlinear Lyapunov equation must be solved at each step
5 Online Solution of ZS Games
Kyriakos Vamvoudakis
5. Online Solution of ZS Gamesfor Nonlinear Systems
Optimal (Game) Adapti e ControlOptimal (Game) Adaptive Control
Need to solve online:eed o so e o e
Nonlinear Lyapunov eq. for Value
220 ( )( )T T Th h V f kd R d 20 ( )( )T T Th h V x f gu kd u Ru d
1
Disturbance update
Control update
2
1( )
2Td k x V
112
( ) ( )Tx R g x V
Control update
Use three Neural Networks
C iti NNCritic NN
1 1ˆ ˆ( ) ( )TV x W x
Control Actor NN11
1 2ˆ( ) ( )T Tu x R g x W
Disturbance actor NN
1 22( ) ( )u x R g x W
11 3
ˆ( ) ( ) ,T Td x k x W 2 1 32( ) ( ) ,d x k x W
Simultaneously:a. Solve Lyap eq.
221 1 1 1
ˆ ˆ( , , ) ( )T T TH x W u W f gu kd h h u Ru d e
and b. update u(x), d(x)
Theorem (Kyriakos Vamvoudakis)- Online Gaming
Let be PE. Tune critic NN weights as2 1( )f gu kd
Tune actor NN weights as
12 2 2 2 1 2 1 2 11ˆ ˆ ˆ ˆ ˆ{( ) ( ) ( ) }4
T TW F W F W D x W m x W
3 3 4 3 3 2 1 1 3 12
1ˆ ˆ ˆ ˆ ˆ( )4
T TW F W F W x W m W
where
4
11 1 1
1 1 1
( ) ( ) ( ) ( ) ( ),
( ) ( ) ( ),
T T
T T
D x x g x R g x x
E x x kk x
22
2 2( 1)Tm
Then there exists an N0 such that, for the number of hidden layer units 0N N
the closed-loop system state, the critic NN error
and the actor NN errors 2 1 2 3 1 3ˆ ˆ,W W W W W W
1 1 1̂W W W
are UUB.
2 1 2 3 1 3
ONLINE solutionDoes not require solution of HJI eq, HJ eq, or nonlinear Lyapunov eq.
D i t d i t b kDoes require system dynamics to be known
Finds approximate local smooth solution to NONLINEAR HJI equation online
An optimal adaptive controller‘indirect’ because it identifies parameters for VFA‘direct’ because control is directly found from value functiondirect because control is directly found from value function
Simulation 1- F-16 aircraft pitch rate controller
1.01887 0.90506 0.00215 0 1 Stevens and Lewis 20030.82225 1.07741 0.17555 0 0
0 0 1 1 0
T
x x u d
y C x
][ eqx
* [p 2p 2p p 2p p ]TW
GARE
Exact solution
,TQ C C I R I 1
2
10T T TA P PA Q PBR B P PKK P
Wind gust
1 11 12 13 22 23 33[p 2p 2p p 2p p ]
[1.6573 1.3954 -0.1661 1.6573 -0.1804 0.4371]TW
Exact solution
Must add probing noise to u(x) and d(x) to get PE
Algorithm converges to
Must add probing noise to u(x) and d(x) to get PE
111 22
ˆ( ) ( ) ( )T Tu x R g x W n t (exponentially decay n(t))
ˆ ( ) [1 7090 1 3303 0 1629 1 7354 0 1730 0 4468]TW tAlgorithm converges to 1( ) [1.7090 1.3303 -0.1629 1.7354 -0.1730 0.4468] .fW t
2 3 1ˆ ˆ ˆ( ) ( ) ( )f f fW t W t W t
1
2 1
2 0 0 1.70900 1 3303
T
T
x
x x
1
2 1
2 0 0 1.70900 1 3303
T
T
x
x x 2 1
3 1112 2
2
3 2
3
0 1.33030
0 -0.1629ˆ ( ) 0
1.73540 2 01
-0.173000.44680 0 2
T x x
x xu x R
x
x x
x
2
2 1
3 112
2
3 2
3
0 1.33030
0 -0.1629ˆ( ) 01.73540 2 0
1-0.173000.44680 0 2
T x x
x xd x
x
x x
x
Critic NN parameters
System states
Simulation 2- F-16 aircraft pitch rate controllerwith d(t)=0, no disturbance
Critic NN parametersWith disturbance
Critic NN parametersWithout disturbance
C FASTER ith tConverges FASTER with an opponentOne learns faster with an adversary
Simulation 3. – Nonlinear System 2( ) ( ) ( )x f x g x u k x d x R ( ) ( ) ( ) ,x f x g x u k x d x R
3 3 2 21 2 2 2
1 2
1 12
10.25 ( ) 0.25 (sin )
( )cos(2 ) 2 (4 ) 2
x x
f xx xx xx x
, , 8Q I R I
1 1
0 0 ( ) , ( ) .
cos(2 ) 2 (4 ) 2(sin )g x k x
xx
1 1Optimal Value
*1 2( ) (cos(2 ) 2) .u x x x Saddle point solution
* 4 21 2
1 1( )
4 2V x x x
*1 22
1( ) (sin(4 ) 2)d x x x
Select VFA basis set
2 2 4 41 1 2 1 2( ) [ ]x x x x x
ˆAlgorithm converges to
2 3 1ˆ ˆ ˆ( ) ( ) ( )f f fW t W t W t
1ˆ ( ) [0.0008 0.4999 0.2429 0.0032]T
fW t
12 0 0.0008T
x
12 0 0.0008T
x 211
32 211
32
0 20 0.4999ˆ ( ) 04cos(2 ) 2 0.2429
0.00320 4
T xu x R
xx
x
2
2132 11
32
0 20 0.4999ˆ( ) 04sin(4 ) 2 0.2429
0.00320 4
T xd x
xx
x
Critic NN parameters states
Value fn. approx. error Dist. approx errorControl approx error
6. Can Avoid knowledge of drift term f(x)Work of Draguna Vrabie
Integral Reinforcement Learning
P li it ti i t d l ti f th L ti
Integral Reinforcement Learning
Policy iteration requires repeated solution of the Lyapunov equation
0 ( , ( )) ( , ( )) ( , ( )) ( ) ( , , ( ))T T
TV V VV r x x x r x x f x x Q x u Ru H x x
x x x
This can be done online without knowing f(x)using measurements of x(t), u(t) along the system trajectories
Lemma 1 – Draguna Vrabie
0 ( , ) ( , ) ( , , ), (0) 0T
V Vf x u r x u H x u V
( ( )) ( ) ( ( )) (0) 0t T
V d V T V
0 ( , ) ( , ) ( , , ), (0) 0f x u r x u H x u Vx x
CT B ll
Is equivalent to
Solves Lyapunov equation without knowing f(x,u)
( ( )) ( , ) ( ( )), (0) 0t
V x t r x u d V x t T V CT Bellman eq.
( ( ))( , ) ( , )
Td V x V
f x u r x udt x
Proof:
y g ( )
dt x
( , ) ( ( )) ( ( )) ( ( ))t T t T
r x u d d V x V x t V x t T
t t
Allows definition of temporal difference error for CT systems
( ) ( ( )) ( , ) ( ( ))t T
t
e t V x t r x u d V x t T
0T TA P P A L RL Q
Lemma 1 - D. Vrabie- LQR case
0c cA P P A L RL Q
is equivalent to
cA A BL
( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T T
t
x t Px t x Q L RL x d x t T Px t T
is equivalent to
Proof
Solves Lyapunov equation without knowing A or B
( )( ) ( )
TT T T T
c c
d x P xx A P PA x x L RL Q x
dt
Proof:
t T t T
( ) ( ) ( ) ( ) ( ) ( )T T T T T
t t
x Q L RL xd d x Px x t Px t x t T Px t T
BUT- this does not allow simultaneous tuning of critic and actor NN
Reinforcement Learning= predict cost, observe behavior, update cost prediction
( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T T
t
x t Px t x Q L RL x d x t T Px t T
Draguna Vrabie
1. Policy iteration
Integral Reinforcement Learning
1 1( ( )) ( , ) ( ( ))t T
k k k
t
V x t r x u dt V x t T
Cost update
Policy evaluation
For LQR case
A and B do not appear
( ) ( )k ku t L x t
1 1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
t
x t P x t x Q L RL x d x t T P x t T
11
1
kT
k PBRLControl gain update B needed for control update
Policy improvement
Initial stabilizing control is needed
CT Policy Iteration – How to implement online?Linear Systems Quadratic Cost- LQR
1 1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k kx t P x t x Q L RL x d x t T P x t T
Policy evaluation
1 1( ) ( ) ( )( ) ( ) ( ) ( )k k k kt
Q
t T
1 1( ) ( ) ( ) ( ) ( )( ) ( )t T
T T T Tk k k k
t
x t P x t x t T P x t T x Q L RL x d
1 111 12 11 121 2 1 2
2 212 22 12 22
1 2 1 2
( ) ( )( ) ( ) ( ) ( )
( ) ( )
( ) ( )
p p p px t x t Tx t x t x t T x t T
p p p px t x t T
x x
1 2 1 211 12 22 11 12 22
2 2 2 2
( ) ( )
1
2 2
( ) ( )
( ) ( )
t t T
Tk
p p p x x p p p x x
x x
p x t x t T
1 ( ) ( )kp x t x t
Quadratic basis set
t TCritic update
Algorithm Implementation
1 1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
t
x t P x t x Q L RL x d x t T P x t T
( ) ( )Tvec ABC C A vec B
t T
is the quadratic basis set
Use Kronecker product
To set this up as ( ) ( ) ( )x t x t x t
t T
1 1( ) ( ) ( ) ( ) ( )t T
T T T Tk i i k
t
p x t x Q K RK x d p x t T
c.f. Linear in the parameters system ID
1 1( ) ( ) ( ) ( ) ( ) ( )T T T Tk k i i
t
p t p x t x t T x Q K RK x d ( , )t t T Reinforcement on time interval [t, t+T]
Quadratic regression vectorQuadratic regression vector
Same form as standard System ID problems ( , ) ( , )Tk kh u y r u y
Regression matrix
( 1) / 2n n
y p
Solve using RLS or batch LS
Need data points along the system trajectory
Unknown parameters
t T
Algorithm Implementation- Approximate Dynamic Programming(or Adaptive DP)
1 1( ( )) ( ) ( ( ))t T
Tk k k k
t
V x t Q x u Ru dt V x t T
A i t l b N l N t k ( )TV W Approximate value by Neural Network ( )TV W x
1 1( ( )) ( ) ( ( ))t T
T T Tk k k kW x t Q x u Ru dt W x t T
t
1 ( ( )) ( ( )) ( )t T
T Tk k k
t
W x t x t T Q x u Ru dt
regression vector Reinforcement on time interval [t, t+T]
1kW Use RLS along the trajectory to get new weights
Then find updated FB
( ( ))T
V x t 1 11 12 21 1 1
( ( ))( ) ( ) ( )
( )T Tk
k k k
V x tu h x R g x R g x W
x x t
Direct Optimal Adaptive Control for Partially Unknown CT Systems
1. Select initial control policy
2 Find associated costSolves Lyapunov eq. without knowing dynamics
This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.
2. Find associated cost
1 ( ) ( ) ( ) ( ) ( ) ( , )t T
T T Tk k k
t
p x t x t T x Q L RL x d t t T
3. Improve control 1
1 1T
k kL R B P
t
observe x(t)Measure cost increment (reinforcement)
apply uk(t)=Lkx(t)
observe cost integral
Measure cost increment (reinforcement)by adding V as a state. Then
( )T k T kV x Qx u Ru observe x(t+T)
update P A is not needed anywhere
t t+T
update control gain to Lk+1
do RLS until convergence to Pk+1
Persistence of Excitation
1 1( ) ( ) ( ) ( ) ( ) ( )t T
T T T Tk k k kp t p x t x t T x Q L RL x d
1 1k k k kt
Regression vector must be PE
Relates to choice of reinforcement interval T
Direct Optimal Adaptive ControllerDraguna Vrabie
Solves Riccati Equation Online without knowing A matrix
Run RLS or use batch L.S.To identify value of current control
So es ccat quat o O e t out o g at
D iZOH T
Critic
T T DynamicControlSystemw/ MEMORY
V
RuuQxxV TT
T T
Update FB gain afterCritic has converged
xu
0; xBuAxx System
RuuQxxV
Actor
K
A hybrid continuous/discrete dynamic controller
0; xBuAxx K
y ywhose internal state is the observed cost over the interval
Reinforcement interval T can be selected on line on the fly – can change
L
Gain update (Policy)
Lk
k0 1 2 3 4 5
Control
)()( txLtu kk
tT
Reinforcement Intervals T need not be the sameThey can be selected on-line in real time
T
Continuous-time control with discrete gain updates
0Control signal
Simulations on: F-16 autopilotLoad frequency control for power system
-0 3
-0.2
-0.1
0
3
3.5
4System states
A matrix not needed
0 0.5 1 1.5 2-0.3
Time (s)
0Controller parameters
1
1.5
2
2.5
0 0.5 1 1.5 2-0.4
-0.2
Time (s)
0 0.5 1 1.5 20
0.5
Time (s)
0 15
0.2
Critic parameters
P(1,1)P(1,2) Converge to SS Riccati equation soln
0.1
0.15 P(2,2)P(1,1) - optimalP(1,2) - optimalP(2,2) - optimal
Converge to SS Riccati equation soln
Solves ARE online without knowing A
0 1 2 3 4 5 60
0.05
Time (s)
Different methods of learning
Reinforcement learningIvan Pavlov 1890s
A t C iti L i
We want OPTIMAL performance- ADP- Approximate Dynamic Programming
Desiredperformance
Actor-Critic LearningEvery living organism interacts with its environment and uses those i t ti t i it
Reinforcementsignal
Critic
interactions to improve its own actions in order to survive and increase.
Control
environmentTuneactor
Critic
SystemAdaptiveLearning system
ControlInputs
outputsActor
ADP – Paul Werbos
Cerebral cortexMotor areas
ThalamusBasal ganglia
Hippocampus
Critic
Cerebellum
Cognitive map of the environment - place cells -
Brainstem
theta rhythms 4-10 Hz
Behavior reference InformationCritic information sent to the ActorActor
Spinal cordinf. olive
Interoceptivereceptors
Exteroceptivereceptors
Muscle contraction and t
Motor control 200 HzControl signal
receptorsreceptors movement
IEEE Circuits & Systems Mag.Editor- Ron Chen