Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA
and
F.L. Lewis National Academy of Inventors
Talk available online at http://www.UTA.edu/UTARI/acs
Reinforcement Learning for Optimal Tracking and Regulation: A Unified Framework
Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries
Northeastern University, Shenyang, China
Supported by :NSF
AFOSR EuropeONR – Marc Steinberg
US TARDEC – Dariusz Mikulski
Supported by :China NNSF
China Project 111
Work of Reza Modares and Bahare Kiumarsi
D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential
Games by Reinforcement Learning Principles, IET Press,
2012.
BooksF.L. Lewis, D. Vrabie, and V. Syrmos,
Optimal Control, third edition, John Wiley and Sons, New York, 2012.
New Chapters on:Reinforcement Learning
Differential Games
5
F.L. Lewis, Applied Optimal Control and Estimation: Digital Design and Implementation,Prentice-Hall, New Jersey, TI Series, Feb. 1992.
Tracking a Reference Input
14
Part 1:Optimal Tracking Control of Continuous-time (CT) Systems
Linear Quadratic Tracking (LQT)CT Optimal Tracking Control: Standard Solution
System Dynamics: ( ) ( ( )) ( ( )) ( )x t f x t g x t u t= +
Objective: ( ) ( )d
x t x t
Standard Solution: Feedforward part: ( ) ( )
dx t x t= , ( ) ( ( )) ( ( )) ( )
d d d dx t f x t g x t u t= +
1( ) ( ( ))( ( ) ( ( ))
d d d du t g x t x t f x t-= - contains the dynamics, noncausal, g(x) invertible offline solution Feedback part:
( ( )) [ ( ) ( ) ]T Td d d e et
V e t e Q e u Ru dt t t¥
= +ò , * min ( ( ))
ee du
u V e t=
Suboptimal Solution: * *( ) ( ) ( )
e du t u t u t= + Suboptimal
15
Linear system
Objective:
design such that while minimizing
Both feedback and feedforward parts (Optimal)19
x A x B u
y C x
= +=
1( , )
2dt
V x y¥
= ò ( ) ( )T Td d
Cx y Q Cx y u Ru dté ù- - +ê úë û
du K x K y¢= + d
y y
( )te g t- -
LQT Problem for Continuous-time Systems
Discount factor
Quadratic Performance Function
Assumption. The reference trajectory dy is generated by
d dy F y Matrix F is not assumed stable. Generate useful command trajectories like: unit step, sinusoidal waveforms, ramp, etc.
Lemma. For d
u K x K y¢= + , the value function is quadratic
( ) ( )1( ( ), ) ( ( ), ( ))
( ) ( )2
T
d d
d d
x t x tV x t y V x t y t P
y t y t
é ù é ùê ú ê ú= = ê ú ê úê ú ê úë û ë û
20
CT LQT Bellman Equation
Augmented system state: ( ) ( ) ( )T
T T
dX t x t y té ù= ê úë û
Augmented system: 1
A BX X u T X B u
F
é ù é ùê ú ê ú= + º +ê ú ê úê ú ê úë û ë û
00 0
Value function:
( ) ( )1 1
( ) ( ) ( )2 2
t T T T
Tt
V X t e X Q X u R u d X t P X tg t t¥
- - é ù= + =ê úë ûò
1 1 1, [ ]TTQ C QC C C I= = -
LQT Bellman equation:
1 10 ( ) ( )T T T T T
TTX B u P X X P TX B u X P X X Q X u Rug= + + + - + +
21
Augmented ARE and casual solution to the CT LQT
For the fixed control input
1du K x K y K X¢= + = , 1
[ ]K K K ¢=
One has the LQT Lyapunov equation
1 1 1 1 1 1
( ) ( ) 0T T
TT B K P P T B K P Q K R Kg+ + + - + + =
Theorem. Causal solution for the LQT problem. The optimal control gain is
1
1 1
TK R B P-= - where the LQT ARE is
1
1 10 T T
TT P PT P P B R B P Qg -= + - - +
1K gives both feedback and feedforward parts simultanously 22
Online solution to the CT LQT ARE: On-policyIntegral Reinforcement Learning (IRL)
IRL Bellman equation:
( )1( ( )) ( ) ( ) ( ( ))
2
t ttt T T
Tt
V X t e X t Q X t u R u d e V X t tgg t t
+D- D- - é ù= + + +Dê úë ûò
IRL Bellman equation Algorithm. Online IRL algorithm for LQT
Policy evaluation:
( )( ) ( ) ( ) ( ) ( ) ( )
( ) ( )
[ ]t t
T i t T i T iT
tt T i
X t P X t e X t Q X t u R u d
e X t t P X t t
g t
g
t+D
- -
- D
= + +
+ D + D
ò LS
Policy improvement:
1 11
i T iu R B P X+ -= - Requires knowledge of B
This is a sequential algorithm 24
IRL – Draguna Vrabie 2009
On-policy RL
25
Target and behaviorpolicy
SystemRef.
Target policy: The policy that we are learning about.
Behavior policy: The policy that generates actions and behavior
Target policy and behavior policy are the same
Off-policy RL
26
Behavior Policy System
Target policy
Ref.
Target policy and behavior policy are different
Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies
( ) ( )x f x g x u
( ) ( ), ( )t
J x r x u d
[ [ [] [ ]] ]( ( )) ( ( )) ( )i i
t T t
t i i
T
t TJ x t J x t T Q x d Ru du
On-policy IRL
[ 1] 1 [ ]12
i T ixu R g J
[ ] [ ]( )i ix f gu g u u Off-policy IRL
[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i
t T t T t T
T i i T iJ x t J x t T Q x d R d du u u R u u
This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA
[ ]iJ [ 1]iu
1. Completely unknown system dynamics2. Can use applied u(t) for –
disturbance rejection – Z.P. Jiang - R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! - J.Y. Lee, J.B. Park, Y.H. Choi 2012
DDO
Yu Jiang & Zhong-Ping Jiang, Automatica 2012
system
value
Must know g(x)
Online solution to the CT LQT ARE: Off-policy IRL
1 1( )i
iX TX B u T X B K X u= + = + + ,
1
i
iT T B K= -
( )
( ) ( )1
1
( ) ( ) ( ) ( ) ( )
( ) 2 ( )
t TT T i T i t T i
t
t T t Tt T iT i t i T T i
Tt t
iRK
de X t T P X t T X t P X t e X P X d
d
e X Q K RK X d e u K X B P X d
g g t
g t g t
tt
t t
+- - -
+ +- - - -
+
+ + - = -
= - + + +
ò
ò ò
Off-policy IRL Bellman equation
Algorithm. Online Off-policy IRL algorithm for LQT
Online step: Apply a fixed control input and collect some data
Offline step: Policy evaluation and improvement using LS on collected data
( )( ) ( ) ( ) ( )t T
T T i T i t Tit
e X t T P X t T X t P X t e X Q X dg g t t+
- - -+ + - =- +ò
( ) 12 ( )t T
t i T i
te u K X RK X dg t t
+- - ++ò No knowledge of the dynamics is required
28
29
S t a t eC o n t ro l
in p u t
R e w a rd
System
Critic
Actor
T T
R e f .
T
( ) ( )
1
1
( ) ( ) ( ) ( )
( ) 2 ( )
T T i T i
t T t Tt T iT i t i T T i
Tt t
iRK
e X t T P X t T X t P X t
e X Q K R K X d e u K X B P X d
g
g t g tt t
-
+ +- - - -
+
+ + -
= - + + +ò ò
Data-based adaptive optimal tracking scheme - DDOOff-policy - Solve Critic network and Actor network simultaneously
iP
1iK
Off-policy Tracking IRL Bellman equation
Convergence
Lemma: The Bellman equation, the IRL Bellman equation and the off-policy IRL
Bellman equation have the same solution for the value function and the control
update law.
Theorem. The IRL Algorithm and Off-policy IRL Algorithm for the LQT problem
converge to the optimal solution.
30
Output Synchronization of Heterogeneous MAS
i i i i i
i i i
x A x B u
y C x
= +
=
0 0
Sz z=0 0
y R z=MAS Leader
Pioneered by Jie Huang
Output regulator equations
i i i i i
i i
A B S
C R
P + G = P
P =
Control
01
( ) ( )N
i i ij j i i ij
S c a gz z z z z z=
é ùê ú= + - + -ê úë ûå
1 1 1 1 2( ) ( )
i i i i i i i i i i i i i i i i iu K x K x K K x Kz z z z= -P +G = + G- P º +
Must know leader’s dynamics S,R
0( ) ( ) 0,
iy t y t i- "o/p regulation
Optimal Output Synchronization of Heterogeneous MAS Using Off-policy IRLNageshrao, Modares, Lopes, Babuska, Lewis
i i i i i
i i i
x A x B u
y C x
= +
=
0 0
Sz z=
0 0y R z=
0
( ) ( ) iT
n pT T
iX t x t z +é ù= Îê úë û
1i i i i iX T X B u= +
1
0,
0 0i i
i i
A BT B
S
é ù é ùê ú ê ú= =ê ú ê úê ú ê úë û ë û
( )
1 1( ( )) ( )
( ) ( )
i t T T Ti i i i i i i it i
Ti i i
V X t e X C Q C K W K X d
X t P X t
g t t¥ - -= +
=ò
1 2 0i i i i i iu K x K K Xz= + =
1 1i i i
u K Xk k+ +=
MAS Leader
Optimal Tracker
Do not have to know the leader’s dynamics (S,R)
Augmented Systems
33
01
ˆ ( ) ( )N
i i i ij j i i ij
S c a gz z z z z z=
é ùê ú= + - + -ê úë ûå
vec 0
1
ˆ ( ) ( ) ( )N
i Si q i ij j i i ij
S I a gz z z z z=
é ùê ú= -G Ä - + -ê úë ûå
1 2ˆ i
i i i i i i i ii
xu K x K K X Kz
z
é ùê ú= + º º ê úê úë û
1 2 0i i i i i iu K x K K Xz= + =
To avoid knowledge of leader’s state in
Use adaptive observer for leader’s state
Then use control
Note that may not converge to actual leader’s matrix SˆiS
Do not have to know the leader’s dynamics (S,R)
34
The RL Tracker Formulation Implicitly Solves the o/p Reg Equations!
1( )
i i i i i i iu K x z z= -P +G
1 2ˆ i
i i i i i i i ii
xu K x K K X Kz
z
é ùê ú= + º º ê úê úë û
Standard o/p reg equation control
Our RL Tracker control
Top Related