Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015...

21
Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The University of Texas at Arlington, USA and F.L. Lewis National Academy of Inventors Talk available online at http://www.UTA.edu/UTARI/acs Reinforcement Learning for Optimal Tracking and Regulation: A Unified Framework Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries Northeastern University, Shenyang, China Supported by : NSF AFOSR Europe ONR – Marc Steinberg US TARDEC – Dariusz Mikulski Supported by : China NNSF China Project 111 Work of Reza Modares and Bahare Kiumarsi

Transcript of Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015...

Page 1: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

and

F.L. Lewis National Academy of Inventors

Talk available online at http://www.UTA.edu/UTARI/acs

Reinforcement Learning for Optimal Tracking and Regulation: A Unified Framework

Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries

Northeastern University, Shenyang, China

Supported by :NSF

AFOSR EuropeONR – Marc Steinberg

US TARDEC – Dariusz Mikulski

Supported by :China NNSF

China Project 111

Work of Reza Modares and Bahare Kiumarsi

Page 2: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential

Games by Reinforcement Learning Principles, IET Press,

2012.

BooksF.L. Lewis, D. Vrabie, and V. Syrmos,

Optimal Control, third edition, John Wiley and Sons, New York, 2012.

New Chapters on:Reinforcement Learning

Differential Games

Page 3: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

5

F.L. Lewis, Applied Optimal Control and Estimation: Digital Design and Implementation,Prentice-Hall, New Jersey, TI Series, Feb. 1992.

Tracking a Reference Input

Page 4: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

14

Part 1:Optimal Tracking Control of Continuous-time (CT) Systems

Page 5: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Linear Quadratic Tracking (LQT)CT Optimal Tracking Control: Standard Solution

System Dynamics: ( ) ( ( )) ( ( )) ( )x t f x t g x t u t= +

Objective: ( ) ( )d

x t x t

Standard Solution: Feedforward part: ( ) ( )

dx t x t= , ( ) ( ( )) ( ( )) ( )

d d d dx t f x t g x t u t= +

1( ) ( ( ))( ( ) ( ( ))

d d d du t g x t x t f x t-= - contains the dynamics, noncausal, g(x) invertible offline solution Feedback part:

( ( )) [ ( ) ( ) ]T Td d d e et

V e t e Q e u Ru dt t t¥

= +ò , * min ( ( ))

ee du

u V e t=

Suboptimal Solution: * *( ) ( ) ( )

e du t u t u t= + Suboptimal

15

Page 6: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Linear system

Objective:

design such that while minimizing

Both feedback and feedforward parts (Optimal)19

x A x B u

y C x

= +=

1( , )

2dt

V x y¥

= ò ( ) ( )T Td d

Cx y Q Cx y u Ru dté ù- - +ê úë û

du K x K y¢= + d

y y

( )te g t- -

LQT Problem for Continuous-time Systems

Discount factor

Page 7: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Quadratic Performance Function

Assumption. The reference trajectory dy is generated by

d dy F y Matrix F is not assumed stable. Generate useful command trajectories like: unit step, sinusoidal waveforms, ramp, etc.

Lemma. For d

u K x K y¢= + , the value function is quadratic

( ) ( )1( ( ), ) ( ( ), ( ))

( ) ( )2

T

d d

d d

x t x tV x t y V x t y t P

y t y t

é ù é ùê ú ê ú= = ê ú ê úê ú ê úë û ë û

20

Page 8: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

CT LQT Bellman Equation

Augmented system state: ( ) ( ) ( )T

T T

dX t x t y té ù= ê úë û

Augmented system: 1

A BX X u T X B u

F

é ù é ùê ú ê ú= + º +ê ú ê úê ú ê úë û ë û

00 0

Value function:

( ) ( )1 1

( ) ( ) ( )2 2

t T T T

Tt

V X t e X Q X u R u d X t P X tg t t¥

- - é ù= + =ê úë ûò

1 1 1, [ ]TTQ C QC C C I= = -

LQT Bellman equation:

1 10 ( ) ( )T T T T T

TTX B u P X X P TX B u X P X X Q X u Rug= + + + - + +

21

Page 9: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Augmented ARE and casual solution to the CT LQT

For the fixed control input

1du K x K y K X¢= + = , 1

[ ]K K K ¢=

One has the LQT Lyapunov equation

1 1 1 1 1 1

( ) ( ) 0T T

TT B K P P T B K P Q K R Kg+ + + - + + =

Theorem. Causal solution for the LQT problem. The optimal control gain is

1

1 1

TK R B P-= - where the LQT ARE is

1

1 10 T T

TT P PT P P B R B P Qg -= + - - +

1K gives both feedback and feedforward parts simultanously 22

Page 10: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Online solution to the CT LQT ARE: On-policyIntegral Reinforcement Learning (IRL)

IRL Bellman equation:

( )1( ( )) ( ) ( ) ( ( ))

2

t ttt T T

Tt

V X t e X t Q X t u R u d e V X t tgg t t

+D- D- - é ù= + + +Dê úë ûò

IRL Bellman equation Algorithm. Online IRL algorithm for LQT

Policy evaluation:

( )( ) ( ) ( ) ( ) ( ) ( )

( ) ( )

[ ]t t

T i t T i T iT

tt T i

X t P X t e X t Q X t u R u d

e X t t P X t t

g t

g

t+D

- -

- D

= + +

+ D + D

ò LS

Policy improvement:

1 11

i T iu R B P X+ -= - Requires knowledge of B

This is a sequential algorithm 24

IRL – Draguna Vrabie 2009

Page 11: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

On-policy RL

25

Target and behaviorpolicy

SystemRef.

Target policy: The policy that we are learning about.

Behavior policy: The policy that generates actions and behavior

Target policy and behavior policy are the same

Page 12: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Off-policy RL

26

Behavior Policy System

Target policy

Ref.

Target policy and behavior policy are different

Page 13: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies

( ) ( )x f x g x u

( ) ( ), ( )t

J x r x u d

[ [ [] [ ]] ]( ( )) ( ( )) ( )i i

t T t

t i i

T

t TJ x t J x t T Q x d Ru du

On-policy IRL

[ 1] 1 [ ]12

i T ixu R g J

[ ] [ ]( )i ix f gu g u u Off-policy IRL

[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i

t T t T t T

T i i T iJ x t J x t T Q x d R d du u u R u u

This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA

[ ]iJ [ 1]iu

1. Completely unknown system dynamics2. Can use applied u(t) for –

disturbance rejection – Z.P. Jiang - R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! - J.Y. Lee, J.B. Park, Y.H. Choi 2012

DDO

Yu Jiang & Zhong-Ping Jiang, Automatica 2012

system

value

Must know g(x)

Page 14: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Online solution to the CT LQT ARE: Off-policy IRL

1 1( )i

iX TX B u T X B K X u= + = + + ,

1

i

iT T B K= -

( )

( ) ( )1

1

( ) ( ) ( ) ( ) ( )

( ) 2 ( )

t TT T i T i t T i

t

t T t Tt T iT i t i T T i

Tt t

iRK

de X t T P X t T X t P X t e X P X d

d

e X Q K RK X d e u K X B P X d

g g t

g t g t

tt

t t

+- - -

+ +- - - -

+

+ + - = -

= - + + +

ò

ò ò

Off-policy IRL Bellman equation

Algorithm. Online Off-policy IRL algorithm for LQT

Online step: Apply a fixed control input and collect some data

Offline step: Policy evaluation and improvement using LS on collected data

( )( ) ( ) ( ) ( )t T

T T i T i t Tit

e X t T P X t T X t P X t e X Q X dg g t t+

- - -+ + - =- +ò

( ) 12 ( )t T

t i T i

te u K X RK X dg t t

+- - ++ò No knowledge of the dynamics is required

28

Page 15: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

29

S t a t eC o n t ro l

in p u t

R e w a rd

System

Critic

Actor

T T

R e f .

T

( ) ( )

1

1

( ) ( ) ( ) ( )

( ) 2 ( )

T T i T i

t T t Tt T iT i t i T T i

Tt t

iRK

e X t T P X t T X t P X t

e X Q K R K X d e u K X B P X d

g

g t g tt t

-

+ +- - - -

+

+ + -

= - + + +ò ò

Data-based adaptive optimal tracking scheme - DDOOff-policy - Solve Critic network and Actor network simultaneously

iP

1iK

Off-policy Tracking IRL Bellman equation

Page 16: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Convergence

Lemma: The Bellman equation, the IRL Bellman equation and the off-policy IRL

Bellman equation have the same solution for the value function and the control

update law.

Theorem. The IRL Algorithm and Off-policy IRL Algorithm for the LQT problem

converge to the optimal solution.

30

Page 17: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Output Synchronization of Heterogeneous MAS

i i i i i

i i i

x A x B u

y C x

= +

=

0 0

Sz z=0 0

y R z=MAS Leader

Pioneered by Jie Huang

Output regulator equations

i i i i i

i i

A B S

C R

P + G = P

P =

Control

01

( ) ( )N

i i ij j i i ij

S c a gz z z z z z=

é ùê ú= + - + -ê úë ûå

1 1 1 1 2( ) ( )

i i i i i i i i i i i i i i i i iu K x K x K K x Kz z z z= -P +G = + G- P º +

Must know leader’s dynamics S,R

0( ) ( ) 0,

iy t y t i- "o/p regulation

Page 18: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

Optimal Output Synchronization of Heterogeneous MAS Using Off-policy IRLNageshrao, Modares, Lopes, Babuska, Lewis

i i i i i

i i i

x A x B u

y C x

= +

=

0 0

Sz z=

0 0y R z=

0

( ) ( ) iT

n pT T

iX t x t z +é ù= Îê úë û

1i i i i iX T X B u= +

1

0,

0 0i i

i i

A BT B

S

é ù é ùê ú ê ú= =ê ú ê úê ú ê úë û ë û

( )

1 1( ( )) ( )

( ) ( )

i t T T Ti i i i i i i it i

Ti i i

V X t e X C Q C K W K X d

X t P X t

g t t¥ - -= +

1 2 0i i i i i iu K x K K Xz= + =

1 1i i i

u K Xk k+ +=

MAS Leader

Optimal Tracker

Do not have to know the leader’s dynamics (S,R)

Augmented Systems

Page 19: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

33

01

ˆ ( ) ( )N

i i i ij j i i ij

S c a gz z z z z z=

é ùê ú= + - + -ê úë ûå

vec 0

1

ˆ ( ) ( ) ( )N

i Si q i ij j i i ij

S I a gz z z z z=

é ùê ú= -G Ä - + -ê úë ûå

1 2ˆ i

i i i i i i i ii

xu K x K K X Kz

z

é ùê ú= + º º ê úê úë û

1 2 0i i i i i iu K x K K Xz= + =

To avoid knowledge of leader’s state in

Use adaptive observer for leader’s state

Then use control

Note that may not converge to actual leader’s matrix SˆiS

Do not have to know the leader’s dynamics (S,R)

Page 20: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,

34

The RL Tracker Formulation Implicitly Solves the o/p Reg Equations!

1( )

i i i i i i iu K x z z= -P +G

1 2ˆ i

i i i i i i i ii

xu K x K K X Kz

z

é ùê ú= + º º ê úê úë û

Standard o/p reg equation control

Our RL Tracker control

Page 21: Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015 07... · 2017. 12. 14. · Synthetical Automation for Process Industries Northeastern University,