Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015...

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

and

F.L. Lewis National Academy of Inventors

Talk available online at http://www.UTA.edu/UTARI/acs

Reinforcement Learning for Optimal Tracking and Regulation: A Unified Framework

Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries

Northeastern University, Shenyang, China

Supported by :NSF

AFOSR EuropeONR – Marc Steinberg

US TARDEC – Dariusz Mikulski

Supported by :China NNSF

China Project 111

Work of Reza Modares and Bahare Kiumarsi

D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential

Games by Reinforcement Learning Principles, IET Press,

2012.

BooksF.L. Lewis, D. Vrabie, and V. Syrmos,

Optimal Control, third edition, John Wiley and Sons, New York, 2012.

New Chapters on:Reinforcement Learning

Differential Games

5

F.L. Lewis, Applied Optimal Control and Estimation: Digital Design and Implementation,Prentice-Hall, New Jersey, TI Series, Feb. 1992.

Tracking a Reference Input

14

Part 1:Optimal Tracking Control of Continuous-time (CT) Systems

Linear Quadratic Tracking (LQT)CT Optimal Tracking Control: Standard Solution

System Dynamics: ( ) ( ( )) ( ( )) ( )x t f x t g x t u t= +

Objective: ( ) ( )d

x t x t

Standard Solution: Feedforward part: ( ) ( )

dx t x t= , ( ) ( ( )) ( ( )) ( )

d d d dx t f x t g x t u t= +

1( ) ( ( ))( ( ) ( ( ))

d d d du t g x t x t f x t-= - contains the dynamics, noncausal, g(x) invertible offline solution Feedback part:

( ( )) [ ( ) ( ) ]T Td d d e et

V e t e Q e u Ru dt t t¥

= +ò , * min ( ( ))

ee du

u V e t=

Suboptimal Solution: * *( ) ( ) ( )

e du t u t u t= + Suboptimal

15

Linear system

Objective:

design such that while minimizing

Both feedback and feedforward parts (Optimal)19

x A x B u

y C x

= +=

1( , )

2dt

V x y¥

= ò ( ) ( )T Td d

Cx y Q Cx y u Ru dté ù- - +ê úë û

du K x K y¢= + d

y y

( )te g t- -

LQT Problem for Continuous-time Systems

Discount factor

Quadratic Performance Function

Assumption. The reference trajectory dy is generated by

d dy F y Matrix F is not assumed stable. Generate useful command trajectories like: unit step, sinusoidal waveforms, ramp, etc.

Lemma. For d

u K x K y¢= + , the value function is quadratic

( ) ( )1( ( ), ) ( ( ), ( ))

( ) ( )2

T

d d

d d

x t x tV x t y V x t y t P

y t y t

é ù é ùê ú ê ú= = ê ú ê úê ú ê úë û ë û

20

CT LQT Bellman Equation

Augmented system state: ( ) ( ) ( )T

T T

dX t x t y té ù= ê úë û

Augmented system: 1

A BX X u T X B u

F

é ù é ùê ú ê ú= + º +ê ú ê úê ú ê úë û ë û

00 0

Value function:

( ) ( )1 1

( ) ( ) ( )2 2

t T T T

Tt

V X t e X Q X u R u d X t P X tg t t¥

- - é ù= + =ê úë ûò

1 1 1, [ ]TTQ C QC C C I= = -

LQT Bellman equation:

1 10 ( ) ( )T T T T T

TTX B u P X X P TX B u X P X X Q X u Rug= + + + - + +

21

Augmented ARE and casual solution to the CT LQT

For the fixed control input

1du K x K y K X¢= + = , 1

[ ]K K K ¢=

One has the LQT Lyapunov equation

1 1 1 1 1 1

( ) ( ) 0T T

TT B K P P T B K P Q K R Kg+ + + - + + =

Theorem. Causal solution for the LQT problem. The optimal control gain is

1

1 1

TK R B P-= - where the LQT ARE is

1

1 10 T T

TT P PT P P B R B P Qg -= + - - +

1K gives both feedback and feedforward parts simultanously 22

Online solution to the CT LQT ARE: On-policyIntegral Reinforcement Learning (IRL)

IRL Bellman equation:

( )1( ( )) ( ) ( ) ( ( ))

2

t ttt T T

Tt

V X t e X t Q X t u R u d e V X t tgg t t

+D- D- - é ù= + + +Dê úë ûò

IRL Bellman equation Algorithm. Online IRL algorithm for LQT

Policy evaluation:

( )( ) ( ) ( ) ( ) ( ) ( )

( ) ( )

[ ]t t

T i t T i T iT

tt T i

X t P X t e X t Q X t u R u d

e X t t P X t t

g t

g

t+D

- -

- D

= + +

+ D + D

ò LS

Policy improvement:

1 11

i T iu R B P X+ -= - Requires knowledge of B

This is a sequential algorithm 24

IRL – Draguna Vrabie 2009

On-policy RL

25

Target and behaviorpolicy

SystemRef.

Target policy: The policy that we are learning about.

Behavior policy: The policy that generates actions and behavior

Target policy and behavior policy are the same

Off-policy RL

26

Behavior Policy System

Target policy

Ref.

Target policy and behavior policy are different

Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies

( ) ( )x f x g x u

( ) ( ), ( )t

J x r x u d

[ [ [] [ ]] ]( ( )) ( ( )) ( )i i

t T t

t i i

T

t TJ x t J x t T Q x d Ru du

On-policy IRL

[ 1] 1 [ ]12

i T ixu R g J

[ ] [ ]( )i ix f gu g u u Off-policy IRL

[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i

t T t T t T

T i i T iJ x t J x t T Q x d R d du u u R u u

This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA

[ ]iJ [ 1]iu

1. Completely unknown system dynamics2. Can use applied u(t) for –

disturbance rejection – Z.P. Jiang - R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! - J.Y. Lee, J.B. Park, Y.H. Choi 2012

DDO

Yu Jiang & Zhong-Ping Jiang, Automatica 2012

system

value

Must know g(x)

Online solution to the CT LQT ARE: Off-policy IRL

1 1( )i

iX TX B u T X B K X u= + = + + ,

1

i

iT T B K= -

( )

( ) ( )1

1

( ) ( ) ( ) ( ) ( )

( ) 2 ( )

t TT T i T i t T i

t

t T t Tt T iT i t i T T i

Tt t

iRK

de X t T P X t T X t P X t e X P X d

d

e X Q K RK X d e u K X B P X d

g g t

g t g t

tt

t t

+- - -

+ +- - - -

+

+ + - = -

= - + + +

ò

ò ò

Off-policy IRL Bellman equation

Algorithm. Online Off-policy IRL algorithm for LQT

Online step: Apply a fixed control input and collect some data

Offline step: Policy evaluation and improvement using LS on collected data

( )( ) ( ) ( ) ( )t T

T T i T i t Tit

e X t T P X t T X t P X t e X Q X dg g t t+

- - -+ + - =- +ò

( ) 12 ( )t T

t i T i

te u K X RK X dg t t

+- - ++ò No knowledge of the dynamics is required

28

29

S t a t eC o n t ro l

in p u t

R e w a rd

System

Critic

Actor

T T

R e f .

T

( ) ( )

1

1

( ) ( ) ( ) ( )

( ) 2 ( )

T T i T i

t T t Tt T iT i t i T T i

Tt t

iRK

e X t T P X t T X t P X t

e X Q K R K X d e u K X B P X d

g

g t g tt t

-

+ +- - - -

+

+ + -

= - + + +ò ò

Data-based adaptive optimal tracking scheme - DDOOff-policy - Solve Critic network and Actor network simultaneously

iP

1iK

Off-policy Tracking IRL Bellman equation

Convergence

Lemma: The Bellman equation, the IRL Bellman equation and the off-policy IRL

Bellman equation have the same solution for the value function and the control

update law.

Theorem. The IRL Algorithm and Off-policy IRL Algorithm for the LQT problem

converge to the optimal solution.

30

Output Synchronization of Heterogeneous MAS

i i i i i

i i i

x A x B u

y C x

= +

=

0 0

Sz z=0 0

y R z=MAS Leader

Pioneered by Jie Huang

Output regulator equations

i i i i i

i i

A B S

C R

P + G = P

P =

Control

01

( ) ( )N

i i ij j i i ij

S c a gz z z z z z=

é ùê ú= + - + -ê úë ûå

1 1 1 1 2( ) ( )

i i i i i i i i i i i i i i i i iu K x K x K K x Kz z z z= -P +G = + G- P º +

Must know leader’s dynamics S,R

0( ) ( ) 0,

iy t y t i- "o/p regulation

Optimal Output Synchronization of Heterogeneous MAS Using Off-policy IRLNageshrao, Modares, Lopes, Babuska, Lewis

i i i i i

i i i

x A x B u

y C x

= +

=

0 0

Sz z=

0 0y R z=

0

( ) ( ) iT

n pT T

iX t x t z +é ù= Îê úë û

1i i i i iX T X B u= +

1

0,

0 0i i

i i

A BT B

S

é ù é ùê ú ê ú= =ê ú ê úê ú ê úë û ë û

( )

1 1( ( )) ( )

( ) ( )

i t T T Ti i i i i i i it i

Ti i i

V X t e X C Q C K W K X d

X t P X t

g t t¥ - -= +

=ò

1 2 0i i i i i iu K x K K Xz= + =

1 1i i i

u K Xk k+ +=

MAS Leader

Optimal Tracker

Do not have to know the leader’s dynamics (S,R)

Augmented Systems

33

01

ˆ ( ) ( )N

i i i ij j i i ij

S c a gz z z z z z=

é ùê ú= + - + -ê úë ûå

vec 0

1

ˆ ( ) ( ) ( )N

i Si q i ij j i i ij

S I a gz z z z z=

é ùê ú= -G Ä - + -ê úë ûå

1 2ˆ i

i i i i i i i ii

xu K x K K X Kz

z

é ùê ú= + º º ê úê úë û

1 2 0i i i i i iu K x K K Xz= + =

To avoid knowledge of leader’s state in

Use adaptive observer for leader’s state

Then use control

Note that may not converge to actual leader’s matrix SˆiS

Do not have to know the leader’s dynamics (S,R)

34

The RL Tracker Formulation Implicitly Solves the o/p Reg Equations!

1( )

i i i i i i iu K x z z= -P +G

1 2ˆ i

i i i i i i i ii

xu K x K K X Kz

z

é ùê ú= + º º ê úê úë û

Standard o/p reg equation control

Our RL Tracker control

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015...

Documents

Transcript of Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015...