Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015...

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

F.L. Lewis National Academy of Inventors

Talk available online at http://www.UTA.edu/UTARI/acs

Reinforcement Learning for Optimal Tracking and Regulation: A Unified Framework

Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries

Northeastern University, Shenyang, China

Supported by :NSF

AFOSR EuropeONR – Marc Steinberg

US TARDEC – Dariusz Mikulski

Supported by :China NNSF

China Project 111

Work of Reza Modares and Bahare Kiumarsi

D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential

Games by Reinforcement Learning Principles, IET Press,

BooksF.L. Lewis, D. Vrabie, and V. Syrmos,

Optimal Control, third edition, John Wiley and Sons, New York, 2012.

New Chapters on:Reinforcement Learning

Differential Games

F.L. Lewis, Applied Optimal Control and Estimation: Digital Design and Implementation,Prentice-Hall, New Jersey, TI Series, Feb. 1992.

Tracking a Reference Input

Part 1:Optimal Tracking Control of Continuous-time (CT) Systems

Linear Quadratic Tracking (LQT)CT Optimal Tracking Control: Standard Solution

System Dynamics: ( ) ( ( )) ( ( )) ( )x t f x t g x t u t= +

Objective: ( ) ( )d

x t x t

Standard Solution: Feedforward part: ( ) ( )

dx t x t= , ( ) ( ( )) ( ( )) ( )

d d d dx t f x t g x t u t= +

1( ) ( ( ))( ( ) ( ( ))

d d d du t g x t x t f x t-= - contains the dynamics, noncausal, g(x) invertible offline solution Feedback part:

( ( )) [ ( ) ( ) ]T Td d d e et

V e t e Q e u Ru dt t t¥

= +ò , * min ( ( ))

u V e t=

Suboptimal Solution: * *( ) ( ) ( )

e du t u t u t= + Suboptimal

Linear system

Objective:

design such that while minimizing

Both feedback and feedforward parts (Optimal)19

x A x B u

1( , )

V x y¥

= ò ( ) ( )T Td d

Cx y Q Cx y u Ru dté ù- - +ê úë û

du K x K y¢= + d

( )te g t- -

LQT Problem for Continuous-time Systems

Discount factor

Quadratic Performance Function

Assumption. The reference trajectory dy is generated by

d dy F y Matrix F is not assumed stable. Generate useful command trajectories like: unit step, sinusoidal waveforms, ramp, etc.

Lemma. For d

u K x K y¢= + , the value function is quadratic

( ) ( )1( ( ), ) ( ( ), ( ))

( ) ( )2

x t x tV x t y V x t y t P

y t y t

é ù é ùê ú ê ú= = ê ú ê úê ú ê úë û ë û

CT LQT Bellman Equation

Augmented system state: ( ) ( ) ( )T

dX t x t y té ù= ê úë û

Augmented system: 1

A BX X u T X B u

é ù é ùê ú ê ú= + º +ê ú ê úê ú ê úë û ë û

Value function:

( ) ( )1 1

( ) ( ) ( )2 2

t T T T

V X t e X Q X u R u d X t P X tg t t¥

- - é ù= + =ê úë ûò

1 1 1, [ ]TTQ C QC C C I= = -

LQT Bellman equation:

1 10 ( ) ( )T T T T T

TTX B u P X X P TX B u X P X X Q X u Rug= + + + - + +

Augmented ARE and casual solution to the CT LQT

For the fixed control input

1du K x K y K X¢= + = , 1

[ ]K K K ¢=

One has the LQT Lyapunov equation

1 1 1 1 1 1

( ) ( ) 0T T

TT B K P P T B K P Q K R Kg+ + + - + + =

Theorem. Causal solution for the LQT problem. The optimal control gain is

TK R B P-= - where the LQT ARE is

1 10 T T

TT P PT P P B R B P Qg -= + - - +

1K gives both feedback and feedforward parts simultanously 22

Online solution to the CT LQT ARE: On-policyIntegral Reinforcement Learning (IRL)

IRL Bellman equation:

( )1( ( )) ( ) ( ) ( ( ))

t ttt T T

V X t e X t Q X t u R u d e V X t tgg t t

+D- D- - é ù= + + +Dê úë ûò

IRL Bellman equation Algorithm. Online IRL algorithm for LQT

Policy evaluation:

( )( ) ( ) ( ) ( ) ( ) ( )

( ) ( )

[ ]t t

T i t T i T iT

tt T i

X t P X t e X t Q X t u R u d

e X t t P X t t

+ D + D

Policy improvement:

i T iu R B P X+ -= - Requires knowledge of B

This is a sequential algorithm 24

IRL – Draguna Vrabie 2009

On-policy RL

Target and behaviorpolicy

SystemRef.

Target policy: The policy that we are learning about.

Behavior policy: The policy that generates actions and behavior

Target policy and behavior policy are the same

Off-policy RL

Behavior Policy System

Target policy

Target policy and behavior policy are different

Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies

( ) ( )x f x g x u

( ) ( ), ( )t

J x r x u d

[ [ [] [ ]] ]( ( )) ( ( )) ( )i i

t TJ x t J x t T Q x d Ru du

On-policy IRL

[ 1] 1 [ ]12

i T ixu R g J

[ ] [ ]( )i ix f gu g u u Off-policy IRL

[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i

t T t T t T

T i i T iJ x t J x t T Q x d R d du u u R u u

This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA

[ ]iJ [ 1]iu

1. Completely unknown system dynamics2. Can use applied u(t) for –

disturbance rejection – Z.P. Jiang - R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! - J.Y. Lee, J.B. Park, Y.H. Choi 2012

Yu Jiang & Zhong-Ping Jiang, Automatica 2012

system

Must know g(x)

Online solution to the CT LQT ARE: Off-policy IRL

1 1( )i

iX TX B u T X B K X u= + = + + ,

iT T B K= -

( ) ( )1

( ) ( ) ( ) ( ) ( )

( ) 2 ( )

t TT T i T i t T i

t T t Tt T iT i t i T T i

de X t T P X t T X t P X t e X P X d

e X Q K RK X d e u K X B P X d

g t g t

+- - -

+ +- - - -

+ + - = -

= - + + +

Off-policy IRL Bellman equation

Algorithm. Online Off-policy IRL algorithm for LQT

Online step: Apply a fixed control input and collect some data

Offline step: Policy evaluation and improvement using LS on collected data

( )( ) ( ) ( ) ( )t T

T T i T i t Tit

e X t T P X t T X t P X t e X Q X dg g t t+

- - -+ + - =- +ò

( ) 12 ( )t T

t i T i

te u K X RK X dg t t

+- - ++ò No knowledge of the dynamics is required

S t a t eC o n t ro l

in p u t

R e w a rd

System

Critic

R e f .

( ) ( )

( ) ( ) ( ) ( )

( ) 2 ( )

T T i T i

t T t Tt T iT i t i T T i

e X t T P X t T X t P X t

e X Q K R K X d e u K X B P X d

g t g tt t

+ +- - - -

= - + + +ò ò

Data-based adaptive optimal tracking scheme - DDOOff-policy - Solve Critic network and Actor network simultaneously

Off-policy Tracking IRL Bellman equation

Convergence

Lemma: The Bellman equation, the IRL Bellman equation and the off-policy IRL

Bellman equation have the same solution for the value function and the control

update law.

Theorem. The IRL Algorithm and Off-policy IRL Algorithm for the LQT problem

converge to the optimal solution.

Output Synchronization of Heterogeneous MAS

i i i i i

x A x B u

Sz z=0 0

y R z=MAS Leader

Pioneered by Jie Huang

Output regulator equations

i i i i i

P + G = P

Control

( ) ( )N

i i ij j i i ij

S c a gz z z z z z=

é ùê ú= + - + -ê úë ûå

1 1 1 1 2( ) ( )

i i i i i i i i i i i i i i i i iu K x K x K K x Kz z z z= -P +G = + G- P º +

Must know leader’s dynamics S,R

0( ) ( ) 0,

iy t y t i- "o/p regulation

Optimal Output Synchronization of Heterogeneous MAS Using Off-policy IRLNageshrao, Modares, Lopes, Babuska, Lewis

i i i i i

x A x B u

0 0y R z=

( ) ( ) iT

n pT T

iX t x t z +é ù= Îê úë û

1i i i i iX T X B u= +

0 0i i

A BT B

é ù é ùê ú ê ú= =ê ú ê úê ú ê úë û ë û

1 1( ( )) ( )

( ) ( )

i t T T Ti i i i i i i it i

Ti i i

V X t e X C Q C K W K X d

X t P X t

g t t¥ - -= +

1 2 0i i i i i iu K x K K Xz= + =

1 1i i i

u K Xk k+ +=

MAS Leader

Optimal Tracker

Do not have to know the leader’s dynamics (S,R)

Augmented Systems

ˆ ( ) ( )N

i i i ij j i i ij

S c a gz z z z z z=

é ùê ú= + - + -ê úë ûå

ˆ ( ) ( ) ( )N

i Si q i ij j i i ij

S I a gz z z z z=

é ùê ú= -G Ä - + -ê úë ûå

1 2ˆ i

i i i i i i i ii

xu K x K K X Kz

é ùê ú= + º º ê úê úë û

1 2 0i i i i i iu K x K K Xz= + =

To avoid knowledge of leader’s state in

Use adaptive observer for leader’s state

Then use control

Note that may not converge to actual leader’s matrix SˆiS

Do not have to know the leader’s dynamics (S,R)

The RL Tracker Formulation Implicitly Solves the o/p Reg Equations!

i i i i i i iu K x z z= -P +G

1 2ˆ i

i i i i i i i ii

xu K x K K X Kz

é ùê ú= + º º ê úê úë û

Standard o/p reg equation control

Our RL Tracker control

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015...

Documents

Transcript of Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The … 05 NEU short course/2015...

The Impact on the Environment and Economy Due to the …file.scirp.org/pdf/JAMP_2016121915544234.pdf · obtain fuzzy synthetical evaluation matrix. Through comparing and analysis,

2016 quadrotor dynamics and control - UT Arlington – UTA quadrotor... · Quadrotor UAV Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The University of Texas at Arlington,

synthetical methods for the preparation of hetarenes

Acknowledgments - Fort Worth, Texasfortworthtexas.gov/uploadedFiles/Planning_and_Development... · 2012-06-15 · Acknowledgments Mike Moncrief, Mayor Fort Worth Independent School

WILL ROGERS MEMORIAL CENTER | FWSSRwalker.agrilife.org/files/2012/10/2014-Fort-Worth-Stock... · 2017-12-02 · Bill J. Mobly Charles B. Moncrief Kit Moncrief Honorable Mike Moncrief

Anonymous - Song of Roland, C. K. Moncrief, Trans.

Donald vincent moncrief, fact vs fiction

Caiman Energy Rick Moncrief Executive Vice President Business Development

MONCRIEF ARMY HEALTH CLINIC OUTPATIENT FORMULARY ...online.lexi.com/lco/splashes/files/pdf/Moncrief-Army.pdf · 1 MONCRIEF ARMY OUTPATIENT FORMULARY Alphabetical Listing by Name This

NCVPS AP Music Theory Presentation by Dr. Tom Moncrief Types of Non-Chord Tones.

Moncrief Furnaces - Building & Indoor Environment Problem ...

NEW MATERIALS AND THEORETICAL & EXPERIMENTAL … · structural design. In order to increase the reliability assessment of wide span structural systems a knowledge based synthetical

Kingdom Keepers Disney After Dark by Ridley Pearson Andrew Turner 2 nd Period Mrs. Moncrief.

Asem Aboelzahab Rachel Benavides Ray Kirchner Ashley Moncrief Mahmoud Qalyoubi.

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI ... coop... · 3. Scale-Free Networks– Barabasi and Albert Nonhomogeneous- some nodes have large degree, most have small

Dynamics and Control of Quadrotor UAV 03 March HKU robotics short... · 2017-12-14 · Dynamics and Control of Quadrotor UAV Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)

W. A. MONCRIEF, JR., as independent · 1The contract was entered into by W.A. Moncrief on his own behalf and that of his son, W.A. Moncrief, Jr. (“Tex”). W.A. Moncrief died in

Dogwood Park @ Moncrief Margaret McMullen, Matt Cox 2/5/07.

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)

PENGEMBANGAN JIWA KEWIRAUSAHAAN BERBASIS SYARIAH …repo.iain-padangsidimpuan.ac.id/281/1/Utari Evy Cahyani.pdf · 2017-10-18 · Pengembangan Jiwa Kewirausahaan …Utari Evy Cahyani