Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic...

29
Proximal Policy Optimization (PPO) default reinforcement learning algorithm at OpenAI Policy Gradient On-policy โ†’ Off-policy Add constraint

Transcript of Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic...

Page 1: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Proximal Policy Optimization (PPO)

default reinforcement learning algorithm at OpenAI

Policy Gradient

On-policy โ†’Off-policy

Add constraint

Page 2: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

DeepMind https://youtu.be/gn4nRCC9TwQ

Page 3: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

OpenAI https://blog.openai.com/openai-baselines-ppo/

Page 4: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Policy Gradient (Review)

Page 5: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Basic Components

EnvActorReward

Function

VideoGame

Go

Get 20 scores when killing a monster

The rule of GO

You cannot control

Page 6: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

โ€ข Policy ๐œ‹ is a network with parameter ๐œƒ

โ€ข Input: the observation of machine represented as a vector or a matrix

โ€ข Output: each action corresponds to a neuron in output layer

โ€ฆโ€ฆ

pixelsfire

right

left

Score of an action

0.7

0.2

0.1

Take the action based on the probability.

Policy of Actor

Page 7: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Start with observation ๐‘ 1 Observation ๐‘ 2 Observation ๐‘ 3

Example: Playing Video Game

Action ๐‘Ž1: โ€œrightโ€

Obtain reward ๐‘Ÿ1 = 0

Action ๐‘Ž2 : โ€œfireโ€

(kill an alien)

Obtain reward ๐‘Ÿ2 = 5

Page 8: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Start with observation ๐‘ 1 Observation ๐‘ 2 Observation ๐‘ 3

Example: Playing Video Game

After many turns

Action ๐‘Ž๐‘‡

Obtain reward ๐‘Ÿ๐‘‡

Game Over(spaceship destroyed)

This is an episode.

We want the total reward be maximized.

Total reward:

๐‘… =

๐‘ก=1

๐‘‡

๐‘Ÿ๐‘ก

Page 9: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Actor, Environment, Reward

๐œ = ๐‘ 1, ๐‘Ž1, ๐‘ 2, ๐‘Ž2, โ‹ฏ , ๐‘ ๐‘‡ , ๐‘Ž๐‘‡Trajectory

Actor

๐‘ 1

๐‘Ž1

Env

๐‘ 2

Env

๐‘ 1

๐‘Ž1

Actor

๐‘ 2

๐‘Ž2

Env

๐‘ 3

๐‘Ž2

โ€ฆโ€ฆ

๐‘๐œƒ ๐œ

= ๐‘ ๐‘ 1 ๐‘๐œƒ ๐‘Ž1|๐‘ 1 ๐‘ ๐‘ 2|๐‘ 1, ๐‘Ž1 ๐‘๐œƒ ๐‘Ž2|๐‘ 2 ๐‘ ๐‘ 3|๐‘ 2, ๐‘Ž2 โ‹ฏ

= ๐‘ ๐‘ 1 เท‘

๐‘ก=1

๐‘‡

๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก ๐‘ ๐‘ ๐‘ก+1|๐‘ ๐‘ก , ๐‘Ž๐‘ก

Page 10: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Actor, Environment, Reward

Actor

๐‘ 1

๐‘Ž1

Env

๐‘ 2

Env

๐‘ 1

๐‘Ž1

Actor

๐‘ 2

๐‘Ž2

Env

๐‘ 3

๐‘Ž2

โ€ฆโ€ฆ

๐‘… ๐œ =

๐‘ก=1

๐‘‡

๐‘Ÿ๐‘ก

Reward

๐‘Ÿ1

Reward

๐‘Ÿ2

updatedupdated

เดค๐‘…๐œƒ =

๐œ

๐‘… ๐œ ๐‘๐œƒ ๐œ

Expected Reward

= ๐ธ๐œ~๐‘๐œƒ ๐œ ๐‘… ๐œ

Page 11: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Policy Gradient เดค๐‘…๐œƒ =

๐œ

๐‘… ๐œ ๐‘๐œƒ ๐œ ๐›ป เดค๐‘…๐œƒ =?

๐›ป เดค๐‘…๐œƒ =

๐œ

๐‘… ๐œ ๐›ป๐‘๐œƒ ๐œ =

๐œ

๐‘… ๐œ ๐‘๐œƒ ๐œ๐›ป๐‘๐œƒ ๐œ

๐‘๐œƒ ๐œ

=

๐œ

๐‘… ๐œ ๐‘๐œƒ ๐œ ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐œ

โ‰ˆ1

๐‘

๐‘›=1

๐‘

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐œ๐‘›

๐‘… ๐œ do not have to be differentiable

It can even be a black box.

๐‘“ ๐‘ฅ ๐›ป๐‘™๐‘œ๐‘”๐‘“ ๐‘ฅ

๐›ป๐‘“ ๐‘ฅ =

= ๐ธ๐œ~๐‘๐œƒ ๐œ ๐‘… ๐œ ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐œ

=1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

Page 12: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Policy Gradient

๐œ1: ๐‘ 11, ๐‘Ž1

1 ๐‘… ๐œ1

๐‘ 21, ๐‘Ž2

1

โ€ฆโ€ฆ

๐‘… ๐œ1

โ€ฆโ€ฆ

๐œ2: ๐‘ 12, ๐‘Ž1

2 ๐‘… ๐œ2

๐‘ 22, ๐‘Ž2

2

โ€ฆโ€ฆ

๐‘… ๐œ2โ€ฆโ€ฆ

Given policy ๐œ‹๐œƒ

๐œƒ โ† ๐œƒ + ๐œ‚๐›ป เดค๐‘…๐œƒ

๐›ป เดค๐‘…๐œƒ =

1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

UpdateModel

DataCollection

๐›ป เดค๐‘…๐œƒ = ๐ธ๐œ~๐‘๐œƒ ๐œ ๐‘… ๐œ ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐œ

only used once

Page 13: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

๐œƒ โ† ๐œƒ + ๐œ‚๐›ป เดค๐‘…๐œƒ

๐›ป เดค๐‘…๐œƒ =1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

Implementation

โ€ฆโ€ฆ

fire

right

left 1

0

0

Consider as classification problem

๐‘Ž๐‘ก๐‘›

๐‘ ๐‘ก๐‘›

1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘› 1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

TF, pyTorch โ€ฆ

๐‘Ž๐‘ก๐‘›๐‘ ๐‘ก

๐‘› ๐‘… ๐œ๐‘›

Page 14: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Tip 1: Add a Baseline๐œƒ โ† ๐œƒ + ๐œ‚๐›ป เดค๐‘…๐œƒ

๐›ป เดค๐‘…๐œƒ โ‰ˆ1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

Ideal case

Samplingโ€ฆโ€ฆ

a b c a b c

It is probability โ€ฆ

a b c

Not sampled

a b c

The probability of the actions not sampled will decrease.

1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› โˆ’ ๐‘ ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

It is possible that ๐‘… ๐œ๐‘› is always positive.

๐‘ โ‰ˆ ๐ธ ๐‘… ๐œ

Page 15: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Tip 2: Assign Suitable Credit

๐‘ ๐‘Ž , ๐‘Ž1 ๐‘ ๐‘ , ๐‘Ž2 ๐‘ ๐‘ , ๐‘Ž3

+0+5 -2

๐‘… = +3

ร— 3 ร— 3 ร— 3

๐›ป เดค๐‘…๐œƒ โ‰ˆ1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› โˆ’ ๐‘ ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

๐‘กโ€ฒ=๐‘ก

๐‘‡๐‘›๐‘Ÿ๐‘กโ€ฒ๐‘›

๐‘ ๐‘Ž , ๐‘Ž2 ๐‘ ๐‘ , ๐‘Ž2 ๐‘ ๐‘ , ๐‘Ž3

+0-5 -2

๐‘… = โˆ’7

ร— โˆ’7 ร— โˆ’7 ร— โˆ’7ร— โˆ’2 ร— โˆ’2 ร— โˆ’2 ร— โˆ’2

Page 16: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Tip 2: Assign Suitable Credit

๐›ป เดค๐‘…๐œƒ โ‰ˆ1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› โˆ’ ๐‘ ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

๐‘กโ€ฒ=๐‘ก

๐‘‡๐‘›๐‘Ÿ๐‘กโ€ฒ๐‘›

๐‘กโ€ฒ=๐‘ก

๐‘‡๐‘›๐›พ๐‘ก

โ€ฒโˆ’๐‘ก๐‘Ÿ๐‘กโ€ฒ๐‘›

๐›พ < 1Add discount factor

Can be state-dependent

Advantage Function

๐ด๐œƒ ๐‘ ๐‘ก , ๐‘Ž๐‘ก

How good it is if we take ๐‘Ž๐‘กother than other actions at ๐‘ ๐‘ก.

Estimated by โ€œcriticโ€ (later)

Page 17: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

From on-policy to off-policy

Using the experience more than once

Page 18: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

On-policy v.s. Off-policy

โ€ข On-policy: The agent learned and the agent interacting with the environment is the same.

โ€ข Off-policy: The agent learned and the agent interacting with the environment is different.

้˜ฟๅ…‰ไธ‹ๆฃ‹ ไฝ็‚บไธ‹ๆฃ‹ใ€้˜ฟๅ…‰ๅœจๆ—้‚Š็œ‹

Page 19: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

On-policy โ†’ Off-policy

๐›ป เดค๐‘…๐œƒ = ๐ธ๐œ~๐‘๐œƒ ๐œ ๐‘… ๐œ ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐œ

๐ธ๐‘ฅ~๐‘[๐‘“ ๐‘ฅ ]

โ€ข Use ๐œ‹๐œƒ to collect data. When ๐œƒ is updated, we have to sample training data again.

โ€ข Goal: Using the sample from ๐œ‹๐œƒโ€ฒ to train ๐œƒ. ๐œƒโ€ฒ is fixed, so we can re-use the sample data.

Importance Sampling

โ‰ˆ1

๐‘

๐‘–=1

๐‘

๐‘“ ๐‘ฅ๐‘–

๐‘ฅ๐‘– is sampled from ๐‘ ๐‘ฅ

We only have ๐‘ฅ๐‘– sampled from ๐‘ž ๐‘ฅ

= เถฑ๐‘“ ๐‘ฅ ๐‘ ๐‘ฅ ๐‘‘๐‘ฅ = เถฑ๐‘“ ๐‘ฅ๐‘ ๐‘ฅ

๐‘ž ๐‘ฅ๐‘ž ๐‘ฅ ๐‘‘๐‘ฅ = ๐ธ๐‘ฅ~๐‘ž[๐‘“ ๐‘ฅ

๐‘ ๐‘ฅ

๐‘ž ๐‘ฅ]

Importance weight

Page 20: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Issue of Importance Sampling

๐ธ๐‘ฅ~๐‘[๐‘“ ๐‘ฅ ] = ๐ธ๐‘ฅ~๐‘ž[๐‘“ ๐‘ฅ๐‘ ๐‘ฅ

๐‘ž ๐‘ฅ]

V๐‘Ž๐‘Ÿ๐‘ฅ~๐‘[๐‘“ ๐‘ฅ ] โ‰  V๐‘Ž๐‘Ÿ๐‘ฅ~๐‘ž[๐‘“ ๐‘ฅ๐‘ ๐‘ฅ

๐‘ž ๐‘ฅ]

๐‘‰๐ด๐‘… ๐‘‹

= ๐ธ ๐‘‹2 โˆ’ ๐ธ ๐‘‹ 2

V๐‘Ž๐‘Ÿ๐‘ฅ~๐‘ ๐‘“ ๐‘ฅ = ๐ธ๐‘ฅ~๐‘ ๐‘“ ๐‘ฅ 2 โˆ’ ๐ธ๐‘ฅ~๐‘ ๐‘“ ๐‘ฅ2

V๐‘Ž๐‘Ÿ๐‘ฅ~๐‘ž[๐‘“ ๐‘ฅ๐‘ ๐‘ฅ

๐‘ž ๐‘ฅ] = ๐ธ๐‘ฅ~๐‘ž ๐‘“ ๐‘ฅ

๐‘ ๐‘ฅ

๐‘ž ๐‘ฅ

2

โˆ’ ๐ธ๐‘ฅ~๐‘ž ๐‘“ ๐‘ฅ๐‘ ๐‘ฅ

๐‘ž ๐‘ฅ

2

= ๐ธ๐‘ฅ~๐‘ ๐‘“ ๐‘ฅ 2๐‘ ๐‘ฅ

๐‘ž ๐‘ฅโˆ’ ๐ธ๐‘ฅ~๐‘ ๐‘“ ๐‘ฅ

2

Page 21: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Issue of Importance Sampling

๐ธ๐‘ฅ~๐‘[๐‘“ ๐‘ฅ ] = ๐ธ๐‘ฅ~๐‘ž[๐‘“ ๐‘ฅ๐‘ ๐‘ฅ

๐‘ž ๐‘ฅ]

๐‘“ ๐‘ฅ

๐‘ ๐‘ฅ ๐‘ž ๐‘ฅ

Very large weight

๐ธ๐‘ฅ~๐‘ ๐‘“ ๐‘ฅ is negative

๐ธ๐‘ฅ~๐‘ ๐‘“ ๐‘ฅ is positive?negative

Page 22: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

On-policy โ†’ Off-policy

๐›ป เดค๐‘…๐œƒ = ๐ธ๐œ~๐‘๐œƒ ๐œ ๐‘… ๐œ ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐œ

โ€ข Use ๐œ‹๐œƒ to collect data. When ๐œƒ is updated, we have to sample training data again.

โ€ข Goal: Using the sample from ๐œ‹๐œƒโ€ฒ to train ๐œƒ. ๐œƒโ€ฒ is fixed, so we can re-use the sample data.

๐›ป เดค๐‘…๐œƒ = ๐ธ๐œ~๐‘๐œƒโ€ฒ

๐œ

๐‘๐œƒ ๐œ

๐‘๐œƒโ€ฒ ๐œ๐‘… ๐œ ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐œ

Importance Sampling

๐ธ๐‘ฅ~๐‘[๐‘“ ๐‘ฅ ] = ๐ธ๐‘ฅ~๐‘ž[๐‘“ ๐‘ฅ๐‘ ๐‘ฅ

๐‘ž ๐‘ฅ]

โ€ข Sample the data from ๐œƒโ€ฒ.

โ€ข Use the data to train ๐œƒ many times.

Page 23: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

On-policy โ†’ Off-policy

= ๐ธ ๐‘ ๐‘ก,๐‘Ž๐‘ก ~๐œ‹๐œƒ[๐ด๐œƒ ๐‘ ๐‘ก , ๐‘Ž๐‘ก ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก

๐‘›|๐‘ ๐‘ก๐‘› ]

= ๐ธ ๐‘ ๐‘ก,๐‘Ž๐‘ก ~๐œ‹๐œƒโ€ฒ[๐‘ƒ๐œƒ ๐‘ ๐‘ก , ๐‘Ž๐‘ก๐‘ƒ๐œƒโ€ฒ ๐‘ ๐‘ก , ๐‘Ž๐‘ก

๐ด๐œƒ ๐‘ ๐‘ก , ๐‘Ž๐‘ก ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘› ]

๐‘“ ๐‘ฅ ๐›ป๐‘™๐‘œ๐‘”๐‘“ ๐‘ฅ๐›ป๐‘“ ๐‘ฅ =

๐ด๐œƒโ€ฒ๐‘ ๐‘ก , ๐‘Ž๐‘ก

This term is from sampled data.

๐ฝ๐œƒโ€ฒ๐œƒ = ๐ธ ๐‘ ๐‘ก,๐‘Ž๐‘ก ~๐œ‹๐œƒโ€ฒ

๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒโ€ฒ ๐‘Ž๐‘ก|๐‘ ๐‘ก

๐ด๐œƒโ€ฒ๐‘ ๐‘ก , ๐‘Ž๐‘ก

= ๐ธ ๐‘ ๐‘ก,๐‘Ž๐‘ก ~๐œ‹๐œƒโ€ฒ[๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒโ€ฒ ๐‘Ž๐‘ก|๐‘ ๐‘ก

๐‘๐œƒ ๐‘ ๐‘ก๐‘๐œƒโ€ฒ ๐‘ ๐‘ก

๐ด๐œƒ ๐‘ ๐‘ก , ๐‘Ž๐‘ก ๐›ป๐‘™๐‘œ๐‘”๐‘๐œƒ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘› ]

Gradient for update

When to stop?

Page 24: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Add Constraint็ฉฉ็ดฎ็ฉฉๆ‰“๏ผŒๆญฅๆญฅ็‚บ็‡Ÿ

Page 25: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

PPO / TRPO

๐ฝ๐‘ƒ๐‘ƒ๐‘‚๐œƒโ€ฒ ๐œƒ = ๐ฝ๐œƒ

โ€ฒ๐œƒ โˆ’ ๐›ฝ๐พ๐ฟ ๐œƒ, ๐œƒโ€ฒ

๐ฝ๐œƒโ€ฒ๐œƒ = ๐ธ ๐‘ ๐‘ก,๐‘Ž๐‘ก ~๐œ‹๐œƒโ€ฒ

๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒโ€ฒ ๐‘Ž๐‘ก|๐‘ ๐‘ก

๐ด๐œƒโ€ฒ๐‘ ๐‘ก , ๐‘Ž๐‘ก

๐ฝ๐‘‡๐‘…๐‘ƒ๐‘‚๐œƒโ€ฒ ๐œƒ = ๐ธ ๐‘ ๐‘ก,๐‘Ž๐‘ก ~๐œ‹๐œƒโ€ฒ

๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒโ€ฒ ๐‘Ž๐‘ก|๐‘ ๐‘ก

๐ด๐œƒโ€ฒ๐‘ ๐‘ก , ๐‘Ž๐‘ก

๐พ๐ฟ ๐œƒ, ๐œƒโ€ฒ < ๐›ฟ

TRPO (Trust Region Policy Optimization)

Proximal Policy Optimization (PPO)

๐œƒ cannot be very different from ๐œƒโ€ฒ

Constraint on behavior not parameters

๐‘“ ๐‘ฅ ๐›ป๐‘™๐‘œ๐‘”๐‘“ ๐‘ฅ๐›ป๐‘“ ๐‘ฅ =

Page 26: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

PPO algorithm

โ€ข Initial policy parameters ๐œƒ0

โ€ข In each iteration

โ€ข Using ๐œƒ๐‘˜ to interact with the environment to collect

๐‘ ๐‘ก , ๐‘Ž๐‘ก and compute advantage ๐ด๐œƒ๐‘˜๐‘ ๐‘ก , ๐‘Ž๐‘ก

โ€ข Find ๐œƒ optimizing ๐ฝ๐‘ƒ๐‘ƒ๐‘‚ ๐œƒ

โ€ข If ๐พ๐ฟ ๐œƒ, ๐œƒ๐‘˜ > ๐พ๐ฟ๐‘š๐‘Ž๐‘ฅ, increase ๐›ฝ

โ€ข If ๐พ๐ฟ ๐œƒ, ๐œƒ๐‘˜ < ๐พ๐ฟ๐‘š๐‘–๐‘›, decrease ๐›ฝ

๐ฝ๐‘ƒ๐‘ƒ๐‘‚๐œƒ๐‘˜ ๐œƒ = ๐ฝ๐œƒ

๐‘˜๐œƒ โˆ’ ๐›ฝ๐พ๐ฟ ๐œƒ, ๐œƒ๐‘˜

Update parameters several times

Adaptive KL Penalty

๐ฝ๐œƒ๐‘˜๐œƒ โ‰ˆ

๐‘ ๐‘ก,๐‘Ž๐‘ก

๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒ๐‘˜ ๐‘Ž๐‘ก|๐‘ ๐‘ก

๐ด๐œƒ๐‘˜๐‘ ๐‘ก , ๐‘Ž๐‘ก

Page 27: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

๐ฝ๐‘ƒ๐‘ƒ๐‘‚๐œƒ๐‘˜ ๐œƒ = ๐ฝ๐œƒ

๐‘˜๐œƒ โˆ’ ๐›ฝ๐พ๐ฟ ๐œƒ, ๐œƒ๐‘˜

๐ฝ๐œƒ๐‘˜๐œƒ โ‰ˆ

๐‘ ๐‘ก,๐‘Ž๐‘ก

๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒ๐‘˜ ๐‘Ž๐‘ก|๐‘ ๐‘ก

๐ด๐œƒ๐‘˜๐‘ ๐‘ก , ๐‘Ž๐‘ก

PPO algorithm

PPO2 algorithm

๐ฝ๐‘ƒ๐‘ƒ๐‘‚2๐œƒ๐‘˜ ๐œƒ โ‰ˆ

๐‘ ๐‘ก,๐‘Ž๐‘ก

๐‘š๐‘–๐‘›๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒ๐‘˜ ๐‘Ž๐‘ก|๐‘ ๐‘ก

๐ด๐œƒ๐‘˜๐‘ ๐‘ก , ๐‘Ž๐‘ก , ๐‘๐‘™๐‘–๐‘

๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒ๐‘˜ ๐‘Ž๐‘ก|๐‘ ๐‘ก

, 1 โˆ’ ํœ€

โ‰ˆ

๐‘ ๐‘ก,๐‘Ž๐‘ก

๐‘š๐‘–๐‘› ๐‘๐‘™๐‘–๐‘๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒ๐‘˜ ๐‘Ž๐‘ก|๐‘ ๐‘ก

, 1 โˆ’ ํœ€, 1 + ํœ€ ๐ด๐œƒ๐‘˜๐‘ ๐‘ก , ๐‘Ž๐‘ก

๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒ๐‘˜ ๐‘Ž๐‘ก|๐‘ ๐‘ก1 1 + ํœ€1 โˆ’ ํœ€

1

1 + ํœ€

1 โˆ’ ํœ€

Page 28: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

๐ฝ๐‘ƒ๐‘ƒ๐‘‚๐œƒ๐‘˜ ๐œƒ = ๐ฝ๐œƒ

๐‘˜๐œƒ โˆ’ ๐›ฝ๐พ๐ฟ ๐œƒ, ๐œƒ๐‘˜

๐ฝ๐œƒ๐‘˜๐œƒ โ‰ˆ

๐‘ ๐‘ก,๐‘Ž๐‘ก

๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒ๐‘˜ ๐‘Ž๐‘ก|๐‘ ๐‘ก

๐ด๐œƒ๐‘˜๐‘ ๐‘ก , ๐‘Ž๐‘ก

PPO algorithm

PPO2 algorithm

๐ฝ๐‘ƒ๐‘ƒ๐‘‚2๐œƒ๐‘˜ ๐œƒ โ‰ˆ

๐‘ ๐‘ก,๐‘Ž๐‘ก

๐‘š๐‘–๐‘›๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒ๐‘˜ ๐‘Ž๐‘ก|๐‘ ๐‘ก

๐ด๐œƒ๐‘˜๐‘ ๐‘ก , ๐‘Ž๐‘ก , ๐‘๐‘™๐‘–๐‘

๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒ๐‘˜ ๐‘Ž๐‘ก|๐‘ ๐‘ก

, 1 โˆ’ ํœ€

โ‰ˆ

๐‘ ๐‘ก,๐‘Ž๐‘ก

๐‘š๐‘–๐‘› ๐‘๐‘™๐‘–๐‘๐‘๐œƒ ๐‘Ž๐‘ก|๐‘ ๐‘ก๐‘๐œƒ๐‘˜ ๐‘Ž๐‘ก|๐‘ ๐‘ก

, 1 โˆ’ ํœ€, 1 + ํœ€ ๐ด๐œƒ๐‘˜๐‘ ๐‘ก , ๐‘Ž๐‘ก

1 1 + ํœ€1 โˆ’ ํœ€

1

1 + ํœ€

1 โˆ’ ํœ€

1 1 + ํœ€1 โˆ’ ํœ€

1

1 + ํœ€

1 โˆ’ ํœ€

๐ด > 0 ๐ด < 0

Page 29: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Experimental Results

https://arxiv.org/abs/1707.06347