Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic...

Proximal Policy Optimization (PPO)

default reinforcement learning algorithm at OpenAI

Policy Gradient

On-policy →Off-policy

Add constraint

DeepMind https://youtu.be/gn4nRCC9TwQ

OpenAI https://blog.openai.com/openai-baselines-ppo/

Policy Gradient (Review)

Basic Components

EnvActorReward

Function

VideoGame

Get 20 scores when killing a monster

The rule of GO

You cannot control

• Policy 𝜋 is a network with parameter 𝜃

• Input: the observation of machine represented as a vector or a matrix

• Output: each action corresponds to a neuron in output layer

……

pixelsfire

Score of an action

Take the action based on the probability.

Policy of Actor

Start with observation 𝑠1 Observation 𝑠2 Observation 𝑠3

Example: Playing Video Game

Action 𝑎1: “right”

Obtain reward 𝑟1 = 0

Action 𝑎2 : “fire”

(kill an alien)

Obtain reward 𝑟2 = 5

Start with observation 𝑠1 Observation 𝑠2 Observation 𝑠3

Example: Playing Video Game

After many turns

Action 𝑎𝑇

Obtain reward 𝑟𝑇

Game Over(spaceship destroyed)

This is an episode.

We want the total reward be maximized.

Total reward:

𝑅 =

𝑡=1

𝑟𝑡

Actor, Environment, Reward

𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠𝑇 , 𝑎𝑇Trajectory

……

𝑝𝜃 𝜏

= 𝑝 𝑠1 𝑝𝜃 𝑎1|𝑠1 𝑝 𝑠2|𝑠1, 𝑎1 𝑝𝜃 𝑎2|𝑠2 𝑝 𝑠3|𝑠2, 𝑎2 ⋯

= 𝑝 𝑠1 ෑ

𝑡=1

𝑝𝜃 𝑎𝑡|𝑠𝑡 𝑝 𝑠𝑡+1|𝑠𝑡 , 𝑎𝑡

Actor, Environment, Reward

……

𝑅 𝜏 =

𝑡=1

𝑟𝑡

Reward

updatedupdated

ത𝑅𝜃 =

𝑅 𝜏 𝑝𝜃 𝜏

Expected Reward

= 𝐸𝜏~𝑝𝜃 𝜏 𝑅 𝜏

Policy Gradient ത𝑅𝜃 =

𝑅 𝜏 𝑝𝜃 𝜏 𝛻 ത𝑅𝜃 =?

𝛻 ത𝑅𝜃 =

𝑅 𝜏 𝛻𝑝𝜃 𝜏 =

𝑅 𝜏 𝑝𝜃 𝜏𝛻𝑝𝜃 𝜏

𝑝𝜃 𝜏

𝑅 𝜏 𝑝𝜃 𝜏 𝛻𝑙𝑜𝑔𝑝𝜃 𝜏

𝑛=1

𝑅 𝜏𝑛 𝛻𝑙𝑜𝑔𝑝𝜃 𝜏𝑛

𝑅 𝜏 do not have to be differentiable

It can even be a black box.

𝑓 𝑥 𝛻𝑙𝑜𝑔𝑓 𝑥

𝛻𝑓 𝑥 =

= 𝐸𝜏~𝑝𝜃 𝜏 𝑅 𝜏 𝛻𝑙𝑜𝑔𝑝𝜃 𝜏

𝑛=1

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 𝛻𝑙𝑜𝑔𝑝𝜃 𝑎𝑡𝑛|𝑠𝑡

Policy Gradient

𝜏1: 𝑠11, 𝑎1

1 𝑅 𝜏1

𝑠21, 𝑎2

……

𝑅 𝜏1

……

𝜏2: 𝑠12, 𝑎1

2 𝑅 𝜏2

𝑠22, 𝑎2

……

𝑅 𝜏2……

Given policy 𝜋𝜃

𝜃 ← 𝜃 + 𝜂𝛻 ത𝑅𝜃

𝛻 ത𝑅𝜃 =

𝑛=1

𝑡=1

𝑇𝑛

UpdateModel

DataCollection

𝛻 ത𝑅𝜃 = 𝐸𝜏~𝑝𝜃 𝜏 𝑅 𝜏 𝛻𝑙𝑜𝑔𝑝𝜃 𝜏

only used once

𝜃 ← 𝜃 + 𝜂𝛻 ത𝑅𝜃

𝛻 ത𝑅𝜃 =1

𝑛=1

𝑡=1

𝑇𝑛

Implementation

……

left 1

Consider as classification problem

𝑎𝑡𝑛

𝑠𝑡𝑛

𝑛=1

𝑡=1

𝑇𝑛

𝑙𝑜𝑔𝑝𝜃 𝑎𝑡𝑛|𝑠𝑡

𝑛 1

𝑛=1

𝑡=1

𝑇𝑛

𝛻𝑙𝑜𝑔𝑝𝜃 𝑎𝑡𝑛|𝑠𝑡

𝑛=1

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 𝑙𝑜𝑔𝑝𝜃 𝑎𝑡𝑛|𝑠𝑡

𝑛=1

𝑡=1

𝑇𝑛

TF, pyTorch …

𝑎𝑡𝑛𝑠𝑡

𝑛 𝑅 𝜏𝑛

Tip 1: Add a Baseline𝜃 ← 𝜃 + 𝜂𝛻 ത𝑅𝜃

𝛻 ത𝑅𝜃 ≈1

𝑛=1

𝑡=1

𝑇𝑛

Ideal case

Sampling……

a b c a b c

It is probability …

Not sampled

The probability of the actions not sampled will decrease.

𝑛=1

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 − 𝑏 𝛻𝑙𝑜𝑔𝑝𝜃 𝑎𝑡𝑛|𝑠𝑡

It is possible that 𝑅 𝜏𝑛 is always positive.

𝑏 ≈ 𝐸 𝑅 𝜏

Tip 2: Assign Suitable Credit

𝑠𝑎 , 𝑎1 𝑠𝑏 , 𝑎2 𝑠𝑐 , 𝑎3

+0+5 -2

𝑅 = +3

× 3 × 3 × 3

𝛻 ത𝑅𝜃 ≈1

𝑛=1

𝑡=1

𝑇𝑛

𝑡′=𝑡

𝑇𝑛𝑟𝑡′𝑛

𝑠𝑎 , 𝑎2 𝑠𝑏 , 𝑎2 𝑠𝑐 , 𝑎3

+0-5 -2

𝑅 = −7

× −7 × −7 × −7× −2 × −2 × −2 × −2

Tip 2: Assign Suitable Credit

𝛻 ത𝑅𝜃 ≈1

𝑛=1

𝑡=1

𝑇𝑛

𝑡′=𝑡

𝑇𝑛𝑟𝑡′𝑛

𝑡′=𝑡

𝑇𝑛𝛾𝑡

′−𝑡𝑟𝑡′𝑛

𝛾 < 1Add discount factor

Can be state-dependent

Advantage Function

𝐴𝜃 𝑠𝑡 , 𝑎𝑡

How good it is if we take 𝑎𝑡other than other actions at 𝑠𝑡.

Estimated by “critic” (later)

From on-policy to off-policy

Using the experience more than once

On-policy v.s. Off-policy

• On-policy: The agent learned and the agent interacting with the environment is the same.

• Off-policy: The agent learned and the agent interacting with the environment is different.

阿光下棋佐為下棋、阿光在旁邊看

On-policy → Off-policy

𝐸𝑥~𝑝[𝑓 𝑥 ]

• Use 𝜋𝜃 to collect data. When 𝜃 is updated, we have to sample training data again.

• Goal: Using the sample from 𝜋𝜃′ to train 𝜃. 𝜃′ is fixed, so we can re-use the sample data.

Importance Sampling

𝑖=1

𝑓 𝑥𝑖

𝑥𝑖 is sampled from 𝑝 𝑥

We only have 𝑥𝑖 sampled from 𝑞 𝑥

= න𝑓 𝑥 𝑝 𝑥 𝑑𝑥 = න𝑓 𝑥𝑝 𝑥

𝑞 𝑥𝑞 𝑥 𝑑𝑥 = 𝐸𝑥~𝑞[𝑓 𝑥

𝑝 𝑥

𝑞 𝑥]

Importance weight

Issue of Importance Sampling

𝐸𝑥~𝑝[𝑓 𝑥 ] = 𝐸𝑥~𝑞[𝑓 𝑥𝑝 𝑥

𝑞 𝑥]

V𝑎𝑟𝑥~𝑝[𝑓 𝑥 ] ≠ V𝑎𝑟𝑥~𝑞[𝑓 𝑥𝑝 𝑥

𝑞 𝑥]

𝑉𝐴𝑅 𝑋

= 𝐸 𝑋2 − 𝐸 𝑋 2

V𝑎𝑟𝑥~𝑝 𝑓 𝑥 = 𝐸𝑥~𝑝 𝑓 𝑥 2 − 𝐸𝑥~𝑝 𝑓 𝑥2

V𝑎𝑟𝑥~𝑞[𝑓 𝑥𝑝 𝑥

𝑞 𝑥] = 𝐸𝑥~𝑞 𝑓 𝑥

𝑝 𝑥

𝑞 𝑥

− 𝐸𝑥~𝑞 𝑓 𝑥𝑝 𝑥

𝑞 𝑥

= 𝐸𝑥~𝑝 𝑓 𝑥 2𝑝 𝑥

𝑞 𝑥− 𝐸𝑥~𝑝 𝑓 𝑥

Issue of Importance Sampling

𝑞 𝑥]

𝑓 𝑥

𝑝 𝑥 𝑞 𝑥

Very large weight

𝐸𝑥~𝑝 𝑓 𝑥 is negative

𝐸𝑥~𝑝 𝑓 𝑥 is positive?negative

• Use 𝜋𝜃 to collect data. When 𝜃 is updated, we have to sample training data again.

• Goal: Using the sample from 𝜋𝜃′ to train 𝜃. 𝜃′ is fixed, so we can re-use the sample data.

𝛻 ത𝑅𝜃 = 𝐸𝜏~𝑝𝜃′

𝑝𝜃 𝜏

𝑝𝜃′ 𝜏𝑅 𝜏 𝛻𝑙𝑜𝑔𝑝𝜃 𝜏

Importance Sampling

𝑞 𝑥]

• Sample the data from 𝜃′.

• Use the data to train 𝜃 many times.

= 𝐸 𝑠𝑡,𝑎𝑡 ~𝜋𝜃[𝐴𝜃 𝑠𝑡 , 𝑎𝑡 𝛻𝑙𝑜𝑔𝑝𝜃 𝑎𝑡

𝑛|𝑠𝑡𝑛 ]

= 𝐸 𝑠𝑡,𝑎𝑡 ~𝜋𝜃′[𝑃𝜃 𝑠𝑡 , 𝑎𝑡𝑃𝜃′ 𝑠𝑡 , 𝑎𝑡

𝐴𝜃 𝑠𝑡 , 𝑎𝑡 𝛻𝑙𝑜𝑔𝑝𝜃 𝑎𝑡𝑛|𝑠𝑡

𝑛 ]

𝑓 𝑥 𝛻𝑙𝑜𝑔𝑓 𝑥𝛻𝑓 𝑥 =

𝐴𝜃′𝑠𝑡 , 𝑎𝑡

This term is from sampled data.

𝐽𝜃′𝜃 = 𝐸 𝑠𝑡,𝑎𝑡 ~𝜋𝜃′

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃′ 𝑎𝑡|𝑠𝑡

= 𝐸 𝑠𝑡,𝑎𝑡 ~𝜋𝜃′[𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃′ 𝑎𝑡|𝑠𝑡

𝑝𝜃 𝑠𝑡𝑝𝜃′ 𝑠𝑡

𝐴𝜃 𝑠𝑡 , 𝑎𝑡 𝛻𝑙𝑜𝑔𝑝𝜃 𝑎𝑡𝑛|𝑠𝑡

𝑛 ]

Gradient for update

When to stop?

Add Constraint穩紮穩打，步步為營

PPO / TRPO

𝐽𝑃𝑃𝑂𝜃′ 𝜃 = 𝐽𝜃

′𝜃 − 𝛽𝐾𝐿 𝜃, 𝜃′

𝐽𝜃′𝜃 = 𝐸 𝑠𝑡,𝑎𝑡 ~𝜋𝜃′

𝐽𝑇𝑅𝑃𝑂𝜃′ 𝜃 = 𝐸 𝑠𝑡,𝑎𝑡 ~𝜋𝜃′

𝐾𝐿 𝜃, 𝜃′ < 𝛿

TRPO (Trust Region Policy Optimization)

Proximal Policy Optimization (PPO)

𝜃 cannot be very different from 𝜃′

Constraint on behavior not parameters

𝑓 𝑥 𝛻𝑙𝑜𝑔𝑓 𝑥𝛻𝑓 𝑥 =

PPO algorithm

• Initial policy parameters 𝜃0

• In each iteration

• Using 𝜃𝑘 to interact with the environment to collect

𝑠𝑡 , 𝑎𝑡 and compute advantage 𝐴𝜃𝑘𝑠𝑡 , 𝑎𝑡

• Find 𝜃 optimizing 𝐽𝑃𝑃𝑂 𝜃

• If 𝐾𝐿 𝜃, 𝜃𝑘 > 𝐾𝐿𝑚𝑎𝑥, increase 𝛽

• If 𝐾𝐿 𝜃, 𝜃𝑘 < 𝐾𝐿𝑚𝑖𝑛, decrease 𝛽

𝐽𝑃𝑃𝑂𝜃𝑘 𝜃 = 𝐽𝜃

𝑘𝜃 − 𝛽𝐾𝐿 𝜃, 𝜃𝑘

Update parameters several times

Adaptive KL Penalty

𝐽𝜃𝑘𝜃 ≈

𝑠𝑡,𝑎𝑡

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

𝐴𝜃𝑘𝑠𝑡 , 𝑎𝑡

𝑠𝑡,𝑎𝑡

PPO algorithm

PPO2 algorithm

𝐽𝑃𝑃𝑂2𝜃𝑘 𝜃 ≈

𝑠𝑡,𝑎𝑡

𝑚𝑖𝑛𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

𝐴𝜃𝑘𝑠𝑡 , 𝑎𝑡 , 𝑐𝑙𝑖𝑝

, 1 − 휀

𝑠𝑡,𝑎𝑡

𝑚𝑖𝑛 𝑐𝑙𝑖𝑝𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

, 1 − 휀, 1 + 휀 𝐴𝜃𝑘𝑠𝑡 , 𝑎𝑡

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡1 1 + 휀1 − 휀

1 + 휀

1 − 휀

𝑠𝑡,𝑎𝑡

PPO algorithm

PPO2 algorithm

𝐽𝑃𝑃𝑂2𝜃𝑘 𝜃 ≈

𝑠𝑡,𝑎𝑡

𝑚𝑖𝑛𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

𝐴𝜃𝑘𝑠𝑡 , 𝑎𝑡 , 𝑐𝑙𝑖𝑝

, 1 − 휀

𝑠𝑡,𝑎𝑡

𝑚𝑖𝑛 𝑐𝑙𝑖𝑝𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

, 1 − 휀, 1 + 휀 𝐴𝜃𝑘𝑠𝑡 , 𝑎𝑡

1 1 + 휀1 − 휀

1 + 휀

1 − 휀

1 1 + 휀1 − 휀

1 + 휀

1 − 휀

𝐴 > 0 𝐴 < 0

Experimental Results

https://arxiv.org/abs/1707.06347

Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic...

Documents

Transcript of Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic...

Beyond Vectors - speech.ee.ntu.edu.tw

Deep Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/DL... · 2017-03-13 · Ups and downs of Deep Learning •1958: Perceptron (linear model) •1969:

Spoken Content Retrieval – Beyond Cascading Speech …speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf · 2015-06-01 · 1 Spoken Content Retrieval – Beyond Cascading Speech Recognition

Generative Adversarial Network - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2017/Lecture/GAN (v11… · Title: Generative Adversarial Network Author: Hung-yi Lee

Unsupervised Learning: Generation - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/GAN (v3).pdf · 𝑔𝑃 =න 𝑧 𝑧| 𝑔𝑃 𝑧q(z|x) can be any

Unsupervised Conditional Generation - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/CycleGA… · •Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE …speech.ee.ntu.edu.tw/~tlkagk/paper/LectureFinal.pdf · 2015-10-08 · IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE

Unsupervised Learning: Generation - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/VAE (v5).pdf · Generative Models PixelRNN Variational Autoencoder (VAE)

PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM

Meta Learning (Part 1) - NTU Speech Processing …speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture...Meta Learning (Part 1) Hung-yi Lee Introduction •Meta learning = Learn to

Theory behind GAN - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...•Sample m examples 𝑥1,𝑥2,…,𝑥𝑚from data distribution 𝑃𝑑𝑎𝑡𝑎𝑥

Conditional Generation by GAN - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/CGAN.pdf · Conditional GAN paired data blue eyes red hair short hair Collecting

Sequence Generation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture... · 2018. 3. 31. · •Sequence Generation •Conditional Sequence

Sequence Labeling Problem - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Sequence.pdf · Tom Noun cat dog saw pen bed apple Det a the the the that a a the

Meta Learning (Part 2)speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture/Meta2 (v4).pdfRecurrent Neural Network •Given function f: ℎ′, =𝑓ℎ, h0 f h1 y1 x1 f h2 y2 x2 f

Deep Learning Toolkit: Keras - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017_2/Lecture/keras.pdf · Title Deep Learning Toolkit: Keras Author Hung-yi Lee Created

Intelligent Photo Editing - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/PhotoEdit… · Photo Editing. Basic Idea space of code z Why move on the code

Ensemble - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture...Re-weighting Training Data •Idea: training 𝒇 𝒙on the new training set that fails 𝒇 𝒙

Transfer Learning - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture... · 2016-12-11 · Transfer Learning - Overview a Source Data (not directly related to

Eigenvalues and Eigenvectors - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/LA_2019/Lecture/eigen.pdf · Eigenvalues and Eigenvectors •If =𝜆 ( is a vector, 𝜆is a