Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic...
Transcript of Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic...
Proximal Policy Optimization (PPO)
default reinforcement learning algorithm at OpenAI
Policy Gradient
On-policy โOff-policy
Add constraint
DeepMind https://youtu.be/gn4nRCC9TwQ
OpenAI https://blog.openai.com/openai-baselines-ppo/
Policy Gradient (Review)
Basic Components
EnvActorReward
Function
VideoGame
Go
Get 20 scores when killing a monster
The rule of GO
You cannot control
โข Policy ๐ is a network with parameter ๐
โข Input: the observation of machine represented as a vector or a matrix
โข Output: each action corresponds to a neuron in output layer
โฆโฆ
pixelsfire
right
left
Score of an action
0.7
0.2
0.1
Take the action based on the probability.
Policy of Actor
Start with observation ๐ 1 Observation ๐ 2 Observation ๐ 3
Example: Playing Video Game
Action ๐1: โrightโ
Obtain reward ๐1 = 0
Action ๐2 : โfireโ
(kill an alien)
Obtain reward ๐2 = 5
Start with observation ๐ 1 Observation ๐ 2 Observation ๐ 3
Example: Playing Video Game
After many turns
Action ๐๐
Obtain reward ๐๐
Game Over(spaceship destroyed)
This is an episode.
We want the total reward be maximized.
Total reward:
๐ =
๐ก=1
๐
๐๐ก
Actor, Environment, Reward
๐ = ๐ 1, ๐1, ๐ 2, ๐2, โฏ , ๐ ๐ , ๐๐Trajectory
Actor
๐ 1
๐1
Env
๐ 2
Env
๐ 1
๐1
Actor
๐ 2
๐2
Env
๐ 3
๐2
โฆโฆ
๐๐ ๐
= ๐ ๐ 1 ๐๐ ๐1|๐ 1 ๐ ๐ 2|๐ 1, ๐1 ๐๐ ๐2|๐ 2 ๐ ๐ 3|๐ 2, ๐2 โฏ
= ๐ ๐ 1 เท
๐ก=1
๐
๐๐ ๐๐ก|๐ ๐ก ๐ ๐ ๐ก+1|๐ ๐ก , ๐๐ก
Actor, Environment, Reward
Actor
๐ 1
๐1
Env
๐ 2
Env
๐ 1
๐1
Actor
๐ 2
๐2
Env
๐ 3
๐2
โฆโฆ
๐ ๐ =
๐ก=1
๐
๐๐ก
Reward
๐1
Reward
๐2
updatedupdated
เดค๐ ๐ =
๐
๐ ๐ ๐๐ ๐
Expected Reward
= ๐ธ๐~๐๐ ๐ ๐ ๐
Policy Gradient เดค๐ ๐ =
๐
๐ ๐ ๐๐ ๐ ๐ป เดค๐ ๐ =?
๐ป เดค๐ ๐ =
๐
๐ ๐ ๐ป๐๐ ๐ =
๐
๐ ๐ ๐๐ ๐๐ป๐๐ ๐
๐๐ ๐
=
๐
๐ ๐ ๐๐ ๐ ๐ป๐๐๐๐๐ ๐
โ1
๐
๐=1
๐
๐ ๐๐ ๐ป๐๐๐๐๐ ๐๐
๐ ๐ do not have to be differentiable
It can even be a black box.
๐ ๐ฅ ๐ป๐๐๐๐ ๐ฅ
๐ป๐ ๐ฅ =
= ๐ธ๐~๐๐ ๐ ๐ ๐ ๐ป๐๐๐๐๐ ๐
=1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐
Policy Gradient
๐1: ๐ 11, ๐1
1 ๐ ๐1
๐ 21, ๐2
1
โฆโฆ
๐ ๐1
โฆโฆ
๐2: ๐ 12, ๐1
2 ๐ ๐2
๐ 22, ๐2
2
โฆโฆ
๐ ๐2โฆโฆ
Given policy ๐๐
๐ โ ๐ + ๐๐ป เดค๐ ๐
๐ป เดค๐ ๐ =
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐
UpdateModel
DataCollection
๐ป เดค๐ ๐ = ๐ธ๐~๐๐ ๐ ๐ ๐ ๐ป๐๐๐๐๐ ๐
only used once
๐ โ ๐ + ๐๐ป เดค๐ ๐
๐ป เดค๐ ๐ =1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐
Implementation
โฆโฆ
fire
right
left 1
0
0
Consider as classification problem
๐๐ก๐
๐ ๐ก๐
1
๐
๐=1
๐
๐ก=1
๐๐
๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐ 1
๐
๐=1
๐
๐ก=1
๐๐
๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐
TF, pyTorch โฆ
๐๐ก๐๐ ๐ก
๐ ๐ ๐๐
Tip 1: Add a Baseline๐ โ ๐ + ๐๐ป เดค๐ ๐
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐
Ideal case
Samplingโฆโฆ
a b c a b c
It is probability โฆ
a b c
Not sampled
a b c
The probability of the actions not sampled will decrease.
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ โ ๐ ๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐
It is possible that ๐ ๐๐ is always positive.
๐ โ ๐ธ ๐ ๐
Tip 2: Assign Suitable Credit
๐ ๐ , ๐1 ๐ ๐ , ๐2 ๐ ๐ , ๐3
+0+5 -2
๐ = +3
ร 3 ร 3 ร 3
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ โ ๐ ๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐
๐กโฒ=๐ก
๐๐๐๐กโฒ๐
๐ ๐ , ๐2 ๐ ๐ , ๐2 ๐ ๐ , ๐3
+0-5 -2
๐ = โ7
ร โ7 ร โ7 ร โ7ร โ2 ร โ2 ร โ2 ร โ2
Tip 2: Assign Suitable Credit
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ โ ๐ ๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐
๐กโฒ=๐ก
๐๐๐๐กโฒ๐
๐กโฒ=๐ก
๐๐๐พ๐ก
โฒโ๐ก๐๐กโฒ๐
๐พ < 1Add discount factor
Can be state-dependent
Advantage Function
๐ด๐ ๐ ๐ก , ๐๐ก
How good it is if we take ๐๐กother than other actions at ๐ ๐ก.
Estimated by โcriticโ (later)
From on-policy to off-policy
Using the experience more than once
On-policy v.s. Off-policy
โข On-policy: The agent learned and the agent interacting with the environment is the same.
โข Off-policy: The agent learned and the agent interacting with the environment is different.
้ฟๅ ไธๆฃ ไฝ็บไธๆฃใ้ฟๅ ๅจๆ้็
On-policy โ Off-policy
๐ป เดค๐ ๐ = ๐ธ๐~๐๐ ๐ ๐ ๐ ๐ป๐๐๐๐๐ ๐
๐ธ๐ฅ~๐[๐ ๐ฅ ]
โข Use ๐๐ to collect data. When ๐ is updated, we have to sample training data again.
โข Goal: Using the sample from ๐๐โฒ to train ๐. ๐โฒ is fixed, so we can re-use the sample data.
Importance Sampling
โ1
๐
๐=1
๐
๐ ๐ฅ๐
๐ฅ๐ is sampled from ๐ ๐ฅ
We only have ๐ฅ๐ sampled from ๐ ๐ฅ
= เถฑ๐ ๐ฅ ๐ ๐ฅ ๐๐ฅ = เถฑ๐ ๐ฅ๐ ๐ฅ
๐ ๐ฅ๐ ๐ฅ ๐๐ฅ = ๐ธ๐ฅ~๐[๐ ๐ฅ
๐ ๐ฅ
๐ ๐ฅ]
Importance weight
Issue of Importance Sampling
๐ธ๐ฅ~๐[๐ ๐ฅ ] = ๐ธ๐ฅ~๐[๐ ๐ฅ๐ ๐ฅ
๐ ๐ฅ]
V๐๐๐ฅ~๐[๐ ๐ฅ ] โ V๐๐๐ฅ~๐[๐ ๐ฅ๐ ๐ฅ
๐ ๐ฅ]
๐๐ด๐ ๐
= ๐ธ ๐2 โ ๐ธ ๐ 2
V๐๐๐ฅ~๐ ๐ ๐ฅ = ๐ธ๐ฅ~๐ ๐ ๐ฅ 2 โ ๐ธ๐ฅ~๐ ๐ ๐ฅ2
V๐๐๐ฅ~๐[๐ ๐ฅ๐ ๐ฅ
๐ ๐ฅ] = ๐ธ๐ฅ~๐ ๐ ๐ฅ
๐ ๐ฅ
๐ ๐ฅ
2
โ ๐ธ๐ฅ~๐ ๐ ๐ฅ๐ ๐ฅ
๐ ๐ฅ
2
= ๐ธ๐ฅ~๐ ๐ ๐ฅ 2๐ ๐ฅ
๐ ๐ฅโ ๐ธ๐ฅ~๐ ๐ ๐ฅ
2
Issue of Importance Sampling
๐ธ๐ฅ~๐[๐ ๐ฅ ] = ๐ธ๐ฅ~๐[๐ ๐ฅ๐ ๐ฅ
๐ ๐ฅ]
๐ ๐ฅ
๐ ๐ฅ ๐ ๐ฅ
Very large weight
๐ธ๐ฅ~๐ ๐ ๐ฅ is negative
๐ธ๐ฅ~๐ ๐ ๐ฅ is positive?negative
On-policy โ Off-policy
๐ป เดค๐ ๐ = ๐ธ๐~๐๐ ๐ ๐ ๐ ๐ป๐๐๐๐๐ ๐
โข Use ๐๐ to collect data. When ๐ is updated, we have to sample training data again.
โข Goal: Using the sample from ๐๐โฒ to train ๐. ๐โฒ is fixed, so we can re-use the sample data.
๐ป เดค๐ ๐ = ๐ธ๐~๐๐โฒ
๐
๐๐ ๐
๐๐โฒ ๐๐ ๐ ๐ป๐๐๐๐๐ ๐
Importance Sampling
๐ธ๐ฅ~๐[๐ ๐ฅ ] = ๐ธ๐ฅ~๐[๐ ๐ฅ๐ ๐ฅ
๐ ๐ฅ]
โข Sample the data from ๐โฒ.
โข Use the data to train ๐ many times.
On-policy โ Off-policy
= ๐ธ ๐ ๐ก,๐๐ก ~๐๐[๐ด๐ ๐ ๐ก , ๐๐ก ๐ป๐๐๐๐๐ ๐๐ก
๐|๐ ๐ก๐ ]
= ๐ธ ๐ ๐ก,๐๐ก ~๐๐โฒ[๐๐ ๐ ๐ก , ๐๐ก๐๐โฒ ๐ ๐ก , ๐๐ก
๐ด๐ ๐ ๐ก , ๐๐ก ๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐ ]
๐ ๐ฅ ๐ป๐๐๐๐ ๐ฅ๐ป๐ ๐ฅ =
๐ด๐โฒ๐ ๐ก , ๐๐ก
This term is from sampled data.
๐ฝ๐โฒ๐ = ๐ธ ๐ ๐ก,๐๐ก ~๐๐โฒ
๐๐ ๐๐ก|๐ ๐ก๐๐โฒ ๐๐ก|๐ ๐ก
๐ด๐โฒ๐ ๐ก , ๐๐ก
= ๐ธ ๐ ๐ก,๐๐ก ~๐๐โฒ[๐๐ ๐๐ก|๐ ๐ก๐๐โฒ ๐๐ก|๐ ๐ก
๐๐ ๐ ๐ก๐๐โฒ ๐ ๐ก
๐ด๐ ๐ ๐ก , ๐๐ก ๐ป๐๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐ ]
Gradient for update
When to stop?
Add Constraint็ฉฉ็ดฎ็ฉฉๆ๏ผๆญฅๆญฅ็บ็
PPO / TRPO
๐ฝ๐๐๐๐โฒ ๐ = ๐ฝ๐
โฒ๐ โ ๐ฝ๐พ๐ฟ ๐, ๐โฒ
๐ฝ๐โฒ๐ = ๐ธ ๐ ๐ก,๐๐ก ~๐๐โฒ
๐๐ ๐๐ก|๐ ๐ก๐๐โฒ ๐๐ก|๐ ๐ก
๐ด๐โฒ๐ ๐ก , ๐๐ก
๐ฝ๐๐ ๐๐๐โฒ ๐ = ๐ธ ๐ ๐ก,๐๐ก ~๐๐โฒ
๐๐ ๐๐ก|๐ ๐ก๐๐โฒ ๐๐ก|๐ ๐ก
๐ด๐โฒ๐ ๐ก , ๐๐ก
๐พ๐ฟ ๐, ๐โฒ < ๐ฟ
TRPO (Trust Region Policy Optimization)
Proximal Policy Optimization (PPO)
๐ cannot be very different from ๐โฒ
Constraint on behavior not parameters
๐ ๐ฅ ๐ป๐๐๐๐ ๐ฅ๐ป๐ ๐ฅ =
PPO algorithm
โข Initial policy parameters ๐0
โข In each iteration
โข Using ๐๐ to interact with the environment to collect
๐ ๐ก , ๐๐ก and compute advantage ๐ด๐๐๐ ๐ก , ๐๐ก
โข Find ๐ optimizing ๐ฝ๐๐๐ ๐
โข If ๐พ๐ฟ ๐, ๐๐ > ๐พ๐ฟ๐๐๐ฅ, increase ๐ฝ
โข If ๐พ๐ฟ ๐, ๐๐ < ๐พ๐ฟ๐๐๐, decrease ๐ฝ
๐ฝ๐๐๐๐๐ ๐ = ๐ฝ๐
๐๐ โ ๐ฝ๐พ๐ฟ ๐, ๐๐
Update parameters several times
Adaptive KL Penalty
๐ฝ๐๐๐ โ
๐ ๐ก,๐๐ก
๐๐ ๐๐ก|๐ ๐ก๐๐๐ ๐๐ก|๐ ๐ก
๐ด๐๐๐ ๐ก , ๐๐ก
๐ฝ๐๐๐๐๐ ๐ = ๐ฝ๐
๐๐ โ ๐ฝ๐พ๐ฟ ๐, ๐๐
๐ฝ๐๐๐ โ
๐ ๐ก,๐๐ก
๐๐ ๐๐ก|๐ ๐ก๐๐๐ ๐๐ก|๐ ๐ก
๐ด๐๐๐ ๐ก , ๐๐ก
PPO algorithm
PPO2 algorithm
๐ฝ๐๐๐2๐๐ ๐ โ
๐ ๐ก,๐๐ก
๐๐๐๐๐ ๐๐ก|๐ ๐ก๐๐๐ ๐๐ก|๐ ๐ก
๐ด๐๐๐ ๐ก , ๐๐ก , ๐๐๐๐
๐๐ ๐๐ก|๐ ๐ก๐๐๐ ๐๐ก|๐ ๐ก
, 1 โ ํ
โ
๐ ๐ก,๐๐ก
๐๐๐ ๐๐๐๐๐๐ ๐๐ก|๐ ๐ก๐๐๐ ๐๐ก|๐ ๐ก
, 1 โ ํ, 1 + ํ ๐ด๐๐๐ ๐ก , ๐๐ก
๐๐ ๐๐ก|๐ ๐ก๐๐๐ ๐๐ก|๐ ๐ก1 1 + ํ1 โ ํ
1
1 + ํ
1 โ ํ
๐ฝ๐๐๐๐๐ ๐ = ๐ฝ๐
๐๐ โ ๐ฝ๐พ๐ฟ ๐, ๐๐
๐ฝ๐๐๐ โ
๐ ๐ก,๐๐ก
๐๐ ๐๐ก|๐ ๐ก๐๐๐ ๐๐ก|๐ ๐ก
๐ด๐๐๐ ๐ก , ๐๐ก
PPO algorithm
PPO2 algorithm
๐ฝ๐๐๐2๐๐ ๐ โ
๐ ๐ก,๐๐ก
๐๐๐๐๐ ๐๐ก|๐ ๐ก๐๐๐ ๐๐ก|๐ ๐ก
๐ด๐๐๐ ๐ก , ๐๐ก , ๐๐๐๐
๐๐ ๐๐ก|๐ ๐ก๐๐๐ ๐๐ก|๐ ๐ก
, 1 โ ํ
โ
๐ ๐ก,๐๐ก
๐๐๐ ๐๐๐๐๐๐ ๐๐ก|๐ ๐ก๐๐๐ ๐๐ก|๐ ๐ก
, 1 โ ํ, 1 + ํ ๐ด๐๐๐ ๐ก , ๐๐ก
1 1 + ํ1 โ ํ
1
1 + ํ
1 โ ํ
1 1 + ํ1 โ ํ
1
1 + ํ
1 โ ํ
๐ด > 0 ๐ด < 0
Experimental Results
https://arxiv.org/abs/1707.06347