PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title...
Transcript of PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title...
Sparse RewardHung-yi Lee
Sparse RewardReward Shaping
Reward Shaping
Take “Play”, 𝑟𝑡+1 = 1, 𝑟𝑡+100 = −100
Take “Study”, 𝑟𝑡+1 = −1, 𝑟𝑡+100 = 100
𝑟𝑡+1 = 1
Reward Shaping
https://openreview.net/pdf?id=Hk3mPK5gg
Get reward, when closer
Need domain knowledge
VizDoomhttps://openreview.net/forum?id=Hk3mPK5gg¬eId=Hk3mPK5gg
Curiosity https://arxiv.org/abs/1705.05363
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
𝑅 𝜏 =
𝑡=1
𝑇
𝑟𝑡
Reward
𝑟1
Reward
𝑟2
updatedupdated
ICM = intrinsic curiosity module +𝑟𝑡𝑖
ICM ICM
𝑟1𝑖 𝑟2
𝑖
Intrinsic Curiosity Module𝑟𝑡𝑖
𝑠𝑡𝑎𝑡 𝑠𝑡+1
Network 1
Ƹ𝑠𝑡+1 diff
Large reward if 𝑠𝑡+1 is hard to predict
鼓勵冒險
Some states is hard to predict, but not
important.
樹葉飄動
Intrinsic Curiosity Module𝑟𝑡𝑖
𝑠𝑡𝑎𝑡 𝑠𝑡+1
Network 1
Feature Ext
Feature Ext
Network 2
𝜙 𝑠𝑡 𝜙 𝑠𝑡+1
𝜙 𝑠𝑡+1
𝑎𝑡
ො𝑎𝑡
diff
𝜙 is useful features related to actions
Reward from Auxiliary Task
https://arxiv.org/abs/1611.05397
Demo
Sparse RewardCurriculum Learning
Curriculum Learning
• Starting from simple training examples, and then becoming harder and harder.
VizDoom
Reverse Curriculum Generation
𝑠𝑔
➢Given a goal state 𝑠𝑔.
➢ Sample some states 𝑠1 “close” to 𝑠𝑔
➢ Start from states 𝑠1, each trajectory has reward R 𝑠1
𝑠1
Reverse Curriculum Generation
𝑠𝑔
➢Delete 𝑠1 whose reward is too large (already learned) or too small (too difficult at this moment)
➢ Sample 𝑠2 from 𝑠1, start from 𝑠2
𝑠2
𝑠1
Sparse RewardHierarchical
Reinforcement Learning
Hierarchical RL
➢ If lower agent cannot achieve the goal, the upper agent would get penalty.
➢ If an agent get to the wrong goal, assume the original goal is the wrong one.
校長 教授
下面這個例子純屬虛構,跟真實的狀況完全不同
菸酒生
provide goal provide subgoal action
https://arxiv.org/abs/1805.08180
https://arxiv.org/abs/1805.08180
Acknowledgement
•感謝芮祥麟博士發現課程網頁上拼字的錯誤