PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title...

17
Sparse Reward Hung-yi Lee

Transcript of PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title...

Page 1: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Sparse RewardHung-yi Lee

Page 2: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Sparse RewardReward Shaping

Page 3: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Reward Shaping

Take “Play”, 𝑟𝑡+1 = 1, 𝑟𝑡+100 = −100

Take “Study”, 𝑟𝑡+1 = −1, 𝑟𝑡+100 = 100

𝑟𝑡+1 = 1

Page 4: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Reward Shaping

https://openreview.net/pdf?id=Hk3mPK5gg

Get reward, when closer

Need domain knowledge

VizDoomhttps://openreview.net/forum?id=Hk3mPK5gg&noteId=Hk3mPK5gg

Page 5: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Curiosity https://arxiv.org/abs/1705.05363

Actor

𝑠1

𝑎1

Env

𝑠2

Env

𝑠1

𝑎1

Actor

𝑠2

𝑎2

Env

𝑠3

𝑎2

……

𝑅 𝜏 =

𝑡=1

𝑇

𝑟𝑡

Reward

𝑟1

Reward

𝑟2

updatedupdated

ICM = intrinsic curiosity module +𝑟𝑡𝑖

ICM ICM

𝑟1𝑖 𝑟2

𝑖

Page 6: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Intrinsic Curiosity Module𝑟𝑡𝑖

𝑠𝑡𝑎𝑡 𝑠𝑡+1

Network 1

Ƹ𝑠𝑡+1 diff

Large reward if 𝑠𝑡+1 is hard to predict

鼓勵冒險

Some states is hard to predict, but not

important.

樹葉飄動

Page 7: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Intrinsic Curiosity Module𝑟𝑡𝑖

𝑠𝑡𝑎𝑡 𝑠𝑡+1

Network 1

Feature Ext

Feature Ext

Network 2

𝜙 𝑠𝑡 𝜙 𝑠𝑡+1

𝜙 𝑠𝑡+1

𝑎𝑡

ො𝑎𝑡

diff

𝜙 is useful features related to actions

Page 8: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Reward from Auxiliary Task

https://arxiv.org/abs/1611.05397

Page 9: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Demo

Page 10: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Sparse RewardCurriculum Learning

Page 11: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Curriculum Learning

• Starting from simple training examples, and then becoming harder and harder.

VizDoom

Page 12: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Reverse Curriculum Generation

𝑠𝑔

➢Given a goal state 𝑠𝑔.

➢ Sample some states 𝑠1 “close” to 𝑠𝑔

➢ Start from states 𝑠1, each trajectory has reward R 𝑠1

𝑠1

Page 13: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Reverse Curriculum Generation

𝑠𝑔

➢Delete 𝑠1 whose reward is too large (already learned) or too small (too difficult at this moment)

➢ Sample 𝑠2 from 𝑠1, start from 𝑠2

𝑠2

𝑠1

Page 14: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Sparse RewardHierarchical

Reinforcement Learning

Page 15: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Hierarchical RL

➢ If lower agent cannot achieve the goal, the upper agent would get penalty.

➢ If an agent get to the wrong goal, assume the original goal is the wrong one.

校長 教授

下面這個例子純屬虛構,跟真實的狀況完全不同

菸酒生

provide goal provide subgoal action

https://arxiv.org/abs/1805.08180

Page 16: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

https://arxiv.org/abs/1805.08180

Page 17: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Reward (v3).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 7/6/2018 11:52:40 AM

Acknowledgement

•感謝芮祥麟博士發現課程網頁上拼字的錯誤