Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest...

14
Working in OpenAI Environments & Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018

Transcript of Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest...

Page 1: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

Working in OpenAI Environments &

Designing Your OwnMike Rudd

CS 885 Guest Lecture

May 18, 2018

Page 2: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

OpenAI*

• Not-for-profit, funded by private and corporate donations

• Employ small team of high-caliber researchers/advisors

• Promote research towards safe AGI

*https://openai.com/

Page 3: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

OpenAI Gym

• Standard set of environments for evaluating RL agents

• Provide benchmark for most new algorithms

• Extended to more complex problems as solutions improve

*https://openai.com/

Page 4: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

Recent Extensions

• Robotics• MuJoCo continuous control

tasks now “easily solvable”

• Harder set of continuouscontrol tasks

• Retro contest• Agents can overfit to their

environment

• Train agent that can transfer skills to new environments

Page 5: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

Interacting with the EnvironmentStandardized Code Applicable Across Tasks

Page 6: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

Sample Code

Page 7: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

Building Your Own EnvironmentPractically more important than beating Gym benchmarks

Page 8: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

Building Your Own Environment

• Not very difficult

• Just define a Python class with methods for:• Initialization• Step• Reset• Render

• Existing packages (physics engines) do most of the heavy lifting• Box2D• MuJoCo

Page 9: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

Example: Teaching a Car to Self-Park

Page 10: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

Challenge of Reward Definition

• Major difficulty is in creating reward function

• Algorithms can learn to exploit gaps in our logic, resulting in undesirable behaviours

• See e.g. Ng et al. (1999) for examples and theoretical analysis

Ng, A. Y., Harada, D., & Russell, S. (1999, June). Policy invariance under reward transformations: Theory and application to reward shaping. In ICML (Vol. 99, pp. 278-287).

Page 11: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your

Reward Shaping

• Theoretically correct reward is 1 for success and 0 otherwise

• This is sparse though, and in practice is very difficult to learn

• Reward shaping seeks to modify the reward function to speed up learning (with dense signal) but to leave the theoretically optimal policy unchanged

• Ng et al. (1999) show that only shaping function 𝐹 satisfying the following equation would guarantee that the optimal policy is preserved:

𝐹 𝑠, 𝑎, 𝑠′ = 𝛾Φ 𝑠′ −Φ 𝑠 ∀𝑠 ∈ 𝑆\{s0}

Page 12: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your
Page 13: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your
Page 14: Working in OpenAI Environments Designing Your Own · Designing Your Own Mike Rudd CS 885 Guest Lecture May 18, 2018. OpenAI* •Not-for-profit, funded by private ... Building Your