Deep Reinforcement Learning - DATAWorks2020 · AlphaGo Zero taught human experts new strategies for...

Deep Reinforcement Learning

Ben [email protected]

Outline● What is Reinforcement Learning (RL) and why is it useful?● What is Deep Reinforcement Learning (DRL)?● DRL case studies

What is Reinforcement Learning?● The Workflow:

○ An agent interacts with an environment to learn a policy which maximizes the reward from the environment

○ By randomly selecting actions the agent explores the environment and slowly learns which actions give a positive or negative reward

Where is Reinforcement Learning Applied?● General algorithm that can be applied to many complex problems● Solve problems where the rules are simple but the solution is not

○ Games like Go and Chess○ Robotic control

● Stochastic and partially observable environments where manually designing even a good solution is difficult or impossible

Why is Reinforcement Learning Useful?● RL represents a paradigm shift from a hand-engineered solution to

specifying an objective; no expert knowledge is needed● Creates agents that exhibit complex behavior and often discover novel

solutions○ AlphaGo Zero taught human experts new strategies for a 3,000 year

old game● Adaptive to changes in objectives or environments● The same RL algorithm can solve different problems

Deep Reinforcement Learning (DRL)● The combination of Reinforcement Learning and Deep Neural Networks

(NN) creates a truly general algorithm which can be applied to almost any problem○ Neural networks can process many different kinds of data: numeric,

images, video, audio, and any combination thereof● NNs have the capability to generalize past experience to new states● NNs and RL algorithms are independent, advancements in both areas

improve agent performance

Case Studies (Why learn on video games?)● Simplified views into real world scenarios● Extensive list of environments already exist:

○ Driving simulators, tactical & strategy, complex 3D worlds● Games require many characteristics desirable in real world scenarios

○ Quick decision making○ Balancing short-term vs long-term goals○ Adaptability to evolving scenarios

DRL Timeline (Breakout)● Dec 2013 - First successful application of Deep

Reinforcement Learning● Nov 2016 - Deepmind announces plans to

research SC2● Jan 2019 - DRL agent beats professional players● Only 5 years from Breakout to SC2

○ A 34 year difference in release date○ May 1976 - July 2010

AlphaStar - StarCraft 2● SC2 is a complex game, played in real time, which requires micro and

macro strategic decision-making along with resource management● Partially observable, requiring enemy positions to be scouted/tracked● Initially trained to mimic human actions/strategy● Multiple agents with differing objectives are trained by competing

against each other with AlphaStar incorporating the best strategies discovered

● Unlike AlphaGo, AlphaStar does not use a search algorithm

AlphaStar - 10, Humans - 0● Beat TLO 5-0, a professional player ranked in the top 600

○ “AlphaStar takes well-known strategies and turns them on their head. The agent demonstrated strategies I hadn’t thought of before...”

● After another week of training defeats MaNa a top 10 player, 5-0○ “I’ve realised how much my gameplay relies on forcing mistakes and

being able to exploit human reactions…”

OpenAI Five - DotA 2● Multi-agent 5 vs 5 game, played in real time and partially observable,

each agent must fulfill its role and trade-off personal vs team rewards● Trained with no human supervision, agents learn from random action

policies and self-play○ 180 years per day for 9 days -> 1,620 years of play

● Reward shaping is used for the final agent but good policies can be learned from only a binary win/loss signal

● No explicit communication channel between agents, they collaborate based on a shared view of the environment (emergent swarming)

OpenAI Five - DotA 2● Won 2-0 against semi-pro (99th percentile) players

○ Attack target coordination○ Flanking & Ambushes○ Perfect timing / low level control○ Punishes opponents mistakes without hesitation

● Lost 0-2 against professional players○ Humans were able to adapt during the game to exploit the AI

Robots Learning (Walker)● Learns a robust policy in 2 hours, training only on a flat surface

Robots Learning (Quadruped)● https://youtu.be/aTDkYFZFWug?t=157● “The learned policies are also robust to changes in hardware… different

robot configurations, which roughly contribute 2.0 kg to the total weight, and a new drive which has a spring three times stiffer than the original one.”

● “In terms of computational cost ... the inference on the robot requires less than 25 µs using a single CPU thread.”

● “…this process [designing the rewards & NN architecture] takes about two days for the locomotion policies presented in this work.”

https://youtu.be/aTDkYFZFWug?t=157

Where Does RL Fail?● Generalization, applying learned concepts to new & unseen environments● In the maze environment the agent overfits even with 20,000 training mazes● AlphaStar competed on only one map● OpenAI Five plays with hero (18/117), item, and skill restrictions

RL Summary● Used where the rules are defined but the optimal solution is unknown● Typically requires an accurate simulation● Adaptable to rule/requirement changes, retrain instead of re-engineer● In real-world use cases, special care is needed to test and evaluate

performance in unseen scenarios

● DotA 2 Rematch - April 13th

The Power of Reinforcement Learning

AlphaStar was able to beat a professional player in a restricted setting after 7 days of training, then a top 10 player the following week. “...the first success took almost three years of research time and the second success took seven days. Similarly, although OpenAI’s DotA 2 agent lost against a pro team, they were able to beat their old agent 80% of the time with 10 days of training. Wonder where it’s at now…”

- Alex Irpan, Software Engineer at Google Brain Robotics

Q&A

Deep Reinforcement Learning�OutlineWhat is Reinforcement Learning?Where is Reinforcement Learning Applied?Why is Reinforcement Learning Useful?Deep Reinforcement Learning (DRL)Case Studies (Why learn on video games?)DRL Timeline (Breakout)Slide Number 9AlphaStar - StarCraft 2AlphaStar - 10, Humans - 0OpenAI Five - DotA 2OpenAI Five - DotA 2Robots Learning (Walker)Robots Learning (Quadruped)Where Does RL Fail?RL SummaryThe Power of Reinforcement LearningQ&A

Deep Reinforcement Learning - DATAWorks2020 · AlphaGo Zero taught human experts new strategies for...

Documents

Transcript of Deep Reinforcement Learning - DATAWorks2020 · AlphaGo Zero taught human experts new strategies for...