Deep Reinforcement Learning - DATAWorks2020 · AlphaGo Zero taught human experts new strategies for...

19
Deep Reinforcement Learning Ben Bell [email protected]

Transcript of Deep Reinforcement Learning - DATAWorks2020 · AlphaGo Zero taught human experts new strategies for...

  • Deep Reinforcement Learning

    Ben [email protected]

  • Outline● What is Reinforcement Learning (RL) and why is it useful?● What is Deep Reinforcement Learning (DRL)?● DRL case studies

  • What is Reinforcement Learning?● The Workflow:

    ○ An agent interacts with an environment to learn a policy which maximizes the reward from the environment

    ○ By randomly selecting actions the agent explores the environment and slowly learns which actions give a positive or negative reward

  • Where is Reinforcement Learning Applied?● General algorithm that can be applied to many complex problems● Solve problems where the rules are simple but the solution is not

    ○ Games like Go and Chess○ Robotic control

    ● Stochastic and partially observable environments where manually designing even a good solution is difficult or impossible

  • Why is Reinforcement Learning Useful?● RL represents a paradigm shift from a hand-engineered solution to

    specifying an objective; no expert knowledge is needed● Creates agents that exhibit complex behavior and often discover novel

    solutions○ AlphaGo Zero taught human experts new strategies for a 3,000 year

    old game● Adaptive to changes in objectives or environments● The same RL algorithm can solve different problems

  • Deep Reinforcement Learning (DRL)● The combination of Reinforcement Learning and Deep Neural Networks

    (NN) creates a truly general algorithm which can be applied to almost any problem○ Neural networks can process many different kinds of data: numeric,

    images, video, audio, and any combination thereof● NNs have the capability to generalize past experience to new states● NNs and RL algorithms are independent, advancements in both areas

    improve agent performance

  • Case Studies (Why learn on video games?)● Simplified views into real world scenarios● Extensive list of environments already exist:

    ○ Driving simulators, tactical & strategy, complex 3D worlds● Games require many characteristics desirable in real world scenarios

    ○ Quick decision making○ Balancing short-term vs long-term goals○ Adaptability to evolving scenarios

  • DRL Timeline (Breakout)● Dec 2013 - First successful application of Deep

    Reinforcement Learning● Nov 2016 - Deepmind announces plans to

    research SC2● Jan 2019 - DRL agent beats professional players● Only 5 years from Breakout to SC2

    ○ A 34 year difference in release date○ May 1976 - July 2010

  • AlphaStar - StarCraft 2● SC2 is a complex game, played in real time, which requires micro and

    macro strategic decision-making along with resource management● Partially observable, requiring enemy positions to be scouted/tracked● Initially trained to mimic human actions/strategy● Multiple agents with differing objectives are trained by competing

    against each other with AlphaStar incorporating the best strategies discovered

    ● Unlike AlphaGo, AlphaStar does not use a search algorithm

  • AlphaStar - 10, Humans - 0● Beat TLO 5-0, a professional player ranked in the top 600

    ○ “AlphaStar takes well-known strategies and turns them on their head. The agent demonstrated strategies I hadn’t thought of before...”

    ● After another week of training defeats MaNa a top 10 player, 5-0○ “I’ve realised how much my gameplay relies on forcing mistakes and

    being able to exploit human reactions…”

  • OpenAI Five - DotA 2● Multi-agent 5 vs 5 game, played in real time and partially observable,

    each agent must fulfill its role and trade-off personal vs team rewards● Trained with no human supervision, agents learn from random action

    policies and self-play○ 180 years per day for 9 days -> 1,620 years of play

    ● Reward shaping is used for the final agent but good policies can be learned from only a binary win/loss signal

    ● No explicit communication channel between agents, they collaborate based on a shared view of the environment (emergent swarming)

  • OpenAI Five - DotA 2● Won 2-0 against semi-pro (99th percentile) players

    ○ Attack target coordination○ Flanking & Ambushes○ Perfect timing / low level control○ Punishes opponents mistakes without hesitation

    ● Lost 0-2 against professional players○ Humans were able to adapt during the game to exploit the AI

  • Robots Learning (Walker)● Learns a robust policy in 2 hours, training only on a flat surface

  • Robots Learning (Quadruped)● https://youtu.be/aTDkYFZFWug?t=157● “The learned policies are also robust to changes in hardware… different

    robot configurations, which roughly contribute 2.0 kg to the total weight, and a new drive which has a spring three times stiffer than the original one.”

    ● “In terms of computational cost ... the inference on the robot requires less than 25 µs using a single CPU thread.”

    ● “…this process [designing the rewards & NN architecture] takes about two days for the locomotion policies presented in this work.”

    https://youtu.be/aTDkYFZFWug?t=157

  • Where Does RL Fail?● Generalization, applying learned concepts to new & unseen environments● In the maze environment the agent overfits even with 20,000 training mazes● AlphaStar competed on only one map● OpenAI Five plays with hero (18/117), item, and skill restrictions

  • RL Summary● Used where the rules are defined but the optimal solution is unknown● Typically requires an accurate simulation● Adaptable to rule/requirement changes, retrain instead of re-engineer● In real-world use cases, special care is needed to test and evaluate

    performance in unseen scenarios

    ● DotA 2 Rematch - April 13th

  • The Power of Reinforcement Learning

    AlphaStar was able to beat a professional player in a restricted setting after 7 days of training, then a top 10 player the following week. “...the first success took almost three years of research time and the second success took seven days. Similarly, although OpenAI’s DotA 2 agent lost against a pro team, they were able to beat their old agent 80% of the time with 10 days of training. Wonder where it’s at now…”

    - Alex Irpan, Software Engineer at Google Brain Robotics

  • Q&A

    Deep Reinforcement Learning�OutlineWhat is Reinforcement Learning?Where is Reinforcement Learning Applied?Why is Reinforcement Learning Useful?Deep Reinforcement Learning (DRL)Case Studies (Why learn on video games?)DRL Timeline (Breakout)Slide Number 9AlphaStar - StarCraft 2AlphaStar - 10, Humans - 0OpenAI Five - DotA 2OpenAI Five - DotA 2Robots Learning (Walker)Robots Learning (Quadruped)Where Does RL Fail?RL SummaryThe Power of Reinforcement LearningQ&A