An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level...

39
An Introduction to Reinforcement Learning and the AlphaZero AI James Frost Data Platform Director Quorum

Transcript of An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level...

Page 1: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

An Introduction to Reinforcement Learning and the AlphaZero AIJames Frost

Data Platform Director

Quorum

Page 2: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

About the speaker

• James Frost

• Data Platform Director at Quorum, an Edinburgh based IT consultancy

• Recently completed an MSc in Data Science at Dundee University

• Final year project was to build a backgammon AI influenced by techniques based on DeepMind AlphaGo. This AI achieved human Grandmaster level.

Page 3: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Session agenda

• An introduction to reinforcement learning concepts

• Monte-Carlo learning

• Neural networks as function approximators

• Issues with reinforcement learning and Deep Neural Networks

• DeepMind and AlphaGo

Page 4: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

What is reinforcement learning?

Page 5: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Types of Machine Learning

Supervised Learning

Reinforcement Learning

Unsupervised Learning

Supervised Learning:

Starts with a dataset of known examples. The engine then trains “by example”.

“A” is for Apple…

… not an apple!

Page 6: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Types of Machine Learning

Supervised Learning

Reinforcement Learning

Unsupervised Learning

Unsupervised Learning:

Starts with a dataset where the categories might not be known and looks for patterns / similarities / clusters which may be of interest.

For example, customer segmentation or fraud investigation

Page 7: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Types of Machine Learning

Supervised Learning

Reinforcement Learning

Unsupervised Learning

Reinforcement Learning:

Is based on an agent interacting with the environment, and getting feedback in the form of a reward mechanism.

Page 8: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

How do dogs learn?

All training should be reward based. Giving your dog something they really like such as food, toys or praise when they show a particular behaviour means that they are more likely to do it again.

RSPCA Website

Page 9: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Principles of reinforcement learning

Agent

Environment

actionAt

reward Rt

state St+1

Page 10: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Rewards

• A reward Rt is a scalar feedback signal

• Indicates the value of carrying out step t

Money won or lost –e.g. poker or stock market

Win (+1) or Loss (-1)

Kill John Connor (+10,000)Getting to destination (+100)Falling over (-50)Taking a step (-1)

Page 11: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Agent

The Agent generally has the following components:

• Model – the agents representation of the environment

• Policy – how the agent behaves

• Value function – estimate of how good a state or action is

Page 12: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Model

• Environment state is the environments private representation

• Often not visible

• Model is a representation of the environment state through observation

Page 13: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Value Functions

Almost all reinforcement learning algorithms involve estimating value functions that estimate how good it is for the agent to be in a given state

…the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values

(Sutton, 2017)

Page 14: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Value Functions / Policy - chess

A sample value function for chess might be the estimated chance of winning from that position.

Chess PolicyFrom each state, calculate all legal movesFor each possible move, move to state with highest value function.

OR from each state pick the move with the highest action value.

This is the optimal policy.

Page 15: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Principles of reinforcement learning

1. Accurately estimating the value function is critical for reinforcement learning

But how do we do that?

Page 16: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Technique 1 – Monte Carlo Learning

• Play a large number of games at random.

• Record how many times each state is seen, and how many games were won from that state (or action). This lets us know the estimated value function.

• This is the evaluation step for the random policy

X O

X

O

41% win rate from here

Action values State values

-0.41 -0.54 -0.41

-0.54 X -0.54

-0.41 -0.54 -0.41

0.32 0.19 0.32

0.19 0.48 0.19

0.32 0.19 0.32

Page 17: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Now use Monte-Carlo Control

• Now play another bunch of games, but this time act “greedily” with respect of the value predicted by the previous policy.

• Record how many times each state is seen, and how many games were won from that state

100% win rate from here

X O

X

O

-0.14 -0.82 -0.14

-0.82 X -0.82

-0.14 -0.82 -0.14

0.07 0.08 0.07

0.08 0.20 0.08

0.07 0.08 0.07

Page 18: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Don’t get too greedy…

• However, by only picking the best moves (greedy) we sometimes miss possible moves that might be better.

• So need to introduce an element of exploration.

• ε-greedy learning works as follows:

• With probability (1-ε) make a greedy move• With probability ε move at random.

• ε is often reduced as the number of episodes increases – this is guaranteed to converge to the optimal policy.

Page 19: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Learning cycle

• Policy evaluation / policy improvement is the core concept of reinforcement learning

• By acting greedily with respect to the value function we can create a new, improved policy.

• Iterating this process will trend towards the optimal policy

Optimal policyPolicyevaluation

Policyimprovement

Page 20: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Problems with large state spaces

Page 21: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Neural Networks

Unfortunately most useful problems can’t store the state for every scenario.

• Chess has 10^47 states

• Go has 10^170 states.

• How many states to record every possible scenario for a driverless car or Terminator robot?

So we need some form of function approximator.

Page 22: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Neural networks as value approximators

-4

-1

-1 6

5 2

-1 6

2

-3

-4

0

0

0

0

0

0

-3

.

.

.

.

.

.

Page 23: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Monte-Carlo Learning and Neural Networks

• Same principles as tic-tac-toe

• Play a number of games at random

• Sample states (or state / action pairs) from the games, the reward that these states led to, discounted by the number of steps

• Use these samples to feed into the neural network for training

• Now repeat the process, but instead of random play, use the neural network to predict the best moves. Pick some moves at random (ε-greedy)

Page 24: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Deep Neural Networks as Value Functions

Deep Neural Networks (DNNs) would appear to be a great candidate for a value function approximator. However, these can suffer from the following causes of instability:

• correlations present in the sequence of observations

• small updates to action-values estimates (Q) may significantly change the policy

Page 25: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Atari

• Deepmind paper from 2015

• Taught a deep neural network to play Atari games, such as Breakout and Kung Fu Master

• Achieved above human level in over half of the games – in some cases superhuman performance (e.g. Breakout).

• The network was trained using a technique based on Q-learning

but introduced two important concepts:• Experience replay

• Target Q network

Page 26: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Source: Human-level control through deep reinforcement learning. Nature, 518. https://doi.org/10.1038/nature14236

Page 27: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Experience Replay and target Q network

Agent

Environment

actionAt

reward Rt state St+1

( St, At ,Rt St+1 ), ( St-1, At-1 ,Rt-1 St ), ( St-2, At-2 ,Rt-2 St-1 ), ( St-3, At-3 ,Rt-3 St-2 ), ( St-4, At-4 ,Rt-4 St-3 ), ( St-5, At-5 ,Rt-5 St-4 )

Replay Buffer

Random samplesTo learn new policy

Agent v2

Page 28: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Effect of Target Q and Experience Replay

Page 29: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

AlphaGo

Page 30: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

AlphaGo

• Initially trained on a database of expert human games

• Then self play

• After months of training beat Lee Sedol

• AlphaGoZero was trained from completely random play

• Within 36 hours of training AlphaZero beat AlphaGoLee 100-0

Page 31: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

“Each time you put knowledge into a system you are actually handicapping it”

David Silver (Silver, 2015)

Page 32: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

AlphaGoZero

DeepMind’s AlphaGoZero project applied the following principles

• only uses the raw board position as input features

• use a simple Monte-Carlo Tree Search (MCTS) to evaluate positions and sample moves

• residual neural network architecture

• dual-headed network

Page 33: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Monte-Carlo tree search

Source: Mastering the game of Go without human knowledge. https://doi.org/doi:10.1038/nature24270

Page 34: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

AlphaZero Success

• Within 4 hours had mastered chess from first principles, beating the worlds then greatest chess engine.

• Within 2 hours had beaten Elmo at Shogi.

• (running on 5,000 TPUs)

Page 35: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

What is most important?

Data

OR

Algorithm

Page 36: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Reinforcement learning concepts summary

1. Value estimation is critical to the success of a reinforcement learning algorithm.

2. Monte-Carlo learning is a relatively simple technique to get started with and can be applied to a wide range of problems

3. Balancing exploration vs exploitation critical

4. Deep neural networks can be unstable with reinforcement learning. Deep Q Networks with Experience Replay can help stabilise this.

Page 37: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Where next

• The re-enforcement learning community are now focussing on games with imperfect information and much deeper strategies:

• No limit Texas Hold’em Poker – Deepstack• StarCraft2 – AlphaStar• DOTA2 – OpenAI 5• OpenSpiel

• Ultimately the aim is to apply deep learning to real world scenarios:• Energy efficiency• Self driving cars• Protein folding• Medical diagnosis• Materials research• Artificial General Intelligence

Page 38: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

References / Useful links

Netflix AlphaGo documentary

Silver, D. (2015) UCL course on RL.

Sutton, R. S. and Barto, A. G. (2017) Reinforcement Learning: An Introduction.

OpenSpiel - https://github.com/deepmind/open_spiel

Deepmind (2016) Human-level control through deep reinforcement learning

Deepmind (2017) Mastering the game of Go without human knowledge

Deepmind (2018) AlphaZero: Shedding new light on the grand games of chess, shogi and Go, deepmind.com.

Page 39: An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go

Quorum Confidential

Thank YouQuorum Network Resources Ltd

18 Greenside Lane, Edinburgh, EH1 3AH

www.qnrl.com | [email protected]

Reg. No. SC 196645, Registered Office: 18 Greenside Lane, Edinburgh, EH1 3AH