An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level...

An Introduction to Reinforcement Learning and the AlphaZero AIJames Frost

Data Platform Director

Quorum

About the speaker

• James Frost

• Data Platform Director at Quorum, an Edinburgh based IT consultancy

• Recently completed an MSc in Data Science at Dundee University

• Final year project was to build a backgammon AI influenced by techniques based on DeepMind AlphaGo. This AI achieved human Grandmaster level.

Session agenda

• An introduction to reinforcement learning concepts

• Monte-Carlo learning

• Neural networks as function approximators

• Issues with reinforcement learning and Deep Neural Networks

• DeepMind and AlphaGo

What is reinforcement learning?

Types of Machine Learning

Supervised Learning

Reinforcement Learning

Unsupervised Learning

Supervised Learning:

Starts with a dataset of known examples. The engine then trains “by example”.

“A” is for Apple…

… not an apple!


Supervised Learning



Unsupervised Learning:

Starts with a dataset where the categories might not be known and looks for patterns / similarities / clusters which may be of interest.

For example, customer segmentation or fraud investigation


Supervised Learning



Reinforcement Learning:

Is based on an agent interacting with the environment, and getting feedback in the form of a reward mechanism.

How do dogs learn?

All training should be reward based. Giving your dog something they really like such as food, toys or praise when they show a particular behaviour means that they are more likely to do it again.

RSPCA Website

Principles of reinforcement learning

Agent

Environment

actionAt

reward Rt

state St+1

Rewards

• A reward Rt is a scalar feedback signal

• Indicates the value of carrying out step t

Money won or lost –e.g. poker or stock market

Win (+1) or Loss (-1)

Kill John Connor (+10,000)Getting to destination (+100)Falling over (-50)Taking a step (-1)

Agent

The Agent generally has the following components:

• Model – the agents representation of the environment

• Policy – how the agent behaves

• Value function – estimate of how good a state or action is

Model

• Environment state is the environments private representation

• Often not visible

• Model is a representation of the environment state through observation

Value Functions

Almost all reinforcement learning algorithms involve estimating value functions that estimate how good it is for the agent to be in a given state

…the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values

(Sutton, 2017)

Value Functions / Policy - chess

A sample value function for chess might be the estimated chance of winning from that position.

Chess PolicyFrom each state, calculate all legal movesFor each possible move, move to state with highest value function.

OR from each state pick the move with the highest action value.

This is the optimal policy.

Principles of reinforcement learning

1. Accurately estimating the value function is critical for reinforcement learning

But how do we do that?

Technique 1 – Monte Carlo Learning

• Play a large number of games at random.

• Record how many times each state is seen, and how many games were won from that state (or action). This lets us know the estimated value function.

• This is the evaluation step for the random policy

X O

X

O

41% win rate from here

Action values State values

-0.41 -0.54 -0.41

-0.54 X -0.54

-0.41 -0.54 -0.41

0.32 0.19 0.32

0.19 0.48 0.19

0.32 0.19 0.32

Now use Monte-Carlo Control

• Now play another bunch of games, but this time act “greedily” with respect of the value predicted by the previous policy.

• Record how many times each state is seen, and how many games were won from that state

100% win rate from here

X O

X

O

-0.14 -0.82 -0.14

-0.82 X -0.82

-0.14 -0.82 -0.14

0.07 0.08 0.07

0.08 0.20 0.08

0.07 0.08 0.07

Don’t get too greedy…

• However, by only picking the best moves (greedy) we sometimes miss possible moves that might be better.

• So need to introduce an element of exploration.

• ε-greedy learning works as follows:

• With probability (1-ε) make a greedy move• With probability ε move at random.

• ε is often reduced as the number of episodes increases – this is guaranteed to converge to the optimal policy.

Learning cycle

• Policy evaluation / policy improvement is the core concept of reinforcement learning

• By acting greedily with respect to the value function we can create a new, improved policy.

• Iterating this process will trend towards the optimal policy

Optimal policyPolicyevaluation

Policyimprovement

Problems with large state spaces

Neural Networks

Unfortunately most useful problems can’t store the state for every scenario.

• Chess has 10^47 states

• Go has 10^170 states.

• How many states to record every possible scenario for a driverless car or Terminator robot?

So we need some form of function approximator.

Neural networks as value approximators

-4

-1

-1 6

5 2

-1 6

2

-3

-4

0

0

0

0

0

0

-3

.

.

.

.

.

.

Monte-Carlo Learning and Neural Networks

• Same principles as tic-tac-toe

• Play a number of games at random

• Sample states (or state / action pairs) from the games, the reward that these states led to, discounted by the number of steps

• Use these samples to feed into the neural network for training

• Now repeat the process, but instead of random play, use the neural network to predict the best moves. Pick some moves at random (ε-greedy)

Deep Neural Networks as Value Functions

Deep Neural Networks (DNNs) would appear to be a great candidate for a value function approximator. However, these can suffer from the following causes of instability:

• correlations present in the sequence of observations

• small updates to action-values estimates (Q) may significantly change the policy

Atari

• Deepmind paper from 2015

• Taught a deep neural network to play Atari games, such as Breakout and Kung Fu Master

• Achieved above human level in over half of the games – in some cases superhuman performance (e.g. Breakout).

• The network was trained using a technique based on Q-learning

but introduced two important concepts:• Experience replay

• Target Q network

https://youtu.be/TmPfTpjtdgg

Source: Human-level control through deep reinforcement learning. Nature, 518. https://doi.org/10.1038/nature14236

Experience Replay and target Q network

Agent

Environment

actionAt

reward Rt state St+1

( St, At ,Rt St+1 ), ( St-1, At-1 ,Rt-1 St ), ( St-2, At-2 ,Rt-2 St-1 ), ( St-3, At-3 ,Rt-3 St-2 ), ( St-4, At-4 ,Rt-4 St-3 ), ( St-5, At-5 ,Rt-5 St-4 )

Replay Buffer

Random samplesTo learn new policy

Agent v2

Effect of Target Q and Experience Replay

AlphaGo

AlphaGo

• Initially trained on a database of expert human games

• Then self play

• After months of training beat Lee Sedol

• AlphaGoZero was trained from completely random play

• Within 36 hours of training AlphaZero beat AlphaGoLee 100-0

“Each time you put knowledge into a system you are actually handicapping it”

David Silver (Silver, 2015)

AlphaGoZero

DeepMind’s AlphaGoZero project applied the following principles

• only uses the raw board position as input features

• use a simple Monte-Carlo Tree Search (MCTS) to evaluate positions and sample moves

• residual neural network architecture

• dual-headed network

Monte-Carlo tree search

Source: Mastering the game of Go without human knowledge. https://doi.org/doi:10.1038/nature24270

AlphaZero Success

• Within 4 hours had mastered chess from first principles, beating the worlds then greatest chess engine.

• Within 2 hours had beaten Elmo at Shogi.

• (running on 5,000 TPUs)

What is most important?

Data

OR

Algorithm

Reinforcement learning concepts summary

1. Value estimation is critical to the success of a reinforcement learning algorithm.

2. Monte-Carlo learning is a relatively simple technique to get started with and can be applied to a wide range of problems

3. Balancing exploration vs exploitation critical

4. Deep neural networks can be unstable with reinforcement learning. Deep Q Networks with Experience Replay can help stabilise this.

Where next

• The re-enforcement learning community are now focussing on games with imperfect information and much deeper strategies:

• No limit Texas Hold’em Poker – Deepstack• StarCraft2 – AlphaStar• DOTA2 – OpenAI 5• OpenSpiel

• Ultimately the aim is to apply deep learning to real world scenarios:• Energy efficiency• Self driving cars• Protein folding• Medical diagnosis• Materials research• Artificial General Intelligence

References / Useful links

Netflix AlphaGo documentary

Silver, D. (2015) UCL course on RL.

Sutton, R. S. and Barto, A. G. (2017) Reinforcement Learning: An Introduction.

OpenSpiel - https://github.com/deepmind/open_spiel

Deepmind (2016) Human-level control through deep reinforcement learning

Deepmind (2017) Mastering the game of Go without human knowledge

Deepmind (2018) AlphaZero: Shedding new light on the grand games of chess, shogi and Go, deepmind.com.

https://www.netflix.com/gb/title/80190844

http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf

https://github.com/deepmind/open_spiel

https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning

https://deepmind.com/research/publications/mastering-game-go-without-human-knowledge

https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/

Quorum Confidential

Thank YouQuorum Network Resources Ltd

18 Greenside Lane, Edinburgh, EH1 3AH

www.qnrl.com | [email protected]

Reg. No. SC 196645, Registered Office: 18 Greenside Lane, Edinburgh, EH1 3AH

http://www.qnrl.com/

mailto:[email protected]

An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level...

Documents

Transcript of An Introduction to Reinforcement Learning and the AlphaZero AI · Deepmind (2016) Human-level...