CSC3203: AI for Games Informed search (1) Patrick Olivier [email protected].
AI for Games
Transcript of AI for Games
![Page 1: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/1.jpg)
Seminar: AI for Games 6/15/2019
Human level control through deep reinforcement learning
Paper by: V. Mnih, K. Kavukcuoglu, D. Silver et al.Presented by: Carsten LüthSupervisor: Professor U. Köthe
!1
Taken from: https://nervanasystems.github.io/coach/
![Page 2: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/2.jpg)
1. Video Games
2. Deep Q Network
A. Idea
B. Methods
C. Algorithm
3. Experiments
4. Discussion
Contents
!2
![Page 3: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/3.jpg)
❖ Humans play to:
❖ learn
❖ test their abilities
❖ compete with each other
❖ Agents are trained on games:
❖ because it is easier, faster and safer than in the real world
❖ to benchmark their performance in different environments
1. Why do we play video games?
!3
ATARI2600 gamesTaken from: https://gym.openai.com/envs/#atari
![Page 4: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/4.jpg)
❖ Humans:
1. See and hear the video game
❖ ?
2. interpret the input❖ ?
3. decide to do something
❖ ?
1. How do we play Video Games?
!4
❖ Agents before DQN:
1. Get the pixels of the video input
2. interpret the input❖ derive features from the
pixels
❖ hand-crafted features
3. decide to do something❖ linear value functions, policy
representations etc.
![Page 5: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/5.jpg)
1. MARI\0 - NEAT
!5
Taken from: https://www.youtube.com/watch?v=qv6UVOQ0F44&t=224s
![Page 6: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/6.jpg)
❖ Why are they used?❖ Computervision: raw pixels of videos or images are bad features
❖ high dimensional, highly correlated
❖ distill human knowledge into the agents
1. Hand-Crafted Features
!6
❖ Problems:❖ costly and often not generalizable from a game like pac-man to
Super Mario
❖ agents performance is limited by the features
![Page 7: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/7.jpg)
❖ Fundamental Concept:❖ self learning feature maps
❖ CNN extracts features
❖ features used for a learned Q-function
❖ Fully connected network
❖ Agent trainable via backpropagation
2.A Deep Q Network
!7
![Page 8: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/8.jpg)
2.A DQN - playing Seaquest
!8Taken from: https://www.youtube.com/watch?v=XjsY8-P4WHM
![Page 9: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/9.jpg)
❖ multiple convolutional Layers, each of those Layers uses a filter or kernel to create a new feature-map
❖ convolutions use the same filters on different locations of the image
2.B Convolutional Neural Network
!9
Taken from: https://en.wikipedia.org/wiki/Convolutional_neural_network#/media/File:Typical_cnn.png
![Page 10: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/10.jpg)
2.B Features of a CNN
!10
Filters of a VGG- Network trained on IMAGENET
visualized
Taken from: Visualizing and Understanding Convolutional Networks - Matthew D Zeiler, Rob Fergus
![Page 11: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/11.jpg)
❖ Optimal Value function:
❖ maximum expected Reward given any policy, state and action
2.B Value Function
!11
Q*(s, a) = maxπ
𝔼[rt + γrt+1 + γ2rt+2 + . . . |st = s, at = a, π]
π = ℙ(a |s) : Policy
rt : Reward
s : State
a : Action γ ∈ (0,1) : Discount
Qi+1(st, at) ⟵ Qi(st, at) + α(rt + γ maxat+1
Qi(st+1, at+1) − Qi(st, at))
❖ Q-learning:
Qi(s, a) ⟶i→∞ Q*(s, a)
![Page 12: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/12.jpg)
❖ The state is a concatentation of consecutive downsampled and grayscaled frames
❖ Reward is the score difference of the ATARI games reached within the frames
2.B Designing the Input
!12
rt = Scoret+1 − Scoret
Taken from: https://www.freecodecamp.org/news/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8/
= st
![Page 13: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/13.jpg)
2.B State: Concept of Time
!13
Taken from: https://towardsdatascience.com/self-learning-ai-agents-part-ii-deep-q-learning-b5ac60c3f47
st : Stack of consecutive frames
Multiple frames allow the agent to learn a concept of movement and time
![Page 14: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/14.jpg)
❖ Intuition: ❖ Save past experiences so that the agent can learn from them
2.B Experience Replay
!14
et = (st, at, rt, st+1) Dt = {e1, e2, e3, . . . , et} (s, a, r, s′�) ∼ U(D) Q-learning updates
❖ Benefits:❖ Avoids forgetting previous episodes
❖ Reduce correlations between experiences
Taken from: https://www.freecodecamp.org/news/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8/
![Page 15: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/15.jpg)
2.B Experience Replay Forgetting Previous Episodes
!15
Level 1 Level 2
Taken from: https://www.freecodecamp.org/news/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8/
![Page 16: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/16.jpg)
❖ Minimization Objective:
❖ Loss:
❖ Define 2 value functions with different updates:
❖ value function is updated by gradient descent w.r.t. target-value function
❖ target value function gets updated periodically every C iterations
❖ Benefit: training is more stable
2.B Training schedule
!16
ℒ(θi) = 𝔼(st,at,rt,st+1)∼U(D)[(rt + γ maxat+1
Q(st+1, at+1; θ−i ) − Q(st, at; θi))2]
θi : Value function (iterative updates) θ−i : Target-Value function (periodic updates)
D : Experience Dataset
θ = argminθℒ(θ)
θ−C = θC
![Page 17: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/17.jpg)
2.B Exploration vs. Exploitation
!17
ϵ-Greedy
p(a |s) = 1 − ϵ + ϵ/nϵ/n
: a = argmaxa0Q(a0 |s)
: else
ϵ gets annealed during the training from 1 to 0.1
Describes the probability between exploration and exploitation by going off-policy during training of the agent
n : # actions
Taken from: https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html
![Page 18: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/18.jpg)
2.C Algorithm
!18
Exploration vs. Exploitation
play
Experience Replay
update Q-function
![Page 19: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/19.jpg)
❖ Agents trained on 49 different games
3. Performance on ATARI 2600
!19
Performance =RA − RR
RH − RR⋅ 100
RR : random play scoreRH : human gaming tester score
RA : agent score
❖ Agents have an action frequency of 10 Hz
![Page 20: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/20.jpg)
3. DQN - playing Breakout
!20Taken from: https://www.youtube.com/watch?v=TmPfTpjtdgg
![Page 21: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/21.jpg)
❖ Pros:
❖ generalizes well to different environments
❖ learns important features itself
❖ minimizes human knowledge
❖ outperforms most other algorithms (before 2015) and has the best overall score
❖ Cons:
❖ decision horizon is rather short
❖ algorithm learns slower than a human
4. Discussion
!21
![Page 22: AI for Games](https://reader036.fdocuments.us/reader036/viewer/2022062313/62a82e03e52cd6284069fec4/html5/thumbnails/22.jpg)
❖ Scientific:❖ Human level control through deep reinforcement learning - V. Mnih, K. Kavukcuoglu, D.
Silver et al.
❖ Playing Atari with Deep Reinforcement Learning - Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al.
❖ Visualizing and Understanding Convolutional Networks - Matthew D Zeiler, Rob Fergus❖ The Arcade Learning Environment: An Evaluation Platform for General Agents - M. G.
Bellemare, Y. Naddaf, J. Veness and M. Bowling
❖ Further Reading:
❖ Rainbow: Combining Improvements in Deep Reinforcement Learning - Ma. Hessel, J. Modayil, H. van Hasselt et al.
❖ The bitter lesson - Richard Sutton: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
❖ Reinforcement Learning, Fast and Slow - M. Botvinick, Sam Ritter, Jane X. Wang et al.
Sources
!22