Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations...
Transcript of Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations...
![Page 1: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/1.jpg)
Avi Bleiweiss NVIDIA Corporation San Jose | 2010
Playing Zero-Sum Games on the GPU
![Page 2: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/2.jpg)
Zero-Sum
One’s gain, other’s loss
Perfect info
Multi player game
— Simple and involved
Matching
Pennies Head Tail
Head 1,-1 -1,1
Tail -1,1 1,-1
![Page 3: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/3.jpg)
Deep Blue
32 CPUs
512 Chess ASICs
1.15B transistors
0
50
100
150
200
1985 1988 1991 1994 1997
Million Moves/Seconds
6’5” tower pair
1.4 tons
100M$
![Page 4: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/4.jpg)
Motivation
Play as you go
— Thousands players
Game cloud
— GPU computing
Mobile computer
— 3D graphics
NVIDIA Tesla
NVIDIA Tegra
![Page 5: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/5.jpg)
Problem
Game
• Two player
• Maximize look ahead
• Complexity • 1025 for 4x4x4 Tic-Tac-Toe
• 10120 for Chess
Search
• Efficient parallel • Single game
• Simultaneous matches
• Graceful stack
• Linear scale
![Page 6: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/6.jpg)
Game Tree
Root
A
C D E
B
F G H
d
g
9 2 7 3 2 4 2 1 5 8 7 7 2 9 1 1 6 2
![Page 7: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/7.jpg)
Mini-Max
Build and search
Unbound DFS
— All nodes visited
Non tail recursive
Depth limited
![Page 8: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/8.jpg)
Principal Variation
Root
A
C D E
B
F G H
d
g
9 2 7 3 2 4 2 1 5 8 7 7 2 9 1 1 6 2
7 4 5 8 1 6
1
4
4
![Page 9: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/9.jpg)
Alpha-Beta
Enhances Mini-Max
Elegant, efficient
— Prunes nodes
Perfect game
— Possible in cases
![Page 10: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/10.jpg)
Pruning
Root
A
C D E
B
F G H
d
g
9 2 7 3 2 4 2 1 5 8 7 7 2 9 1 1 6 2
∞ ∞
∞ ∞
∞ 7
∞ ∞
7 ∞ 4 ∞
7 ∞ 7 4
4 ∞ 4 5
1 4
∞ 4
![Page 11: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/11.jpg)
Optimization
Aspiration Search
• Tree value known
• Narrow window
• Re-search if fails
Principal Variation Search
• Non PV nodes
• beta = alpha + 1
• Full window PV nodes
• Perfect ordered tree
Iterative Deepening
• Fixed depth searches
• Transposition table
• Hash branch positions [Marsland, T. A. 1986]
![Page 12: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/12.jpg)
Parallelism
Principal Variation Split
• Strongly ordered tree
• Synchronization bound
• Load imbalance
Young Brothers Wait Concept
• Parallel at any node
• Processor owns node
• Scales to # processors
Dynamic Tree Splitting
• Processors share node
• Global job list
• Reasonable speedup
[Feldmann, R. 1993]
![Page 13: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/13.jpg)
Challenges
Deep recursion, limited stack
Divergent, irregular threads
Dynamic parallelism
Low arithmetic intensity
![Page 14: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/14.jpg)
Implementation
Kernel for each
— Mini-Max, Alpha-Beta
Board C++ class
— Rules specific
Games
3D Tic-Tac-Toe Connect-4 Reversi
![Page 15: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/15.jpg)
Board
Cells
Player
Successors
Move
Query
Winner
Full
Manipulate
Update
Undo
![Page 16: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/16.jpg)
Stack
Recursion depth >1000
Greedy allocation
Hybrid design
Local
Memory Global
Memory
Runtime/Compiler
• Local variables
• Function parameters
User
• Successors
![Page 17: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/17.jpg)
Split
Thousands
of working
threads
Find highest cut nodes
Parallel node search
Resolve up to root
CPU
GPU
producer consumer
Game Tree
game init
game over
foreach
move
![Page 18: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/18.jpg)
Shared αβ
Kernel global
scope
Check
private αβ Global atomic
update
![Page 19: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/19.jpg)
Limitations
Stack allocation
— Bounds parallelism
Split constraints
Depth Tic-Tac-Toe Threads
3x3x3 4x4x4 5x5x5
1 650 3906 15252
2 15600 238266 1860744
![Page 20: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/20.jpg)
Methodology
CUDA Toolkit 3.1, Windows
Processors
GPU SMs Warps/SM Clocks(MHz) L1/Shared (KB) L2(KB)
GTX480 15 2 723/1446/1796 48/16 640
CPU Cores Clocks(MHz) L1/L2 (KB)
I7-940 8 2942/(3*1066) 32/8192
![Page 21: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/21.jpg)
Single Game
0
0.5
1
1.5
2
2.5
3906 3660 3422 3192 2970 2756 2550
Sec
on
ds/
Mo
ve
Threads/Move
4x4x4 Tic-Tac-Toe
Naïve Split Shared αβ
lower is
good
![Page 22: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/22.jpg)
Simultaneous Matches
0
1
2
3
4
5
6
0
1
2
3
4
5
6
7
8
1 16 128 1024 4096 16384
Sp
eed
up
Av
era
ge
Sec
on
ds/
Move
Matches
4x4x4 Tic-Tac-Toe
GTX480 i7 8 Threads
lower is
good
higher is
good
![Page 23: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/23.jpg)
Game Analysis
0
1
2
3
4
5
6
7
8
1 3 5 7 9 11 13 15 17 19 21 23
Sec
on
ds
per
Mo
ve
Move #
4x4x4 Tic-Tac-Toe
GTX480 i7 8 Threads
lower is
good
4K Matches
![Page 24: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/24.jpg)
Future Work
Data packing
Backtracking search
Sudoku, toy game
— Generator, solver
Multi recursion
![Page 25: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/25.jpg)
GPU Performance
Metric Game Dimension Speedup
Shared αβ vs. Naïve Split Tic-Tac-Toe 4x4x4 13.37X mean
Simultaneous Matches
vs. CPU
Tic-Tac-Toe 4x4x4 5.22X @ 16K
Connect-4 7x6 6.26X @ 32K
Reversi 8x8 5.96X @ 16K
![Page 26: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/26.jpg)
Summary
Efficient hybrid stack
Dynamic parallelism
Tegra, Tesla solution
3D Chess on GPU!
Courtesy freegames4all
Deep Blue Tesla
$/(Moves/Sec) 0.5 0.02
![Page 27: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/27.jpg)
Thank You!
Questions?
![Page 28: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/28.jpg)
Info
Base
— http://developer.nvidia.com
GPU AI
— Technology Preview
Toolkit
— CUDA Zone
Debugger
— Parallel Nsight
![Page 29: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/29.jpg)
Backup
![Page 30: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/30.jpg)
Simultaneous Matches (1)
0
1
2
3
4
5
6
7
0
0.5
1
1.5
2
2.5
1 16 128 1024 4096 16384 32768
Sp
eed
up
Av
era
ge
Sec
on
ds
/ M
ove
Matches
7x6 Connect-4
GTX480 i7 8 Threads
lower is
good
higher is
good
![Page 31: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/31.jpg)
Simultaneous Matches (2)
0
1
2
3
4
5
6
7
0
0.2
0.4
0.6
0.8
1
1 16 128 1024 4096 16384
Sp
eed
up
Aver
ag
e S
econ
ds
/ M
ov
e
Matches
8x8 Reversi
GTX480 i7 8 Threads
lower is
better
higher is
better
![Page 32: Playing Zero-Sum Games on the GPU · scope Check private αβ Global atomic update . Limitations Stack allocation —Bounds parallelism Split constraints Depth Tic-Tac-Toe Threads](https://reader030.fdocuments.us/reader030/viewer/2022041006/5eac6515886b79666848c38c/html5/thumbnails/32.jpg)
Simultaneous Puzzles
0
0.5
1
1.5
2
2.5
3
3.5
4
0
1
2
3
4
5
6
7
1 16 128 1024 4096 16384
Sp
eed
up
Sec
on
ds
Puzzles
9x9 Sudoku
GTX480 i7 8 Threads
lower is
better
higher is
better