Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store...
Transcript of Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store...
![Page 1: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/1.jpg)
Introduction to Computing With
Graphics Processors Cris Cecka
Computational MathematicsStanford University
Lecture 1: Introduction to Massively Parallel Computing
![Page 2: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/2.jpg)
Tutorial Goals
Learn architecture and computational environment of GPU computing
Massively Parallel
Hierarchical threading and memory space
Principles and patterns of parallel programming
Processor architecture features and constraints
Scalability across future generations
Introduction to programming in CUDAProgramming API, tools and techniques
Functionality and maintainability
![Page 3: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/3.jpg)
Moore’s Law (paraphrased)
“The number of transistors on an integrated circuit doubles every two years.”
– Gordon E. Moore
![Page 4: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/4.jpg)
Moore’s Law (Visualized)
Data credit: Wikipedia
GF100
![Page 5: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/5.jpg)
Serial Performance Scaling is Over
Cannot continue to scale processor frequenciesno 10 GHz chips
Cannot continue to increase power consumptioncan’t melt chip
Can continue to increase transistor densityas per Moore’s Law
![Page 6: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/6.jpg)
A quiet revolution and potential build-upComputation: TFLOPs vs. 100 GFLOPs
– GPU in every PC – massive volume & potential impact
Why Massively Parallel Processing?
T12
Westmere
NV30 NV40
G70
G80
GT200
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
![Page 7: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/7.jpg)
Why Massively Parallel Processing?
• A quiet revolution and potential build-up– Bandwidth: ~10x
– GPU in every PC – massive volume & potential impact
NV30
NV40 G70
G80
GT200
T12
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
Westmere
![Page 8: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/8.jpg)
The “New” Moore’s Law
• Computers no longer get faster, just wider
• You must re-think your algorithms to be parallel!Not only parallel, but hierarchically parallel...
• Data-parallel computing is most scalable solutionOtherwise: refactor code for 2 cores
You will always have more data than cores – build the computation around the data
8 cores4 cores 16 cores…
![Page 9: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/9.jpg)
Enter the GPU
• Highly Scalable• Massively parallel
Hundreds of cores
Thousands of threads
• Cheap
• Available
• Programmable
![Page 10: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/10.jpg)
GPU Evolution
• High throughput computation– GeForce GTX 280: 933 GFLOP/s
• High bandwidth memory– GeForce GTX 280: 140 GB/s
• High availability to all– 180+ million CUDA-capable GPUs in the wild
1995 2000 2005 2010
RIVA 128RIVA 1283M xtors3M xtors
GeForceGeForce®® 256 25623M xtors23M xtors
GeForce FXGeForce FX125M xtors125M xtors
GeForce 8800GeForce 8800681M xtors681M xtors
GeForce 3 GeForce 3 60M xtors60M xtors
““Fermi”Fermi”3B3B xtors xtors
![Page 11: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/11.jpg)
Graphics Legacy
• Throughput is paramount– Must paint every pixel within frame time– Scalability
• Create, run, & retire lots of threads very rapidly– Measured 14.8 Gthread/s on increment() kernel
• Use multithreading to hide latency– 1 stalled thread is OK if 100 are ready to run
![Page 12: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/12.jpg)
Why is this different from a CPU?
• CPU: minimize latency experienced by 1 threadbig on-chip caches
sophisticated control logic
• GPU: maximize throughput of all threads# threads in flight limited by resources => lots of resources (registers, bandwidth, etc.)
multithreading can hide latency => skip the big caches
• Different goals produce different designsGPU assumes work load is highly parallel
CPU must be good at everything, parallel or not
![Page 13: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/13.jpg)
NVIDIA GPU Architecture
Fermi GF100
DR
AM
D
RA
M
I/FI/F
HO
ST
H
OS
T
I/FI/F
Gig
a T
hre
adD
RA
M
DR
AM
I/FI/
FD
RA
M
DR
AM
I/FI/F
DR
AM
D
RA
M
I/FI/F
DR
AM
D
RA
M
I/FI/F
DR
AM
D
RA
M
I/FI/F
L2L2
![Page 14: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/14.jpg)
SM Multiprocessor
Register FileRegister File
SchedulerScheduler
DispatchDispatch
SchedulerScheduler
DispatchDispatch
Load/Store Units x 16
Special Func Units x 4
Interconnect NetworkInterconnect Network
64K Configurable64K ConfigurableCache/Shared MemCache/Shared Mem
Uniform CacheUniform Cache
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
Instruction CacheInstruction Cache
• 32 CUDA Cores per SM (512 total)
• 8x peak FP64 performance
50% of peak FP32 performance
• Direct load/store to memory
Usual linear sequence of bytes
High bandwidth (Hundreds GB/sec)
• 64KB of fast, on-chip RAM
Software or hardware-managed
Shared amongst CUDA cores
Enables thread communication
![Page 15: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/15.jpg)
Key Architectural Ideas
• GPU serves as a coprocessor to the CPUhas its own device memory on the card
• SIMT (Single Instruction Multiple Thread) executionthreads run in groups of 32 called warps
threads in a warp share instruction unit (IU)
HW automatically handles divergence
• Hardware multithreadingHW resource allocation & thread scheduling
HW relies on threads to hide latency
• Threads have all resources needed to runany warp not waiting for something can run
context switching is (basically) free
Register FileRegister File
SchedulerScheduler
DispatchDispatch
SchedulerScheduler
DispatchDispatch
Load/Store Units x 16
Special Func Units x 4
Interconnect NetworkInterconnect Network
64K Configurable64K ConfigurableCache/Shared MemCache/Shared Mem
Uniform CacheUniform Cache
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
Instruction CacheInstruction Cache
![Page 16: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/16.jpg)
Enter CUDA
• A compiler and toolkit for programming NVIDIA GPUs
• Minimal extensions to familiar C/C++ environmentlet programmers focus on parallel algorithms
• Scalable parallel programming modelExpress parallelism and control hierarchy of memory spaces
But also uses a high level abstraction from hardware
• Provide straightforward mapping onto hardwaregood fit to GPU architecture
maps well to multi-core CPUs too
![Page 17: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/17.jpg)
Motivation
110-240X
45X 100X
35X
17X
13–457x 30x
![Page 18: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/18.jpg)
Compute Environment
• Threads executed by SPExecuting the same sequential kernel
On-chip registers, Off-ship local memory
• Thread blocks executed by SMThreads in the same block can cooperate
On-ship shared memory
Synchronization
• Grids of blocks executed by deviceOff-chip global memory
No synchronization
Thread t
t0 t1 … tBBlock b
t0 t1 … tB
t0 t1 … tB
t0 t1 … tB
t0 t1 … tB
![Page 19: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/19.jpg)
CUDA Model of Parallelism
• CUDA virtualizes the physical hardware– thread is a virtualized scalar processor (registers, PC, state)– block is a virtualized multiprocessor (threads, shared mem.)
• Scheduled onto physical hardware without pre-emption– threads/blocks launch & run to completion– blocks should be independent
• • •Block Mem
oryBlock Mem
ory
Global Memory
![Page 20: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/20.jpg)
NOT: Flat Multiprocessor
• Global synchronization isn’t cheap• Global memory access times are expensive
• cf. PRAM (Parallel Random Access Machine) model
Processors
Global Memory
![Page 21: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/21.jpg)
NOT: Distributed Processors
• Distributed computing is a different setting
• cf. BSP (Bulk Synchronous Parallel) model, MPI
Interconnection Network
Processor
Memory
Processor
Memory • • •
![Page 22: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/22.jpg)
A Common Programming Strategy
Global memory resides in device memory (DRAM)
Much slower access than shared memory
Tile data to take advantage of fast shared memory:
Generalize from adjacent_difference example
Divide and conquer
![Page 23: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/23.jpg)
A Common Programming Strategy
Global memory resides in device memory (DRAM)
Much slower access than shared memory
Tile data to take advantage of fast shared memory:
Generalize from adjacent_difference example
Divide and conquer
![Page 24: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/24.jpg)
A Common Programming Strategy
Partition data into subsets that fit into shared memory
![Page 25: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/25.jpg)
A Common Programming Strategy
Handle each data subset with one thread block
![Page 26: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/26.jpg)
A Common Programming Strategy
Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
![Page 27: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/27.jpg)
A Common Programming Strategy
Perform the computation on the subset from shared memory
![Page 28: Introduction To Massively Parallel Computing2011/01/09 · Dispatch Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared](https://reader035.fdocuments.us/reader035/viewer/2022062604/5f76b69f34df2e795d6ccf0e/html5/thumbnails/28.jpg)
A Common Programming Strategy
Copy the result from shared memory back to global memory