IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hanspeter Pfister, Harvard)
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton...
description
Transcript of IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton...
![Page 1: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/1.jpg)
CUDA Tricks and Computational Physics
Kipton Barros
In collaboration with R. Babich, R . Brower, M. Clark, C. Rebbi, J. Ellowitz
Boston University
![Page 2: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/2.jpg)
High energy physicshuge computational needs
27 km
Large Hadron Collider, CERN
![Page 3: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/3.jpg)
A disclaimer:
I’m not a high energy physicist
A request:
Please question/comment freely during the talk
![Page 4: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/4.jpg)
View of the CMS detector at the end of 2007. (Maximilien Brice, © CERN)
![Page 5: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/5.jpg)
View of the Computer Center during the installation of servers. (Maximilien Brice; Claudia Marcelloni, © CERN)
15 Petabytes to be processed annually
![Page 6: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/6.jpg)
The “Standard Model” of Particle Physics
![Page 7: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/7.jpg)
I’ll discuss Quantum ChromoDynamics
Although it’s “standard”, these equations are hard to solve
Big questions: why do quarks appear in groups?physics during big bang?
![Page 8: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/8.jpg)
![Page 9: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/9.jpg)
Quantum ChromoDynamicsThe theory of nuclear
interactions
Extremely difficult:
Must work at the level of fields, not particlesCalculation is quantum mechanical
(bound by “gluons”)
![Page 10: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/10.jpg)
Lattice QCD:Solving Quantum Chromodynamics by Computer
Discretize space and time (place the quarks and gluons on a 4D lattice)
![Page 11: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/11.jpg)
Spacetime = 3+1 dimensions
lattice sites
Quarks live on sites (24 floats each)
Gluons live on links (18 floats each)
Total system sizefloat bytes
lattice sites
quarks gluons
! 324 ! 106
! 4! 324 ! (24 + 4! 18) " 384MB
![Page 12: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/12.jpg)
Lattice QCD:Inner loop requires repeatedly solving linear equation
DW is a sparse matrixwith only nearest neighbor
couplings
quarksgluons
needs to be fast!DW
![Page 13: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/13.jpg)
DWOperation of
1 output quark site(24 floats)
![Page 14: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/14.jpg)
DWOperation of
1 output quark site(24 floats)
2x4 input quark sites(24x8 floats)
![Page 15: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/15.jpg)
1 output quark site(24 floats)
2x4 input quark sites(24x8 floats)
DWOperation of
2x4 input gluon links(18x8 floats)
![Page 16: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/16.jpg)
1 output quark site(24 floats)
2x4 input quark sites(24x8 floats)
DWOperation of
2x4 input gluon links(18x8 floats)
1.4 kB of local storage required per quark update?
![Page 17: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/17.jpg)
Cuda Parallelization:Must process many quark updates simultaneously
Odd/even sites processed separately
![Page 18: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/18.jpg)
© NVIDIA Corporation 2006 3
Programming Model
A kernel is executed as a grid of thread blocks
A thread block is a batch of threads that can cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution
Threads from different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Threading
Friday, January 23, 2009
![Page 19: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/19.jpg)
parallelization:Each thread processes 1 siteNo communication required between threads!All threads in warp execute same code
DW
![Page 20: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/20.jpg)
Step 1: Read neighbor site
![Page 21: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/21.jpg)
Step 1: Read neighbor site
Step 2: Read neighbor link
![Page 22: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/22.jpg)
Step 1: Read neighbor site
Step 2: Read neighbor link
Step 3: Accumulate into
![Page 23: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/23.jpg)
Step 1: Read neighbor site
Step 2: Read neighbor link
Step 3: Accumulate into
Step 4: Read neighbor site
![Page 24: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/24.jpg)
Step 1: Read neighbor site
Step 2: Read neighbor link
Step 3: Accumulate into
Step 4: Read neighbor site
Step 5: Read neighbor link
![Page 25: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/25.jpg)
Step 1: Read neighbor site
Step 2: Read neighbor link
Step 3: Accumulate into
Step 4: Read neighbor site
Step 5: Read neighbor link
Step 6: Accumulate into
![Page 26: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/26.jpg)
79
!""#$%&"'
()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-
+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-
4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'
!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-
"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-
<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-
"1&"#**+&04'
?.<.0+,-9'-*+/1#*"+-#/%6+@
A+6./0+*/
B)%*+,-<+<1*'
Exec
Friday, January 23, 2009
!!!
![Page 27: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/27.jpg)
85
!"#$%$&$'()#*+,-./)",+)01234
5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,
9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/
<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)
*$.$'(
?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)
#*+,-.
A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.
B,6+$/#$3/
<$'$%6%C)DE)#*+,-./)",+)01234
!'1>)$7)%61#$"1,)32'36++,'#)01234/)
FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,
J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>
K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M
Exec
Friday, January 23, 2009
![Page 28: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/28.jpg)
Reminder -- each multiprocessor has:
16 kb shared memory
16 k registers
1024 active threads (max)
High occupancy needed for maximum performance (roughly 25% or so)
![Page 29: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/29.jpg)
: does it fit onto the GPU?DW
Each thread requires 0.2 kb1.4 kbof fast local memory
24 12 floats
18 floats
24 floats
![Page 30: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/30.jpg)
: does it fit onto the GPU?DW
Each thread requires 0.2 kb1.4 kbof fast local memory
MP has
Threads/MP = 16 / 0.2 = 80
16 kb shared mem
![Page 31: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/31.jpg)
: does it fit onto the GPU?DW
Each thread requires 0.2 kb1.4 kbof fast local memory
MP has
Threads/MP = 16 / 0.2 = 80
16 kb shared mem
64(multiple of 64 only)
![Page 32: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/32.jpg)
: does it fit onto the GPU?DW
Each thread requires 0.2 kb1.4 kbof fast local memory
MP has
Threads/MP = 16 / 0.2 = 80
16 kb shared mem
64(multiple of 64 only)
MP occupancy = 64/1024 = 6%
![Page 33: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/33.jpg)
6% occupancysounds pretty
bad!
Andreas Kuehn / Getty
![Page 34: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/34.jpg)
Reminder -- each multiprocessor has:
16 kb shared memory
16 k registers
1024 active threads (max)
Each thread requires 0.2 kbof fast local memory
How can we get better occupancy?
![Page 35: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/35.jpg)
Reminder -- each multiprocessor has:
16 kb shared memory
16 k registers = 64 kb memory
1024 active threads
Each thread requires 0.2 kbof fast local memory
How can we get better occupancy?
Occupancy > 25%
![Page 36: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/36.jpg)
Registers as data(possible because no inter-thread communication)
Instead of shared memory
Registers are allocated as
![Page 37: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/37.jpg)
Registers as data
Can’t be indexed. All loops must be EXPLICITLY expanded
![Page 38: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/38.jpg)
Code sample
(approx. 1000 LOC automatically generated)
![Page 39: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/39.jpg)
Performance Results:
82 Gigabytes/sec (GTX 280)
44 Gigabytes/sec (Tesla C870)
(completely bandwidth limited)
For comparison:
twice as fast as Cell impl. (arXiv:0804.3654)
20 times faster than CPU implementations
(90 Gflops/s)
![Page 40: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/40.jpg)
0
11.25
22.50
33.75
45.00
≥ 25% 17% 8% 0%
GB/s vs Occupancy
Tesla C870
Surprise! Very robust to low occupancy
0
21.25
42.50
63.75
85.00
≥ 19% 13% 6% 0%
GB/s GB/s
GTX 280
Occupancy Occupancy
![Page 41: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/41.jpg)
Device memory is the bottleneckCoalesced memory accesses crucial
q11 , q12 , ...q124
Quark 1 Quark 2 Quark 3
q21 , q22 , ...q224 q31 , q32 , ...q324 ...
... ...q31q21q11 q12 q22 q32
Data reordering
thread 0 thread 2thread 1 ...
![Page 42: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/42.jpg)
Memory coalescing: store even/odd lattices separately
![Page 43: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/43.jpg)
When memory access isn’t perfectly coalesced
Sometimes float4 arrays can hide latency
This global memory read corresponds to a single CUDA
instruction
thread 0 thread 2thread 1
In case of coalesce miss, at least 4x data is transfered
![Page 44: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/44.jpg)
When memory access isn’t perfectly coalesced
Binding to textures can help
This makes use of the texture cache and can reduce penalty for nearly coalesced accesses
corresponds to a single CUDA instruction
![Page 45: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/45.jpg)
Regarding textures, there are two kinds of memory:
Linear array
Can be modified in kernel
“Cuda array”
Can’t be modifed in kernelGets reordered for 2D, 3D locality
Can only be bound to 1D texture
Allows various hardware features
![Page 46: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/46.jpg)
When a CUDA array is bound to a 2D texture, it is probably reordered to something like a Z-curve
Wikipedia image
This gives 2D locality
![Page 47: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/47.jpg)
Warnings:
The effectiveness of float4, textures, depends on the CUDA hardware and driver (!)
Certain “magic” access patterns are many times faster than others
Testing appears to be necessary
![Page 48: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/48.jpg)
Memory bandwidth test
Should be optimal
Simple kernel
Memory access completely coalesced
![Page 49: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/49.jpg)
Memory bandwidth test
Simple kernel
Memory access completely coalesced
Bandwidth: 54 Gigabytes / sec(GTX 280, 140 GB/s theoretical!)
![Page 50: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/50.jpg)
So why are NVIDIA samples so fast?
NVIDIA actually uses
54 Gigabytes / sec 102 Gigabytes / sec
(GTX 280, 140 GB/s theoretical)
![Page 51: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/51.jpg)
Naive access pattern
Block 1
...Step 1
Block 2
Block 1
...Step 2
Block 2
...
...
![Page 52: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/52.jpg)
Modified access pattern
Block 1
...Step 1
Block 2
Block 1
...Step 2
Block 2
...
...
(much more efficient)
![Page 53: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/53.jpg)
CUDA Compiler
CUDAC code
PTXcode
CUDA machinecode
Use unofficial CUDA disassembler to view CUDA machine code
CUDA disassembly
(LOTS of optimization
here)
![Page 54: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/54.jpg)
CUDA Disassembler (decuda)
Compile and save cubin file
foo.cu
Disassemble
![Page 55: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/55.jpg)
Look how CUDA implements integer
division!
![Page 56: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/56.jpg)
CUDA provides fast (but imperfect) trigonometry in hardware!
![Page 57: IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)](https://reader034.fdocuments.us/reader034/viewer/2022051817/547bc3c65906b55e798b4654/html5/thumbnails/57.jpg)
The compiler is very aggressive in optimization. It will group memory loads together to minimize latency
Notice: each thread reads 20 floats!
(snippet from LQCD)