Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and...
Transcript of Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and...
![Page 1: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/1.jpg)
Innovative Applications and Technology Pivots –A Perfect Storm in Computing
Wen-mei Hwu
Professor and Sanders-AMD Chair, ECE, NCSA
University of Illinois at Urbana-Champaign
with
Izzat El Hajj, Liwen Chang, Simon Garcia, and Carl Pearson
![Page 2: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/2.jpg)
Agenda
• Revolutionary paradigm shift in applications
• Post-Dennard technology pivot - heterogeneity
• An example of positive application-technology spiral
• Engineering high-efficiency software for heterogeneous computing
![Page 3: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/3.jpg)
A major paradigm shift
In the 20th Century, we were able to understand, design, and manufacture what we can measure• Physical instruments and computing systems allowed us to see farther, capture
more, communicate better, understand natural processes, control artificial processes…
![Page 4: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/4.jpg)
A major paradigm shift
In the 20th Century, we were able to understand, design, and manufacture what we can measure• Physical instruments and computing systems allowed us to see farther, capture
more, communicate better, understand natural processes, control artificial processes…
In the 21st Century, we are able to understand, design, and create what we can compute• Computational models are allowing us to see even farther, going back and
forth in time, learn better, test hypothesis that cannot be verified any other way, create safe artificial processes…
![Page 5: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/5.jpg)
Examples of Paradigm Shift20th Century
Small mask patterns
Electronic microscope and Crystallography with computational image processing
Anatomic imaging with computational image processing
Teleconference
GPS
21st Century
Optical proximity correction
Computational microscope with initial conditions from Crystallography
Metabolic imaging sees disease before visible anatomic change
Tele-emersion
Self-driving cars
![Page 6: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/6.jpg)
Diving deeper into computational microscope
• Large clusters (scale out) allow simulation of biological systems of realistic space dimensions• 0.5Å (0.05 nm) lattice spacing needed for accuracy• Interesting biological systems have dimensions of mm or larger• Thousands of nodes are required to hold and update all the grid points.
• Fast nodes (scale up) allow simulation at realistic time scales• Simulation time steps at femtosecond (10-15 second) level needed for accuracy• Biological processes take miliseconds or longer• Current molecular dynamics simulations progress at about one day for each
10-100 microseconds of the simulated process.
![Page 7: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/7.jpg)
Blue Waters Science Breakthrough Example Determination of the structure of the HIV
capsid at atomic-level
Collaborative effort of experimental groups at the U. of Pittsburgh and Vanderbilt U., and the Schulten’s computational team at the U. of Illinois.
64-million-atom HIV capsid simulation of the process through which the capsid disassembles, releasing its genetic material
a critical step in understanding HIV infection and finding a target for antiviral drugs.
![Page 8: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/8.jpg)
Post-Dennard technology pivot -heterogeneity
![Page 9: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/9.jpg)
Dennard Scaling of MOS Devices
In this ideal scaling, as L → α*L
• VDD → α*VDD, C → α*C, i → α*i
• Delay = CVDD/I scales by α, so f → 1/α
• Power for each transistor is CV2*f and scales by α2
• keeping total power constant for same chip area
JSSC Oct 1974, page 256
![Page 10: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/10.jpg)
Frequency Scaled Too Fast 1993-2003
Clock Frequency (MHz)
10
100
1000
10000
85 87 89 91 93 95 97 99 01 03 05
![Page 11: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/11.jpg)
Total Processor Power Increased (super-scaling of frequency and chip size)
1
10
100
85 87 89 91 93 95 97 99 01 03
![Page 12: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/12.jpg)
Post-Dennard Pivoting
Multiple cores with more moderate clock frequencies
Heavy use of vector execution
Employ both latency-oriented and throughput-oriented cores
3D packaging for more memory bandwidth
![Page 13: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/13.jpg)
Blue Waters Computing SystemOperational at Illinois since 3/2013
Sonexion: 26 PBs
>1 TB/sec
100 GB/sec
10/40/100 GbEthernet Switch
Spectra Logic: 300 PBs
120+ Gb/sec
WAN
IB Switch12.5 PF1.6 PB DRAM
$250M
49,504 CPUs -- 4,224 GPUs
![Page 14: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/14.jpg)
CPUs: Latency Oriented Design
High clock frequency
Large caches• Convert long latency memory accesses
to short latency cache accesses
Sophisticated control• Branch prediction for reduced branch
latency
• Data forwarding for reduced data latency
Powerful ALU• Reduced operation latency
Cache
ALU
Control
ALU
ALU
ALU
DRAM
CPU
![Page 15: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/15.jpg)
GPUs: Throughput Oriented Design
Moderate clock frequency
Small caches• To boost memory throughput
Simple control• No branch prediction• No data forwarding
Energy efficient ALUs• Many, long latency but heavily pipelined
for high throughput
Require massive number of threads to tolerate latencies
DRAM
GPU
![Page 16: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/16.jpg)
Applications Benefit from Both CPU and GPU
CPUs for sequential parts where latency matters• CPUs can be 10+X faster than GPUs
for sequential code
GPUs for parallel parts where throughput wins• GPUs can be 10+X faster than CPUs
for parallel code
![Page 17: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/17.jpg)
Initial Production Use ResultsApplication Description Application Speedup
NAMD100 million atom benchmark with Langevin dynamics and
PME once every 4 steps, from launch to finish, all I/O included
1.8
ChromaLattice QCD parameters: grid size of 483 x 512 running at the
physical values of the quark masses2.4
QMCPACKFull run Graphite 4x4x1 (256 electrons), QMC followed by
VMC2.7
ChaNGaCollisionless N-body stellar dynamics with multipole
expansion and hydrodynamics2.1
AWPAnelastic wave propagation with staggered-grid finite-
difference and realistic plastic yielding1.2
![Page 18: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/18.jpg)
An example of positive application-technology spiral
![Page 19: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/19.jpg)
19
DEEP LEARNING IN COMPUTER VISION
Deep Learning Object Detection
DNN + Data + HPCTraditional Computer Vision
Experts + TimeDeep Learning Achieves “Superhuman” Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2009 2010 2011 2012 2013 2014 2015 2016
Traditional CV
Deep Learning
ImageNet
Slide courtesy of Steve Oberlin, NVIDIA
![Page 20: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/20.jpg)
20
DIFFERENT MODALITIES OF REAL-WORLD DATA
Image Vision features Detection
Images/video
Audio Audio features Speaker ID
Audio
Text
Text Text features
Text classification, machine
translation, information
retrieval, ....
Slide courtesy of Andrew Ng, Stanford University
![Page 21: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/21.jpg)
A long way to go towards cognitive computing
Image Recognition
Text Extraction
Human Instructions
Speech Recognition
Natural Language Processing
Diagram Understanding
IR
Knowledge Indexing
Knowledge Inferencing
Programming Framework
Hardware Platform
![Page 22: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/22.jpg)
More Heterogeneity Is Coming
Beyond traditional CPUs and GPUs• FPGAs (e.g., Microsoft FPGA cloud)
• ASICs (e.g., Google’s TPU)
Beyond traditional DRAM• Stacked DRAM for more memory bandwidth
• Non-volatile RAM for memory capacity
• Near/in memory computing for reduced power used in data movement
![Page 23: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/23.jpg)
Some Lessons Learned
• Throughput computing using GPUs can result in 2-3X end-to-end application-level performance improvement
• GPUs, big data and deep learning have formed a positive spiral for the industry
• GPU computing has so far had narrow but deep impact in the application space• Data movement overhead and small GPU memory
• Unified memory, HBM, NVLink, and HSA-style systems will help
• Low-level programming interfaces with poor performance portability
![Page 24: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/24.jpg)
Engineering high-efficiency software for heterogeneous computing
![Page 25: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/25.jpg)
Performance-Portability: One Source for All
Levels of Hierarchy
CodeletComposition
Memory Characteristics
Automatic Data Placement
Resource Sizes
Autotuning
Micro-architecture
Algorithmic Choice
Granularity of Parallelism
Coarsening
Challenges
Solutions
![Page 26: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/26.jpg)
Coarsening Scheduling Alternatives
Depth First Order (DFO) Scheduling
DFO Scheduling with Vectorization(time progresses as color
gets darker)
Breadth First Order (BFO) Scheduling
BFO with Vectorization
(time progresses as color gets darker)
![Page 27: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/27.jpg)
0
0.2
0.4
0.6
0.8
1
ctcp hst hw kmns lkct lmd lud mrig mriq nw pbfs pf rbfs sad sc sgm spmv tpcf geo
AMD Intel LC (no vec.) LC
Performance Results
Spee
du
p(n
orm
aliz
ed t
o f
ast
est)
Speedups of 3.32x and 1.71x over AMD and Intel OpenCL implementations
Kim et al., CGO’15
![Page 28: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/28.jpg)
Performance-Portability: One Source for All
Levels of Hierarchy
CodeletComposition
Memory Characteristics
Automatic Data Placement
Resource Sizes
Autotuning
Micro-architecture
Algorithmic Choice
Granularity of Parallelism
Coarsening
Challenges
Solutions
![Page 29: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/29.jpg)
Hierarchical Compute Organization of Devices
CPU
1. Process
2. Thread (vector-capable)
3. Vector Lane
4. Instruction-level Parallelism
GPU
1. Grid
2. Block
3. Warp
4. Thread
5. Instruction-level Parallelism
![Page 30: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/30.jpg)
Hierarchical Compute Organization of Devices
CPU
1. Process
2. Thread (vector-capable)
3. Vector Lane
4. Instruction-level Parallelism
nt = omp_get_num_threads();tile = (len + nt – 1)/nt;#pragma omp parallel{
j = omp_get_thread_num();accum = 0;#pragma unrollfor(int i = 0; i < tile; ++i) {
accum += in[j*tile + i];}partial[j] = accum;
}sum = 0;for(int j = 0; j < nt; ++j) {
sum += partial[j];}return sum;
![Page 31: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/31.jpg)
Hierarchical Compute Organization of Devices
GPU
1. Grid
2. Block
3. Warp
4. Thread
5. Instruction-level Parallelism
tile = (len + gridDim.x – 1)/gridDim.x;sub_tile = (tile + blockDim.x – 1)/blockDim.x;accum = 0#pragma unrollfor(unsigned i = 0; i < sub_tile; ++i) {
accum += in[blockIdx.x*tile+ i*blockDim.x + threadIdx.x];
}tmp[threadIdx.x] = accum; __syncthreads();for(unsigned s=1; s<blockDim.x; s *= 2) {
if(id >= s)tmp[threadIdx.x] +=
tmp[threadIdx.x - s];__syncthreads();
}partial[blockIdx.x] = tmp[blockDim.x-1];return; // Launch new kernel to sum up partial
![Page 32: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/32.jpg)
Tangram: Codelet-based Programming Model__codeletint sum(const Array<1,int> in) {
unsigned len = in.size();int accum = 0;for(unsigned i=0; i < len; ++i) {
accum += in[i];}return accum;
}(a) Atomic autonomous codelet
__codelet __tag(asso_tiled) int sum(const Array<1,int> in) {
__tunable unsigned p;unsigned len = in.size();unsigned tile = (len+p-1)/p;return sum( map( sum, partition(in,
p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1))));}
__codelet __coop __tag(kog)int sum(const Array<1,int> in) {
__shared int tmp[coopDim()]; unsigned len = in.size();unsigned id = coopIdx();tmp[id] = (id < len)? in[id] : 0;for(unsigned s=1; s<coopDim(); s *= 2) {
if(id >= s)tmp[id] += tmp[id - s];
}return tmp[coopDim()-1];
}(b) Atomic cooperative codelet
(c) Compound codelet using adjacent tiling
(d) Compound codelet using strided tiling
__codelet __tag(stride_tiled) int sum(const Array<1,int> in) {
__tunable unsigned p;unsigned len = in.size();unsigned tile = (len+p-1)/p;return sum( map( sum, partition(in,
p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1))));}
cb
?
pc
? ? ?
?
pd
? ? ?
ca
![Page 33: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/33.jpg)
Tangram: Composition Example
?
?
cb
?
pc
? ? ?
?
pd
? ? ?
ca
cb
pc
ca ca ca
__syncthreads()
pc
ca ca ca
?
__syncthreads()
pc
ca ca ca
ca
![Page 34: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/34.jpg)
Tangram Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
scan spmv dgemm kmeans bfs
No
rmal
ize
d P
erf
orm
ance
(hig
her
is b
ette
r)
Fermi (Reference)
Fermi (TGM)
Kepler (Reference)
Kepler (TGM)
CPU (Reference)
CPU (TGM)
(Tangram)
(Tangram)
(Tangram)
![Page 35: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/35.jpg)
Need for Run-time Selection
• Statically determining best algorithm could be difficult or infeasible• Sometimes it is input dependent
• Even a robust compiler or an expert could select suboptimal sequence of optimization• A catastrophic performance loss could happen
35
![Page 36: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/36.jpg)
DySel Runtime Selects the Best Version
• Application or compiler provides multiple versions• Typically 4-10
• Runtime performs the final selection• Apply micro-profiling to sample the performance of each candidate
• Use a small subset of the actual workload per candidate• Contributes to final result
• Profile candidates concurrently• Reduces profiling overhead
• Incurs less than 8% of overhead in the worst observed case
36
![Page 37: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/37.jpg)
Productive Profiling Mode
• Computation in profiling also contributes to the final output
37
profile
profile
compute Version A
Version B
Output
Workload Space →
← Probational Period → ← Tenured Period →
![Page 38: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/38.jpg)
Case Study: Input-dependent Optimizations
• Best optimizations could be input-dependent
38
0.00
0.50
1.00
1.50
2.00
2.50
3.00
randommatrix diagonalmatrix
Rela
veexecu
onmeoveroracle
(lo
werisbe
er)
Oracle
Sync
Async(bestini alselec on)Async(worstini alselec on)scalar,DFO
scalar,BFO
vector,DFO
vector,BFO
Worst
8.63 8.638.60
0.00
0.50
1.00
1.50
2.00
2.50
3.00
randommatrix diagonalmatrix
Rela
veexecu
onmeov
eroracle
(lowerisbe
er)
Oracle
Sync
Async(bestini alselec on)
Async(worseini alselec on)
Scalar
Vector
Worst
4.73 4.73 22.7322.73
(a) CPU (b) GPU
![Page 39: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/39.jpg)
Conclusion and Outlook
• Applications have very large appetite for more computing power• Both larger scale clusters and faster devices
• Heterogeneity has become the norm for all hardware systems• HPC community are currently seeing about 2-3x application speedup• Recent positive spiral between deep learning and GPU computing • More positive spirals are yet to come
• Performance portability is critical for broad software adoption • There is critical need for programming systems with strong support for
portability• Performance portability involves several dimensions of technical challenges• Unfortunately, vendors have not been interested in solving this problem.
![Page 40: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/40.jpg)
Thank you!
![Page 41: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/41.jpg)
Backup Slides
![Page 42: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/42.jpg)
Performance-Portability: One Source for All
Levels of Hierarchy
CodeletComposition
Memory Characteristics
Automatic Data Placement
Resource Sizes
Autotuning
Micro-architecture
Algorithmic Choice
Granularity of Parallelism
Coarsening
Challenges
Solutions
![Page 43: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/43.jpg)
Performance-Portability: One Source for All
Levels of Hierarchy
CodeletComposition
Memory Characteristics
Automatic Data Placement
Resource Sizes
Autotuning
Micro-architecture
Algorithmic Choice
Granularity of Parallelism
Coarsening
Challenges
Solutions
![Page 44: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/44.jpg)
Coarse-grain CPU threads Fine-grain GPU threads
Automatic Parallelization
Thread Coarsening
![Page 45: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/45.jpg)
Coarsening Scheduling Alternatives
Depth First Order (DFO) Scheduling
DFO Scheduling with Vectorization(time progresses as color
gets darker)
Breadth First Order (BFO) Scheduling
BFO with Vectorization
(time progresses as color gets darker)
![Page 46: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/46.jpg)
OpenCL/CUDA to CPU CompilersBasic Coarsening
(DFO)Vectorization
Locality-aware Scheduling (DFO vs. BFO)
AMD No No No
MCUDA Yes No No
SnuCL Yes No No
Karrenberg& Hack
Yes Yes No
pocl Yes Yes No
Intel Yes Yes No
MxPA Yes Yes Yes
![Page 47: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/47.jpg)
0
0.2
0.4
0.6
0.8
1
ctcp hst hw kmns lkct lmd lud mrig mriq nw pbfs pf rbfs sad sc sgm spmv tpcf geo
AMD Intel LC (no vec.) LC
Performance Results
Spee
du
p(n
orm
aliz
ed t
o f
ast
est)
Speedups of 3.32x and 1.71x over AMD and Intel OpenCL implementations
Kim et al., CGO’15
![Page 48: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/48.jpg)
Performance-Portability: One Source for All
Levels of Hierarchy
CodeletComposition
Memory Characteristics
Automatic Data Placement
Resource Sizes
Autotuning
Micro-architecture
Algorithmic Choice
Granularity of Parallelism
Coarsening
Challenges
Solutions
![Page 49: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/49.jpg)
Data Placement OptionsCPU
Global memory
Caches (data tiling)
Registers
GPU
Global memory
Caches (data tiling)
Registers
+
Scratchpad memory
Constant memory
Texture memory
![Page 50: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/50.jpg)
Rule-based vs. Model-based
• Rule-based (e.g., Jang et al.)• Heuristics on the memory access pattern
• Model-based (e.g., PORPLE)• Create a model the memory subsystem
• Slower but more accurate
![Page 51: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/51.jpg)
Tangram’s Rule-based Data PlacementContainer
shared?no
stride 0?
Transpose
Texture
texture?yes
nostride 1?
no const.stride
no
yes yes yes
scratchpad?
Scratchpad
yesshuffle?
Registers
yescache?
Global
yes
candidate for on-chip memory
yes
Mem
ory
Acc
ess
Ch
arac
teri
stic
sM
emo
ry S
yste
mFe
atu
res
![Page 52: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/52.jpg)
Performance-Portability: One Source for All
Levels of Hierarchy
CodeletComposition
Memory Characteristics
Automatic Data Placement
Resource Sizes
Autotuning
Micro-architecture
Algorithmic Choice
Granularity of Parallelism
Coarsening
Challenges
Solutions
![Page 53: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/53.jpg)
GPU Tuning: Scan Case Study
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Tuned for Fermi Tuned for Fermi Tuned for Kepler Re-optimized for Kepler
Run on Fermi Run on Kepler
Perf
orm
ance
(% o
f b
est,
hig
her
is b
ette
r)
Architecture Upgrade
Retune Re-optimize
![Page 54: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/54.jpg)
Performance-Portability: One Source for All
Levels of Hierarchy
CodeletComposition
Memory Characteristics
Automatic Data Placement
Resource Sizes
Autotuning
Micro-architecture
Algorithmic Choice
Granularity of Parallelism
Coarsening
Challenges
Solutions
![Page 55: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/55.jpg)
0.3380
0.7782
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
5 10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
GTX580(FermiGF110)
K40c(KeplerGK110)
GTX980(MaxwellGM204)
Frac onoffiltereditems
Execu
onme(m
s)
Scratchpad atomics performance (stream compaction)
![Page 56: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/56.jpg)
Motivation Backup
![Page 57: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/57.jpg)
• Codesign among diverse areas will be required to reach exascale• Every level of the computational stack is a potential bottleneck.
• XPACC code will need to run efficiently and portably on next-generation heterogeneous platforms (CPUs, GPUs, Xeon-Phis)
![Page 58: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/58.jpg)
Initial Production Use Results
• NAMD• 100 million atom benchmark with Langevin dynamics and PME once every 4 steps,
from launch to finish, all I/O included• 768 nodes, Kepler+Interlagos is 3.9X faster over Interlagos-only• 768 nodes, XK7 is 1.8X XE6
• Chroma• Lattice QCD parameters: grid size of 483 x 512 running at the physical values of the
quark masses• 768 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only• 768 nodes, XK7 is 2.4X XE6
• QMCPACK• Full run Graphite 4x4x1 (256 electrons), QMC followed by VMC• 700 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only• 700 nodes, XK7 is 2.7X XE6
![Page 59: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/59.jpg)
Blue Waters Science Production Applications
• Work with science teams to effectively use GPUs in their production code.• ChaNGa – cosmological simulation, University of Washington• AWP – earthquake simulation, Southern California Earthquake Center
• Significant speedup by tuning kernels to specific GPU characteristics• Real-world opportunities for performance portability
Running Time (ms) SpeedupChaNGa Baseline 1.35 2.11
Optimized 1.16AWP Baseline 61.6 1.33
Optimized 43.3
GPU Kernel Optimizations
![Page 60: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/60.jpg)
Levels of GPU Programming Interfaces
Current generation CUDA, OpenCL, DirectCompute
Next generation OpenACC, HCC++, Thrust, Bolt
Simplifies data movement, kernel details and kernel launch
Same GPU execution model (but less boilerplate)
Prototype & in development X10, Chapel, Nesl, Delite, Par4all, Tangram...
Implementation manages GPU threading and synchronizationinvisibly to user
![Page 61: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/61.jpg)
Portability- CPU vs. GPU Code Versions
Maintaining multiple code versions is extremely expensive
Most CUDA/OpenCL developers maintain original CPU version
Many developers report that when they back ported the CUDA/OpenCL algorithms to CPU, they got better performing code• Locality, SIMD, multicore
MxPA is designed to automate this process (John Stratton, Hee-Seok Kim, Izzat El Hajj)
![Page 62: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/62.jpg)
Performance Library
A major qualifying factor for new computing platforms• MKL, BLAS, CUSPARSE, Trust, FFT, OpenCV, CUDNN, etc.
• Currently redeveloped and hand-tuned for each HW type/generation
Exa-scale HW expected to have increasing levels of heterogeneity, parallelism, and hierarchy• Increasing levels of memory heterogeneity and hierarchy
• Increase SIMD width and types/number of cores
Performance library development process must keep up with the HW evolution and diversification• Performance portability
SCF 2016
![Page 63: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/63.jpg)
![Page 64: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/64.jpg)
2003
1 core
2005
2 cores
2006
4 cores
2010
6 cores
![Page 65: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/65.jpg)
2003
1 core
2005
2 cores
2006
4 cores
2007
many-core
2010
many-core
2010
6 cores
2012
many-core
2012
many-core
NVIDIA Maxwell
many-core
2008
Stellarton
SoC (1 core)
CPU+FPGA
2011
APU (1st gen)
APU (2nd gen)
SoC (2 cores)
2014
APU (3rd gen)
Kaveri
2014
SoC (6 cores)
![Page 66: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/66.jpg)
Portability Backup
![Page 67: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/67.jpg)
Results of thread coarsening for Parboil benchmarks(written for NVIDIA SIMT GPUs) on AMD Radeon HD6990 (VLIW-5)
Granularity Tuning (OpenCL)
Results compiled using MulticoreWare’s SlotMaximizer
* Not a single kernel** Results from more than one dimension coarsening
![Page 68: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/68.jpg)
CPUs favor intra-thread localityGPUs favor inter-thread locality
(within Work Groups)
• Reduction – CPU vs. GPU (Part 1)
…
Tree-shapeparallel reduction
![Page 69: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/69.jpg)
CPU 2-level hierarchy GPU 4-level hierarchy
…
• Reduction – CPU vs. GPU (Part 2)
Collect from Work Group partial results
![Page 70: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/70.jpg)
Mandelbrot performance with vector width
0
1
2
3
4
5
6
7
8
128 256 512 1024 2048 4096
Spe
ed
up
Image Size
Scalar SSE AVX
Results courtesy of intel.com
• CPU Parameter Tuning
![Page 71: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/71.jpg)
Non-portable tile sizes
58%
68%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Original version tuned for Tesla Tiling parameters retuned forFermi
Tiling restructured
Re
lati
ve P
erf
orm
ance
(all running on Fermi GPU)
GPU Parameter Tuning
![Page 72: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/72.jpg)
Non-portable tile sizes
58%
68%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Original version tuned for Tesla Tiling parameters retuned forFermi
Tiling restructured
Re
lati
ve P
erf
orm
ance
(all running on Fermi GPU)
GPU Parameter Tuning
![Page 73: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/73.jpg)
Slide courtesy of nvidia.com
![Page 74: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/74.jpg)
CPU Xeon Phi
C/FORTRAN
OpenMP, TBB, Pthreads, Cilk…
CUDA, OpenCL
Multicore GPU
+ SIMDIntrinsics
Verilog, VHDL
FPGA
![Page 75: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/75.jpg)
CPU Xeon Phi
C/FORTRAN
OpenMP, TBB, Pthreads, Cilk…
Multicore GPU
+ SIMDIntrinsics
Verilog, VHDL
FPGA
CUDA, OpenCLMxPA
![Page 76: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/76.jpg)
CPU Xeon Phi
C/FORTRAN
OpenMP, TBB, Pthreads, Cilk…
Multicore GPU
+ SIMDIntrinsics
Verilog, VHDL
FPGA
CUDA, OpenCLMxPA
• Locality-centric work-item scheduling
• Speedups of 3.32x and 1.71x over AMD and Intel OpenCL implementations
![Page 77: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/77.jpg)
CPU Xeon Phi
C/FORTRAN
OpenMP, TBB, Pthreads, Cilk…
CUDA, OpenCL
Multicore GPU
+ SIMDIntrinsics
Verilog, VHDL
FPGA
Tangram
![Page 78: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/78.jpg)
Tangram Backup
![Page 79: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/79.jpg)
Devices have different architectural hierarchies
![Page 80: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/80.jpg)
Computation Codelets
Decomposition Codelets
Programmer writes architecture-neurtral
computations and decomposition rules
![Page 81: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/81.jpg)
Computation Codelets
Decomposition Codelets
Compiler maps computations
to each level of the hierarchy…
![Page 82: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/82.jpg)
Computation Codelets
Decomposition Codelets
…and decomposition rules between
each level
![Page 83: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/83.jpg)
DySel Backup
![Page 84: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/84.jpg)
• Pronounced as diesel/ˈdiːzəl/
• Imply low-cost and high-efficiency• Diesel was cheaper than regular gas, when we submitted the paper… :v
• A small but useful tool to save compiler optimization developers
84
![Page 85: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/85.jpg)
Motivation
• Statically determining the optimal code could be default or even infeasible• Sometimes it is input dependent
• Even a robust compiler or an expert could select suboptimal sequence of optimization• A catastrophic performance loss could happen
85
![Page 86: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/86.jpg)
Example: Intel OpenCL Vectorization for CPU
• Suboptimal heuristic for vectorization in sgemm and spmv-jds
86
2.13X↓ 1.24X↓
![Page 87: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/87.jpg)
Relax the Constraints
• Instead of asking a compiler for an optimized version which it thought is the best
• Ask a compiler for multiple versions which are competitive • A typical number is around 4-10
• Let the runtime to do the final selection
87
![Page 88: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/88.jpg)
Version Selection on Runtime
• We propose DySel for dynamic version selection on runtime
• Apply micro-profiling to sample the performance of each candidate
88
![Page 89: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/89.jpg)
Micro-Profiling
• Profile a kernel with smaller workload• A smaller number of work-group/thread block
• Avoid large impact of performance
• Multiple micro-profiling can be scheduled and even executed concurrently
89
![Page 90: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/90.jpg)
Productive Profiling Mode
• Computation in profiling also contributes to the final output
90
profile
profile
compute Version A
Version B
Output
![Page 91: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/91.jpg)
Synchronous vs Asynchronous Scheduling
• Synchronous: Schedule the remaining workload after the best version is finalized
• Asynchronous: Schedule remaining workload eagerly in a batch using the current best candidate
91
…
blocking
time
workload profile
compute
…
time
workload profile
compute
…
time
workload profile
compute
(a) Sync (b) Async (bad default) (c) Async (good default)
![Page 92: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/92.jpg)
Sync vs Async Scheduling
• Sync • Schedule the remaining workload after the best version is finalized
• Async• Schedule remaining workload eagerly in a batch using the current best
candidate
92
![Page 93: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/93.jpg)
Sync vs Async Scheduling
Productive,
Micro-Profiling Ki
Assign workgroups
to each Ki
Profiling finished? Kselect = best Ki
Apply Kselect to
compute the
remaining workload
Schedule Kdefault
for a batch of
work groups
no yes
Suggest an initial
Kdefault
K1
K2 K3
K4
K5
Kernel Version
Generator
Update Kdefault using
the best profiled
Ki so far
�
�
�
(a) Sync (b) Async
K1
K2 K3
K4
K5
Kernel Version
Generator
Productive,
Micro- Profiling Ki
Kselect = best Ki
Assign workgroups
to each Ki
Apply Kselect to
compute the
remaining workload
�
�
�
93
![Page 94: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/94.jpg)
Sync vs Async Scheduling
…
blocking
time
workload profile
compute
… tim
e
workload profile
compute
…
time
workload profile
compute
(a) Sync (b) Async (bad default) (c) Async (good default)
94
![Page 95: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/95.jpg)
Things I skipped
• The two extra profiling modes
• Applicability and resource requirement of each mode
• What kind of compiler analyses needed for different modes
• Where compilers add profiling code in both CPU and GPU
• More details about DySel runtime using TBB and CUDA
95
![Page 96: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/96.jpg)
DySel Interface
DySelLaunchKernel(stringkernel_sig,//kernelnameboolprofiling=true,//profilingactivationflagenummode=fully_async//profilingmode);
DySelAddKernel(stringkernel_sig,//kernelnamefunc_ptrimplementation,//kernelimplementationdim3wa_factor,//workassignmentfactorvector<int>sandbox_index=[]//argumentoffsetsfor
//sandboxes/privateoutputs);
(a) Kernel Implementation Registration API
(b) Kernel Launch API
96
![Page 97: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/97.jpg)
Case Study: Locality-centric Scheduling for CPU OpenCL
97
• Iterate in-kernel loops first or work-item loops for OpenCL on CPU (CGO’15) using MxPA• Through analyzing access patterns
• It is open-source, and robust• “3.32x over AMD, 1.71x over Intel OpenCL stacks”
![Page 98: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/98.jpg)
Case Study: Locality-centric Scheduling for CPU OpenCL
98
1.15X↓
![Page 99: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/99.jpg)
Case Study: Data Placement for GPU
• Data placement optimizations are crucial for performance on GPUs (TPDS 2011 & MICRO 2014)• Although they are not open-source, they did show the transformed results
• Suboptimal decisions due to inaccurate model or improper heuristic
99
0.00
0.50
1.00
1.50
2.00
2.50
spmv-csr par clefilter
Rela
veexecu
onmeov
eroracle
(lowerisbe
er)
OracleSyncAsync(bestini alselec on)Async(worstini alselec on)PORPLEHeuris c-basedWorst
1.29X↓
2.29X↓
![Page 100: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/100.jpg)
Case Study: Experts’ Mixed Optimizations
• Parboil provides multiple versions with different optimization strategies• Optimized versions usually run better
• Some Optimizations are improper or redundant
• E.g. loop unrolling and prefetching in spmv-jds on Kepler
100
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
cutcp sgemm spmv-jds stencil GeoMean
Rela
veexecu
onmetooracle
(lo
werisbe
er)
Oracle Sync Async(bestini alselec on) Async(worstini alselec on) Worst
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
cutcp sgemm spmv-jds stencil GeoMean
Rela
veexecu
onmeoveroracle
(lo
werisbe
er)
Oracle Sync Async(bestini alselec on) Async(worstini alselec on) Worst7.74 2.28
(a) CPU (b) GPU
![Page 101: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/101.jpg)
Case Study: Input-dependent Optimizations
• Best optimizations could be input-dependent
101
0.00
0.50
1.00
1.50
2.00
2.50
3.00
randommatrix diagonalmatrix
Rela
veexecu
onmeoveroracle
(lo
werisbe
er)
Oracle
Sync
Async(bestini alselec on)Async(worstini alselec on)scalar,DFO
scalar,BFO
vector,DFO
vector,BFO
Worst
8.63 8.638.60
0.00
0.50
1.00
1.50
2.00
2.50
3.00
randommatrix diagonalmatrix
Rela
veexecu
onmeov
eroracle
(lowerisbe
er)
Oracle
Sync
Async(bestini alselec on)
Async(worseini alselec on)
Scalar
Vector
Worst
4.73 4.73 22.7322.73
(a) CPU (b) GPU
![Page 102: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/102.jpg)
Conclusion
• DySel can deliver high accuracy and low overhead for dynamic version selection in data-parallel programing model• Incur less than 8% of overhead in the worst observed case
• Using DySel is like buying an insurance…
102
![Page 103: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/103.jpg)
MxPA Backup
![Page 104: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/104.jpg)
Contributions
• Exploiting data locality in scheduling work-items for performance
• Real system and measurement demonstrates speedups of 3.32x and1.71x over AMD and Intel OpenCL implementations• 18 benchmarks from Parboil and Rodinia
• Nominated for best paper award at CGO’15
• AE certified
![Page 105: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/105.jpg)
OpenCL Programming Model
Device
Global Memory
Compute Unit
Local Memory
Compute Unit
Local Memory
Compute Unit
Local Memory
…
Kernel
Work Group Work Group Work Group…WorkItems
![Page 106: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/106.jpg)
void kernel(…) {i0;i1;…ia-1;barrier();ia;ia+1;…ib-1;
}
kernel code
immediate dependencyii Instruction or instruction block
barrier for work-items in a work-group
wi = work-itemwg = work-groupLS = local sizeGS = global size
OpenCL Execution Model
region0
region1
i1
ia-1
ia
ia+1
in-1
i0
wiLS-1 wiLS wiLS+1 wi2LS-1wi0 wi1 wiGS-1
wg0 wg1 wgGS/LS-1
How to schedule this execution graph on a multicore CPU?
![Page 107: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/107.jpg)
Work-group Scheduling
• Assign work-groups in whole to different cores• Considerations: Locality, Load balance
CPU Core CPU Core CPU Core CPU Core
![Page 108: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/108.jpg)
Region Scheduling
• Serialize barrier-separated regions
![Page 109: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/109.jpg)
Work-item Scheduling
• How to schedule work-items within a region?• Different approaches by different compilers
![Page 110: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/110.jpg)
Existing Approaches
• Industry• Intel
• AMD (Twin Peaks)
• Academia• Karrenberg & Hack
• SnuCL
• pocl
Depth First Order (DFO) Scheduling
![Page 111: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/111.jpg)
Existing Approaches
• Industry• Intel
• AMD (Twin Peaks)
• Academia• Karrenberg & Hack
• SnuCL
• pocl
DFO Scheduling with Vectorization(time progresses as color gets darker)
![Page 112: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/112.jpg)
Memory Access Patternse.g. bfs(each thread traverses a list of neighbors)
e.g. sgemm(threads computing adjecentoutputs access adjacent inputs)
e.g. kmeans(all threads loop over the same mean values)
![Page 113: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/113.jpg)
DFO and Locality
![Page 114: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/114.jpg)
DFO and Locality
![Page 115: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/115.jpg)
DFO and Locality
![Page 116: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/116.jpg)
Alternative Schedule: BFO
Breadth First Order (BFO) Scheduling
![Page 117: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/117.jpg)
Alternative Schedule: BFO
BFO with Vectorization(time progresses as color gets darker)
![Page 118: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/118.jpg)
DFO’s vs. BFO’s Impact on Locality
0
0.2
0.4
0.6
0.8
1
sgm ctcp mrig tpcf sc hw kmns hst mriq nw spmv lkct lud pf sad pbfs rbfs lmd geo
DFO BFO
L1 d
ata
cach
e m
isse
s (n
orm
aliz
ed t
o w
ors
t)
BFO has better locality DFO has better locality
BFO has better locality for 13 benchmarks, DFO has better locality for 5 benchmarks. No schedule is always the best.
![Page 119: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/119.jpg)
DFO scheduling
wi0 wi1 wiLS-1
ibefore
i0
iafter
iN-1
wi0 wi1 wiLS-1
ibefore
i0
iafter
i1
iN-1
BFO scheduling
prefers BFO?
classify memory accesses in loop
contains loop?
No
kernel region
Yes
No
Yes
Locality Centric (LC) Scheduling
![Page 120: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/120.jpg)
Locality Centric (LC) Scheduling
Work-item Stride
0 1 Other
Loo
p It
erat
ion
Str
ide 0 - DFO DFO
1 BFO - DFO
Other BFO BFO -
Classify memory accesses per loop body and tally which schedule has greater popularity
![Page 121: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/121.jpg)
LC’s Impact on Locality
0
0.2
0.4
0.6
0.8
1
sgm ctcp mrig tpcf sc hw kmns hst mriq nw spmv lkct lud pf sad pbfs rbfs lmd geo
DFO BFO LC
L1 d
ata
cach
e m
isse
s (n
orm
aliz
ed t
o w
ors
t)
BFO has better locality DFO has better locality
LC captures the best of both schedules
![Page 122: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/122.jpg)
Locality Results
0
0.2
0.4
0.6
0.8
1
sgm ctcp tpcf mrig lkct sc lmd kmns hw hst pf lud mriq nw spmv sad pbfs rbfs geo
AMD Intel LC
L1 d
ata
cach
e m
isse
s (n
orm
aliz
ed t
o w
ors
t)
LC has best locality for most benchmarks
![Page 123: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/123.jpg)
0
0.2
0.4
0.6
0.8
1
ctcp hst hw kmns lkct lmd lud mrig mriq nw pbfs pf rbfs sad sc sgm spmv tpcf geo
AMD Intel LC (no vec.) LC
Performance Results
Spee
du
p(n
orm
aliz
ed t
o f
ast
est)
LC (with vec.) outperforms AMD (without vec.) and Intel (with vec.) by 3.32x and 1.71x
LC (without vec.) is faster than Intel (with vec.) by 1.04x
![Page 124: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/124.jpg)
Summary
• Proposed an alternative scheduling approach to the state-of-the-art
• Demonstrated that no schedule is always best and proposed a staticschedule selection
• Outperformed industry implementations in memory system efficiencyand performance
![Page 125: Innovative Applications and Technology Pivots A Perfect ... · Wen-mei Hwu Professor and Sanders-AMD Chair, ECE, NCSA University of Illinois at Urbana-Champaign ... 10-100 microseconds](https://reader034.fdocuments.us/reader034/viewer/2022042121/5e9be95fdb47217be355dd6e/html5/thumbnails/125.jpg)
Heterogeneous Computing in Blue Waters
Blue Waters contains 4,224 Cray XK7 compute nodes.
Dual-socket Node• One AMD Interlagos chip
• 8 core modules, 32 threads• 156.5 GFs peak performance
• Consumes 2,504 GB of data per second
• 32 GBs memory• 51 GB/s bandwidth
• One NVIDIA Kepler chip• 1.3 TFs peak performance
• Consumes 20,800 GB of data per second
• 6 GBs GDDR5 memory• 250 GB/sec bandwidth