W HAT H IGH P ERFORMANCE F IRMS D O T O A CHIEVE A MAZING R ESULTS
Keeping Performance Portable In H igh P erformance K ernels
description
Transcript of Keeping Performance Portable In H igh P erformance K ernels
![Page 1: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/1.jpg)
Keeping Performance Portable In High Performance Kernels
Saman AmarasingheUna-May O’Reilly
Jason AnselPhitchaya Mangpo Phothilimthana
Jonathan Ragan-Kelley
Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology
![Page 2: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/2.jpg)
Example: Convolution2D Convolution
k1k1
k1k2
k1k3
k2k1
k2k2
k2k3
k3k1
k3k1
k3k1
k1k1
k1k2
k1k3
k2k1
k2k2
k2k3
k3k1
k3k1
k3k1
k1k1
k1k2
k1k3
k2k1
k2k2
k2k3
k3k1
k3k1
k3k1
k1k1
k1k2
k1k3
k2k1
k2k2
k2k3
k3k1
k3k1
k3k1
inputoutput
k1k1
k1k2
k1k3
k2k1
k2k2
k2k3
k3k1
k3k1
k3k1
2D kernel
k1k2
k3
![Page 3: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/3.jpg)
Example: ConvolutionAlgorithmic Choices
2D Convolution Separable Convolution
k1k1
k1k2
k1k3
k2k1
k2k2
k2k3
k3k1
k3k1
k3k1
2D kernel
inputoutput
input intermediate
intermediateoutput
Convolve Row
Convolve Column
k1 k2 k3k1 k2 k3k1 k2 k3
k1k2
k3
k1k2
k3
k1k2
k3
k1k2
k3
![Page 4: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/4.jpg)
ZettaBricks Language [PLDI’09]
transform SeparableConvolutionfrom In[w, h], Kernel[KWIDTH]to Out[w - KWIDTH+1, h - KWIDTH+1]{
// Choice 1: single pass 2D convolutionto(Out out) from(In in, Kernel kernel) { Convolve2D(out, in, kernel);}
// Choice 2: two pass separable convolutionto(Out out) from(In in, Kernel kernel) using(buffer[w - KWIDTH+1, h]) {ConvolveRows(buffer, in, kernel);ConvolveColumns(out, buffer, kernel);}
}
![Page 5: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/5.jpg)
ZettaBricks for Heterogeneous Systems
Compiler
Autotuner
Runtime System
ZettaBricks
Program
C++ output
Program
Training Informatio
n
ChoiceConfigurati
on
- dependency analysis- task creations- task scheduler- C++ code gen- etc.
- algorithmic choices- parellelization techniques- data distributions- transformations- etc.
- CPU work-stealing model
- dependency analysis- data movement analysis- CPU/GPU task creations- task scheduler- C++ code gen- OpenCL code gen- etc.
- CPU work-stealing model- GPU work-pushing model- memory management
- algorithmic choices- parellelization techniques- data distributions- transformations- CPU/GPU choices- global/local memory- CPU-GPU workload ratio- GPU local work size- etc.
ZettaBricks
[ASPLOS13]
![Page 6: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/6.jpg)
Schedule 1: Convolve2D();
Schedule 2: ConvolveRows(); ConvolveColumns();
Schedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: ConvolveRows(); ConvolveColumns();
Schedule 4: ConvolveRows (); ConvolveColumns_opencl();
Schedule 5: ConvolveRows_opencl(); ConvolveColumns();
Schedule 6: ConvolveRows_opencl(); ConvolveColumns_opencl();
Before adding OpenCL
After adding OpenCL
Scheduling Choices: Convolution
![Page 7: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/7.jpg)
Scheduling Choices: Convolution
Schedule 1: Convolve2D();
Schedule 2: ConvolveRows(); ConvolveColumns();
Schedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: ConvolveRows(); ConvolveColumns();
Schedule 4: ConvolveRows (); ConvolveColumns_opencl();
Schedule 5: ConvolveRows_opencl(); ConvolveColumns();
Schedule 6: ConvolveRows_opencl(); ConvolveColumns_opencl();
Original Choices
After adding OpenCL
Schedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: Convolve2D_opencl_local();
Schedule 4: ConvolveRows(); ConvolveColumns();
Schedule 5: ConvolveRows (); ConvolveColumns_opencl();
Schedule 6: ConvolveRows (); ConvolveColumns_opencl_local();
Schedule 7: ConvolveRows_opencl(); ConvolveColumns();
Schedule 8: ConvolveRows_opencl_local(); ConvolveColumns();
Schedule 9: ConvolveRows_opencl(); ConvolveColumns_opencl();
Schedule 10: ConvolveRows_opencl(); ConvolveColumns_opencl_local();
Schedule 11: ConvolveRows_opencl_local(); ConvolveColumns_opencl();
Schedule 12: ConvolveRows_opencl_local(); ConvolveColumns_opencl_local();
After adding local mem version
Local memory = scratchpad memory shared by all work-items (gpu threads) in the block
![Page 8: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/8.jpg)
CPU-GPU Workload Balancing
CPU/GPU ratio parameter statically defines how much of the data should be computed on each device.
ProgramProgram
![Page 9: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/9.jpg)
GPU Choice RepresentationSchedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: Convolve2D_opencl_local();
Schedule 4: ConvolveRows(); ConvolveColumns();
Schedule 5: ConvolveRows (); ConvolveColumns_opencl();
Schedule 6: ConvolveRows (); ConvolveColumns_opencl_local();
Schedule 7: ConvolveRows_opencl(); ConvolveColumns();
Schedule 8: ConvolveRows_opencl_local(); ConvolveColumns();
Schedule 9: ConvolveRows_opencl(); ConvolveColumns_opencl();
Schedule 10: ConvolveRows_opencl(); ConvolveColumns_opencl_local();
Schedule 11: ConvolveRows_opencl_local(); ConvolveColumns_opencl();
Schedule 12: ConvolveRows_opencl_local(); ConvolveColumns_opencl_local();
49
1625
1/82/83/8
8/8…
Local Work Size GPU-CPU Ratio
![Page 10: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/10.jpg)
GPU Choice RepresentationSchedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: Convolve2D_opencl_local();
Schedule 4: ConvolveRows(); ConvolveColumns();
Schedule 5: ConvolveRows (); ConvolveColumns_opencl();
Schedule 6: ConvolveRows (); ConvolveColumns_opencl_local();
Schedule 7: ConvolveRows_opencl(); ConvolveColumns();
Schedule 8: ConvolveRows_opencl_local(); ConvolveColumns();
Schedule 9: ConvolveRows_opencl(); ConvolveColumns_opencl();
Schedule 10: ConvolveRows_opencl(); ConvolveColumns_opencl_local();
Schedule 11: ConvolveRows_opencl_local(); ConvolveColumns_opencl();
Schedule 12: ConvolveRows_opencl_local(); ConvolveColumns_opencl_local();
49
1625
1/82/83/8
8/8…
Local Work Size GPU-CPU Ratio Other Parameters …
…
…
…
…
…Big Search Space!
up to 101040 choicesBottem-up evolutionary algorithm [GECCO’11]
![Page 11: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/11.jpg)
Experimental Results
Convolution Black-Sholes
Poisson2D SOR SortStrassen Tridiagonal SolverSingle Value Decomposition
![Page 12: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/12.jpg)
Experiment: Convolution
Desktop Server LaptopAll choices are in OpenCL
![Page 13: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/13.jpg)
Experiment: Convolution• Autotune on each machine• Test cross-run• Normalize execution time by the best config
Separable convolution w/local memory on GPU
Desktop config
Separable convolution on OpenCL
Server config
2D convolution w/local memory on GPU
Laptop config
Lower is better.
Hand-coded OpenCL
![Page 14: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/14.jpg)
Experiment: Stressen (Matrix Multiply)
Right configuration can provide huge performance improvement.
16.5x
Data parallel on GPUDesktop config
Recursive decomposition-> LAPACK on CPU
Server config
LAPACK on CPULaptop config
Hand-coded OpenCL
![Page 15: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/15.jpg)
Experiment: Poisson 2D SOROptimal placement is almost the opposite of another across machines.
Split on CPUCompute on GPU
Desktop config
Split on OpenCLCompute on CPU
Server config
Split on CPUCompute on GPU
Laptop config
![Page 16: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/16.jpg)
Experiment: Tridiagonal Solver
Algorithmic choice dramatically affects performance.
Cyclic reduction on GPUDesktop config
Direct solve on CPUServer config
Direct solve on CPULaptop config
![Page 17: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/17.jpg)
Experiment: SortIt is always best to use accelerators.
2MS -> QS -> 4MS -> IS on CPU
Desktop config
4MS -> 2MS -> ISon CPU
Server config
4MS -> 2MS -> 4MS -> ISon CPU
Laptop config
Bitonic sortGPU-only config
Radix sortHand-coded OpenCL
![Page 18: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/18.jpg)
Experiment: SVDGPU-CPU task parallel division on some machines
Task parallelism betweenCPU/GPU
Desktop config
All on CPUServer config
All on CPULaptop config
![Page 19: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/19.jpg)
Experiment: Black-sholesGPU-CPU task workload division on some machines
All on GPUDesktop config
All on OpenCLServer config
25% on CPU, 75% on GPULaptop config
![Page 20: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/20.jpg)
ConvolutionStressenSORTridiagonal SolverSortSVDBlack-sholes
Choice Differences Across Machines
Devices
(C++/OpenCL)
Algorithms
GPU-CPU
ratio
GPU/CPU
task
parallel
ism
Global/l
ocal m
emory
![Page 21: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/21.jpg)
OpenTuner: Make Autotuning Available Beyond ZettaBricks• Every high performance programmer can,
and should, use autotuners• But autotuning is sparsely used• Most still do exhaustive search!• Taking advantage of what we have learned in
the 5+ ZettaBricks autotuners• We use sophisticated machine learning techniques• A general framework for building autotuners • A toolbox, not a one-size-fits-all autotuner• Making it available to the community• Domain experts can put together a sophisticated
autotuner
![Page 22: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/22.jpg)
Lessons from ZettaBricks #1
• Configuration representation is critical• Cartesian coordinates often natural/useful• Represents things like trees poorly
OpenTuner:• Custom format with dual interfaces:
o Cartesian view Point in high dimensional space
o Maze view Dynamic number "moves" can be taken from any
current position
![Page 23: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/23.jpg)
Lessons from ZettaBricks #2
• There is no perfect search technique• Techniques have differing strengths
o Experience with many novel techniques• Exploitation/exploration tradeoff
OpenTuner:• Library of competing techniques:
o Ensembles of techniques run in parallelo Credit assignment gives larger testing budgets to
successful techniqueso Long term (cross-run) performance informs which
techniques are best for each problem
![Page 24: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/24.jpg)
Lessons from ZettaBricks #3
• Usage, aggregation, and interpretation of results data varies widely• Often accessed in different ways at
different times
OpenTuner:• Fully featured database of results (SQL):
o Cross cutting access and mining of results data
o Supports transactional parallelismo Long term knowledge sharing between runs
![Page 25: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/25.jpg)
OpenTuner Modules/Processes
![Page 26: Keeping Performance Portable In H igh P erformance K ernels](https://reader035.fdocuments.us/reader035/viewer/2022070503/56816081550346895dcfaad7/html5/thumbnails/26.jpg)
OpenTuner Status• V1 is ready, ZettaBricks is now
ported to the OpenTuner• Looking for users• Come find us at the Technology
Marketplace