Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK...
Transcript of Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK...
![Page 1: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/1.jpg)
Code Generation for Embedded Heterogeneous Architectures on Android
Oliver Reiche, Richard Membarth, Frank Hannig, and Jürgen Teich
University of Erlangen-Nuremberg
![Page 2: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/2.jpg)
What do we need DSLs and code generation for?
3P: Performance, Productivity, and Portability
What’s the difference for embedded heterogeneous architectures?
Motivation
25-Mar-14 2Oliver Reiche / University of Erlangen-Nuremberg
![Page 3: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/3.jpg)
1. Programming Models
2. Code Generation
– HIPAcc Framework
– Renderscript Code Generation
– Vector Support
– HSA Memory Management
3. Results
Outline
![Page 4: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/4.jpg)
Programming Models
![Page 5: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/5.jpg)
Android NDK (Native Development Kit)
• no native support for GPUs
• low-level fine tuning:
– implicit and explicit vectorization(SSE/AVX/NEON)
– cache-aware programming
OpenCL (inoffical)
• support for CPUs, GPGPUs and others
• low-level fine tuning:
– explicit mapping of threads
– transparent memory hierarchy
– supports unified CPU/GPU memory
Programming Models
25-Mar-14 5Oliver Reiche / University of Erlangen-Nuremberg
![Page 6: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/6.jpg)
Renderscript Compute
• code mapping to native threads
• targets CPUs and DSPs
• additionally targets GPUs(since Android 4.2)
Renderscript
25-Mar-14 6Oliver Reiche / University of Erlangen-Nuremberg
Filterscript
• stricter limitations
– relaxed precision
– no scatter writes
– pointers are illegal
• ensures wider compatibility
![Page 7: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/7.jpg)
On first sight, much similarities to OpenCL but fundamentally different . . .
Philosophy behind Renderscript
• higher level of programming
• to widen support for different architectures
• dynamic execution on heterogeneous platforms
• uncouple developer from target hardware
• at the cost of performance
low-level optimizations are barely possible!
Renderscript in Detail
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 7
![Page 8: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/8.jpg)
HIPAcc Framework
![Page 9: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/9.jpg)
HIPAcc Framework Overview
25-Mar-14 9Oliver Reiche / University of Erlangen-Nuremberg
![Page 10: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/10.jpg)
HIPAcc Example: Host Code
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 10
![Page 11: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/11.jpg)
HIPAcc Example: Kernel Code
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 16
![Page 12: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/12.jpg)
Renderscript Code Generation
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 22
![Page 13: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/13.jpg)
Memory Access Mapping
DSL Kernel:
Filterscript:
Renderscript Memory Access
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 24
1 2 3 4
![Page 14: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/14.jpg)
Memory Access Mapping
DSL Kernel:
Renderscript:
Renderscript Memory Access
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 25
1 2 3 4
![Page 15: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/15.jpg)
Memory Access Mapping
DSL Kernel:
Renderscript:
(4 Pixels per
Thread)
Renderscript Memory Access
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 26
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
![Page 16: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/16.jpg)
Renderscript Iteration Space
• defined by output buffer size
• no custom launch configuration
When we need less threads, e. g., for
• processing multiple pixels per thread
• operating on a fraction of the buffer (ROI)
we need appropriate Iteration Space Mapping
Renderscript Iteration Space
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 27
![Page 17: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/17.jpg)
Iteration Space Mapping (3 Approaches)
1. Temporary buffer
– additional memory
– copy overhead: widthROI x heightROI
Renderscript Iteration Space
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 29
IMG
temp
ROI
![Page 18: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/18.jpg)
Iteration Space Mapping (3 Approaches)
1. Temporary buffer
– additional memory
– copy overhead: widthROI x heightROI
2. Dummy buffer
– allocation overhead for unused buffer
– not suitable for Filterscript
Renderscript Iteration Space
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 30
IMG
dummy
ROI
![Page 19: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/19.jpg)
Iteration Space Mapping (3 Approaches)
1. Temporary buffer
– additional memory
– copy overhead: widthROI x heightROI
2. Dummy buffer
– allocation overhead for unused buffer
– not suitable for Filterscript
3. Add guards to the kernel
– suitable for Filterscript
– copy overhead:(widthIMG x heightIMG) – (widthROI x heightROI)
– minor execution overhead
Renderscript Iteration Space
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 31
IMG
ROI
![Page 20: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/20.jpg)
Vector Support
![Page 21: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/21.jpg)
Mobile GPUs: SIMD Units
vector support is crucial forperformance
Vector Support
• added vector typesTn (e. g., float4)
• added conversion functionsTn convert_Tn(…)
Vector Support
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 33
Single Core of the ARM Mali-T604
![Page 22: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/22.jpg)
HSA Memory Management
![Page 23: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/23.jpg)
Support for unified CPU/GPU memory
• abstract memory from developer
• implicitly handle memory transfers
• manage map() and unmap() operations
avoid unnecessarymemory copies
HSA Memory Management
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 35
![Page 24: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/24.jpg)
Results
![Page 25: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/25.jpg)
Results: Productivity
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 37
Productivity
HIPAcc is
• up to 156x more compact than OpenCV
• up to 780x more compact than generated Renderscript
Lines of Code for implementing different image filters
![Page 26: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/26.jpg)
Speedup GPU
Code Variants show
use of constant memory is almost negligible (≈5%) on embedded GPUs
Results: Performance
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 38
5x5 Gaussian Blur on an ARM Mali-T604
![Page 27: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/27.jpg)
Results: Performance
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 39
Execution Time HSA (GPU with OpenCL)
![Page 28: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/28.jpg)
Summary
![Page 29: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/29.jpg)
Contributions: We showed
• what kind of optimizations are useful on eGPGPUs
• using DSLs for embedded devices is reasonable,high productivity in describing image filters
• implicit use of unified CPU/GPU memory
Summary
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 41
![Page 30: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/30.jpg)
Contributions: We showed
• what kind of optimizations are useful on eGPGPUs
• using DSLs for embedded devices is reasonable,high productivity in describing image filters
• implicit use of unified CPU/GPU memory
Summary
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 42
HIPAcc Framework Features
• ROI definition
• boundary handling modes
• interpolation modes
• image pyramids
• built-in architecture model
• automatic exploration
• target-specific optimizations
HIPAcc Compiler Features
• exploit full GPU memory hierarchy
• loop unrolling
• constant propagation
• multiple pixels per thread
• forced use of textures
• vectorization (point operators)
• unified CPU/GPU memory support
![Page 31: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/31.jpg)
Contributions: We showed
• what kind of optimizations are useful on eGPGPUs
• using DSLs for embedded devices is reasonable,high productivity in describing image filters
• implicit use of unified CPU/GPU memory
Summary
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 43
HIPAcc Framework Features
• ROI definition
• boundary handling modes
• interpolation modes
• image pyramids
• built-in architecture model
• automatic exploration
• target-specific optimizations
HIPAcc Compiler Features
• exploit full GPU memory hierarchy
• loop unrolling
• constant propagation
• multiple pixels per thread
• forced use of textures
• vectorization (point operators)
• unified CPU/GPU memory support
![Page 32: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/32.jpg)
Contributions: We showed
• what kind of optimizations are useful on eGPGPUs
• using DSLs for embedded devices is reasonable,high productivity in describing image filters
• implicit use of unified CPU/GPU memory
Summary
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 44
HIPAcc Framework Features
• ROI definition
• boundary handling modes
• interpolation modes
• image pyramids
• built-in architecture model
• automatic exploration
• target-specific optimizations
HIPAcc Compiler Features
• exploit full GPU memory hierarchy
• loop unrolling
• constant propagation
• multiple pixels per thread
• forced use of textures
• vectorization (point operators)
• unified CPU/GPU memory support
![Page 33: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/33.jpg)
Questions?
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 45
HIPAcc framework sources released under Simplified BSD License.
http://hipacc-lang.org
University Booth Demonstration: Wednesday, 12 p. m. & 4 p. m.
![Page 34: Code Generation for Embedded Heterogeneous Architectures on Android · 2017. 4. 10. · Android NDK (Native Development Kit) • no native support for GPUs • low-level fine tuning:](https://reader035.fdocuments.us/reader035/viewer/2022062506/5fbc8b728648925bec0d94a8/html5/thumbnails/34.jpg)
Results: Performance
25-Mar-14 Oliver Reiche / University of Erlangen-Nuremberg 46
Speedup CPU