2 © 2020 Arm Limited (or its affiliates)
Background on geophysical stencils
• Coupling of HPC and Numerical methods approaches : Mercerat et al. (2009), Breuer et al. (2016).
• Performance characterization and projection : C.Andreolli et al. (2011), R.Cruz & M.Araya-Polo (2011).
• Impact of the Absorbing Boundary Conditions and integration in complex applications : M.Christen et al.(2011)
Sync/Comms-avoiding strategies
• Blocking (mainly spatial) but traction for temporal algorithms
• Runtime systems (e.g. task-based)
Refs✓D.Orozco & G.Guao (2009), K.Datta et al. (2009),
Malas et al. (2015).✓ V.Martinez et al. (2015), L.Boilot et al.(2016).
Leveraging hardware features
• Heterogeneous architectures.• SIMD.• Mixed-precision (converged
architectures).
Refs✓ P.Micikevicious (2009), R.Abdelkhalak et al.
(2012), I.Said et al. (2018).✓G.Fabiel-Ouellet (2018).
Productivity
• High-level, DSL-like approaches(e.g. Devito, Patus, Yask)
• Directive-based, source-to-source• Auto-tuning, machine learning
Refs✓ Christen et al. (2012), C.Yount et al.(2016),
M.Louboutin et al.(2016), ✓ B.Videau et al.(2018).
3 © 2020 Arm Limited (or its affiliates)
Marvell(Cavium)
Ampere(X-Gene)
Fujitsu
Huawei(HiSilicon)
Amazon(Annapurna)
EPI / SiPearl
Other
AltraeMag
1616
PHYTIUM
RHEA
Arm is ubiquitous
4 © 2020 Arm Limited (or its affiliates)
PCleController
TofuInterface
C
C
C
C
NOC
HB
M2
HB
M2
HB
M2
HB
M2
CMG CMG
CMG CMG
CMG:Core Memory Group NOC:Network on Chip
Arm is ubiquitous
• Four NUMA Regions
• SVE-capable (512-bits)
• Various interfaces (more hierarchy)
• Diversity of accelerators
5 © 2020 Arm Limited (or its affiliates)Arithmetic Intensity (Flop/Byte)
4096
1024
256
64
16
40.25 1 644 16 256 1024
Theoretical roofline
Data from publicly available performance information.
6 © 2020 Arm Limited (or its affiliates)
• There is no preferred vector length• The vector length (VL) is a hardware choice, 128-2048b, in increments of 128b• A Vector Length Agnostic (VLA) programming adjusts dynamically to the available VL
• SVE addresses traditional barriers to auto-vectorization • Software-managed speculative vectorization of uncounted loops• Extract more data-level parallelism (DLP) from existing C/C++/Fortran source code
• SVE is a new approach to vectorization, not an iteration on existing ISAs (e.g. NEON)• SVE is a separate, optional extension with a new set of instruction encodings• Initial focus is HPC and general-purpose server, not media/image processing
What makes it a Scalable Vector Extension?
7 © 2020 Arm Limited (or its affiliates)
Dynamic Binary instrumentation
DynamoRIO
Armv8-A + SVE Binary
ArmIE(emulation client)
Emulation API
SVE Memtrace Client
SVE Inscount Client
SVE custom clients
Opcodes Client
• Diversity of languages, frameworks, dependencies
• Region of Interest feature for full-fledged code
Arm Research : Asvie: A Timing-Agnostic SVE Optimization Methodology (R.Rusitoru et al., IEEE SC19)
8 © 2020 Arm Limited (or its affiliates)
Walk through – source code modification
• Region of Interest feature to capture hotspots : _START_TRACE() / _STOP_TRACE()
• Standard pragmas, keywords to enhance vectorization (e.g. from LLVM)
9 © 2020 Arm Limited (or its affiliates)
Walk through - vectorization
• Add “+sve” to generate SVE instructions
• Use “–Rpass” or “opt-report” flag for compiler insights
10 © 2020 Arm Limited (or its affiliates)
Walk through - scripts
• Same binary, varying vector length
• Various post-processing scripts are included : “merge, analyze, flops-bytes …”
11 © 2020 Arm Limited (or its affiliates)
Walk through - metrics
SVE Mem operations
SVE instr. breakdown
Native instr. breakdown
12 © 2020 Arm Limited (or its affiliates)
NEON Vectorization study
• NEON (128-bits)
• LLVM and GNU toolchain.
• Speedup up to x3.7
Acoustic stencil - 8-th order (ISO)
13 © 2020 Arm Limited (or its affiliates)
Dynamic instructions breakdown
• Same binary for all SVE runs
• Impact of Amdhal law for large vector length
Implementation-dependent
Acoustic stencil - 8-th order (ISO)
52 %
85 %
36 %
67 %
78 %
14 © 2020 Arm Limited (or its affiliates)
Acoustic stencil - 8-th order (ISO)
SVE instructions breakdown
• Decrease of instruction count as we increase the vector length
• Ratios (Floating Point/ Memory or Loads/Stores instructions).
Stencil-based kernels characterization
15 © 2020 Arm Limited (or its affiliates)
Acoustic stencil - 8-th order (ISO)
Scalar instructions breakdown
• Instructions count reduction with larger SVE vector.
Fewer loop iteration (e.g. conditional branch – bcond)
16 © 2020 Arm Limited (or its affiliates)
SVE Vector lane utilization
• Number of bytes transferred
• Very useful for application characterization (SIMD lanes, scatter/gather operations ….)
Acoustic stencil - varying order (ISO)
© 2020 Arm Limited (or its affiliates)
Thank YouDankeMerci谢谢
ありがとうGracias
Kiitos감사합니다
धन्यवाद
شكرًاধন্যবাদתודה
The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks
© 2020 Arm Limited (or its affiliates)
Top Related