EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism
Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None...
Transcript of Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None...
![Page 1: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/1.jpg)
Graal Compiler Optimizations OnAArch64Xiaohong GongSep 2019
![Page 2: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/2.jpg)
Outline❖ Graal Overview
❖ Graal Optimization status
❖ What optimizations have we done for AArch64
❖ The future work
![Page 3: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/3.jpg)
Graal Overview
![Page 4: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/4.jpg)
What is Graal?
▪ A compiler written in Java, that compiles Java bytecode to machine codes (e.g. X86, AArch64, Sparc).
It could be used in several ways:
Usage Mode
Replacement for C2 compiler in Hotspot JIT (Just-In-Time)
Hotspot AOT AOT (Ahead-Of-Time)
Substrate VM AOT
Note: Hotspot is the Virtual Machine (VM) in OpenJDK, and OpenJDK is the official Java implementation.
![Page 5: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/5.jpg)
Note: Hotspot is the Virtual Machine (VM) in OpenJDK, and OpenJDK is the official Java implementation.
![Page 6: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/6.jpg)
Compilation Workflow
IR IR LIR
Bytecode
ASM
![Page 7: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/7.jpg)
Key Features
Designed for speculative optimizations and deoptimizationWritten in Java:o High development efficiency
o Safety
o Maintainable and extendable
o Good IDE support for debuggers
Designed for exact garbage collection (oop maps)
Aggressive high-level optimizations (like PEA)
Modular architecture (Compiler-VM separation)
Compiling and optimizing itself is also a good optimization opportunity
![Page 8: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/8.jpg)
Graal Optimization Status on AArch64
![Page 9: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/9.jpg)
Graal vs C2Mid-End Optimizations Graal C2
Inline, GVN, Constant propagation, Profile-
guided Optimization(PGO), Canonicalization,
Dead Code Elimination… (about 50 more)
Similar Similar
Escape Analysis Partial Escape Analysis Non-partial EA
Lowering Snippet Non-snippet
Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization
De-virtualization Cache more virtual function calls Cache fewer virtual function calls
Back-End Optimizations on AArch64 Graal C2
Instruction Selection Naïve instruction selection,
no match rules on AArch64 until
we added some
BURS(Bottom Up Rewrite System), match rules on
AArch64
Intrinsics Missing a lot Most implemented
Instruction scheduling Less tuned scheduler Architecture specific scheduler
Register Allocation Linear Scan Graph coloring
![Page 10: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/10.jpg)
Back-End Optimizations on AArch64 Graal C2
Instruction Selection Naïve instruction selection,
no match rules on AArch64 until
we added some
BURS(Bottom Up Rewrite System), match rules on
AArch64
Intrinsics Missing a lot Most implemented
Instruction scheduling Less tuned scheduler Architecture specific scheduler
Register Allocation Linear Scan Graph coloring
Mid-End Optimizations Graal C2
Inline, GVN, Constant propagation, Profile-
guided Optimization(PGO), Canonicalization,
Dead Code Elimination… (about 50 more)
Similar Similar
Escape Analysis Partial Escape Analysis Non-partial EA
Lowering Snippet Non-snippet
Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization
De-virtualization Cache more virtual function calls Cache fewer virtual function calls
Graal vs C2
![Page 11: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/11.jpg)
Back-End Optimizations on AArch64 Graal C2
Instruction Selection Naïve instruction selection,
no match rules on AArch64 until
we added some
BURS(Bottom Up Rewrite System), match rules on
AArch64
Intrinsics Missing a lot Most implemented
Instruction scheduling Less tuned scheduler Architecture specific scheduler
Register Allocation Linear Scan Graph coloring
Mid-End Optimizations Graal C2
Inline, GVN, Constant propagation, Profile-
guided Optimization(PGO), Canonicalization,
Dead Code Elimination… (about 50 more)
Similar Similar
Escape Analysis Partial Escape Analysis Non-partial EA
Lowering Snippet Non-snippet
Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization
De-virtualization Cache more virtual function calls Cache fewer virtual function calls
Graal vs C2
![Page 12: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/12.jpg)
Back-End Optimizations on AArch64 Graal C2
Instruction Selection Naïve instruction selection,
no match rules on AArch64 until
we added some
BURS(Bottom Up Rewrite System), match rules on
AArch64
Intrinsics Missing a lot Most implemented
Instruction scheduling Less tuned scheduler Architecture specific scheduler
Register Allocation Linear Scan Graph coloring
Mid-End Optimizations Graal C2
Inline, GVN, Constant propagation, Profile-
guided Optimization(PGO), Canonicalization,
Dead Code Elimination… (about 50 more)
Similar Similar
Escape Analysis Partial Escape Analysis Non-partial EA
Lowering Snippet Non-snippet
Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization
De-virtualization Cache more virtual function calls Cache fewer virtual function calls
Graal vs C2
![Page 13: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/13.jpg)
Performance on AArch64
0
5000
10000
15000
20000
25000
30000
35000
max-jOPS critical-jOPS
SPECjbb2015 (Higher is better)
C2 Graal
• SpecJbb2015 is a big benchmark developed for Java. It simulates a world-wide supermarket
company’s IT infrastructure.
Testing on AArch64 Cortex-A72 based platform with jdk11.
![Page 14: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/14.jpg)
Conclusion
❑ Compared to C2, Graal is quite new. And even without much backend optimizations
for AArch64 platform, it still provides similar performance to C2 for some cases.
❑ Graal is still missing some optimizations and new features of OpenJDK, and its
performance for some cases is worse than C2. So we need to catch up with C2.
❑ With more optimizations added for AArch64, Graal’s future performance will be
compelling.
![Page 15: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/15.jpg)
What have we done for optimizationson AArch64 ?
![Page 16: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/16.jpg)
Optimizations we have done for AArch64
Back-End on AArch64:o Instruction selections
Add match rules to delete or use light instructions like madd,
ubfx/ubfm, mneg, tbz, etc.
Address reshaping and memory bulk zeroing for initialization.
……
o Intrinsics
Integer.bitCount/Long.bitCount
o Others
Memory barrier
Mid-End:o Canonicalization
Integer div/rem optimization when the divisor is a constant.
If branch elimination.
![Page 17: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/17.jpg)
If Elimination
…
After optimization
Use conditional select(csel) instead of comparation and branch.
code gen
![Page 18: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/18.jpg)
-- Java array in memory --
Header a[0] a[1] a[2] a[3] a[4] a[5]
0 16 20 24 28 32 36 40
OopHeader LenCode Gen
array base data offset
array index
After optimization
array base array index
derived reference: base + index * 4
Address Reshaping
![Page 19: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/19.jpg)
After optimization
Graal implementation
Intrinsic - bitCount
![Page 20: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/20.jpg)
Performance – with our optimizations (1)
Before optimizations
After optimizations0
0.2
0.4
0.6
0.8
1
1.2
Jmh-algorithm0
0.2
0.4
0.6
0.8
1
1.2
Jmh-benchmarksgame
7% ~ 130% improvement
Lower is better
![Page 21: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/21.jpg)
0
100
200
300
400
500
600
700
800
900
1000
Jmh.vector
Before optimizations After optimizations
0.6% ~ 51% improvement
Lower is better
Performance – with our optimizations (2)
![Page 22: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/22.jpg)
0
200
400
600
800
1000
1200
Jmh.intrinsics
Before optimizations After optimizations
7% ~ 326% improvement
Lower is better
Performance – with our optimizations (3)
![Page 23: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/23.jpg)
lusearch xalan pmd tradebeans jython avrora h2 fop lusearch-fix luindex sunflow
Before optimizations 148 573 2037 11487 6762 6824 8857 514 145 1185 812
After optimizations 148 582 2008 11063 6508 6729 8260 491 135 1116 756
0
2000
4000
6000
8000
10000
12000
14000
msec
DaCapo 9.12 (lower is better)
Before optimizations After optimizations
Performance – with our optimizations (4)
![Page 24: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/24.jpg)
The future work
![Page 25: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/25.jpg)
Future work
o Loop related and other small optimizations
o ARM architecture new features(SVE, etc.)
o Instruction Selection (Missing match rules for AArch64 https://github.com/oracle/graal/issues/323
o Intrinsics (String, Unsafe, Math, etc.)
o Register Allocation & safepoint related optimizations
![Page 26: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/26.jpg)
Thank youJoin Linaro to accelerate deployment of your Arm-based solutions through collaboration
![Page 27: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/27.jpg)
Why use Graal?
Compared to C2 and LLVM compiler, Graal is:
o Implemented in java.
o Modular, VM independent.
o Support other languages, not only for java.
o Support both dynamic and static compilation.
Graal’s key optimizations that make its performance similar with C2:
o PEA
o Inlining
![Page 28: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/28.jpg)
Partial Escape Analysis (PEA)• EA is a method for determining the dynamic scope of an object, to decide whether to allocate it on the Java heap.
• PEA is a control flow sensitive EA, aimed at finding more escape analysis opportunities.
A. An example C2 can not handle
allocate on heap
escape
B. After analysis in Graal
PEA
scalar replacement
allocate on heap
![Page 29: Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual](https://reader035.fdocuments.us/reader035/viewer/2022081514/5fb9bed012950111f928b40b/html5/thumbnails/29.jpg)
After optimization
-- Java array in memory --
Header a[0] a[1] a[2] a[3] a[4] a[5]
0 16 20 24 28 32 36 40
OopHeader LenCode Gen
array base
OopMap(stackmap, metadata) describes:
• Oop – a GC root
• Derived_oop – base oop is described
• …
data offset
array indexarray base array index
derived reference: base + index * 4
safepoint is allowed
What we have done – address reshaping