LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf...

LLVM Performance Improvements and

HeadroomGerolf Hoflehner

LLVM Developers’ Meeting 2015 San Jose, CA

Messages

• Tuning and focused local optimizations

• Advancing optimization technology

• Getting inspired by ‘heroic’ optimizations

• Exposing performance opportunities to developers

Benchmarks

1) SPEC is a registered trademark of the Standard Performance Evaluation Cooperation http://spec.org

• SPEC® CINT2006 1)

• Kernels

• LLVM Tests —benchmarking-only

• SPEC® CFP2006 (7 C/C++ benchmarks)

• Clang-600 (~LLVM 3.5) vs Clang-700 (~LLVM 3.7)

• -O3 -FLTO -PGO

• -O3 -FLTO

• ARM64

Some Performance Gains

SPEC CINT2006: +6.5%

Kernels: up to 70%

Acknowledgements

• Adam Nemet, Arnold Schwaighofer, Chad Rossier, Chandler Carruth, James Molloy, Michael Zolotukhin, Tyler Nowicki, Yi Jiang and many other contributors of the LLVM community

SPEC CINT2006

18.35%

Some Reasons For GainsConditionfolding:((c)>='A'&&(c)<='Z')||((c)>='a'&&(c)<=‘z’)

cmp+br->tbnz

Unrollingofloopswithconditionalstores

Registerpressureawarelooprotation

Local (narrow) optimizations

Hot Loop: 456.hmmerfor (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; …;

d[k] = d[k - 1] + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > d[k]) d[k] = sc; …;}

Hot Loop: 456.hmmerLoop Distribution

for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; …;}

for (k = 1; k <= M; k++) { // split d[k] = d[k - 1] + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > d[k]) d[k] = sc; …;}

- Improves cache efficiency - Partial vectorization

Hot Loop: 456.hmmerLoop Distribution + Store To Load Forwarding

for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; …;}Tk-1 = d[0];for (k = 1; k <= M; k++) { // split // d[k]= d[k-1] + t1[k-1] d[k] = Tk = Tk-1 + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > Tk) Tk = d[k] = sc; …}- Critical Path Shortening

Reflection

• Many local narrowly focused optimizations

• Loop Distribution advances capabilities of Loop Transformation Framework

Kernel Performance

-17.5%

SHA Sobel Filter Compress BlackScholes Convolution

-6%-8%-10%

“Convolution”: Estimate Loop Unrolling Benefits

No Unrolling

static const int k[] = { 0, 1, 5 };

for (v = 0; v < size; v++) { r += src[v] * k[v];}

With Unrolling

static const int k[] = { 0, 1, 5 }; … r += src[0] * k[0]; r += src[1] * k[1]; r += src[2] * k[2]; …

With Unrolling

static const int k[] = { 0, 1, 5 }; … r += src[0] * k[0]; r += src[1] * k[1]; r += src[2] * k[2]; …

With Unrolling

static const int k[] = { 0, 1, 5 }; … r += src[0] * 0; r += src[1] * 1;a r += src[2] * 5; …

With Unrolling

static const int k[] = { 0, 1, 5 }; … r += 0; r += src[1];a r += src[2] * 5; …

saves mul + addsaves mul

Kernel RegressionsSHA(-10%)Aggressiveloadhoisting

SobelFilter(-8%)LSRexpressionnormalization

Compress(-6%)NoCCMPoptimizationduetotbnz

Tune Scheduler

Avoid GEP in base address calculation

Generalize CCMP

Performance Changes In LLVM Tests

-30.00%

-10.00%

10.00%

30.00%

50.00%

70.00%%Gain

Losses

Headroom

CINT2006 Headroom

CINT2006

0% 10% 20% 30% 40%

5%15%10%2%

Tuning Advanced Heroic Unknown

Inlining+GlobalOpts+LoopFusion

libquantum(“10X”)

AoS->SoA libquantum (2X)

Tuning?+2%

CFP2006 Headroom

CFP2006

0% 10% 20% 30% 40%

5%25%2%

Tuning Advanced Heroic Unknown

Tuning +2%

AoS->SoA milc (3X)

Value Profiling povray(10%)

Array of Structs (AoS) to Structs of Array (SoA)

Libquantum: AoS->SoAstruct quantum_reg_node { COMPLEX_FLOAT amplitude; int state;};

struct quantum_reg_node { int width; int size; int hashw; quantum_reg_node *node; int *hash;};

hot_code(…, quantum_reg_node *reg) { int i; … for (i = 0; i < reg->size; i++) { if (reg->node[i].state & C) { reg->node[i].state ^= T; } } …}

Hot loop uses only some fields of a structure

struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; };

for(i=0; i<reg->size; i++){ if(reg->N[i].state & C) { reg->N[i].state ^= T); } }

quantum_reg_node N[]

0 8 12 16 204 24 28 32 36

64b 128b 192b

Cache:

N[0].state N[1].state N[2].state

quantum_reg_node N[]N[0] N[1] N[2]

0 8 12 16 204 24 28 32 36

8 12 16 20 32 36

fictitious cache line size!

struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; };

0 8 12 16 204 24 28 32 36

COMPLEX_FLOAT *amplitude; int *state;

COMLEX_FLOAT amplitude[]

…0 16

MAX_UNSIGNED state[]

… 0 4 8

16 204

2420 24

8 12 16

…0 16

… 0 4 8

16 204

2420 24

8 12 16

for(i=0; i<reg->size; i++){ if(reg->state[i] & C) { reg->state[i] ^= T); } }

…0 16

… 0 4 8

16 204

2420 24

8 12 16

64 128 1920

Cache (after AoS):

4 8 12 16 20

AoS to SoA speeds up benchmarks and applications

that are memory-bandwidth bound and/or cache bound

AoS->SoA: Challenges• Legality

• Casts to/from struct type, escaped types, address taken of individual fields, parameter and return values, semantic of constants

• Transformations

• Data accesses, memory allocation

• Usability

• Debugging

AoS->SoA: Changing Structure Definition and Accesses

struct quantum_reg_node { int width; int size; int hashw; COMLEX_FLOAT *amplitude; int *state; int *hash;};

struct quantum_reg_node { COMPLEX_FLOAT amplitude; MAX_UNSIGNED state;};

struct quantum_reg_node { int width; int size; int hashw; quantum_reg_node *node; int *hash;};

reg->N[i].state to reg->state[i]

Constants

if (!reg.state) reg.node = calloc(r; eg.size, sizeof(quantum_reg_node))

%call10 = call i8* @calloc(i64 %conv, i64 16);

Parameter Passing

j = quantum_get_state(reg1->amplitude, reg1->state);

j = quantum_get_state(reg1->node)

References• Implementing Data Layout Optimizations in LLVM Framework,

Prashantha NR et al., 2014 LLVM Developer Meeting

• G. Chakrabarti, F. Chow, Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements. Gautam Chakrabarti, Open64 Workshop at CGO, 2008

• O. Golovanevsky et al, https://www.research.ibm.com/haifa/Workshops/compiler2007/present/data-layout-optimizations-in-gcc.pdf, GCC Workshop, 2007

• R. Hundt et al, Practical Structure Layout Optimization and Advice, CGO, 2006

More Headroom: ‘Heroics’

💂💂 💂 💂

💂 💂 💂 💂 💂 💂

Libquantum: Heroics test_sum() { ...; cnot(2 * width - 1, width - 1, reg); sigma_x(2 * width - 1, reg); ...; }

• 2 similar functions • Hot loops

Whole program visibility Alias analysis GlobalModRef … Cost Model

Inline +

Interprocedural Loop Fusion

decohere(reg);

void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else {

… if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; }

void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else {

… if (foo(y)) return; for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; } decohere(reg);

decohere(reg);

void status(&qec, B1;

… reg->state[i] ^= T; }

void status(&qec, B2

decohere(reg); decohere(reg);

decohere(reg);

void decohere(q_reg) { if (status) { do_something; } return; }

Global Value Prop

void decohere(q_reg) { if (0) { do_something; } return; }

Global Value Prop

decohere(reg);

void status(&qec, NULL); if (qec) B1;

status(&qec, NULL); if (qec) B2; … reg->state[i] ^= T; }

status(&qec, B2; … reg->state[i] ^= T; }

status(&qec, NULL); if (qec)

Inlining

status(&qec, B2; … reg->state[i] ^= T; }

if (globalVar)Inlining

decohere(reg);

void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (globalVar) B1; else {

void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (globalVar) B2; else {

void status(&qec, NULL); if (globalVar) B1; else {

void status(&qec, NULL); if (globalVar) B2; else {

GlobalModRef

if (globalVar) { B1; B2; } else {

… if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; } … if (foo(y)) return; for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; }

GlobalModRef

… if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; } … if (foo(y)) return; for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; }

GlobalModRef

… if (foo(x)) return;

for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; }

… if (foo(y)) return;

for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; }

GlobalModRef

bool V1 = foo(x); bool V2 = foo(y);

for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; }

if (V1 || V2) return;

for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; }

GlobalModRef

bool V1 = foo(x); bool V2 = foo(y);

for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; reg->state[i] ^= T; }

if (V1 || V2) return;

Fused Loops

What can we learn?

• Optimization Scope: Call Chain

• Concept: Function Similarity

• Challenge: Hoist statements across loops

• Techniques: Global Value Prop, Partial Inlining, GlobalModRef, Loop Fusion, …

Take Aways

• Techniques needed for ‘heroics’ can be generalized to advance optimization technology

• Cost Model?

How to expose performance opportunities to developers?

__builtin_nontemporal_store

void scaledCpy(float *__restrict__ a, float *__restrict__ b, float S, int N) { for (int i = 0; i < N; i++) b[i] = S * a[i];}

__builtin_nontemporal_store

void scaledCpy(float *__restrict__ a, float *__restrict__ b, float S, int N) { for (int i = 0; i < N; i++) // b[i] = S * a[i]; __builtin_nontemporal_store(S * a[i], &b[i]);}

Vectorizer: Hints and Diagnosticswhile (good) {

for (i = 0; i < N; i++) { DW[i] = A[i - 3] + B[i - 2] + C[i - 1] + D[i]; UW[i] = A[i] + B[i + 1] + C[i + 2] + D[i + 3]; }

remark: loop not vectorized: … Avoid runtime pointer checking when you know the arrays will always be independent by specifying '#pragma clang loop vectorize(assume_safety)’ before the loop or by specifying 'restrict' on the array arguments. Erroneous results will occur if these options are incorrectly applied!

Conclusions

The best days for LLVM performance are ahead of us

Questions?

LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf...

Documents

Transcript of LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf...

LLVM Register Allocation

What is LLVM? And a Status Update.€¦ · Clang, LLVM, etc. LLVM/Clang is both a research platform and a production-quality compiler. LLVM is a liberally-licensed(*) infrastructure

Introduction to LLVM

Adaptive PFC Headroom

The LLVM Compiler System · 2017-06-20 · Front-end LLVM Optimizer LLVM Compile Time IR in .o files Link Time. ... – LLVM provides high performance: optimizations, regalloc, scheduling,

Collective characterization, optimization and design of ... · ICC 10.1 ICC 11.0 ICC 11.1 ICC 12.0 ICC 12.1 LLVM 2.6 LLVM 2.7 LLVM 2.8 LLVM 2.9 LLVM 3.0 Phoenix MVS XLC Open64 Jikes

Modular Codegen - LLVM

Android RenderScript on LLVM

LLVM DevMtg 2014llvm.org/.../Carruth-TheLLVMPassManagerPart2.pdf · the llvm pass manager part 2 chandler carruth llvm devmtg 2014

Lld The LLVM Linker Friday, April 13, 2012 2012 LLVM Euro - Michael Spencer.

Cross Sections and Headroom (including Amendment … · Cross Sections and Headroom (including Amendment No ... TII Publication Title Cross Sections and Headroom (including Amendment

Headroom Speaker Stand

VRP in LLVM

OpenCL with LLVM - The LLVM Compiler Infrastructure Project

Compiling with Continuations and LLVMIntroduction LLVM The LLVM Landscape LLVM IR ARM64 x86-64 Power Compiler Optimizer LLVM Rust C SML Haskell Erlang PML e GHC VM MLton Rustc Clang

C-to-Verilog.com: High-Level Synthesis Using LLVM · 2019-10-30 · High-Level Synthesis using LLVM. HLS using LLVM –Use Clang and LLVM to parse and optimize the code –Special

LLVM Berlin Meetup - Cosenzabiagiocosenza.com/talk/LLVM-Berlin-Meetup-Nov2017.pdf · LLVM Berlin Meetup Auto-tuning Compiler Transformations with Machine Learning Mozilla Berlin Community

lldb reproducers - LLVM

Bringing Clang and LLVM to Visual C++ users - LLVM Compiler

LCU14 209- LLVM Linux