LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf...

64
LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Transcript of LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf...

Page 1: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

LLVM Performance Improvements and

HeadroomGerolf Hoflehner

Apple

LLVM Developers’ Meeting 2015 San Jose, CA

Page 2: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Messages

• Tuning and focused local optimizations

• Advancing optimization technology

• Getting inspired by ‘heroic’ optimizations

• Exposing performance opportunities to developers

Page 3: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Benchmarks

1) SPEC is a registered trademark of the Standard Performance Evaluation Cooperation http://spec.org

• SPEC® CINT2006 1)

• Kernels

• LLVM Tests —benchmarking-only

• SPEC® CFP2006 (7 C/C++ benchmarks)

Page 4: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Setup

• Clang-600 (~LLVM 3.5) vs Clang-700 (~LLVM 3.7)

• -O3 -FLTO -PGO

• -O3 -FLTO

• ARM64

Page 5: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Some Performance Gains

SPEC CINT2006: +6.5%

Kernels: up to 70%

Page 6: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Acknowledgements

• Adam Nemet, Arnold Schwaighofer, Chad Rossier, Chandler Carruth, James Molloy, Michael Zolotukhin, Tyler Nowicki, Yi Jiang and many other contributors of the LLVM community

Page 7: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

SPEC CINT2006

-1.6%

5.05%

11.7%

18.35%

25%

400.

perlb

ench

401.

bzip

2

403.

gcc

429.

mcf

445.

gobm

k

456.

hmm

er

458.

sjeng

462.

libqu

antu

m46

4.h2

64re

f47

1.om

netp

p

473.

asta

r48

3.xa

lancb

mk

Geom

ean

6.5%

19.3%

2.6%

-1.6%

2.5%

8%

6%

22%

1.3%

5.3%

0.9%

4.4%

3.2%

-1.6%

Page 8: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Some Reasons For GainsConditionfolding:((c)>='A'&&(c)<='Z')||((c)>='a'&&(c)<=‘z’)

cmp+br->tbnz

Unrollingofloopswithconditionalstores

Registerpressureawarelooprotation

Local (narrow) optimizations

Page 9: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Hot Loop: 456.hmmerfor (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; …;

d[k] = d[k - 1] + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > d[k]) d[k] = sc; …;}

Page 10: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Hot Loop: 456.hmmerLoop Distribution

for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; …;}

for (k = 1; k <= M; k++) { // split d[k] = d[k - 1] + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > d[k]) d[k] = sc; …;}

- Improves cache efficiency - Partial vectorization

Page 11: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Hot Loop: 456.hmmerLoop Distribution + Store To Load Forwarding

for (k = 1; k <= M; k++) { mc[k] = mpp[k - 1] + t0[k - 1]; …;}Tk-1 = d[0];for (k = 1; k <= M; k++) { // split // d[k]= d[k-1] + t1[k-1] d[k] = Tk = Tk-1 + t1[k - 1]; if ((sc = mc[k - 1] + t2[k - 1]) > Tk) Tk = d[k] = sc; …}- Critical Path Shortening

Page 12: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Reflection

• Many local narrowly focused optimizations

• Loop Distribution advances capabilities of Loop Transformation Framework

Page 13: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Kernel Performance

-17.5%

0%

17.5%

35%

52.5%

70%

SHA Sobel Filter Compress BlackScholes Convolution

70%

53%

-6%-8%-10%

53%

70%

Page 14: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

“Convolution”: Estimate Loop Unrolling Benefits

No Unrolling 

static const int k[] = { 0, 1, 5 };

for (v = 0; v < size; v++) { r += src[v] * k[v];}

Page 15: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

“Convolution”: Estimate Loop Unrolling Benefits

With Unrolling 

static const int k[] = { 0, 1, 5 }; … r += src[0] * k[0]; r += src[1] * k[1]; r += src[2] * k[2]; … 

Page 16: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

“Convolution”: Estimate Loop Unrolling Benefits

With Unrolling 

0 1 5

static const int k[] = { 0, 1, 5 }; … r += src[0] * k[0]; r += src[1] * k[1]; r += src[2] * k[2]; … 

Page 17: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

“Convolution”: Estimate Loop Unrolling Benefits

With Unrolling 

static const int k[] = { 0, 1, 5 }; … r += src[0] * 0; r += src[1] * 1;a r += src[2] * 5; … 

Page 18: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

“Convolution”: Estimate Loop Unrolling Benefits

With Unrolling 

static const int k[] = { 0, 1, 5 }; … r += 0; r += src[1];a r += src[2] * 5; … 

saves mul + addsaves mul

Page 19: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Kernel RegressionsSHA(-10%)Aggressiveloadhoisting

SobelFilter(-8%)LSRexpressionnormalization

Compress(-6%)NoCCMPoptimizationduetotbnz

Tune Scheduler

Avoid GEP in base address calculation

Generalize CCMP

Page 20: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Performance Changes In LLVM Tests

-30.00%

-10.00%

10.00%

30.00%

50.00%

70.00%%Gain

Gains

Losses

Page 21: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Headroom

Page 22: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

CINT2006 Headroom

CINT2006

0% 10% 20% 30% 40%

5%15%10%2%

Tuning Advanced Heroic Unknown

Inlining+GlobalOpts+LoopFusion

libquantum(“10X”)

+15%

AoS->SoA libquantum (2X)

+10%

?+5%

Tuning?+2%

Page 23: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

CFP2006 Headroom

CFP2006

0% 10% 20% 30% 40%

5%25%2%

Tuning Advanced Heroic Unknown

Tuning +2%

AoS->SoA milc (3X)

Value Profiling povray(10%)

+25%

?+5%

Page 24: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Array of Structs (AoS) to Structs of Array (SoA)

Page 25: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Libquantum: AoS->SoAstruct quantum_reg_node { COMPLEX_FLOAT amplitude; int state;};

struct quantum_reg_node { int width; int size; int hashw; quantum_reg_node *node; int *hash;};

hot_code(…, quantum_reg_node *reg) { int i; … for (i = 0; i < reg->size; i++) { if (reg->node[i].state & C) { reg->node[i].state ^= T; } } …}

Hot loop uses only some fields of a structure

Page 26: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; };

for(i=0; i<reg->size; i++){ if(reg->N[i].state & C) { reg->N[i].state ^= T); } }

8

20

quantum_reg_node N[]

0 8 12 16 204 24 28 32 36

Page 27: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

20

64b 128b 192b

Cache:

N[0].state N[1].state N[2].state

08

quantum_reg_node N[]N[0] N[1] N[2]

0 8 12 16 204 24 28 32 36

8 12 16 20 32 36

fictitious cache line size!

36

Page 28: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

struct quantum_reg_node { COMPLEX_FLOAT amplitude; int state; };

for(i=0; i<reg->size; i++){ if(reg->N[i].state & C) { reg->N[i].state ^= T); } }

8

20

quantum_reg_node N[]

0 8 12 16 204 24 28 32 36

Page 29: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

for(i=0; i<reg->size; i++){ if(reg->N[i].state & C) { reg->N[i].state ^= T); } }

8

20

quantum_reg_node N[]

0 8 12 16 204 24 28 32 36

COMPLEX_FLOAT *amplitude; int *state;

Page 30: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

COMLEX_FLOAT amplitude[]

for(i=0; i<reg->size; i++){ if(reg->N[i].state & C) { reg->N[i].state ^= T); } }

…0 16

4 12

8

20

MAX_UNSIGNED state[]

… 0 4 8

COMPLEX_FLOAT *amplitude; int *state;

0

8 12

16 204

2420 24

0 4 8

8 12 16

Page 31: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

COMLEX_FLOAT amplitude[]

…0 16

4 12

8

20

MAX_UNSIGNED state[]

… 0 4 8

COMPLEX_FLOAT *amplitude; int *state;

0

8 12

16 204

2420 24

0 4 8

8 12 16

for(i=0; i<reg->size; i++){ if(reg->state[i] & C) { reg->state[i] ^= T); } }

Page 32: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

COMLEX_FLOAT amplitude[]

…0 16

4 12

8

20

MAX_UNSIGNED state[]

… 0 4 8

COMPLEX_FLOAT *amplitude; int *state;

0

8 12

16 204

2420 24

0 4 8

8 12 16

64 128 1920

Cache (after AoS):

4 8 12 16 20

Page 33: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

AoS to SoA speeds up benchmarks and applications

that are memory-bandwidth bound and/or cache bound

Page 34: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

AoS->SoA: Challenges• Legality

• Casts to/from struct type, escaped types, address taken of individual fields, parameter and return values, semantic of constants

• Transformations

• Data accesses, memory allocation

• Usability

• Debugging

Page 35: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

AoS->SoA: Changing Structure Definition and Accesses

struct quantum_reg_node { int width; int size; int hashw; COMLEX_FLOAT *amplitude; int *state; int *hash;};

struct quantum_reg_node { COMPLEX_FLOAT amplitude; MAX_UNSIGNED state;};

struct quantum_reg_node { int width; int size; int hashw; quantum_reg_node *node; int *hash;};

reg->N[i].state to reg->state[i]

Page 36: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Constants

if (!reg.state) reg.node = calloc(r; eg.size, sizeof(quantum_reg_node))

%call10 = call i8* @calloc(i64 %conv, i64 16);

Page 37: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Parameter Passing

j = quantum_get_state(reg1->amplitude, reg1->state);

j = quantum_get_state(reg1->node)

Page 38: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

References• Implementing Data Layout Optimizations in LLVM Framework,

Prashantha NR et al., 2014 LLVM Developer Meeting

• G. Chakrabarti, F. Chow, Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements. Gautam Chakrabarti, Open64 Workshop at CGO, 2008

• O. Golovanevsky et al, https://www.research.ibm.com/haifa/Workshops/compiler2007/present/data-layout-optimizations-in-gcc.pdf, GCC Workshop, 2007

• R. Hundt et al, Practical Structure Layout Optimization and Advice, CGO, 2006

Page 39: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

More Headroom: ‘Heroics’

💂💂 💂 💂

💂 💂 💂 💂 💂 💂

Page 40: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Libquantum: Heroics test_sum() { ...; cnot(2 * width - 1, width - 1, reg); sigma_x(2 * width - 1, reg); ...; }

• 2 similar functions • Hot loops

Whole program visibility Alias analysis GlobalModRef … Cost Model

Inline +

Fuse

Page 41: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

decohere(reg);

void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else {

} }

… if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; }

void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else {

} }

… if (foo(y)) return; for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; } decohere(reg);

Page 42: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

decohere(reg);

void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else {

} }

… if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; }

void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else {

} }

… if (foo(y)) return; for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; } decohere(reg);

Page 43: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

void status(&qec, B1;

}

… reg->state[i] ^= T; }

void status(&qec, B2

}

… reg->state[i] ^= T; }

decohere(reg); decohere(reg);

Interprocedural Loop Fusion

decohere(reg);

Page 44: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

void status(&qec, B1;

}

… reg->state[i] ^= T; }

void status(&qec, B2;

}

… reg->state[i] ^= T; }

Interprocedural Loop Fusion

void decohere(q_reg) { if (status) { do_something; } return; }

Global Value Prop

decohere(reg); decohere(reg);

Page 45: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

void status(&qec, B1;

}

… reg->state[i] ^= T; }

void status(&qec, B2

}

… reg->state[i] ^= T; }

Interprocedural Loop Fusion

void decohere(q_reg) { if (0) { do_something; } return; }

Global Value Prop

decohere(reg); decohere(reg);

Page 46: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

decohere(reg);

void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B1; else {

} }

… if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; }

void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (qec) B2; else {

} }

… if (foo(y)) return; for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; } decohere(reg);

Page 47: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

void status(&qec, NULL); if (qec) B1;

}

… reg->state[i] ^= T; }

void

}

status(&qec, NULL); if (qec) B2; … reg->state[i] ^= T; }

Page 48: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

void status(&qec, B1;

}

… reg->state[i] ^= T; }

void

}

status(&qec, B2; … reg->state[i] ^= T; }

status(&qec, NULL); if (qec)

Inlining

Page 49: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

void status(&qec, B1;

}

… reg->state[i] ^= T; }

void

}

status(&qec, B2; … reg->state[i] ^= T; }

if (globalVar)Inlining

Page 50: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

decohere(reg);

void cnot(int C, int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (globalVar) B1; else {

} }

… if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; }

void sigma(int T, q_reg *reg) { int i; int qec; status(&qec, NULL); if (globalVar) B2; else {

} }

… if (foo(y)) return; for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; } decohere(reg);

Page 51: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

void status(&qec, NULL); if (globalVar) B1; else {

}

… reg->state[i] ^= T; }

void status(&qec, NULL); if (globalVar) B2; else {

}

… reg->state[i] ^= T; }

GlobalModRef

Page 52: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

void status(&qec, B1;

}

… reg->state[i] ^= T; }

void status(&qec, B2

}

… reg->state[i] ^= T; }

GlobalModRef

if (globalVar) { B1; B2; } else {

… if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; } … if (foo(y)) return; for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; }

Page 53: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

void status(&qec, B1;

}

… reg->state[i] ^= T; }

void status(&qec, B2

}

… reg->state[i] ^= T; }

GlobalModRef

if (globalVar) { B1; B2; } else {

… if (foo(x)) return; for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; } … if (foo(y)) return; for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; }

Page 54: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

void status(&qec, B1;

}

… reg->state[i] ^= T; }

void status(&qec, B2

}

… reg->state[i] ^= T; }

GlobalModRef

if (globalVar) { B1; B2; } else {

… if (foo(x)) return;

for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; }

… if (foo(y)) return;

for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; }

Page 55: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

void status(&qec, B1;

}

… reg->state[i] ^= T; }

void status(&qec, B2

}

… reg->state[i] ^= T; }

GlobalModRef

if (globalVar) { B1; B2; } else {

bool V1 = foo(x); bool V2 = foo(y);

for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; }

if (V1 || V2) return;

for (i = 0; i < reg->size; i++) { reg->state[i] ^= T; }

Fuse

Page 56: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Interprocedural Loop Fusion

void status(&qec, B1;

}

… reg->state[i] ^= T; }

void status(&qec, B2

}

… reg->state[i] ^= T; }

GlobalModRef

if (globalVar) { B1; B2; } else {

bool V1 = foo(x); bool V2 = foo(y);

for (i = 0; i < reg->size; i++) { if (reg->state[i] & C) reg->state[i] ^= T; reg->state[i] ^= T; }

if (V1 || V2) return;

Fused Loops

Page 57: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

What can we learn?

• Optimization Scope: Call Chain

• Concept: Function Similarity

• Challenge: Hoist statements across loops

• Techniques: Global Value Prop, Partial Inlining, GlobalModRef, Loop Fusion, …

Page 58: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Take Aways

• Techniques needed for ‘heroics’ can be generalized to advance optimization technology

• Cost Model?

Page 59: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

How to expose performance opportunities to developers?

Page 60: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

__builtin_nontemporal_store

void scaledCpy(float *__restrict__ a, float *__restrict__ b, float S, int N) { for (int i = 0; i < N; i++) b[i] = S * a[i];}

Page 61: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

__builtin_nontemporal_store

void scaledCpy(float *__restrict__ a, float *__restrict__ b, float S, int N) { for (int i = 0; i < N; i++) // b[i] = S * a[i]; __builtin_nontemporal_store(S * a[i], &b[i]);}

Page 62: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Vectorizer: Hints and Diagnosticswhile (good) {

for (i = 0; i < N; i++) { DW[i] = A[i - 3] + B[i - 2] + C[i - 1] + D[i]; UW[i] = A[i] + B[i + 1] + C[i + 2] + D[i + 3]; }

}

remark: loop not vectorized: … Avoid runtime pointer checking when you know the arrays will always be independent by specifying '#pragma clang loop vectorize(assume_safety)’ before the loop or by specifying 'restrict' on the array arguments. Erroneous results will occur if these options are incorrectly applied!

Page 63: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Conclusions

The best days for LLVM performance are ahead of us

Page 64: LLVM Performance Improvements and Headroom · LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers’ Meeting 2015 San Jose, CA

Questions?