WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... ·...

60

Transcript of WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... ·...

Page 2: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Page 3: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICES

WORKING TOGETHER

Page 4: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

●○○○

●●

Page 5: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

K (Android 4.4): Dalvik + JIT compilerL (Android 5.0): ART + AOT compilerM (Android 6.0): ART + AOT compilerN (Android 7.0): ART + JIT/AOT compilerO (Android 8.0): ART + JIT/AOT compiler + vectorization

Page 6: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

●●●●●●

Page 7: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICES

WORKING TOGETHER

Page 8: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

A SIMD instruction performs a single operation to multiple operands in parallel

ARM: NEON Technology (128-bit)

Intel: SSE* (128-bit) AVX* (256-bit, 512-bit)

MIPS: MSA (128-bit)

All modern general-purpose CPUs support small-scale SIMD instructions (typically between 64-bit and 512-bit)

4x32-bit operations

Page 9: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

●○○○

○○○

Page 10: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

● Many vectorizing compilers were developed by supercomputer vendors

● Intel introduced first vectorizing compiler for SSE in 1999● Since the Android O release, the optimizing compiler of

ART has joined the family of vectorizing compilers

www.aartbik.com

Page 11: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICES

WORKING TOGETHER

Page 12: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

for (int i = 0; i < 256; i++) { for (int i = 0; i < 256; i += 4) {

a[i] = b[i] + 1; -> a[i:i+3] = b[i:i+3] + [1,1,1,1];} }

Page 13: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

Ronny Reader

Abby AuthorWendy Writer

Perry Presenter Vinny Viewer Molly Maker Casey Creator

VectorOperation

VectorMemOpVectorBinOp

VectorAdd VectorSub VectorLoad VectorStore

….

….

has alignment

has vector lengthhas packed data type

A class hierarchy of general vector operations that is sufficiently powerful to represent SIMD operations common to all architectures

Page 14: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

t = [1,1,1,1];

for (int i = 0; i < 256; i += 4) { -> for (int i = 0; i < 256; i += 8) {

a[i:i+3] = b[i:i+3] + [1,1,1,1]; a[i :i+3] = b[i :i+3] + t;} a[i+4:i+7] = b[i+4:i+7] + t; }

Page 15: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)
Page 16: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

t = [1,1,1,1];

for (int i = 0; i < 256; i += 8) { ->

a[i:i+3] = b[i:i+3] + t; a[i+4:i+7] = b[i+4:i+7] + t;}

movi v0.4s, #0x1, lsl #0

mov w3, #0xc

mov w0, #0x0

Loop: cmp w0, #0x100 (256)

b.hs Exit

add w4, w0, #0x4 (4)

add w0, w3, w0, lsl #2

add w5, w3, w4, lsl #2

ldr q1, [x2, x0]

add v1.4s, v1.4s, v0.4s

str q1, [x1, x0]

ldr q1, [x2, x5]

add v1.4s, v1.4s, v0.4s

str q1, [x1, x5]

add w0, w4, #0x4 (4)

ldrh w16, [tr] ; suspend check

cbz w16, Loop

Page 17: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

VecReplicateScalar(x)

ARM64 x86-64 MIPS64

dup v0.4s, w2 movdq xmm0, rdx fill.w w0, a2 pshufd xmm0, xmm0, 0

Page 18: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

/** * Cross-fade byte arrays x1 and x2 into byte array x_out. */private static void avg(byte[] x_out, byte[] x1, byte[] x2) { // Compute minimum length of the three byte arrays. int min = Math.min(x_out.length, Math.min(x1.length, x2.length));

// Morph with rounding halving add (unsigned). for (int i = 0; i < min; i++) { x_out[i] = (byte) (((x1[i] & 0xff) + (x2[i] & 0xff) + 1) >> 1); }}

Page 19: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

SEQUENTIAL (ARMv8 AArch64)

L:cmp w5, w0 b.hs Exit add w4, w2, #0xc (12) add w6, w3, #0xc (12) ldrsb w4, [x4, x5] ldrsb w6, [x6, x5] and w4, w4, #0xff and w6, w6, #0xff add w4, w4, w6 add w6, w1, #0xc (12) add w4, w4, #0x1 (1) asr w4, w4, #1 strb w4, [x6, x5] add w5, w5, #0x1 (1) ldrh w16, [tr] ; suspend check cbz w16, L

SIMD (ARMv8 AArch64 + NEON Technology)

L:cmp w5, w4 b.hs Exit add w16, w2, w5 ldur q0, [x16, #12] add w16, w3, w5 ldur q1, [x16, #12] urhadd v0.16b, v0.16b, v1.16b add w16, w1, w5 stur q0, [x16, #12] add w5, w5, #0x10 (16) ldrh w16, [tr] ; suspend check cbz w16, L

Runs about 10x faster!

Page 21: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICES

WORKING TOGETHER

Page 22: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Java code Autovectorization result

void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; }}

●○

●○○

Page 23: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Java code Autovectorization result

void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; }}

L:cmp w0, #0x200b.hs Exit

add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L

●○

●○○

●○○

●○

Page 24: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (68% perf boost)

L:cmp w0, #0x200b.hs Exit

add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

Page 25: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (68% perf boost)

L:cmp w0, #0x200b.hs Exit

add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

Page 26: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (68% perf boost)

L:cmp w0, #0x200b.hs Exit

add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○

●○

Page 27: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (11% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

Page 28: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (11% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

Page 29: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (11% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○○○

Page 30: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (23% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]

add w16, w2, w0, lsl #2ldur q1, [x16, #12]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○

●○○○○

Page 31: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (23% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]

add w16, w2, w0, lsl #2ldur q1, [x16, #12]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○

●○○○○

Page 32: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (23% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]

add w16, w2, w0, lsl #2ldur q1, [x16, #12]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○

●○○○○

●○○

Page 33: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (10% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]add w16, w2, w0, lsl #2ldur q1, [x16, #12]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

mov w3, #0xc

L:cmp w0, #0x200b.hs Exit add w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○○

Page 34: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (10% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]add w16, w2, w0, lsl #2ldur q1, [x16, #12]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

mov w3, #0xc

L:cmp w0, #0x200b.hs Exit add w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○○

Page 35: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (10% perf boost)

L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]add w16, w2, w0, lsl #2ldur q1, [x16, #12]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

mov w3, #0xc

L:cmp w0, #0x200b.hs Exit add w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]

mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]

add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●○

●○○

●●

Page 36: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Page 37: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (2.5% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

Page 38: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (2.5% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

Page 39: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (2.5% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

●●

○○

Page 40: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (12% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L

Page 41: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (12% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L

Page 42: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (12% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L

●●

●○○

●○

Page 43: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Before After (12% perf boost)

L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L

●●

●○○

●○

Page 44: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

for (int i = 0; i < LENGTH; i++) { c[i] = (byte)(a[i] + b[i]);}

i87 Add [i80,i79]i102 IntermediateAddressIndex [i87,i98,i3]i99 IntermediateAddressIndex [i80,i98,i3]d89 VecLoad [l35,i102]d84 VecLoad [l35,i99]d83 VecLoad [l29,i99]d88 VecLoad [l29,i102]d85 VecAdd [d83,d84]d90 VecAdd [d88,d89]d86 VecStore [l27,i99,d85]d91 VecStore [l27,i102,d90]i92 Add [i87,i79]v78 Goto

Page 45: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

(gdb) x/64u 0xefc0b0000xefc0b000: 0 28 192 18 0 0 0 00xefc0b008: 0 0 4 0 100 101 102 1030xefc0b010: 104 105 106 107 108 109 110 1110xefc0b018: 112 113 114 115 116 117 118 1190xefc0b020: 120 121 122 123 124 125 126 1270xefc0b028: 128 129 130 131 132 133 134 1350xefc0b030: 136 137 138 139 140 141 142 1430xefc0b038: 144 145 146 147 148 149 150 151

Java Code static final int LENGTH = 1024 * 256; // 256K elements, 0x40000static byte [] a = new byte[LENGTH];static byte [] b = new byte[LENGTH];static byte [] c = new byte[LENGTH];

Object Header

data[0]

Page 46: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

(gdb) x/64u 0xefc0b0000xefc0b000: 0 28 192 18 0 0 0 00xefc0b008: 0 0 4 0 100 101 102 1030xefc0b010: 104 105 106 107 108 109 110 1110xefc0b018: 112 113 114 115 116 117 118 1190xefc0b020: 120 121 122 123 124 125 126 1270xefc0b028: 128 129 130 131 132 133 134 1350xefc0b030: 136 137 138 139 140 141 142 1430xefc0b038: 144 145 146 147 148 149 150 151

One VecLoad / VecStore

Java Code static final int LENGTH = 1024 * 256; // 256K elements, 0x40000static byte [] a = new byte[LENGTH];static byte [] b = new byte[LENGTH];static byte [] c = new byte[LENGTH];

Object Header

Page 47: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

●○

●○○○

0xefc0b000: 0 28 192 18 0 0 0 0

0xefc0b008: 0 0 4 0 100 101 102 103

0xefc0b010: 104 105 106 107 108 109 110 111

0xefc0b018: 112 113 114 115 116 117 118 119

0xefc0b020: 120 121 122 123 124 125 126 127

0xefc0b028: 128 129 130 131 132 133 134 135

0xefc0b030: 136 137 138 139 140 141 142 143

0xefc0b038: 144 145 146 147 148 149 150 151

SIMD from here->

Avoid SIMD from here

Page 48: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICES

WORKING TOGETHER

Page 49: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

●○

●●

○○

Page 50: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

●○○

●●●●

Page 51: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

●●

○○○○○○○

●○○○

Analyzable and flexible CHECKED!

Embeddable CHECKED!

Stable and reproducible CHECKED!

Recognized CHECKED!

Page 52: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

●●

○○○

●○○○

Page 53: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Page 54: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Page 55: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Page 56: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

Page 57: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

●○

●○

●○

●○ LDR q1, [x16] + LDR q2, [x16, #16] -> LDP q1, q2, [x16]

●○

Page 59: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

ENGINEERS AND DEVICESWORKING TOGETHER

●●

○●

○○

Page 60: WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... · 2017-10-09 · ENGINEERS AND DEVICES WORKING TOGETHER Before After (68% perf boost)

Java Scalar version Initial SIMD Version

void mul_add(int[] a, int[] b, int[] c) -{ for (int i=0; i<512; i++) { a[i] += a[i] * b[i]; }}

L:cmp w0, #0x200b.hs Exit

add w4, w1, #0xcldr w6, [x4, x0, lsl #2]add w5, w2, #0xcldr w5, [x5, x0, lsl #2]madd w5, w6, w5, w6str w5, [x4, x0, lsl #2]add w0, w0, #0x1ldrh w16, [tr]cbz w16, L

L:cmp w0, #0x200b.hs Exit

add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L