WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... ·...
Transcript of WORKING TOGETHER - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17/Presentations... ·...
ENGINEERS AND DEVICESWORKING TOGETHER
ENGINEERS AND DEVICES
WORKING TOGETHER
K (Android 4.4): Dalvik + JIT compilerL (Android 5.0): ART + AOT compilerM (Android 6.0): ART + AOT compilerN (Android 7.0): ART + JIT/AOT compilerO (Android 8.0): ART + JIT/AOT compiler + vectorization
●●●●●●
ENGINEERS AND DEVICES
WORKING TOGETHER
A SIMD instruction performs a single operation to multiple operands in parallel
ARM: NEON Technology (128-bit)
Intel: SSE* (128-bit) AVX* (256-bit, 512-bit)
MIPS: MSA (128-bit)
All modern general-purpose CPUs support small-scale SIMD instructions (typically between 64-bit and 512-bit)
4x32-bit operations
●○○○
●
○○○
● Many vectorizing compilers were developed by supercomputer vendors
● Intel introduced first vectorizing compiler for SSE in 1999● Since the Android O release, the optimizing compiler of
ART has joined the family of vectorizing compilers
www.aartbik.com
ENGINEERS AND DEVICES
WORKING TOGETHER
for (int i = 0; i < 256; i++) { for (int i = 0; i < 256; i += 4) {
a[i] = b[i] + 1; -> a[i:i+3] = b[i:i+3] + [1,1,1,1];} }
Ronny Reader
Abby AuthorWendy Writer
Perry Presenter Vinny Viewer Molly Maker Casey Creator
VectorOperation
VectorMemOpVectorBinOp
VectorAdd VectorSub VectorLoad VectorStore
….
….
has alignment
has vector lengthhas packed data type
A class hierarchy of general vector operations that is sufficiently powerful to represent SIMD operations common to all architectures
t = [1,1,1,1];
for (int i = 0; i < 256; i += 4) { -> for (int i = 0; i < 256; i += 8) {
a[i:i+3] = b[i:i+3] + [1,1,1,1]; a[i :i+3] = b[i :i+3] + t;} a[i+4:i+7] = b[i+4:i+7] + t; }
t = [1,1,1,1];
for (int i = 0; i < 256; i += 8) { ->
a[i:i+3] = b[i:i+3] + t; a[i+4:i+7] = b[i+4:i+7] + t;}
movi v0.4s, #0x1, lsl #0
mov w3, #0xc
mov w0, #0x0
Loop: cmp w0, #0x100 (256)
b.hs Exit
add w4, w0, #0x4 (4)
add w0, w3, w0, lsl #2
add w5, w3, w4, lsl #2
ldr q1, [x2, x0]
add v1.4s, v1.4s, v0.4s
str q1, [x1, x0]
ldr q1, [x2, x5]
add v1.4s, v1.4s, v0.4s
str q1, [x1, x5]
add w0, w4, #0x4 (4)
ldrh w16, [tr] ; suspend check
cbz w16, Loop
VecReplicateScalar(x)
ARM64 x86-64 MIPS64
dup v0.4s, w2 movdq xmm0, rdx fill.w w0, a2 pshufd xmm0, xmm0, 0
/** * Cross-fade byte arrays x1 and x2 into byte array x_out. */private static void avg(byte[] x_out, byte[] x1, byte[] x2) { // Compute minimum length of the three byte arrays. int min = Math.min(x_out.length, Math.min(x1.length, x2.length));
// Morph with rounding halving add (unsigned). for (int i = 0; i < min; i++) { x_out[i] = (byte) (((x1[i] & 0xff) + (x2[i] & 0xff) + 1) >> 1); }}
SEQUENTIAL (ARMv8 AArch64)
L:cmp w5, w0 b.hs Exit add w4, w2, #0xc (12) add w6, w3, #0xc (12) ldrsb w4, [x4, x5] ldrsb w6, [x6, x5] and w4, w4, #0xff and w6, w6, #0xff add w4, w4, w6 add w6, w1, #0xc (12) add w4, w4, #0x1 (1) asr w4, w4, #1 strb w4, [x6, x5] add w5, w5, #0x1 (1) ldrh w16, [tr] ; suspend check cbz w16, L
SIMD (ARMv8 AArch64 + NEON Technology)
L:cmp w5, w4 b.hs Exit add w16, w2, w5 ldur q0, [x16, #12] add w16, w3, w5 ldur q1, [x16, #12] urhadd v0.16b, v0.16b, v1.16b add w16, w1, w5 stur q0, [x16, #12] add w5, w5, #0x10 (16) ldrh w16, [tr] ; suspend check cbz w16, L
Runs about 10x faster!
Sequential performance SIMD performance (NEON 128-bit) ≈20fps ≈60fps
ENGINEERS AND DEVICES
WORKING TOGETHER
ENGINEERS AND DEVICESWORKING TOGETHER
Java code Autovectorization result
void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; }}
●○
●○○
ENGINEERS AND DEVICESWORKING TOGETHER
Java code Autovectorization result
void mul_add(int[] a, int[] b) -{ for (int i = 0; i < 512; i++) { a[i] += a[i] * b[i]; }}
L:cmp w0, #0x200b.hs Exit
add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L
●○
●○○
●○○
●○
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (68% perf boost)
L:cmp w0, #0x200b.hs Exit
add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (68% perf boost)
L:cmp w0, #0x200b.hs Exit
add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (68% perf boost)
L:cmp w0, #0x200b.hs Exit
add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●
●○
●○
●○
●
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (11% perf boost)
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●○
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (11% perf boost)
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●○
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (11% perf boost)
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mul v1.4s, v0.4s, v1.4sadd v0.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●○
●○○○
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (23% perf boost)
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]
add w16, w2, w0, lsl #2ldur q1, [x16, #12]
mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]
add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●○
●○
●○○○○
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (23% perf boost)
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]
add w16, w2, w0, lsl #2ldur q1, [x16, #12]
mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]
add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●○
●○
●○○○○
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (23% perf boost)
L:cmp w0, #0x200b.hs Exit add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.4s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.4s}, [x16]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v2.4s}, [x16]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]
add w16, w2, w0, lsl #2ldur q1, [x16, #12]
mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]
add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●○
●○
●○○○○
●○○
●
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (10% perf boost)
L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]add w16, w2, w0, lsl #2ldur q1, [x16, #12]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
mov w3, #0xc
L:cmp w0, #0x200b.hs Exit add w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]
mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]
add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●○
●○○
●
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (10% perf boost)
L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]add w16, w2, w0, lsl #2ldur q1, [x16, #12]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
mov w3, #0xc
L:cmp w0, #0x200b.hs Exit add w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]
mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]
add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●○
●○○
●
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (10% perf boost)
L:cmp w0, #0x200b.hs Exit add w16, w1, w0, lsl #2ldur q0, [x16, #12]add w16, w2, w0, lsl #2ldur q1, [x16, #12]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sadd w16, w1, w0, lsl #2stur q2, [x16, #12]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
mov w3, #0xc
L:cmp w0, #0x200b.hs Exit add w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]
mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]
add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●○
●○○
●●
ENGINEERS AND DEVICESWORKING TOGETHER
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (2.5% perf boost)
L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (2.5% perf boost)
L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (2.5% perf boost)
L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
●●
○○
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (12% perf boost)
L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L
●
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (12% perf boost)
L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L
●
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (12% perf boost)
L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L
●
●●
○
●○○
●○
ENGINEERS AND DEVICESWORKING TOGETHER
Before After (12% perf boost)
L:cmp w0, #0x200b.hs Exitadd w4, w3, w0, lsl #2ldr q0, [x1, x4]ldr q1, [x2, x4]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x4]add w0, w0, #0x4 add w4, w3, w0, lsl #2 ldr q0, [x1, x4] ldr q1, [x2, x4] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x4] add w0, w0, #0x4ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exitadd w4, w0, #0x4add w0, w3, w0, lsl #2 add w5, w3, w4, lsl #2ldr q0, [x1, x0]ldr q1, [x2, x0]mov v2.16b, v0.16bmla v2.4s, v0.4s, v1.4sstr q2, [x1, x0] ldr q0, [x1, x5] ldr q1, [x2, x5] mov v2.16b, v0.16b mla v2.4s, v0.4s, v1.4s str q2, [x1, x5] add w0, w4, #0x4ldrh w16, [tr]cbz w16, L
●
●●
○
●○○
●○
●
ENGINEERS AND DEVICESWORKING TOGETHER
for (int i = 0; i < LENGTH; i++) { c[i] = (byte)(a[i] + b[i]);}
i87 Add [i80,i79]i102 IntermediateAddressIndex [i87,i98,i3]i99 IntermediateAddressIndex [i80,i98,i3]d89 VecLoad [l35,i102]d84 VecLoad [l35,i99]d83 VecLoad [l29,i99]d88 VecLoad [l29,i102]d85 VecAdd [d83,d84]d90 VecAdd [d88,d89]d86 VecStore [l27,i99,d85]d91 VecStore [l27,i102,d90]i92 Add [i87,i79]v78 Goto
●
○
○
●
ENGINEERS AND DEVICESWORKING TOGETHER
(gdb) x/64u 0xefc0b0000xefc0b000: 0 28 192 18 0 0 0 00xefc0b008: 0 0 4 0 100 101 102 1030xefc0b010: 104 105 106 107 108 109 110 1110xefc0b018: 112 113 114 115 116 117 118 1190xefc0b020: 120 121 122 123 124 125 126 1270xefc0b028: 128 129 130 131 132 133 134 1350xefc0b030: 136 137 138 139 140 141 142 1430xefc0b038: 144 145 146 147 148 149 150 151
Java Code static final int LENGTH = 1024 * 256; // 256K elements, 0x40000static byte [] a = new byte[LENGTH];static byte [] b = new byte[LENGTH];static byte [] c = new byte[LENGTH];
Object Header
data[0]
ENGINEERS AND DEVICESWORKING TOGETHER
(gdb) x/64u 0xefc0b0000xefc0b000: 0 28 192 18 0 0 0 00xefc0b008: 0 0 4 0 100 101 102 1030xefc0b010: 104 105 106 107 108 109 110 1110xefc0b018: 112 113 114 115 116 117 118 1190xefc0b020: 120 121 122 123 124 125 126 1270xefc0b028: 128 129 130 131 132 133 134 1350xefc0b030: 136 137 138 139 140 141 142 1430xefc0b038: 144 145 146 147 148 149 150 151
One VecLoad / VecStore
Java Code static final int LENGTH = 1024 * 256; // 256K elements, 0x40000static byte [] a = new byte[LENGTH];static byte [] b = new byte[LENGTH];static byte [] c = new byte[LENGTH];
Object Header
ENGINEERS AND DEVICESWORKING TOGETHER
●○
●○○○
●
0xefc0b000: 0 28 192 18 0 0 0 0
0xefc0b008: 0 0 4 0 100 101 102 103
0xefc0b010: 104 105 106 107 108 109 110 111
0xefc0b018: 112 113 114 115 116 117 118 119
0xefc0b020: 120 121 122 123 124 125 126 127
0xefc0b028: 128 129 130 131 132 133 134 135
0xefc0b030: 136 137 138 139 140 141 142 143
0xefc0b038: 144 145 146 147 148 149 150 151
SIMD from here->
Avoid SIMD from here
ENGINEERS AND DEVICES
WORKING TOGETHER
ENGINEERS AND DEVICESWORKING TOGETHER
●○
●●
○○
ENGINEERS AND DEVICESWORKING TOGETHER
●○○
●●●●
○
ENGINEERS AND DEVICESWORKING TOGETHER
●●
○○○○○○○
●○○○
Analyzable and flexible CHECKED!
Embeddable CHECKED!
Stable and reproducible CHECKED!
Recognized CHECKED!
ENGINEERS AND DEVICESWORKING TOGETHER
●●
○○○
●○○○
ENGINEERS AND DEVICESWORKING TOGETHER
●
ENGINEERS AND DEVICESWORKING TOGETHER
●
ENGINEERS AND DEVICESWORKING TOGETHER
●
ENGINEERS AND DEVICESWORKING TOGETHER
●
ENGINEERS AND DEVICESWORKING TOGETHER
●○
●○
●○
●○ LDR q1, [x16] + LDR q2, [x16, #16] -> LDP q1, q2, [x16]
●○
ENGINEERS AND DEVICESWORKING TOGETHER
●●
○●
○○
Java Scalar version Initial SIMD Version
void mul_add(int[] a, int[] b, int[] c) -{ for (int i=0; i<512; i++) { a[i] += a[i] * b[i]; }}
L:cmp w0, #0x200b.hs Exit
add w4, w1, #0xcldr w6, [x4, x0, lsl #2]add w5, w2, #0xcldr w5, [x5, x0, lsl #2]madd w5, w6, w5, w6str w5, [x4, x0, lsl #2]add w0, w0, #0x1ldrh w16, [tr]cbz w16, L
L:cmp w0, #0x200b.hs Exit
add w16, w1, #0xcadd x16, x16, x0, lsl #2ld1 {v0.2s}, [x16]add w16, w2, #0xcadd x16, x16, x0, lsl #2ld1 {v1.2s}, [x16]mul v1.2s, v0.2s, v1.2sadd v0.2s, v0.2s, v1.2sadd w16, w1, #0xcadd x16, x16, x0, lsl #2st1 {v0.2s}, [x16]add w0, w0, #0x2ldrh w16, [tr]cbz w16, L