SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015...
Transcript of SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015...
![Page 1: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/1.jpg)
ISC15 The Road to Application Performance on Intel Xeon Phi
July 16, 2015
SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos Rosales
1
![Page 2: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/2.jpg)
SIMD
2
Single Instruction Multiple Data (SIMD) Data Registers
Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop SIMD CilkPlus OpenMP SIMD mapping Alignment & Elemental Functions Alignment Beyond Present Directives
![Page 3: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/3.jpg)
SIMD
3
Single Instruction Multiple Data (SIMD) Data Registers Playground
Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop SIMD CilkPlus OpenMP SIMD mapping Alignment & Elemental Functions Alignment Beyond Present Directives
![Page 4: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/4.jpg)
What to consider about SIMD • Know as Vectorization by scientific community. • Speed Kills
– It was the speed of microprocessors that killed the Cray vector story in the 90’s.
– We a rediscovering how to use vectors. – Microprocessor vectors were 2DP long for many
years. • We live in a parallel universe
– It’s not just about parallel SIMD, we also live in a silky environment of thread tasks and MPI tasks.
4
![Page 5: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/5.jpg)
What to make of this? • SIMD registers are getting wider now, but there
are other factors to consider. – Caches: Maybe non-coherent,
possible 9 layers of memory later – Alignment: Avoid cache-to-register hickups – Prefetching: MIC needs user intervention here – Data Arrangement: AoS vs SoA, gather,
scatter, permutes – Masking: Allows conditional execution– but
you get less bang for your buck. – Striding: 1 is best
5
SIMD Lanes
![Page 6: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/6.jpg)
SIMD
6
Single Instruction Multiple Data (SIMD) Evolution of SIMD Hardware Data Registers Instruction Set Overview, AVX
Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop SIMD CilkPlus OpenMP SIMD mapping Alignment & Elemental Functions Alignment Beyond Present Directives
![Page 7: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/7.jpg)
Intel CilkPlus • # pragma SIMD
– Force SIMD operation on loops
• Array Notation à data arranged appropriate for SIMD a[index:count] start at index, end count-start-1 (also index:count:stride) a[i:n] = b[i-1:n]+b[i+1] (think single line SIMD, heap arrays)
(optimize away subarrays) e[:] = f[:] +g[:] (entire array, heap or stack) r[:] = s[i[:]], r[i[:]=s[:] (gather, scatter) func(a[:]) (scalar/simd-enabled=by element/SIMD) if(5==a[:]) result[:]=0 (works with conditionals)
• SIMD Enabled Functions: element àvector function
7
![Page 8: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/8.jpg)
SIMD pragma – Instructs the compiler to create SIMD operations
for iterations of the loops. – Reason for vectorization failure: too many
pointers, complicated indexing … (ivdep is a hint)
8
void do2(double a[n][n], double b[n][n], int end){ #pragma SIMD for (int i=0 ; i<end ; i++) { a[i][0] = (b[i][0] - b[i+1][0]); a[i][1] = (b[i][1] - b[i+1][1]); } }
ivdep and vector always don’t work here. (Fortran code vectorizes)
Without pragma vec-report=2 was helpful: remark #15541: outer loop was not auto-vectorized: consider using SIMD directive
![Page 9: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/9.jpg)
SIMD Enabled Functions
• SIMDizable Functions:
9
double fun1(double r, double s, double t); double fun2(double r, double s, double t); … void driver (double R[N], double S[N], double T[N]){ for (int i=0; i<N; i++){ A[i] = fun1(R[i],S[i],T[i]); B[i] = fun2(R[i],S[i],T[i]); } }
![Page 10: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/10.jpg)
SIMD Enabled Functions • Can be invoked with scalar or vector arguments. • Use array notation with SIMD version (optimized
for vector width)
10
__declspec(vector) double fun1(double r, double s, double t); __declspec(vector) double fun2(double r, double s, double t); … void driver (double R[N], double S[N], double T[N]){ A[:] = fun1(R[:],S[:],T[:]); B[:] = fun2(R[:],S[:],T[:]); }
courier
** or __attribute((vector))
**
// Function is for an element operation; // but in parallel context (CilkPlus) provides an array for a vector version.
![Page 11: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/11.jpg)
SIMD Enabled Functions • Vector attribute/declspec decorations generate
scalar and SIMD version with:
11
__attribute__((vector (clauses))) function_declaration __declspec( vector(clauses)) function_declaration
Clauses: vectorlength(n) Vector Length linear(list : step) scalar list variables are incremented by step; uniform(list) (same) values are broadcast to all iterations [no]mask generate a masked vector version
Syntax:
![Page 12: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/12.jpg)
SIMD and Threads
• Cilk’s “los tres amigos” – cilk_for – cilk_spawn – cilk_sync
• Cilk loops are SIMDizes, and invoke multiple threads.
• Functions use SIMD form in CilkPlus loops.
12
![Page 13: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/13.jpg)
SIMD
13
Single Instruction Multiple Data (SIMD) Evolution of SIMD Hardware Data Registers Instruction Set Overview, AVX
Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop SIMD CilkPlus OpenMP SIMD mapping Alignment & Elemental Functions Alignment Beyond Present Directives
![Page 14: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/14.jpg)
OpenMP SIMD • First appeared in OpenMP 4.0 2013 • Appears as
– SIMD – SIMD do/for – declare SIMD
• SIMD refinements in OpenMP 4.1, ~2015.
14
![Page 15: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/15.jpg)
SIMD • OMP Directive SIMDizes loop
15
!$OMP SIMD [clause[[,] clause] ... ] #pragma omp SIMD [clause[[,] clause] ... ]
Clauses: safelen(n) number (n) of interations in a SIMD chunk linear(list : step) scalar list variables are incremented by step;
loop iterations incremented by (vector length)*step aligned(list :n) uses aligned (by n bytes) move on listed variables collapse(n), lastprivate(list), private(list), reduction(operator: list)
Syntax (Fortran):
![Page 16: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/16.jpg)
SIMD + Worksharing Loop • OMP Directive Workshares and SIMDizes loop
16
!$OMP DO SIMD [clause[[,] clause] ... ]
Clauses: any DO clause data sharing attributes, nowait, etc. any SIMD clause
Syntax:
Creates SIMD loop which uses chunks containing increments of the vector size. Remaining iterations are distributed “consistently”. No scheduling details are give.
#pragma omp SIMD [clause[[,] clause] ... ]
![Page 17: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/17.jpg)
SIMD Enabled Functions • OMP Directive generates scalar and SIMD
version with:
17
$OMP DECLARE SIMD(routine-name) [clause[[,] clause]... ]
Clauses: aligned(list:n) uses aligned (by n bytes) moves on listed variables [not]inbranch must always be called in conditional [or never in] linear(list:step) scalar list variables are incremented by step;
loop iterations incremented by (vector length)*step simdlen(n) vector length uniform(list) listed variables have invariant value
Syntax (Fortran):
![Page 18: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/18.jpg)
SIMD
18
Single Instruction Multiple Data (SIMD) Evolution of SIMD Hardware Data Registers Instruction Set Overview, AVX
Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop SIMD CilkPlus OpenMP SIMD mapping Alignment & Elemental Functions Alignment Beyond Present Directives
![Page 19: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/19.jpg)
CilkPlus à OpenMP Mapping
CilkPlus • SIMD (on loop)
– Reduction – Vector length – Linear (increment) – Private, Lastprivate
OpenMP • SIMD (on loop)
– Reduction – Vector length – Linear (increment) – Private, Lastprivate
19
!dir$ simd reduction(+:mysum) linear(j:1) vectorlength(4) do…; mysum=mysum+j; j=fun(); enddo
!$omp simd reduction(+:mysum) linear(j:1) safelen(4) do…; mysum=mysum+j; j=fun(); enddo
e.g (fortran)
![Page 20: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/20.jpg)
CilkPlus OpenMP SIMD Differences
CilkPlus • SIMD (on loop)
– firstprivate – vectorlengthfor – [no]vectremainder – [no]assert
• #pragma cilk grainsize
OpenMP • SIMD (on loop)
– aligned(var_list,bsize) – collapse
– schedule(kind, chunk)
20
![Page 21: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/21.jpg)
CilkPlus Enable OMP Declare Differences
CilkPlus • vector clauses
– vectorlength – linear – uniform – [no]mask
– processor(cpuid) – vectorlengthfor
OpenMP • declare simd
– simdlen – linear – uniform – inbranch/notinbranch – aligned
21
![Page 22: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/22.jpg)
Alignment
• Memory Alignment – Allocation alignment
• C/C++ – dynamic: memalloc routines – static: __declspec(align(64)) declaration
• Fortran – dynamic: !dir$ attributes align: 64 :: var – static: !dir$ attributes align: 64 :: var – compiler: -align array64byte
22
![Page 23: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/23.jpg)
Alignment (CilkPlus)
• Memory Alignment – Access Description
• C/C++ – loop: #pragma vector aligned (all variables) – Cilk_for vars: _assume_aligned(var,size) – pointers attribute: __attribute__((align_value (size)))
• Fortran – dynamic: !dir$ attributes align: 64 :: var (allocatable var) – static: !dir$ attributes align: 64 :: var (stack var) – compiler: -align array64byte
23
![Page 24: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/24.jpg)
Alignment (OpenMP)
• Memory Alignment – No API functions, no separate construct – Declaration SIMD / SIMD have aligned clauses
24
![Page 25: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/25.jpg)
Prefetch
• Prefetch distance can be controlled via compiler options and pragmas
#pragma prefetch var:hint:distance
• inner loops • may be important to turn off prefetch • available for Fortran
25
![Page 26: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/26.jpg)
What do developers need to control at the directive level?
• Caches: locality of data • Alignment: Avoid cache-to-register hickups • Prefetching: hiding latency (not available with OMP) • Rearranging data: characterizing data structure (kokkos) • Masking: Allows conditional execution– but you get less
bang for your buck. • Striding: characterized data structure, (restrict)
26
![Page 27: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/27.jpg)
SIMD
27
Single Instruction Multiple Data (SIMD) Evolution of SIMD Hardware Data Registers Instruction Set Overview, AVX
Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop Alignment Beyond Present Directives
![Page 28: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/28.jpg)
Vector Compiler Options • Compiler will look for vectorization opportunities at
optimization – O2 level.
• Use architecture option: –x<simd_instr_set> to ensure latest vectorization
hardware/instructions set is used. • Confirm with vector report:
– vec-report=<n>, n=“verboseness” • To get assembly code, myprog.s:
– S • Rough Vectorization estimate: run w./w.o. vectorization
-no-vec
28
![Page 29: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/29.jpg)
Vector Compiler Options (cont.) • Alignment options here. • Inlining here.
29
![Page 30: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/30.jpg)
Alignment
• Alignment of data and data structures can affect performance. For AVX, alignment to 32byte boundaries (4 Double Precision words) allows a single reference to a cache line for moving 4DP words into the registers (SIMD support). For MIC, alignment is 64 bytes.
• Compilers are great at detecting alignment and peeling off a few iterations before working on a sustained alignment within a loop body.
30
(Aligned data can use the more efficient movdqa instruction, rather than the less efficient movdqu instruction.)
Vec. Programming Alignment
![Page 31: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/31.jpg)
Alignment
31
32-byte Aligned
Load 4 DP Words
Load 4 DP Words
Load 4 DP Words
Load 4 DP Words
Cache Line 0 Cache Line 1 Cache Line 2 Cache Line 3
Single Cache access for 4 DP Words
regi
ster
s
Non- Aligned
Load 4 DP Words
Load 4 DP Words
Load 4 DP Words
Load 4 DP Words
Cache Line 0 Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4
Across Cache Line access for 4 DP Words
regi
ster
s
Vec. Programming Alignment
![Page 32: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/32.jpg)
Vector Align
• Unaligned accesses are slower. – Non-‐sequen6al across “bus”. – Cross cache line boundary.
• #pragma vector aligned or !DEC$ vector aligned
0.75 CP/Op w.o. pragma* 0.50 CP/OP with pragma*
for(i=0; i<loops; i++) #pragma vector aligned for(j=0;j<N-‐i;j++) a[j]=b[j]+c[j];
*When executed without –xSSE4.1 on Westmere.
32
Compiler Directives Alignment
Alignment can be forced
C: memalign(XXbyte,size)
F90: Use compiler option -align arrayXXbyte
![Page 33: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/33.jpg)
Inlining
• Functions within a loop prevent vectorization. – Inlining can often overcome this problem. e.g. ...
for(i=0; i<nx; i++){ x = x0 + i*h; sum = sum + do_r2(x,y, xp,yp); } ...
double do_r2(double x, double y, double xp, double yp){ double r2; r2 = (x-‐xp)*(x-‐xp) + (y-‐yp)*(y-‐yp); return r2; }
file main.c file funs.c
• Since the call and function are in different files, inlining and vectorization don’t occur. Use interprocedural optimization option (-ipo) to inline & vectorize.
• If call and function are within the same unit (file), inlining and vectorization are performed at –O2 optimization and higher.
Vec. Programming Inlining
33
![Page 34: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/34.jpg)
Inlining
... for(i=0; i<nx; i++){ x = x0 + i*h; sum = sum + do_r2(x,y, xp,yp); } ...
double do_r2(double x, double y, double xp, double yp){ double r2; r2 = (x-‐xp)*(x-‐xp) + (y-‐yp)*(y-‐yp); return r2; }
file main.c file funs.c
Inlining Vectorization Time (ms) not inlined not vectorized 1.55 inlined not vectorized 0.44 inlined vectorized 0.056
Vec. Programming Inlining
34
![Page 35: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/35.jpg)
SIMD END
• Questions • Discussion • …
35
![Page 36: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/36.jpg)
some Slides from 2012 Tutorial (kfm)
36
![Page 37: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/37.jpg)
Registers Memory
SIMD Hardware (for VectorizaNon)
• Op6mal Vectoriza6on requires concerns beyond the SIMD Unit! – Opera6ons: Requires elemental (independent) opera6ons (SIMD opera6ons) – Registers: Alignment of data on 64, 128, or 256 bit boundaries might be important – Cache: Access to elements in caches is fast, access from memory is much slower – Memory: Store vector elements sequen6ally for fastest aggregate retrieval
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
+
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
nnn y
yyy
x
xxx
a
z
zzz
!!!3
2
1
3
2
1
3
2
1
*
x1, x2, x3, … xn
Cache
y1, y2, y3, … yn a
z1, z2, z3, … zn
SAXPY Opera6on
SIMD Hardware
Vectors
37
![Page 38: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/38.jpg)
SIMD Processing -‐-‐ VectorizaNon • Vectoriza6on, or SIMD* processing, allows a simultaneous, independent instruc6on on mul6ple data operands with a single instruc6on. ( Loops over array elements ooen provide a constant stream of data.)
…
Instruc6on Stream
Note: Streams provide Vectors of length 2-‐16 for execu6on in the SIMD unit.
*SIMD= Single Instruc6on Mul6ple Data
SIMD Hardware
Vectors
38
![Page 39: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/39.jpg)
Vector Add -‐-‐ AVX
• Only vector code will load mul6ple sets of data into registers simultaneously.
• Non-‐aligned sets do consume more Clock Periods (CPs).
vmovupd
vaddpd
vmovupd
vmovupd
+ =
Cache Line 1A Cache Line 2A Cache Lin e 128A
Cache Line 1D Cache Line 2D Cache Lin e 128D
L1 Data Cache
AVX Unit
Instr.
Cache Line
Assembly Instruc6ons
1 256-‐bit Register 4 64-‐bit DP FP
39
Vectorization Example
Add in Hardware
…
![Page 40: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/40.jpg)
Vectorization ( on KNC)
void mult(double *a, double *b, double*c, int n){
for(int i=0; i<n; i++) a[i]=b[i]+c[i]; }
subroutine mult(a, b, c, n); real*8 :: a(n),b(n),c(n) do i=1,n; a(i)=b(i)+c(i); enddo
end subroutine
b0 b1 b2 b3 b4 b5 b6 b7 c0 c1 c2 c3 c4 c5 c6 c7
a0
a2
a7
b0 b1 b2 b3 b4 b5 b6 b7 c0 c1 c2 c3 c4 c5 c6 c7
a0 a1 a2 a3 a4 a5 a6 a7
add
add
add
…
vaddpd
Scalar Instruc6ons 8 instruc6ons, 8 element pairs
Vector Instruc6on 1 instruc6on, 8 element pairs
40
Vectorization Example
KNC Vectors
![Page 41: SIMD: CilkPlus and OpenMPISC15 The Road to Application Performance on Intel Xeon Ph i July 16, 2015 SIMD: CilkPlus and OpenMP Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos](https://reader034.fdocuments.us/reader034/viewer/2022042202/5ea23111a719eb7e58008517/html5/thumbnails/41.jpg)
Compiler Directives: Hints and Coercion
alloc_section distribute_point inline, noinline, and forceinline ivdep loop_count memref_control novector optimize optimization_level prefetch/noprefetch simd unroll/nounroll unroll_and_jam/nounroll_and_jam vector
Compiler Directives
41