Unstructured Computations on Knights
Landing Architecture
Mohammed A. Al Farhan and David E. Keyes
Extreme Computing Research CenterKing Abdullah University of Science and Technology
Intel Extreme Performance Users Group (IXPUG)Middle East Conference 2018 at KAUSTApril 22-25, 2018 • Thuwal, Saudi Arabia
April 24, 2018
Highlights of the Contributions
Address the classical tension between unstructured data (indirect addressing) andhighly structured architectures (vector instructions)
Algorithm: nonlinearly implicit pseudo-transient Newton-Krylov-Schwarz solver
Application: computational aerodynamics (Gordon Bell winning NASA legacy code)
Architecture: Intel Xeon Phi “Knights Landing”
AVX-512CD instructions provide new solutions for conflicting objectives
Independence of writes versus memory locality
Finest-grain implementation of PETSc-FUN3D to exploit language features
Gains in wall-clock time 2.9x over scalar compilation for strong thread scaling
Distributed-memory not addressed here; proven for two decades in weak scaling
Gains in energy-to-solution for a given wall-clock time up to 1.7x
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 1 / 39
Outline
1 Application – Computational Aerodynamics
2 Shared-memory Optimizations and TuningData-level Parallelism
3 Performance Evaluation and Results
4 Summary and Reflections
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 2 / 39
Fully Unstructured Navier-Stokes in 3D
An unstructured tetrahedral mesh Euler and Navier-Stokes research code, which isclosely related to the export-controlled state-of-the-practice from NASA LangleyResearch Center
For over two decades FUN3D has been under active development for modeling fluidflow, and design optimization of airplanes, automobiles, and submarines with anumber of vertices up to 10 billion
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 3 / 39
PETSc-FUN3D
1995
1999Gordon Bell Special Prize
2005Adaptive Linear Solver
2016KNC/HSX/BDX
2017KNL/SKY
2018GPU
2001Parallel I/O
2010IBM BlueGene/P
2015SNB/IVB
MIMD/SPMD
1996 2002
SIMD/SIMT
2009
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 4 / 39
Edge-based Loop Kernel – Sequential
1: for all e ∈ array of edges do2: Get the index of the right node of e3: Get the index of the left node of e4: Read the flow variables from both endpoints of e5: Compute the flux and the residual6: Perform “write-back” operations to update the flow variables of e7: end for
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 5 / 39
Spatial and Temporal Locality of Reference
Left endpoints of the edges are ordered in ascending orderI The right endpoints and the edges’ components are ordered accordingly
Traversing the edges index table is done through one iterator pointer that issequentially incremented
A kernel loop that iterates over the stencil data items is essentially transformedfrom a loop over the edges into a loop over the vertices, in which the iterativetraversing is linearly based upon the left endpoints
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 6 / 39
Geometric Data LayoutTree of Struct-of-Arrays (SoA)
*edges(struct)
nedges (size t)
*nodes(struct) *n0 (unsigned int)
*n1 (unsigned int)
*normals(struct)
*x (double)
*y (double)
*z (double)
*ln (double)
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 7 / 39
Residual and Gradient Data LayoutFROM Array-of-Structs-of-Arrays (AoSoA) TO Array-of-Structs-of-Strided-Arrays
struct {X: u v w p ......... u v w p
(A) Y: u v w p ......... u v w pZ: u v w p ......... u v w p
} Gradient;
struct {(B) Q: u v w p ......... u v w p
} Residual;
Alignment and padding to 64-bytes cache line boundaries
Contiguously adjust every 4 DoF next to each other (u, v, w, p)
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 8 / 39
Memory-aware Allocation
Explicit heap allocator based on how frequent the data are being accessedI Intel memkind heap manager library and jemalloc library
F hbw posix memalign psize()
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 9 / 39
Thread-level Parallelism
Edges workloads is partitioned using multilevel k-way partitioning scheme of METIS
METIS searches for an adequate load balance across the OpenMP threads
It minimizes the aggregate cross edges’ weight (i.e., reduce the edge-cut)
Reverse Cuthill-McKee (RCM) is used for reordering the vertices
Work replication – METIS distribution conserves thread safety without the need ofatomic
Only the “master” thread is allowed to perform the write-back operation – “OwnerCompute”
The METIS strategy of redundancy in computations purges the overheads of usingglobal synchronizing barrier between the working threads
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 10 / 39
Edge-based Loop – Optimized and Threaded
1 #pragma omp parallel
2 {
3 const uint32_t t = omp_get_thread_num();
4 for(uint32_t i = ie[t]; i < ie[t+1]; i++){
5 /*Load and compute*/
6 if(parts[n0[i]] == t){/*Write-back into v[n0[i]]*/}
7 if(parts[n1[i]] == t){/*Write-back into v[n1[i]]*/}
8 }
9 }
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 11 / 39
Intel AVX-512 Instruction Set Architecture
Knights Landing
Skylake
AVX512F: Foundation
AVX512CD: Conflict Detection
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 12 / 39
Intel AVX-512 Instruction Set Architecture
Knights Landing
Skylake
AVX512ER: Exponential and Reciprocal
AVX512PF: Prefetching
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 13 / 39
Intel AVX-512 Instruction Set Architecture
Knights Landing
Skylake
AVX512VL: Vector Length
AVX512DQ: Doubleword and Quadword
AVX512BW: Byte and Word
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 14 / 39
Vectorizing Edge-based Loop Kernels
Read: Load and gather instructions
Arithmetic: FMA (vfmadd/vfmsub/vfnmadd)
Control-flow to Data-flow: Masking instructions
Square Root/Division: Reciprocal instructionsI√x: mm512 mul pd( mm512 rsqrtXX pd(x),x)
I cx : mm512 mul pd( mm512 rcpXX pd(x),c)
Prefetching: Software gather prefetching instructions
Write-back: Conflict detection and scatter instructions
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 15 / 39
Conflict Detection Instructions
01234567
Conflict-free Array
00011122
Array with Conflicts
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 16 / 39
Conflict Detection Instructions
01234567
Conflict-free Array
00011122
Array with Conflicts
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 17 / 39
Conflict Detection Instructions
01234567
Conflict-free Array
0 0 00 0 10 0 21 1 01→
1→
11 1 22 2 02 2 1
Array with Conflicts
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 18 / 39
Conflict Detection Instructions1 Generate an initial mask based on the thread ID and node index
2 Scan the SIMD lane to identify the conflicted data
3 Generate a mask that separates a conflict-free data subset
4 Perform a safe gather mask operation (load) to generate registers with contiguousdata items
5 Update the operands with multiple FMA instructions (number of double precisionarithmetics)
6 Perform a safe scatter mask operation (write-back) to sprinkle the updated dataitems over the memory addresses based on their indices
7 Perform SIMD boolean operation to “mask out” the already written indices fromthe mask (swizzling)
8 Repeat the aforementioned steps again on all of the remaining subsets within theSIMD lane
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 19 / 39
Vectorized Edge-based Loop Kernel1 #pragma omp parallel
2 {
3 const uint32_t t = omp_get_thread_num();
4 const uint32_t l = ie[t+1] - ((ie[t+1]-ie[t]) % 8);
5 for(uint32_t i = ie[t]; i < l; i += 8){
6 /*Load and compute on the SIMD lane elements*/
7
8 /*The Write-back inner loop*/
9 _mm512_cmpeq_epi32_mask(/*...*/);
10 do {
11 _mm512_mask_conflict_epi32(/*...*/);
12 _mm512_broadcastmw_epi32(/*...*/);
13 _mm512_mask_testn_epi32_mask(/*...*/);
14 /*Gather, compute, and scatter*/
15 _mm512_kxor(/*...*/);
16 } while(/*...*/);
17 }
18 /*Peel and remainder loop*/
19 for(uint32_t i = l; i < ie[t+1]; i++){
20 /*load and compute*/
21 if(parts[n0[i]] == t){/*Write-back into v[n0[i]]*/}
22 if(parts[n1[i]] == t){/*Write-back into v[n1[i]]*/}
23 }
24 }
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 20 / 39
Fine-grained Data Partitioning
Improve the vectorization efficiency and increase the size of the independent datasubset within a SIMD lane
Bucket sort routine based on a variant of the edge coloring algorithm
Aim to:I Extract independent subsets of the edgesI Prune the overhead of the innermost CD loop
Extract at most 8 independent edges within a bucketI To avoid a potential disruption of the initial ordering for cache locality efficiency
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 21 / 39
Bucket Sort Based on Partial Coloring
Edges Bitmap0 1 2 3 4 5 6 7
0 0 11 0 2 1 1 1 1 1 1 0 0 Bucket 12 0 33 1 0 1 1 1 1 0 0 0 0 Bucket 24 1 25 1 3 1 1 1 1 0 0 0 0 Bucket 36 2 37 5 4 1 1 0 0 0 0 0 0 Bucket 4
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 22 / 39
Bucket Sort Based on Partial Coloring
1: Create an nedges per color array to track the colors2: Create a bitmap to track the endpoints coloring scheme3: for c← 0, k ← 0, i← 0 to nndges do4: Clear all of the bitmap bits5: for j ← i to nndges do6: Read n0[j], n1[j] bits from the bitmap → b0, b17: if b0 ∨ b1 is TRUE then8: CONTINUE9: end if
10: if nedges per color[c] = 8 then11: BREAK12: end if13: Set n0[j], n1[j] bits in bitmap14: Swap edge data from index j to k and vice versa15: Increment k and nedges per color[c] by 116: end for17: Increment c by 118: i ← k19: end for
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 23 / 39
Mesh size1
Vertices: 2,761,774
DoFs: 11,047,096
Edges: 18,945,809
1In 1999 Gordon Bell prize paper, this was considered to be a large mesh but it is hostedconveniently today within a single node
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 24 / 39
Experimental Design
Sample space: 20 independent experiments
Runtime:I API: RDSTC instructions and POSIX timestampsI Results summary: arithmetic mean
Memory bandwidth and FLOPS:I Bandwidth: MC/EDC RPQ and WPQ insertsI Flops: count the arithmetic operations manuallyI Results summary: harmonic mean
For every experiment, the source code is recompiled, and the memory and cacheare completely flushed
+/- standard deviation of the mean for each experimental sample
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39
Hardware Specifications
KNL-A KNL-B KNL-C HSX BDX SKYFamily x200 x200 x200 E5V3 E5V4 ScalableModel 7210 7210 7290 2699 2680 8176Socket(s) 1 1 1 2 2 2Cores 64 64 72 36 28 56GHz 1.30 1.30 1.50 2.30 2.40 2.10Watts/socket 215 215 245 145 120 165DDR4 (GB) 96 96 192 264 132 264Freq. Driver* I A A A I AMax GHz 1.50 1.30 1.50 2.30 3.30 2.10Governor** P C C O P OTurbo Boost × X X X X X
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 26 / 39
Flux Performance with Different KNL Modes
0
20
40
60
80
100
120
140
160
180
64 128 256
Tim
e (
Second)
Number of Threads
Chip: KNL-B
All2All/CAll2All/F
Quadrant/CQuadrant/F
Hemisphere/CHemisphere/F
SNC-2/CSNC-2/FSNC-4/CSNC-4/F
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 27 / 39
Optimizations of the Flux Kernel
0
80
160
240
320
400
64 128 256
Tim
e (
Second)
Number of Threads
Chip: KNL-A
BaselineOptimized/HBW-AVX512/HBW
IntrinsicsIntrinsics/HBW
Reordering/HBW
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 28 / 39
Optimizations of the Gradient Kernel
0
80
160
240
320
64 128 256
Tim
e (
Second)
Number of Threads
Chip: KNL-A
BaselineOptimized/HBW-AVX512/HBW
IntrinsicsIntrinsics/HBW
Reordering/HBWRestructuring/HBW
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 29 / 39
Strong Thread Scalability
0
4000
8000
12000
16000
20000
1 2 4 8 16 32 64 128 256
Tim
e (
Second)
Number of Threads
Chip: KNL-B
PETSc RoutinesFlux Kernel
Gradient KernelJacobian Matrix Construction
Preprocessing and SetupPseudo Time Step Kernel
Aerodynamics Forces Computation
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 30 / 39
Strong Thread Scalability (Flux Kernel)
16
64
256
1024
4096
1 2 4 8 16 32 64 128 256
Tim
e (
Second)
Number of Threads
Chip: KNL-B
100%
100%
100%
98%
98%
94%
88%54% 24%
1.0
2.1
4.0
7.8
15.7
30.1
56.168.5 61.8
Flux Routine
Ideal Scaling
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 31 / 39
Strong Thread Scalability (Gradient Kernel)
16
64
256
1024
4096
1 2 4 8 16 32 64 128 256
Tim
e (
Second)
Number of Threads
Chip: KNL-B
100%
100%
100%
100%
98%
88%72% 42% 22%
1.0
2.3
4.3
8.3
15.7
28.345.8 53.7 57.2
Gradient Routine
Ideal Scaling
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 32 / 39
Performance Speedup
0
10
20
30
40
50
60
70
164
128
256 1
64
128
256 1
64
128
256 1
64
128
256 1
64
128
256 1
64
128
256
Sp
eed
up
Number of Threads
Chip: KNL-B
ForcesTSSetupJacoGradFlux
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 33 / 39
Amdahl’s Law
0
1
2
3
4
1 2 4 8 16 32 64 128 256
Speedup
Number of Threads
Chip: KNL-B
Amdahl's LawActual Speedup
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 34 / 39
Memory Bandwidth and Flops PerformanceFlux Kernel Gradient Kernel
Threads 72 144 288 72 144 288GFlop/s 117 144 123 21 24 25
Flux Kernel Gradient KernelThreads 72 144 288 72 144 288
DD
R Read: GB/s 14 17 17 8 10 12Write: GB/s 7 9 9 0.2 0.4 1
HB
W Read: GB/s 40 51 45 43 53 54Write: GB/s 0.1 0.1 0.1 18 22 25
GFlop/s: 5% out of 3 TFlop/sGB/s:
I DRAM: 25% out of STREAM (77 GB/s (R), 36 GB/s (W))I MCDRAM: 17% out of STREAM (314 GB/s (R), 171 GB/s (W))
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 35 / 39
Flux Performance on x86 Hardware
0
40
80
120
160
200
240
280
320
360
18
36
72
14
28
56
64
128
256
72
144
288
28
56
112
Tim
e (
Secon
d)
Number of ThreadsSKXKNL-CKNL-BBDXHSX
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 36 / 39
Gradient Performance on x86 Hardware
0
40
80
120
160
200
240
18
36
72
14
28
56
64
128
256
72
144
288
28
56
112
Tim
e (
Secon
d)
Number of ThreadsSKXKNL-CKNL-BBDXHSX
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 37 / 39
PETSc-FUN3D Performance on x86 Hardware
0
1000
2000
3000
4000
5000
6000
18
36
72
14
28
56
64
128
256
72
144
288
28
56
112
Tim
e (
Secon
d)
Number of ThreadsSKXKNL-CKNL-BBDXHSX
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 38 / 39
Conclusion
Demonstrate several shared-memory optimizations to extract:1 Thread-level parallelism – Careful workload distributions and load balancing2 Data-level parallelism – Utilizing the capabilities of AVX-512 ISA
Achieve 2.9x speedup in the performance of the flux kernel relative to the baselinecode
Exhibit almost linear scalability up to the full core count of KNL (64 cores), andcontinued scalability with SMT
Maintain on Skylake roughly similar speedup with less power consumption [KNL:245 Watts; SKX: 330 Watts]
Future Direction −→ Port PETSc-FUN3D onto GPU
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 39 / 39
Conclusion
Demonstrate several shared-memory optimizations to extract:1 Thread-level parallelism – Careful workload distributions and load balancing2 Data-level parallelism – Utilizing the capabilities of AVX-512 ISA
Achieve 2.9x speedup in the performance of the flux kernel relative to the baselinecode
Exhibit almost linear scalability up to the full core count of KNL (64 cores), andcontinued scalability with SMT
Maintain on Skylake roughly similar speedup with less power consumption [KNL:245 Watts; SKX: 330 Watts]
Future Direction −→ Port PETSc-FUN3D onto GPU
M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 39 / 39
Q/A
Top Related