Optimisation of computational fluid dynamics applications ...€¦ · OPTIMISATION OF COMPUTATIONAL...

O P T I M I S AT I O N O F C O M P U TAT I O N A L F L U I DD Y N A M I C S A P P L I C AT I O N S O N M U LT I C O R E A N D

M A N Y C O R E A R C H I T E C T U R E S

ioan corneliu hadade

Rolls-Royce Vibration UTCDepartment of Mechanical Engineering

Imperial College London

Submitted in part fulfilment of the requirements for the degree ofDoctor of Philosophy of Imperial College London and the Diploma of

Imperial College

D E C L A R AT I O N O F O R I G I N A L I T Y

The work presented herein is based on the research carried out by the authorat the Rolls-Royce Vibration UTC, part of the Department of Mechanical En-gineering at Imperial College London. The author considers it to be originaland his own unless where explicitly stated otherwise.

2

C O P Y R I G H T D E C L A R AT I O N

The copyright of this thesis rests with the author and is made available undera Creative Commons Attribution Non-Commercial No Derivatives licence. Re-searchers are free to copy, distribute or transmit the thesis on the conditionthat they attribute it, that they do not use it for commercial purposes and thatthey do not alter, transform or build upon it. For any reuse or redistribution,researchers must make clear to others the licence terms of this work.

3

A B S T R A C T

This thesis presents a number of optimisations used for mapping the underly-ing computational patterns of finite volume CFD applications onto the archi-tectural features of modern multicore and manycore processors. Their effect-iveness and impact is demonstrated in a block-structured and an unstructuredcode of representative size to industrial applications and across a variety of pro-cessor architectures that make up contemporary high-performance computingsystems.

The importance of vectorization and the ways through which this can beachieved is demonstrated in both structured and unstructured solvers togetherwith the impact that the underlying data layout can have on performance. Theutility of auto-tuning for ensuring performance portability across multiple ar-chitectures is demonstrated and used for selecting optimal parameters such asprefetch distances for software prefetching or tile sizes for strip mining/looptiling. On the manycore architectures, running more than one thread per phys-ical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For archi-tectures with high-bandwidth memory packages, their exploitation, whetherexplicitly or implicitly, is shown to be imperative for best performance.

The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X onthe manycore processors.

4

"For from Him and through Him and to Him are all things. To Him be theglory forever. Amen."

— Romans 11:36

5

To my son,Isaac

6

A C K N O W L E D G E M E N T S

I would like to thank my supervisor, Luca di Mare, for giving me the oppor-tunity to undertake this PhD and for welcoming me into his research group.Despite his very busy schedule, Luca’s door was always open whenever I en-countered difficulties in my work or needed any sort of advice. Furthermore,he always encouraged me to pursue my own direction in research and madesure there was always funding and opportunities for further training for whichI am thankful.

I would also like to thank Peter Higgs for keeping the group running smoothlyby organising travel and hotels for conferences or purchasing servers andwhatever else was required. I can’t imagine how my PhD experience wouldhave been without him.

I have been very fortunate to work alongside brilliant colleagues such as(in no particular order): Feng Wang, Edouard Minoux, Ahmad Nawab, MauroCarnevale, Max Rife, Fernando Barbarossa and Gan Lu. I have come to regardall of you as family and genuinely dread the moment when each of us will gotheir own way.

Besides my wonderful colleagues, I have also enjoyed spending time withTeddy Szemberg O’Connor, Maria Esperanza Barrera Medrano, Nefeli Dimelaand Nicolas Alferez and will most certainly miss our entertaining discussionsover lunch in the Senior Common Room.

Special thanks go to all of my Romanian friends: Emanuel Pop, Paul Pop,Remus Pop, Patric Fülop, Delia Babiciu, Oana Puscas, Radu Oprescu, OzgürOsman, Oana Marin and Adrian Draghici. You have certainly brightened upmy life and made both Edinburgh and London feel a little bit closer to home.

I am also very much indebted to Christopher Dahnken, Dheevatsa Mudigereand Michael Klemm of Intel Corporation who have helped me out over theyears with various things ranging from compiler intrinsics to OpenMP intrica-cies, Agner Fog for the support provided in using the VCL library as well ashis incredibly helpful optimisation manuals and Timothy Jones from the Uni-versity of Cambridge for help and discussions on software prefetching. The

7

work herein was also shaped by the discussions and the advice received fromPaolo Adami of Rolls-Royce plc for which I am very grateful.

Many thanks go to my examiners, Paul Kelly of Imperial College Londonand Graham Pullan from the University of Cambridge for their constructiveand excellent feedback that has shaped and significantly improved the finalversion of this manuscript. I have thoroughly enjoyed our discussions whichenabled me to identify strengths and weaknesses of my work I had not previ-ously considered.

My biggest thanks go to my wife, Iulia, and to my loving parents, Ioan andCornelia Hadade as well as my "adoptive" UK parents, Melanie and KevinSmith. I am immensely blessed to have you in my life and look forward tospending more time with all of you and less time debugging code.

This research was supported by the Engineering and Physical Sciences Re-search Council and Rolls-Royce plc through the Industrial CASE Award 13220161.

8

D I S S E M I N AT I O N

Parts of the work presented in this thesis have been disseminated to a wideraudience through peer-reviewed publications and oral presentations. These arelisted below and are up to date as of April 2018.

journal papers

[1] Ioan Hadade and Luca di Mare. “Modern multicore and manycore archi-tectures: Modelling, optimisation and benchmarking a multiblock CFDcode”. In: Computer Physics Communications 205 (2016), pp. 32 –47. issn:0010-4655. doi: http://doi.org/10.1016/j.cpc.2016.04.006.

[2] Ioan Hadade, Feng Wang, Mauro Carnevale and Luca di Mare. “Someuseful optimisations for unstructured computational fluid dynamics codeson multicore and manycore architectures”. Computer Physics Commu-nications (accepted).

conference papers

[1] Ioan Hadade and Luca di Mare. “Exploiting SIMD and Thread-Level Par-allelism in Multiblock CFD”. In: Supercomputing: 29th International Con-ference, ISC 2014, Leipzig, Germany, June 22-26, 2014. Proceedings. SpringerInternational Publishing, 2014, pp. 410–419. isbn: 978-3-319-07518-1. doi:10.1007/978-3-319-07518-1_26.

9

http://dx.doi.org/http://doi.org/10.1016/j.cpc.2016.04.006

http://dx.doi.org/10.1007/978-3-319-07518-1_26

oral presentations

[1] Ioan Hadade and Luca di Mare. “Turbomachinery CFD on Modern Mul-ticore and Manycore Architectures”. SIAM Conference on Parallel Pro-cessing for Scientific Computing, Paris, France, April 12-15. 2016. url:http://www.siam.org/meetings/pp16/.

[2] Ioan Hadade and Luca di Mare. “Turbomachinery CFD on Modern Mul-ticore and Manycore Architectures”. 10 years of HPC at Imperial, Lon-don, UK, July 7th. 2016.

[3] Ioan Hadade and Luca di Mare. “Modern multicore and manycore archi-tectures: Modelling, optimisation and benchmarking a multiblock CFDcode”. UK Turbulence Consortium Annual Review and First UKFN SIG"Multicore and Manycore Algorithms to Tackle Turbulent flow" Meeting,London, UK, Sep 5th. 2017.

10

http://www.siam.org/meetings/pp16/

C O N T E N T S

1 introduction 24

1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 background 29

2.1 Trends in processor design . . . . . . . . . . . . . . . . . . . . . . . 29

2.1.1 Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.1.2 Frequency and Power . . . . . . . . . . . . . . . . . . . . . 31

2.1.3 Multicore and manycore architectures . . . . . . . . . . . . 32

2.1.4 Levels of parallelism . . . . . . . . . . . . . . . . . . . . . . 33

2.1.5 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.1.6 The memory hierarchy . . . . . . . . . . . . . . . . . . . . . 40

2.2 The performance gap . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.3 Implications for CFD codes . . . . . . . . . . . . . . . . . . . . . . 45

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3 experimental setup 49

3.1 Processor Architectures . . . . . . . . . . . . . . . . . . . . . . . . 49

3.1.1 Intel Xeon Sandy Bridge . . . . . . . . . . . . . . . . . . . . 49

3.1.2 Intel Xeon Haswell . . . . . . . . . . . . . . . . . . . . . . . 50

3.1.3 Intel Xeon Broadwell . . . . . . . . . . . . . . . . . . . . . . 51

3.1.4 Intel Xeon Skylake Server . . . . . . . . . . . . . . . . . . . 51

3.1.5 Intel Xeon Phi Knights Corner . . . . . . . . . . . . . . . . 52

3.1.6 Intel Xeon Phi Knights Landing . . . . . . . . . . . . . . . 53

3.2 The Roofline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 optimisation of block-structured mesh applications 60

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.1 Governing Equations . . . . . . . . . . . . . . . . . . . . . . 62

4.3.2 Spatial discretization . . . . . . . . . . . . . . . . . . . . . . 63

4.3.3 Time integration . . . . . . . . . . . . . . . . . . . . . . . . 65

11

contents 12

4.3.4 Computational kernels . . . . . . . . . . . . . . . . . . . . . 66

4.3.5 Configuration of compute nodes . . . . . . . . . . . . . . . 68

4.4 Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4.1 Flow variable update . . . . . . . . . . . . . . . . . . . . . . 69

4.4.2 Flux computations . . . . . . . . . . . . . . . . . . . . . . . 72

4.5 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 80

4.5.1 Flow variable update . . . . . . . . . . . . . . . . . . . . . . 80

4.5.2 Flux computations . . . . . . . . . . . . . . . . . . . . . . . 84

4.5.3 Performance modelling . . . . . . . . . . . . . . . . . . . . 86

4.5.4 Architectural comparison . . . . . . . . . . . . . . . . . . . 90

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5 optimisation of unstructured mesh applications 93

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.1 Governing Equations . . . . . . . . . . . . . . . . . . . . . . 97

5.3.2 Spatial Discretization . . . . . . . . . . . . . . . . . . . . . . 97

5.3.3 Time integration . . . . . . . . . . . . . . . . . . . . . . . . 99

5.3.4 Test case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.3.5 Computational kernels . . . . . . . . . . . . . . . . . . . . . 100

5.3.6 Configuration of compute nodes . . . . . . . . . . . . . . . 102

5.4 Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.4.1 Grid renumbering . . . . . . . . . . . . . . . . . . . . . . . 103

5.4.2 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.4.3 Colouring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.4.4 Array of Structures . . . . . . . . . . . . . . . . . . . . . . . 108

5.4.5 Gather Scatter Optimisations . . . . . . . . . . . . . . . . . 112

5.4.6 Array of Structures Structure of Arrays . . . . . . . . . . . 113

5.4.7 Loop tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.4.8 Software Prefetching . . . . . . . . . . . . . . . . . . . . . . 119

5.4.9 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.5 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 122

5.5.1 Effects of optimisations . . . . . . . . . . . . . . . . . . . . 122

5.5.2 Performance scaling across a node . . . . . . . . . . . . . . 136

5.5.3 Performance modelling . . . . . . . . . . . . . . . . . . . . 143

5.5.4 Architectural Comparison . . . . . . . . . . . . . . . . . . . 147

contents 13

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6 conclusions 154

6.1 Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6.2 Advice for CFD practitioners . . . . . . . . . . . . . . . . . . . . . 157

6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

bibliography 161

Appendix 171

a appendix 172

a.1 Results of auto-tuning prefetch distances . . . . . . . . . . . . . . 172

L I S T O F F I G U R E S

Figure 1 Trends in microprocessors over the past 46 years. Ori-ginal data up to the year 2010 collected and plotted byM. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L.Hammond, and C. Batten. Data between 2010-2015 col-lected by K. Rupp. Latest data between 2015-2017 col-lected and plotted by I. Hadade. . . . . . . . . . . . . . . 30

Figure 2 Schematics of the execution units in the Intel SandyBridge and Haswell microarchitectures. Updates in theHaswell architecture compared to Sandy Bridge are high-lighted in orange. Figures courtesy of [82] . . . . . . . . . 35

Figure 3 Comparison between peak floating point performancein double precision and peak memory bandwidth acrossIntel Xeon multicore CPUs (CPUs), NVIDIA Tesla GPUs(GPUs) and Intel Xeon Phi processors (Phis) spanningthe last decade. Data courtesy of K. Rupp [77] withminor updates by I. Hadade . . . . . . . . . . . . . . . . . 42

Figure 4 Number of floating point operations in double precisionper byte from off-chip memory across Intel Xeon mul-ticore CPUs, NVIDIA Tesla GPUs and Intel Xeon Phiprocessors spanning the last decade. Data courtesy ofK. Rupp [77] with minor updates by I. Hadade . . . . . . 44

Figure 5 Example of rooflines for computational nodes based onthe processor architectures presented in Section 3.1 . . . 58

Figure 6 Computational domain consisting of five blocks and cor-responding stitch lines. . . . . . . . . . . . . . . . . . . . . 62

Figure 7 Schematic of the cell-centred finite volume scheme onstructured grids. . . . . . . . . . . . . . . . . . . . . . . . . 67

Figure 8 Vector load operation at aligned boundary for left stateand load operation at unaligned boundary for right state. 74

Figure 9 Side effects of unaligned vector load operations leadingto cache line split loads. . . . . . . . . . . . . . . . . . . . 75

14

List of Figures 15

Figure 10 Vector load operation at aligned boundary for both leftand right states followed by inter-register shuffles forcorrect positioning of operands. . . . . . . . . . . . . . . . 76

Figure 11 Effects of optimisations for the kernel performing theflow variable update across different grid sizes and run-ning on a single-core. Results in GFLOPS on KNC 5110Pfor the reference runs on grid sizes 105 and 106 are 0.03

and 0.02. The symbol (u) stands for unaligned memoryaccess. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Figure 12 Performance strong scaling of kernel executing the flowvariable update on 106 grid cells. The symbol (u) standsfor unaligned memory access while entries that includethe symbol (N) implement NUMA optimisations (i.e.first touch). Results of runs with NUMA (N) optimisa-tions are only applicable to the multicore CPUs and notto KNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Figure 13 Effects of optimisations on the kernel performing theflux computations on different grid sizes. Results in GFLOPSon KNC 5110P for the reference runs on grid sizes 105

and 106 are 0.05 and 0.04. The symbol (u) stands forunaligned memory access. . . . . . . . . . . . . . . . . . . 85

Figure 14 Performance strong scaling of kernel executing the fluxcomputations as a result of optimisations on 106 gridcells. The symbol (u) stands for unaligned memory access. 87

Figure 15 Roofline visualisation of optimisations for single-coreand full-node configurations. Key: Reference Vec-torization (OMP) Vectorization (Shuffle) Memory Opt. 89

Figure 16 Schematic of the cell-centred finite volume scheme onunstructured grids. . . . . . . . . . . . . . . . . . . . . . . 98

Figure 17 Unstructured mesh of intake with hexahedra elementsat near wall regions and prisms in the free stream domain.101

Figure 18 Solution of ground vortex ingestion highlighting the vor-ticity in the normal to the ground direction. . . . . . . . . 101

List of Figures 16

Figure 19 Example of reducing the bandwidth (β) via Reverse Cut-hill Mckee on mesh 1 (3x106 elements). Figure (a) high-lights the initial bandwidth where β = 2808603 whilefigure (b) presents the improved bandwidth where β =

52319 as a result of renumbering. . . . . . . . . . . . . . . 104

Figure 20 Structure of Arrays (SoA) layout for cell-centred vari-ables such as unknowns. . . . . . . . . . . . . . . . . . . . 110

Figure 21 Array of Structures (AoS) layout for cell-centred vari-ables such as unknowns including padding. . . . . . . . 111

Figure 22 On the fly transposition from AoS to SoA of the un-knowns on Advanced Vector Extenstions (AVX)/AdvancedVector Extensions 2 (AVX2) architectures. Correspond-ing compiler intrinsics can be seen in listing 13. . . . . . 117

Figure 23 AoSSoA layout for storing the normals to the face inthree dimensions. . . . . . . . . . . . . . . . . . . . . . . . 118

Figure 24 Effects of optimisations on the performance of face-basedloops (flux computations) on the Sandy Bridge (SNBE5-2650) system. Results are reported as GFLOPS andSpeed-up and were collected by running the applicationon 1 MPI rank (i.e. single-core). . . . . . . . . . . . . . . 123

Figure 25 Effects of optimisations on the performance of face-basedloops (flux computations) on the Broadwell (BDW E5-2680) system. Results are reported as GFLOPS and Speed-up and were collected by running the application on 1

MPI rank (i.e. single-core). . . . . . . . . . . . . . . . . . 124

Figure 26 Effects of optimisations on the performance of face-basedloops (flux computations) on the Skylake (SKL Gold6140) system. Results are reported as GFLOPS and Speed-up and were collected by running the application on 1

MPI rank (i.e. single-core). . . . . . . . . . . . . . . . . . 125

Figure 27 Effects of optimisations on the performance of face-basedloops (flux computations) on the Knights Corner (KNC7120P) system. Results are reported as GFLOPS andSpeed-up and were collected by running the applicationon 1 MPI rank (i.e. single-core). . . . . . . . . . . . . . . 126

List of Figures 17

Figure 28 Effects of optimisations on the performance of face-basedloops (flux computations) on the Knights Landing (KNL7210) system. Results are reported as GFLOPS and Speed-up and were collected by running the application on1 MPI rank (i.e. single-core) and in Quadrant/Cachemode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Figure 29 Effects of optimisations on the performance of cell-basedloops (update to primitive variables) reported as GFLOPS.Results were collected by running the application on 1

MPI rank (i.e. single-core). The Knights Landing (KNL7210) system was configured as Quadrant/Cache. . . . . 128

Figure 30 Effects of optimisations on the performance of cell-basedloops (update to primitive variables) reported as speed-up relative to the reference implementation. Results werecollected by running the application on 1 MPI rank (i.e.single-core). The Knights Landing (KNL 7210) systemwas configured as Quadrant/Cache. . . . . . . . . . . . . 129

Figure 31 Effects of optimisations on the average time per solu-tion update within a Newton-Jacobi iteration obtainedthrough each optimisation. Results were collected byrunning the application on 1 MPI rank (i.e. single-core).The Knights Landing (KNL 7210) system was configuredas Quadrant/Cache. . . . . . . . . . . . . . . . . . . . . . 130

Figure 32 Effects of optimisations on the speed-up per solutionupdate within a Newton-Jacobi iteration obtained througheach optimisation relative to the reference implementa-tion. Results were collected by running the applicationon 1 MPI rank (i.e. single-core). The Knights Landing(KNL 7210) system was configured as Quadrant/Cache. 131

Figure 33 Performance strong scaling of face-based loops (flux com-putations) on the two-socket multicore nodes. . . . . . . 139

Figure 34 Performance strong scaling of face-based loops (flux com-putations) on the Intel Xeon Phi Knights Corner 7120Pco-processor. . . . . . . . . . . . . . . . . . . . . . . . . . . 140

List of Figures 18

Figure 35 Performance strong scaling of face-based loops (flux com-putations) on the Intel Xeon Phi Knights Landing 7210

processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Figure 36 Peformance strong scaling of cell-based loop kernel (up-dates to primitive variables) on all processors. . . . . . . 142

Figure 37 Multicore rooflines of single core and full node con-figurations for both classes of computational kernels.Key: Reference Vectorization Gather/Scatter Sw.Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Figure 38 Manycore rooflines of single core and full node config-urations for both classes of computational kernels. Keyfor KNC 7120P system and 1 Core KNL 7210: Refer-ence Vectorization Gather/Scatter Sw. PrefetchMulti-threading. Key for KNL 7210 (64 cores): Refer-ence Cache Opt DDR Opt Cache Opt MCDRAM

Opt MCDRAM & DDR . . . . . . . . . . . . . . . . . . . 146

Figure 39 Architectural comparison of the performance of face-based loops (flux computations) represented as GFLOPSat single core and full node concurrency and across allarchitectures (higher is better). The results presented forKnights Landing were obtained by at full concurrencyby interleaving the memory access between DDR andMCDRAM (i.e. DDR+MCDRAM) . . . . . . . . . . . . . . 150

Figure 40 Architectural comparison of the performance of cell-based loops (update to primitive variables) representedas GFLOPS at single core and full node concurrencyand across all architectures (higher is better). The res-ults presented for Knights Landing were obtained byat full concurrency by interleaving the memory accessbetween DDR and MCDRAM (i.e. DDR+MCDRAM). . . 151

Figure 41 Architectural comparison of the performance of the wholeapplication represented as time per Newton-Jacobi it-eration at single core and full node concurrency andacross all architectures (lower is better). The results presen-ted for Knights Landing were obtained by at full concur-rency by interleaving the memory access between DDRand MCDRAM (i.e. DDR+MCDRAM). . . . . . . . . . . . 151

L I S T O F TA B L E S

Table 1 Evolution of Single Instruction Multiple Data (SIMD) inIntel processors . . . . . . . . . . . . . . . . . . . . . . . . 37

Table 2 Hardware and software configuration of the computenodes used in this chapter. The SIMD Instruction SetArchitecture (ISA) represents the latest vector instruc-tion set architecture supported by the particular platform. 69

Table 3 Computational kernels representing face-based and cell-based loops used for demonstrating the implementationand impact of our optimisations. . . . . . . . . . . . . . . 102

Table 4 Hardware and software configuration of the computenodes used in this chapter. The SIMD ISA representsthe latest vector instruction set architecture supportedby the particular platform. . . . . . . . . . . . . . . . . . . 102

Table 5 Jump in indices for mesh 1 after reordering the facesbased on the two algorithms in Löhner et al [49]. . . . . . 109

Table 6 Comparison between time to solution update measuredin seconds of baseline and best optimised implementa-tion on the multicore CPUs. . . . . . . . . . . . . . . . . . 144

19

Table 7 Comparison between time to solution update measuredin seconds of baseline and optimised versions with 4 hy-perthreads and no software prefetching, software prefetch-ing and no hyperthreading and both software prefetch-ing and 4 hyperthreads on the Intel Xeon Phi KnightsCorner coprocessor. The speed-up is calculated relativeto baseline results and the timings of the best optimisa-tion (prefetch x 4 threads). On this architecture, wecould only perform experimental runs on mesh 1 at fullconcurrency due to the 16GB memory limit. . . . . . . . 147

Table 8 table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Table 9 Speed-up of flux computation kernels and whole solver(newton-jacobi) depending on prefetch distance on SandyBridge E5-2650 CPU. . . . . . . . . . . . . . . . . . . . . . 172

Table 10 Speed-up of flux computation kernels and whole solver(newton-jacobi) depending on prefetch distance on Broad-well . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Table 11 Speed-up of flux computation kernels and whole solver(newton-jacobi) depending on prefetch distance on Sky-lake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Table 12 Speed-up of flux computation kernels and whole solver(newton-jacobi) depending on prefetch distance on KNCwith no hyperthreading . . . . . . . . . . . . . . . . . . . 175

Table 13 Speed-up of flux computation kernels and whole solver(newton-jacobi) depending on prefetch distance on KNLquadrant cache mode . . . . . . . . . . . . . . . . . . . . . 176

L I S T I N G S

Listing 1 i-direction sweep . . . . . . . . . . . . . . . . . . . . . . . 68

Listing 2 j-direction sweep . . . . . . . . . . . . . . . . . . . . . . . 68

Listing 3 Example of extra compiler hints needed for generatingaligned vector load and stores on 64-byte boundaries. . . 70

20

Listing 4 Unaligned SIMD loads on Knights Corner . . . . . . . . 73

Listing 5 Unaligned SIMD stores on Knights Corner . . . . . . . . 73

Listing 6 Register shuffle for aligned accesses on the i-face sweepwith AVX. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Listing 7 Register shuffle for aligned accesses on the i-face sweepwith AVX2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Listing 8 Register shuffle for aligned accesses on the i-face sweepwith IMCI. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Listing 9 Implementation of cache blocking by nesting both i andj sweeps together. . . . . . . . . . . . . . . . . . . . . . . . 79

Listing 10 Example of a face-based kernel. . . . . . . . . . . . . . . . 93

Listing 11 Example of a cell-based kernel vectorized with OpenMP4.0 directives. . . . . . . . . . . . . . . . . . . . . . . . . . 105

Listing 12 Example of a SIMD friendly implementation of face-based kernels vectorized with OpenMP 4.0 directives. . . 107

Listing 13 Example of AVX/AVX2 compiler intrinsics kernel forin-register transposition from AoS to SoA of unknowns. 114

Listing 14 Example of Advanced Vector Extensions 512 (AVX-512)compiler intrinsics kernel for in-register transpositionfrom AoS to SoA of unknowns. . . . . . . . . . . . . . . . 115

Listing 15 Example of Initial Many Core Instructions (IMCI) com-piler intrinsics kernel for in-register transposition fromAoS to SoA of unknowns. . . . . . . . . . . . . . . . . . . 116

Listing 16 Integration of primitives for on the fly conversion betweenAoS and SoA data layouts in a face-based kernel. . . . . 117

Listing 17 Implementation of software prefetching in face-basedloops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

A C R O N Y M S

AoS Array of Structures

AoSSoA Array of Structures Structure of Arrays

21

acronyms 22

CFD Computational Fluid Dynamics

DSL Domain Specific Language

CPU Central Processing Unit

ISA Instruction Set Architecture

NASA National Aeronautics and Space Administration

SIMD Single Instruction Multiple Data

SoA Structure of Arrays

RCMK Reverse Cuthill Mckee

MPI Message Passing Interface

TLB Translation Look-aside Buffer

OpenMP Open Multi-Processing

IMCI Initial Many Core Instructions

AVX Advanced Vector Extenstions

AVX2 Advanced Vector Extensions 2

AVX-512 Advanced Vector Extensions 512

KNC Knights Corner

KNL Knights Landing

SNB Sandy Bridge

BDW Broadwell

SKL Skylake

HSW Haswell

MUSCL Monotonic Upwind Scheme for Conservation Laws

MCDRAM Multi-Channel DRAM

NUMA Non-Uniform-Memory-Access

acronyms 23

UMA Uniform-Memory-Access

ILP Instruction-level parallelism

DLP Data-level parallelism

TLP Thread-level parallelism

FMA Fused-Multiply-Add

SIMT Single-Instruction-Multiple-Thread

VSX Vector Scalar Extension

SSE Streaming SIMD Extensions

SMT Simultaneous Multithreading

SPMD Single Program Multiple Data

SM Streaming Multiprocessor

HBM2 High-Bandwidth Memory 2

HBM3 High-Bandwidth Memory 3

VPU Vector Processing Unit

VCL Vector Class Library

QPI Quick Path Interconnect

BIU Bus Interface Unit

CHA Cache Homing Agent

1I N T R O D U C T I O N

1.1 preamble

The current trend in processor design is parallelism and has led to the adventof multicore and manycore processors whilst effectively ending the so called"free lunch" era in performance scaling [85]. Modern multicore and manycoreprocessors now consist of 10-100 core numbers integrated on the same die,wide vector units with associated instruction set extensions, multiple back-end execution ports along with deeper and more complex memory hierarchies.Consequently, achieving high performance on these architectures mandatesthe exploitation of all of the available on-chip parallelism as well as the featuresof the memory system.

In the context of Computational Fluid Dynamics (CFD) applications, optim-ising codes to make full use of the architectural features of modern processorscan result in considerable speed-ups [60],[26]. However, incorporating suchoptimisations in existing applications is no small task. First of all, the imple-mentation of the underlying numerical methods may have to be reconsideredin order to make them more amenable to parallelism across different granular-ities. Secondly, accessing certain features of modern processors may mandatethe use of specialised, non-portable code. This may in turn harm performanceportability across the myriad of architectures available in high-performancecomputing systems unless suitable abstractions are found to address this.

For the general CFD practitioner, however, the issue remains of finding atrade-off between programming effort and gains in performance for their ap-plication, both in the short and in the long term. Thus, it is essential to exposethe CFD community to the whole range of techniques available and their im-pact on realistic applications and modern processors.

This thesis presents such optimisations across two distinct CFD codes rep-resenting both structured and unstructured paradigms and evaluates their per-formance on modern multicore and manycore processors.

24

1.2 thesis contributions 25

1.2 thesis contributions

This thesis makes the following contributions:

• Demonstrates empirically the existence of a performance gap betweenCFD codes that are not optimised for the current era of multicore andmanycore processors and those that are, in structured and unstructuredapplications. Although on the multicore CPUs, this difference is in theregion of 3X, on manycore processors such as the Intel Xeon Phi, the gapin performance grows to more than an order of magnitude.

• Presents techniques useful for extracting good performance out of mod-ern multicore and manycore processors using a single body of sourcecode written in a traditional high-level language for structured and un-structured applications. Architectural specific optimisations are abstrac-ted away using standard language constructs and machinery such asclasses, templates and pre-processor macros whilst optimal parametersare identified for different architectures via auto-tuning.

• Demonstrates the importance of making good use of the available vectorunits in modern processors and presents the changes required to vec-torize the computational patterns found in structured and unstructuredapplications such as stencil operators and gather and scatter primitives.This is achieved by exploring a wide variety of techniques ranging fromcompiler intrinsics to OpenMP (>=4.0) compiler directives in conjunctionwith re-writing various sections of the code in order to expose data-levelparallelism at the algorithmic level. The results in this thesis show thatthe best trade-off between performance and portability is obtained viathe use of OpenMP directives throughout the code with specialised ar-chitectural specific implementations limited to efficiently loading datainto vector registers.

• Establishes the importance of choosing the optimal data layout formatand how this may vary depending on whether the partial differentialequations are discretised on structured or unstructured grids. The res-ults in this thesis show that the Structures of Arrays (SoA) or the hybridArray of Structures Structure of Arrays (AoSSoA) format is the most op-timal for structured mesh applications due to allowing for efficient vector

1.2 thesis contributions 26

load and store operations. This is in contrast to unstructured mesh applic-ations where the Array of Structures (AoS) format was found to performbetter due to its superior exploitation of the cache hierarchy in compu-tational kernels that exhibit gather and scatter operations. Thus, due tothese differences, the recommendation is for applications to implementsuitable abstractions of the underlying data layout format in order toseamlessly transition between different storage types at compile time.

• Demonstrates the benefit of software prefetching in unstructured meshapplications due to the indirect memory access patterns and the utility ofauto-tuning in finding the optimal distance parameters across differentmulticore and manycore processors.

• Presents the implementation of thread-parallelism for both structuredand unstructured applications on the manycore architectures. This isshown to be highly beneficial for in-order architectures that require morethan one active thread per physical core to hide memory latencies andkeep the arithmetic units busy although proved to be detrimental to per-formance on out-of-order architectures.

• Demonstrates the utility of using performance models for appraising andanalysing the performance and merits of optimisations with respect tothe underlying hardware and the characteristics of the computationalkernels and patterns.

• The results in this thesis clearly show that optimisations that aim toexploit parallelism across different granularities in the underlying al-gorithms (instruction, data, thread, core) exhibit good performance acrossboth multicore as well as manycore processors. The same also appliesto optimisations that reduce the amount of data transfers from mainmemory and exploit the cache hierarchies or the available high-bandwidthmemory implementations. However, these may not be useful or relevantfor future architectures that deviate in their design from the architecturalprinciples currently used in multicore and manycore processors. Theseare:

– 10-100 cores integrated on the same die via an on-chip interconnect(ring, mesh)

– multiple execution ports and deep instruction pipelines

1.3 thesis outline 27

– vector units with associated instruction set extensions

– multiple execution contexts per core (multithreading)

– deep memory hierarchy based on different cache levels and includ-ing high-bandwidth memory

– multiple processors per node (i.e. multi-socket boards)

1.3 thesis outline

The structure of this thesis is as follows:

chapter 2 presents the changes that have taken place in the design ofprocessors over the past decades and highlights the shift in paradigm whichhas led to the advent of multicore and manycore processors. This is followedby a description of the key architectural features of modern multicore andmanycore processors and on the increasing gap in performance between ap-plications that exploit these in their implementation and those that don’t. Theimplications that these have on the optimisation and performance of existingCFD codes is discussed together with a review of some of the approaches thathave been used in literature for mapping CFD algorithms onto modern parallelarchitectures.

chapter 3 gives the necessary background on the processor architecturesused in this work in terms of their architectural features and the implicationsthat these have on application performance. This is followed by a descriptionof the Roofline performance model and its application in the context of thiswork.

chapter 4 presents optimisations for improving the performance of block-structured CFD codes on modern multicore and manycore processors anddemonstrates a wide range of techniques for mapping the underlying com-putational patterns of structured mesh solvers such as stencil operators ontothe architectural features of modern processors.

chapter 5 presents a number of optimisations for improving the perform-ance of unstructured CFD codes on modern multicore and manycore pro-

1.3 thesis outline 28

cessors. Examples of optimisations include: grid renumbering, vectorization,face colouring, data layout transformations, on-the fly transpositions, softwareprefetching, loop tiling and multithreading. Their implementation and impactare demonstrated in an unstructured CFD code of representative size and com-plexity to an industrial application and across a wide range of processor ar-chitectures such as the Intel Sandy Bridge, Broadwell and Skylake multicoreCPUs and the Intel Xeon Phi Knights Corner and Knights Landing manycoreprocessors.

chapter 6 discusses the achievements of this thesis, its limitations andfuture work.

2B A C K G R O U N D

This chapter begins by presenting the changes which have taken place in thedesign of processors over the past decades and highlights the shift in paradigmthat has led to the advent of multicore and manycore processors. This is fol-lowed by a discussion on the key architectural features of modern multicoreand manycore processors and on the increasing gap in performance betweenapplications that are able to exploit their features and those that are not. Fi-nally, the implications with respect to the optimisation and performance ofCFD codes are also discussed as well as some of the approaches that can beused for mapping CFD algorithms onto modern hardware.

2.1 trends in processor design

Figure 1 presents trends in microprocessor characteristics over the past fivedecades based on data from all major Central Processing Unit (CPU) manufac-turers such as Intel, AMD and IBM. These are discussed in more detail in thefollowing sections along with the effects they have on the design of modernprocessors and the constraints they place on the performance of applicationsthat run on them.

2.1.1 Moore’s Law

Moore’s law [56] is based on observations made by Gordon Moore in 1965 thatthe number of components (i.e. transistors) per integrated circuit will doubleevery year as a result of continuous improvements in the fabrication of suchdevices. He subsequently revised the pace of transistor doubling a decade laterto every two years.

The semiconductor industry has kept these predictions alive over the pastdecades as seen in Figure 1 translating the constant flux of additional transist-ors into optimisations and architectural features which made new processorssignificantly faster and more efficient than their predecessors. Although there

29

2.1 trends in processor design 30

100

101

102

103

104

105

106

107

108

1970 1980 1990 2000 2010 2020

Year

No. of Physical CoresNo. of Logical CoresFrequency (MHz)Vector Length (bits)Transistors (x103)Power (Watts)

Figure 1: Trends in microprocessors over the past 46 years. Original data up to theyear 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.Olukotun, L. Hammond, and C. Batten. Data between 2010-2015 collected byK. Rupp. Latest data between 2015-2017 collected and plotted by I. Hadade.


are signs that the rate at which the number of transistors is doubled is slowingdown [27] as demonstrated by the increasing gap between successive genera-tions of processors that feature smaller transistors, it is believed that Moore’slaw will still survive the next decade by extending photolithography into thethird dimension [81] as well as by revising the pace of exponential growth toevery three to five years. Therefore, in the short to medium term, it is expectedthat Moore’s law will continue to provide chip designers with an increase inthe number of transistors at their disposal for the design of processors that willmake up the high-performance computing systems of the next decade.

2.1.2 Frequency and Power

For many years, Moore’s law was accompanied by Dennard scaling [23]. RobertDennard demonstrated in 1974 the proportional relationship between the sup-ply of voltage and current and the linear dimensions of a transistor. Therefore,as the size of a transistor shrunk, so did the required voltage and current. Thisallowed for circuits to operate at higher frequencies for the same power andthus led to the increases in clock frequencies that we see in Figure 1. The trendof frequency scaling continued until 2004 and reached its peak during the1990s when clock rates doubled on average every 18 months. The combinationof extra transistors on the back of Moore’s law coupled with the continuousincrease in clock frequencies at which they operated allowed chip designers todouble the performance of processors with every new generation. The major-ity of hardware innovations at that time were based on increasing instructionthroughput in order to exploit the high clock rates such as complex branch pre-dictors, out of order execution and deeper instruction pipelines. Consequentlythese improvements would automatically translate into faster execution of ap-plications while requiring no changes to the programming paradigms. As per-formance and the number of transistors increased at a similar rate over thisperiod, this led people to equate Moore’s law with ever increasing perform-ance [90].

However, Dennard scaling failed to predict that as transistors shrink to smal-ler and smaller scales, it becomes increasingly challenging to offset the increasein current leakage as well as dissipating heat efficiently without affecting theintegrity of the device. Consequently, the increase in clock frequencies was


deemed to be no longer profitable as evidenced by the plateau for both powerand frequency in Figure 1 from circa 2004 onwards.

2.1.3 Multicore and manycore architectures

The end of clock frequency scaling due to power consumption has forced thesemiconductor industry into a shift in paradigm. This has led to the adventof multicore processors where instead of utilising the additional transistorsguaranteed by Moore’s law to build a single monolithic processor, they wereused to replicate and integrate multiple cores running at a lower frequency onthe same die. The advantages of the multicore approach are based on the factthat reducing the clock frequency of a single core by 20% results in 50% lesspower consumption at a cost of a 13% drop in performance [75]. Consequently,if one were to divide the work equally among two processors both of whichoperate at 80% of the original frequency, the performance would be 73% higherthan on a single core for the same power usage. Moreover, dissipating theheat across multiple discrete cores is also more efficient than across a singleone. As a consequence, from 2004 onwards (Figure 1), hardware designersstarted integrating more and more physical cores on the same die, a trend thatcontinues to this day where a multicore CPU can implement as many as 28

physical cores per chip [3].However, the problem with multicore processors is that translating their su-

perior performance and energy efficiency into palpable application accelera-tion mandates the explicit exploitation of parallelism. Whereas the previousparadigm of increasing clock frequencies and improving serial performance re-quired modest code interventions on the application-side, the multicore trendonly favours parallel applications to the exclusion of serial ones. As a result,the burden is placed on the shoulders of application developers to revisit andrethink their existing implementations so as to fully exploit the available par-allelism as more and more cores are integrated unto a die [66].

This burden is further exacerbated by the emergence of manycore processors.These architectures push the boundaries of the multicore approach even fur-ther by integrating tens to hundreds of cores on the same die. This integrationis made possible by reducing the clock rates of cores at more than half thefrequency of their multicore counterpart and by discarding architectural fea-tures geared towards serial performance which require a significant amount


of on-die logic [65] such as out-of-order execution or complex branch predict-ors in favour of more space and energy efficient features such as wide vectorunits. Thus, although the "slower" core in a manycore architecture will operateat a lower performance compared to the more powerful core in multicore pro-cessors, the total compute throughput of the manycore processor when all ofthem and their features are exploited in unison will be higher than that of themulticore system [12]. However, for this to be true in practical terms dependson whether the application can not only scale linearly with the number of coreson the device but also across all other existing levels of parallelism (i.e. thread,data and instruction). Otherwise, applications will see significantly worse per-formance on manycore devices than on multicore systems due to the slowercores that lack features oriented towards single thread performance. The onesaving grace is that multicore and manycore processors share much of thesame architectural features. As a result, optimisations for one architecture willmost likely be of benefit on the other.

Thus, one returns to the issue beforehand where making effective use of bothmulticore and manycore processors requires the exploitation of parallelismacross different granularities. Therefore, understanding what these levels areand ways through which one can exploit them will see an application performwell in the era of multicore and manycore architectures which, judging by thetrends in Figure 1, is here to stay.

2.1.4 Levels of parallelism

instruction Instruction-level parallelism (ILP) is implemented by the un-derlying architecture transparently via techniques such as instruction pipelin-ing, out-of-order execution, register renaming, speculative execution or branchprediction [65]. Consequently, in the majority of cases, these features will beexploited at a lower level of abstraction than that of the program code. How-ever, some forms of ILP can be addressed by the application depending on thechoice of algorithms and their implementation. For instance, Figure 2 presentsthe execution units of two successive multicore architectures, namely the IntelSandy Bridge (2011) and Haswell (2013). The changes to the execution unitthat have been implemented in the newer Haswell architecture can be seen inorange. A close inspection of both reveals several things. On Sandy Bridge, inorder to fully exploit the available floating point units in ports 0-1, the compu-


tations would need to exhibit a near perfect balance between multiplicationsand additions. Otherwise, in the case of an imbalance represented by a signi-ficantly larger number of additions over multiplications or vice-versa, one ofthe execution ports would be oversubscribed while the other would sit idletherefore reducing the maximum attainable performance by a factor of two.On Haswell, both ports 0 and 1 implement Fused-Multiply-Add (FMA) cap-abilities. As a result, two separate FMA instructions can be executed on bothin parallel. However, applications that have an imbalance between multiplic-ations and additions will also suffer from imbalance on Haswell since thesewon’t be cast to FMA operations. Consequently, a vector multiplication wouldonly execute on port 0 while an addition only on port 1. Another example isfloating point division which is only implemented on port 0 across both mi-croarchitectures. As these operations are non-pipelined and can only executeon 128-bit lanes [40], an algorithm that translates to a high number of divi-sions will degrade the performance significantly since it will prevent any otheroperation to utilise port 0 leading to an instruction bottleneck.

Some of the issues above can be addressed either through a different choiceof algorithms or through programming techniques such as casting all multi-plications or additions to FMA operations by either multiplying by a constantor zero addition. This would in theory exploit both ports on Haswell althoughit would lead to worse performance on Sandy Bridge due to the higher num-ber of instructions that will execute on different ports. Moreover, these typesof optimisations are not recommended since the compiler would most likelyremove these transformations at certain optimisation levels. Therefore, the bestapproach and advice for application developers is to consider the operationalbalance of algorithms and where possible, implement functionality that takesadvantage of parallelism in the execution units by avoiding instructions thatcan lead to port pressure such as floating point divisions and favouring imple-mentations that exhibit a larger degree of parallelism (i.e. vector blend instruc-tions vs. shuffles).

data Data-level parallelism (DLP) is exposed in conventional multicoreand manycore architectures through the SIMD execution model. This is basedon performing the same instruction be it a floating point, integer or a logicaloperation across successive elements of a vector simultaneously. The numberof operands which can be processed by one instruction is limited to the width


(a) Intel Sandy Bridge (b) Intel Haswell

Figure 2: Schematics of the execution units in the Intel Sandy Bridge and Haswellmicroarchitectures. Updates in the Haswell architecture compared to SandyBridge are highlighted in orange. Figures courtesy of [82]

of the vector registers. On modern multicore and manycore architectures, thesevector registers vary between 128, 256 and 512-bits in length and are exploitedvia extensions to the ISA such as AVX [30] or variants thereof on x86 pro-cessors from Intel and AMD, NEON [4] on ARM processors or Vector ScalarExtension (VSX) on the latest IBM POWER architectures.

Although the implementation of DLP through the SIMD taxonomy was pop-ular during the 1970s in specialised high-performance computing systems suchas the CRAY-1 [78], its adoption in commodity processors only came later dur-ing the end of the 1990s when Intel introduced the MMX instruction set on 64-bit wide registers. This was followed by the Streaming SIMD Extensions (SSE)instruction set which increased the width of vector registers to 128-bits andprovided support for single precision floating point operations. However, sub-sequent improvements were based only on incremental updates while the vec-tor register width remained fixed for over a decade. Thus, the slow pace ofdevelopment as well as the lack of support for double precision floating pointoperations limited the usability and adoption of SIMD in scientific codes.

The turning point came in 2010 with the introduction of AVX and 256-bitvector registers in the Intel Sandy Bridge architecture. From then onwards,


the pace of improvements to SIMD capabilities in Intel architectures gainedsignificant momentum as seen in Table 1 and led to further increases to thewidth of vector registers to 512-bits as well as extensions to the instruction setsuch as AVX2, IMCI and AVX-512.

This recent trend of increasing SIMD capabilities in modern processors stemsfrom the consideration that vector-based architectures are more area and en-ergy efficient than scalar-based architectures [46],[84]. Consequently, we canattribute this to the same trends that contributed to the shift to multicoreprocessors, that of energy efficiency. However, as with multicore architectures,making best use of vector-based architectures requires the explicit exploitationof significant amounts of parallelism in the application, albeit at at a differentgranularity. This is either done manually through the utilisation of assembly,compiler intrinsics or libraries implementing the former or by the compilerbased on its internal heuristics or guided through directives. The conversionfrom a scalar to a vector implementation is called vectorization which, if ap-plied automatically by the compiler with limited or no external interventionbecomes known as auto-vectorization. In the best case scenario, the latter isalways true and the compiler is able to cast the majority of instructions in theprogram code as vector operations. However, this is only the case in applica-tions that exhibit regular access patterns and where the safety of vectorizationcan be guaranteed by the compiler. Consequently, in the majority of cases, thevectorization of any scientific code of significant size and complexity is a la-borious endeavour which is made even more complicated in the presence ofirregular and complex access patterns.

On specialised hardware such as NVIDIA GPUs, data parallelism is imple-mented through Single-Instruction-Multiple-Thread (SIMT) execution wherebyall threads within a warp execute the same instruction. As a result, these archi-tectures behave similarly to the general purpose SIMD architectures of conven-tional multicore and manycore processors with the only difference being thenumber of operands that can be processed simultaneously. Consequently, op-timisations that expose a large degree of data parallelism on vectors of variablelengths will perform well across all architectures.

thread Thread-level parallelism (TLP) is based on the ability of modernmulticore and manycore architectures to maintain architectural state for morethan one execution context. In multicore architectures such as Intel CPUs,


year vector width instruction set architecture

1997 64-bit MMX P5

1999 128-bit SSE P6

2001 128-bit SSE2 NetBurst2004 128-bit SSE3 NetBurst2006 128-bit SSE4 Core2007 128-bit SSE4.2 Nehalem2010 256-bit AVX Sandy Bridge2013 256-bit AVX2 Haswell2013 512-bit IMCI Knights Corner2016 512-bit AVX-512F/CD/ER/PF Knights Landing2017 512-bit AVX-512F/CD/BW/DQ/VL Skylake Server

Table 1: Evolution of SIMD in Intel processors

TLP is implemented via the Simultaneous Multithreading (SMT) model whereevery physical core on the die can appear to the operating system as two lo-gical cores. This is achieved by duplicating features that maintain executionstate but not the execution units. Therefore, when two threads (i.e. executioncontexts) are scheduled to run on the same physical core, if one of them isstalled due to a cache miss, its state can be saved while the other thread isgiven control and ownership of the execution unit assuming that its instructionoperands are available. Thus, one can hide latencies incurred from dependen-cies in the instruction pipeline or misses in the cache hierarchy by overlappingthe execution of two threads.

On manycore architectures such as the Intel Xeon Phi, each core is capableof running up to four concurrent threads. On the Knights Corner architecture,the execution of more than one thread per physical core is required in orderto compensate for the in-order nature of the core where a miss in the L1 cacheleads to a complete execution stall. Therefore, by running two, three or fourthreads concurrently, context can be switched to any thread that has data avail-able when a miss in L1 occurs therefore circumventing a complete executionstall. In essence, TLP on the Knights Corner (KNC) coprocessor is used asa means of accomplishing out-of-order execution. Consequently, although aKNC coprocessor might integrate up to 61 physical cores, best performancemandates the exploitation of at least 120 independent execution contexts (i.e.threads).

This is similar to the approach taken by NVIDIA GPUs where the SIMTexecution model utilises a number of threads grouped in warps which are ex-


ecuted in a lock-step fashion on a Streaming Multiprocessor (SM). If one groupof threads waits on a memory transaction, context can be quickly switched toanother one thereby hiding the latency incurred by the prior group.

The effectiveness as well as the exploitation of TLP is therefore highly de-pendent on the application as well as the underlying architecture. On multicorearchitectures that feature powerful out-of-order cores, running more than onethread per core can be detrimental to performance depending on whether theapplication is compute or memory bound. On the other hand, on manycore ar-chitectures, the exploitation of TLP is required for good performance due to the"slow" in-order cores. For GPUs, the SIMT model somewhat blends data andthread parallelism together however, on the Intel Xeon Phi architecture, threadparallelism differs from data parallelism and requires separate considerationsuch as another level of domain decomposition for grid-based applications.

core Parallelism at core granularity is based on the exploitation of the avail-able physical cores that are integrated on the same die. This can be explicitlyaddressed on conventional multicore and manycore architectures either viaexploiting the shared memory nature of multicore systems where the discretecores of a multicore processor usually exhibit Uniform-Memory-Access (UMA)properties in respect to main memory or via the Single Program Multiple Data(SPMD) model. In the latter, copies of the same application are executed inde-pendently across the physical cores while communication among them is per-formed explicitly via messages using programming models such as MessagePassing Interface (MPI). To complicate matters further, it is becoming commonpractice for scientific codes to implement both types of techniques wherebyone or more cores in a multicore or manycore processor runs in a SPMD fash-ion whereas the rest take the form of threads spawned from the SPMD contextthat communicate via shared memory rather than explicit messages due to themore favourable latencies. Such programming models are defined as hybridand are particularly encouraged on manycore architectures such as the IntelXeon Phi due to the need of running more than one thread context per core.

The equivalent of a discrete core on GPUs differs depending on the manu-facturer. For NVIDIA GPUs, this would be the SM. However, compared to thephysical cores in conventional multicore and manycore systems, the function-ality and execution of an SM on NVIDIA GPUs is performed exclusively by thehardware. A group of threads (warp) can be scheduled to run on any available


SM. As a result, parallelism at such level cannot be exploited explicitly by theapplication.

node Compute nodes that form the building block of contemporary high-performance systems usually host two or more multicore CPUs that are integ-rated on the same circuit board in separate sockets and connected together viaproprietary high bandwidth interconnects. Alternatively, they can usually con-tain a single self-hosted manycore processor such as the Intel Xeon Phi KnightsLanding or a multicore CPU as well as one or more accelerators such as GPUsor the Intel Xeon Phi KNC coprocessor due to the inability of such platformsto host their own operating system. Consequently, exploiting node parallelismcan mean different things depending on the architectural composition of thenode.

For heterogeneous nodes containing both a conventional multicore CPU andan accelerator or coprocessor, node parallelism can mean performing symmet-ric computations on both where some computations are offloaded to the accel-erator while another stream of independent calculations is performed on theCPU. The difficulty in this approach is that of load balancing the execution ontwo distinct architectures with different levels of throughput and latencies. An-other type of parallelism at the granularity of an heterogeneous node involvesthe exploitation of locality when more than one accelerator or coprocessor areconnected to the same circuit board.

For homogeneous node architectures comprising of a number of multicoreCPUs or a manycore processor, parallelism at this granularity is usually thesame as that at the core granularity with the only consideration that commu-nication via shared memory across different sockets will most likely involveNon-Uniform-Memory-Access (NUMA) effects as each socket is affine to thememory nearest to then.

2.1.5 Amdahl’s Law

Since achieving high performance on both multicore and manycore processorsmandates the explicit exploitation of parallelism across different levels, themaximum speed-up that an application can exhibit on these architectures willbe limited by Amdahl’s law [38],[13].


Amdahl’s law [7] states that the speed-up that can be attained by an applic-ation that is executed in parallel is limited by the percentage of execution thatis performed serially such that:

S(N) =1

(1− P) + PN

(1)

where S(N) is the speed-up as a function ofN processors, P the fraction of codethat is executed in parallel and N the number of processors this is parallelisedon. Therefore, as N→∞, S(N)→ 1

(1−P)

In practical terms, the limitations imposed by Amdahl’s law are that themaximum speed-up that is attainable for an application that spends 25% of itsoverall runtime in sequential execution is a factor of four, irrespective of thenumber of processors (i.e 1

(1−0.75) = 4).Consequently, applications that will perform best on multicore and ma-

nycore processors are those that can maximize P as well as make good useof the underlying architectural features. Furthermore, the limitations in speed-up imposed by Amdahl’s law are applicable across all of the available levels ofparallelism in modern processors where not exploiting one granularity suchas data parallelism leads to a drastic limitation of the maximum attainableperformance.

2.1.6 The memory hierarchy

A trend that is not featured in Figure 1 although as important is that of thegrowing disparity between the performance of the processor and that of thememory system. Wulf et al [93] observed in 1995 that although both micropro-cessor speeds as well as memory bandwidth grew exponentially, they did soat different exponents while the difference between them also grew at an expo-nential rate. The authors coined this trend as "the memory wall" and arguedthat if this were to continue at the same pace, it would limit the performanceof almost all applications to that of the memory system, thus making futureadvancements in microprocessors redundant.

As a consequence, chip designers began devoting large amounts of die areaand transistors for the integration of on-chip caches. This improved memoryperformance albeit for applications that addressed the principles of spatial and


temporal locality in their memory access patterns. Nevertheless, the integra-tion of larger caches across deeper hierarchies combined with their exploitationby the application proved to be effective in bridging the gap in performancebetween the processor and the memory system.

However, with the emergence of multicore and manycore processors, thedisparity between the performance of the processor and that of the memorysystem began to grow again. This was due to the fact that as the number ofcores per processor increased, so did floating point performance as arithmeticunits were replicated as well. In contrast, other components such as the num-ber of memory channels connecting the memory system to the processor didnot scale at a similar pace [57]. Consequently, although peak floating pointperformance almost doubled with every processor generation, the ability ofthe memory system to keep all cores and their arithmetic units busy with datalagged further and further behind.

This is evidenced in Figure 3 which presents a comparison between peakfloating point performance and peak memory bandwidth of high-end Intelmulticore CPUs, NVIDIA Tesla GPUs and Intel Xeon Phi processors releasedbetween 2007 and 2017. The highest imbalance between floating point per-formance and memory bandwidth is seen in the Intel multicore CPUs. We canexplain the reasons behind this by analysing and comparing the last two pro-cessors which were released in 2016 and 2017 respectively. These are represen-ted in Figure 3 by the 22-core Intel Broadwell E5-2699 v4 processor and its dir-ect successor, the 28-core Intel Skylake Platinum 8180 processor. Although thelatter has more than double the floating point performance of its predecessordue to AVX-512 and the increase in size of vector registers from 256 to 512-bitsas well as six additional cores, the number of memory channels between thetwo architectures only increased by 50% from the previous four to six. As aresult, the increases in floating point performance on the Skylake-based pro-cessor have not been matched with a similar increase in memory bandwidththerefore leading to a higher imbalance between the arithmetic capabilities ofthe processor and the performance of the memory system.

On manycore architectures such as NVIDIA Tesla GPUs or Intel Xeon Phiprocessors, this imbalance is alleviated due to the implementation of on-packagehigh bandwidth memory such as the Multi-Channel DRAM (MCDRAM) inthe Intel Xeon Phi Knights Landing architecture. This is integrated using 3D-stacked technology and is connected to the Knights Landing die via eight high-


104

105

106

107

108

2007 2009 2011 2013 2015 2017

MBy

tes/

s,M

Flop

s/s

Year

FLOPs (CPUs)Bandwidth (CPUs)FLOPs (Phis)Bandwidth (Phis)FLOPs (GPUs)Bandwidth (GPUs)

Figure 3: Comparison between peak floating point performance in double precisionand peak memory bandwidth across Intel Xeon multicore CPUs (CPUs),NVIDIA Tesla GPUs (GPUs) and Intel Xeon Phi processors (Phis) spanningthe last decade. Data courtesy of K. Rupp [77] with minor updates by I.Hadade

bandwidth controllers offering a factor of four more bandwidth than the tradi-tional DDR4 memory [44]. Therefore, although the Knights Landing architec-ture has increased floating point performance by more than a factor of threecompared to its Knights Corner predecessor, the performance of MCDRAMover the previous GDDR5 implementation in Knights Corner has scaled at asimilar rate.

A similar approach to that of the Knights Landing architecture is also takenby NVIDIA Tesla GPUs with the integration of High-Bandwidth Memory 2

(HBM2) and High-Bandwidth Memory 3 (HBM3) in the P100 and V100 GPUsbased on the Pascal and Volta architectures. However, an important differencebetween the two is that the MCDRAM implementation on the Knights Land-ing architecture can be configured in a variety of memory modes by the userwhereas the behaviour of both HBM2 and HBM3 is hard wired on the GPUs.

A consequence of the growing disparity in performance between the pro-cessor execution speed and the bandwidth and latency of main memory is

2.2 the performance gap 43

that the number of floating point operations required to keep all of the avail-able arithmetic units busy for every byte of data retrieved from main memoryhas increased considerably. This is highlighted in Figure 4. While this ratio isdependent on the machine balance and varies across architectures, it is stillgrowing on the multicore CPUs as previously seen as well as on the NVIDIAGPUs, albeit at a different rate. The implication that this has on applicationperformance is that algorithms that were previously bound by the computa-tional capability of the processor, such as dense matrix-vector operations, arenow limited by the performance of the memory. Consequently, techniques forimproving the exploitation of the cache hierarchy and which limit the expos-ure of an application to the latency and bandwidth of main memory will havea high impact on modern architectures and especially on multicore processors.On the other hand, while the Intel multicore CPUs seem to suffer the mostfrom this growing imbalance, this can be rectified by integrating on-packagehigh bandwidth memory systems similar to the one in the Knights Landing ar-chitecture to counter balance the "memory wall". Thus, it may be only a matterof time until such implementations are also found in high-end Intel multicoreCPUs. This may somewhat complicate matters for the application developer asthese will potentially require explicit attention for best performance however,without them, applications will see little benefit from the continuous increasesin floating point performance if these are not corroborated with increases inmemory performance.

2.2 the performance gap

The multicore and manycore trend has led to processors which exhibit paral-lelism at various granularities along with deeper and more complex memoryhierarchies. Consequently, translating the performance of these modern archi-tectures into palpable application speed-ups mandates the exploitation of allof their architectural features.

According to Satish et al [80], this has directly contributed to the so called"ninja performance gap" where only a limited number of expert programmersare able to extract the full power of modern processors. In contrast, the av-erage programmer can only extract a fraction of this or even worse, sees hisperformance drop on the more unforgiving manycore processors.

2.2 the performance gap 44

1

10

2007 2009 2011 2013 2015 2017

FLO

Pspe

rby

te

Year

Intel XeonNVIDIA TeslaIntel Xeon Phi

Figure 4: Number of floating point operations in double precision per byte from off-chip memory across Intel Xeon multicore CPUs, NVIDIA Tesla GPUs andIntel Xeon Phi processors spanning the last decade. Data courtesy of K. Rupp[77] with minor updates by I. Hadade

2.3 implications for cfd codes 45

The authors refute the claim that traditional approaches to programmingare at fault for this state of affairs and that the only panacea are radical altern-atives such as new programming languages. Instead, they demonstrate thatwhile a gap in performance does exist, this can be mitigated by exploiting thelow hanging fruits such as vectorization via compiler directives and thread-level parallelism using application interfaces such as Open Multi-Processing(OpenMP). Through these techniques, the average difference in performancebetween an application that was hand tuned for the underlying architectureand its naively written counterpart dropped from the previous 24X to 3.5X.Moreover, they demonstrated that more involved algorithmic improvementssuch as data layout transformations for making better use of the underlyingvector units as well as memory optimisations for alleviating memory band-width pressure via cache blocking or loop tiling have further reduced thisdifference to an average of 1.4X across 11 real-world applications.

Consequently, although the authors envisage that the "ninja gap" will invari-ably grow as emerging architectures will only improve the architectural fea-tures that can only be exploited via explicit hierarchical parallelism, they alsoclaim that traditional approaches to programming are more than able to extractperformance out of these modern processors. However, they argue that higherperformance will also translate to a considerably higher programming effort.Nevertheless, without optimisations focused on exploiting the multi-level par-allelism available on modern processors and the features of the memory sys-tem, scientific applications will see their execution speed stagnate or drop evenas newer processors will claim increases in their peak performance. The latterwill most likely be reserved exclusively for codes that are able to map to thearchitectural features of these processors.

2.3 implications for cfd codes

The majority of existing CFD codes used in both industry and academia weredeveloped prior to the multicore and manycore era. As these codes wereprimarily designed for distributed-memory environments via programmingmodels such as MPI with one rank per uniprocessor, they already exhibit par-allelism at the granularity of a core or compute node. However, they do notexploit the parallelism available at instruction, vector or thread-level which,as demonstrated in the previous sections, is essential for extracting a high de-


gree of performance out of modern architectures. Furthermore, on modernprocessors, these codes are predominantly limited by memory bandwidth dueto the growing disparity between the peak floating point performance and thatof the available memory bandwidth. Consequently, optimisations that alleviatethe pressure placed on the memory system such as cache blocking and looptiling for structured grid applications or grid renumbering for unstructuredgrids will most likely deliver palpable speed-ups [31].

On manycore architectures such as the Intel Xeon Phi Knights Landing, ex-ploiting the on-package MCDRAM is imperative for getting the most out ofthe hardware and to that extent, the processor supports a variety of configura-tions that are traditionally hard wired in architectures such as GPUs. Findingthe best configuration therefore requires an empirical evaluation and dependson various aspects related to the application. Moreover, on manycore architec-tures that integrate simple in-order cores such as the Intel Xeon Phi KnightsCorner and NVIDIA GPUs, hiding the latency incurred by the lack of out-of-order execution as well as complex hardware prefetchers is possible throughmultithreading and an optimal software prefetching strategy.

Finally, as the number of cores continue to increase in multicore and ma-nycore processors, running one MPI rank per core might not be the best solu-tion especially for manycore architectures and their significantly slower cores.Consequently, the implementation of hybrid parallelism with multiple threadsper MPI rank could be beneficial not only for manycore architectures butalso for multicore systems due to the ability of exploiting the shared memorynature that is present in multi-socket compute nodes.

Incorporating these optimisations in existing large scale CFD codes is nosmall task and is highly dependent on both the underlying processor architec-tures as well as the CFD algorithms. In the context of finite volume methods,such optimisations would vary depending on whether the governing equationsare discretised on structured or unstructured meshes as these exhibit differentcomputational patterns which mandate different optimisations. As for the un-derlying processor architectures, the majority of multicore CPUs from Intel,AMD, ARM and IBM share architectural similarities in that they all requirethe exploitation of the underlying vector units and of the memory hierarch-ies. As a result, the optimisations highlighted above will be transferable to alarge extent among all of them unless architectural specific implementationsare used. This is also applicable to the Intel Xeon Phi processors which can be


exploited using the same programming models as their multicore counterpartand where optimisations for the Xeon Phi processors are also beneficial to themulticore CPUs and vice-versa.

On the other hand, although this is also applicable to more specialised archi-tectures such as NVIDIA GPUs that also rely on data and thread-level parallel-ism in their implementation albeit at larger scales, the programming modelsrequired to exploit these architectures differ significantly from those used forconventional hardware. To that extent, porting a large scale legacy CFD codeto GPUs is almost synonymous to writing the entire code from scratch which,from the point of view of an industrial practitioner, is not feasible.

One approach to mitigate these portability issues is through the develop-ment of a Domain Specific Language (DSL) or high level framework. This canexpress the underlying algorithms in a form accessible to the CFD practitionerand generate efficient code across both conventional multicore processors aswell as specialised GPU hardware. Examples of these are SBLOCK [14] de-veloped by Brandvik et al for optimising the stencil computations arising fromstructured CFD mesh applications. Another example in the context of struc-tured mesh algorithms is OPS [70] which focuses more on the implementationof optimisations for reducing the memory traffic of such algorithms via theimplementation of loop tiling and cache blocking techniques as well as theirmapping on both multicore as well manycore architectures such as Xeon Phiand NVIDIA GPUs. Examples of DSLs for unstructured mesh computationsare OP2 [32],[59] and Liszt [22] and more specialised implementations withapplication to finite element codes such as FEniCS [48] and Firedrake [69]which integrates features from both the former as well as the OP2 project.

While the benefits of "writing once, run everywhere" are obvious in the con-text of current processor trends, and this is certainly achievable with DSL im-plementations such as the ones presented above, as Giles et al [31] remarked,the utilisation of DSLs also brings forth a number of challenges. First of all,there is the issue of expressiveness and whether a high level abstraction canfully capture the requirements of a CFD practitioner in an industrial or re-search context. Secondly, maintaining the software is challenging as many ofthese implementations arise from highly technical research groups with finiteamounts of funding. Thirdly, there is also the issue of porting an existing largescale application to such high level implementation which, is not very differentto that of re-writing the application from the very beginning.

2.4 conclusions 48

Consequently, in the context of existing codes, the only viable approach isto revisit and optimise the underlying CFD algorithms so as to map themonto the architectural features of modern processors. For such endeavours,the optimisations and techniques presented in this thesis will act as valuableresource for establishing the return on investment in terms of programmingeffort over application speed-up that each particular optimisation can offer onconventional multicore and manycore processors.

2.4 conclusions

This chapter presented current trends in the development of modern pro-cessors and the effects these have on the performance of scientific applicationsand CFD solvers in particular. A discussion on how best to mitigate these inCFD applications was also given which serves as the main motivation of thiswork along with previous efforts that have been implemented for this purpose.

3E X P E R I M E N TA L S E T U P

This chapter provides the necessary background on the processor architecturesused in this thesis followed by a discussion on the Roofline performance modeland its application in our work.

3.1 processor architectures

The processor architectures described in this section and used throughout thisthesis as experimental platforms have been selected because: (i) they makeup a very large proportion of current and likely future node architectures inhigh-performance computing systems; (ii) they incorporate architectural fea-tures such as: multiple execution ports, out of order and in-order core designs,wide vector units, multi-threading capabilities, medium to large core countsas well as high-bandwidth memory implementations representative of currentmulticore and manycore processor designs. Consequently, presenting and eval-uating optimisations that exploit any or all of these features will make themapplicable for any processor architecture whether multicore or manycore thatimplement one or all of them in their design.

3.1.1 Intel Xeon Sandy Bridge

The Intel Sandy Bridge (SNB) microarchitecture was the first to introduce 256-bit vector registers and an extension to the SIMD instruction set via AVX [30].The AVX instruction set maintains compatibility with its predecessors and isimplemented on top of SSE by fusing two 128-bit SSE registers. The latterdesign consideration has an impact on certain operations that target data ele-ments across 128-bit lanes such as shuffle operations and permutations. In the-ory, the load ports in Sandy Bridge can perform 256-bit loads via AVX, however,achieving this bandwidth requires the simultaneous usage of the two availableload ports. This limitation leads to situations where vectorization might notdeliver the expected performance boost due to load port pressure [40].

49

3.1 processor architectures 50

For arithmetic purposes, and as already mentioned in Chapter 2, the SandyBridge architecture can execute one vector multiplication and one vector ad-dition in the same clock cycle thus requiring a balance of such operations forbest performance.

In terms of the cache hierarchy, the L1 is 32KB and 8-way associative andcan sustain two 128-bit loads and a single 128-bit store per cycle. The L2 cacheis 256KB in size and 8-way associativity and a 12 cycle load-to-use latency aswell as a write-back design. The L3 cache is 20MB in size and shared acrossthe cores on the die. Integration of all physical cores on the chip is done via aring-based interconnect [40].

3.1.2 Intel Xeon Haswell

The Intel Haswell (HSW) microarchitecture is based on a 22nm design and isa successor to the Ivy Bridge architecture. Whereas Ivy Bridge did not containany major architectural changes compared to its Sandy Bridge predecessor,Haswell provides a number of modifications through a new core design, im-proved execution unit based on the AVX2 extensions and a revamped memorysubsystem [68].

The major improvements to the execution unit in the Haswell microarchi-tecture regard the addition of two ports, one for memory and one for integeroperations which aids instruction-level parallelism. To that end, there are twoexecution ports for performing FMA operations as part of AVX2 with peakperformance of 16 double precision floating-point operations per cycle, doublethat of Sandy Bridge and Ivy Bridge. Further improvements provided by AVX2

are gather operations which are particularly useful for packing non-contiguouselements within a vector register with application to unstructured mesh calcu-lations and whole register cross-lane permutations and shuffles with the im-plementation of the vpermpd instruction.

From a memory standpoint, the Haswell fixes the back-end port width is-sue by providing true 64 byte load (2x256-bits) and 32 byte store (1x256-bits)per cycle functionality. As a result, vectorization of codes that contain a largeamount of load and store operations should perform significantly better thanon the Sandy Bridge and Ivy Bridge architectures.


3.1.3 Intel Xeon Broadwell

The Intel Broadwell (BDW) microarchitecture is the successor to Haswell towhich it brings a number of enhancements such as latency improvements forfloating-point multiply operations and FMA as well as increased throughputof gather instructions [40]. The architecture is based on a 14 nm die shrinkfrom the previous 22 nm on Haswell which allows high-end multicore CPUsbased on the Broadwell architecture to integrate as many as 24 physical coreson the same die, six more than on Haswell.

3.1.4 Intel Xeon Skylake Server

The Intel Skylake (SKL) Server microarchitecture was released in July 2017

and is the successor to Broadwell [41]. At core level, Skylake Server increasesthe vector register size to 512-bits and adds support for AVX-512 instructions.Although the execution unit has the same number of ports as Haswell andBroadwell, port 0 and 1 can either perform AVX/AVX2 vector computationson 256-bit lanes or fused 512-bit AVX-512 computations. Port 5 is exclusive toAVX-512 execution. Therefore, to take full advantage of this architecture, vectorcomputations should target the wider vector lanes via AVX-512 as it utilises thehighest available throughput. Skylake Server can perform 32 double precisioncomputations per cycle when utilising both AVX-512 FMA units, twice morethan Broadwell and Haswell.

In terms of the cache subsystem, the L1 cache on Skylake offers similarlatency and size to Broadwell, 32KB at 4-6 cycles [40]. The major differencehowever is based on the increase in bandwidth to 128 bytes (2x512-bits) forloads and 64 bytes (1x512-bits) for stores which are required by AVX-512 oper-ations. In essence, the L1 cache can service up to two entire cache lines (64 byteseach) to the load ports if AVX-512 is used and the underlying data is aligned to64 byte boundary which make these considerations crucial for achieving bestperformance.

Radical changes are also present in the L2 cache which sees a factor of fourincrease in size compared to Broadwell (1MB vs. 256KB)[40]. The L3 cache ismarginally smaller in size than on Broadwell and is configured as a victimcache (non-inclusive) to the higher levels. The effect of these changes is thatapplications making use of communication avoiding algorithms such as loop


tiling or cache blocking should target the L2 cache rather than L3 with thesame considerations applying for software prefetching.

Access to main memory can be serviced by up to 6 memory channels onSkylake Server versus 4 on Broadwell which should provide a 50% increasein available memory bandwidth. This figure has been corroborated with ourown STREAM[53] benchmarks. Consequently, the difference in performancebetween Broadwell and Skylake Server depends on whether the application iscompute or memory bound as previously discussed in Chapter 2.

Further changes are also present in regard to the the on-chip interconnecttopology where the previous ring implementation is replaced with a 2D meshinterconnect which was initially implemented on the Intel Xeon Phi KnightsLanding architecture [40]. This enables the Skylake Server microarchitecture toscale to as many as 28 cores per die.

3.1.5 Intel Xeon Phi Knights Corner

The Intel Xeon Phi KNC coprocessor can be classed as an x86-based Shared-Memory-Multiprocessor-on-a-chip [43] with more than 60 physical cores onthe die each supporting four concurrent hardware threads.

A Knights Corner core contains one Vector Processing Unit (VPU) that canoperate on 32 512-bit wide registers and 8 mask registers based on the IMCIextensions. The IMCI extensions however are not compatible with the SIMD ex-tensions on x86 multicore CPUs. The functionality offered by the VPU is heav-ily geared towards floating point computations with support for FMA, gatherand scatter operations useful for unstructured mesh computations. Theoretic-ally, the VPU can execute one FMA operation per cycle (16 double precisionfloating-point operations) however, due to the in-order execution nature of thecore, either software prefetching or more than one in-flight thread is requiredto keep the VPU busy. This is due to the fact that every miss in the L1 cacheleads to a complete execution stall unless context can be switched to anotherin-flight thread. The cores on the die are connected via a bi-directional ringinterconnect that offers full cache coherence.

Communication with the host CPU is performed via the PCI-Express bus ina similar fashion to GPUs. Although Knights Corner can be used in a num-ber of modes such as offload, symmetric or native, the native mode has beenused throughout this work as it removes any unnecessary synchronization


constructs thus enabling for a more accurate assessment of the platform’s com-putational characteristics and performance.

3.1.6 Intel Xeon Phi Knights Landing

The Knights Landing (KNL) architecture is the successor of Knights Cornerand second iteration of the Intel Xeon Phi family series. The Knights Land-ing is the first self-boot manycore processor able to run a standard operatingsystem [44] and therefore differentiates itself among all other coprocessor andaccelerator platforms. Another important distinction of the KNL architecturecompared to its predecessor is binary compatibility with x86 multicore CPUsand the integration of a high bandwidth on-package memory (MCDRAM)[83]alongside standard DDR4 main memory.

The building block of the Knights Landing architecture is the tile which isreplicated across the entire chip on a 2D lattice interconnect. A tile is furthercomposed of two cores sharing a 1MB L2 cache and associated memory controlagents. The KNL core is based on the Intel Atom (Silvermont) architectureand includes additional features targeting floating point workloads while alsosupporting up to four in-flight threads [44].

For the purpose of compute, the core integrates two VPUs each support-ing AVX-512 execution. Implementation of SSE/AVX/AVX2 instructions [44]is present on one of the two vector units and not on both. Therefore, similarlyto the Skylake Server architecture, full throughput can only be achieved viaAVX-512 calculations as SSE/AVX/AVX2 can only utilise half of the availablevector units.

Due to the out of order nature, one thread per KNL core can saturate allavailable core resources. The L1 cache can perform two 64 byte (2x512-bits)loads and one 64 byte (1x512-bits) stores per cycle and supports unalignedand split load accesses across multiple cache lines which was not previouslypossible in KNC. Furthermore, the L1 cache also implements special logic forgather and scatter operations whereby a single instruction can access multiplememory locations at once without the need of a blocking loop implementationas found in KNC [67]. This is particularly important for applications that ex-hibit indirect and irregular memory access patterns such as unstructured meshsolvers.


The L2 cache is shared by the two cores in a tile via the Bus Interface Unit(BIU) and has a bandwidth of 64 bytes (1x512-bits) for reads and 32 bytes(1x256-bits) for writes per cycle. This can therefore become a bottleneck formemory bound workloads when both cores issue AVX-512 vector loads andstore operations.

The Knights Landing architecture can support various clustering and memorymode configurations which were traditionally hard-wired during the chip man-ufacturing process but are now exposed as boot-able options [44]. Clusteringmodes target ways in which data is routed on the 2D mesh in the event of acache miss in the shared L2 cache [83]. An L2 miss in Knights Landing involvesthe interaction of three actors: the tile from where the miss originated, theCache Homing Agent (CHA) that tracks the location and state of the memoryaddress for which a miss was generated and the actual memory location (i.e.MCDRAM or DDR4) [44].

In the All-to-All configuration, all memory addresses are distributed uni-formly across all CHA’s without any locality considerations. This mode providesthe lowest performance out of all available configurations and is usually usedas fall-back in the event of memory imbalance or errors.

In the Quadrant configuration, the entire die is divided into four distinctregions and memory addresses are mapped to the caching agents which residein the same quadrant. This creates affinity between the memory location andthe CHA which reduces latency and provides an increase in bandwidth.

Finally, in the Sub-Numa-Clustering mode, the die is divided in either fouror two distinct sections analogous to a four or two socket multicore CPUnode where affinity exists between all agents: tile, CHA and memory. Thisconfiguration introduces NUMA considerations that have to be exploited onthe application-side as accessing data owned by a tile from a different regionwill lead to increased latency due to the longer path traversed on the 2D mesh.However, if implemented correctly, this mode of operation can provide the bestperformance as all communications are localised.

With regard to memory modes, the Knights Landing architecture supportsthree distinct memory configurations. The first option is Cache mode in whichthe 16GB MCDRAM acts as a transparent direct mapped memory side cache.In Flat mode, the MCDRAM is exposed as an explicit memory buffer that mustbe managed by the application and which has a different address space thanstandard DDR4 memory. Hybrid mode combines both of the previous options

3.2 the roofline model 55

where MCDRAM can be used as cache and explicit high bandwidth memorybased on pre-defined ratios.

In this work, experimental results for the Knights Landing processor wereobtained using the ARCHER-KNL [2] evaluation platform. Consequently, allruns were performed in the Quadrant cluster mode as this was the global con-figuration across all nodes. With regard to memory modes, separate queuesallowed for the evaluation of both Cache as well as Flat modes.

3.2 the roofline model

In order to determine the effects of our code optimisations rigorously, per-formance models are built and correlated based on the characteristics of theprocessor architectures presented in Section 3.1. A good performance modelcan highlight how well the implementation of an algorithm uses the underly-ing architectural features of the processor as well as the degree of performancethat can still be achieved with further optimisations.

In this thesis, all performance modelling activities were performed usingthe Roofline [91],[92],[62] performance model. The model is based on the as-sumption that the principal vectors of performance in numerical algorithmsare computation, communication and locality [92]. The Roofline model definescomputation as number of floating point operations per second, communica-tion as units of data from memory required by the computations and localityas the distance in memory from where the data is retrieved i.e. (cache, DRAM,MCDRAM etc). However, due to the fact that modelling the performance ofthe entire cache system is a complex task for one architecture, let alone six, thelocality component in this work is set to DRAM main memory for the mul-ticore CPUs, GDDR5 for the Intel Xeon Phi Knights Corner and MCDRAM forthe Intel Xeon Phi Knights Landing processor.

The correlation between computation and communication is defined by themodel as the arithmetic intensity of a given kernel which can be expressedas the ratio of useful floating point operations to the corresponding numberof bytes requested from memory. As the units of measurement for both flopsand byte transfers in current CPU architectures are given as GigaFLOPS persecond and GigaBYTES per second, the maximum attainable performance of a


computational kernel, measured in GigaFLOPS per second, can be calculatedas:

Max. GFlops/sec = min

Peak floating-point performanceMax. Memory bandwidth×Kernel flops/bytes

(2)

Peak floating point is obtained from multiplying the number of floatingpoint operations per cycle to the nominal clock frequency together with thenumber of physical cores while maximum memory bandwidth is obtained us-ing the STREAM[54],[53] benchmark.

The model visualises these metrics in a diagram [91] where the attainableperformance of a kernel is plotted as a function of its arithmetic intensity. Thisacts as the main roofline of the model and exhibits a slope followed by a plat-eau when peak FLOPS is reached. The position at which the arithmetic intens-ity coupled with the available memory bandwidth equals peak FLOPS is calledthe "ridge" and signifies the transition from memory to compute bound.

Achieving either maximum memory bandwidth or peak FLOPS on mod-ern multicore and manycore processors requires the exploitation of a plethoraof architectural features even though the algorithm might exhibit a favour-able arithmetic intensity. For example, achieving peak FLOPS requires a bal-ance of additions and multiplications for fully populating all functional unitsor for all such operations to be cast as FMA operations on architectures thatsupport such primitives. Moreover, even where the computational kernels ex-hibit a perfect balance of operations or can cast all of them into FMA instruc-tions, they will not automatically extract peak performance if data parallelismthrough SIMD and instruction level parallelism through loop unrolling are notexploited as well.

Achieving performance close to the maximum memory bandwidth of thesystem requires that memory accesses are performed at unit stride and includesome form of prefetching either by the hardware or by the software. Further-more, on two-socket systems, extracting the maximum performance out of thememory system also requires that NUMA effects are catered for when runningat full node concurrency [92],[11]. Consequently, best attainable performancewill vary significantly between applications that exhibit a high degree of ir-regular and indirect access patterns and those with regular and unit stridememory access patterns such as structured and unstructured CFD codes.


To demonstrate this, roofline diagrams are presented in Figure 5 for theprocessor architectures described in Section 3.1 and which are used throughoutthis thesis.

The vertical ceilings in Figure 5 that are parallel to the horizontal line repres-ented as peak floating point performance are optimisations that are requiredto improve computational performance. The ceilings that are parallel to thediagonal represented by peak main memory bandwidth are memory optim-isations that are necessary for efficiently exploiting the memory system of thatparticular architecture. The order of each ceiling or roofline is set based on howlikely it is for the compiler to automatically apply such optimisations withoutexternal interventions [92].

For the Sandy Bridge node, a balance between additions and multiplicationsis required due to the design of the execution ports where one multiplicationand one addition can be performed every clock cycle as previously discussed.An imbalanced kernel will fully utilise the addition or multiplication executionunit whilst the other sits idle therefore reducing throughput by a factor of two.Furthermore, lack of SIMD computations via AVX on 256-bit vector registersfurther decreases performance by a factor of four from the previous ceiling inthe context of double precision computations.

On the Haswell and Broadwell nodes, lack of FMA operations reduces themaximum attainable performance by a factor of four since two SIMD FMAoperations can be performed in one cycle on both architectures. Lack of balancebetween floating point operations is also detrimental on these architecturessince these can only be performed in one of the two ports, same as on SandyBridge. As a result, this leads to another factor of two drop in the maximumperformance that is attainable. Finally, absence of AVX/AVX2 execution onHaswell and Broadwell results in a further reduction by a factor of four as aresult of the same considerations as on Sandy Bridge.

On the Skylake Server architecture, lack of FMA operations has a similareffect to that on Haswell and Broadwell. However, if the wide SIMD units arenot exploited via AVX-512, the maximum attainable performance is reducedfurther by a factor of eight due to the wide 512-bit vector registers.

The two-socket compute nodes based on the multicore CPU architecturesshare the same ceiling for memory optimisations such as NUMA optimisationswhen crossing socket boundaries followed by larger drops in performance incases where the kernel exhibits irregular access patterns.


1

4

16

64

256

1024

1/8 1/4 1/2 1 2 4 8 16

Peak FP

Peak MemBandwidth

NUMA

irregular acce

ss

1

4

16

64

256

1024

1/8 1/4 1/2 1 2 4 8 16

Peak FP

Peak Mem Bandwidth

1

4

16

64

256

1024

1/8 1/4 1/2 1 2 4 8 16

Peak FP

Peak Mem Bandwidth

4

16

64

256

1024

4096

1/8 1/4 1/2 1 2 4 8 16

Peak FP

Peak MemBandwidth

1

4

16

64

256

1024

1/8 1/4 1/2 1 2 4 8 16

Peak FP

Peak Mem BandwidthFMA

IMCI

TLP

4

16

64

256

1024

4096

1/8 1/4 1/2 1 2 4 8 16

Peak FP

Peak Mem Bandwidth FMA

AVX-512

Arithmetic Intensity

2x SNB E5-2650 (16 cores)

MUL/ADD

AVX


2x HSW E5-2650 (20 cores)

NUMA

FMA

AVX/AVX2

MUL/ADD

irregular access

GFL

OPS


2x BDW E5-2680 (28 cores)

NUMA

FMA

AVX/AVX2

MUL/ADD

irregular access


2x SKL Gold 6140 (36 cores)

NUMA

FMA

AVX-512

irregular acce

ss


KNC 7120P (61 cores)

prefetching

irregular access


KNL 7210 (64 cores)

irregular access

MCDRAM

Figure 5: Example of rooflines for computational nodes based on the processor archi-tectures presented in Section 3.1

3.3 conclusions 59

On the Knights Corner 7120P coprocessor, the in-order cores require theuse of multiple threads to compensate for lack of out-of-order execution. Intheory, the VPU unit can be filled by running between two or three threadson the core. Running with only one thread in-flight reduces the attainablepeak performance by a factor of four in the worst case scenario. Furthermore,similarly to Haswell and Broadwell, an imbalanced kernel that cannot fullyexploit FMA operations on Knights Corner will suffer a factor of two drop inperformance. Lack of SIMD results in a factor of eight decrease due to the wide512-bit vector registers.

Since the Knights Corner architecture does not exhibit NUMA effects due tothe ring topology and interleaved memory bank placement, the first ceiling forobtaining performance close to maximum memory bandwidth is prefetching.This is due to the lack of hardware prefetchers. The second ceiling is similar tothe multicore architectures and represents lack of contiguous and unit stridememory access patterns which results in a factor of four drop in performancerelative to the maximum attainable memory bandwidth.

On the Knights Landing architecture, the first in-core ceiling is based onlack of FMA operations. As there are two VPUs in a Knights Landing core, ifthe kernel’s floating-point operations cannot be executed as FMA instructions,performance will drop by a factor of four. Furthermore, lack of AVX-512 leadsto a factor of eight drop in performance.

With regard to memory ceilings, the Knights Landing architecture integratesboth an L1 as well as L2 hardware prefetcher. As a result, the first ceiling isrepresented by lack of contiguous memory access patterns. The second one isbased on not exploiting the MCDRAM implementation which leads to a factorof five decrease of the maximum attainable memory bandwidth.

3.3 conclusions

This chapter presented the necessary background on the processor architec-tures used in this thesis followed by a discussion on the Roofline performancemodel and its application in this work.

4O P T I M I S AT I O N O F B L O C K - S T R U C T U R E D M E S HA P P L I C AT I O N S

4.1 introduction

This chapter presents a number of optimisations for improving the computa-tional performance of block-structured CFD solvers across different processorssuch as the Intel Sandy Bridge and Haswell multicore CPUs and the Intel XeonPhi Knights Corner manycore coprocessor. Code optimisations are demon-strated on two computational kernels exhibiting different computational pat-terns: regular and streaming access patterns represented by the update of flowvariables and the stencil-based operations arising from the computation offluxes. A discussion on the code transformations required for achieving effi-cient exploitation of the available vector units, threads and cores for both ker-nels and across the selected processors is also given while performance resultsare correlated with the Roofline performance model.

The remaining chapter is structured as follows. Section 4.2 discusses re-lated work, Section 4.3 presents the test vehicle for this study, a descriptionof the two computational kernels and the configuration of the hardware. Sec-tion 4.4 describes the code optimisations and their implementation, Section 4.5presents results and discussions while Section 4.6 gives concluding remarks.

4.2 related work

There have been a number of previous efforts concentrated on optimising theperformance of structured grid applications on modern processors and whichare also applicable to block-structured CFD codes.

Henretty et al [37] applied data transpositions for optimising the SIMD exe-cution of stencil operations. Rostrup et al [76] demonstrated similar techniquesfor exploiting data parallelism in structured grid applications with the additionof implementing hand-tuned primitives for shuffling vector operands in sten-cil operations in order to allow for aligned vector load and store operations on

60

4.3 background 61

the Cell processor. Rosales et al. [74] determined the effect of data layout trans-formations for improving the performance of vector operations and the issuesof exploiting a large degree of thread-parallelism on the manycore Intel XeonPhi Knights Corner processor using a Lattice Boltzmann code as a test vehicle.Datta et al [21] studied the problem of modelling and auto-tuning the perform-ance of stencil operations across a wide range of modern processors in order toexploit their distinct architectural characteristics and therefore extract a higherdegree of performance portability. With regards to the latter, Mcintosh-Smithet al [55] and Curran et al [19] presented the need for a performance port-able approach to the optimisation of structured grid applications. Accordingto them, the solution to this problem is the utilisation of a framework that iscompatible among the majority of manycore processors such as OpenCL [63].They demonstrated that performance portability across a variety of structuredgrid codes can be achieved by re-writing the main computational kernels inthe OpenCL framework. Their results indicate that good performance portab-ility can be achieved across a wide range of manycore architectures such asNVIDIA and AMD GPUs as well as Intel Xeon Phi processors although thisdoes differ significantly depending on the complexity of the application.

Other approaches for ensuring performance portability on both multicore aswell as manycore processors is through auto-tuning as presented by Williamset al [92] or the implementation of DSLs and source to source code generatorssuch as the already mentioned SBLOCK [14] and OPS [70] implementations.

The contribution of this chapter is the incorporation of some of the tech-niques presented in literature together with more novel optimisations in ablock-structured solver of complexity and structure representative of indus-trial CFD applications. Their impact is demonstrated on two multicore CPUarchitectures as well as on the Intel Xeon Phi Knights Corner coprocessor andshows that performance in structured grid applications can be exploited acrosssuch architectures using traditional programming models coupled with port-able APIs such as OpenMP.

4.3 background

The test vehicle for this study is an Euler solver used in an industrial setting forperforming quick calculations of transonic turbo-machinery flows. The solvercomputes inviscid flow solutions in the m ′− θ coordinate system [87] where θ

4.3 background 62

is the angular position around the annulus, m is the arclength evaluated alonga stream surface dm =

√dx2 + dr2 and m ′ is a normalized (dimensionless)

curvilinear coordinate defined from the differential relation dm ′ = dmr . The

m ′ − θ system is used in turbomachinery computations because it preservesaerofoil shapes and flow angles. A typical computational domain is shown inFigure 6 and consists of a number of blocks connected by stitch lines.

Figure 6: Computational domain consisting of five blocks and corresponding stitchlines.

4.3.1 Governing Equations

The Euler equations are solved in semi-discrete form

d

dtWi,jUi,j = Fi−1/2,j − Fi+1/2,j +Gi,j−1/2 −Gi,j+1/2 +Si,j = RHSi,j (3)

4.3 background 63

In equation (3), Wi,j is the volume of cell i, j, Ui,j is the vector of conservedvariables and the vectors Fi−1/2,j, Gi,j−1/2 and Si,j denote fluxes through i-faces, j-faces and source terms, respectively and are evaluated as follows:

Fi−1/2,j = sζi−1/2,j

ρwζ

ρwζum + pnζm

ρwζuθ + pnζθ

ρwζh− pwζ

i−1/2,j

(4)

Gi,j−1/2 = sηi,j−1/2

ρwη

ρwηum + pnηm

ρwηuθ + pnηθ

ρwηh− pwη

i,j−1/2

(5)

Si,j = Wi,jρi,j

0

u2θ sinφumuθ cosφ

0

i,j

(6)

For the purpose of flux and source term evaluation, the contravariant velo-cities wζ/η, the normals nζm,θ and nηm,θ and the radial flow angle φ are alsoneeded.

4.3.2 Spatial discretization

The physical fluxes are approximated with second order TVD-MUSCL [6][39][73]numerical fluxes

F∗i−1/2,j =1

2

(Fi−1,j + Fi,j

)−12R|Λ|L

(Ui,j −Ui−1,j

)−12R|Λ|ΨL∆Ui−1/2,j (7)

The term R|Λ|ΨL∆Vi−1/2,j represents the second order contribution to thenumerical fluxes. Ψ is the limiter and the flux eigenvectors and eigenvaluesR,Λ, L are evaluated at the Roe-average [73] state

4.3 background 64

ΨL∆

ρ

um

uθ

p

i−1/2,j

= Ψ+Lli−1,j

∂ρ∂m

∂ρ∂rθ

∂um∂m

∂um∂rθ

∂uθ∂m

∂uθ∂rθ

∂p∂m

∂p∂rθ

+

i−1,j

[nζm

nζθ

]

+Ψ−Lli,j

∂ρ∂m

∂ρ∂rθ

∂um∂m

∂um∂rθ

∂uθ∂m

∂uθ∂rθ

∂p∂m

∂p∂rθ

−

i,j

[nζm

nζθ

](8)

where L is the matrix of the left eigenvectors of the Roe-averaged flux Jacobian.

L =

1 0 0 − 1

a2

0 tm tθ 0

0 nm nθ − 1ρa

0 nm nθ − 1ρa

(9)

The evaluation of the numerical fluxes also requires the Roe-averaged eigen-values and right-eigenvectors of the flux Jacobian, which are evaluated as fol-lows:

Λ = diag (un, un, un + a, u− a) + ε (10)

R =

1 0 1 1

0 tm (un + a)nm (un − a)nm

0 tθ (un + a)nθ (un − a)nθ

k ut h+ una h− una

(11)

4.3 background 65

where ε is Harten’s entropy correction [36]. The Roe-averaged state is definedby

um = ωui+1,jm + (1−ω)ui,jm (12)

uθ = ωui+1,jθ + (1−ω)ui,jθ (13)

h = ωhi+1,j + (1−ω)hi,j (14)

a2 = (γ− 1)(h− k) (15)

ρ =√ρi+1,jρi,j (16)

ω =

√ρi+1,j√

ρi+1,j +√ρi,j

(17)

Similar definitions are applied for the G∗i,j−1 numerical flux vector.

4.3.3 Time integration

Convergence to a steady state is achieved by a matrix free, implicit algorithm.At each iteration, a correction to the primitive variable vector V is determinedas solution to the linear problem [34]

δ(Wi,jUi,j

)δt

= RHSi,j + Jh,ki,j δVh,k (18)

or, equivalently

(Wi,jKi,j − Jh,ki,j )δVi,j = −

(δWi,j

)IUi,j +RHSi,j (19)

where Vi,j is the vector of primitive variables at the cell i, j and Ki,j is thetransformation Jacobian from primitive variables to conserved variables. Thelinear problem in equation (19) can be approximated by a diagonal problem ifthe assembled flux Jacobian Jh,k

i,j is replaced by a matrix bearing on the maindiagonal the sum of the spectral radii |Λ| of the flux Jacobian contributions foreach cell

Jh,ki,j ≈ −diag

(∑s∣∣Λ∣∣

max

)i,j

(20)

4.3 background 66

At a fixed Courant number σ =δti,j

∑s|Λ|

max

Wi,jthis approximation yields the

following update

δVi,j =1

Wi,j (1+ σ)K−1i,j

(−Ui,jδWi,j +RHSi,j

)(21)

The solver has been validated against MISES [50] for turbine testcases. Forthe purpose of implementation, the baseline solver stores the primitive vari-ables Vi,j at each cell as well as auxiliary quantities such as speed of soundand total enthalpy while all computations are carried out in double precision.The application is implemented as a set of C++ classes where Eqs. (3)-(21) areimplemented as methods in distinct classes representing an abstraction of a gasmodel. These methods are called at appropriate times during application exe-cution to compute the numerical fluxes, updates etc. The remaining elementsof the application in charge of pre-processing or post-processing are handledby a different set of classes.

4.3.4 Computational kernels

Inspection of equations (3)-(21) reveals the existence of two types of computa-tional kernels and patterns. The first type can be defined as a cell-based loopwhich evaluates cell attributes and is performed by looping over the cells in thedomain. The evaluation of Si,j in equation 3 or the block diagonal inversionin equation 21 fall within this category. The second type evaluates stencil op-erators and needs to be performed by looping over the neighbours of each cellor by looping over the cell boundaries. The evaluation of the numerical fluxesFi±1/2,j and Gi,j±1/2 in equation 3 is an example of such stencil kernels.

The solver spends 75% of time in computing the numerical fluxes and ap-proximately 15% in evaluating cell-centred attributes, the most time consum-ing of which is the update of flow variables. Therefore, optimising both ofthese kernels will have the highest impact on overall application performance.

flux computations The flux computations kernel iterates over a set ofcell interfaces in separate i and j sweeps and computes numerical fluxes us-ing Roe’s approximate Riemann solver as seen in Listings 1 and 2. The kernelaccepts pointers to the left and right states (q), to the corresponding residuals

4.3 background 67

(i,j)

(i,j+ 1)

(i+ 1,j)

cell center(solution)

cell interfacecontrol volume

Figure 7: Schematic of the cell-centred finite volume scheme on structured grids.

(rhs), and to the normals of the interfaces. The computations performed ateach interface are: the evaluation of Euler fluxes for the left and right state -equations (4,5), evaluation of Roe averages and corresponding eigen-vectorsand eigen-values - equations (9)-(17), assembly of the numerical fluxes, and ac-cumulation of the residuals. The kernel computes two external products (leftand right Euler flux), and one 4x4 GAXPY. A judicious implementation re-quires 191 FLOPS and loads 38 double precision values per direction sweep.

4.3 background 68

// fluxes in i-direction

for( j=0;j<nj;j++ )

for( i=0;i<ni-1;i++ )

iq= j*ni+i;

ql[0]=q[0][iq];

qr[0]=q[0][iq+1];

...

riem(ql,qr,...,f,&lmax);

rhs[0][iq]-= f[0];

rhs[0][iq+1]+= f[0];

Listing 1: i-direction sweep

// fluxes in j-direction

for( j=0;j<nj-1;j++ )

for( i=0;i<ni;i++ )

iq = j*ni+i;

ql[0]=q[0][iq];

qr[0]=q[0][iq+ni];

...

riem(ql,qr,...,f,&lmax);

rhs[0][iq]-= f[0];

rhs[0][iq+ni]+= f[0];

Listing 2: j-direction sweep

flow variable update The kernel computes the primitive variables up-dates, based on the residuals of the discretized Euler equations, as defined inequation (21). The kernel accesses the arrays storing the flow variables, the re-siduals, an array storing auxiliary variables and an array storing the spectralradii of the flux Jacobians. The arrays are passed to the function through theirbase pointers. The flow and auxiliary variables are used to compute the entriesof the transformation Jacobian K−1

i,j between conserved variables and primitivevariables. The kernel performs 40 FLOPS per cell and loads 35 double precisionvalues giving it a 0.14 flops/byte ratio.

4.3.5 Configuration of compute nodes

Table 2 presents details with respect to the configuration of the compute nodesused in this chapter such as processor architecture and model, memory config-uration and compiler version.

4.4 optimisations 69

SNB HSW KNC

Version E5-2650 E5-2650 5110PSockets 2 2 1

Cores 8 10 60

Threads 2 2 4

Clock (GHz) 2.0 2.3 1.053

SIMD ISA AVX AVX2 IMCISIMD width 256-bit 256-bit 512-bitL1 Cache (KB) 32 32 32

L2 Cache (KB) 256 256 512

L3 Cache (MB) 20 25 -DRAM (GB) 32 64 8

DRAM type DDR3 DDR4 GDDR5

Stream (GB/sec) 71 110 140

Compiler icpc 15.0

Table 2: Hardware and software configuration of the compute nodes used in thischapter. The SIMD ISA represents the latest vector instruction set architecturesupported by the particular platform.

4.4 optimisations

This section presents in detail a number of optimisation techniques that havebeen applied to the selected computational kernels introduced in Section 4.3.

4.4.1 Flow variable update

vectorization Vectorization of the flow variable update kernel has beenachieved through the use of OpenMP 4.0 [64] compiler directives i.e. #pragmaomp simd. The use of OpenMP 4.0 directives is found preferable to compiler-specific directives such as the ones available in Intel compilers because theyare portable and supported across other compilers. In order to achieve effi-cient vectorization, the qualifier restrict needs to be added to the functionarguments. This guarantees to the compiler that arrays represented by the ar-guments do not overlap. In absence of further provisions, the compiler gener-ates code for unaligned loads and stores. This is a safety precaution, as alignedSIMD load/store instructions on unaligned addresses lead to bus errors on theIntel Xeon Phi coprocessor. In contrast, Sandy Bridge and Haswell processorscan deal with aligned access instructions on unaligned addresses, albeit with


some performance penalty due to the inter-register data movements that arerequired.

Aligned vector load and store operations can be achieved by issuing thealigned qualifier to the original directive and by allocating all the relevantarrays using the _mm_malloc wrapper function. _mm_malloc takes an extra ar-gument representing the required alignment (32 bytes for AVX/AVX2 and 64

bytes for the IMCI on Knights Corner). A number of additional directives maybe needed to persuade the compiler that aligned loads can be issued safely, asshown in the snippet below for a pointer storing linearly four variables in theSoA arrangement, at offsets id0, id1:

__assume_aligned(rhs, 64);

__assume(id0%8==0);

__assume(id1%8==0);

__assume((id0*sizeof(rhs[0]))%64==0);

__assume((id1*sizeof(rhs[0]))%64==0);

Listing 3: Example of extra compiler hints needed for generating aligned vector loadand stores on 64-byte boundaries.

The assume_aligned construct indicates that the base pointer is aligned withthe SIMD register boundary. The following __assume statements indicate thatsubscripts and indices accessing the four sections of the array are also alignedon the SIMD register boundary. The need for such additional directives is dueto the fact that the OpenMP 4.0 implementation can only generate alignedload and store operations if the argument passed to the aligned clause i.e.aligned(rhs:64) is a single pointer and not a pointer to pointer. This some-what complicates the issue of generating aligned vector load and store oper-ations even for kernels with regular access patterns since the additional dir-ectives presented in Listing 3 are specific to Intel compilers and therefore notportable.

Another way of achieving SIMD execution with aligned load and stores op-erations is by invoking compiler intrinsics or by using a higher abstractionlayer such as a library. In this work, versions of the cell-based flow variable up-date kernel have been implemented using both compiler intrinsics specific toeach architecture as well as Agner Fog’s Vector Class Library (VCL) [16]. Themain advantage of the latter is that ugly compiler intrinsics are encapsulatedaway allowing for a more readable code. On Knights Corner, an extension of


the VCL library was used which was developed by Przemyslaw Karpinski atCERN [86].

software prefetching Software prefetching can also be used to im-prove the performance of memory-bound kernels. This has been implementedin the kernel version based on compiler intrinsics and using the functionalityof the _mm_prefetch intrinsics. The latter can take as arguments the cache levelwhere the prefetch is to appear in. When using _mm_prefetch it is advisable todisable the compiler prefetcher using the -no-opt-prefetch compilation flag(Intel 15.0 compilers).

data layout transformations The previous data layout used to storeboth cell-centred as well as face quantities such as normals was based on theSoA format. The SoA layout is advisable for SIMD since it allows for contigu-ous SIMD load and stores. However, this can lead to performance penaltiessuch as Translation Look-aside Buffer (TLB) misses for very large arrays whilealso requiring that multiple memory streams are kept in-flight for each indi-vidual vector. A hybrid approach such as the Array of Structures Structureof Arrays (AoSSoA) can alleviate these issues by offering the recommendedSIMD grouping in sub vectors coupled with improved intra and inter struc-ture locality.

Therefore, a version of the kernel using the AoSSoA data layout has alsobeen studied. Four sub vector lengths were tested: 4,8,16,32 double precisionvalues on the multicore processors and 8,16,32,64 on the coprocessor. The bestperforming sub vector length of AoSSoA proved to be the one equal to theSIMD vector width of the processor (i.e. 4 and 8 respectively).

thread parallelism Thread parallelism of the flow variable update ker-nel was exploited using the OpenMP 4.0 construct for the auto-vectorized ver-sion which complements the initial #pragma omp simd construct and furtherdecomposes the loop iterations across all of the available threads in the en-vironment. The kernel versions based on compiler intrinsics used the genericOpenMP loop-level construct. The work decomposition has been carried outvia a static scheduling clause, whereby each thread receives a fixed size chunkof the entire domain. Additionally, the first touch policy has been applied forall data structures utilised in this method in order to alleviate NUMA effects


when crossing over socket boundaries. The first touch is a basic techniquewhich consists in performing trivial operations - e.g. initialization to zero - ondata involved in parallel loops before the actual computations. This forces therelevant sections of the arrays to be loaded on the virtual pages of the core andsocket where the thread resides. The first touch needs to be performed withsame scheduling and thread pinning as the subsequent computations. In termsof thread pinning, the compact process affinity was used for Sandy Bridge andHaswell running only with one thread per physical core. On Knights Corner,the scatter affinity brought forth better results compared to compact.

4.4.2 Flux computations

vectorization Code vectorization of the kernel computing Roe’s numer-ical fluxes has been achieved by evaluating a number of avenues such as auto-vectorization via compiler directives, compiler intrinsics and the VCL vectorclass, similar to the cell-based kernel.

In order for the code to be vectorized by the compiler, a number of modi-fications had to be performed. First of all, in the original version, the faceneighbours were obtained using indirect references in order to also allow forcalculations on unstructured meshes. These were removed and their positionexplicitly computed for each iteration. As these increase in a linear fashion byunit stride once their offset is computed, the compiler was then able to replacegather and scatter instructions with unaligned vector loads and stores, albeitwith the help of the OpenMP 4.0 linear clause. This had a particularly largeeffect on Knights Corner where performance of unaligned load and stores wasa factor of two higher compared to gather and scatter operations. Furthermore,the restrict qualifier was also necessary for any degree of vectorization to beattempted by the compiler.

In structured and block-structured codes, it is natural to group the faces ac-cording to the transformed coordinate that stays constant on their surface, e.g.i-faces and j-faces. As a result, fluxes can be evaluated in two separate sweepsvisiting i-faces or the j-faces, respectively. Each operation consists of nested i-and j-loops as seen in Section 4.3. When visiting i-faces, if the left and rightstates are stored at unit stride, they cannot be simultaneously aligned on vectorregister boundary thus preventing efficient SIMD execution as demonstratedin Figure 8. When visiting j-faces this problem does not appear, as padding is


sufficient to guarantee that left and right states of all j-faces can be simultan-eously aligned on the vector register boundary.

As a result, vectorized versions of the flux computation kernel were im-plemented based on compiler intrinsics and the VCL library that utilise un-aligned loads and stores when visiting the i-faces but aligned operations forthe sweeps along the j-faces. On the Knights Corner architecture, the VPU ISAdoes not contain any native instructions for performing vector load and storesoperations at unaligned addresses. To this extent, the programmer has to relyupon the unpacklo, unpackhi, packstorelo and packstorehi instructions andtheir corresponding intrinsics to emulate such behaviour. Examples of meth-ods used in this work for performing unaligned load and store operations onKnights Corner can be seen in Listings 4 and 5. As one can observe, emulat-ing an unaligned load on Knights Corner requires two aligned load/store andregister manipulations hence the penalty in performance compared to theiraligned counterparts.

inline __m512d loadu(double *src)

__m512d ret;

ret = _mm512_loadunpacklo_pd(ret, src);

ret = _mm512_loadunpackhi_pd(ret, src+8);

return ret;

Listing 4: Unaligned SIMD loads on Knights Corner

inline void storeu(double *dest, __m512d val)

_mm512_packstorelo_pd(dest, val);

_mm512_packstorehi_pd(dest+8,val);

Listing 5: Unaligned SIMD stores on Knights Corner

The lack of alignment for the loop over the i-faces can be resolved by twotechniques: shuffle operations or via transposition. Shuffling consists in per-forming the loads using register-aligned addresses, and then modifying thevector registers to position correctly all the operands. On AVX, performingthese shuffle operations requires three instructions as register manipulations


1 2 5 6 9

1 2 5 6 9

8

10

q[iq]

32 bytes 32 bytes 32 bytes

113 4 7 8

1187430

0q[iq+1]

1 2 5 63 4 70

9 10 13 1411 12 158

17 18 21 2219 20 2316

L1 Cache

64 bytes

Figure 8: Vector load operation at aligned boundary for left state and load operationat unaligned boundary for right state.


1 2 5 6 9

1 2 6 9

8

10

q[iq]


113 4 7

1187430

0q[iq+1]

1 2 5 63 4 70

9 10 13 1411 12 158

17 18 21 2219 20 2316

L1 Cache

64 bytes

5 8

Cache Line Split2x loads

Figure 9: Side effects of unaligned vector load operations leading to cache line splitloads.


1 2 5 6 9

1 2 6 9

8

10

q[iq]


113 7

1187430

0q[iq+4]

1 2 5 63 4 70

9 10 13 1411 12 158

17 18 21 2219 20 2316

L1 Cache

64 bytes

5 84

1 2 30

5 64 7

YMM0

YMM1

2 31 4YMM2

Figure 10: Vector load operation at aligned boundary for both left and right statesfollowed by inter-register shuffles for correct positioning of operands.


can only be performed on 128-bit lanes. The implementation using compilerintrinsics and register shuffles issues aligned load and stores combined withshuffle operations when handling i-faces. The use of register shuffling alsosaves a load for each iteration, as the current right state can be re-used as leftstate when evaluating the following face therefore allowing for register block-ing. The issue of unaligned loads for the i-sweep direction as well as the rem-edy of performing inter-register shuffling can be seen in Figures 8, 9 an 10. Forbest performance, the optimisation of the vector register rotations and shifts(i.e. the shuffle operations) is critical. On AVX, this was achieved by usingthe vperm2f128 instruction followed by a shuffle. AVX2 supports the vpermd

instruction which can perform cross-lane permutations, therefore a more effi-cient approach can be used which requires fewer instructions. Knights Corneralso supports vpermd although the values have to be cast from epi32 (integers)to the required format (doubles in the present case). The choice of compilerintrinsics for each operation is based on their latency and throughput: instruc-tions with throughput larger than one can be performed by more than oneexecution port, if these are available, leading to improved instruction level par-allelism. As an example, the AVX/AVX2 _mm256_blend_pd compiler intrinsicwhich converts to a vblendpd instruction has a latency of one cycle and athroughput of 3 on Haswell, and one cycle latency and throughput of 2 onSandy Bridge. Consequently, it is preferable to the AVX _mm256_shuffle_pd

intrinsic (vshufpd), which has one cycle latency, but unit throughput on botharchitectures. The code snippets below show how shuffling can be achievedacross the three architectures.

// vl= 3 2 1 0

// vr= 7 6 5 4

__m256d t1=_mm256_permute2f128_pd(vl,vr,33); // 5 4 3 2

__m256d t2=_mm256_shuffle_pd(vl,t1,5); // 4 3 2 1

Listing 6: Register shuffle for aligned accesses on the i-face sweep with AVX.

// vl= 4,3,2,1

// vr= 8,7,6,5

__m256d blend = _mm256_blend_pd(vl,vr,0x1); // 4,3,2,5

__m256d res =_mm256_permute4x64_pd(blend,_MM_SHUFFLE(0,3,2,1)); // 5,4,3,2

Listing 7: Register shuffle for aligned accesses on the i-face sweep with AVX2.


// vl= 8,7,6,5,4,3,2,1

// vr= 16,15,14,13,12,11,10,9

__m512i idx = 2,3,4,5,6,7,8,9,10,11,12,13,14,15,0,1;

// 8,7,6,5,4,3,2,9

__m512d blend = _mm512_mask_blend_pd(0x1,vl,vr );

// 9,8,7,6,5,4,3,2

__m512d res = _mm512_castsi512_pd(_mm512_permutevar_epi32(idx,

_mm512_castpd_si512(blend)));

Listing 8: Register shuffle for aligned accesses on the i-face sweep with IMCI.

Generating aligned load/stores in the VCL library can be performed by us-ing the blend4d method which requires as argument an index shuffle map.However, this was found to be less efficient than the implementation basedon compiler intrinsics as the blend4d primitive did not utilise the instructionswith the best latency and throughput for each distinct architecture.

Another technique for addressing the stream alignment conflict issue is bytransposing the cell-centred variables before the sweep over the i-faces andtransposing them back prior to the sweep over the j-faces. This is somewhatsimilar to Henretty’s dimension-lifted transposition [37] however, the versionimplemented in this work has the disadvantage of requiring that data be re-transposed prior to the sweep on the j-faces due to the independent evaluationin the i and j directions.

cache blocking In the implementations up to this point, the loops visit-ing i-faces and j-faces have been kept separate. Spatial and temporal localitycan be improved by fusing (blocking) the evaluation of fluxes across the j-faceand the i-face of each cell. This technique is known as cache blocking. Thecache blocking kernel fuses both i- and j-passes and it is based on the kernelversions that perform inter-register shuffles as described above for allowingaligned loads and stores on the i-faces. The cache blocking kernel further be-nefits from the fact that it can save an extra load/store operation when writingback results for each evaluated face. The blocking factor should be a multipleof the SIMD vector length as to allow for efficient vectorization.

data layout transformations Discussion so far for the flux computa-tions has assumed a SoA data layout format. A kernel using the AoSSoA datalayout and based on inter-register shuffling for aligned SIMD load operations


for( j=1;j<min(j+jBF,nj);j+=jBF )

for( i=1;i<min(i+iBF,ni);i+=iBF )

fluxi(...);fluxj(...);

Listing 9: Implementation of cache blocking by nesting both i and j sweeps together.

was also tested. The best performing sub-vector size was the size of the SIMDregister, similar to the flow variable update kernel.

thread parallelism For the purpose of studying performance at fullchip and node concurrency, domain decomposition is performed at block-level.This can be performed in two ways. One option is to split a block into slabsalong the i- and j-planes depending on sweep, the number of slabs being equalto the number of available threads. The second option is to split each block intotiles. This method increases locality and complements the cache-blocking pro-cessing of the kernel computing fluxes as discussed above. For the non-fusedkernels where separate plane sweeps are performed, the first decompositionmethod was chosen due to the elimination of possible race conditions. For thefused kernels, the latter decomposition was required together with the hand-ling of race conditions at the tile boundary. In the second option, the decom-position mimics that on distributed-memory machines via MPI and requirestreatment for updating boundaries in a serial fashion once the fluxes are com-puted in the inner regions.

Thread and process affinity have been applied in a similar fashion to theflow variable update kernel i.e. compact for the multicore CPUs and scatter

for the coprocessor. Task and chunk allocation have been performed manuallyas required by the custom domain decomposition within the OpenMP parallelregion. Another reason for doing the above and not relying on the OpenMPruntime was the high overhead that this caused on the Knights Corner pro-cessor when running on more than 100 threads. Furthermore, all parallel runsof the optimised flux computation kernels implement NUMA-aware place-ment via applying the first touch technique.

4.5 results and discussions 80

4.5 results and discussions

This section presents the results of the optimisations on both computationalkernels and a discussion on their effectiveness.

4.5.1 Flow variable update

Figure 11 presents the effect that the optimisations had on the performance ofthe flow variable update kernel when running on a single core. Results at full-chip and full-node concurrency for each optimisation are presented in Figure12.

vectorization On Sandy Bridge and Haswell, even the auto-vectorizedkernel performs twice and three times faster than the baseline kernel. OnKnights Corner, the improvement is almost one order of magnitude as theVPU and therefore wide SIMD lanes are exploited whereas they were not pre-viously.

Aligned versus unaligned load and stores see no benefit on Sandy Bridgeand for smaller problem sizes a marginal improvement on Haswell, due to bet-ter L1 cache utilisation. The aligned access version outperforms its unalignedcounterpart on Knights Corner as a 512-bit vector register maps across anentire cache line therefore allowing for efficient load and store operations to/-from the L1 cache. The reason there is no significant difference between alignedand unaligned SIMD load and store operations on the multicores is possiblydue to the fact that the kernel is already limited by memory bandwidth onthese platforms. On the other hand, on Knights Corner, the difference in in-struction count as well as throughput between aligned and unaligned loadand store operations leads to a noticeable difference in performance.

The utilisation of compiler intrinsics or the VCL implementation did notbring forth any speed-ups over the directive based vectorization for this partic-ular kernel. This would indicate the fact that the compiler is able to vectorizekernels with regular access patterns competitively provided it is aided withadditional hints such as the ones presented in Listing 3 and compiler direct-ives.


software prefecthing Software prefetching for the cell-based flow up-date kernel sees no benefit across any of the three architectures.

thread parallelism The kernel under consideration is memory-boundon all three architectures and is therefore very sensitive to NUMA effects. ForHaswell, not using the first touch technique can lead to a large performancedegradation as soon as the active threads spill outside a single socket. Thiscan be seen by the behaviour of the data sets that lack NUMA optimisationsin Figure 12. A reason for why the discrepancies between NUMA and non-NUMA runs are larger on Haswell compared to Sandy Bridge is the largercore count on the former which means that the Quick Path Interconnect (QPI)system linking the two sockets gets saturated more quickly due to the increasein cross-socket transfer requests as more cores are serviced.

On Knights Corner there are no NUMA effects due to the ring-based inter-connect and the interleaved memory controller placement. However, alignedloads play an important role on this architecture: the SIMD length is 64 bytes(8 doubles) which maps to the size of an L1 cache line. If unaligned loadsare issued, the thread is required to go across cache boundaries, loading twoor more cache lines and performing inter register movements for packing upthe necessary data. This wastes memory bandwidth and forms a very dam-aging bottleneck. This can be seen from the drop in performance above 60

cores in the version written in compiler intrinsics with unaligned accesses andthe auto-vectorized version data set in Figure 12. Consequently, although un-aligned accesses did not produce such issues when run across a single core, theoverhead of additional memory traffic across the cache hierarchies as a resultof unaligned memory accesses grew as the number of cores increased as well.

The best performing version was based on the hybrid AoSSoA format. Theversion based on aligned load and store operations via compiler intrinsics andcombined with manual prefetching performed second best followed by theversion implementing aligned compiler intrinsics as well as compiler issuedprefetches.


0

0.5

1

1.5

2

105 106 105 106 105 106

0246810121416

105 106 105 106 105 106

GFL

OPS

KNC 5110PHSW E5-2650SNB E5-2650

Spee

d-up

KNC 5110PHSW E5-2650SNB E5-2650

ReferenceVectorization (OMP) (u)Vectorization (OMP)Vectorization (Intrinsics)

Vectorization (VCL)Software PrefetchData Layout (AoSSoA)

Figure 11: Effects of optimisations for the kernel performing the flow variable updateacross different grid sizes and running on a single-core. Results in GFLOPSon KNC 5110P for the reference runs on grid sizes 105 and 106 are 0.03 and0.02. The symbol (u) stands for unaligned memory access.


0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 160

5

10

15

20

25

0 2 4 6 8 10 12 14 16 18 20

0

5

10

15

20

25

30

0 50 100 150 200 250

GFL

OPS

# cores

2x SNB E5-2650

# cores

2x HSW E5-2650

GFL

OPS

# threads

KNC 5110P

ReferenceReference (N)Vectorization (OMP)Vectorization (OMP) (N)Vectorization (VCL/Intrinsics)Vectorization (VCL/Intrinsics) (N)

Data Layout (AoSSoA)Data Layout (AoSSoA) (N)Software PrefetchSoftware Prefetch (N)Vectorization (VCL/Intrinsics) (u)

Figure 12: Performance strong scaling of kernel executing the flow variable updateon 106 grid cells. The symbol (u) stands for unaligned memory accesswhile entries that include the symbol (N) implement NUMA optimisations(i.e. first touch). Results of runs with NUMA (N) optimisations are onlyapplicable to the multicore CPUs and not to KNC.


4.5.2 Flux computations

Figure 13 presents the effects that the optimisations had on the single-coreperformance of the flux computations kernel. Results at full-chip and full-nodeconcurrency are shown in Figure 14.

vectorization For the kernel computing Roe’s numerical fluxes, the ver-sion based on auto-vectorization via compiler directives brought forth a 2.5Ximprovement on Sandy Bridge, 2.2X on Haswell and 13X on Knights Cornerwhen compared to their respective baseline. The penalty for unaligned ac-cess on the i-faces is better tolerated on the multicore processors, where inter-register movements can be performed with small latency, but is very damagingon Knights Corner, as already seen.

The kernel version based on unaligned load and store operations and com-piler intrinsics delivers a 4X speed-up over the baseline implementation onSandy Bridge and Haswell and 13.5X on Knights Corner. Comparisons withthe auto-vectorized version sees a 60% speed-up on Sandy Bridge and 50%on Haswell although no noticeable improvements on the coprocessor. The in-crease in performance for the compiler intrinsics version on the multicore pro-cessors is due to the manual inner loop unrolling when assembling fluxeswhich allows for more efficient instruction level parallelism.

The compiler intrinsics and shuffle kernel performs similarly on Sandy Bridgecompared to the unaligned version due to AVX cross-lane shuffle limitationswhilst performing 30% and 10% faster on Haswell and Knights Corner. Thekernel versions based on the VCL library perform worse on both multicoreCPUs compared to the version using compiler intrinsics due to the inefficientimplementation of shuffling in the blend4d routine. On Knights Corner, thesewere replaced with the hand-tuned compiler intrinsics primitives presentedin Listing 4.4.2 as they were not implemented in the VLCKNC library. Forthis reason, the aligned VCL version on the coprocessor delivers very similarperformance compared to the version based on compiler intrinsics.

The kernel version based on aligned vector load and store operations onboth sweeps thanks to transpositions performs worse than the version that isbased on inter-register shuffles. This is due to the fact that transposing the cell-centred variables prior to the sweep on the i-faces and back for the followingsweep on the j-faces can have a degrading effect on performance due to the


012345678910

105 106 105 106 105 106

0

1

2

4

8

16

32

105 106 105 106 105 106

GFL

OPS

KNC 5110PHSW E5-2650SNB E5-2650

Spee

d-up

KNC 5110PHSW E5-2650SNB E5-2650

ReferenceVectorization (OMP) (u)Vectorization (Intrinsics) (u)Vectorization (VCL) (u)Vectorization (Shuffle)

Vectorization (VCL)Vectorization (Transpose)Cache-blocking SoAData Layout (AoSSoA)Cache-blocking AoSSoA

Figure 13: Effects of optimisations on the kernel performing the flux computations ondifferent grid sizes. Results in GFLOPS on KNC 5110P for the referenceruns on grid sizes 105 and 106 are 0.05 and 0.04. The symbol (u) stands forunaligned memory access.


working set not fitting into the cache. As a result, this technique is discourageddue to the overhead it produces and which is not offset by the difference inperformance between unaligned and aligned vector operations. However, it isimportant to note that such transpositions might be beneficial if the data isonly transposed once as seen in the work of Henretty et al [37]. In hindsight,this could have been accomplished by fusing both i and j sweeps (i.e. in thecache-blocked version).

cache blocking Across all three platforms, cache blocking did not bringforth any palpable benefits for single core runs. The reason for this is that al-though data cache reuse is increased, fusing both loops brings forth an increasein register pressure and degrades performance as the fused kernel performs382 FLOPS for every iteration.

thread parallelism The performance on both Sandy Bridge and Haswellnodes at full concurrency is 53 GFLOPS and 101 GFLOPS respectively. OnKnights Corner, the best performing version achieves 94 GFLOPS when run-ning on 180 threads. While on the multicore CPUs, the best performing kernelswere the versions based on compiler intrinsics, cache-blocking as well as thehybrid AoSSoA layout, the scaling on the coprocessor of this particular kernelwas second best. This can be attributed to register pressure as more threadsget scheduled on the same core which compete for available resources. Thisassumption can be validated by examining the scaling behaviour of the cache-blocked kernel without the hybrid AoSSoA data layout which suffers frompoor scalability as more than one thread gets allocated per core.

Comparing best performing results to their respective baselines reveals a3.1X speed-up on Sandy Bridge and Haswell and 24X on the Knights Cornercoprocessor at full concurrency.

4.5.3 Performance modelling

A visualisation of the Roofline model based on the results of the applied optim-isations can be seen in Figure 15. For brevity, the main classes of optimisationsare grouped as:

• Reference: for the initial baseline


0

10

20

30

40

50

60

0 2 4 6 8 10 12 14 160102030405060708090100110

0 2 4 6 8 10 12 14 16 18 20

0102030405060708090100

0 50 100 150 200 250

GFL

OPS

# cores

2x SNB E5-2650

# cores

2x HSW E5-2650G

FLO

PS

# threads

KNC 5110P

ReferenceVectorization (OMP) (u)Vectorization (Intrinsics) (u)Vectorization (VCL) (u)Vectorization (VCL)

Vectorization (Transpose)Vectorization (Shuffle)Data Layout (AoSSoA)Cache-blocking SoACache-blocking AoSSoA

Figure 14: Performance strong scaling of kernel executing the flux computations asa result of optimisations on 106 grid cells. The symbol (u) stands for un-aligned memory access.


• Vectorization (OMP): for versions vectorized using compiler directives;

• Vectorization (Shuffle): assuming best performing version based on com-piler intrinsics on that particular kernel and platform (i.e. shuffle, trans-position etc)

• Memory Opt.: for best performing version implementing aligned vectorload and store operations, compiler intrinsics and specific memory op-timisation (cache-blocking, AoSSoA etc)

single core Observations on the single core diagrams highlight interest-ing aspects on the two computational kernels and their evolution in the optim-isation space.

For the cell-based flow variable update kernel, vectorization using compilerdirectives pierces through the IMCI roofline on the coprocessor. This is dueto the fact that vectorization on Knights Corner allows for instructions to berouted towards the more efficient VPU unit. Furthermore, this kernel performsas well as the hand-tuned kernels based on compiler intrinsics across all threeplatforms (see Figures 11 and 12) as its performance is limited by memorybandwidth.

For flux computations, vectorization using compiler directives pierces throughtheir respective SIMD wall whilst hand tuned compiler intrinsics and sub-sequent memory optimisations deliver close to the attainable performance pre-diction. On Haswell and Knights Corner however, speed-up is limited by thefact that the FMA units are not fully utilised due to the algorithmic natureof stencils operators and their inherent lack of FMA operations. Even so, forsingle core runs, the best optimised version incorporating SIMD as well asmemory optimisations achieves between 80-90% of the available performancecompared to the baseline which delivers 20% on the multicore CPUs and only1.67% on the coprocessor.

full node At full node concurrency, the sizeable increase in memory band-width permits the reference implementation to obtain a larger degree of per-formance across all architectures.

For the cell-based flow variable update kernel, NUMA first touch optimisa-tions applied to all of the optimised kernel versions allows them to bypassthe NUMA wall. On the coprocessor, software prefetching only works when


0.25

0.5

1

2

4

8

16

32

1/8 1/4 1/2 1 2 4 8 16

flux

0.5

1

2

4

8

16

32

64

1/8 1/4 1/2 1 2 4 8 16

FMA

AVX/AVX2

0.0625

0.25

1

4

16

1/8 1/4 1/2 1 2 4 8 16

FMA

IMCI

TLP

1

4

16

64

256

1024

1/8 1/4 1/2 1 2 4 8 16

1

4

16

64

256

1024

1/8 1/4 1/2 1 2 4 8 16

1

4

16

64

256

1024

1/8 1/4 1/2 1 2 4 8 16


1 Core SNB E5-2650

MUL/ADD

AVX

update

GFL

OPS


1 Core HSW E5-2650


1 Core KNC 5110P



NUMA


2x HSW E5-2650 (20 cores)

NUMA



prefetching

Figure 15: Roofline visualisation of optimisations for single-core and full-node config-urations. Key: Reference Vectorization (OMP) Vectorization (Shuffle)

Memory Opt.


coupled with SIMD optimisations due to the poor performance of the scalarprocessing unit compared to the VPU. As more memory bandwidth becomesavailable, subsequent optimisations such as the hybrid data layout transforma-tion that prevents TLB misses and provides better data locality performs betterthan the vectorized SoA-based hand tuned compiler intrinsics version.

For flux computations, vectorization through compiler directives and intrins-ics results in performance that is close to the main roofline represented bypeak DRAM memory bandwidth while subsequent memory optimisations by-pass the model’s prediction. This is due to the fact that some of the data inthe optimised kernels is reused from the higher cache levels as a result ofmemory optimisations such as cache-blocking and is therefore not transferredfrom main memory. This in effect improves the arithmetic intensity of the ker-nel since fewer loads are in fact serviced by main memory and is a limitationof this analysis that is worth considering. An alternative would be to also up-date the arithmetic intensity of each kernel as optimisations are implementedthat improve upon the locality of the data. However, the difficulty for suchan approach is how to accurately determine the proportion of loads that areserviced by the higher cache levels and those serviced by main memory.

4.5.4 Architectural comparison

On a core-to-core comparison among the three processors, the Haswell-basedXeon E5-2650 core performs on average 2X and 4-5X faster compared to asingle Sandy Bridge Xeon E5-2650 and Xeon Phi Knights Corner 5110P coreacross both kernels. However, at full node and chip concurrency, the two socketHaswell Xeon E5-2650 node is approximately on par with the Knights Cornercoprocessor for flux computations and 25% faster for updating the flow vari-ables. The Xeon Phi Knights Corner coprocessor and Haswell node outperformthe two socket Sandy Bridge Xeon E5-2650 node by approximately a factor oftwo on flux computations and 50% to 2X on the flow variable update. How-ever, on a flop per watt and flop per dollar metric, the Knights Corner 5110Pcoprocessor delivers superior performance compared to both multicore CPUsat the cost of higher development and optimisation time needed for exploitingits underlying features in numerical computations. The increase in time spenton fine tuning the code on the coprocessor is attenuated by the fact that the ma-jority of optimisations targeting fine and coarse grained levels of parallelism

4.6 conclusions 91

such as SIMD and threads are transferable between both types of platformsincluding GPGPUs.

4.6 conclusions

In this chapter, a variety of optimisation techniques have been implementedin two distinct computational kernels within a block-structured CFD codeand across three modern architectures. A detailed description was given onthe exploitation of all levels of parallelism available in modern multicore andmanycore processors through efficient code vectorization and thread parallel-ism. Memory optimisations described in this work included software prefetch-ing, data layout transformations through hybrid data structures such as theAoSSoA layout and cache blocking.

The practicalities of enabling efficient vectorization were discussed whichestablished that for relatively simple kernels such as cell-based loops for up-dating state vectors, the compiler can generate efficient SIMD code with theaid of portable OpenMP 4.0 directives. This approach however does not fullyextend to more complex kernels such as flux computations involving a stencil-based access pattern where best SIMD performance is mandated through theuse of aligned vector load and store operations made possible via inter-registershuffles and permutations. Implementations of such operations were performedin this work using compiler intrinsics and the VCL framework and includedbespoke optimisations for each architecture. Vectorized and non-vectorizedcomputations exhibit a 2X performance gap for the flow variable update ker-nel and up to 5X for flux computations on the Sandy Bridge and Haswellmulticore CPUs. The difference in performance is significantly higher on themanycore Knights Corner coprocessor where vectorized code outperforms thenon-vectorized baseline by 13X in updating the flow variables and 23X forcomputing the numerical fluxes. These figures, when correlated with projec-tions that future multicore and manycore architectures will contain further im-provements to the width and latency of SIMD units, mandates efficient codeSIMDization as a crucial avenue for attaining performance in structured gridapplications on current and future architectures.

Modifying the data layout from SoA to a hybrid AoSSoA led to improve-ments for the vectorized kernels by minimizing TLB misses when running onlarge grids and by increasing data locality. Performance gains were particu-

4.6 conclusions 92

larly noticeable when running at full concurrency and performed best on theKnights Corner coprocessor. Cache-blocking the two separate sweeps withinthe flux computation kernel coupled with the hybrid data layout delivered bestresults on the multicore CPUs when running across all available cores howeverperformed second best on the coprocessor due to increased register pressure.

Core and thread parallelism has been achieved through the use of OpenMP4.0 directives for the cell-based flow variable update kernel which offers mixedparallelism at SIMD and thread granularities through common constructs. Forthe numerical fluxes, the domain was decomposed into slabs on the i and jfaces for implementations that performed separate plane sweeps and into tilesfor the fused sweep implementation with cache blocking. The flow variable up-date kernel achieved a 2X speed-up on the Sandy Bridge and Haswell nodesand 20X on the Knights Corner coprocessor when compared to their respect-ive baselines. Computations of the numerical fluxes at full concurrency alsoobtained a 3X speed-up on the multicore CPUs and 24X on the coprocessorover the baseline implementation.

The Roofline performance model has been used to appraise the optimisationprocess with respect to the algorithmic nature of the two computational kernelsand the three architectures. For single core execution, the optimised flow vari-ables update kernel achieved approximately 80% efficiency on the multicoreCPUs and 90% on the coprocessor. Flux computations obtained 90% efficiencyon Sandy Bridge, 80% on Haswell and approximately 60% on Knights Corner.The reason for the relatively low efficiency on the coprocessor is due to thein-order core design which requires at least two threads for fully populatingthe coprocessor VPU. The computational efficiency at full concurrency outper-forms the model’s predictions for both kernels across the multicore platformsdue to the fact that some of the data in the optimised kernels is retrieved fromthe higher cache levels and not main memory.

5O P T I M I S AT I O N O F U N S T R U C T U R E D M E S HA P P L I C AT I O N S

5.1 introduction

The solution of fluid flow problems in the vicinity of complex geometries man-date the utilisation of unstructured grids. However, compared to structuredand block-structured mesh applications, computations on unstructured gridsare known to exhibit unsatisfactory performance on cache-based architectures[9]. This is due in great part to the data structures required for expressing gridconnectivity and the resulting indirect and irregular access patterns.

In finite volume discretizations, these data structures and access patternsappear when iterating over the faces or edges of the computational domain forthe purpose of evaluating fluxes, gradients and limiters. These computationalkernels are usually structured as a sequence of gather, compute and scatteroperations where variables are gathered from pairs of cells or vertices sharinga face or edge followed by the calculation and scatter of results to the respectiveface or edge end-points.

An example of a typical face-based loop can be seen in Listing 10 where un-knowns are gathered from q and used together with mesh geometrical attrib-utes in geo to compute the flux residuals f that are subsequently accumulatedand scattered back to rhs. For edge-based solvers, one can simply replace faceswith edges and cells with vertices or nodes.

for( ic=0;ic<num_faces;ic++ )

u1= q[ifq[0][ic]];

u2= q[ifq[1][ic]];

f= geo[ic]*(u2-u1);

rhs[ifq[0][ic]]-= f;

rhs[ifq[1][ic]]+= f;

Listing 10: Example of a face-based kernel.

93

5.1 introduction 94

The gather and scatter operations arising from the indirect access in q andrhs via the ifq index array can operate across large and irregular strides inmemory and is determined by the mesh topology. As a result, these face-basedkernels are an example of a class of computational patterns that do not nat-urally map to the architectural features of modern cache-based multicore andmanycore architectures. First of all, accessing memory in an irregular fashionis detrimental to performance as it fails to exploit both spatial and temporallocality required for efficient utilisation of the memory hierarchy. Secondly, theexistence of indirect addressing when accessing cell-centred variables inhibitsvectorization by the compiler as well as the exploitation of thread parallelismdue to potential data dependencies when scattering back the results in rhs.Finally, accessing memory indirectly also has a detrimental impact on the op-eration of hardware prefetchers thus reducing any opportunity for memoryparallelism.

A typical unstructured finite volume code will spend more than two thirdsof its execution time in face-based loops for computing fluxes across cell in-terfaces and boundaries. Consequently, addressing the limitations that pre-vent optimal use of the vector units, threads and memory hierarchy will havethe highest impact in achieving improved performance on modern processors.However, optimisations that are beneficial to face-based kernels might be to thedetriment of cell-based loops where the remaining execution time is spent. Asa result, this could potentially offset any performance gains with respect to thewhole application. Moreover, as face-based loops are optimised, the computa-tional bottleneck will invariably shift towards cell-based loops. Consequently,for best full application performance, one must optimise both face-based andcell-based kernels and assess the impact this has on improving overall perform-ance.

This chapter presents a wide range of such optimisations in an unstructuredfinite volume CFD code typical in size and complexity of an industrial applic-ation. Their implementation and impact are demonstrated on kernels comput-ing inviscid, viscous and linearised fluxes as an example of face-based loopsand on a kernel computing updates to primitive variables as an example of acell-based loop. The benefits of each distinct optimisation are evaluated acrossa wide range of multicore and manycore compute nodes based on architec-tures such as the Intel Sandy Bridge, Broadwell, Skylake and the Intel Xeon

5.2 related work 95

Phi Knights Corner and Knights Landing processors and across two differentcomputational domain sizes.

The remaining chapter is structured as follows. Section 5.2 presents relatedwork, Section 5.3 provides details of the code, test case and hardware setup.Section 5.4 presents an in-depth description of each optimisation followed byresults and discussions in Section 5.5 and concluding remarks in Section 5.6.

5.2 related work

The flexibility of unstructured mesh CFD solvers in dealing with complicatedgeometries have led to their wide adoption in industry and across the CFDcommunity as a whole [52]. Consequently, there has always been significantinterest in the optimisation and acceleration of unstructured mesh solvers onthe latest computer architectures.

Anderson et al [8] presented the optimisation of FUN3D [28], a tetrahedralvertex-centered unstructured mesh code developed at the National Aeronaut-ics and Space Administration (NASA) Langley Research Center for the solu-tion of the compressible and incompressible Euler and Navier-Stokes equationsand for which they received the 1999 Gordon Bell Prize [1]. Their optimisationswere based on the concept of memory centric computations whereby the aimwas to minimize the number of memory references as much as possible inthe recognition that flops are cheap relative to memory load and store opera-tions. The authors achieved this by increasing spatial locality with the help ofinterlacing in which data items that are required in close succession such asunknowns are stored contiguously in memory using data structures such asAoS. They also reduced the impact of the underlying gather and scatter oper-ations by renumbering the mesh vertices using the Cuthill-Mckee [20] sparsematrix bandwidth minimizer. Their work was subsequently extended in thecontext of the FUN3D code by a number of studies such as Gropp et al [35]which introduced performance models in order to guide the optimisation pro-cess by classifying the operational characteristics of the computational kernelsand their interaction with the underlying hardware, Mudigere et al [60] whodemonstrated shared memory optimisations on modern parallel architecturesincluding vectorization and threading through a hybrid MPI/OpenMP imple-mentation, Al Farhan et al [29] who presented optimisations specific to theIntel Xeon Phi Knights Corner processor as well as Duffy et al [25] who ported

5.3 background 96

FUN3D for execution on graphical processing units obtaining a factor of twospeed-up as a result.

More recently, Economon et al [26] presented the performance optimisa-tion of the open-source SU2 [79] unstructured CFD code on Intel architectures.Their work demonstrated the impact of a number of optimisations such as vec-torization, edge reordering, data layout transformations for improving singlecore performance on modern multicore architectures as well as optimisationsof the linear solver in order to remove the impact of performing collective op-erations at large scales. As a result, they obtained speed-ups of more than afactor of two at single node and multi node granularities.

A different approach of optimising unstructured grid applications is by im-plementing such optimisations at a higher level of abstraction through a DSL.Examples of such initiatives with respect to unstructured CFD solvers can befound by examining the work in the OP2 framework [32],[59],[71],[72] as wellas the other examples that are discussed in Chapter 2.

The work presented herein complements the above in that it presents theimpact of some of the described optimisations on new architectures such as theIntel Xeon Skylake and Intel Xeon Phi Knights Landing processors while alsodemonstrating a number of improvements and novel approaches for exploitingthe architectural features across both multicore and manycore platforms in alarge scale unstructured CFD application.

5.3 background

The test vehicle for this study is the in-house CFD solver AU3X[24],[88]. AU3Xuses a cell-centred finite volume approach to solve the unsteady Favre-averagedNavier-Stokes equations on unstructured meshes. Steady solutions are obtainedby pseudo time marching and time accurate solutions by dual time stepping[42]. The governing equations, spatial discretization and time integration schemesare briefly described in the following sections in order to complement the im-plementation details and code optimisation study although some similaritieswith the numerical algorithms presented in Chapter 4 will be evident.

5.3 background 97

5.3.1 Governing Equations

The Favre-averaged Navier-Stokes equations for compressible flows in the dif-ferential form read:

∂ρ

∂t+∂(ρvi)

∂xj= 0

∂(ρvi)

∂t+∂(ρvivj)

∂xj= −

∂p

∂xi+∂

∂xj( ˜τij + τtij)

∂(ρE)

∂t+∂(ρvjH)

∂xj= −

∂

∂xj(κ∂T

∂xj+ vi( ˜τij + τtij)) (22)

Where,

˜τij = 2µ(Sij−1

3

∂vk∂xk

δij), p = (γ−1)ρ(E−1

2ujuj), κ

∂T

∂xj=

γ

γ− 1

µ

Pr

∂

∂xj(p

ρ),

(23)

The tilde "∼" and overbar "−" represent Favre averaging and Reynolds aver-aging respectively. The working fluid is air and it is treated as calorically per-fect gas while γ and the Prandtl number Pr are held constant at 1.4 and 0.72

respectively. µ is evaluated by Sutherland’s law and is based on a referenceviscosity of 1.7894× 10−5 kgms together with a reference temperature of 288.15Kand Sutherland’s constant at 110K. If the Boussinesq assumption holds, theReynolds stress τtij can be written as a linear function of the mean flow gradi-ent:

τtij = 2µt(Sij −1

3

∂vk∂xk

δij) (24)

Turbulent viscosity µt is computed by turbulence models. In this work, theWilcox k−ω turbulence model is used and additional equations are requiredfor k and ω. Readers can refer to Wilcox [89] for more details.

5.3.2 Spatial Discretization

The flow variables are stored at the cell centres and the boundary conditionsare applied at the ghost cells, the positions of which are generated by mirroring

5.3 background 98

i

cell interface

Ωi

Ωj

jghost cell

Figure 16: Schematic of the cell-centred finite volume scheme on unstructured grids.

the positions of the cells immediately adjacent to the boundary. The inviscidand viscous fluxes are evaluated at the cell-to-cell and boundary-to-cell inter-faces. The schematic of the finite volume scheme is shown in Figure 16. Flowgradient is computed at the cell centre using the weighted least square proced-ure [51]. The matrix of the weighted least square gradient is evaluated onceat pre-processing for static grids and at every nonlinear iteration for movingmeshes.

The inviscid fluxes are computed by the upwind scheme using the approx-imated Riemann solver of Roe [73]. Second order spatial discretization is ob-tained by extrapolating the values from the cell centre to the interface via theMonotonic Upwind Scheme for Conservation Laws (MUSCL) [47] with thevan Albada limiter [39]. The viscous fluxes at the interface are computed byusing the inverse of the distance weighting from the ones evaluated at the cellcentres on both sides of the interface while source terms are evaluated at thecell centres and are assumed to be piecewise constant in the cell.

5.3 background 99

5.3.3 Time integration

After inviscid, viscous fluxes and source terms are computed for each cell, thecoupled system in Equation 22 can be described as the following:

ΩidUidt

= −

N∑j=1

Ri(Uj) (25)

Where Ui are the conservative variables of cell i, namely (ρ, ρvi, ρE)T , Ωithe cell volume, Uj the conservative variables of the neighbouring cells of Ui,N the number of neighbouring cells and Ri the right hand side of cell i, whichare the fluxes evaluated at each cell. Here we assume no mesh motion and Ωiremains a constant for each cell in the computation.

The system in Equation 22 is solved implicitly by first applying the backwardEuler scheme:

Ωi∆Ui∆t

= −

N∑j=1

Ri(Un+1j ) (26)

Where n is the solution at the current level, n+ 1 is the solution to be solvedin the next level and ∆Ui = Un+1i −Uni

Expanding Ri(Un+1j ) in Taylor series, Equation 26 becomes:

Ωi∆Ui∆t

= −

N∑j=1

Ri(Unj ) −

N∑j=1

∂(Ri(Unj ))

∂Uj∆Unj (27)

Where∂(Ri(U

nj ))

∂Ujis the flux Jacobian. Equation 27 can be re-arranged and

the flux Jacobian is approximated by its spectral radius. The resulting linearsystem reads:

[Jni (ΩiJni ∆t

+ 1)]∆Uni = −

N∑j=1

Ri(Unj ) (28)

5.3 background 100

Equation 28 is the resulting linear system to march the solution from timelevel n to n+ 1, and it is solved by the Newton-Jacobi method where Jni is thespectral radius of the flux Jacobian matrix which is accumulated across the cellinterfaces. Linearised fluxes

∂(Ri(Unj ))

∂Uj∆Unj are required to update the solutions

at each Newton-Jacobi iteration and they are evaluated exactly for the inviscidand viscous fluxes. For the Wilcox k−ω turbulence models, an approximationis used for linearising the governing equations for k andω. The Newton-Jacobiis executed for user-specified iterations to march the solution from n to n+ 1,the right and left hand sides are then updated, and the Newton-Jacobi is in-voked again. This process proceeds until a user-specified convergence criteriais met.

5.3.4 Test case

The numerical test case used for the optimisation study represents an aero-engine intake operating near ground. Validation and numerical investigationusing the AU3X code have been previously presented by Carnevale et al [17],[18]using experimental data provided by Murphy et al [61]. The computationaldomain based on an unstructured mesh can be seen in Figure 17. Near wallregions have been discretized with hexahedral elements for boundary layerprediction whilst prismatic elements have been used in the free stream do-main. Furthermore, two mesh sizes have been utilised throughout this work,namely mesh 1 which contains 3.3x106 elements and mesh 2 with 6x106 ele-ments respectively.

5.3.5 Computational kernels

Table 3 presents details of the face-based and cell-based kernels used to demon-strate the implementation and the impact of optimisations. Face-based loopsare represented by four kernels computing the inviscid, viscous and linearisedfluxes while cell-based loops are represented by linearised updates to the prim-itive variables. Face-based loops are characterised by a relatively high arith-metic intensity as they involve a large number of floating point operations perpair of adjacent cells whereas cell-based loops tend to exhibit a modest num-ber of calculations per memory load operation. Therefore, it is expected that

5.3 background 101

Figure 17: Unstructured mesh of intake with hexahedra elements at near wall regionsand prisms in the free stream domain.

Figure 18: Solution of ground vortex ingestion highlighting the vorticity in the normalto the ground direction.

face-based loops, and especially the second order inviscid MUSCL fluxes, willbenefit the most from optimisations that improve the throughput of floatingpoint computations such as vectorization and to scale with the available num-ber of computational cores. For cell-based loops, it is expected that they scalewith the available memory bandwidth although it will be of interest to assessthe impact that other optimisations have on their performance such as vectoriz-ation or data layout transformations and whether optimisations for face-basedkernels have any negative impact on their performance.

5.3 background 102

The optimisations presented for the kernels in Table 3 have been implemen-ted across all other face-based and cell-based kernels in the application. This isreflected in the results presenting whole application performance as time persolution update (Newton-Jacobi iteration).

kernel loop runtime (%) flops/bytes description

iflux faces 13 1.30 second order TVD MUSCL fluxesvflux faces 8 0.80 viscous fluxesdiflux faces 32 0.84 linearised inviscid fluxesdvflux faces 30 0.80 linearised viscous fluxesdvar cells 5 0.18 update to primitive variables

Table 3: Computational kernels representing face-based and cell-based loops used fordemonstrating the implementation and impact of our optimisations.

5.3.6 Configuration of compute nodes

Table 4 presents details with respect to the configuration of the compute nodesused in this chapter in terms of processor model, memory configuration andsoftware environment. A broader discussion on the characteristics and archi-tectural features of each processor is given in Chapter 3.

SNB BDW SKL KNC KNL

Version E5-2650 E5-2680 Gold 6140 7120P 7210

Sockets 2 2 2 1 1

Cores 8 14 18 61 64

Threads 2 2 2 4 4

Clock (GHz) 2.0 2.4 2.3 1.2 1.3SIMD ISA AVX AVX2 AVX-512 IMCI AVX-512

SIMD width 256-bit 256-bit 512-bit 512-bit 512-bitL1 Cache (KB) 32 32 32 32 32

L2 Cache (KB) 256 256 1024 512 1024

L3 Cache (MB) 20 35 25 - -DRAM (GB) 32 128 196 16 96/16

DRAM type DDR3 DDR4 DDR4 GDDR5 DDR4/MCDRAMStream (GB/sec) 71 118 186 181 82/452

Compiler icpc 17.0MPI Library Intel MPI 2017

Table 4: Hardware and software configuration of the compute nodes used in thischapter. The SIMD ISA represents the latest vector instruction set architecturesupported by the particular platform.


5.4 optimisations

5.4.1 Grid renumbering

In order to improve the exploitation of the cache hierarchy in face-based loops,the distance between memory references when gathering and scattering cell-centred data has to be minimized. In this work, this is achieved using theReverse Cuthill Mckee (RCMK) [20] sparse matrix bandwidth minimizer. TheRCMK algorithm reorders the non-zero elements in the adjacency matrix de-rived from the underlying mesh topology so as to cluster them as close aspossible to the main diagonal [15]. An example of the resulting bandwidthreduction when applying RCMK on mesh 1 can be seen in Figure 19 wherethe maximum distance to the diagonal has been reduced by 53X. Followingthe renumbering, the list of faces is sorted in ascending order based on thefirst index which results in a sequence of first indices that increase monotonic-ally. A subsequent sort is performed on the second index so that in groups ofconsecutive faces where the first index is constant, the second index referencewill be visited in ascending order. This leads to improvements in both spatialand temporal locality as the cells referenced by the first index will be quasicontiguous in memory and therefore able to better exploit the cache hierarchyand hardware prefetchers. Furthermore, minimizing the stride in memory ac-cesses can also reduce the number of TLB misses which are particularly ex-pensive on manycore architectures based on more simple core designs thatcannot handle a page walk as efficiently as conventional multicore CPUs. Thefinal sort on the second index further improves memory performance sincehardware prefetchers operate best on streams with ordered accesses whetherin a forward or backward direction. The mesh reordering is performed im-mediately after solver initialisation and across all MPI ranks. Each rank is incharge of renumbering its local cells after which it traverses its list of halos inorder to relabel the corresponding internal cells with their new value. Since thereordering is performed only once at solver start-up and scales linearly withthe number of processors, its effect on the overall application execution time isnegligible.


(a) β = 2808603 (b) β = 52319

Figure 19: Example of reducing the bandwidth (β) via Reverse Cuthill Mckee on mesh1 (3x106 elements). Figure (a) highlights the initial bandwidth where β =2808603while figure (b) presents the improved bandwidth where β = 52319as a result of renumbering.

5.4.2 Vectorization

The continuous increase in size of the underlying vector registers combinedwith steady improvements to the associated SIMD ISA have seen vectorizationbecome indispensable for exploiting the computational power of modern pro-cessors. Although compilers have evolved over the years and are generally bet-ter at vectorizing non-trivial kernels, they can only do so when they can guar-antee safety. Consequently, computational kernels such as face-based loops arenot vectorized due to the existence of indirection and potential dependencieswhen accumulating and scattering back to the cell-centres. Furthermore, evenin the case of kernels with regular access patterns such as cell-based loops,compiler auto-vectorization is not always possible due to various reasons suchas pointer aliasing, inner function calls or conditional branching. As a result, re-lying on the compiler to generate vector instructions based only on its internalanalysis is not recommended as the success rate will invariably differ acrossdifferent compilers, programming languages and computational kernels.


The alternative to compiler auto-vectorization is explicit vectorization via theutilisation of directives, lower level intrinsics or inline assembly. In this work,the predominant approach for achieving vectorization was based on OpenMP4.0 [64] directives. The use of compiler directives has been preferred over otheralternatives since they offer far greater flexibility and portability across SIMDarchitectures and compilers at a small cost in performance compared to lowerlevel implementations such as compiler intrinsics or assembly.

In cell-based loops, vectorization was made possible through the addition ofdirectives either at loop level or at function declaration which not only forcesthe compiler to vectorize the construct but also enables the generation of ef-ficient vector code based on auxiliary data attributes such as alignment, vari-able scoping and vector lane length. An example of this can be seen in Listing11 based on a simplified example of a cell-based loops for updating the un-knowns.

#pragma omp simd simdlen(VECLEN) safelen(VECLEN) \

aligned(q:ALIGN,dq:ALIGN)

for( iq=0;iq<num_cells;iq++ )

q[iq]+=dq[iq];

Listing 11: Example of a cell-based kernel vectorized with OpenMP 4.0 directives.

For face-based loops however, a complete re-write was necessary in order toswitch to a vector programming paradigm. Listing 12 presents the improvedvector-friendly layout which replaces the original example in Listing 10 (Sec-tion 5.1).

In the new implementation, the original loop is divided into three distinctstages which naturally map to the underlying gather, compute and scatterpattern. In essence, each main loop iteration will process a number of con-secutive faces in parallel by exploiting the available vector lanes as definedby the VECLEN macro. The first nested loop gathers the unknowns into localshort vector arrays. Depending on the underlying architecture, the compilerwill either generate SIMD gather instructions or serial load sequences if thearchitecture does not support such operations. Once all data is loaded into theshort vectors, the second nested loop performs the computation. Similarly tothe previous stage, computations are carried out in parallel on the available


vector lanes. In AVX/AVX2, four faces would be processed concurrently oreight for IMCI and AVX-512 architectures in double precision. Since all inter-mediate short vectors are allocated on the stack as static arrays of sizes knownat compile time as defined by the pre-processor macros, the compiler can easilygenerate efficient vector code specific to the underlying architecture as all de-pendencies have been eliminated. This includes loading contiguous data suchas face geometrical properties at aligned addresses as expressed through thealigned clause in the directive. The last stage involves the accumulation andscatter of the residuals to their respective cells. This is done sequentially sincegenerating vector code for this section would lead to incorrect results due todata dependencies as multiple successive faces are processed in parallel.

Grouping the operations specific to the gather, compute and scatter patternsin distinct sections enabled the vectorization of the first two where the major-ity of instructions are issued while the scatter, which prevented vectorizationin the first place, is still performed sequentially. Vectorizing the gather andthe computation stages is important since finite volume discretizations involvea large number of floating point operations per pair of cells which if vector-ized, can lead to significant speed-ups. The additional nested loops in the newimplementation are unrolled automatically at compilation as they decay intosingle or multiple SIMD instructions therefore removing potential overhead aslong as the iteration space (VECLEN) is equal or a relatively small multiple ofthe underlying vector lane size. The promotion from scalar to short vectors isperformed only for variables that appear in at least one of the three distinctstages (i.e. f representing the flux residual, computed in stage two and writtenback in stage three).

In addition to re-writing all face-based kernels in the solver following theprinciples set out in Listing 12, a number of other optimisations were alsoperformed at this stage such as: (i) allocating face and cell data structures onaligned 32 or 64 byte boundaries using _mm_malloc depending on the SIMDarchitecture. (ii) padding of the list of faces through the addition of redundantentries up to a size that is a multiple of VECLEN. (iii) the replacement of divisionswith reciprocal multiplications in places where the divisors were geometricalvariables as SIMD division operations are non-pipelined across the majority ofarchitectures and suffer from very high latencies.


# if defined __MIC__

# define VECLEN 8# define ALIGN 64

# elif defined __AVX512F__


# elif defined __AVX__


# elif defined __SSE3__


# else# define VECLEN 1# define ALIGN 16

# endif

double u1[VECLEN];double u2[VECLEN];double f[VECLEN];

for( ic=0;ic<num_faces;ic+=VECLEN )

// gather data from adjacent cells#pragma omp simd simdlen(VECLEN)for( iv=0;iv<VECLEN;iv++ )

u1[iv]= q[ifq[0][ic+iv]];u2[iv]= q[ifq[1][ic+iv]];

// computation#pragma omp simd simdlen(VECLEN) \aligned(geo:ALIGN,u1:ALIGN,u2:ALIGN)for( iv=0;iv<VECLEN;iv++ )

f[iv]= geo[ic+iv]*(u2[iv]-u1[iv]);// scatter, serially due to dependenciesfor( iv=0;iv<VECLEN;iv++ )

rhs[ifq[0][ic+iv]]-= f[iv];rhs[ifq[1][ic+iv]]+= f[iv];

Listing 12: Example of a SIMD friendly implementation of face-based kernelsvectorized with OpenMP 4.0 directives.


5.4.3 Colouring

The dependencies that prevent vectorization when scattering back to the faceend-points can be removed by further colouring and reordering the list offaces. Löhner et al [49] presented two algorithms for this purpose and whichwere implemented in this work. The first algorithm (Löhner 1) uses a simplecolouring approach whereby it reorders the list of faces such that groups ofconsecutive faces of size equal to the specified vector length (i.e. VECLEN) haveno dependencies across their end-points. The second algorithm (Löhner 2) isan extension of the first algorithm with the difference that it also attemptsto minimize the jumps in the second index within the vector groups. This isachieved by performing additional iterations as defined by a search distanceparameter in order to find suitable faces that exhibit no dependencies andkeep the one that contributes to the smallest jump in the second index whencompared with the first entry of the vector group.

Table 5 presents the jumps in both the first and the second index after re-ordering the faces using both algorithms on mesh 1. The entries jump1 andjump1a represent maximum and average jumps across the first index whilejump2 and jump2a have the same meaning but for the second index. It can beobserved that Löhner 2 decreases the jumps in the second index as the searchdistance is increased although these lead to significant increases in the first in-dex. As a result, one has to choose a combination that provides the best tradeoff between the two (i.e. largest decrease in jumps in the second index over thesmallest increases in the first). Consequently, throughout this work, the Löhner2 algorithm was used with a vector length of 4 and a search distance of 16 forAVX/AVX2 architectures and a vector length of 8 and and a search distance of16 for the IMCI and AVX-512 platforms.

After removing the dependencies at the face end-points in groups of facesequal to the underlying SIMD length (i.e. VECLEN), the scatter sections across allface-based loops were also vectorized using OpenMP 4.0 directives in a similarfashion to the gather and computation sections.

5.4.4 Array of Structures

The reference implementation used the SoA layout to store both face and cell-centred variables. For example, unknowns were stored in individual vectors as


algorithm vector length search distance jump1 jump1a jump2 jump2a

Original - - 7 0.4 79490 18869

Löhner 1 4 - 12 1.4 79490 18869

Löhner 2 4 16 69 11.5 79235 6540

Löhner 2 4 32 116 19 79023 6064

Löhner 2 4 64 201 31 79028 5853

Löhner 1 8 - 25 1.9 79028 17296

Löhner 2 8 16 85 11.7 79261 4708

Löhner 2 8 32 158 21.3 79977 3541

Löhner 2 8 64 272 35 79030 3153

Löhner 1 16 - 48 2.3 79039 16258

Löhner 2 16 16 89 10.6 78801 5235

Löhner 2 16 32 224 20.3 79051 2858

Löhner 2 16 64 376 37.5 78922 1941

Löhner 1 32 - 94 2.6 79490 15683

Löhner 2 32 32 208 18.6 79274 3267

Löhner 2 32 64 428 35 79453 1767

Löhner 2 32 128 791 67.3 78750 1104

Löhner 1 64 - 173 2.9 79670 15399

Löhner 2 64 64 412 31.7 97813 2143

Löhner 2 64 128 948 63.2 79521 1119

Löhner 2 64 256 1607 126.1 79287 698.5

Table 5: Jump in indices for mesh 1 after reordering the faces based on the two al-gorithms in Löhner et al [49].


u0 u1 u2 u3 u4 u5 ... un

v0 v1 v2 v3 v4 v5 ... vn

w0 w1 w2 w3 w4 w5 ... wn

t0 t1 t2 t3 t4 t5 ... tn

p0 p1 p2 p3 p4 p5 ... pn

k0 k1 k2 k3 k4 k5 ... kn

ω0 ω1 ω2 ω3 ω4 ω5... ωn

Figure 20: SoA layout for cell-centred variables such as unknowns.

seen in Figure 20 where n is equal to the number of cells in the computationaldomain.

As faces are processed in consecutive order within face-based loops, access-ing face variables such as normals, coordinates and frame speed is done viacontiguous vector load operations which map effectively to the CPU vectorregisters. However, the access for cell-centred variables is indirect and irregu-lar since cells are traversed non-consecutively according to the correspondencearray. As a result, for each cell, all cell-centred variables have to be gatheredfrom their respective vectors and into the SIMD registers. For SIMD architec-tures that do not support vector gather operations, the compiler will generatesequential loads in order to fill in the vector registers. Similarly, scatteringvalues such as residuals back to the corresponding arrays translates into se-quential stores if scatter vector instructions are not available. On architectureswith available gather and/or scatter support, the number of issued instruc-tions is reduced by a factor equal to the underlying number of vector lanes.However, when compared to regular SIMD load and store operations, gatherand scatter primitives are known to suffer from significantly higher latencies[40] especially on the Knights Corner architecture [67]. Consequently, the datastructures storing cell-centred variables have been modified to an AoS imple-mentation whilst face data structures were kept in the SoA format.

In the AoS layout, cell-centred variables are stored contiguously in shortarrays for every cell as seen in Figure 21 rather than being stored in individualvectors. As a result, although the AoS format does not fully remove all gatherand scatter operations, irregular load and stores are only executed once foreach cell vector as subsequent successive elements are loaded automatically atcache line granularity as long as the short arrays are padded. This improveslocality and minimizes cache misses although the disadvantage compared toSoA is that variables have to be transposed into their respective vector registerand lane positions.


u0 v0 w0 t0 p0 k0 ω0 u1 v1 w1 t1 p1 k1 ω1... un vn wn tn pn kn ωn

Figure 21: AoS layout for cell-centred variables such as unknowns including padding.

The advantage of using AoS over SoA in loops with gather and scatter op-erations can be demonstrated by analysing the number of cache lines requiredto gather the unknowns in a single vector iteration for both layout formats. Forthe purpose of such analysis, VLEN represents the vector register size, NVARthe number of variables to be gathered per cell,NCELL the number of cells andCLEN the number of variables that can be stored in one cache line. Therefore,VLEN is equal to either 4 or 8 depending on the underlying SIMD architec-ture (i.e. AVX/AVX2 or IMCI/AVX-512), NVAR = 7 for the unknowns sincethere are 5 variables for the flow and 2 for the turbulence model, CLEN = 8 indouble precision since eight variables will occupy an entire 64-byte cache lineand finally, NCELL = 2× VLEN as each face is shared by two cells and theindices referencing the cells are unique across every iteration due to colouring.In the best case scenario, the indices referencing the pairs of neighbouring cellswithin a vector iteration are ordered consecutively among themselves while inthe worst case, the cell indices will be at a distance greater than the size of acache line (i.e. > CLEN). As a result, the number of cache lines accessed by theSoA layout in the best case scenario would beNVAR× (NCELL÷CLEN) whilefor AoS, this would be NCELL× (NVAR÷CLEN). In the worst case scenario,SoA would require NCELL×NVAR cache lines whereas for AoS this wouldremain as NCELL× (NVAR÷CLEN) due to the fact that all of the successiveunknowns within a cell fit across a single cache line after padding. Therefore,for AVX/AVX2, the SoA layout would require 7 cache lines in the best casescenario and 56 in the worst. In the AoS layout, both the best and worst casewould require 8 cache lines. For IMCI/AVX-512, one can simply multiply theabove numbers by a factor of two where the best and worst case for SoA wouldbe 14 and 112 respectively and 16 across both cases for AoS.

Consequently, although SoA is superior in the best case scenario where thevisited cells are contiguous in memory, it performs worse by a factor equal tothe number of variables that are gathered per cell compared to AoS when thedistance between the face end-points is larger than the size of the cache line.Judging by the average indices in Table 5, one can clearly observe that for boththe first and the second index, the average jumps resulting from reordering the


list of faces using the Löhner 2 algorithm are larger than the size of a singlecache line (>8). As a result, the AoS layout is in theory the optimal choice sinceit significantly reduces the number of cache lines required per vector iterationand should therefore improve the performance of the cache hierarchy.

The transition from SoA to AoS was by no means a trivial endeavour as itrequired significant changes to the message passing interface, array access se-mantics and switching the order of all nested loop constructs that manipulatedcell centred data. It is therefore recommended that some form of abstraction isimplemented with regards to the layout in memory of such data structures sothat a switch between different implementations can be performed at compiletime which would be a useful design trait. This is important when consideringthat for structured codes, SoA or a hybrid thereof, is the better performingimplementation as demonstrated in Chapter 4 since cells are traversed con-secutively following an i,j,k indexing system which therefore allows for effi-cient vector load and store operations. Finally, with regard to unstructuredgrid applications, modifying cell data to the AoS layout has an impact oncell-based kernels. Whereas previously with SoA, these kernels could exploitSIMD contiguous load store operations as cells were traversed in successiveorder, switching to AoS means that variables for each cell have to be loaded invector registers and subsequently transposed into the SoA format which addsextra latency and decreases performance. It is therefore important to empir-ically evaluate the impact of this optimisation as performance will be lost inthe cell-based loops although improvements are expected for the face-basedconstructs which are the main bottleneck in the application.

5.4.5 Gather Scatter Optimisations

The AoS memory layout for cell-centred variables requires that gathers, arisingfrom indirect addressing, are only executed once per structure and not forevery variable as it is the case with SoA if the distance between cells is largerthan that of the underlying cache line. However, successive elements in AoSalthough contiguous in memory, need to be transposed into the correct vectorregister and lane positions. If one considers the seven unknowns in the AoSlayout with padding in the last position at a given cell index i, dependingon the underlying SIMD architecture (i.e 256-bit or 512-bit wide register), allof the eight double precision values can be loaded from the structure with


either one or two aligned vector load instructions for each cell end-point. Asvectorization is applied across consecutive faces, the unknowns of all cell end-points need to be re-arranged on the fly and packed into the SoA format. Thisprocess is described in Figure 22 which presents the on the fly transpositionfrom AoS to SoA on AVX/AVX2 architectures with 256-bit registers. In thiswork, the transpose primitives were implemented using compiler intrinsicsfor each individual SIMD architecture by using instructions that exhibit thelowest latency and highest degree of instruction parallelism as presented inListings 13,14 and 15 for AVX/AVX2, AVX-512 and IMCI respectively. This isrelevant since the SIMD implementations across processors differ significantlyand a generic solution would leave performance on the table. For example, theKnights Corner architecture provides swizzle operations which can performon the fly data multiplexing prior to execution from the register. Consequently,some permutations and shuffles can be done with "zero" penalty whereas onall other architectures, these will be forwarded to an execution port (port 5 onmulticores) which can lead to port pressure. Similarly, scatters for writing backresults to the face end-points are implemented by transposing back from SoAto AoS and utilising aligned vector stores which is possible since faces havebeen coloured. Listing 16 presents the integration of the AoS to SoA conversionprimitives in a simplified face-based loop. The conversion primitives are alsoused in cell-based kernels such as the candidate for updating flow variables inorder to convert to and from AoS and SoA as efficiently as possible with thedistinction that the indices for performing the gather and scatters in cell-basedloops are linear and at unit stride.

Finally, similar work to the above has been presented by Pennycook et al[67] for optimising the gather/scatter patterns in molecular dynamics applica-tions. The solution presented here extends their work by supporting additionalSIMD architectures and by applying them to unstructured computational fluiddynamics solvers.

5.4.6 Array of Structures Structure of Arrays

While the SoA format is ideal for mapping face data to SIMD registers, it re-quires that multiple memory streams are serviced in parallel for each vector ofvariables. Since prefetchers can only operate on a limited number of streams,maintaining a large number of them in flight wastes memory bandwidth and


inline void aos2soa(AOS_t *src, int *pos, double dest[NVAR][VECLEN])

__m256d v[NVAR],s[NVAR],p[NVAR];

// load variables in 256-bit registersv[0] = _mm256_load_pd(&src[pos[0]].var[0]); // u0,v0,w0,t0v[4] = _mm256_load_pd(&src[pos[0]].var[4]); // p0,k0,o0v[1] = _mm256_load_pd(&src[pos[1]].var[0]); // u1,v1,w1,t1v[5] = _mm256_load_pd(&src[pos[1]].var[4]); // p1,k1,o1v[2] = _mm256_load_pd(&src[pos[2]].var[0]); // u2,v2,w2,t2v[6] = _mm256_load_pd(&src[pos[2]].var[4]); // p2,k2,o2v[3] = _mm256_load_pd(&src[pos[3]].var[0]); // u3,v3,w3,t3v[7] = _mm256_load_pd(&src[pos[3]].var[4]); // p3,k3,o3

// 64-bit wide interleaves[0] = _mm256_shuffle_pd(v[0],v[1],0x0); // u0,u1,w0,w1s[4] = _mm256_shuffle_pd(v[4],v[5],0x0); // p0,p1,o0,o1s[1] = _mm256_shuffle_pd(v[0],v[1],0xF); // v0,v1,t0,t1s[5] = _mm256_shuffle_pd(v[4],v[5],0xF); // k0,k1s[2] = _mm256_shuffle_pd(v[2],v[3],0x0); // u2,u3,w2,w3s[6] = _mm256_shuffle_pd(v[6],v[7],0x0); // p2,p3,o2,o3s[3] = _mm256_shuffle_pd(v[2],v[3],0xF); // v2,v3,t2,t3s[7] = _mm256_shuffle_pd(v[6],v[7],0xF); // k2,k3

// 128-bit wide interleavep[0] = _mm256_permute2f128_pd(s[0],s[2],0x20); // u0,u1,u2,u3p[1] = _mm256_permute2f128_pd(s[1],s[3],0x20); // v0,v1,v2,v3p[2] = _mm256_permute2f128_pd(s[0],s[2],0x31); // w0,w1,w2,w3p[3] = _mm256_permute2f128_pd(s[1],s[3],0x31); // t0,t1,t2,t2p[4] = _mm256_permute2f128_pd(s[4],s[6],0x20); // p0,p1,p2,p3p[5] = _mm256_permute2f128_pd(s[5],s[7],0x20); // k0,k1,k2,k3p[6] = _mm256_permute2f128_pd(s[4],s[6],0x31); // o0,o1,o2,o3

// store in SoA format_mm256_store_pd(&dest[0][0],p[0]);_mm256_store_pd(&dest[1][0],p[1]);_mm256_store_pd(&dest[2][0],p[2]);_mm256_store_pd(&dest[3][0],p[3]);_mm256_store_pd(&dest[4][0],p[4]);_mm256_store_pd(&dest[5][0],p[5]);_mm256_store_pd(&dest[6][0],p[6]);

Listing 13: Example of AVX/AVX2 compiler intrinsics kernel for in-registertransposition from AoS to SoA of unknowns.



__m512d v[NVAR],u[NVAR],p[NVAR],s[NVAR];

// load variables in 512-bit registersv[0] = _mm512_load_pd(&src[pos[0]].var[0]); // u0,v0,w0,t0,p0,...v[1] = _mm512_load_pd(&src[pos[1]].var[0]); // u1,v1,w1,t1,p1,...// truncated, repeat for rows 2-7

// 64-bit wide interleaveu[0] = _mm512_unpacklo_pd(v[0],v[1]); // u0,u1,w0,w1,p0,p1,o0,o1u[1] = _mm512_unpacklo_pd(v[2],v[3]); // u2,u3,w2,w3,p2,p3,o2,o3u[2] = _mm512_unpacklo_pd(v[4],v[5]); // u4,u5,w4,w5,p4,p5,o4,o5u[3] = _mm512_unpacklo_pd(v[6],v[7]); // u6,u7,w6,w7,p6,p7,o6,o7u[4] = _mm512_unpackhi_pd(v[0],v[1]); // v0,v1,t0,t1,k0,k1u[5] = _mm512_unpackhi_pd(v[2],v[3]); // v2,v3,t2,t3,k2,k3u[6] = _mm512_unpackhi_pd(v[4],v[5]); // v4,v5,t4,t5,k4,k5u[7] = _mm512_unpackhi_pd(v[6],v[7]); // v6,v7,t6,t7,k6,k7

// 128-bit wide interleavep[0] = _mm512_mask_permutex_pd(u[0],0xCC,u[1],0x44); // u0-3,p0-3p[2] = _mm512_mask_permutex_pd(u[4],0xCC,u[5],0x44); // v0-3,k0-3p[4] = _mm512_mask_permutex_pd(u[1],0x33,u[0],0xEE); // w0-3,o0-3p[6] = _mm512_mask_permutex_pd(u[5],0x33,u[4],0xEE); // t0-3p[1] = _mm512_mask_permutex_pd(u[2],0xCC,u[3],0x44); // u4-7,p4-7p[3] = _mm512_mask_permutex_pd(u[6],0xCC,u[7],0x44); // v4-7,k4-7p[5] = _mm512_mask_permutex_pd(u[3],0x33,u[2],0xEE); // w4-7,o4-7p[7] = _mm512_mask_permutex_pd(u[7],0x33,u[6],0xEE); // t4-7

// 256-bit wide interleaves[0] = _mm512_shuffle_f64x2(p[0],p[1],0x44); // u0,u1,u2,u3,u4,...s[1] = _mm512_shuffle_f64x2(p[2],p[3],0x44); // v0,v1,v2,v3,v4,...s[2] = _mm512_shuffle_f64x2(p[4],p[5],0x44); // w0,w1,w2,w3,w4,...s[3] = _mm512_shuffle_f64x2(p[6],p[7],0x44); // t0,t1,t2,t3,t4,...s[4] = _mm512_shuffle_f64x2(p[0],p[1],0xEE); // p0,p1,p2,p3,p4,...s[5] = _mm512_shuffle_f64x2(p[2],p[3],0xEE); // k0,k1,k2,k3,k4,...s[6] = _mm512_shuffle_f64x2(p[4],p[5],0xEE); // o0,o1,o2,o3,o4,...

// store in SoA format_mm512_store_pd(&dest[0][0],s[0]);_mm512_store_pd(&dest[1][0],s[1]);// truncated, repeat for dest[2-6] <- s[2-6]

Listing 14: Example of AVX-512 compiler intrinsics kernel for in-register transpositionfrom AoS to SoA of unknowns.



__m512d tmp0,tmp1;__m512d v[NVAR],b[NVAR],p[NVAR];

// load variables in 512-bit registersv[0] = _mm512_load_pd(&src[pos[0]].var[0]); // u0,v0,w0,t0,p0,...v[1] = _mm512_load_pd(&src[pos[1]].var[0]); // u1,v1,w1,t1,p1,...

// u0,u1,t0,t1,u0,u1,t0,t1tmp0 = _mm512_mask_blend_pd(0x22,v[0],

_mm512_swizzle_pd(v[1],_MM_SWIZ_REG_CDAB));// u2,u3,t2,t3,u2,u3,t2,t3tmp1 = _mm512_mask_blend_pd(0x88,

_mm512_swizzle_pd(v[2],_MM_SWIZ_REG_BADC),_mm512_swizzle_pd(v[3],_MM_SWIZ_REG_AAAA));

// u0,u1,u2,u3,t0,t1,t2,t3b[0] = _mm512_mask_blend_pd(0xCC,tmp0,tmp1);// u4,u5,t4,t5,u4,u5,t4,t5tmp0 = _mm512_mask_blend_pd(0x22,v[4],

_mm512_swizzle_pd(v[5],_MM_SWIZ_REG_CDAB));// u6,7,t6,t7,u6,u7,t6,t7tmp1 = _mm512_mask_blend_pd(0x88,

_mm512_swizzle_pd(v[6],_MM_SWIZ_REG_BADC),_mm512_swizzle_pd(v[7],_MM_SWIZ_REG_AAAA));

// u4,u5,u6,u7,t4,t5,t6,t7b[1] = _mm512_mask_blend_pd(0xCC,tmp0,tmp1);// u0,u1,u2,u3,u4,u5,u6,u7p[0] =_mm512_castsi512_pd(

_mm512_mask_permute4f128_epi32(_mm512_castpd_si512(b[0]), 0xFF00,_mm512_castpd_si512(b[1]),_MM_PERM_BABA));

// t0,t1,t2,t3,t4,t5,t6,7p[3] =_mm512_castsi512_pd(

_mm512_mask_permute4f128_epi32(_mm512_castpd_si512(b[1]), 0xFF,_mm512_castpd_si512(b[0]),_MM_PERM_DCDC));

// repeat for v-k,w-o,p-0

_mm512_store_pd(&dest[0][0],u0u1u2u3u4u5u6u7);_mm512_store_pd(&dest[4][0],t0t1t2t3t4t5t6t7);// repeat for other v,w,p,k,o

Listing 15: Example of IMCI compiler intrinsics kernel for in-register transpositionfrom AoS to SoA of unknowns.


u0 v0 w0 t0 p0 k0 ω0 u1 v1 w1 t1 p1 k1 ω1 u2 v2 w2 t2 p2 k2 ω2 u3 v3 w3 t3 p3 k3 ω3

u0 v0 w0 t0 p0 k0 ω0 u1 v1 w1 t1 p1 k1 ω1 u2 v2 w2 t2 p2 k2 ω2 u3 v3 w3 t3 p3 k3 ω3

u0 u1 w0 w1 v0 v1 t0 t1 p0 p1 ω0 ω1 k0 k1 u2 u3 w2 w3 v2 v3 t2 t3 p2 p3 ω2 ω3 k2 k3

u0 u1 u2 u3 v0 v1 v2 v3 w0 vw vw vw t0 t1 t2 t3 p0 p1 p2 p3 k0 k1 k2 k3 ω0 ω1 ω2 ω3

Figure 22: On the fly transposition from AoS to SoA of the unknowns on AVX/AVX2

architectures. Corresponding compiler intrinsics can be seen in listing 13.

double u1[NVAR][VECLEN];double u2[NVAR][VECLEN];double r1[NVAR][VECLEN];double r2[NVAR][VECLEN];double f[NVAR][VECLEN];


// gatheraos2soa(q,&(ifq[0][ic]),u1);aos2soa(q,&(ifq[1][ic]),u2);aos2soa(rhs,&(ifq[0][ic]),r1);aos2soa(rhs,&(ifq[1][ic]),r2);

// compute#pragma omp simd simdlen(VECLEN) \aligned(geo:ALIGN,u1:ALIGN,u2:ALIGN)for( iv=0;iv<VECLEN;iv++ )

for( jv=0;jv<NVAR;jv++ )

f[jv][iv]= geo[ic+iv]*(u2[jv][iv]-u1[jv][iv]);r1[jv][iv]-= f[jv][iv];r2[jv][iv]+= f[jv][iv];

// scattersoa2aos(r1,&(ifq[0][ic]),rhs);soa2aos(r2,&(ifq[1][ic]),rhs);

Listing 16: Integration of primitives for on the fly conversion between AoS and SoAdata layouts in a face-based kernel.


other valuable resources. Consequently, the SoA layout can be replaced withthe hybrid AoSSoA layout. Essentially, in the AoSSoA layout, face attributessuch as normals and coordinates are clubbed together in short vectors equalto the underlying vector register size or a multiple of it. As an example, if oneconsiders the normals to the face in three dimensions: x,y,z on AVX/AVX2

with 256-bit wide registers, the hybrid AoSSoA implementation would mapthese to the layout presented in Figure 23. This increases locality since distinctvariables are stored at a stride equal to one or more vector load and store oper-ations. Furthermore, whereas before, three independent streams were requiredto load normals in each dimension, the new layout merges this into a singlestream whilst still allowing for aligned vector load and store operations.

x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 ... xn xn xn xn yn yn yn yn zn zn zn zn

Figure 23: AoSSoA layout for storing the normals to the face in three dimensions.

5.4.7 Loop tiling

The transfer of data to and from the underlying memory hierarchy is the mainbottleneck in unstructured mesh applications based on finite volume discretiz-ations. Thus, the implementation of techniques and algorithms that minimizethis movement will most likely yield improvements in performance. In thiswork, this was accomplished via loop tiling, also known as loop blocking orstrip mining, with the purpose of reusing data across multiple kernels beforewriting the results back to main memory. The benefits of such optimisations inunstructured mesh applications have previously been demonstrated by Gileset al [33] via an analytical study.

The implementation of loop tiling in this work was limited to the linearsolver, and more specifically, to the computation of linearised fluxes wherethe largest amount of time was being spent. This consisted in evaluating bothlinearised inviscid and viscous flux computations over a unified iteration space.In essence, the inviscid fluxes would be computed first for a pre-set amountof faces followed by viscous computations. This sequence would be repeateduntil all faces of the domain have been processed. The number of faces to beprocessed in each set of iterations was determined after telescoping through


various multiples of the underlying register size in order to block the data inthe L2 cache on each architecture.

5.4.8 Software Prefetching

In structured codes with contiguous and regular access patterns, memory par-allelism is exploited by the hardware transparently via the available hardwareprefetchers across the different cache levels. In unstructured codes, the irregu-lar and indirect access patterns make prefetching more difficult to accomplishby the hardware. In the majority of cases, hardware prefetchers are unableto anticipate which data is required in the upcoming iterations of face-basedloops due to indirection and the non-consecutive traversal of the face end-points. However, this limitation can be addressed through the programmer’sdomain knowledge since the order of traversal is known at run-time and canbe deduced from the associated connectivity arrays.

The prefetch strategy implemented in this work is based on compiler intrins-ics and auto-tuning. Prefetch instructions are executed inside the routines thatgather and transpose cell-centred data from AoS to SoA as demonstrated inListing 17. In this manner, as data is gathered and transposed from the groupof cells traversed in vector iteration i, prefetches are also issued for the data ofall the cells that will be visited in the vector iteration i+dwhere d is a distanceparameter.

For software prefetching to be effective, the value of d has to be chosen care-fully. If too small, it will fetch data into the higher cache levels that is no longerneeded resulting in unnecessary memory traffic and therefore consuming valu-able memory bandwidth. On the other hand, if d is too large, the prefetcheddata will be evicted by the time it is referenced therefore increasing the num-ber of cache misses and cache line replacements. The challenge of selectingthe optimal value of d is exacerbated by the fact that this will invariably differacross computational kernels and processors since it depends on architecturalspecific metrics such as latencies of caches and main memory as well as kernelspecific characteristics such as number of instructions executed per loop itera-tion [58],[45]. As a result, finding the optimal value of d across all kernels andfor every distinct architecture can only be achieved by means of auto-tuning.In this work, an auto-tuning phase is executed once on every processor in or-der to telescope through a range of values for d as well as for assessing the


optimal placement of prefetch instructions across the available cache levels (i.e.L1 only, L1 and L2, L2 only).

In face-based loops, prefetch instructions are issued for both cell-centreddata and for the indices in the connectivity arrays that are used for referencingthe face end-points. Otherwise, cache misses that result from accessing the con-nectivity array would offset any benefits of prefetching the actual data. Further-more, Ainsworth et al [5] demonstrated that the optimal distance for prefetch-ing the indices is twice the distance used for prefetching the data. Thus, withinthe auto-tuning phase, the auto-tuner telescopes through multiple ranges of in-dex and data prefetch distances whilst maintaining this ratio and selects thedistances that result in the highest speed-up for every face-based kernel. Thisis demonstrated more clearly in Listing 17.

In cell-based loops, prefetch instructions are issued only for cell-centred dataas there is no indirection. The optimal prefetch distance is obtained during thesame auto-tuning phase that is performed for the face-based loops.

The efficiency of the software prefetching implementation is also improvedby the choice of data structures for the cell-centred variables. A side effect ofusing the AoS data layout is that all successive variables within the structureare loaded at cache line granularity. This means that compared to SoA, the AoSlayout only requires that a single prefetch instruction is issued per structureusing the address of the first variable since consecutive entries will be loadedas well in the same cache line.

5.4.9 Multithreading

Thread-level parallelism has been exploited in the application via the util-isation of OpenMP directives and specifically targets the Intel Xeon Phi ma-nycore architectures. For the multicore CPUs, multi-threading in conjunctionwith MPI did not bring forth any performance improvement and was there-fore abandoned. However, on the Intel Xeon Phi architecture, running morethan one thread context per physical core is highly encouraged especially forKnights Corner where it can hide memory latencies due to the in-order coreexecution engine. Although exposing parallelism in loops over cells is trivial,exposing thread parallelism in face loops is more challenging and requirescolour concurrency. There are a number of approaches which can be utilisedin respect to the latter [10]. In this work, the available colouring algorithms


# if defined L1_INDEX# define L1_DATA (L1_INDEX >> 1)

# endif# if defined L2_INDEX

# define L2_DATA (L2_INDEX >> 1)# endif

inline void prefetchi( int *pos )# if defined L2_INDEX

_mm_prefetch((char *)&(pos[L2_INDEX]),_MM_HINT_T1);# endif# if defined L1_INDEX

_mm_prefetch((char *)&(pos[L1_INDEX]),_MM_HINT_T0);# endif

inline void prefetchd( AOS_t *data, int *pos )# if defined L2_INDEX

for( int i=0;i<NVAR;i++ )_mm_prefetch((char *)&(data[pos[L2_DATA]].var[i]),_MM_HINT_T1);

# endif# if defined L1_INDEX

for( int i=0;i<NVAR;i++ )_mm_prefetch((char *)&(data[pos[L1_DATA]].var[i]),_MM_HINT_T0);

# endif


...// prefetch the dataprefetchd(src,pos);...


// prefetch the indices once per iterationprefetchi(&(ifq[0][ic]));prefetchi(&(ifq[1][ic]));// gatheraos2soa(q,&(ifq[0][ic]),u1);...

Listing 17: Implementation of software prefetching in face-based loops.


were used to reorder the list of faces such that the number of dependency-free faces was not only a multiple of the underlying vector register size butcould also be divided so that each active thread per core can process an equalamount of vector iterations. The iteration space was then scheduled as staticat a chunk equal to one vector iteration whereby each thread executes vectoriterations in a round robin fashion. The main rationale behind this approachis that it guarantees correctness and integrates well with MPI whereby eachrank is pinned to a physical core with subsequent threads spawned within thesame core domain. This maintains data locality and affinity to the underlyingcache hierarchy, reduces traffic caused from the protocols in charge of cachecoherence and mitigates against the risk of false sharing. Most importantly,this approach hides latency best especially for Knights Corner since it guaran-tees that each thread will access different cell-centred data in face-based loopsdue to colouring and that every miss and stall in L1 can be circumvented byswitching between active threads that have data available.

5.5 results and discussions

5.5.1 Effects of optimisations

Figures 24, 25, 26, 27, 28, 29, 30 and 31 present the impact that each optimisa-tion has on the performance of both classes of computational kernels and onthe overall application run-time. Results are averaged across 10 Newton-Jacobiiterations while running on 1 MPI rank. For the Knights Landing system, res-ults in this section were obtained while running in the Quadrant/Cache con-figuration.

grid renumbering Renumbering the grid via the RCMK algorithm ledto minor improvements in performance for face-based kernels. Unsurprisingly,the highest impact is observed in kernels with a lower flop per byte ratio repres-ented by the computation of linearised inviscid and viscous fluxes and wherethe speed-up is as high as 10%.

vectorization Vectorization results in the largest increases in perform-ance across all architectures and computational patterns. In face-based ker-nels, these increases range between 2-5X on the multicore CPUs and 2-3X on


00.511.522.533.54

mesh 1 mesh 2 mesh 1 mesh 2 mesh 1 mesh 2 mesh 1 mesh 2

0123456789


GFL

OPS

dvfluxdifluxvfluxiflux

Spee

d-up


ReferenceRCMKVectorization

ColouringAoSGather/Scatter

AoSSoALoop tilingSw. Prefetch

Figure 24: Effects of optimisations on the performance of face-based loops (flux com-putations) on the Sandy Bridge (SNB E5-2650) system. Results are reportedas GFLOPS and Speed-up and were collected by running the application on1 MPI rank (i.e. single-core).


012345678


0123456789


GFL

OPS


Spee

d-up





Figure 25: Effects of optimisations on the performance of face-based loops (flux com-putations) on the Broadwell (BDW E5-2680) system. Results are reported asGFLOPS and Speed-up and were collected by running the application on 1

MPI rank (i.e. single-core).


012345678


0123456789


GFL

OPS


Spee

d-up





Figure 26: Effects of optimisations on the performance of face-based loops (flux com-putations) on the Skylake (SKL Gold 6140) system. Results are reported asGFLOPS and Speed-up and were collected by running the application on 1

MPI rank (i.e. single-core).


00.10.20.30.40.50.60.70.8


02468101214161820


GFL

OPS


Spee

d-up


ReferenceRCMKVectorizationColouringAoS

Gather/ScatterAoSSoALoop tilingSw. Prefetch2 Threads

3 Threads4 ThreadsSwP & 4 Thrds

Figure 27: Effects of optimisations on the performance of face-based loops (flux com-putations) on the Knights Corner (KNC 7120P) system. Results are reportedas GFLOPS and Speed-up and were collected by running the application on1 MPI rank (i.e. single-core).


00.511.522.533.54


024681012141618


GFL

OPS


Spee

d-up


ReferenceRCMKVectorizationColouring

AoSGather/ScatterAoSSoALoop tiling

Sw. Prefetch2 Threads3 Threads4 Threads

Figure 28: Effects of optimisations on the performance of face-based loops (flux com-putations) on the Knights Landing (KNL 7210) system. Results are reportedas GFLOPS and Speed-up and were collected by running the application on1 MPI rank (i.e. single-core) and in Quadrant/Cache mode.


0

1

2

3

4

5

mesh 1 mesh 2 mesh 1 mesh 2 mesh 1 mesh 2

0

1

2

3

4

5

mesh 1 mesh 2

0

0.1

0.2

0.3

0.4

0.5

mesh 1 mesh 2

GFL

OPS SKL Gold 6140BDW E5-2680SNB E5-2650

KNL 7210 KNC 7120P

ReferenceVectorizationAoS

Gather/ScatterSw. Prefetch2 Threads


Figure 29: Effects of optimisations on the performance of cell-based loops (update toprimitive variables) reported as GFLOPS. Results were collected by runningthe application on 1 MPI rank (i.e. single-core). The Knights Landing (KNL7210) system was configured as Quadrant/Cache.


0

1

2

3

4

5


012345678910

mesh 1 mesh 2

024681012141618

mesh 1 mesh 2

Spee

d-up

SKL Gold 6140BDW E5-2680SNB E5-2650

KNL 7210 KNC 7210P

ReferenceVectorizationAoS

Gather/ScatterSw. Prefetch2 Threads


Figure 30: Effects of optimisations on the performance of cell-based loops (update toprimitive variables) reported as speed-up relative to the reference imple-mentation. Results were collected by running the application on 1 MPI rank(i.e. single-core). The Knights Landing (KNL 7210) system was configuredas Quadrant/Cache.


020406080100120140160180200


03006009001200150018002100240027003000

mesh 1 mesh 2

050100150200250300350400450

mesh 1 mesh 2

Tim

e(s

)


KNC 7120P KNL 7210




Figure 31: Effects of optimisations on the average time per solution update within aNewton-Jacobi iteration obtained through each optimisation. Results werecollected by running the application on 1 MPI rank (i.e. single-core). TheKnights Landing (KNL 7210) system was configured as Quadrant/Cache.


00.511.522.533.544.55


012345678910

mesh 1 mesh 2

012345678910

mesh 1 mesh 2

Spee

d-up


KNC 7120P KNL 7210




Figure 32: Effects of optimisations on the speed-up per solution update within aNewton-Jacobi iteration obtained through each optimisation relative to thereference implementation. Results were collected by running the applica-tion on 1 MPI rank (i.e. single-core). The Knights Landing (KNL 7210) sys-tem was configured as Quadrant/Cache.


the Knights Corner and Knights Landing processors. As one can observe, thegains are considerably higher in kernels with a high arithmetic intensity suchas the computation of the inviscid second order MUSCL fluxes (iflux) sinceefficient vector computations are mandatory for achieving a high instructionthroughput on modern processors. A similar impact is seen for cell-based ker-nels where vectorization provided the highest performance impact comparedto all other optimisations. Furthermore, all of these translate to an overall ap-plication speed-up between 2-3X relative to the original baseline.

colouring The colouring/reordering of faces for enabling vectorized scat-ters of data to the face end-points does not yield any benefits on the multicoreCPUs where it is in fact detrimental to performance. In contrast, this is benefi-cial on the manycore processors and leads to marginal speed-ups although nothigher than 10%. This difference might stem from the fact that non-vectorizedserial stores are significantly slower on the manycore processors than the SIMDscatter instruction. On the multicore CPUs, the Sandy Bridge and Broadwellarchitectures implement AVX/AVX2 which do not provide vector scatter func-tionality. On these processors, scatters are performed as a series of vector insertand store that have a higher cumulative latency than their serial store coun-terpart. On Skylake, the AVX-512F implementation does provide support forvector scatters. Thus, the small decrease in performance is attributed on this ar-chitecture to the increase in the distance across the first and second index dueto face reordering which leads to an increase in cache misses due to the limita-tions of the SoA layout in kernels with irregular access patterns. Nevertheless,colouring is important further down the line once additional optimisationsare implemented since it exposes parallelism in writing back values across allface-based loops and allows for the exploitation of thread parallelism.

array of structures The conversion of cell centred data structures fromSoA to AoS provides improvements in performance for some of the face-basedkernels on the multicore CPUs. In face-based kernels with a high arithmeticintensity, the speed-up as a result of this conversion is as much as 25% althoughthis is not replicated across the board. On Knights Landing, the AoS layout seessignificant speed-ups of as much as 50% across the majority of studied face-based loops. This is opposite to Knights Corner where there is no noticeableimprovement. Furthermore, modifying the memory layout of cell-centred data


structures has a negative effect in cell-based loops. This is important whenbearing in mind that the overall solver performance decreases after conversionfrom SoA to AoS. However, as with colouring, this optimisation is required forfurther optimisations which will exploit the additional data locality availablewithin each vector of cell-centred variables.

gather scatter optimisations The on the fly transposition from AoSto SoA using specialised intrinsics-based functions improves performance acrossall processors. The impact is significantly higher on architectures with widervector registers such as Skylake and the Intel Xeon Phi processors. This is dueto the fact that consecutive variables are gathered via a single aligned vectorload and subsequently transposed and arranged into the correct lane via ar-chitectural specific permute instructions and the fact that the number of SIMDinstructions required for the transposition scales logarithmically compared tothe serial implementation. It is worth mentioning that on Knights Landingand Skylake with AVX-512F, permute instructions using a single source op-erand (vpermpd) perform better compared to the newly available two sourceoperand instructions (vpermi2pd) since they exhibit superior throughput. OnSandy Bridge and Broadwell with AVX/AVX2, the best version is also the onepresented in Listing 13 even when compared with other implementations thatperformed the first step of the shuffle on the load ports with the utilisation ofthe vinsertf128 instruction on a memory operand. This would indicate thatthe bottleneck on these architectures is not port 5 pressure where all of theinterleave and shuffle operations are executed.

In cell-based loops, performing the conversion between AoS and SoA on thefly via compiler intrinsics leads to a moderate improvement in performancefor the candidate kernel and in some cases, it offsets the performance droppedfrom switching to AoS from SoA in the first place. More importantly, the im-plementation of these primitives leads to whole application speed-up on allprocessors between a few percentages on the multicore CPUs where it amendsfor the loss in performance due to the switch to AoS in cell-based loops to asmuch as 50% on the Xeon Phi processors on top of the previous optimisations.

array of structures structure of arrays The conversion of facedata structures to the hybrid AoSSoA layout concerns only face-based loopsand yields minor improvements in performance. However, these are not signi-


ficant enough to warrant its usefulness as it did not generate any substantialspeed-ups on any architectures. As a result, the effort of implementing suchhybrid memory layout within a code of representative size might not be war-ranted.

loop tiling Loop tiling yields a large increase in performance for thecomputation of the linearised viscous fluxes since groups of consecutive facesprocessed during the computation of the linearised inviscid fluxes are thenpassed to the viscous routine and blocked in L2. The best performing tile sizeswere 128 for Sandy Bridge, Broadwell and Knights Corner and 512 for Skylakeand Knights Landing due to the larger L2 cache. It is therefore believed thatextending the scope of such optimisations across the entire application and notjust the linear solver would be beneficial across all cache-based architectures.This could be implemented by encapsulating the entire solver over a unifiediteration space at a stride equal to the vector register size or a multiple of it.However, the difficulty in such implementation is resolving the dependenciesbetween face-based and subsequent cell-based loops which is mesh dependentand can only be done at run-time.

software prefetching Software prefetching exhibits substantial speed-ups on the Knights Corner architecture due to the in-order core design andon the Skylake and Knights Landing processors as a result of the larger L2

cache. On Knights Corner, the best performing strategy was to issue prefetchinstructions at a distance of 32 and 64 for the indices in L1 and L2 respectivelyand half of that for the actual data as described previously in the methodology.For Knights Landing and Skylake, the best strategy was to only issue prefetchinstructions for the L2 cache at a distance of 32 for the index and therefore16 for the data as any prefetch instructions for L1 were in fact detrimentalto performance. The reason for this is most likely related to the smaller sizeof the L1 cache (32KB) which remained unchanged from Sandy Bridge andBroadwell while the L2 was increased by a factor of four. Furthermore, in thecase of Knights Landing, this approach bears fruit since the L1 prefetcher canoperate on irregular streams however the L2 is not able to hence why issuingprefetch instructions for L2 in software is advisable. On Sandy Bridge andBroadwell, the advantage of software prefetching is minimal and virtually nonexistent. This is probably due to the smaller L2 caches on these architectures


which leads to capacity misses. The effect that various combinations of prefetchdistances have on the performance of face-based kernels and the overall solveracross all architectures obtained as a result of auto-tuning can be inspected inA.1.

For the cell-based loop kernel, software prefetching has no effect on SandyBridge and Broadwell but is beneficial on Skylake, Knights Corner and KnightsLanding. This is not surprising for Knights Corner where the lack of an L1

hardware prefetcher mandates the utilisation of prefetch instructions either insoftware or through the compiler even for regular unit stride access patterns.The prefetch distances for cell-based kernels with linear unit stride loads havebeen selected via auto-tuning as well. For L1, the best linear prefetch distancewas found to be 32 and 64 for L2 respectively across Knights Corner, KnightsLanding and Skylake.

Finally, an important consideration to take into account is that if softwareprefetching is implemented explicitly as described above, compiler prefetchingshould be switched off in order to avoid for competing prefetch instructions tobe issued which can degrade performance.

multithreading Multithreading via OpenMP on Knights Corner yieldssignificant speed-ups with performance increasing almost linearly with thenumber of threads. The reason for this lies again in the in-order nature ofthe Knights Corner core where a missed load in the L1 cache leads to a com-plete pipeline stall. Having more than one thread in flight can help minimizememory latency as context can be switched to a thread for which data is avail-able. Since both software prefetching and multithreading on Knights Cornerhave the same purpose, the runs with 2,3,4 threads per MPI rank and physicalcore have been done with software prefetching disabled while for the SwP & 4

Thrds analysis, both software prefetching and 4 active threads were utilised perMPI rank. The results indicate that best performance on Knights Corner is ob-tained when both threading and prefetching are enabled, however, if requiredto choose between the two, careful software prefetching based on an auto-tuning approach yields better performance than multithreading alone. This isan important aspect to take into account since the implementation of softwareprefetching is less intrusive and error prone than exposing another level of par-allelism in the application for multithreading within an MPI rank. Surprisingly,on Knights Landing, multithreading was actually detrimental to performance.


This most likely due to the fact that the core architecture of Knights Landing isout of order and therefore capable of utilising all core resources with a singlethread per core. Running more than one thread per core leads to the divisionof core resources among the in-flight threads.

5.5.2 Performance scaling across a node

Performance strong scaling is performed for each computational kernel andthe whole solver across all architectures. For the multicore-based two-socketcompute nodes, the baseline implementation is compared across both meshsizes with the best optimised version. On the Knights Corner co-processor,strong scaling studies are performed for three different versions besides thebaseline in order to assess their scalability across the entire chip. These arethe optimised version with software prefetching enabled and multithreadingdisabled, the optimised version with four threads per MPI rank and no soft-ware prefetching and lastly, the optimised version with both software prefetch-ing and four threads per rank enabled. On Knights Landing, the differencein performance between baseline and optimised versions is studied with theadditional exploration of the memory configurations that are available on thisarchitecture. In cache mode, the MCDRAM is configured as a direct mappedcache and acts as a memory side buffer transparent to the user. In flat mode, allmemory allocations are performed explicitly in MCDRAM using the numactl

utility. For the larger mesh 2, approximately 10% of the allocations were re-routed to DRAM memory after utilising all of the available 16GB in MCDRAM.In DRAM mode, the allocation of memory has been done only in DDR memoryand MCDRAM has not been utilised at all. The last option was a hybridevaluation where the storage of cell-centred data structures is allocated inMCDRAM whilst face data is allocated in DDR together with MPI buffers inorder to interleave the memory access and therefore exploit the bandwidth ofboth memories. This has been implemented in the code using the libmemkindinterface.

5.5.2.1 Face-based Loops

The performance scaling of face-based loops on the multicore nodes (Figure33) is dependent on the arithmetic intensity of each individual kernel. Best


performance is exhibited by the kernel computing the inviscid second orderfluxes (iflux) due to its high arithmetic intensity (i.e 1.3) and which scalesalmost linearly with the number of cores on each node. Good performance isalso exhibited by the kernel computing the linearised viscous fluxes (dvflux)due to loop tiling which improves its arithmetic intensity since data is accessedfrom the higher cache levels rather than main memory. Therefore, it can be seenhow optimisations that reduce the amount of data movements from the lowermemory hierarchies to the processor are beneficial when scaling is performedacross the entire device as well.

On the manycore processors (Figures 34 and 35), the difference betweenbaseline and best optimised version for the face-based loops and at full con-currency varies between 12-16X on Knights Corner (60 cores x 4 threads) and7.3-11.7X on Knights Landing (64 cores) and remain similar to the results on 1

MPI rank.On Knights Corner (Figure 34), for the compute bound TVD MUSCL fluxes

(iflux), the best version as one scales across all physical cores is the one thatperforms both software prefetching and runs with four threads per core. This isfollowed by the version with no software prefetching and multithreading whilesoftware prefetching only is last by a significant margin. However, for the morememory bound face-based loops such as the computation of linearised inviscidfluxes (diflx) or the non-linear viscous fluxes (vflux), the second best versionis software prefetching only while the worst out of the optimised versions isthe one consisting of multithreading only (i.e. four threads per MPI rank). Thisdemonstrates that for compute bound kernels, the best approach for exploit-ing the performance of an in-order architecture is via running more threadswhereas for memory bound kernels, a good software prefetching strategy willbe more beneficial.

With regard to the performance of the various memory modes on KnightsLanding, not exploiting the MCDRAM either as a cache or explicitly as amemory side buffer leads to a factor of three drop in performance once theapplication scales past 16 cores on the processor. This is to be expected sincethe difference in bandwidth between MCDRAM and DDR4 is more than afactor of four as measured with STREAM. The best performing version is theinterleaved implementation where both MCDRAM and DDR4 are utilised to-gether and where allocation of specific data structures is performed acrossboth interfaces. This is due to the fact that through this approach, one is able


to not only utilise all eight MCDRAM memory channels but also the four avail-able DDR4 controllers. As DDR4 exhibits lower bandwidth but better latencycompared to MCDRAM, allocating latency sensitive data structures in DDR4

(i.e. MPI buffers) and all other data structures in MCDRAM allows for the ex-ploitation of both interfaces simultaneously therefore increasing the availablememory bandwidth.

5.5.2.2 Cell-Based Loops

For the cell-based kernel (Figure 36), the baseline implementation slightlyoutperforms the optimised version on Skylake and Broadwell as the kernelscales across the node. This is most likely due to the extra latency incurred bythe transposition from AoS to SoA in the optimised implementations whichbecomes more of a bottleneck as memory bandwidth is saturated. The op-timised implementation outperforms the baseline on Sandy Bridge and theXeon Phi processors since serial instructions are significantly slower on themanycore architectures than their vector counterparts and the Sandy Bridgenode is more balanced compared to Broadwell and Skylake. As cell-based ker-nels are memory bound with low flop per byte ratios, they tend to scale withthe available memory bandwidth. Therefore, optimisations focused on improv-ing floating-point performance such as vectorization are not as effective as thenumber of cores are increased.

The negative effect that the conversion from SoA to AoS had on the perform-ance of cell-based loops is made evident in the results for the Broadwell andSkylake systems where the overhead of performing the transposition betweenthe two formats are the main bottleneck at full node concurrency.

On Knights Corner, as with the face-based kernels, best performance wasobtained by a large margin when running with 4 threads per MPI rank andsoftware prefetching.

On Knights Landing, the DDR4 interface scales better than with face-basedkernels due to the regular unit stride access pattern however, it is almost 3Xworse than the other alternatives that exploit the MCDRAM.

5.5.2.3 Full Application

Results for whole application strong scaling can be seen in Tables 6, 7 and8. At full concurrency and on both mesh sizes, the best optimised version is


0

20

40

60

80

100

120

140

0 4 8 12 16 20 24 28 32 360

20

40

60

80

100

120

140

0 4 8 12 16 20 24 28 32 36

0

20

40

60

80

100

120

140

0 4 8 12 16 20 24 28 32 360

20

40

60

80

100

120

140

0 4 8 12 16 20 24 28 32 36

GFL

OPS # cores

iflux

# cores

vflux

# cores

diflux

# cores

dvflux

Ref SNB E5-2650

Opt SNB E5-2650

Ref BDW E5-2680

Opt BDW E5-2680

Ref SKL Gold 6140

Opt SKL Gold 6140

Figure 33: Performance strong scaling of face-based loops (flux computations) on thetwo-socket multicore nodes.


05101520253035404550

0 6 12 18 24 30 36 42 48 54 600

5

10

15

20

25

30

35

40

0 6 12 18 24 30 36 42 48 54 60

0

5

10

15

20

25

30

35

40

0 6 12 18 24 30 36 42 48 54 600

5

10

15

20

25

30

35

0 6 12 18 24 30 36 42 48 54 60

GFL

OPS # cores

iflux

# cores

vflux

# cores

diflux

# cores

dvflux

RefOpt 4 Threads

Opt Sw. PrefetchOpt SwP & 4Thrds

Figure 34: Performance strong scaling of face-based loops (flux computations) on theIntel Xeon Phi Knights Corner 7120P co-processor.


020406080100120140160180

0 8 16 24 32 40 48 56 64020406080100120140160180

0 8 16 24 32 40 48 56 64

020406080100120140160180

0 8 16 24 32 40 48 56 64020406080100120140160180

0 8 16 24 32 40 48 56 64

GFL

OPS # cores

iflux

# cores

vflux

# cores

diflux

# cores

dvflux

Ref CacheOpt CacheRef MCDRAMOpt MCDRAM

Ref DDROpt DDROpt DDR & MCDRAM

Figure 35: Performance strong scaling of face-based loops (flux computations) on theIntel Xeon Phi Knights Landing 7210 processor.


0

2

4

6

8

10

12

0 2 4 6 8 10 12 14 160

5

10

15

20

25

30

35

40

0 4 8 12 16 20 24 28

0

10

20

30

40

50

60

0 4 8 12 16 20 24 28 32 360

5

10

15

20

25

0 6 12 18 24 30 36 42 48 54 60

0102030405060708090

0 8 16 24 32 40 48 56 64

# cores

SNB E5-2650

# cores

BDW E5-2680G

FLO

PS

# cores

SKL Gold 6140

# cores

KNC 7120P

RefOpt

Opt 4 ThreadsOpt Sw. Prefetch

Opt SwP & 4 Thrds

# cores

KNL 7210

Ref CacheRef MCDRAM

Ref DDROpt Cache

Opt MCDRAMOpt DDR

Opt DDR & MCDRAM

Figure 36: Peformance strong scaling of cell-based loop kernel (updates to primitivevariables) on all processors.


between 2.6 to 2.8X faster on the multicore CPUs, 8.6X on Knights Corner and5.6X on the best performing configuration of Knights Landing in Quadrantand Flat mode with MCDRAM and DDR4 accesses interleaved. As expected,running only with DDR4 is 2.5X slower than the other modes that involve theutilisation of MCDRAM. The approach of utilising both memories explicitly isthe fastest version although by a very small percentage. On Knights Corner,running 4 concurrent threads per MPI rank combined with software prefetch-ing or performing software prefetching only exhibit the same performance asthe application scales across the whole coprocessor and saturates the availablememory bandwidth.

5.5.3 Performance modelling

A visualisation of the Roofline model based on applying the optimisationspresented in Section 5.4 can be seen in Figure 37 for the two-socket multicoreCPU nodes and Figure 38 for the manycore Intel Xeon Phi nodes. For brevity,face-based loops are represented only by the kernel with the highest arithmeticintensity (i.e. iflux).

single core Analysis of the single core rooflines on the multicore sys-tems demonstrates that the performance of cell-based loops is limited by thememory bandwidth while the performance of face-based loops is limited bythe imbalance in floating point operations since the FMA units are not ex-ploited on Broadwell and Skylake. The same considerations also apply to themanycore systems at this granularity.

full node At full node concurrency, the presence of irregular access pat-terns in face-based loops is the main impediment to performance for theseclasses of kernels. In contrast, the performance of cell-based loops is very closeor slightly surpasses that of the memory bandwidth. This is possible due to acombination of prefetching and the fact that some of the data is available inthe cache hierarchy and does not come from main DRAM as assumed by theperformance model. This is a limitation that is worth considering and thereforeallows for certain optimisation that involve the exploitation of the cache hier-archy or memory parallelism to pierce through the peak memory bandwidthroofline.


mesh 1 mesh 2

# coresreference

(s)optimised

(s)speedup

(x)reference

(s)optimised

(s)speedup

(x)

SNB E5-2650

1 83.0 29.0 2.8 184.0 61.0 3.02 42.0 15.3 2.7 86.8 29.7 2.94 23.0 8.9 2.5 44.7 15.1 2.98 13.6 5.8 2.3 24.7 9.0 2.710 10.7 4.5 2.3 19.5 7.5 2.612 9.4s 3.7 2.5 18.3 7.2 2.616 7.4s 2.8 2.6 14.4 5.5 2.6

BDW E5-2680

1 46.5 16.8 2.7 91.0 35.7 2.52 24.9 9.1 2.7 47.4 20.2 2.34 14.6 5.5 2.6 26.5 11.2 2.38 8.4 3.8 2.2 16.1 7.4 2.114 6.4 3.0 2.1 10.4 5.9 1.716 5.8 2.5 2.3 9.1 5.2 1.718 5.3 2.4 2.2 8.8 4.6 1.922 4.6 1.9 2.4 7.9 3.8 2.028 3.8 1.4 2.7 7.6 2.8 2.7

SKL Gold 6140

1 47.8 15.5 3.0 91.3 32.1 2.82 25.5 8.0 3.1 49.9 16.4 3.04 13.4 4.5 2.9 28.1 8.8 3.18 7.4 2.9 2.5 15.5 5.5 2.814 5.1 2.2 2.3 10.7 4.2 2.518 4.7 2.1 2.2 9.0 3.7 2.422 3.9 1.6 2.4 7.8 3.1 2.424 3.8 1.5 2.5 7.2 2.8 2.526 3.5 1.4 2.5 6.6 2.6 2.532 2.9 1.2 2.4 6.0 2.2 2.736 2.8 1.0 2.8 5.3 1.9 2.7

Table 6: Comparison between time to solution update measured in seconds of baselineand best optimised implementation on the multicore CPUs.


0.25

0.5

1

2

4

8

16

32

1/8 1/4 1/2 1 2 4 8 16

MUL/ADD

AVX

0.250.51

2

4

8

16

32

64

1/8 1/4 1/2 1 2 4 8 16

FMA

AVX/AVX2

0.250.51

2

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8 16

FMA

AVX512

1

4

16

64

256

1024

1/8 1/4 1/2 1 2 4 8 16

1

4

16

64

256

1024

1/8 1/4 1/2 1 2 4 8 16

4

16

64

256

1024

4096

1/8 1/4 1/2 1 2 4 8 16


1 Core SNB E5-2650

iflux

dvar

GFL

OPS


1 Core BDW E5-2680


1 Core SKL Gold 6140



irregular acce

ss


2x BDW E5-2680 (28 cores)

irregular access


2x SKL Gold 6140 (36 cores)

irregular access

Figure 37: Multicore rooflines of single core and full node configurations for bothclasses of computational kernels. Key: Reference Vectorization Gather-/Scatter Sw. Prefetch


0.250.51

2

4

8

16

32

64

1/8 1/4 1/2 1 2 4 8 16

FMA

AVX-512

0.0625

0.25

1

4

16

1/8 1/4 1/2 1 2 4 8 16

FMA

IMCI

TLP

4

16

64

256

1024

4096

1/8 1/4 1/2 1 2 4 8 16

1

4

16

64

256

1024

1/8 1/4 1/2 1 2 4 8 16


1 Core KNL 7210GFL

OPS


1 Core KNC 7120P


KNL 7210 (64 cores)

irregular access

MCDRAM



prefetching

irregular access

Figure 38: Manycore rooflines of single core and full node configurations for bothclasses of computational kernels.Key for KNC 7120P system and 1 Core KNL 7210: Reference Vectoriza-tion Gather/Scatter Sw. Prefetch Multi-threading. Key for KNL 7210

(64 cores): Reference Cache Opt DDR Opt Cache Opt MCDRAMOpt MCDRAM & DDR


mesh 1

# coresreference

(s)

optimisedx 4 threads

(s)

optimisedprefetch

(s)

optimisedprefetch

x 4 threads(s)

speedup(x)

KNC 7120P

1 1349.0 193.6 183.4 149.1 9.02 703.0 111.9 89.7 78.9 8.94 385.8 56.7 48.2 40.8 9.48 192.9 34.4 28.5 25.1 7.616 110.1 17.5 14.3 13.6 8.032 62.5 10.0 8.4 7.9 7.960 43.8 6.0 5.0 5.0 8.7

Table 7: Comparison between time to solution update measured in seconds of baselineand optimised versions with 4 hyperthreads and no software prefetching, soft-ware prefetching and no hyperthreading and both software prefetching and 4

hyperthreads on the Intel Xeon Phi Knights Corner coprocessor. The speed-upis calculated relative to baseline results and the timings of the best optimisa-tion (prefetch x 4 threads). On this architecture, we could only performexperimental runs on mesh 1 at full concurrency due to the 16GB memorylimit.

On the Knights Landing architecture, the difference between using MC-DRAM as cache or explicit memory buffer as well as exploiting both MC-DRAM and DDR4 is very small. It is however believed that better performanceof the interleaved version can be obtained by auto-tuning and testing a varietyof data structure allocations in order to find the right balance.

5.5.4 Architectural Comparison

face-based loops In face-based loops (Figure 39), the high arithmeticintensity means that architectures with wider SIMD units and larger numberof cores will perform best since these kernels tend to scale well across vectorlanes and across cores and sockets. Consequently, best performance at singlecore granularity for kernels computing fluxes at cell interfaces was obtainedon the Skylake processor followed by Broadwell. The performance of a singleKnights Landing core compared to Sandy Bridge was on par while the KnightsCorner core performed the worst which is to be expected due to the in-order


mesh 1 mesh 2

# coresreference

(s)optimised

(s)speedup

(x)reference

(s)optimised

(s)speedup

(x)

KNL 7210 Cache

1 205.0 33.7 6.0 411.0 71.1 5.72 116.0 17.8 6.5 238.4 37.8 6.34 57.1 9.4 6.0 120.4 19.3 6.28 29.0 5.3 5.4 63.3 10.0 6.316 16.2 2.8 5.7 32.1 5.3 6.032 9.1 1.6 5.6 18.23 3.1 5.864 5.9 1.0 5.9 10.5 2.0 5.2

KNL 7210 MCDRAM

1 204.5 33.2 6.1 405.1 69.1 5.82 115.0 17.8 6.4 230.9 37.4 6.14 57.3 9.4 6.0 119.4 18.9 6.38 28.9 5.3 5.4 61.6 9.8 6.216 16.0 2.8 5.7 31.2 5.3 5.832 8.9 1.6 5.5 17.9 3.0 5.964 5.7 1.0 5.7 10.3 1.9 5.4

KNL 7210 DDR

1 190.7 31.9 5.9 377.1 66.2 5.62 106.8 17.1 6.2 213.6 35.9 5.94 53.1 9.1 5.8 111.2 18.3 6.08 27.1 5.2 5.2 57.3 9.7 5.916 15.4 3.0 5.1 29.6 5.7 5.132 9.2 2.6 3.5 18.1 4.9 3.664 7.6 2.6 2.9 13.6 4.9 2.7

KNL 7210 DDR+MCDRAM

1 - 30.2 - - 62.5 -2 - 17.5 - - 36.4 -4 - 9.1 - - 18.7 -8 - 5.3 - - 9.6 -16 - 2.8 - - 5.1 -32 - 1.6 - - 3.0 -64 - 0.9 - - 1.8 -

Table 8: tableComparison between time to solution update measured in seconds of baseline andbest optimised implementation on the Intel Xeon Phi Knights Landing 7210 processorand across different memory modes.

5.6 conclusions 149

execution unit. However, it is interesting to observe the improvements in singlecore performance between successive Intel Xeon Phi generations.

At full node concurrency, the MCDRAM in the Knights Landing architectureas well as the higher core numbers and wide SIMD units contribute to the bestoverall performance. Unsurprisingly, the second best performance was exhib-ited by the Skylake node followed by Broadwell, Sandy Bridge and KnightsCorner. As previously mentioned, the difference in performance between Sky-lake and Broadwell is a factor of two for kernels with high arithmetic intensities(iflux) due to AVX-512, however, in kernels that are memory bound such asthe viscous fluxes and linearised inviscid fluxes (vflux and diflux), the differ-ence between the two is 50%. This correlates well with their respective memorybandwidth performance.

cell-based loops As cell-based loops are heavily bound by memory per-formance, the best performance at single core is obtained by the Skylake sys-tem due to the highest per core bandwidth. At full node concurrency, bestperformance is obtained on the Knights Landing processor due to the superiorbandwidth of MCDRAM. As mentioned earlier, the difference in performancebetween architectures for memory-bound kernels is significantly lower eventhough some of the processors such as the Sandy Bridge system are up to fiveyears older than the two-socket Skylake node. This further epitomises the issueof processor speeds increasing at a significant higher rate than the performanceof the memory system.

full application In terms of the whole solver represented as averagetime per Newton-Jacobi iteration, the overall winner is the Knights Landingprocessor at full node concurrency followed closely by the multicore Skylakenode and the remaining multicore systems. The difference between the KnightsLanding and Skylake systems and the Broadwell node is approximately 50%,3X compared to Sandy Bridge and more than 5X compared to the KnightsCorner coprocessor.

5.6 conclusions

This chapter presented a number of optimisations for improving the perform-ance of unstructured CFD applications on multicore and manycore processors

5.6 conclusions 150

0

1

2

3

4

5

6

7

8


20406080100120140160180200


GFL

OPS

Single core

2.5 2.5

4.3 4.2

6.0 5.9

0.7 0.7

3.0 3.0

2.1 2.1

3.43.1

3.6 3.5

0.5 0.5

2.4 2.32.2 2.2

3.83.5

4.3 4.2

0.5 0.5

2.72.6

2.2 2.1

4.13.9

5.6 5.5

0.4 0.4

2.0 2.0


Full node

31 31

75 77

139136

45

174177

36

28

74

62

103

95

36

170167

2928

58 55

105

97

37

177172

33 32

90

86

128

121

30

137

131


SNB E5-2650

BDW E5-2680

SKL Gold 6140

KNC 7120PKNL 7210

Figure 39: Architectural comparison of the performance of face-based loops (flux com-putations) represented as GFLOPS at single core and full node concurrencyand across all architectures (higher is better). The results presented forKnights Landing were obtained by at full concurrency by interleaving thememory access between DDR and MCDRAM (i.e. DDR+MCDRAM)

5.6 conclusions 151

0.25

1

4

16

64

256

mesh 1 mesh 2 mesh 1 mesh 2

GFL

OPS

0.8 0.8

2.3 2.32.6 2.6

0.4 0.4

1.3 1.3

11 10

37 35

55 55

23

83 83

Full nodeSingle core

SNB E5-2650

BDW E5-2680

SKL Gold 6140

KNC 7120PKNL 7210

Figure 40: Architectural comparison of the performance of cell-based loops (update toprimitive variables) represented as GFLOPS at single core and full node con-currency and across all architectures (higher is better). The results presentedfor Knights Landing were obtained by at full concurrency by interleavingthe memory access between DDR and MCDRAM (i.e. DDR+MCDRAM).

1

4

16

64

256

1024

mesh 1 mesh 2 mesh 1 mesh 2

Tim

e(s

)

32

66

19

41

19

41

257

529

39

81

2.8

5.5

1.42.8

1.0

1.9

5.0

0.91.8

Full nodeSingle core

SNB E5-2650

BDW E5-2680

SKL Gold 6140

KNC 7120PKNL 7210

Figure 41: Architectural comparison of the performance of the whole applicationrepresented as time per Newton-Jacobi iteration at single core and fullnode concurrency and across all architectures (lower is better). The res-ults presented for Knights Landing were obtained by at full concurrencyby interleaving the memory access between DDR and MCDRAM (i.e.DDR+MCDRAM).

5.6 conclusions 152

and demonstrated their implementation and impact in a code of representativesize and complexity to an industrial application. Examples of optimisationsincluded grid renumbering, vectorization, colouring, data layout transforma-tions, in-register transpositions, software prefetching, loop tiling and multith-reading.

Renumbering the mesh using the Reverse Cuthill-Mckee algorithm led to areduction in the distance between memory references in face-based loops byup to 53X although this only resulted in actual speed-ups for such kernels ofup to 10%.

Vectorization contributed to the largest speed-ups ranging between 2-5X and2-3X on the multicore CPUs and the Intel Xeon Phi manycore processors forboth face-based and cell-based loops and 2-3X for the whole application acrossall architectures. Although vectorizing cell-based loops was relatively straightforward using the features of the OpenMP 4.0 API, the vectorization of face-based loops required that these kernels are re-written and re-structured due totheir more complex access patterns.

Colouring the faces enabled the vectorization of scatter operations in face-based loops. This resulted in relatively worse performance for these kernels onthe multicore CPUs mainly due to the increase in cache misses as the distancesbetween face-end points increased as well. Despite this, the usefulness of col-ouring was demonstrated further down the line when evaluated in conjunctionwith other optimisations.

The advantage of the AoS memory layout compared to SoA has been demon-strated in kernels that contain indirect and irregular access patterns such as thecomputation of fluxes due to their more efficient utilisation of the cache hier-archy. However, this was shown to have a detrimental effect on the perform-ance of cell-based loops. To alleviate this, hand-tuned conversion primitiveswere presented based on compiler intrinsics which performed fast transposi-tions to and from the AoS and SoA layouts. These improved performance inface-based and cell-based kernels and were found to be particularly beneficialon the manycore processors and on architectures with wide SIMD lengths.

The utility of auto-tuning for finding parameters such as prefetch distancesor tile size across all processors has been demonstrated and the large impactthat this can have on performance. Moreover, it has been shown that softwareprefetching is highly recommended on architectures with large L2 caches andon the Knights Corner co-processor due to the in-order core design and lack

5.6 conclusions 153

of hardware prefetchers. Another optimisation that made use of auto-tuning isloop tiling which improved performance of the linear solver by blocking datain the L2 cache.

On the Intel Xeon Phi Knights Landing processor, the performance of allavailable memory modes was studied together with the introduction of a thirdoption in which both MCDRAM and DDR are utilised. The best approachwas found to be the latter although by a small margin when compared tothe other options that make use of the MCDRAM. The utilisation of only theDDR interface came last by a significant margin which demonstrates that theexploitation of the high-bandwidth memory package is imperative for bestperformance.

The utilisation of multiple threads per MPI rank led to very good perform-ance on the Knights Corner architecture, especially when combined with soft-ware prefetching. In contrast, this was not beneficial on the Knights Landingprocessor where it led to worse performance since the KNL core is out-of-orderand multiple threads were competing for the available core resources.

Finally, all of the presented techniques were implemented in a single codebase with abstractions for architectural specific optimisations being based ontraditional language constructs available in the C++ programming languageand on compiler pre-processing machinery. As for their overall performance be-nefit, their implementation led to full application speed-ups ranging between2.6-2.8X on the multicore CPUs and 5-8X on the manycore Xeon Phi processorsat double precision and across two different mesh sizes of a realistic industrialtest case.

6C O N C L U S I O N S

This chapter reviews the achievements of this thesis, discusses its limitationsand outlines the most important aspects worth taking into account when map-ping existing or new CFD applications onto the architectural features of mod-ern processors. This is presented in the form of advice to CFD practitioners andis applicable to future processors following similar architectural paradigms. Fi-nally, future research directions are discussed that could improve the workpresented herein.

6.1 summary of findings

Chapter 2 presented the processor trends that have led to the advent of mul-ticore and manycore processors. This was followed by a discussion of theirarchitectural features and the importance of exploiting parallelism across mul-tiple granularities in order to obtain high performance on them. The implica-tions that these have on the performance of CFD codes have also been reviewedwith the conclusion that current legacy applications used in industry and inresearch, although able to exploit coarse grained parallelism across processornodes via message passing, are not amenable to parallelism at finer granularit-ies and therefore exhibit poor performance on modern processors unless theirimplementation is revisited. This state of affairs has led to the main motiva-tion behind this work based on evaluating and presenting the optimisationsrequired for mapping CFD algorithms within the finite volume frameworkonto the architectural features of modern multicore and manycore processors.

To this end, chapter 4 presented a number of optimisations for structuredmesh applications in a block-structured Euler code used for turbomachinerycomputations. The optimisations were evaluated across three distinct archi-tectures such as: the Intel Xeon Sandy Bridge and Haswell multicore CPUsand the Intel Xeon Phi Knights Corner coprocessor and in two computationalkernels with distinct access patterns: stencil operations arising from the com-putation of fluxes at cell interfaces and the evaluation of cell attributes in ker-

154

6.1 summary of findings 155

nels that perform updates to state vectors. The practicalities of enabling effi-cient vectorization across both types of computational kernels was discussedalong with the trade-off between performance and portability by evaluatinga number of approaches such as compiler directives, compiler intrinsics andAgner Fog’s vector library class. For stencil operators, it was demonstratedthat efficient vectorization relies upon the use of aligned vector load and storeoperations which can only be made possible through techniques such as inter-register shuffle operations or transpositions. In kernels with regular and stream-ing access patterns such as cell-based kernels, efficient vectorization can beachieved by relying solely on compiler directives found in OpenMP 4.0 andthe addition of padding and aligned memory allocations. The importance ofthe underlying data layout was also discussed and it was demonstrated thatthe hybrid AoSSoA format leads to superior performance compared to SoAin structured mesh applications as it allows for more efficient vector load andstore operations and utilises fewer memory resources. The utility of cache-blocking the separate sweeps when computing fluxes at the interfaces was alsodemonstrated along with a number of domain decomposition approaches forextracting thread-level parallelism. The best version of the latter was based ondecomposing each block into two dimensional tiles and mimicking the samebehaviour as the message passing layer where halo swaps were performed ex-plicitly via shared memory. Finally, the impact of each optimisation across thethree architectures was appraised using the Roofline performance model. Thedifference in performance between the optimised version of the kernel basedon stencil computations at full node concurrency was 3X on the multicorearchitectures and 24X on the manycore Knights Corner coprocessor when run-ning on 180 threads. For the cell-based state vector update kernel that featuresa regular memory access pattern and is bound by memory bandwidth, thedifference in performance between the optimised and reference implementa-tion was approximately 2X on the multicore CPUs and 13X on the manycoreKnights Corner platform.

Chapter 5 presented optimisations and techniques useful for improving theperformance of unstructured CFD applications on modern multicore and ma-nycore processors. The optimisations were implemented in a code of repres-entative size and complexity to an industrial application and demonstrated intwo distinct classes of computational patterns that form the backbone of un-structured CFD codes: gather and scatter operations in face-based loops for the

6.1 summary of findings 156

evaluation of fluxes and regular and streaming access patterns that are presentin cell-based loops for updating state vectors. It was demonstrated that vec-torization of face-based loops is possible using compiler directives based onthe OpenMP 4.0 API as long as such kernels are re-written in order to isol-ate the scatter operations which prevent vectorization thus allowing for boththe gather and the computation stage to be vectorized. Full vectorization ofall face-based loops was achieved via the implementation of reordering al-gorithms which not only allowed for SIMD scatter operations by removingthe dependencies at the face-end points but also minimized the jump in thesecond indices. Prior to that, a renumbering of the grid was performed viathe Reverse Cuthill Mckee algorithm which reduced the distance between ref-erences in face-based loops and therefore allowed for a better exploitation ofthe memory system. The advantage of the AoS data layout over SoA was alsodemonstrated in kernels with irregular access patterns due to their more ef-ficient utilisation of the cache hierarchy. Conversion between AoS and SoAdata layouts for vector operations was performed using hand tuned compilerintrinsics along with ways by which architectural specific implementations ofsuch primitives can be abstracted away in the application code. Another im-portant contribution in this chapter was the implementation of an auto-tuningstrategy for finding optimal parameters such as prefetch distances or tile sizesacross all of the evaluated platforms. This resulted in palpable speed-ups onthe back of software prefetching as well as loop tiling especially on newer ar-chitectures that integrate large L2 caches and on the in-order Knights Cornercoprocessor. Other optimisations included the evaluation of the AoSSoA lay-out for face attributes in order to reduce the number of memory streams andimprove memory performance and the implementation of multithreading asa separate layer of parallelism by extending the functionality of the reoderingalgorithms.

On the Knights Landing architecture, the performance of the available memorymodes has been assessed together with a hybrid approach that exploited bothDDR4 and the MCDRAM interface. Finally, all optimisations were also cor-related with the Roofline model and led to full application speed-up rangingbetween 2.6X and 2.8X on the multicore CPUs and 5-8X on the manycore XeonPhi processors at double precision and across two different computational do-main sizes.

6.2 advice for cfd practitioners 157

6.2 advice for cfd practitioners

The findings in this thesis can be used to construct a list of recommendationsand procedures aimed at CFD practitioners and developers who wish to maptheir new or existing application onto the architectural features of current andfuture multicore and manycore processors. These are described in more detailbelow.

vectorization As seen from the results for both structured and unstruc-tured applications, vectorization is the most important optimisation on mod-ern processors. This is likely to be the case for future architectures as wellwhich are expected to increase the underlying vector register width furtherand continue to improve the SIMD ISA.

As a result, developers should make sure that their application is flexiblein such manner that data parallelism can be exploited across different vec-tor lengths. This is important for ensuring best performance across a varietyof multicore and manycore processors. Another important consideration formaking good use of the vector units is how to efficiently load data into them.As seen in Chapters 4 and 5, this differs depending on the underlying accesspattern i.e. stencil operators or gather and scatter operations. Consequently,although best performance is achieved by using architectural specific optimisa-tions that fully harness the capability of each platform i.e. AVX-512 or IMCI,abstractions for these implementations are necessary in order to ensure portab-ility. Furthermore, applications such as unstructured mesh solvers also requirecolouring and reordering algorithms so that efficient vector accumulate andstore operations can be utilised.

data layout Another important optimisation with respect to CFD applic-ations is the format used for storing cell-centred and face variables. For struc-tured mesh codes, the SoA or the hybrid AoSSoA are the best choice, as demon-strated in Chapter 4. This is due to the fact that these translate to efficient SIMDload and store operations due to the regular and streaming access patterns ofstructured solvers. However, for unstructured mesh solvers, the AoS formatis in fact more efficient for cell-centred data structures as it leads to a reduc-tion in cache misses in kernels with gather and scatter operations, as seen in

6.2 advice for cfd practitioners 158

Chapter 5. With regard to face attributes, SoA is still the preferred choice inunstructured grid codes since faces are evaluated in consecutive order.

Changing the underlying data layout of an application is an arduous anderror prone task, especially in a code of representative size. Consequently, a keyadvice to CFD developers is to provide an abstraction layer in the applicationfor the storage format of such data structures. This would allow for any layoutto be changed at compile time depending on both the underlying architectureand the computational patterns.

auto-tuning Although the majority of processor architectures in high-performance computing adhere to an abstract processor model as seen inChapter 2, they all exhibit certain particularities which are important to con-sider for best performance. Examples of these are: size and latency of caches,length of vector registers, core design i.e. in-order or out-of-order, single threadedor n-way multithreaded, memory or cluster modes etc. Consequently, for op-timisations that rely on these attributes in their implementation, performanceportability can only be achieved by means of auto-tuning. This has been demon-strated in this thesis in Chapter 5 where an auto-tuning phase was used toselect the best distance parameter for software prefetching across a wide rangeof architectures as well as the appropriate tile size for the implementation ofloop tiling.

As a result, applications should be modified or developed to allow for "ma-gic" values such as tile size or prefetch distance to be automatically chosen atan initial auto-tuning stage every time the application is executed on a newarchitecture.

multithreading Running more than one thread per physical core is notalways useful for multicore processors since one thread can fully utilise theentire execution pipeline and available resources. However, for in-order archi-tectures such as the Intel Xeon Phi Knights Corner or GPGPUs, running morethan one thread per core can help hide memory latencies and avoid pipelinestalls as seen in both Chapters 4 and 5.

Consequently, the advice for CFD developers is to consider enabling anotherlevel of parallelism in their application for multithreading that can be enabledor disabled depending on the underlying architecture. This was demonstratedin Chapter 5 with the help of colouring algorithms where faces with depend-

6.3 limitations 159

ency free end-points were processed by each thread in groups equal in size tothe width of the underlying vector register and in a round robin fashion.

6.3 limitations

In chapter 4, one existing limitation in the presented approach is the lack ofloop tiling. In structured codes, temporal blocking or skewed tiling is signific-antly easier to achieve than in their unstructured counterpart and could havehad a sizeable impact on performance. This could have been implemented byhoisting the entire solver under the same iteration space which would proceedin steps equal or a multiple of the underlying SIMD size thereby exploitingdata parallelism as well as minimizing traffic to main memory. Another lim-itation in chapter 4 is the implementation of Henretty’s [37] dimension liftedtransposition for aligned vector load and stores in stencil operations. In thiswork, the transposition operations were only performed in one dimension asthe fluxes were computed in two distinct passes in the i and j direction. Thisrequired that data is re-transposed again after the i and before the j sweep.However, in Henretty’s original work, the transposition of the data can be per-formed more efficiently and only once by fusing both sweeps.

The work in chapter 5 could have been improved with respect to the hy-brid MPI/OpenMP implementation. In such an implementation, the domainowned by each MPI rank is further decomposed across existing threads therebymimicking the same execution model as the message passing layer while ex-ploiting the superior latency of shared memory. Such approaches were firstpresented by Aubry et al [10] and it is believed that as the number of coresper node continue to increase, a hybrid approach of one rank per socket withsubsequent threads pinned to all other physical cores and communicating viashared memory would lead to even better performance and scalability.

6.4 future work

The most natural extension of this work would be to validate the claims that theoptimisations presented herein for both structured and unstructured solversare also applicable to more specialised manycore processors such as NVIDIAGPUs. Thus, a good avenue to test this claim would be through a compiler

6.4 future work 160

directive approach such as OpenACC or OpenMP 4.0/4.5 in the first instance.This would make a fair comparison between the performance that can be ex-tracted across all architectures whilst maintaining the same code structure.Following from this, hand-tuned kernels can be implemented in OpenCL orCUDA with results being correlated via performance models.

B I B L I O G R A P H Y

[1] ACM Gordon Bell Prize. http://awards.acm.org/bell. Accessed: 31-08-2017.

[2] ARCHER. Knights Landing Testing & Development Platform. http://www.archer.ac.uk/documentation/knl-guide/. Accessed: 01-08-2017.

[3] ARK. Intel Xeon Platinum 8180 Processor. https : / / ark . intel . com /

products / 120496 / Intel - Xeon - Platinum - 8180 - Processor - 38 _ 5M -

Cache-2_50-GHz. Accessed: 08-01-2018.

[4] ARM. ARM Developer. https://developer.arm.com/. Accessed: 09-01-2018.

[5] Sam Ainsworth and Timothy M. Jones. “Software Prefetching for IndirectMemory Accesses”. In: Proceedings of the 2017 International Symposium onCode Generation and Optimization. CGO ’17. Austin, USA: IEEE Press, 2017,pp. 305–317. isbn: 978-1-5090-4931-8.

[6] G.D. Albada, B. Leer and Jr. Roberts W.W. “A Comparative Study ofComputational Methods in Cosmic Gas Dynamics”. English. In: Upwindand High-Resolution Schemes. Ed. by M.Yousuff Hussaini, Bram Leer andJohn Rosendale. Springer Berlin Heidelberg, 1997, pp. 95–103. isbn: 978-3-642-64452-8.

[7] Gene M. Amdahl. “Validity of the Single Processor Approach to Achiev-ing Large Scale Computing Capabilities”. In: Proceedings of the April 18-20,1967, Spring Joint Computer Conference. AFIPS ’67 (Spring). Atlantic City,New Jersey: ACM, 1967, pp. 483–485. doi: 10.1145/1465482.1465560.

[8] W. K. Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes and B. F. Smith.“Achieving High Sustained Performance in an Unstructured Mesh CFDApplication”. In: Proceedings of the 1999 ACM/IEEE Conference on Super-computing. SC ’99. Portland, Oregon, USA: ACM, 1999. isbn: 1-58113-091-0. doi: 10.1145/331532.331600.

161

http://awards.acm.org/bell

http://www.archer.ac.uk/documentation/knl-guide/

http://www.archer.ac.uk/documentation/knl-guide/

https://ark.intel.com/products/120496/Intel-Xeon-Platinum-8180-Processor-38_5M-Cache-2_50-GHz



https://developer.arm.com/

http://dx.doi.org/10.1145/1465482.1465560

http://dx.doi.org/10.1145/331532.331600

Bibliography 162

[9] K. Asanovic et al. The Landscape of Parallel Computing Research: A Viewfrom Berkeley. Tech. rep. UCB/EECS-2006-183. Electrical Engineering andComputer Sciences, University of California at Berkeley.

[10] R. Aubry, G. Houzeaux, M. Vázquez and J. M. Cela. “Some useful strategiesfor unstructured edge-based solvers on shared memory machines”. In:International Journal for Numerical Methods in Engineering 85.5 (2011), pp. 537–561. issn: 1097-0207. doi: 10.1002/nme.2973.

[11] H. David Bailey, F. Robert Lucas and W. Samuel Williams. PerformanceTuning of Scientific Applications. New York, United States: CRC Press, 2011.isbn: 9781439815694.

[12] Shekhar Borkar. “Thousand Core Chips: A Technology Perspective”. In:Proceedings of the 44th Annual Design Automation Conference. DAC ’07. SanDiego, California: ACM, 2007, pp. 746–749. isbn: 978-1-59593-627-1. doi:10.1145/1278480.1278667.

[13] Shekhar Borkar and Andrew A Chien. “The future of microprocessors”.In: Communications of the ACM 54.5 (2011), pp. 67–77.

[14] T. Brandvik and G. Pullan. “SBLOCK: A Framework for Efficient Stencil-Based PDE Solvers on Multi-core Platforms”. In: 2010 10th IEEE Interna-tional Conference on Computer and Information Technology. 2010, pp. 1181–1188. doi: 10.1109/CIT.2010.214.

[15] D. A. Burgess and M. B. Giles. “Renumbering Unstructured Grids toImprove the Performance of Codes on Hierarchical Memory Machines”.In: Adv. Eng. Softw. 28.3 (Apr. 1997), pp. 189–201. issn: 0965-9978. doi:10.1016/S0965-9978(96)00039-7.

[16] C++ Vector Library Class. http://www.agner.org/optimize/#vectorclass.Accessed: 10-05-2015.

[17] Mauro Carnevale, Jeffrey S Green and Luca Di Mare. “Numerical studiesinto intake flow for fan forcing assessment”. In: Proceedings of ASMETurbo Expo 2014: Turbine Technical Conference and Exposition. 2014, pp. 16–20.

[18] Mauro Carnevale, Feng Wang and Luca di Mare. “Low Frequency Dis-tortion in Civil Aero-engine Intake”. In: Journal of Engineering for GasTurbines and Power 139.4 (2017), p. 041203.

http://dx.doi.org/10.1002/nme.2973

http://dx.doi.org/10.1145/1278480.1278667

http://dx.doi.org/10.1109/CIT.2010.214

http://dx.doi.org/10.1016/S0965-9978(96)00039-7

http://www.agner.org/optimize/#vectorclass

Bibliography 163

[19] Dan Curran, Christian B Allen, Simon McIntosh-Smith and David Beck-ingsale. “Towards Portability For A Compressible Finite-Volume CFDCode”. In: 54th AIAA Aerospace Sciences Meeting. 2016, p. 1813.

[20] E. Cuthill and J. McKee. “Reducing the Bandwidth of Sparse SymmetricMatrices”. In: Proceedings of the 1969 24th National Conference. ACM ’69.New York, NY, USA: ACM, 1969, pp. 157–172.

[21] Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, JonathanCarter, Leonid Oliker, David Patterson, John Shalf and Katherine Yelick.“Stencil Computation Optimization and Auto-tuning on State-of-the-artMulticore Architectures”. In: Proceedings of the 2008 ACM/IEEE Conferenceon Supercomputing. SC ’08. Austin, Texas: IEEE Press, 2008, 4:1–4:12. isbn:978-1-4244-2835-9.

[22] Zachary DeVito et al. “Liszt: A Domain Specific Language for BuildingPortable Mesh-based PDE Solvers”. In: Proceedings of 2011 InternationalConference for High Performance Computing, Networking, Storage and Ana-lysis. SC ’11. Seattle, Washington: ACM, 2011, 9:1–9:12. isbn: 978-1-4503-0771-0. doi: 10.1145/2063384.2063396.

[23] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous and A. R.LeBlanc. “Design of ion-implanted MOSFET’s with very small physicaldimensions”. In: IEEE Journal of Solid-State Circuits 9.5 (1974), pp. 256–268. issn: 0018-9200. doi: 10.1109/JSSC.1974.1050511.

[24] Luca Di Mare, Davendu Y Kulkarni, Feng Wang, Artyom Romanov, Pan-dia R Ramar and Zacharias I Zachariadis. “Virtual gas turbines: Geo-metry and conceptual description”. In: Proceedings of ASME TurboExpo,Vancouver, Canada (2011).

[25] Austen C. Duffy, Dana P. Hammond and Eric J. Nielsen. ProductionLevel CFD Code Acceleration for Hybrid Many-Core Architectures. Tech. rep.NASA/TM-2012-217770. National Aeronautics and Space Administra-tion.

[26] Thomas D. Economon, Dheevatsa Mudigere, Gaurav Bansal, AlexanderHeinecke, Francisco Palacios, Jongsoo Park, Mikhail Smelyanskiy, JuanJ. Alonso and Pradeep Dubey. “Performance optimizations for scalableimplicit RANS calculations with SU2”. In: Computers & Fluids 129 (2016),

http://dx.doi.org/10.1145/2063384.2063396

http://dx.doi.org/10.1109/JSSC.1974.1050511

Bibliography 164

pp. 146 –158. issn: 0045-7930. doi: http://dx.doi.org/10.1016/j.compfluid.2016.02.003.

[27] L. Eeckhout. “Is Moore’s Law Slowing Down? What’s Next?” In: IEEEMicro 37.4 (2017), pp. 4–5. issn: 0272-1732. doi: 10 . 1109 / MM . 2017 .

3211123.

[28] FUN3D. https://fun3d.larc.nasa.gov/. Accessed: 31-08-2017.

[29] Mohammed A. Al Farhan, Dinesh K. Kaushik and David E. Keyes. “Un-structured computational aerodynamics on many integrated core archi-tecture”. In: Parallel Computing 59 (2016). Theory and Practice of IrregularApplications, pp. 97 –118. issn: 0167-8191. doi: http://dx.doi.org/10.1016/j.parco.2016.06.001.

[30] Pawel Gepner, Victor Gamayunov and David L. Fraser. “Early perform-ance evaluation of AVX for HPC”. In: Procedia Computer Science 4.0 (2011).Proceedings of the International Conference on Computational Science,ICCS 2011, pp. 452 –460. issn: 1877-0509.

[31] M. B. Giles and I. Reguly. “Trends in high-performance computing forengineering calculations”. In: Philosophical Transactions of the Royal Soci-ety of London A: Mathematical, Physical and Engineering Sciences 372.2022

(2014). issn: 1364-503X. doi: 10.1098/rsta.2013.0319.

[32] M. B. Giles, G. R. Mudalige, Z. Sharif, G. Markall and P. H.J. Kelly. “Per-formance Analysis of the OP2 Framework on Many-core Architectures”.In: SIGMETRICS Perform. Eval. Rev. 38.4 (Mar. 2011), pp. 9–15. issn: 0163-5999. doi: 10.1145/1964218.1964221.

[33] M. B. Giles, G. R. Mudalige, C. Bertolli, P. H. J. Kelly, E. László and I. Reg-uly. “An Analytical Study of Loop Tiling for a Large-Scale UnstructuredMesh Application”. In: 2012 SC Companion: High Performance Computing,Networking Storage and Analysis. 2012, pp. 477–482. doi: 10 . 1109 / SC .

Companion.2012.68.

[34] F. Grasso and C. Meola. Handbook of Computational Fluid Mechanics. Lon-don: Academic Press, 1996, pp. 159–282.

[35] William D Gropp, Dinesh K Kaushik, David E Keyes and Barry F Smith.“High-performance parallel implicit CFD”. In: Parallel Computing 27.4(2001). Parallel computing in aerospace, pp. 337 –362. issn: 0167-8191.doi: http://dx.doi.org/10.1016/S0167-8191(00)00075-2.

http://dx.doi.org/http://dx.doi.org/10.1016/j.compfluid.2016.02.003

http://dx.doi.org/http://dx.doi.org/10.1016/j.compfluid.2016.02.003

http://dx.doi.org/10.1109/MM.2017.3211123

http://dx.doi.org/10.1109/MM.2017.3211123

https://fun3d.larc.nasa.gov/

http://dx.doi.org/http://dx.doi.org/10.1016/j.parco.2016.06.001

http://dx.doi.org/http://dx.doi.org/10.1016/j.parco.2016.06.001

http://dx.doi.org/10.1098/rsta.2013.0319

http://dx.doi.org/10.1145/1964218.1964221

http://dx.doi.org/10.1109/SC.Companion.2012.68

http://dx.doi.org/10.1109/SC.Companion.2012.68

http://dx.doi.org/http://dx.doi.org/10.1016/S0167-8191(00)00075-2

Bibliography 165

[36] Amiram Harten, Peter D. Lax and Bram van Leer. “On Upstream Differ-encing and Godunov-Type Schemes for Hyperbolic Conservation Laws”.In: SIAM Review 25.1 (1983), pp. 35–61.

[37] Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanu-jam and P. Sadayappan. “Data Layout Transformation for Stencil Com-putations on Short-vector SIMD Architectures”. In: Proceedings of the 20thInternational Conference on Compiler Construction: Part of the Joint EuropeanConferences on Theory and Practice of Software. CC’11/ETAPS’11. Berlin,Heidelberg: Springer-Verlag, 2011, pp. 225–245. isbn: 978-3-642-19860-1.

[38] M. D. Hill and M. R. Marty. “Amdahl’s Law in the Multicore Era”. In:Computer 41.7 (2008), pp. 33–38. issn: 0018-9162. doi: 10.1109/MC.2008.209.

[39] Charles Hirsch. Numerical Computation of Internal and External Flows. Chichester,West Sussex, UK: John Wiley and Sons, 1990.

[40] Intel Corporation. Intel® 64 and IA-32 Architectures Optimization ReferenceManual. 248966-037. 2017.

[41] Intel Xeon Scalable Processors. https://newsroom.intel.com/press-kits/next-generation-xeon-processor-family/. Accessed: 18-07-2017.

[42] A. Jameson. “Time dependent calculations using multigrid, with applic-ations to unsteady flows past airfoils and wings”. In: AIAA paper 1596

(1991).

[43] Jim Jeffers and James Reinders. Intel Xeon Phi Coprocessor High Perform-ance Programming. Boston, United States: Morgan Kaufmann, 2013. isbn:978-0-12-410414-3.

[44] Jim Jeffers, James Reinders and Avinash Sodani. Intel Xeon Phi ProcessorHigh Performance Programming. Morgan Kaufmann, 2016. isbn: 978-0-12-809194-4.

[45] Jaekyu Lee, Hyesoon Kim and Richard Vuduc. “When Prefetching Works,When It Doesn’t, and Why”. In: ACM Trans. Archit. Code Optim. 9.1 (Mar.2012), 2:1–2:29. issn: 1544-3566. doi: 10.1145/2133382.2133384.

http://dx.doi.org/10.1109/MC.2008.209

http://dx.doi.org/10.1109/MC.2008.209

https://newsroom.intel.com/press-kits/next-generation-xeon-processor-family/

https://newsroom.intel.com/press-kits/next-generation-xeon-processor-family/

http://dx.doi.org/10.1145/2133382.2133384

Bibliography 166

[46] Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lock-hart, Christopher Batten and Krste Asanovic. “Exploring the TradeoffsBetween Programmability and Efficiency in Data-parallel Accelerators”.In: SIGARCH Comput. Archit. News 39.3 (June 2011), pp. 129–140. issn:0163-5964. doi: 10.1145/2024723.2000080.

[47] Bram van Leer. “Towards the ultimate conservative difference scheme.V. A second-order sequel to Godunov’s method”. In: Journal of Compu-tational Physics 32.1 (1979), pp. 101 –136. issn: 0021-9991. doi: https :

//doi.org/10.1016/0021-9991(79)90145-1.

[48] Anders Logg, Kent-Andre Mardal and Garth Wells. Automated Solutionof Differential Equations by the Finite Element Method: The FEniCS Book.Springer Publishing Company, Incorporated, 2012. isbn: 3642230989.

[49] Rainald Löhner. “Cache-efficient renumbering for vectorization”. In: In-ternational Journal for Numerical Methods in Biomedical Engineering 26.5(2010), pp. 628–636. issn: 2040-7947. doi: 10.1002/cnm.1160.

[50] MIT. MISES. http://web.mit.edu/drela/Public/web/mises/. Accessed:13-05-2015. 2009.

[51] D.J. Mavriplis. Revisiting the Least-squares Procedure for Gradient Recon-struction on Unstructured Meshes. Tech. rep. NASA/CR-2003-212683. Na-tional Aeronautics and Space Administration.

[52] Dimitri J Mavriplis. “Unstructured-mesh discretizations and solvers forcomputational aerodynamics”. In: AIAA journal 46.6 (2008), pp. 1281–1298.

[53] John D. McCalpin. STREAM: Sustainable Memory Bandwidth in High Per-formance Computers. Tech. rep. A continually updated technical report.http://www.cs.virginia.edu/stream/. Charlottesville, Virginia: Univer-sity of Virginia, 1991-2007.

[54] John D. McCalpin. “Memory Bandwidth and Machine Balance in Cur-rent High Performance Computers”. In: IEEE Computer Society TechnicalCommittee on Computer Architecture (TCCA) Newsletter (Dec. 1995), pp. 19–25.

http://dx.doi.org/10.1145/2024723.2000080

http://dx.doi.org/https://doi.org/10.1016/0021-9991(79)90145-1

http://dx.doi.org/https://doi.org/10.1016/0021-9991(79)90145-1

http://dx.doi.org/10.1002/cnm.1160

http://web.mit.edu/drela/Public/web/mises/

Bibliography 167

[55] Simon McIntosh-Smith, Michael Boulton, Dan Curran and James Price.“On the performance portability of structured grid codes on many-corecomputer architectures”. In: International Supercomputing Conference. Springer.2014, pp. 53–75.

[56] Gordon E. Moore. “Readings in Computer Architecture”. In: ed. by MarkD. Hill, Norman P. Jouppi and Gurindar S. Sohi. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 2000. Chap. Cramming More Com-ponents Onto Integrated Circuits, pp. 56–59. isbn: 1-55860-539-8.

[57] S. K. Moore. “Multicore is bad news for supercomputers”. In: IEEE Spec-trum 45.11 (2008), pp. 15–15. issn: 0018-9235. doi: 10.1109/MSPEC.2008.4659375.

[58] Todd C. Mowry, Monica S. Lam and Anoop Gupta. “Design and Evalu-ation of a Compiler Algorithm for Prefetching”. In: SIGPLAN Not. 27.9(Sept. 1992), pp. 62–73. issn: 0362-1340. doi: 10.1145/143371.143488.

[59] G. R. Mudalige, M. B. Giles, I. Reguly, C. Bertolli and P. H. J. Kelly.“OP2: An active library framework for solving unstructured mesh-basedapplications on multi-core and many-core architectures”. In: 2012 Innov-ative Parallel Computing (InPar). 2012, pp. 1–12. doi: 10.1109/InPar.2012.6339594.

[60] D. Mudigere, S. Sridharan, A. Deshpande, J. Park, A. Heinecke, M. Smely-anskiy, B. Kaul, P. Dubey, D. Kaushik and D. Keyes. “Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application onModern Parallel Systems”. In: 2015 IEEE International Parallel and Distrib-uted Processing Symposium. 2015, pp. 723–732. doi: 10.1109/IPDPS.2015.114.

[61] J. P. Murphy and D. G. MacManus. “Ground vortex aerodynamics undercrosswind conditions”. In: Experiments in Fluids 50.1 (2011), pp. 109–124.issn: 1432-1114. doi: 10.1007/s00348-010-0902-4.

[62] G. Ofenbeck, R. Steinmann, V. Caparros, D.G. Spampinato and M. Puschel.“Applying the roofline model”. In: Performance Analysis of Systems andSoftware (ISPASS), 2014 IEEE International Symposium on. 2014, pp. 76–85.doi: 10.1109/ISPASS.2014.6844463.

[63] OpenCL. OpenCL Overview. https://www.khronos.org/opencl/. Ac-cessed: 10-01-2018.

http://dx.doi.org/10.1109/MSPEC.2008.4659375


http://dx.doi.org/10.1145/143371.143488

http://dx.doi.org/10.1109/InPar.2012.6339594

http://dx.doi.org/10.1109/InPar.2012.6339594

http://dx.doi.org/10.1109/IPDPS.2015.114


http://dx.doi.org/10.1007/s00348-010-0902-4

http://dx.doi.org/10.1109/ISPASS.2014.6844463

https://www.khronos.org/opencl/

Bibliography 168

[64] OpenMP 4.0 Specifications. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf. Accessed: 15-05-2016.

[65] David A Patterson and John L Hennessy. Computer organization and design:the hardware/software interface. Newnes, 2013.

[66] David Patterson. “The Trouble With Multi-core”. In: IEEE Spectr. 47.7(July 2010), pp. 28–32. issn: 0018-9235. doi: 10.1109/MSPEC.2010.5491011.

[67] Simon J. Pennycook, Chris J. Hughes, M. Smelyanskiy and S. A. Jarvis.“Exploring SIMD for Molecular Dynamics, Using Intel®Xeon®Processorsand Intel®Xeon Phi Coprocessors”. In: Proceedings of the 2013 IEEE 27thInternational Symposium on Parallel and Distributed Processing. IPDPS ’13.Washington, DC, USA: IEEE Computer Society, 2013, pp. 1085–1097. isbn:978-0-7695-4971-2. doi: 10.1109/IPDPS.2013.44.

[68] T. Piazza, H. Jiang, P. Hammarlund and R. Singhal. Technology Insight:Intel(R) Next Generation Microarchitecture Code Name Haswell. Tech. rep.Intel Corporation, 2012.

[69] Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fa-bio Luporini, Andrew T. T. Mcrae, Gheorghe-Teodor Bercea, Graham R.Markall and Paul H. J. Kelly. “Firedrake: Automating the Finite ElementMethod by Composing Abstractions”. In: ACM Trans. Math. Softw. 43.3(Dec. 2016), 24:1–24:27. issn: 0098-3500. doi: 10.1145/2998441.

[70] I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran and S. McIntosh-Smith. “The OPS Domain Specific Abstraction for Multi-block StructuredGrid Computations”. In: 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Comput-ing. 2014, pp. 58–67. doi: 10.1109/WOLFHPC.2014.7.

[71] I. Z. Reguly, G. R. Mudalige, C. Bertolli, M. B. Giles, A. Betts, P. H. J. Kellyand D. Radford. “Acceleration of a Full-Scale Industrial CFD Applicationwith OP2”. In: IEEE Transactions on Parallel and Distributed Systems 27.5(2016), pp. 1265–1278. issn: 1045-9219. doi: 10.1109/TPDS.2015.2453972.

[72] I Z. Reguly, Endre László, Gihan R. Mudalige and Mike B. Giles. “Vector-izing unstructured mesh computations for many-core architectures”. In:Concurrency and Computation: Practice and Experience 28.2 (2016), pp. 557–577. issn: 1532-0634. doi: 10.1002/cpe.3621.

http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf

http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf



http://dx.doi.org/10.1145/2998441

http://dx.doi.org/10.1109/WOLFHPC.2014.7

http://dx.doi.org/10.1109/TPDS.2015.2453972

http://dx.doi.org/10.1002/cpe.3621

Bibliography 169

[73] Philip L Roe. “Approximate Riemann solvers, parameter vectors, and dif-ference schemes”. In: Journal of computational physics 43.2 (1981), pp. 357–372.

[74] C. Rosales. “Porting to the Intel Xeon Phi: Opportunities and Challenges”.In: Extreme Scaling Workshop (XSW), 2013. 2013, pp. 1–7. doi: 10.1109/XSW.2013.5.

[75] P. E. Ross. “Why CPU Frequency Stalled”. In: IEEE Spectr. 45.4 (Apr.2008), pp. 72–72. issn: 0018-9235. doi: 10.1109/MSPEC.2008.4476447.

[76] Scott Rostrup and Hans De Sterck. “Parallel hyperbolic PDE simula-tion on clusters: Cell versus GPU”. In: Computer Physics Communications181.12 (2010), pp. 2164 –2179. issn: 0010-4655. doi: http://dx.doi.org/10.1016/j.cpc.2010.07.049.

[77] Karl Rupp. CPU, GPU and MIC Hardware Characteristics over Time. https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-

over-time/. Accessed: 28-12-2017.

[78] Richard M. Russell. “The CRAY-1 Computer System”. In: Commun. ACM21.1 (Jan. 1978), pp. 63–72. issn: 0001-0782. doi: 10.1145/359327.359336.

[79] SU2, the open-source CFD code. http://su2.stanford.edu. Accessed:31-08-2017.

[80] Nadathur Satish, Changkyu Kim, Jatin Chhugani, Hideki Saito, RakeshKrishnaiyer, Mikhail Smelyanskiy, Milind Girkar and Pradeep Dubey.“Can Traditional Programming Bridge the Ninja Performance Gap forParallel Computing Applications?” In: Proceedings of the 39th Annual In-ternational Symposium on Computer Architecture. ISCA ’12. Portland, Ore-gon: IEEE Computer Society, 2012, pp. 440–451. isbn: 978-1-4503-1642-2.

[81] J. M. Shalf and R. Leland. “Computing beyond Moore’s Law”. In: Com-puter 48.12 (2015), pp. 14–23. issn: 0018-9162. doi: 10.1109/MC.2015.374.

[82] Anand Lal Shimpi. Intel’s Haswell Architecture Analyzed: Building a newPC and a new Intel. https://www.anandtech.com/show/6355/intels-haswell-architecture. Accessed: 12-07-2015.

[83] A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S.Hutsell, R. Agarwal and Y. C. Liu. “Knights Landing: Second-GenerationIntel Xeon Phi Product”. In: IEEE Micro 36.2 (2016), pp. 34–46. issn: 0272-1732. doi: 10.1109/MM.2016.25.

http://dx.doi.org/10.1109/XSW.2013.5

http://dx.doi.org/10.1109/XSW.2013.5


http://dx.doi.org/http://dx.doi.org/10.1016/j.cpc.2010.07.049

http://dx.doi.org/http://dx.doi.org/10.1016/j.cpc.2010.07.049

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/



http://dx.doi.org/10.1145/359327.359336

http://su2.stanford.edu

http://dx.doi.org/10.1109/MC.2015.374

https://www.anandtech.com/show/6355/intels-haswell-architecture

https://www.anandtech.com/show/6355/intels-haswell-architecture

http://dx.doi.org/10.1109/MM.2016.25

Bibliography 170

[84] Milan Stanic. “Design of Energy-Efficient Vector Units for In-order Cores”.PhD thesis. University Politècnica de Caralunya, 2016.

[85] Herb Sutter and James Larus. “Software and the Concurrency Revolu-tion”. In: Queue 3.7 (Sept. 2005), pp. 54–62. issn: 1542-7730. doi: 10.1145/1095408.1095421.

[86] VCLKNC. https://bitbucket.org/veclibknc/vclknc. Accessed: 10-08-2015.

[87] Michael Vavra. Aero-Thermodynamics and Flow in Turbomachines. Los Alam-itos, CA, USA: John Wiley, 1960, pp. 139–145. isbn: 0-8186-0819-6.

[88] Feng Wang, Mauro Carnevale, Gan Lu, Luca di Mare and DavenduKulkarni. “Virtual Gas Turbine: Pre-Processing and Numerical Simula-tions”. In: ASME Turbo Expo 2016: Turbomachinery Technical Conference andExposition. American Society of Mechanical Engineers. 2016, V001T01A009–V001T01A009.

[89] D.C. Wilcox. “Reassessment of the scale-determining equation for ad-vanced turbulence models”. In: AIAA Journal 26 (Nov. 1988), pp. 1299–1310. doi: 10.2514/3.10041.

[90] Samuel W. Williams. “Auto-tuning performance on multicore computers”.PhD thesis. University of California at Berkeley, 2008.

[91] Samuel Williams, Andrew Waterman and David Patterson. “Roofline:An Insightful Visual Performance Model for Multicore Architectures”.In: Commun. ACM 52.4 (Apr. 2009), pp. 65–76. issn: 0001-0782. doi: 10.1145/1498765.1498785.

[92] Samuel Williams, Leonid Oliker, Jonathan Carter and John Shalf. “Ex-tracting Ultra-scale Lattice Boltzmann Performance via Hierarchical andDistributed Auto-tuning”. In: Proceedings of 2011 International Conferencefor High Performance Computing, Networking, Storage and Analysis. SC ’11.Seattle, Washington: ACM, 2011, 55:1–55:12. isbn: 978-1-4503-0771-0. doi:10.1145/2063384.2063458.

[93] Wm. A. Wulf and Sally A. McKee. “Hitting the Memory Wall: Implic-ations of the Obvious”. In: SIGARCH Comput. Archit. News 23.1 (Mar.1995), pp. 20–24. issn: 0163-5964. doi: 10.1145/216585.216588.

http://dx.doi.org/10.1145/1095408.1095421

http://dx.doi.org/10.1145/1095408.1095421

https://bitbucket.org/veclibknc/vclknc

http://dx.doi.org/10.2514/3.10041

http://dx.doi.org/10.1145/1498765.1498785

http://dx.doi.org/10.1145/1498765.1498785

http://dx.doi.org/10.1145/2063384.2063458

http://dx.doi.org/10.1145/216585.216588

A P P E N D I X

171

AA P P E N D I X

a.1 results of auto-tuning prefetch distances

Speedup

L1i L2i L1d L2d iflux vflux diflux dvfluxnewtonjacobi

4 - 2 - 1.10 0.99 0.99 0.92 0.99

4 8 2 4 1.04 1.05 0.95 0.97 0.99

4 16 2 8 1.09 1.07 0.93 0.96 0.97

4 32 2 16 1.11 1.08 0.92 0.96 0.97

8 - 4 - 1.10 1.05 0.96 0.94 0.99

8 16 4 8 1.01 1.04 0.93 0.93 0.95

8 32 4 16 1.02 1.04 0.92 0.93 0.95

8 64 4 32 1.02 1.06 0.92 0.93 0.96

16 - 8 - 1.11 1.06 0.96 0.94 0.99

16 32 8 16 1.01 1.04 0.92 0.92 0.94

16 64 8 32 1.01 1.05 0.93 0.99 0.96

16 128 8 64 1.00 1.03 0.91 0.98 0.95

32 - 16 - 1.11 1.08 0.95 0.92 0.98

32 64 16 32 1.00 1.04 0.91 0.98 0.96

32 128 16 64 0.99 1.02 0.91 0.97 0.95

32 256 16 128 0.98 1.00 0.89 0.95 0.94

64 - 32 - 1.11 1.08 0.94 0.94 0.99

64 128 32 64 0.98 1.01 0.90 0.88 0.92

64 256 32 128 0.97 0.99 0.89 0.86 0.91

64 512 32 256 0.94 0.94 0.85 0.84 0.89

Table 9: Speed-up of flux computation kernels and whole solver (newton-jacobi) de-pending on prefetch distance on Sandy Bridge E5-2650 CPU.

172

appendix 173

Speedup


4 - 2 - 0.95 0.93 0.92 0.92 0.95

4 8 2 4 0.97 0.91 0.91 0.83 0.92

4 16 2 8 0.96 0.90 0.95 0.82 0.93

4 32 2 16 0.95 0.89 0.92 0.81 0.91

8 - 4 - 0.96 1.00 0.97 0.86 0.95

8 16 4 8 0.94 0.87 0.91 0.80 0.91

8 32 4 16 0.95 0.89 0.92 0.87 0.93

8 64 4 32 0.94 0.88 0.90 0.85 0.91

16 - 8 - 0.94 0.99 0.94 0.84 0.93

16 32 8 16 0.93 0.87 0.90 0.86 0.93

16 64 8 32 0.93 0.88 0.90 0.84 0.91

16 128 8 64 0.93 0.87 0.90 0.83 0.91

32 - 16 - 0.94 0.99 0.93 0.83 0.93

32 64 16 32 0.93 0.87 0.90 0.84 0.91

32 128 16 64 0.92 0.86 0.88 0.83 0.90

32 256 16 128 0.89 0.83 0.86 0.79 0.88

64 - 32 - 0.93 0.99 0.93 0.89 0.94

64 128 32 64 0.92 0.83 0.88 0.75 0.88

64 256 32 128 0.89 0.82 0.86 0.72 0.86

64 512 32 256 0.85 0.76 0.79 0.68 0.82

Table 10: Speed-up of flux computation kernels and whole solver (newton-jacobi) de-pending on prefetch distance on Broadwell

appendix 174

Speedup


8 - 4 - 1.06 1.04 1.03 1.06 1.03

8 16 4 8 1.13 1.17 1.16 1.28 1.13

8 32 4 16 1.11 1.16 1.17 1.28 1.13

8 64 4 32 1.10 1.14 1.15 1.23 1.12

16 - 8 - 1.12 1.15 1.14 1.30 1.13

16 32 8 16 1.12 1.16 1.16 1.21 1.12

16 64 8 32 1.09 1.13 1.15 1.20 1.11

16 128 8 64 1.08 1.10 1.13 1.18 1.09

32 - 16 - 1.11 1.15 1.14 1.29 1.12

32 64 16 32 1.09 1.13 1.15 1.20 1.11

32 128 16 64 1.07 1.09 1.13 1.17 1.09

32 256 16 128 1.05 1.07 1.12 1.14 1.08

64 - 32 - 1.11 1.14 1.14 1.25 1.11

64 128 32 64 1.08 1.10 1.13 1.15 1.09

64 256 32 128 1.05 1.06 1.11 1.12 1.07

64 512 32 256 1.04 1.04 1.10 1.09 1.06

- 16 - 8 1.15 1.20 1.18 1.36 1.15

- 32 - 16 1.13 1.18 1.18 1.35 1.15

- 64 - 32 1.12 1.17 1.16 1.31 1.14

- 128 - 64 1.10 1.15 1.15 1.29 1.13

- 256 - 128 1.09 1.13 1.15 1.26 1.12

Table 11: Speed-up of flux computation kernels and whole solver (newton-jacobi) de-pending on prefetch distance on Skylake

appendix 175

Speedup


8 - 4 - 1.65 2.76 2.40 1.99 1.66

8 16 4 8 1.80 4.49 3.93 1.97 1.97

8 32 4 16 1.81 4.52 4.10 2.27 1.92

8 64 4 32 1.79 4.50 4.10 2.25 1.99

16 - 8 - 1.72 3.65 3.60 1.97 1.85

16 32 8 16 1.74 4.09 3.96 2.04 1.90

16 64 8 32 1.71 3.96 3.87 2.00 1.93

16 128 8 64 1.69 3.98 3.85 1.98 1.87

32 - 16 - 1.68 3.40 3.48 1.88 1.68

32 64 16 32 1.67 3.65 3.63 1.90 1.60

32 128 16 64 1.66 3.61 3.62 1.88 1.70

32 256 16 128 1.63 3.53 3.51 1.83 1.76

64 - 32 - 1.65 3.18 3.26 1.76 1.74

64 128 32 64 1.64 3.37 3.38 1.79 1.73

64 256 32 128 1.60 3.25 3.25 1.68 1.65

64 512 32 256 1.56 3.12 3.14 1.65 1.64

Table 12: Speed-up of flux computation kernels and whole solver (newton-jacobi) de-pending on prefetch distance on KNC with no hyperthreading

appendix 176

Speedup


8 - 4 - 1.09 1.02 1.01 0.81 0.96

8 16 4 8 1.39 1.65 1.48 1.07 1.19

8 32 4 16 1.36 1.60 1.47 1.00 1.18

8 64 4 32 1.33 1.57 1.49 0.96 1.16

16 - 8 - 1.26 1.34 1.46 1.06 1.17

16 32 8 16 1.32 1.52 1.53 0.96 1.16

16 64 8 32 1.30 1.47 1.44 0.92 1.13

16 128 8 64 1.29 1.41 1.45 0.88 1.12

32 - 16 - 1.27 1.37 1.49 1.05 1.11

32 64 16 32 1.29 1.45 1.46 0.91 1.13

32 128 16 64 1.24 1.37 1.43 0.86 1.10

32 256 16 128 1.21 1.30 1.38 0.82 1.07

64 - 32 - 1.26 1.36 1.44 1.00 1.15

64 128 32 64 1.23 1.33 1.42 0.84 1.09

64 256 32 128 1.17 1.25 1.35 0.78 1.05

64 512 32 256 1.11 1.16 1.26 0.74 1.00

- 16 - 8 1.54 1.96 1.80 1.22 1.31

- 32 - 16 1.53 1.93 1.80 1.20 1.33

- 64 - 32 1.47 1.83 1.70 1.11 1.27

- 128 - 64 1.42 1.69 1.63 1.03 1.23

- 256 - 128 1.31 1.54 1.48 0.95 1.17

Table 13: Speed-up of flux computation kernels and whole solver (newton-jacobi) de-pending on prefetch distance on KNL quadrant cache mode

colophon

This document was typeset using the typographical look-and-feel classicthesisdeveloped by André Miede with subsequent modifications by Ioan Hadadefor adhering to Imperial College regulations. The style was inspired by RobertBringhurst’s seminal book on typography “The Elements of Typographic Style”.

Final Version as of 30th April 2018 (classicthesis).

Optimisation of computational fluid dynamics applications ...€¦ · OPTIMISATION OF COMPUTATIONAL...

Documents

Transcript of Optimisation of computational fluid dynamics applications ...€¦ · OPTIMISATION OF COMPUTATIONAL...