PARALLEL SCIENTIFIC COMPUTATION A …mc.stanford.edu/cgi-bin/images/5/5a/Elsen_phd.pdf · PARALLEL...

PARALLEL SCIENTIFIC COMPUTATION

ON EMERGING ARCHITECTURES

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF MECHANICAL

ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Erich Konrad Elsen

September 2009

c© Copyright by Erich Konrad Elsen 2009

All Rights Reserved

ii

I certify that I have read this dissertation and that, in my opinion, it

is fully adequate in scope and quality as a dissertation for the degree

of Doctor of Philosophy.

(Eric Darve) Principal Adviser




(Juan Alonso

Aeronautics and Astronautics)




(Frank Ham

Mechanical Engineering)

Approved for the University Committee on Graduate Studies.

iii

Preface

The main goal of this thesis is to develop a method for more easily writing high per-

formance scientific codes, specifically mesh based PDE solvers. The best method

for achieving this is a Domain Specific Language (DSL), which we have named

Liszt. Liszt provides hardware independence (for example between streaming com-

puters, commodity graphics processors and specialized processors like IBM’s CELL

and ClearSpeed’s line of accelerator boards) by making the mesh and mesh-based

data storage primitives of the language. Code is forced to be written in a parallel

way involving loops over mesh elements. Liszt has the additional desirable properties

of reducing programmer time and effort, reducing program complexity, automatic

parallelization/domain decomposition and built-in parallel visualization and check-

pointing. Recognizing that creating a language capable of generating code for these

platforms is a challenging problem, work was first done on how to best achieve high

performance on these platforms for these kinds of problems to provide guidance when

developing the language. The second and third chapters deal with implementing an

O(N2) N-Body simulation and a compressible Euler flow solver on commodity graph-

ics hardware and IBM’s Cell. The final chapter, which is presented as an appendix

due to its being unrelated to the rest of the work deals with a new periodic boundary

condition developed for simulating nanowires undergoing torsion.

iv

Acknowledgements

I would like to thank my parents for making me believe I could do anything and then

not trying to tell me what that should be, Deb Michael and Doreen Wood for helping

me navigating the bureaucracy of the University for 5 years, all the teachers I’ve ever

had, but especially: Don Porzio, Mrs. Franzen, Anthony Jacobi, John P. D’Angelo,

Geir Dullerud, Rose Marie Wood, Gustavo Romero, Fred Weldy, and Wei Cai. Ilhami

Torunglo and Ahmet Karakas helped me grow wise in the ways of the ”real” world and

I owe them deeply for all the generosity they have shown me. Frank Ham and Juan

Alonso were especially helpful in that I worked with them on several projects during

my stay and they always provided valuable insight and advice. I would especially like

to thank Parviz Moin for both enticing me to come to Stanford, advising me during

my first year and his guidance since. Finally, I would like to thank my advisor for

everything over these last five years; hopefully some of his wisdom has been passed

on to me. He took a chance on me solely because I expressed some interest in those

GPU things (for which I’m grateful) and I think it worked out well.

v

Contents

Preface iv

Acknowledgements v

1 Historical Background 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Single-Threaded Performance . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Parallel Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.1 Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.2 Merrimac and Streaming . . . . . . . . . . . . . . . . . . . . . 11

1.3.3 Programmable GPUs . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.4 Cell Broadband Engine Architecture . . . . . . . . . . . . . . 18

1.4 Comparison of Technologies . . . . . . . . . . . . . . . . . . . . . . . 21

2 N-Body Simulations on GPUs 22

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Implementation and Optimization on GPUs . . . . . . . . . . . . . . 28

2.3.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.2 General Optimization . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.3 Optimization for small systems . . . . . . . . . . . . . . . . . 31

2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5.1 Comparison to other Architectures . . . . . . . . . . . . . . . 35

vi

2.5.2 Hardware Constraints . . . . . . . . . . . . . . . . . . . . . . 37

2.5.3 On-board Memory vs. Cache Usage . . . . . . . . . . . . . . . 38

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.7.1 Flops Accounting . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Structured PDE Solvers on CELL and GPUs 41

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Review of prior work on GPUs . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Flow Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4 Numerical accuracy considerations and performance comparisons be-

tween CPU and GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5 Mapping the Algorithms to the GPU . . . . . . . . . . . . . . . . . . 48

3.5.1 Classification of kernel types . . . . . . . . . . . . . . . . . . . 48

3.5.2 Data layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5.3 Summary of GPU code . . . . . . . . . . . . . . . . . . . . . . 52

3.5.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.6.1 Performance scaling with block size . . . . . . . . . . . . . . . 57

3.6.2 Performance of the three main kernel types . . . . . . . . . . . 58

3.6.3 Performance on real meshes . . . . . . . . . . . . . . . . . . . 60

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.8 CELL Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.8.1 Amdahl’s Revenge . . . . . . . . . . . . . . . . . . . . . . . . 63

3.8.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4 Liszt 71

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3 Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3.1 Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3.2 Language Components . . . . . . . . . . . . . . . . . . . . . . 80

vii

4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5 Conclusions 100

A Torsion and Bending PBC 102

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

A.2 Generalization of Periodic Boundary Conditions . . . . . . . . . . . . 105

A.2.1 Review of Conventional PBC . . . . . . . . . . . . . . . . . . 105

A.2.2 Torsional PBC . . . . . . . . . . . . . . . . . . . . . . . . . . 106

A.2.3 Bending PBC . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

A.3 Virial Expressions for Torque and Bending Moment . . . . . . . . . . 112

A.3.1 Virial Stress in PBC . . . . . . . . . . . . . . . . . . . . . . . 113

A.3.2 Virial Torque in t-PBC . . . . . . . . . . . . . . . . . . . . . . 114

A.3.3 Virial Bending Moment in b-PBC . . . . . . . . . . . . . . . . 116

A.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

A.4.1 Si Nanowire under Torsion . . . . . . . . . . . . . . . . . . . . 118

A.4.2 Si Nanowire under Bending . . . . . . . . . . . . . . . . . . . 122

A.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Bibliography 127

viii

List of Tables

1.1 SGEMM and DGEMM numbers are for the best performing matrix

sizes on each platform that are very large (ie much too big to fit en-

tirely in any kind of cache or local memory. The FFT is for best

performing (power of 2), very large 2D complex transforms. The Cell

is the PowerXCell 8i accelerator board from Mercury Systems. . . . 21

2.1 Values for the maximum performance of each kernel on the X1900XTX.

The instructions are counted as the number of pixel shader assembly

arithmetic instructions in the inner loop. . . . . . . . . . . . . . . . 27

2.2 Values for the maximum performance of each kernel on the X1900XTX. 28

2.3 Comparison of GROMACS(GMX) running on a 3.2 GHz Pentium 4

vs. the GPU showing the estimated simulation time per day for a 1000

atom system.

*GROMACS does not have an SSE inner loop for LJC(linear) . . . . 34

3.1 Measured speed-ups for the NACA 0012 airfoil computation. . . . . . 61

3.2 Speed-ups for the hypersonic vehicle computation . . . . . . . . . . . 62

A.1 Comparison of torsional stiffness for Si NW estimated from MD simu-

lations and that predicted by Strength of Materials (SOM) theory. D∗

is the adjusted NW diameter that makes the SOM predictions exactly

match MD results. The critical twist angle φc and critical shear strain

γc at failure are also listed. . . . . . . . . . . . . . . . . . . . . . . . . 121

ix

A.2 Comparison of the bending stiffnesses for Si NWs estimated from MD

simulations and that predicted by Strength of Materials (SOM) theory.

D∗ is the adjusted NW diameter that makes SOM predictions exactly

match MD results. The critical bending angle Θf and critical normal

strain εf at fracture are also listed. . . . . . . . . . . . . . . . . . . . 124

x

List of Figures

1.1 Transistor Counts Over the Last 35 Years . . . . . . . . . . . . . . . 3

1.2 Illustration of SIMD operation . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Parallel solve of Tri-diagonal Matrix . . . . . . . . . . . . . . . . . . . 11

1.4 G70 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.6 CUDA Programming Model with N threads per block. Only the first

kernel is shown in full detail due to space constraints. . . . . . . . . . 18

1.7 Overview of the layout of the Cell . . . . . . . . . . . . . . . . . . . . 19

1.8 Conceptual Diagram of Cell SPE . . . . . . . . . . . . . . . . . . . . 20

2.1 GA Kernel with varying amounts of unrolling . . . . . . . . . . . . . 30

2.2 Performance improvement for LJC(sigmoidal) kernel with i-particle

replication for several values of N . . . . . . . . . . . . . . . . . . . . 33

2.3 Speed comparison of CPU, GPU and GRAPE-6A . . . . . . . . . . . 35

2.4 Useful MFlops per second per U.S. Dollar of CPU, GPU and GRAPE-6A 36

2.5 Millions of Interactions per Watt of CPU, GPU and GRAPE-6A . . . 36

2.6 GFlops achieved as a function of memory speed . . . . . . . . . . . . 39

3.1 Array of Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2 Structure of Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3 Flowchart of NSSUS running on the GPU. . . . . . . . . . . . . . . . 52

xi

3.4 This figure illustrates the stencil in the x direction and the branching

on the GPU. Each colored square represents a mesh node. The color

corresponds to the stencil used for the node. Inner nodes (in grey) use

the same stencil. For optimal efficiency, nodes inside a 4 × 4 square

should branch coherently, i.e., use the same stencil (see square with a

dashed line border). For this calculation, this is not the case near the

boundary which leads to inefficiencies in the execution. The algorithm

proposed here reduces branching and leads to only one branch (instead

of 3 here). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5 The continuity of the solution across mesh blocks is enforced by com-

puting penalty terms using the SAT approach[16]. The fact that the

connectivity between blocks is unstructured creates special difficulty.

On this figure, for each node on the faces of the blue block, one must

identify the face of one of the green blocks from which the penalty

terms are to be computed. In this case, the left face of the blue block

intersects the faces of four distinct green blocks. This leads to the

creation of 4 sub-faces on the blue block. For each sub-face, penalty

terms need to be computed. Note that some nodes may belong to

several sub-faces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.6 To calculate the penalty terms efficiently for each sub-face, one first

copies data from the 3D block into a smaller sub-face stream (shown

on the right). In this figure, the block has 10 sub-faces. Assume that

the largest sub-face can be stored in memory as a 2D rectangle of size

nx × ny. In the case shown, the sub-face stream is then composed of

12 nx × ny rectangles, 2 of which are unused. Some of the space is

occupied by real data (in blue); the rest is unused (shown in grey). . . 55

xii

3.7 This figure shows the mapping from neighboring blocks to the neighbor

stream used to process the penalty terms for the blue block. There

are four large blocks surrounding the blue block (top and bottom not

shown). They lead to the first 4 green rectangles. The other rectangles

are formed by the two blocks in the front right and the four smaller

blocks in the front left. . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.8 Performance scaling with block size, 1st order. . . . . . . . . . . . . . 57

3.9 left: pointwise performance (inviscid flux calculation); right: stencil

performance (3rd order residual calculation). . . . . . . . . . . . . . . 59

3.10 Unstructured gather performance (boundary conditions and penalty

terms calculation). The decrease in speed-up is due to an unavoidable

O(n3) vs. O(n2) algorithmic difference in one of the kernels that make

up the boundary calculations. See the discussion in the text. . . . . 59

3.11 Three block C-mesh around the NACA 0012 airfoil. . . . . . . . . . . 60

3.12 Mach number around the NACA 0012 airfoil, M∞ = 0.63, α = 2. . . . 60

3.13 Mach number – side and back views of the hypersonic vehicle. . . . . 61

3.14 Amdahl’s Law (A = 1) vs. CBE (A = 10) . . . . . . . . . . . . . . . 64

3.15 Amdahl’s Law (A = 1) vs. CBE (A = 10) . . . . . . . . . . . . . . . 64

3.16 Ratio of Amdahl’s Law Speedup to CBE Speedup . . . . . . . . . . . 65

3.17 Cell Memory Bandwidth treating each SPE as an Independent Co-

processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.18 Cell Memory Bandwidth Viewing each SPE as a Step in a Pipeline . 67

3.19 Circular Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A.1 (a) A nanowire subjected to PBC along z axis. (b) A nanowire sub-

jected to t-PBC along z axis. . . . . . . . . . . . . . . . . . . . . . . 107

A.2 A nanowire subjected to b-PBC around z axis. At equilibrium the net

line tension force F must vanish but a non-zero bending moment M

will remain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

A.3 Snapshots of Si NWs of two diameters before torsional deformation

and after failure. The failure mechanism depends on its diameter. . . 119

xiii

A.4 Virial torque τ as a function of rotation angle φ between the two ends

of the NWs of two different diameters. Because the two NWs have the

same aspect ratio Lz/D, they have the same maximum strain (on the

surface) γmax = φD2Lz

at the same twist angle φ. . . . . . . . . . . . . . 120

A.5 Virial bending moment M as a function of bending angle Θ between

the two ends of the two NWs with different diameters. Because the two

NWs have the same aspect ratio Lz/D, they have the same maximum

strain εmax = ΘD2Lz

at the same bending angle Θ. . . . . . . . . . . . . 123

A.6 Snapshots of Si NWs of two diameters under bending deformation be-

fore and after fracture. While metastable hillocks form on the thinner

NWs before fracture (a), this does not happen for the thicker NW (c). 125

xiv

Chapter 1

Historical Background

1

CHAPTER 1. HISTORICAL BACKGROUND 2

1.1 Introduction

Since the invention of the first transistor in 1958 an empirical ”law” has continued

to predict our ability to manufacture in ever smaller dimensions. Gordon Moore,

co-founder of Intel, made the observation that approximately every 18 months the

number of transistors that could be mass produced in a given area doubled [71] (see

figure 1.1). From 1958 until about 2002 this statement was equivalent to saying that

the speed of the processor also doubled every 18 months. In fact, the correspondence

was close enough that many people erred in thinking that the latter statement was

actually Moore’s law. Since then the increase in performance of a single core has

increased much more slowly. The impact of this decrease in performance growth rate

and its repercussions for scientific computing are the main motivating force behind

this thesis.

The solution of the hardware designers to the inability to significantly increase

single-threaded performance was to increase the explicit parallelism both in the hard-

ware and in the programming model. No longer can software be written in a sequential

fashion relying on advances in hardware to improve performance. Software must now

be written to take advantage of the parallelism inherent in the processors by explic-

itly expressing the parallelism of the algorithms. This requires no less effort than

completely rethinking and rewriting most high-performance code.

1.2 Single-Threaded Performance

In the single threaded programming model the CPU is viewed as doing only one thing

at a time. It theoretically executes each command in its entirety before moving on

to the next; the results of a previous instruction are available for the next one. The

main factors determining performance are then:

• Speed of individual instructions

• Speed of data movement from memory to execution units


Figure 1.1: Transistor Counts Over the Last 35 Years


The speed of each instruction is mainly determined by the clock speed of the card since

on most arithmetic instructions (floating point division being the main exception) on

modern processors take one cycle (when pipelining, which will be explained later,

is taken into account). Manufacturing companies have been unable to continuing

increasing the clock speeds of processors due to thermal dissipation issues even as

they continue to shrink transistor sizes. The speed of data movement is important

to ensure that every cycle a processor is performing a useful operation instead of

waiting for data arrive. Unfortunately, delays to main memory can be on the order of

hundreds of cycles and cannot be significantly reduced. The obvious solution to the

first problem is to exploit parallelism somehow to execute more than one instruction

each clock cycle. There are two techniques for this. One is done in hardware, requires

no changes to program code and is known as ‘superscalar’ processing; the other

requires writing new code utilizing SIMD (Simultaneous Instruction Multiple Data)

instructions or an auto-vectorizing compiler. The solutions to the second problem are

to add a memory hierarchy which decreases in size but increases in speed (cache) and

to try and find the processor another instruction to execute while waiting for data

for the current instruction, which is known as out-of-order execution. Pipelining,

superscalar execution and out-of-order execution all take advantage of and require

instruction level parallelism (ILP). Unfortunately, all these technologies have a point

of diminishing return. First each technology will be described and the reason it fails

to scale beyond a certain point will be explained. The SIMD instructions are a limited

step toward data level parallelism.

Processor 80386 80486 Pentium Pentium Pro Pentium 4

Year 1986 1989 1993 1995 2000

Cache Size (Internal) 8KB 32KB 512KB 2048KB

Pipelined X X X X

Superscalar X X X

Out of order X X

SIMD X


Cache attempts to reduce the latency problem by storing recently used data closer

to the processor (temporal locality) as well as bringing data spatially close to a re-

quested location into the cache as well (spatial locality) under the assumption that

it may also soon be needed. Generally, the hardware makes all the decisions with re-

gards to what is brought into the cache and when data is evicted from the cache. This

greatly simplifies the programming model (and importantly is backwards compatible

with previous serial code), but can also lead to sub-optimal performance because the

programmer cannot take advantage of a known access pattern by ”informing” the

cache. Increasing the size of the cache obviously increases the amount of data that

can be in the cache at any one time and therefore also the time, on average, that a

piece of data will reside in the cache before being evicted increasing its chances of

being reused. A doubling in cache size from two to four or three to six megabytes

results in an average improvement of approximately 10% [85] [84] on a suite of typ-

ical application benchmarks including compression, rendering, video encoding and

gaming. Clearly, the marginal efficiency of those extra transistors is not high.

Next, techniques for taking advantage of ILP are examined. Pipelining was the

earliest technique of this type to be implemented. It arose naturally because executing

a single instruction actually consists of multiple steps. In a very generic 4-stage

pipeline, an instruction must be fetched from memory, decoded, executed and then

the result written. Instead of keeping three of these stages idle while waiting for one

instruction to move all the way through the pipeline, a new instruction is begun as

soon as the first one has been fetched. Of course, the ability of the processor to do

this depends on their being 4 independent instructions in a row, otherwise it must

wait for a previous instruction to finish before starting the next one.

Listing 1.1: Pseudo-Assembly to Illustrate ILP mul x1 , y1 −> a

mul x2 , y2 −> b // independent

mul x3 , y3 −> c // independent

add a , b −> a // dependent

add a , c −> a // dependent

mov a −> memory // dependent


add q , r −> s // independent For example, in 1.1, the first three instructions are independent and would start

filling up the pipeline but then a ”bubble” would form because the fourth instruction

depends on the result of the first and second. And in a worst case scenario, the fifth

instruction depends on the fourth, which means the pipeline is completely unused

- the fifth must wait for the fourth to finish before it can enter. So clearly, the

effectiveness of pipelining depends on the ability to have large amounts of contiguous

and independent instructions.

A super-scalar processor will have multiple functional units such as ALUs so that

two multiplies can happen at exactly the same time. Not pipelined but truly in

parallel. So in the example 1.1, the first two multiplications would executed in parallel,

then the next multiplication and add could also be executed in parallel, after that

only one instruction would be executed at a time because of dependencies. This

technique can be, and often is, combined with pipelining so that each functional unit

has its own pipeline. Ultimately though, these techniques are limited by the amount

of parallelism available in the instruction stream.

Out of order execution attempts to solve this fundamental problem by allowing

for instructions to be executed in a different order from the one described by the

instruction stream. In our example, 1.1 assuming we still have a two-unit superscalar

processor, now instead of only executing the add a, c->a instruction by itself be-

cause the next instruction depends on its result, the add q, r->s statement could be

executed with it since it has no dependencies. In practice this technique is complex

and requires a large of amount of transistors for book-keeping machinery. This places

limits on how many dependent instructions can be ”passed over” while looking for

the next independent one.

The last mentioned technique, SIMD (Single Instruction Multiple Data), can be

seen as connection between the ILP of the past and the data parallelism of the future.

Because only so much parallelism, even with all these techniques, can be extracted

from a serial instruction stream additional instructions that specifically operate on

multiple data at one time were introduced. For example, to perform the additions


Figure 1.2: Illustration of SIMD operation

A B

C D+ +

= =E F

a+b, c+d we could do this with one SIMD instruction if a and c are contiguous

in memory as well as b and d; see figure 1.2. In this way the programmer could

begin to explicitly specify the parallelism in the code. In some applications this can

lead to a large speedup [75], but they are often difficult to use, essentially requiring

programming in assembly language and they require very specific data layout and

alignment that can be very difficult to achieve for many applications.

A possible solution to the complexity and size of the circuitry to determine and

keep track of dependencies between instructions in the various ILP techniques is to

remove them from the processor and instead move the job to the compiler. The

compiler should determine at compile time which instructions are independent and

should be executed in parallel. This is the approach of the Intel Itanium [67]. In

practice, writing the necessary compilers has proved to be a very challenging task

and current compilers are still not optimal [86] [19].

Even without introducing new techniques to take advantage of ILP, speeds could

still be increased if the clock speed of the processors could continue to be increased.

This also proved to not be possible. In an ideal CMOS transistor current only flows

when the transistor is switching states. As the process node reached 130 and the

90 nanometers, an unanticipated phenomenon occurred which was current leakage

through the transistor even when it wasn’t switching. This lead to much higher ther-

mal dissipation requirements than originally anticipated and limited the maximum

clock rates of the chips. To some extant this problem has been mitigated with the

introduction of high-k called materials that reduce this current leakage. Nonetheless,


processor speeds remain capped at around 4GHz.

A new paradigm was needed to continue increasing performance of processors. It

is data parallelism.

1.3 Parallel Architectures

In specialized areas such as High-Performance Computing (HPC), graphics and mul-

timedia applications the limitations of the general purpose processors had been ap-

parent for some time. Engineers realized that for the same transistor and power

budgets a great deal more computing power was possible - provided it was the right

kind of computing! The basic idea behind all of the following technologies is to use

a larger number of simple processors instead of a small number of very powerful pro-

cessors while placing the burden of expressing parallelism on the programmer. The

approaches taken by existing hardware are quite different but the commonality is

that the calculation must be parallel. If the algorithm/computation is completely

sequential there is nothing parallel hardware or algorithms can do to accelerate it.

An example of such a problem would be “pointer chasing”. The first memory location

contains the location of the second memory location, which contains the location of

the third and so on. Starting at the first memory location it is impossible to get to

the end of the chain in any fashion other than following the pointers. Algorithms

like this should be avoided at all cost. These new technologies depend on parallel

algorithms to fully utilize their power which requires a fundamental shift in software

development.

1.3.1 Parallel Algorithms

Consider a simple example, solving a tri-diagonal matrix. The serial solution is well

known and is O(N). As a warmup to help the reader begin to think “parallelly” the

serial and parallel solutions are presented next.

• Serial : Simply perform gaussian elimination from the bottom-up until there is

only one unknown left in the top row. Solve for this unknown. Now substitute


back into the second row from the top which allows the next unknown to be

solved for. This process continues until the last unknown is found.

β γ 0 0 0

α β γ 0 0

0 α β γ 0

0 0 α β γ

0 0 0 α β

~x =

y0

y1

y2

y3

y4

after the first step becomes

β γ 0 0 0

α β γ 0 0

0 α β γ 0

0 0 α β∗1 0

0 0 0 α β

~x =

y0

y1

y2

y∗3

y4

after going all the way up:

β∗4 0 0 0 0

α β∗3 0 0 0

0 α β∗2 0 0

0 0 α β∗1 0

0 0 0 α β

~x =

y∗0

y∗1

y∗2

y∗3

y4

• Parallel : One possible parallel algorithm for solving the system below is to use

cyclic reduction. At each step the even rows are used to eliminate the even

numbered unknowns from the odd equation above and below it resulting in a

new system containing just the odd rows. This reduction in the number of

unknowns is repeated until one equation in one unknown is left, which is then

solved, and the solution is propagated through the reverse of the reduction

procedure to solve for all of the unknowns (see figure 1.3.1).


Original System:

β γ 0 0 0 0 0

α β γ 0 0 0 0

0 α β γ 0 0 0

0 0 α β γ 0 0

0 0 0 α β γ 0

0 0 0 0 α β γ

0 0 0 0 0 0 α β

~x =

y0

y1

y2

y3

y4

y5

y6

After first reduction step (note that the odd rows are decoupled from the even

rows):

β γ 0 0 0 0 0

0 β∗1 0 γ∗1 0 0 0

0 α β γ 0 0 0

0 α∗3 0 β∗3 0 γ∗3 0

0 0 0 α β γ 0

0 0 0 α∗5 0 β∗5 0

0 0 0 0 0 α β

~x =

y0

y∗1

y2

y∗3

y4

y∗5

y6

The disadvantage to this scheme, operating in a reduction fashion, is that the

amount of parallelism available at each stage decreases. When only a small

number of equations are left, it is likely faster (depending on the specifics of the

hardware) to perform the solve serially at that point instead of continuing the

reductions.

There are other schemes for the parallel solution of Toeplitz tri-diagonal matri-

ces (the coefficient on each diagonal is constant) that require no communication

at all provided one is willing to accept some error in the solution [68].


Figure 1.3: Parallel solve of Tri-diagonal MatrixReduce and solve modified equation

Propogate solution tosolve for all unknowns

0

1

2

3

4

5

6

1*

3*

5*

3**

0

1

2

3

4

5

6

1*

3*

5*

1.3.2 Merrimac and Streaming

The Merrimac Streaming Supercomputing project began at Stanford to solve the

hardware and software issues outlined above. Specifically it recognized three require-

ments for continued high-performance on modern VLSI devices.

1. Parallelism

2. Latency Tolerance - 500 or more cycles to main memory

3. Exploitation of locality in addition to parallelism

One of its main contributions was the popularization of the stream programming

abstraction. In this abstraction data is organized into streams which are collections

of data on which similar computations are to be performed (Data Parallel paradigm).

Computation is performed by kernels which are computations that operate on each

element of an output stream. The key difference between this and the earlier vector

programming model is that kernels are not just simple arithmetic operations as in the

vector model, but rather a small program that has access to a local register file. This

change now allows for the programmer to express information about locality through


kernels and can prevent unnecessary writes and reads in main memory by keeping

local data in the registers. It also allows for data dependancies other than a one to

one mapping from input to output stream because of the ability to store information

locally.

Although it was planned to develop and produce specialized hardware to take

advantage of this programming model, due to various circumstances, the project never

got past the design stage. Instead, a version of the Brook language was developed

for GPUs to take advantage of an already existing hardware which the programming

model mapped to very well.

1.3.3 Programmable GPUs

The advances in hardware design and programming abstractions have come very

quickly since the introduction of programmable GPUs. First, the GPUs of 2005,

when this research began will be described. This will make the mapping of the

BrookGPU language to the hardware clear. It will also bring to light some of its

shortcomings. Then the current (2009) state of the programming model and hardware

will be described in the context of surmounting the aforementioned shortcomings.

Architecture circa 2005

The entire architecture of the GPU will not be described, but only that relating to

using the GPU for general purpose computations. GPUs of this era generally had

separate hardware for vertex and pixel shaders, but only pixel shaders were generally

used for general purpose computations. Likewise, because generally only one rectangle

the size of screen is rendered, the vast majority of the fixed function pipeline is not

utilized. A top-level depiction of a GPU from this era (NVIDIA’s G70) can be

seen in figure 1.4. From the perspective of the stream programming abstraction,

everything above the fragment crossbar is not terribly important. What matters is

that somehow fragments are generated (based upon the output destination) and feed

into the fragment shaders to be processed independently. The programming model

combined with its view of the hardware can be seen in figure 1.5. Theoretically, if


Figure 1.4: G70 Architecture

there are N fragments they can all be thought as being executed simultaneously on

N different processors. Of course, in reality, there were only 24 physical processors on

a GPU (the exact number obviously depended on the particular GPU); a far smaller

number than fragments to be processed, but there are actually more fragments ”in

flight” than the number of processors. The actual number of fragments ”in flight” is

approximately 20× the number of physical processors. This is done so that whenever

one fragment stalls waiting for a memory access, another fragment can be scheduled

immediately in its place and no processing capacity is lost due to memory latency.

The order in which fragments are generated and sent into the queue to be processed

is chosen to maximize the possibility of cache hits, assuming the fragments access

data that has 2D locality.

BrookGPU

Brook for GPUs (also known as BrookGPU) was designed by Ian Buck[13, 12, 59].

Brook is a source to source compiler which converts Brook code into C++ code and


Figure 1.5: Programming Model

Kernel

Texture Cache

Input Streams Gather Arguments

Constants

Fixed OutputLocation

Local Registers

a high level shader language like Cg or HLSL. This code then gets compiled into

pixel shader assembly by an appropriate shader compiler like Microsoft’s FXC or

NVIDIA’s CGC. The graphics driver finally maps the pixel shader assembly code

into hardware instructions as appropriate to the architecture. It can run on top of

either DirectX or OpenGL; due its greater maturity, the DirectX backend was used for

all results in this thesis. Specifically Microsoft DirectX 9.0c [69] and the Pixel Shader

3.0 Specification [70]. In the Pixel Shader 3.0 specification, the shader has access to

32 general purpose, 4-component, single precision floating point (float4) registers,

16 float4 input textures, 4 float4 render targets (output streams) and 32 float4

constant registers. A shader consists of a number of assembly-like instructions. GPUs

of this era had a maximum static program length of 512 (ATI) or 1024 (NVIDIA)

instructions.

The syntax of Brook is based on C with some extensions. The data is represented

as streams. These streams are operated on by kernels which have specific restrictions:

each kernel is a short program to be executed concurrently on each record of the

output stream(s). This implies that each instance of a kernel automatically has an

output location associated with it. It is this location only to which output can be

written. Scatter operations (writing to arbitrary memory locations) are not allowed.

Gather operations (read with indirect addressing) are possible for input streams. Here

is a trivial example:


kernel void add (

/∗ stream argument ∗/ f loat a<>,

/∗ gather argument ∗/ f loat b [ ] [ ] ,

/∗ constant ∗/ int width ,

/∗ output ∗/ out f loat r e s u l t <>)

f loat2 indexToRight = indexo f ( r e s u l t ) . xy + f loat2 ( 1 , 0 ) ;

//wrap around i f we ’ re on the edge

i f ( indexToRight . x == width )

indexToRight . x = 0 ;

// because a i s a stream argument

//we do not need to prov ide i n d i c e s

// i t i s automat i ca l l y that o f the output l o c a t i o n

r e s u l t = a + b [ indexToRight ] ;

f loat a<100>; f loat b<100>; f loat c<100>;

add ( a , b , 100 , c ) ; There will be one hundred of instances of the add kernel that are created, implicitly

executing a parallel for loop over all the elements of the output steam c. The indexof

operator can be used to get the location a particular instance of the kernel will be

writing to in the output stream(s).

The features of the hardware appear in the language in many ways. Unlike memory

in traditional machines, streams are all addressed using two coordinates because under

the hood all memory is represented as textures which are inherently 2D in graphics

languages (3D textures were not yet standardized or supported by all platforms when

Brook was created). Most importantly, caches are two dimensional. Instead of cache

lines one can instead think of cache squares around the data requested1. Some of

the more annoying features of the hardware are related to looping. Both NVIDIA

and ATI cards use an 8-bit counter for for loops, so each for loop is limited to 256

1Technically, the memory on GPUs is still linear but by using algorithms based on space fillingZ-curves, the hardware gives the appearance of a two dimensional memory layout.


iterations (i = 0...255)2. To do more iterations for loops must be nested - 2 loops for

65,535 iterations and so on. The required control flow is given in the following code

snippet. bool breakFlag = fa l se ;

for ( int i = 0 ; i < 256 ; ++i ) for ( int j = 0 ; j < 256 ; ++j )

int l i n e a r I n d e x = i ∗ 256 + j ;

i f ( l i n e a r I n d e x >= des i redNumIterat ions ) breakFlag = true ;

break ;

// do something

i f ( breakFlag )

break ;

A further complication with loops is that on NVIDIA hardware there is a hard limit

of 65,535 (assembly) instructions per kernel invocation. The exact number of in-

structions used is impossible to determine before runtime because the true assembly

instructions used by the hardware are generated by the driver at runtime (Just-In-

Time Compiled). The solution is to multi-pass a kernel that might do many loop

iterations across many kernel invocations but this is naturally inefficient since data

must reloaded over and over again instead of remaining in registers (basically negating

some of the advantage of the stream paradigm over the vector processing paradigm).

The inability to scatter has some important algorithmic implications. For exam-

ple, if we make a calculation and then need to update values at multiple memory

locations with this single value: f oo [ bar ] += value ;

foo [ moo ] += value ; 2for unknown reasons on ATI the limit is actually i = 0...254


we have to calculate value twice on GPUs since we could only output to one

location.

A limitation of this programming model itself is that locality can only be directly

expressed by the programmer at one level, that of the registers. It is only indi-

rectly possible, through the texture cache, to make use of locality between different

fragments. Even this ability only allows for re-use of constant read-only data, it is

impossible to share calculated information between fragments.

CUDA and Recent Hardware

Even though the research in this thesis was done with BrookGPU, it is worth men-

tioning CUDA and recent architectural developments. NVIDIA’s G80 and later se-

ries chips as well as ATI’s R600 and later series chips have what’s known as unified

shaders. Instead of specific hardware that is only either a vertex or pixel shader, there

is one unit that can function as either depending on the demand. BrookGPU and

indeed, most general purpose GPU computing, never used vertex shaders so as far

as GPGPU was concerned, they were a waste of transistors. Now however, all of the

unified shaders can be utilized for computation. More importantly, NVIDIA released

CUDA which is an evolution of the stream programming model of BrookGPU. The

two main evolutionary features are:

Shared Memory - a small amount of read/write memory that can be shared among

a group of threads (the preferred terminology to move away from the graphics

specific fragment) known as a block.

Scatter - it is now possible to write to arbitrary memory locations from each thread.

It is also possible to place synchronization points in kernel code which will be respected

within a block. Combined with the shared memory this allows is a second level of

locality that is explicitly controlled by the programmer. The new programming model

can be seen in figure 1.6.


Kernel 1

Texture Cache

Global Memory

Linear Memory

ConstantsLinear GlobalMemory

Local Registers Shared Memory

Kernel 2

Kernel 3

Kernel N-1Kernel N

Figure 1.6: CUDA Programming Model with N threads per block. Only the firstkernel is shown in full detail due to space constraints.

1.3.4 Cell Broadband Engine Architecture

The Cell Broadband Engine Architecture (usually shortened to just Cell) was de-

veloped by Sony, Toshiba and IBM. It sits in between the completely data parallel

paradigm of GPUs and the instruction level paradigm of conventional CPUs. It con-

sists of one Power Processing Element (PPE), a simplified PowerPC processor and

eight Synergistic Processing Elements (SPE). They are all connected by the Element

Interconnect Bus (EIB), a circular ring connecting the PPE, 8 SPEs and a memory

controller (MIC). The MIC interfaces with the onboard XDR (extreme data rate)

RAM which has maximum data rate of 25.6 GB/sec to the ring. The EIB actually

consists of 4 “lanes”, two which operate clockwise and two counter-clockwise. The

maximum bandwidth around the ring is 204 GB/sec (at a clock speed of 3.2 GHz.)

Compared to the GPUs available when the Cell was released, the Cell’s bandwidth

to main memory was about half of the GPUs. Compared to today’s GPUs it is nearly

a factor of seven! This combined with the same discrepancy between the bandwidth

around the ring compared to the main memory bandwidth makes it clear that the

Cell can not be thought of as a pure streaming, data-parallel processor. It is partially

data parallel but also task parallel. To make the most use of the available bandwidth,

SPEs must process data in a pipeline fashion, with each SPE, generally, performing

a different task. Or, at the minimum able to usefully reuse information amongst

themselves. Unfortunately, this programming model does not always map well to


Power Processor Element (PPE)64 bit PowerPC

EIB

SPE

SPE

SPE

SPE

SPE

SPE

SPE

SPE

RAM25.6 GB/sec

Figure 1.7: Overview of the layout of the Cell

complicated scientific codes.

The PPE is fairly standard PowerPC processor with the exception that it doesn’t

support out-of-order execution. Additionally, some instructions were converted into

microcoded instructions (essentially a sequence of other instructions) that are stored

in a ROM chip. It takes 11 cycles for the instructions to be fetched from the ROM

and the pipeline stalls during this time. Although Cell aware compilers will try to

avoid these instructions, it is not always possible. These two differences can have a

significant impact on performance, as will be shown later.

The SPEs are unique processors. They have no cache, instead they have a small

Local Store (LS), 256 KB in size with predictable 6 cycle latency on all loads. All

operations are vector operations; there are no scalar instructions. Scalar operations

can be emulated by the compiler using shifts and masks, but this results in under-

utilization of the available compute power by at least a factor 4 (likely much more).

It has a large register file - 128 general purpose registers are available. In keeping

with the vector nature of the processor the registers are 16 bytes in size, the size of


256 KBLocal Store

Memory FlowController

128 128-bitRegisters

Even PipeArithemtic Ops

Odd PipeLoad/StoreBranch Ops

EIB

Figure 1.8: Conceptual Diagram of Cell SPE

a typical SIMD vector (4 floats or 2 doubles). Due to the relatively simple nature of

the processor execution of code on a SPE is deterministic; it can be determined stat-

ically how code will be pipelined. Optimizing code to make to prevent pipeline stalls

is important optimization technique that has some trade-offs that will be discussed

later (code-size vs. number of stalls).

In addition to their compute capabilities, they also have a Memory Flow Controller

(MFC) which contains a Direct Memory Access (DMA) controller which is used to

transfer data between the LS and main memory. The SPE queue up memory transfers

with the MFC, which takes care of servicing them asynchronously, while the SPE

goes about computing. Ideally, by employing some kind of buffering strategy, the

SPE should never be waiting for data transfers. There are unfortunately, quite a

range of restrictions and conditions on the DMAs to achieve maximum performance.

Maximum performance “is achieved for transfers in which both the EA [main memory

address] and LSA [local store address] are 128-byte aligned and multiples of 128

bytes.” [41]. All of the requirements and conditions in their full detail can be in the

Cell Broadband Engine Programming Manual.


1.4 Comparison of Technologies

Transistors Power BW Max Single Max Double SGEMM DGEMM FFT CostMillions Watts GB/sec GFlops GFlops GFlops GFlops GFlops $

Nehalem 731 130 25.6 102.4 51.2 92 45 41 1700GTX 285 1400 183 159 1062 88 355 74 95 400Cell 250 150 22.8 180 102 175 75 21 8000

Table 1.1: SGEMM and DGEMM numbers are for the best performing matrix sizeson each platform that are very large (ie much too big to fit entirely in any kind ofcache or local memory. The FFT is for best performing (power of 2), very large 2Dcomplex transforms. The Cell is the PowerXCell 8i accelerator board from MercurySystems.

From the raw numbers in table 1.1 many of the relative strengths and weaknesses

of each platform become apparent. In terms of raw and achieved single precision

performance and performance per dollar, the GPU is dominant. The Cell is very

efficient in terms of performance per transistor, but that is a rather useless metric,

except perhaps to IBM’s bottom line. The Cell also has slightly higher double pre-

cision performance than the GPU, and therefore also slightly better double precision

performance per watt, but fairs horribly in performance per dollar comparisons. Ab-

solute performance of the CPU is generally the worst of the three, but it falls in the

middle when performance per dollar is examined.

What is missing from the table is performance on more complicated applications.

Matrix-Matrix multiply and Fast Fourier Transforms are simple compared to complex

scientific applications and important to such a variety of applications that a great

deal of manpower goes into optimizing a very small piece of code, which can make

performance numbers for these routines not representative of the performance one

can achieve on larger applications. This thesis attempts to fill in this gap in chapter 3

where the performance and implementation of a compressible Euler solver is detailed

on both GPUs and the CELL and compared with a reference CPU implementation.

Chapter 2

N-Body Simulations on GPUs

22

CHAPTER 2. N-BODY SIMULATIONS ON GPUS 23

2.1 Introduction

The classical N -body problem consists of obtaining the time evolution of a system

of N mass particles interacting according to a given force law. The problem arises

in several contexts, ranging from molecular scale calculations in structural biology to

stellar scale research in astrophysics. Molecular dynamics (MD) has been successfully

used to understand how certain proteins fold and function, which have been outstand-

ing questions in biology for over three decades [87, 33]. Exciting new developments

in MD methods offer hope that such calculations will play a significant role in future

drug research [30]. In stellar dynamics where experimental observations are hard, if

not impossible, theoretical calculations may often be the only way to understand the

formation and evolution of galaxies.

Analytic solutions to the equations of motion for more than 2 particles or compli-

cated force functions are intractable which forces one to resort to computer simula-

tions. A typical simulation consists of a force evaluation step, where the force law and

the current configuration of the system are used to the compute the forces on each

particle, and an update step, where the dynamical equations (usually Newton’s laws)

are numerically stepped forward in time using the computed forces. The updated

configuration is then reused to calculate forces for the next time step and the cycle

is repeated as many times as desired.

The simplest force models are pairwise additive, that is the force of interaction

between two particles is independent of all the other particles, and the individual

forces on a particle add linearly. The force calculation for such models is of com-

plexity O(N2). Since typical studies involve a large number of particles (103 to 106)

and the desired number of integration steps is usually very large (106 to 1015), the

computational requirements often limit both the problem size as well as the simula-

tion time and consequently, the useful information that may be obtained from such

simulations. Numerous methods have been developed to deal with these issues. For

molecular simulations, it is common to reduce the number of particles by treating

the solvent molecules as a continuum. In stellar simulations, one uses individual time

stepping or tree algorithms to minimize the number of force calculations. Despite


such algorithmic approximations and optimizations, the computational capabilities

of current hardware remain a limiting factor.

Typically N -body simulations utilize neighborlists, tree methods or other algo-

rithms to reduce the order of the force calculations. In previous work [27], a GPU

implementation of a neighbor list based method to compute non-bonded forces was

demonstrated. However, since the GPU so far outperformed the CPU, the neigh-

borlist creation quickly became a limiting factor. Building the neighborlist on the

GPU is extremely difficult due to the lack of specific abilities (namely indirected out-

put) and research on computing the neighborlist on the GPU is still in progress. Other

simplistic simulations that do not need neighborlist updates have been implemented

by others [47]. However, for small N, one finds they can do an O(N2) calculation

significantly faster on the GPU than an O(N) method using the CPU (or even with

a combination of the GPU and CPU). This has direct applicability to biological sim-

ulations that use continuum models for the solvent. The reader should also note that

in many of the reduced order methods such as tree based schemes, at some stage an

O(N2) calculation is performed on a subsystem of the particles, so this method can

be used to improve the performance of such methods as well. When using GRAPE

accelerator cards for tree based algorithms, the host processor takes care of building

the tree and the accelerator cards are used to speed up the force calculation step;

GPUs could be used in a similar way in place of the GRAPE accelerator boards.

Using the methods described below, acceleration of the force calculation by a

factor of 25 is possible with GPUs compared to highly optimized SSE code running

on an Intel Pentium 4. This performance is in the range of the specially designed

GRAPE-6A [31] and MDGRAPE-3 [92] processors, but uses a commodity processor

at a much better performance/cost ratio.

2.2 Algorithm

General purpose CPUs are designed for a wide variety of applications and take limited

advantage of the inherent parallelism in many calculations. Improving performance in


the past has relied on increasing clock speeds and the size of high speed cache memo-

ries. Programming a CPU for high performance scientific applications involves careful

data layout to utilize the cache optimally and careful scheduling of instructions.

In contrast, graphics processors are designed for intrinsically parallel operations,

such as shading pixels, where the computations on one pixel are completely indepen-

dent of another. GPUs are an example of streaming processors, which use explicit data

parallelism to provide high compute performance and hide memory latency. Data is

expressed as streams and data parallel operations are expressed as kernels. Kernels

can be thought of as functions that transform each element of an input stream into

a corresponding element of an output stream. When expressed this way, the kernel

function can be applied to multiple elements of the input stream in parallel. Instead

of blocking data to fit caches, the data is streamed into the compute units. Since

streaming fetches are predetermined, data can be fetched in parallel with computa-

tion. This section describes how the N -body force calculation can be mapped to

streaming architectures.

In its simplest form the N -body force calculation can be described by the following

pseudo-code: for i = 1 to N

f o r c e [ i ] = 0

r i = coo rd ina t e s [ i ]

for j = 1 to N

r j = coo rd ina t e s [ j ]

f o r c e [ i ] = f o r c e [ i ] + f o r c e f u n c t i o n ( r i , r j )

end

end Since all coordinates are fixed during the force calculation, the force computation can

be parallelized for the different values of i. In terms of streams and kernels, this can

be expressed as follows:


stream coo rd ina t e s ;

stream f o r c e s ;

kernel k f o r c e ( r i )

f o r c e = 0

for j = 1 to N


f o r c e = f o r c e + f o r c e f u n c t i o n ( r i , r j )

end

return f o r c e

end kernel

f o r c e s = k f o r c e ( coo rd ina t e s ) The kernel kforce is applied to each element of the stream coordinates to pro-

duce an element of the forces stream. Note that the kernel can perform an indexed

fetch from the coordinates stream inside the j-loop. An out-of-order indexed fetch

can be slow, since in general, there is no way to prefetch the data. However in this

case the indexed accesses are sequential. Moreover, the j-loop is executed simulta-

neously for many i-elements; even with minimal caching, rj can be reused for many

N i-elements without fetching from memory thus the performance of this algorithm

would be expected to be high. The implementation of this algorithm on GPUs and

GPU-specific performance optimizations are described in the following section.

There is however one caveat in using a streaming model. Newton’s Third law

states that the force on particle i due to particle j is the negative of the force on

particle j due to particle i. CPU implementations use this fact to halve the number

of force calculations. However, in the streaming model, the kernel has no ability to

write an out-of-sequence element (scatter), so forces[j] can not be updated while

summing over the j-loop to calculate forces[i]. This effectively doubles the number

of computations that must be done on the GPU compared to a CPU.

Several commonly used force functions were implemented to measure and compare

performance. For stellar dynamics, depending on the integration scheme being used,


Flops Input Inner BandwidthKernel Formula per Unroll (bytes) Loop (GB/s)

Interaction. Instructions.

Gravity(accel)

mj

(r2ij+ε2)3/2 rij 19 4×4 64 125 19.9

Gravity(accel & jerk)

mj

(r2ij+ε2)3/2 rij

mj

hvij

(r2ij+ε2)3/2 − 3(rij ·vij)rij

(r2ij+ε2)5/2

i 42 1×4 128 104 40.6

LJC(constant)

qiqj

εr3ij

rij + εij

»“σij

rij

”6−“σij

rij

”12–

30 2×4 104 109 33.6

LJC(linear)

qiqj

r4ij

rij + εij

»“σij

rij

”6−“σij

rij

”12–

30 2×4 104 107 34.5

LJC(sigmoidal)

qiqj

ζ(rij)r3ij

rij +

εij

»“σij

rij

”6−“σij

rij

”12–

ζ(r) = e(αr3+βr2+γ+δ)

43 2×4 104 138 27.3

Table 2.1: Values for the maximum performance of each kernel on the X1900XTX.The instructions are counted as the number of pixel shader assembly arithmetic in-structions in the inner loop.

one may need to compute just the forces, or the forces as well as the time derivative of

the forces (jerk). These kernels are referred to as GA (Gravitational Acceleration) and

GAJ (Gravitational Acceleration and Jerk) in the rest of this chapter. In molecular

dynamics, it is not practical to use O(N2) approaches when the solvent is treated

explicitly, so this work restricts itself to continuum solvent models. In such models,

the quantum interaction of non-bonded atoms is given by a Lennard-Jones function

and the electrostatic interaction is given by Coulomb’s Law suitably modified to

account for the solvent. The LJC(constant) kernel calculates the Coulomb force with

a constant dielectric, while the LJC(linear) and LJC(sigmoidal) kernels use distance

dependent dielectrics. The equations used for each kernel as well as the arithmetic

complexity of the calculation are shown in Tables 2.1 and 2.2.


Useful Giga SystemKernel GFLOPS Interactions Size

per sec.

Gravity(accel)

94.3 4.97 65,536

Gravity(accel & jerk)

53.5 1.27 65,536

LJC(constant)

77.6 2.59 4096

LJC(linear)

79.5 2.65 4096

LJC(sigmoidal)

90.3 2.10 4096

Table 2.2: Values for the maximum performance of each kernel on the X1900XTX.

2.3 Implementation and Optimization on GPUs

2.3.1 Precision

Recent graphics boards have 32-bit floating point arithmetic. Consequently all of

the calculations were done in single precision. Whether or not this is sufficiently

accurate for the answers being sought from the simulation is often the subject of

a debate which will not be settled here. In many cases, though certainly not all,

single precision is enough to obtain useful results. Furthermore, if double precision

is necessary, it is usually not required throughout the calculation, but rather only

in a select few instances. For reference, GRAPE-6 [61] performs the accumulation

of accelerations, subtraction of position vectors and update of positions in 64-bit

fixed point arithmetic with everything else in either 36, 32 or 29 bit floating point

precision. It is quite common to do the entire force calculation in single precision for

molecular simulations while using double precision for some operations in the update

step. If and where necessary, the appropriate precision could be emulated on graphics

boards [32]. The impact on performance would depend on where and how often it

would be necessary to do calculations in double precision.


2.3.2 General Optimization

The algorithm was implemented for several force models. For simplicity, in the follow-

ing discussion, only the GA kernel is discussed, which corresponds to the gravitational

attraction between two mass particles, given by

ai = −G∑i 6=j

mj

(r2ij + ε2)3/2

rij (2.1)

where ai is the acceleration on particle i, G is a constant (often normalized to one), mj

is the mass of particle j, ε is a softening parameter used to avoid near singular forces

when two particles become very close, and rij is the vector displacement between

particles i and j. The performance of the kernel for various input sizes are shown in

Figure 2.1.

The algorithm outlined in Section 2.2 was implemented in BrookGPU and targeted

for the ATI X1900XTX. Even this naive implementation performs very well, achieving

over 40 GFlops, but its performance can be improved. This kernel executes 48 Giga-

instructions/sec and has a memory bandwidth of 33 GB/sec. Using information from

GPUBench [14], one expects the X1900XTX to be able to execute approximately

30-50 Giga-instruction/sec (it depends heavily on the pipelining of commands) and

have a cache memory bandwidth of 41GB/sec. The nature of the algorithm is such

that almost all the memory reads will be from the cache since all the pixels being

rendered at a given time will be accessing the same j-particle. Thus this kernel is

limited by the rate at which the GPU can issue instructions (compute bound).

To achieve higher performance, the standard technique of loop unrolling was used.

This naive implementation is designated as a 1×1 kernel because it is not unrolled

in either i or j. The convention followed hereafter when designating the amount of

unrolling will be that A×B means i unrolled A times and j unrolled B times. The

second GA kernel (1×4) which was written unrolled the j-loop four times, enabling

the use of the 4-way SIMD instructions on the GPU. This reduces instructions that

must be issued by around a factor of 3. (some Pixel Shader instructions are scalar

which prevents a reduction by a factor of 4). The performance for this kernel is


shown in Figure 2.1. It achieves a modest speedup compared to the previous one,

and the kernel has now switched from being compute bound to bandwidth bound (35

Giga-Instructions/sec and ≈40GB/sec).

100 1000 10000

Output Stream Size

0

1

2

3

4

5

Gig

a-I

nte

ract

ion

s p

er s

eco

nd

1x1

1x4

4x4

Figure 2.1: GA Kernel with varying amounts of unrolling

Further reducing bandwidth usage is somewhat more difficult. It involves using

the multiple render targets (MRT) capability of recent GPUs which is abstracted as

multiple output streams by BrookGPU. By reading in 4 i-particles into each kernel

invocation and outputting the force on each into a separate output stream, we reduce

by a factor of four the size of each output stream compared with original. This

reduces input bandwidth requirements to one quarter of original bandwidth because

each j-particle is only read by one-quarter as many fragments. To make this more

clear, the pseudo-code for this kernel is shown below. This kernel is designated as a

4×4 kernel. stream coo rd ina t e s ;

stream index = range ( 1 to N sk ip 4 ) ;

stream f o r c e s1 , f o r c e s2 , f o r c e s3 , f o r c e s 4 ;

kernel k force4x4 ( i )

f o r c e 1 = 0

f o r c e 2 = 0

f o r c e 3 = 0

f o r c e 4 = 0


r i 1 = coo rd ina t e s [ i ]

r i 2 = coo rd ina t e s [ i +1]



for j = 1 to N sk ip 4

r j 1 = coo rd ina t e s [ j ]

r j 2 = coo rd ina t e s [ j +1]



f o r c e 1 += f o r c e f u n c t i o n 4 ( r i1 , r j1 , r j2 , r j3 , r j 4 )




end

return f o r ce1 , f o rce2 , f o rce3 , f o r c e 4

end kernel

f o r c e s1 , f o r c e s2 , f o r c e s3 , f o r c e s 4 = kforce4x4 ( i n d i c e s ) In the above code, the input is the sequence of integers 1, 5, 9, ...N and the output

is 4 force streams. The force function4 uses the 4-way SIMD math available on

the GPU to compute 4 forces at a time. The four output streams can be trivially

merged into a single one if needed. Results for this kernel can be seen in Figure 2.1.

Once more the kernel has become instruction-rate limited and its bandwidth is half

that of the maximum bandwidth of the ATI board, but the overall performance has

increased significantly.

2.3.3 Optimization for small systems

In all cases, performance is severely limited when the number of particles is less than

about 4000. This is due to a combination of fixed overhead in executing kernels and

the lack of sufficiently many parallel threads of execution. It is sometimes necessary


to process small systems or subsystems of particles (N ≈ 100− 1000).

For example, in molecular dynamics where forces tend to be short-range in nature,

it is more common to use O(N) methods by neglecting or approximating the inter-

actions beyond a certain cutoff distance. However, when using continuum solvent

models, the number of particles is small enough (N ≈ 1000) that the O(N2) method

is comparable in complexity while giving greater accuracy than O(N) methods.

It is common in stellar dynamics to parallelize the individual time step scheme

by using the block time step method [66]. In this method forces are calculated on

only a subset of the particles at any one time. In some simulations a small core can

form such that the smallest subset might have less than 1000 particles in it. To take

maximal advantage of GPUs it is therefore important to get good performance for

small output stream sizes.

To do this, one can increase the number of parallel threads by decreasing the

j-loop length. For example, the input stream can be replicated twice, with the j-loop

looping over the first N/2 particles for the first half of the replicated stream and

looping over the second N/2 particles for the second half of the stream. Consider the

following pseudocode that replicates the stream size by a factor of 2: stream coo rd ina t e s ;

stream i n d i c e s = range ( 1 to 2N ) ;

stream p a r t i a l f o r c e s ;

kernel k f o r c e ( i )

f o r c e = 0

i f i <= N:

r i = coo rd ina t e s [ i ]

for j = 1 to N/2



end

else

r i = coo rd ina t e s [ i−N+1]


for j = N/2+1 to N



end

e n d i f

return f o r c e

end kernel

p a r t i a l f o r c e s = k f o r c e ( i n d i c e s ) In this example, the stream indices is twice as long as the coordinates stream

and contains integers in sequence from 1 to 2N . After applying the kernel kforce

on indices to get partial forces, the force on particle i can be obtained with by

adding partial forces[i] and partial forces[i+N], which can be expressed as a

trivial kernel. The performance of the LJC(sigmoidal) kernel for different number of

replications of the i-particles is shown in Figure 2.2 for several system sizes.

2.4 Results

All kernels were run on an ATI X1900XTX PCIe graphics card on Dell Dimension

8400 with pre-release drivers from ATI (version 6.5) and the DirectX SDK of February

2 4 6 8

Replication of i-particles

0.5

1.0

1.5

2.0

2.5

Gig

a-I

nte

ract

ion

s p

er s

eco

nd

4096

2048

1024

768

Figure 2.2: Performance improvement for LJC(sigmoidal) kernel with i-particle repli-cation for several values of N


GMX GMX GPU GPUKernel Million ns/day Million ns/day

Intrxn/s Intrxn/s

LJC(constant) 66 11.4 2232 386LJC(linear)* 33 5.7 2271 392LJC(sigmoidal) 40 6.9 1836 317

Table 2.3: Comparison of GROMACS(GMX) running on a 3.2 GHz Pentium 4 vs.the GPU showing the estimated simulation time per day for a 1000 atom system.*GROMACS does not have an SSE inner loop for LJC(linear)

2006. A number of different force models were implemented with varying compute-to-

bandwidth ratios (see Table 2.1). A sample code listing is provided in the appendix

(2.7.1) to show the details of how flops are counted.

To compare against the CPU, a specially optimized version of the GA and GAJ

kernels were written since no software suitable for a direct comparison to the GPU

existed. The work of [74] uses SSE for the GAJ kernel but does some parts of the

calculation in double precision which makes it unsuitable for a direct comparison. The

performance they achieved is comparable to the performance achieved here. Using

SSE intrinsics and Intel’s C++ Compiler v9.0, sustained performance of 3.8 GFlops

on a 3.0 GHz Pentium 4 was achieved.

GROMACS [56] is currently the fastest performing molecular dynamics software

with hand-written SSE assembly loops. As mentioned in Section 2.2 the CPU can do

out-of-order writes without a significant penalty. GROMACS uses this fact to halve

the number of calculations needed in each force calculation step. In the comparison

against the GPU in Table 2.3 the interactions per second as reported by GROMACS

have been doubled to reflect this. Also shown in the table are the estimated nanosec-

onds one could simulate in a day for a system of 1000 atoms - all O(N) operations

such as constraints and updates have been neglected in this estimate, as they consume

less than 2% of the total runtime. The GPU calculation thus represents an order of

magnitude improvement over existing methods on CPUs.


0

1

2

3

4

5

Bil

lion

In

teracti

on

s/s

CPU (observed)

GPU (observed)

GRAPE (theoretical peak)

GA GAJ LJC(constant)

Figure 2.3: Speed comparison of CPU, GPU and GRAPE-6A

2.5 Discussion

2.5.1 Comparison to other Architectures

In Figure 2.3 is a comparison of interactions/sec between the ATI X1900XTX, GRAPE-

6A and a Pentium 4 3.0GHz. The numbers for the GPU and CPU are observed values,

those for GRAPE-6A are for its theoretical peak. Compared to GRAPE-6A, the GPU

can calculate over twice as many interactions when only the acceleration is computed,

and a little over half as many when both the acceleration and jerk are computed. The

GPU bests the CPU by 35x, 39x and 15x for the GA, LJC(constant) and GAJ kernels

respectively.

Another important metric is performance per unit of power dissipated. These

results can be seen in Figure 2.5. Here the custom design and much smaller on-board

memory allows GRAPE-6A to better the GPU by a factor of 4 for the GAJ kernel,

although they are still about equal for the GA kernel. The power dissipation of the

Intel Pentium 4 3.0 GHz is 82W [43], the X1900XTX is 120W [4], and GRAPE-6A’s

dissipation is estimated to be 48W since each of the 4 processing chips on the board

dissipates approximately 12W [62].

The advantages of the GPU become readily apparent when the metric of perfor-

mance per dollar is examined (Figure 2.4). The current price of an Intel Pentium 4

630 3.0GHz is $100, an ATI X1900XTX is $350, and an MD-GRAPE3 board costs

$16000 [42]. The GPU outperforms GRAPE-6A by a factor of 22 for the GA kernel

and 6 for the GAJ kernel.


0

2

4

6

8

10

Mil

lio

n I

nte

ra

cti

on

s/s

ec/U

SD

CPU (observed)

GPU (observed)


MD-GRAPE3 (observed)


Figure 2.4: Useful MFlops per second per U.S. Dollar of CPU, GPU and GRAPE-6A

0

10

20

30

40

50

Mil

lion

In

teracti

on

s/W

att

CPU (observed)

GPU (observed)


MD-GRAPE3 (observed)


Figure 2.5: Millions of Interactions per Watt of CPU, GPU and GRAPE-6A


2.5.2 Hardware Constraints

The 4×4 unrolling that is possible with the GA kernel does not work for the other,

more complicated kernels. For example, the GAJ kernel requires two outputs per

particle (jerk in addition to acceleration). This reduces the maximum unrolling pos-

sibility to 2×4 because the GPU is limited to a maximum of 4 outputs per kernel.

However, even this amount of unrolling doesn’t work because the compiler cannot

fit the kernel within the 32 available registers. The number of registers is also what

prevents the LJC kernels from being unrolled by 4×4 instead of 2×4.

This apparent limitation due to the number of registers appears to result from

compiler inefficiencies; the authors are currently hand coding a 2×4 GAJ kernel

directly in pixel shader assembly which should cause the kernel to become compute

bound and greatly increase its performance. The performance gain of unrolling the

LJC kernels to 4×4 by rewriting them in assembly would most likely be small since

these kernels are already compute bound.

While the maximum texture size of 4096×4096 and 512 MB would make it pos-

sible to store up to 16 million particles on the board at a time, this really isn’t

necessary. In fact, GRAPE-6A only has storage for 131,000 particles on the board

at any one time. This is small enough to occasionally seem restrictive - a good bal-

ance is around 1 million particles which could easily be accommodated by 64MB. If

board manufacturers wanted to produce cheaper boards specifically for use in these

kinds of computations they could significantly reduce the cost without affecting the

functionality by reducing the amount of onboard RAM.

The current limits on the number of instructions also impacts the efficiency of large

GPGPU programs. On ATI hardware, the maximum shader length of 512 instructions

limits the amount of loop unrolling and the complexity of the force functions one can

handle. On NVIDIA hardware, the dynamic instruction limit limits us to very small

systems without resorting to multi-pass techniques which effect the cache efficiency,

and therefore the performance of the proposed algorithms.


2.5.3 On-board Memory vs. Cache Usage

As mentioned in Section 2.3.2 one expects the kernels to make very efficient use of

the cache on the boards. There are a maximum of 512 threads in flight on the ATI

X1900XTX at any one time [4], and in the ideal situation, each of these threads will

try and access the same j-particle at approximately the same time. The first thread

to request a j-particle will miss the cache and cause the particle to be fetched from

on-board memory, however once it is in the cache, all the threads should be able to

read it without it having to be fetched from on-board memory again.

For example, in the case of the GA kernel with 65,536 particles, there would

be 16,384 fragments to be processed, and if fragments were processed in perfectly

separate groups of 512, then 32 groups would need to be processed. Each group

would need to bring in 65,536 particles from main memory to the cache resulting in

an extremely low memory bandwidth requirement of 38.2 MB/sec.

Of course, the reality is that particles are not processed in perfectly separate

groups of 512 particles that all request the same particle at the same time, but by

using ATITool [5] to adjust the memory clock of the board one can determine how

much bandwidth each kernel actually needs to main memory. The results of this

testing can be seen in Figure 2.6.

The performance degradation occurs at approximately 11.3, 5.2, and 2.1 GB/sec

for the LJC, GAJ and GA kernels respectively. The LJC kernels must also read in an

exclusion list for each particle which does not cache as well as the other reads, and is

the reason why their bandwidth to main memory is higher than that of the gravity

kernels. The number for the GA kernel suggests that approximately 10 particles are

accessing the same j-particle at once.

At memory speeds above 500MHz all the kernels run very near their peak speed,

thus board manufacturers could not only use less RAM, they could also use cheaper

RAM if they were to produce a number of boards that would only be used for these

calculations. This would reduce the cost and power requirements over the standard

high end versions used for gaming.


0 200 400 600

Memory Speed (MHz)

0

20

40

60

80

100

GF

lop

s

GA

GAJ

LJC (sig)

LJC (linear)

LJC (const)

Figure 2.6: GFlops achieved as a function of memory speed

2.6 Conclusion

The processing power of GPUs has been successfully used to accelerate pairwise force

calculations for several commonly used force models in stellar and molecular dynam-

ics simulations. In some cases the GPU is more than 25 times as fast as a highly

optimized SSE-based CPU implementation and exceeds the performance of GRAPE-

6A, which is hardware specially designed for this task. Furthermore, the performance

is compute bound, so this work is well poised to take advantage of further increases

in the number of ALUs on GPUs, even if memory subsystem speeds do not increase

significantly. Because GPUs are mass produced, they are relatively inexpensive and

their performance to cost ratio is an order of magnitude better than the alternatives.

The wide availability of GPUs will allow distributed computing initiatives like Fold-

ing@Home to utilize the combined processing power of tens of thousands of GPUs to

address problems in structural biology that were hitherto computationally infeasible.

It is safe to conclude that the future will see some truly exciting applications of GPUs

to molecular dynamics.

2.7 Appendix

2.7.1 Flops Accounting

To detail how flops are counted a snippet of the actual Brook code for the GA ker-

nel is presented. The calculation of the acceleration on the first i-particle has been


commented with the flop counts for each instruction. In total, the calculation of

the acceleration on the first i-particle performs 76 flops. Since four interactions are

computed, this amounts to 19 flops per interaction. f loat3 d1 , d2 , d3 , d4 , ou tacc e l 1 ;

f loat4 jmass , r , r inv , rinvcubed , s c a l a r ;

d1 = jpos1 − i po s1 ; //3

d2 = jpos2 − i po s1 ; //3

d3 = jpos3 − i po s1 ; //3

d4 = jpos4 − i po s1 ; //3

r . x = dot ( d1 , d1 ) + eps ; //6

r . y = dot ( d2 , d2 ) + eps ; //6

r . z = dot ( d3 , d3 ) + eps ; //6

r .w = dot ( d4 , d4 ) + eps ; //6

r inv = r s q r t ( r ) ; //4

r invcubed = r inv ∗ r inv ∗ r inv ; //8

s c a l a r = jmass ∗ r invcubed ; //4

outacc e l 1 += s c a l a r . y∗d2 + s c a l a r . z∗d3 + s c a l a r .w∗d4 ; //18

i f ( I l i s t . x != J l i s t 1 . x ) //don ’ t add f o r c e due to o u r s e l f

ou tacc e l 1 += s c a l a r . x ∗ d1 ; //6

Chapter 3

Structured PDE Solvers on CELL

and GPUs

41

CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 42

3.1 Introduction

In this section the implementation of flow solvers using GPUs and the CELL are

examined. Specifically, the focus is on solving the compressible Euler equations in

complicated geometry with a multi-block structured code. Previous attempts to im-

plement flow solvers had been attempted (see the next section) but never had a real

large scale engineering application been demonstrated. In this section, a real engi-

neering flow calculation running on a single GPU with “engineering” accuracy and

numerics is presented. It demonstrates the potential of these processors for high per-

formance scientific computing. The CELL presented many more difficulties and its

level of performance was far below expectations leading to us terminate our work on

that architecture. The difficulties and performance achieved are described after the

GPU work.

3.2 Review of prior work on GPUs

The current state of the art in applying GPUs to computational fluid mechanics is

either simulations for graphics purposes emphasizing speed and appearance over accu-

racy, or simulations generally dealing with 2D geometries and using simpler numerics

not suited for complex engineering flows. Some previous efforts in this direction

are now reviewed. The most notable work of engineering significance is the work of

Brandvik [11] who solved an Euler flow in 3D geometry.

Kruger and Westermann[51] implemented basic linear operators (vector-vector

arithmetic, matrix-vector multiplication with full and sparse matrices) and measured

a speed-up around 12–15 on ATI 9800 compared to Pentium 4 2.8 GHz. Applications

to the conjugate gradient method and the Navier-Stokes equations in 2D are pre-

sented. Rumpf and Strzodka[81] applied the conjugate gradient method and Jacobi

iterations to solve non-linear diffusion problems for image processing operations.

Bolz et al.[9] implemented sparse matrix solvers on GPU using the conjugate

gradient method and a multigrid acceleration. Their approach was tested on a 2D

flow problem. A 2D unit square was chosen as test case. A speed-up by 2 was


measured with a GeForce FX.

Goodnight et al.[34] implemented the multigrid method on GPUs for three appli-

cations: simulation of heat transfer, modeling of fluid mechanics, and tone mapping of

high dynamic range images. For the fluid mechanics application, the vorticity-stream

function formulation was applied to solve for the vorticity field of a 2D airfoil. This

was implemented on NVIDIA GeForceFX 5800 Ultra using Cg. A speed-up of 2.3

was measured compared to an AMD Athlon XP 1800.

In computer graphics where accuracy is not essential but speed is, flow simula-

tions using the method of Stam[89] are very popular. It is a semi-Lagrangian method

and allows large time-steps to be applied in solving the Navier-Stokes equations with

excellent stability. Though the method is not accurate enough for engineering compu-

tation, it does capture the characteristics of fluid motion with nice visual appearance.

Harris et al.[35] performed a rather comprehensive simulation of cloud visualization

based on Stam’s method[89]. Partial differential equations describe fluid motion,

thermodynamic processes, buoyant forces, and water phase transitions. Liu et al.[57]

performed various 3D flow calculations, e.g. flow over a city, using Stam’s method[89].

Their goal is to have a real-time solver along with visualization running on the GPU.

A Jacobi solver is used with a fixed number of iterations in order to obtain a satis-

factory visual effect.

The Lattice-Boltzmann model (LBM) is attractive for GPU processors since it is

simple to implement on sequential and parallel machines, requires a significant com-

putational cost (therefore benefits from faster processors) and is capable of simulating

flows around complex geometries. One should be aware of some limitations of this

approach; what is gained in terms of algorithm simplicity is often lost in terms of

overall accuracy and various physical/numerical limitations (see review by Khalighi

et al.[49]). Li et al.[55, 54] obtained a speed-up around 6 using Cg on an NVIDIA

GeForce FX 5900 Ultra (vs. Pentium 4 2.53Ghz). See the work of Fan et al.[28] using

a GPU cluster.

Scheidegger et al.[82] ported the simplified marker and cell (SMAC) method[3] for

time-dependent incompressible flows. SMAC is a technique used primarily to model

free surface flows. Scheidegger performed several 2D flow calculations and obtained


speed-ups on NV 35 and NV 40 varying from 7 to 21. The error of the results was in

the range 10−2–10−3. See also the recent review by Owens et al.[79].

The work of Brandvik et al.[11] is the closest to our own. They implement a 2D

and 3D compressible solver on the GPU in both BrookGPU and Nvidia’s CUDA. They

achieve speedups of 29 (2D) and 16 (3D) respectively, although the 3D BrookGPU

version achieved a speedup of only 3. A finite volume discretization with vertex

storage and a structured grid of quadrilaterals was used. No multi-grid or multiple

blocks were used.

3.3 Flow Solver

The Navier-Stokes Stanford University Solver (NSSUS) solves the three-dimensional

Unsteady Reynolds Averaged Navier-Stokes (URANS) equations on multi-block meshes

using a vertex-centered solution with first to sixth order finite difference and artificial

dissipation operators based on work by Mattson[65], Svard[91], and Carpenter[16] on

Summation by Parts (SBP) operators. Boundary conditions are implemented using

penalty terms based on the Simultaneous Approximation Term (SAT) approach[16].

Geometric multigrid with support for irregular coarsening of meshes is also imple-

mented. The SBP and SAT approaches allow for provably stable handling of the

boundary conditions (both physical boundaries and boundaries between blocks). The

numerics of the code are investigated in the work of Nordstorm, et al.[76]

This work focuses on a subset of the capabilities in NSSUS, namely the steady

solution of the compressible Euler equations which come about if the viscous effects

and heat transfer in the Navier-Stokes equations are neglected. Flows modeled using

the Euler equations are routinely used as part of the analysis and design of transonic

and supersonic aircraft, missiles, hypersonic vehicles, and launch vehicles. Current

GPUs are well suited to solving the Euler equations since the use of double precision,

needed for the fine mesh spacing required to properly resolve the boundary layer in

RANS simulations, is not necessary.


The non-dimensional Euler equations in conservation form are

∂W

∂t+∂E

∂x+∂F

∂y+∂G

∂z= 0, (3.1)

where W is the vector of conserved flow variables and E, F, and G are the Euler flux

vectors defined as:

W = [ρ, ρu, ρv, ρw, ρe],

E = [ρu, ρu2 + p, ρuv, ρuw, ρuh],

F = [ρv, ρuv, ρv2 + p, ρvw, ρvh],

G = [ρw, ρuw, ρvw, ρw2 + p, ρwh]

In these equations, ρ is the density, u, v, and w are the cartesian velocity components,

p is the static pressure, and h is the total enthalpy related to the total energy by

h = e+ pρ. For an ideal gas, the equation of state may be written as

p = (γ − 1) ρ

[e− 1

2(u2 + v2 + w2)

]. (3.2)

For the finite difference discretization a coordinate transformation from the phys-

ical coordinates (x, y, z) to the computational coordinates (ξ, η, ζ) is performed to

yield:∂W

∂t+∂E

∂ξ+∂F

∂η+∂G

∂ζ= 0, (3.3)

where W = W/J , J is the coordinate transformation Jacobian, and:

E =1

J(ξxE + ξyF + ξzG), F =

1

J(ηxE + ηyF + ηzG), G =

1

J(ζxE + ζyF + ζzG).

Discretizing the spatial operators results in a system of ordinary differential equations

d

dt

(Wijk

Jijk

)+Rijk = 0, (3.4)

at every node in the mesh. An explicit five-stage Runge-Kutta scheme using modified


coefficients for a maximum stability region is used to advance the equations to a

steady state solution. Computing the residual R is the main computational cost;

it includes the inviscid Euler fluxes, the artificial dissipation for stability, and the

penalty terms for the boundary conditions. The penalty states, obtained either from

physical boundary conditions or (for internal block boundaries) from the value of the

flow solution in another block, are used to compute the penalty terms. Geometric

multi-grid is used to speed up convergence.

In the next sections, the implementation of NSSUS on GPUs is described. This

work was accomplished using BrookGPU. The algorithms required to implement

NSSUS on the GPU are discussed and numerical results and performance measure-

ments are reported.

3.4 Numerical accuracy considerations and perfor-

mance comparisons between CPU and GPU

Producing identical results in a CPU and GPU implementation of an algorithm is,

perhaps surprisingly, not a simple matter. Even if the exact same sequence of in-

structions are executed on each processor, it is quite possible for the results to be

different. Current GPUs do not support the entire IEEE-754 standard. Some of the

deviations are not, in the author’s experience, generally a concern: not all rounding

modes are supported; there is no support for denormalized numbers; and NaN and

floating point exceptions are not handled identically. However, other differences are

more significant and will affect most applications: division and square root are imple-

mented in a non-standard-compliant fashion, and multiplication and addition can be

combined by the compiler into a single instruction (FMAD) which has no counterpart

on current CPUs. X = A∗B + C; // FMAD

This instruction truncates the result of the intermediate multiplication leading to

different behavior than if the operations were performed sequentially[78].


There are other differences between the architectures that can cause even a se-

quence of additions and multiplications (without FMADs) to yield different results.

This is because the FPU registers are 80-bit on CPUs but only 32-bit on current

generation GPUs. If the following sequence of operations was performed: C = 1E5 + 1E−5; // C i s in a r e g i s t e r

D = 10 ∗ C; // C i s s t i l l in a r e g i s t e r , so i s D

E = D − 1E6 ; // The r e s u l t E i s f i n a l l y wr i t t en to memory On a GPU, E would be 0, while on a CPU it would contain the correct result of

.0001. The result of the initial addition would be truncated to 1E5 to fit in the 32-bit

registers of a GPU unlike the CPU where the 80-bit registers can represent the result

of the addition.

Evaluation of transcendental functions are also likely to produce different results,

especially for large values of the operand.

To further complicate matters, CPUs have an additional SIMD unit that is sepa-

rate from the traditional FPU. This unit has its own 128-bit registers that are used

to store either 4 single precision or 2 double precision numbers. This has implications

both for speed-up and accuracy comparisons. Each number is now stored in its own

32-bit quarter of the register. The above operations would yield the same result on

both platforms if the CPU was using the SIMD unit for the computation.

In addition, by utilizing the SIMD unit, the CPU performs these 4 operations

simultaneously which leads to a significant increase in performance. Unfortunately,

the SIMD unit can only be directly used by programming in assembly language or

using “intrinsics” in a language such as C/C++. Intrinsics are essentially assembly

language instructions, but allow the compiler to take care of instruction order opti-

mization and register allocation. In most scientific applications writing at such a low

level is impractical and rarely done; instead compilers that “auto-vectorize” code have

been developed. They attempt to transform loops so that the above SIMD operations

can be used.


3.5 Mapping the Algorithms to the GPU

3.5.1 Classification of kernel types

In mapping the various algorithms to the GPU it is useful to classify kernels into four

categories based on their memory access patterns. All of the kernels that make up

the entire PDE solver can be classified into one of these categories. Portions of the

computation that are often referred to as a unit, the artificial dissipation for example,

are often composed of a sequence of many different kernels. For each kernel type a

simple example of sequential C code is given, followed by how that code would be

transformed into streaming BrookGPU code.

The categories are:

Pointwise. When all memory accesses, possibly from many different streams, are

from the same location as the output location of the fragment. A simple example

of this type of kernel would be calculating momentum at all vertices by multiplying

the density and velocity at each vertex. Kernels of this type often have much greater

computational density than the following three types of kernels. for ( int i = 0 ; i < 100 ; ++i )

c [ i ] = a [ i ] + b [ i ] ; would be transformed into the above add kernel.

Stencil. Kernels of this type require data that is spatially local to the output loca-

tion of the fragment. The data may or may not be local in memory depending on

how the 3D data is mapped to 2D space. Difference approximations and multigrid

transfer operations lead to kernels of this type. These kernels often have a very low

computational density, often performing only one arithmetic operation per memory

load. for ( int x = l e f t ; x < r i g h t ; ++x )

for ( int y = bottom ; y < top ; ++y )

r e s [ x ] = ( func [ x +1] [ y ] + func [ x−1] [ y ] + func [ x ] [ y+1]

+ func [ x ] [ y−1] − 4∗ func [ x ] [ y ] ) / d e l t a ; would become


kernel void r e s ( f loat de l ta , f loat func [ ] [ ] , out f loat res<>)

f loat2 my index = indexo f ( r e s ) . xy ;

f loat2 up = my index + f loat2 (0 , 1 ) ;

f loat2 down = my index − f loat2 (0 , 1 ) ;

f loat2 r i g h t = my index + f loat2 (1 , 0 ) ;

f loat2 l e f t = my index − f loat2 (1 , 0 ) ;

r e s = ( func [ up ] + func [ down ] + func [ r i g h t ] + func [ l e f t ]

− 4∗ func [ my index ] ) / d e l t a ;

Unstructured gather. While connectivity inside a block is structured, the blocks

themselves are connected in an unstructured fashion. To access data from neighbor-

ing blocks, special data structures are created to be used by gather kernels which

consolidate non-local information. Copying the sub-faces of a block into their own

sub-face stream is a special case of this kind of kernel. kernel void unstructGather ( f loat2 pos [ ] [ ] , f loat data [ ] ,

out f loat r e s h u f f l e<> ) f loat2 my index = indexo f ( r e s h u f f l e ) ;

f loat2 gatherPos = pos [ my index ] ;

r e s h u f f l e = data [ gatherPos ] ;

The contents of the pos stream are indices that are used to access elements of the

data stream.

Reduction. Reduction kernels are used to monitor the convergence of the solver. A

reduction kernel outputs a single scalar by performing a commutative operation on all

the elements of the input stream. Examples include the sum, product or maximum

of all elements in a stream. Reduction operations are implemented in Brook using

efficient tree data structures and an optimal number of passes [13].


X Y Z X Y Z

Figure 3.1: Array of Structures

X X X X X X

Z Z Z Z Z Z

Y Y Y Y Y Y

X X

YY

Z ZFigure 3.2: Structure of Arrays

3.5.2 Data layout

Because the entire iterative loop of the solver is performed on the GPU, the data

layout the CPU need not constrain the data layout the BrookGPU version of NSSUS

uses. A one-time translation to and from the GPU format can be done at the begin-

ning and end of the complete solve with minimal overhead. This translation is on the

order one second, whereas solves take minutes to tens of minutes.

Until the release of the G80 from NVIDIA, all graphics processors had 4-wide

SIMD processors; the latest ATI card, the R600, will retain this design. For maximum

efficiency on these vector designs, data should be laid out using a structure of arrays

(SoA), see figure 3.2, instead of the more convenient array of structures (AoS), see

figure 3.1, so that the full vector capability of the processor is utilized every cycle.

Unfortunately, such a data layout presents a number of problems. The main one

is that the mesh metrics which would be stored as 3 float3 streams in AoS become

9 float4 streams in SoA. The maximum number of inputs to any kernel is 16, a

hardware limitation, and having 9 of those taken up by just the metrics means it will

not be possible to get all the necessary data into some of the kernels.

The second difficulty SoA introduces is that along the direction that data is packed

into float4s, mesh dimensions are forced to be multiples of 4, when in reality they

almost never are. This could, of course, be surmounted with sufficient effort and

increased complexity of the software.

Finally, since NVIDIA has moved to scalar chips, it should theoretically not matter

which format is used on their future cards. Even on ATI cards, AoS harnesses more


than 3/4th of the available computational power by use of intrinsics such as dot

product and length as well as combining floats into float3 or float4 when possible.

For all of these reasons, in this project, it was decided to go with the simpler (from

a software engineering standpoint) AoS format.

To lay the 3D data out in the 2D texture memory, the standard “flat” 3D texture

approach[53] was used where each 2D plane making up the 3D data is stored at a

different location in the 2D stream. This leads to some additional indexing to figure

out a fragment’s 3D index from its location in the 2D stream (12 flops) and also

additional work to convert back 3D indices (9 flops). kernel f l o a t 3 where am i ( f loat2 index ,

f loat s i z ex , f loat s i z ey , f loat dx )

f l o a t 3 my loc ;

my loc . x = fmod ( index . x , s i z e x ) ;

my loc . y = fmod ( index . y , s i z e y ) ;

my loc . z = f l o o r ( index . x / s i z e x ) +

dx ∗ f l o o r ( index . y / s i z e y ) ;

return my loc ;

kernel f loat2 newZIndex ( f l o a t 3 my loc , f loat dz ,

f loat s i z ex , f loat s i z ey , f loat dx )

f loat2 new index ;

new index . x = fmod ( my loc . z+dz , dx )∗ s i z e x + my loc . x ;

new index . y = f l o o r ( ( my loc . z+dz )/ dx )∗ s i z e y + my loc . y ;

return new index ;

Data for each block in the multi-block topology is stored in separate streams; the

solver loops over the blocks and processes each one sequentially.


3.5.3 Summary of GPU code

A summary of the code execution is shown in Figure 3.3. The existing preprocess-

ing subroutines implemented on the CPU are unchanged. Additional GPU specific

preprocessing code is run on the CPU to setup the communication patterns between

blocks, and the treatment of the penalty states and penalty terms. The transfer of

data from the host to the GPU includes the initial value of the solution, preprocessed

quantities computed from the mesh coordinates, and weights and stencils used in the

multigrid scheme. Once the data is on the GPU the solver runs in a closed loop. The

only data communicated back to the host are the L2 norms of the residuals which

are used for monitoring the convergence of the code, and the current solution if the

output to a restart file is requested. The number of lines for the GPU implementation

is approximately: 4,500 lines of Brook code, 8,000 lines of supporting C++ code and

1,000 lines of new Fortran code. The original NSSUS code is in Fortran. It took

approximately 4 months to develop the necessary algorithms and make the changes

to original code.

Inviscid fluxArtificial dissipationMultigrid forcing termsInviscid residualCopy sub-face dataBlock to block communicationPenalty state for physical boundary conditionsPenalty terms

Preprocessing

Preprocessing for GPU

Transfer data to GPU

Run solver on GPU

Return data from GPU

Write output file

while iteration < iterationsMax and solution not converged: loop over steps of the multigrid cycle: if prolongation step: transfer correction/solution to fine grid if restriction step: transfer solution and residual to coarse grid if smoothing step: compute residual compute time step store solution state update solution compute residual loop over remaining Runge Kutta stages: update solution compute residual update solution compute L2 norm of the residual

CPU GPU

Figure 3.3: Flowchart of NSSUS running on the GPU.


3.5.4 Algorithms

Constraints from the geometry of the mesh may require that in some blocks, especially

at coarse multigrid levels, the differencing in some directions is done at a lower order

than otherwise desired if the number of points in that direction becomes too small.

To accommodate this constraint imposed by realistic geometries and also to avoid

writing 27 different kernels for each possible combination of order and direction (up

to third order is currently implemented on the GPU), all differencing stencils are

applied in each direction separately.

The numerics of the code are such that one-sided difference approximations are

used near the boundaries of the domain. A boundary point is designated as a point

where a special stencil is needed, and interior point as a point where the normal

stencil is applied. This distinction presents a problem for parallel data processors

such as GPUs because boundary points perform a different calculation from interior

points and furthermore different boundary points perform different calculations. This

can lead to terrible branch coherency problems. See Figure 3.4. However, regardless

of the order of the discretization, the branching can be reduced to only checking if

the fragment is a boundary point or not. While the calculation for each boundary

point is different, it is always a linear combination of field values which can be com-

puted as a dot product between stencil coefficients and field values. Thus by using a

small 1D stream (that can be indexed using the boundary point’s own location) to

hold the coefficients, only one branch instead of three is required. (Note: we count

an if...else... statement as one branch.) The exact number depends on the

branch granularity of the hardware which is theoretically 4×4 on the 8800. However,

GPUbench [14] suggests that in practice 8×8 performs better than 4×4 and 16×16

even better than 8× 8. For 16× 16, the maximum possible number of branches is 8

– one interior point plus 4 right boundary points and 4 left boundary points (which

can be adjacent due to the flat 3D layout). This technique reduces this maximum

to two – one branch for interior points plus one branch for right and left boundary

points. Higher order differencing would benefit even more from this technique.

Dealing with the boundary conditions and penalty terms in an efficient manner

is significantly more difficult than either of the two previous cases. Figure 3.5 shows


Figure 3.4: This figure illustrates the stencil in the xdirection and the branching on the GPU. Each coloredsquare represents a mesh node. The color correspondsto the stencil used for the node. Inner nodes (in grey)use the same stencil. For optimal efficiency, nodes in-side a 4 × 4 square should branch coherently, i.e., usethe same stencil (see square with a dashed line border).For this calculation, this is not the case near the bound-ary which leads to inefficiencies in the execution. Thealgorithm proposed here reduces branching and leads toonly one branch (instead of 3 here).

how sub-faces and penalty terms are computed for each block. The unstructured

connectivity between blocks leads to several sub-faces on each block. Each node on

the blue block must be penalized against the corresponding node on the adjacent

blocks. For example, the node on the blue block located at the intersection of all

four green blocks must be penalized against the corner node in each of the four green

blocks.

In Brook, it is not possible to stream over a subset of the entries in an array.

Instead one must go through all the O(n3) entries and use if statements to determine

whether and what type of calculation need to be performed. This leads to a significant

loss of performance since effectively only O(n2) entries (“surface” entries) need to be

operated on. This problem is made worse by the fact that certain nodes belong to

multiple faces thereby requiring multiple passes. In order to solve these issues, it

was decided to copy the sub-face data into one smaller 2D stream (hereafter called

sub-face stream); copy data from other blocks if necessary for the internal penalty

states (called the neighbor stream). These streams are then used to calculate the

penalty state for physical boundary conditions and the penalty terms. This step is

computationally efficient since primarily only nodes which need to be processed, are.

This is a strictly O(n2) step. Finally, the result is applied back into the full 3D

stream. This is shown in more details in Figure 3.6 and 3.7.

The copying of the sub-face data into the sub-face stream is done by calculating


Figure 3.5: The continuity of the solution across meshblocks is enforced by computing penalty terms usingthe SAT approach[16]. The fact that the connectiv-ity between blocks is unstructured creates special dif-ficulty. On this figure, for each node on the faces ofthe blue block, one must identify the face of one of thegreen blocks from which the penalty terms are to becomputed. In this case, the left face of the blue blockintersects the faces of four distinct green blocks. Thisleads to the creation of 4 sub-faces on the blue block.For each sub-face, penalty terms need to be computed.Note that some nodes may belong to several sub-faces.

Figure 3.6: To calculate the penaltyterms efficiently for each sub-face, onefirst copies data from the 3D block intoa smaller sub-face stream (shown onthe right). In this figure, the block has10 sub-faces. Assume that the largestsub-face can be stored in memory asa 2D rectangle of size nx × ny. In thecase shown, the sub-face stream is thencomposed of 12 nx×ny rectangles, 2 ofwhich are unused. Some of the spaceis occupied by real data (in blue); therest is unused (shown in grey).

Figure 3.7: This figure shows themapping from neighboring blocksto the neighbor stream used toprocess the penalty terms for theblue block. There are four largeblocks surrounding the blue block(top and bottom not shown).They lead to the first 4 green rect-angles. The other rectangles areformed by the two blocks in thefront right and the four smallerblocks in the front left.


and storing the location in the full 3D stream from which each fragment in the sub-

face stream will gather. The copying of the data from other blocks into the neighbor

stream is done by pre-computing and storing the block number and the location

within that block from which each fragment in the sub-face stream gathers. This

kernel requires multiple blocks as input and must branch to gather from the correct

block. This is illustrated by the pseudo-code below, which can be implemented in

Brook: kernel void buildNeighborStream ( f loat block1 [ ] [ ] ,

f loat block2 [ ] [ ] ,

f loat3 d o n o r l i s t <>,

out f loat penal ty data<>)

block = d o n o r l i s t . x ;

ga ther coord = d o n o r l i s t . yz ;

i f ( b lock == 1) pena l ty data = block1 [ gather coord ] ;

else i f ( b lock == 2) pena l ty data = block2 [ gather coord ] ;

. . .

An important point to make is that this method automatically handles the case

of intersecting sub-faces (such as at edges and corners) where multiple boundary

conditions and penalty terms need to be applied. In that respect, this approach leads

to a significantly simpler code.

3.6 Results

The performance scaling of the code with block size is examined followed by an

investigation of the performance of each of the three main kinds of kernels. Then

the performance on meshes for complex geometries typical of realistic engineering

problems is examined. In all our tests, the CPU used was a single core of an Intel

Core 2 Duo E6600 (2.4Ghz, 4MB L2 cache) and the GPU used was an NVIDIA

8800GTX (128 scalar processor cores at 1.35Ghz).


For all the results given below, a consistent accuracy compared to the original

single precision code in the range of 5 to 6 significant digits was observed, including

the converged solution for the hypersonic vehicle. This is the accuracy to be expected

since the GPU operates in single precision. This good behavior is partly a result

of considering the Euler equation. The Navier-Stokes equations for example often

require a very fine mesh near the boundary to resolve the boundary layer. In that

case, differences in mesh element sizes may result in loss of accuracy.

3.6.1 Performance scaling with block size

Figure 3.8 shows the scaling of performance and speedup with respect to the block

size. These tests were run on single block cube geometries with freestream boundary

conditions on all faces. As the data set becomes larger than the L2 cache, the CPU

slows down by a factor of about of two. On the other hand, when the data set

increases, the GPU becomes much more efficient, improving by about a factor of 100.

The GPU doesn’t reach its peak efficiency until it is working on streams with at least

32,000 elements.

103

104

105

106

1

10

Number of vertices

Mic

rose

cond

s pe

r ve

rtex

103

104

105

106

10

20

30

40

Number of vertices

Spe

ed−

up

CPU single gridCPU multigridGPU single gridGPU multigrid

Single gridMultigrid

Figure 3.8: Performancescaling with block size, 1st

order.

The multigrid cycle used in these and following tests was a 2 level V cycle. In

principle, multigrid should be used with more than 2 levels but for the compressible

Euler equations, the presence of shocks limits the number of grids which can be

efficiently used to two. Since our goal is to model a hypersonic vehicle in which

shocks are present, 2 grids were used throughout this work even in cases where there


is no shock. In 3D, 2 levels require computing on a grid approximately 8 times smaller

than the original and it is known from the single grid results that small grids will

be slower than larger ones; consequently, one would expect the multigrid solver to

be somewhat slower than the single grid. This is indeed the case. For 512 vertices,

multi-grid is about twice as slow. For larger grids, the performance of multi-grid

generally follows that of the single grid results but are slightly slower.

3.6.2 Performance of the three main kernel types

The three main different types of kernels have different performance characteristics

which will be examined here. For pointwise kernels, the inviscid flux kernel is con-

sidered; stencil kernels will be represented by the residual calculation (differencing

of the fluxes), and kernels with unstructured gathers by the boundary and penalty

terms calculation. Reduction kernels are not examined since they have been studied

elsewhere[13] and these kernels are less than one percent of the total runtime.

Figure 3.9 shows that the inviscid flux kernel scales similarly to the overall program

(figure 3.8) although with a more marked increase at the largest size. This kernel has

an approximately 1:1 ratio of flops to bytes loaded which suggests that it is still

limited by the maximum memory bandwidth of the card. Indeed, the largest mesh

achieved a bandwidth of 78 Gbytes/sec which is nearly the theoretical peak of the

card. The achievable memory bandwidth depends not only on the size of the data

stream, but also its shape. The second largest mesh has an x-dimension that is

divisible by 16, whereas the largest mesh has an x-dimension divisible by 128. This

is the likely reason for the variations between these two stream sizes.

The second type of kernel, the stencil computation, also follows the same basic

scaling pattern as the timings in Figure 3.8 (Figure 3.9). This particular kernel loads 5

bytes for every one (useful) flop it performs. This very poor ratio is due to loading the

differencing coefficients as well as some values which are never used – a byproduct

of the way some data is packed into float3s and float4s. Nonetheless, a very high

bandwidth for this type of kernel is achieved. The bandwidth is in fact higher than

the memory bandwidth of card! This is possible because the 2D locality of the data


104

105

106

0

20

40

60

80

100

120

Number of Vertices

GF

lops

OR

Gby

tes/

sec

BandwidthGFlops

104

105

106

0

20

40

60

80

100

120

Number of Vertices

GF

lops

OR

Gby

tes/

sec

BandwidthGFlops

Figure 3.9: left: pointwise performance(inviscid flux calculation); right: stencilperformance (3rd order residual calcula-tion).

104

105

106

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Number of Vertices

Mic

rose

cond

s pe

r ve

rtex

CPUGPU

104

105

106

1

2

3

4

5

6

7

8

Number of Vertices

Spe

edup

Figure 3.10: Unstructured gather per-formance (boundary conditions andpenalty terms calculation). The de-crease in speed-up is due to an unavoid-able O(n3) vs. O(n2) algorithmic dif-ference in one of the kernels that makeup the boundary calculations. See thediscussion in the text.

access allows for the cache to be utilized very efficiently. The stencil coefficients, a

total of sixteen values, are also almost certainly kept in the cache.

The final type of kernel is the unstructured gather of which 3 out of the 5 kernels

that make up the boundary and penalty term calculation consist. Its performance

and scaling can be seen in Figure 3.10. Startlingly, this routine does not see increased

efficiency with larger blocks and the speedup vs. the CPU actually decreases after

a point. To explain this, each of the individual kernels is examined. The first two

are unstructured gather kernels that copy data to the sub-face streams and they run,

as would be expected, at approximately the random memory bandwidth of the card

(∼ 10 GBytes/sec). The next two are pointwise calculations for the penalty terms

which behave much like the inviscid flux kernel. The last kernel applies the calculated

penalties to the volumetric data, which as mentioned above implies an implicit loop

over the entire volume even though one only wishes to apply the penalties to the

boundaries. This is unavoidable because of the inability in Brook to scatter outputs

to arbitrary memory locations. Even though most of the fragments do little work


other than determining if they are a boundary point or not, as the block grows the

ratio of interior to surface points increases and the overhead of all the interior points

determining their location slows the overall computation down. In practice however,

it is unlikely that the size of a given block will be larger than 2 million elements; so

in most practical situations, one is in the region where the GPU speed-up is large.

X

Y

Z

Figure 3.11: Three block C-mesh around the NACA 0012airfoil.

Figure 3.12: Mach numberaround the NACA 0012 air-foil, M∞ = 0.63, α = 2.

3.6.3 Performance on real meshes

The NACA 0012 airfoil (from the National Advisory Committee for Aeronautics) is a

symmetric, 12% thick airfoil, that is a standard test case geometry for computational

fluid dynamics codes. Figure 3.11 shows the mesh with three blocks used for this

simulation (C-mesh topology) and Figure 3.12 shows the Mach number around the

airfoil.

The CPU code was compiled with the following options using the Intel Fortran

compiler version 10: -O2 -tpp7 -axWP -ipo.

Table 3.1 shows the speedups for the NACA 0012 airfoil test case. As expected the

speed-ups with multigrid is lower than with a single grid because the computations on

the coarser grids are not as efficient. However, over an order of magnitude reduction

in computation time is still achieved.

For our final calculations, the hypersonic vehicle configuration from Marta and


Table 3.1: Measured speed-ups for the NACA 0012 airfoil computation.

Order Multigrid cycle Speed-up

1st order single grid 17.63rd order single grid 15.11st order 2 grids 15.63rd order 2 grids 14.0

Alonso[64] was used. This is representative of a typical mesh used in the external

aerodynamic analysis of aerospace vehicles. It is a 15 block mesh; two versions were

used with approximately 720,000 and 1.5 million nodes. Because the blocks are

processed sequentially on the GPU, an important consideration is not only the overall

mesh size but the sizes of individual blocks. For the 1.5 million node mesh, the

approximate average block size is 100,000 nodes, with a minimum of 10,000 and a

maximum of 200,000 nodes. Figure 3.13 shows the Mach number on the surface of

the vehicle and the symmetry plane for a Mach 5 freestream.

Figure 3.13: Mach number – side and back views of the hypersonic vehicle.

In Table 3.2, one can see the same general trend for speed-ups as the problem size

and multigrid cycle are varied. Beyond just the pure speed-up, it’s also important

to note the practical impact of the shortened computational time. For example, a

converged solution for the 1.5M node mesh using a 2-grid multigrid cycle requires

approximately 4 CPU hours, but only about 15 minutes on one GPU!


Table 3.2: Speed-ups for the hypersonic vehicle computation

Mesh size Multigrid cycle Speed-up

720k single grid 15.4720k 2 grids 11.21.5M single grid 20.21.5M 2 grids 15.8

3.7 Conclusion

Measured speed-ups range from 15x to over 40x. To demonstrate the capabilities

of the code a hypersonic vehicle in cruise at Mach 5 was simulated – something

out of the reach of most previous fluid simulation works on GPUs. The three main

types of kernels necessary for solving PDEs were presented and their performance

characteristics analyzed. Suggestions to reduce branch incoherency due to stencils

that vary at the boundaries were made. A novel technique to handle the complications

created by the boundary conditions and the unstructured multi-block nature of the

mesh was also developed.

Additonal analysis has identified further ways in which the performance can be

improved. Performance on small blocks is lackluster and unfortunately with meshes

around realistic geometries, small blocks often can not be avoided. By grouping all the

blocks into a single large texture, this problem could be avoided at the cost of increased

indexing difficulties. Also, NVIDIA’s new language, CUDA, offers some interesting

possibilities. It has an extremely fast memory shared by a group of processors, the

“parallel data cache”, which could be used to increase the memory bandwidth of the

stencil calculations even further. Scatter operations are also supported which means

that the application of the penalty terms (at block interfaces) could scale with the

number of surface vertices instead of the total number of vertices.

An important demonstration would be the use of a parallel computer with GPUs

for fluid dynamics simulations. This would establish the performance in a realistic

engineering setting. It will impose some interesting difficulties because while nodes

will be on the order of 10× faster, the network speeds and latencies will not have


changed and might be a bottleneck.

While the exact direction of future CPU developments is impossible to predict,

it seems very likely that they will incorporate many light computational cores very

similar in nature to the fragment shaders of current GPUs. The techniques presented

here should thus be applicable to the general purpose processors of tomorrow.

3.8 CELL Experiences

3.8.1 Amdahl’s Revenge

As discussed in section 1.3.4 the PPE on the CELL is significantly slower than the

normal PowerPC processor it is based on. In fact, for NSSUS, it approximately 10×slower! Of course, the SPEs can be significantly faster. This leads to a new type of

Amdahl’s Law on the CELL. Instead of

1

(1− P ) + PS

where P is the parallel portion of the code and S is the speedup on that portion,

this yields a new law

1

A ∗ (1− P ) + PS

where A is the slowdown of the PPE. Plotting this new law with A = 10 alongside

Amdahl’s original law as in figure 3.14 and figure 3.15 shows the dramatic impact

that this can have. For P < .9 it is impossible to achieve a speedup regardless of how

large S is. Even when P = .995, when only .5% of the code is still serial, figure 3.16

shows the speedup is four times less than what it could be if the PPE weren’t slower

than a normal processor. Note that on all of these plots, the maximum plotted S is

100. Since there are eight SPEs, each one would need to be over 12× as fast as normal

processor to reach this speedup. In fact, their peak performance is 25.6 GFLOPs,

which is in fact comparable to the performance of normal processor. The next section


Figure 3.14: Amdahl’s Law (A = 1) vs. CBE (A = 10)

0 20 40 60 80 1000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Speedup of Parallel Portion (S)

Ove

rall S

peed

up

CBE P=.5Amdahl P=.5CBE P=.8Amdahl P=.8

Figure 3.15: Amdahl’s Law (A = 1) vs. CBE (A = 10)

0 20 40 60 80 1000

10

20

30

40

50

60

70


Ove

rall S

peed

up

CBE P=.9Amdahl P=.9CBE P=.995Amdahl P=.995

will show that the maximum speedup obtained was 10×, even if P = .995 with this

S the possibilities for overall speedup are limited to 6.7×. Significantly worse than

for the GPU. For NSSUS, there are about 220 routines that make up the last 0.5

percent of the runtime, so a P value of .995 can be taken as a realistic maximum.

3.8.2 Implementation

The main type of computation considered here is that of the stencil type (see section

3.5.1). The generic problem considered is to apply a 3D stencil over a volume, with

the complication that it is not simply a stencil over one field, or array, but a stencil

over multiple fields, or arrays. For example, fi 6= f(ai−1, ai, ai+1) but rather fi =

f(ai−1, ai, ai+1, bi−1, bi, bi+1). In the actual NSSUS code the number of fields is thirteen

for the inviscid flux, artificial dissipation and inviscid residual!


Figure 3.16: Ratio of Amdahl’s Law Speedup to CBE Speedup

0 20 40 60 80 1001

2

3

4

5

6

7

8

9

10


Ove

rall S

peed

up A

mda

hl /

Ove

rall S

peed

up C

BE

P=.9P=.995

The first factor to consider in implementing these kernels is what data must be

on a SPE to do the computation, keeping in mind that there is only 256Kb for the

data and the program code. And that furthermore, the space for data has to be

divided into at least two separate buffers so that communication can be overlapped

with this computation. Even worse, there must be “halo” data brought in which

means that the actual amount of data computed must be smaller than the amount of

data brought in.

There is a competing constraint between performance and code size which exists

to some extent on all systems. On most modern systems however, code size is not

an issue and code is optimized for maximum speed. Code size most definitely is an

issue on the CELL because of the limited space on the SPE. But, to obtain maximum

performance, loops must be aggressively unrolled to eliminate branches and allow for

maximum instruction reordering by the compiler to eliminate data dependencies (the

SPEs do not have out-of-order capability, see 1.3.4). In the case of NSSUS, these three

kernels: inviscid flux, artificial dissipation and inviscid residual have a code size of

approximately 80Kb when unrolling loops enough to achieve reasonable performance!

This leaves us only 176Kb for data.

Each kernel ki has a different set of required arrays Aki . If Aki ∩ Akj = then

there is no benefit, in terms of memory reuse, in trying to run ki and kj without

initiating new DMAs. However, often the overlap is significant and re-DMAing most

of the arrays would be a waste of memory bandwidth. The arrays needed for the


Figure 3.17: Cell Memory Bandwidth treating each SPE as an Independent Co-processor

PPE

25.6 GB/sec

three kernels mentioned above are nearly identical.

One possible space saving method is to buffer the kernel code so that instead of

all the kernel codes taking up space on the SPE even when they aren’t being used,

only the kernel currently running is on the SPE while the kernel that is going to

be run next is being DMA’d in. When the current kernel is finished executing it

jumps to wherever the beginning of the next kernel was placed. There are a couple

of downsides to this.

1. It is complicated and error prone to actually implement. Debugging is difficult.

2. The size of the buffer must be the size of the largest kernel. If most of the

kernels are small and one is very large, the space savings may not actually be

significant.

3. Bandwidth that could be used for data is now instead being used for program

code.

Due to all of these considerations, this optimization was not attempted.

A second factor to consider is how to get data to the SPEs for them to perform

computation. The CELL has only 25.6GB/sec of bandwidth from main memory to

all of the SPEs. The most straightforward use of the SPEs as eight independent

and homogenous processors all simultaneously executing the same kernel leads to a

diagram like 3.17 in which all the SPEs are competing for this 25.6GB/sec. This

bandwidth itself is already significantly lower than that of GPUs which in the model

used for the above work was about 100GB/sec and in the most recent models is

160GB/sec. Therefore, using this computational method may lead to non-optimal


Figure 3.18: Cell Memory Bandwidth Viewing each SPE as a Step in a Pipeline

PPE

25.6 GB/sec

~200GB/secaround ring intotal

performance on the CELL which is significantly lower than that of GPUs. Here, each

SPE would bring in one 3D block of all the necessary arrays, do the computation on

them while fetching the next block (buffering) and then start writing the results back

while getting the next block and beginning computation on the just fetched block.

Each block is totally independent from one another.

The bandwidth around the element-interconnect-bus (EIB) which connects the

SPEs can be significantly higher 200GB/sec, but utilizing this bandwidth requires

algorithms that pass data between the SPEs. This requires an algorithmic complica-

tion - no longer can the SPEs be viewed as identical accelerators fetching data from

memory and writing the result back, but as steps in a pipeline that pass data to one

another, bringing data in at one side and eventually writing it back to main memory

at the other. This method addresses the problem of having multiple kernels taking

up space while only using one of them at a given time because each SPE would only

have one kernel. It has the major problem that it requires, to make full use of the

machine, there be exactly as many stages in the pipeline as there are SPEs. In the

case of the CELL - eight. Unless it is the CELL in the playstation 3 (PS3) in which

case it is six. In this case, a natural breakdown of the computation into eight stages

could not be found, ruling this method out.

The solution that was found to keep, as much as possible, the simplicity of the

homogenous SPE approach while minimizing the amount of bandwidth needed is

called circular buffering.


To decrease the bandwidth needed from the SPEs to main memory while retain-

ing as much as possible the relative simplicity of the homogenous SPE approach a

technique called circular buffering was developed. The idea behind this method is

that instead of having each SPE process independent blocks, each SPE sweeps a

small rectangle in the x-y plane through the z-direction of the volume. This way it

is possible to reuse some of the data from the previous memory transfer in the next

calculation. A diagram of this procedure in 2D, for clarity, is in figure 3.19. As can be

seen in that figure, a problem that arises from this type of buffering is that arrays are

no longer completely contiguous and require some complicated indexing to address

that right location. If the computation were compute bound, then this extra math

would be a waste of possible computational power.

Another important advantage of circular buffering is that it reduces the amount

of space required for the buffer by two-thirds. Decreasing the size of the buffers

allows the size of the sub-block to be increased which is important for maximizing

the compute/bandwidth ratio.

Finally, some of the finer technical points regarding how memory DMAs on the

CELL work are considered. All memory transfers must be 16 byte aligned and transfer

128 bytes for maximum performance (32 single precision numbers). It is impossible

to start a transfer that is not 16 byte aligned however transfers can be smaller than

128 bytes but performance will suffer by approximately the same ratio as the actual

number bytes to 128 bytes. This facts are important because of how a 3D array is

stored in main memory and how this implies a small sub-block must be transferred.

The formula for determining the linear offset of location in a 3D is given by: l i n e a r O f f s e t = sizeX ∗ s izeY ∗ zCoord + sizeX ∗ yCoord + xCoord The extent of a sub-block’s x-coordinates will be (by definition) less than sizeX

so that there will be a jump in the linear offset each time the y-coordinate or z-

coordinate of the sub-block changes. Therefore, each sub-block (of size [sx, sy, sz])

cannot be transferred as one continuous memory copy, but as sysz copies each of

length sx. The optimization problem then: minimize the amount of time it takes to

transfer all of the sub-blocks subject to the constraints that each sub-block can only


Figure 3.19: Circular Buffering

This is being DMA'd

This data is complete

Bufferk=1

k=2

k=3

k=1

k=2

k=3

Subblock Sizewith Halo

Size of Computed Data

Only this Data is DMA'dfor Next Sub-block

have 1000 floats (due to the space restrictions) and the x size must be a multiple of

four (due to alignment constraint). Long sub-blocks (in the x-direction) will transfer

more quickly because of the quirks of the CELL hardware, but more will need to be

transferred because of poor surface / volume ratio. The solution of this particular

problem is a sub-block with size 16 × 10 × 6. This solution is very specific to this

application, although the technique for arriving at it is general. The z-direction is

chosen to be only six because that is the direction of the circular buffering and there is

no benefit to making it larger than the minimum necessary because circular buffering

ensures no extra data is transferred in any case.

The final performance figures employing these optimization techniques, along with

others related mostly to data alignment for DMA transfers and SIMD operations, is

a bandwidth limited 10× speedup on the three aforementioned kernels combined.

However, these kernels make up only eighty percent on the program runtime. As

section 3.8.1 showed, this is not enough to achieve an overall program speedup. Even

the speedup achieved on the kernels alone is smaller than what can be achieved using

GPUs. This reality coupled with the greater difficulty of developing software on the

CELL lead to the abandonment of this line of research in favor of GPUs.

The one advantage of the CELL over GPUs at the beginning of this research was

the ability to perform double precision operations. GPUs have since closed this gap


and have double precision performance approximately equal to that of the CELL,

while having, in general, significantly better single precision performance.

Chapter 4

Liszt - A Domain Specific

Language for Writing Codes on

Unstructured Meshes

71

CHAPTER 4. LISZT 72

4.1 Introduction

The last chapters have shown that a great deal work and thought goes into achieving

maximum performance with these accelerator cards. In chapter 3, the same physics

and numerics were implemented on two different architectures which required two

very different implementation strategies. This is non-optimal from a number of per-

spectives:

• The amount of time spent writing and debugging code is approximately linearly

proportional to the number of architectures that must be supported.

• The total amount of code is approximately linearly proportional to the number

of architectures that must be supported which means maintenance times and

costs are also linearly proportional to the number of architectures

• The programmer must become adept at writing and optimizing code for multiple

architectures

There is clearly some possibility here to generalize how code is written so that it need

only be written once. What other benefits could this bring? And what would some

features of the resulting language need to be?

Ideally, one would write code once, in a Domain Specific Language (DSL), that

could be compiled to and optimized for multiple architectures. We have named this

DSL Liszt. First, we recognize that the accelerator cards are parallel in nature and

also that compilers which take serial code and parallelize it are difficult to write in

the best case and in the worst case impossible. For example, a compiler which would

transform the serial algorithm for solving tri-diagonal systems presented in Chapter

1 into the parallel version does not exist, and it is unclear how one would go about

writing such a thing. This places a restriction on Liszt - it must explicitly express

parallelism.

Parallelism is already present at a high level in NSSUS. It must be able to run on a

cluster of machines and as such performs domain decomposition and uses MPI to run

on multiple processors. The parallelism required to run on the accelerator cards is at

a lower level than this, so Liszt should therefore also be able to handle the parallelism

CHAPTER 4. LISZT 73

at the cluster level. This includes parallel file i/o, domain decomposition, ghost

cell determination, ghost cell communication and parallel restart and visualization

output.

How should the language express the parallelism? The parallelism in a finite

difference code is over the vertices (in the sense that the calculation at every vertex

performs more or less the same operations), so a natural way to express this would be,

‘for all the vertices in the mesh, do this ’. Of course, other numerical techniques such

as finite volume, finite element and discontinuous galerkin do not parallelize in this

way. They might involve operations over the faces or cells of the mesh instead. By

making the elements of a 3D (or 2D) mesh as primitives (vertex, edge, face and cell)

in Liszt and allowing for groups of these elements, parallel loops over these groups

can be implemented.

Additionally, to allow Liszt to reason about the communication patterns, and

optimize not only code but also the data structures themselves for each platform

the layout of data in memory must be abstracted. Data should be represented on

the mesh in some fashion and then accessed through mesh primitives, allowing the

compiler freedom to determine how to physically layout the memory optimally for

each machine and program configuration.

The design of the Liszt language should be general enough that all of the main

techniques for solving PDEs on grids can be expressed while allowing enough flexibility

for most new algorithms to be developed. One of the main problems with the previous

work (see next section) is that they are specialized for one specific numerical technique

and sometimes even specific application areas.

Given this brief overview of the motivating ideas behind Liszt, first existing alter-

natives will be examined followed by an in depth description of the language along

with code samples for what finite difference (FD), finite volume (FV), Galerkin finite

element method (FEM) of both first and higher orders, and discontinuous Galerkin

(DG) methods would look like in Liszt. The goal of this chapter is not to describe the

functioning of Liszt down to the tiniest detail (even the syntax for some operations

is not necessarily finalized); the project is ongoing and many of these details are still

changing. The aim is to describe the higher level concepts of the language, which

CHAPTER 4. LISZT 74

will not change, and show that all major methods of solving PDEs can be expressed

cleanly and efficiently using these constructs.

4.2 Previous Work

Sundance [58] is a framework of C++ classes, developed at Sandia National Labs,

that allows for rapid development of parallel FEM solvers by expressing at a high

level the weak formulation of a PDE and its discretization. Within these boundaries

it is very successful. Its main shortcoming is that its boundaries are too narrowly

defined; it is impossible to use for anything but FEM methods and extremely difficult

to experiment with numerics (discretizations and quadrature rules for example) that

have not been supplied by Sundance’s creators. It is not well suited for solving time

dependent problems. Also, a major limitation for mechanics codes in particular is

that it doesn’t support moving or deforming meshes. We believe it is possible to create

a more general language that also allows for more expressive power while retaining

some of the simplicity of Sundance’s approach.

Next, an example of a simple Sundance program, taken from the Sundance tuto-

rial [58], for solving a potential flow problem. Many of the ideas are similar to ideas

in Liszt, which will be presented later. /∗∗∗ So lve s the Laplace equat ion f o r p o t e n t i a l f low past an e l l i p t i c a l

∗ post in a wind tunne l .

∗/int main ( int argc , void∗∗ argv )

try

Sundance : : i n i t (&argc , &argv ) ;

/∗ We w i l l do our l i n e a r a lgebra us ing Epetra ∗/VectorType<double> vecType = new EpetraVectorType ( ) ;

CHAPTER 4. LISZT 75

/∗ Create a mesh . I t w i l l be o f type Bas i sS impl i c ia lMesh , and

∗ w i l l be b u i l t us ing a Part i t ionedRectang leMesher . ∗/

MeshType meshType = new BasicSimpl ic ia lMeshType ( ) ;

MeshSource mesher

= new ExodusNetCDFMeshReader ( ” post . ncdf ” , meshType ) ;

Mesh mesh = mesher . getMesh ( ) ; At the end a mesh, of a specific type, is loaded (possibly in parallel).

/∗ Create a c e l l f i l t e r that w i l l i d e n t i f y the maximal c e l l s

∗ in the i n t e r i o r o f the domain ∗/C e l l F i l t e r i n t e r i o r = new MaximalCe l lF i l t e r ( ) ;

C e l l F i l t e r boundary = new BoundaryCe l lF i l t e r ( ) ;

C e l l F i l t e r in = boundary . l abe l edSubse t ( 1 ) ;

C e l l F i l t e r out = boundary . l abe l edSubse t ( 2 ) ; CellFilters are just collections of cells. Here they are used to specify the cells over

which boundary conditions will be applied. This concept is similar to the Set concept

in Liszt (which again will be described later). /∗ Create unknown and t e s t func t i ons , d i s c r e t i z e d us ing

∗ f i r s t −order Lagrange i n t e r p o l a n t s ∗/Expr phi = new UnknownFunction (new Lagrange ( 1 ) , ”u” ) ;

Expr phiHat = new TestFunction (new Lagrange ( 1 ) , ”v” ) ; This defines the unknown function we are solving for (phi) and the test function

that we will multiply the equation by and then integrate as in the standard weak

formulation of a FE problem. Notice that the basis we can use to represent these

functions is a predefined polynomial basis - Lagrange, if we wished to use a more

exotic basis that didn’t exist in Sundance, we would have to modify Sundance itself

to add such a capability. /∗ Create d i f f e r e n t i a l operator and coord inate f u n c t i o n s ∗/Expr x = new CoordExpr ( 0 ) ;

Expr dx = new Der iva t i ve ( 0 ) ;

CHAPTER 4. LISZT 76

Expr dy = new Der iva t i ve ( 1 ) ;

Expr dz = new Der iva t i ve ( 2 ) ;

Expr grad = L i s t (dx , dy , dz ) ;

/∗ We need a quadrature r u l e f o r doing the i n t e g r a t i o n s ∗/QuadratureFamily quad2 = new GaussianQuadrature ( 2 ) ;

double L = 1 . 0 ; Here a quadrature rule is define in case some of the terms in the Integral below cannot

be exactly integrated (in this case they can). Note that again the quadrature rule

must be chosen from a set of predefined options. /∗ Def ine the weak form ∗/Expr eqn = I n t e g r a l ( i n t e r i o r , ( grad∗phiHat )∗ ( grad∗phi ) , quad2 )

+ I n t e g r a l ( in , phiHat ∗(x−phi )/L , quad2 ) ; This gives a symbolic representation of the weak formulation of the problem.

/∗ Def ine the D i r i c h l e t BC ∗/Expr bc = EssentialBC (out , phiHat∗phi /L , quad2 ) ;

Here because the Dirichlet boundary conditions give rise to a separate equation, they

are defined separately. /∗ We can now s e t up the l i n e a r problem ! ∗/LinearProblem prob (mesh , eqn , bc , phiHat , phi , vecType ) ;

The only difference between a linear and non-linear problem from the user’s point of

view is that a non-linear problem must also be supplied with an initial guess. /∗ Read the parameters f o r the l i n e a r s o l v e r from an XML f i l e ∗/ParameterXMLFileReader reader ( ” . . / . . / t u t o r i a l / b i cg s tab . xml” ) ;

ParameterList solverParams = reader . getParameters ( ) ;

L inearSo lver<double> l i n S o l v e r

= L inea rSo lve rBu i lde r : : c r e a t e S o l v e r ( solverParams ) ;

/∗ s o l v e the problem ∗/

CHAPTER 4. LISZT 77

Expr so ln = prob . s o l v e ( l i n S o l v e r ) ; A variety of linear and non-linear solvers are available.

/∗ Pro j e c t the v e l o c i t y onto a d i s c r e t e space f o r v i s u a l i z a t i o n ∗/Disc re teSpace d i s c r e t e S p a c e (mesh ,

L i s t (new Lagrange ( 1 ) ,

new Lagrange ( 1 ) ,

new Lagrange ( 1 ) ) , vecType ) ;

L2Projector p r o j e c t o r ( d i s c r e t eSpace , grad∗ so ln ) ;

Expr v e l o c i t y = p r o j e c t o r . p r o j e c t ( ) ;

/∗ Write the f i e l d in VTK format ∗/Fie ldWri ter w = new VTKWriter ( ”Post3d” ) ;

w. addMesh ( mesh ) ;

w. addFie ld ( ” phi ” , new ExprFieldWrapper ( so ln [ 0 ] ) ) ;

w. addFie ld ( ”ux” , new ExprFieldWrapper ( v e l o c i t y [ 0 ] ) ) ;

w. addFie ld ( ”uy” , new ExprFieldWrapper ( v e l o c i t y [ 1 ] ) ) ;

w. addFie ld ( ”uz” , new ExprFieldWrapper ( v e l o c i t y [ 2 ] ) ) ;

w. wr i t e ( ) ;

It supports outputting data in the VTK file format, which can be used by programs

such as Paraview [1] for visualization. catch ( except ion& e )

Sundance : : handleException ( e ) ;

Sundance : : f i n a l i z e ( ) ;

Just some boilerplate to wrap things up nicely.

Sundance has many desirable features that Liszt seeks to incorporate. It makes

CHAPTER 4. LISZT 78

writing parallel code very similar to writing serial code, parallel mesh loading and vi-

sualization are handled automatically and a CellFilter or Set concept is used to group

cells on which similar computations are performed. However, its downsides: FEM

specific, lack of easy extensibility, not suitable for time-dependent problems, difficulty

dealing with moving meshes and no clear way to take advantage of accelerator cards

lead us to seek a more general and powerful solution.

SIERRA [26] is another framework also developed at Sandia National Lab. It

is not classified, but public information about it is scarce and it is not open-source.

Its objectives are similar to Liszt’s. It recognizes that a great deal of infrastruc-

ture related to mesh decomposition, parallelization, communication and mesh i/o are

essentially common across a great deal of mesh-based PDE solvers and this common-

ality should be leveraged. It does not attempt to take advantage of accelerator cards.

Information about its exact capabilities must be inferred from one of the publicly

available documents that shows it being used for large multi-physics simulations with

different but overlapping meshes - implying its capabilities are quite advanced.

Unfortunately, due to its relatively secret nature a close examination of it is not

possible. It is possible that Liszt may in many ways duplicate functionality present

in SIERRA. However, conversations with people who have seen it being used, suggest

it is not the most user friendly environment to work with. It also lacks the idea of

re-targeting to multiple accelerator architectures because its concept of parallelism is

at the domain decomposition level and not lower.

OpenFOAM [45] is a set a C++ classes which can be used to create PDE solvers.

Essentially the only supported numerical method is 2nd order Finite Volume. The

equations to be solved are represented symbolically and then the method of dis-

cretization for each term is chosen from a list. It provides a fairly complete set of pre

and post processing utilities and supports a moving mesh. Most of its capabilities

are geared toward writing fluid solvers, although others have been written (Electro-

magnetics, Solid Mechanics and Finance). Its main drawback is, again, the inability

to significantly alter its numerics.

ParFUM (Parallel Framework for Unstructured Meshing) [52] is a library devel-

oped at the University of Illinois at Urbana-Champaign on top of their Charm++

CHAPTER 4. LISZT 79

framework. Its goals are similar to the aforementioned solutions and Liszt’s. It is more

general than OpenFOAM and Sundance in that it does not target one specific nu-

merical method. It supports mesh-refinement and can take advantage of Charm++’s

dynamic (run-time) load balancing. However, it still requires the programmer to

manually describe and register ghost cells and trigger their update. It currently lacks

implicit solver support. It can support arbitrary cell shapes, but does not provide

for arbitrary connectivity relations and requires that all nodes have a position in

space. Because it is not parsed, it cannot support retargeting one code to multiple

architectures.

4.3 Language

4.3.1 Flow

A typical run of a Liszt program would follow these steps:

1. Load Configuration Files

(a) Determines which kernels will be used this particular run

(b) Specifies a particular mesh

(c) Possibly specifies a hardware configuration (e.g. which accelerator card to

use, how many nodes, etc.)

2. Load/Generate Sets to be used during computation

(a) Boundary Condition Sets

(b) Sets for Line Searches

(c) ...

3. Compiler Generates Optimized, Machine Specific Code

4. Solver Runs

(a) Parallel Mesh I/O

CHAPTER 4. LISZT 80

(b) Parallel Domain Decomposition

(c) Solver Loop

(d) Parallel Visualization and Restart Output

This technique of generating code at runtime is known as Just-In-Time (JIT)

Compilation [6]. For this to be advantageous the assumption is that the amount of

time spent in the initialization and code generation phase is small compared to the

time spent in the solving phase. For large scale scientific calculations this is likely to

be the case, since solves often take on the order of hours or even days.

4.3.2 Language Components

A sample Liszt fragment looks like: Fie ld<Vertex , double3> pos = . . . ; // load ver tex p o s i t i o n s

SparseMatrix<Vertex , Vertex> A;

f o ra l l ( Ce l l c in mesh . c e l l s ( ) ) double3 c e l l C e n t e r = cente r ( c ) ;

f o ra l l ( Face f in c . f a c e s ( ) ) double3 f ace dx =

cente r ( f ) − c e l l C e n t e r ;

// note that the f o l l o w i n g loop i s p a r a l l e l

// the CCW i m p l i e s the o r i e n t a t i o n o f the edges

// not t h e i r o rde r ing

f o ra l l ( Edge e in f . edgesCCW( c ) ) ver tex v0 = e . t a i l ( ) ;

ve r tex v1 = e . head ( ) ;

double3 v0 dx = pos ( v0 ) − cente r ;

double3 v1 dx = pos ( v1 ) − cente r ;

double3 face normal = v0 dx . c r o s s ( v1 dx ) ;

// c a l c u l a t e f l u x f o r f a c e

DOF d0 = v0 . getDOF ( ) ; // code to p lace the DOF

CHAPTER 4. LISZT 81

DOF d1 = v1 . getDOF ( ) ; // not shown in t h i s sn ippet

A[ d0 ] [ d1 ] += . . .

A[ d1 ] [ d0 ] −= . . .

Mesh – Liszt includes the interface to the mesh as part of the language, providing ob-

jects for vertices, edges, faces, and cells, along with a full set of topological func-

tions such as mesh.cells() (the set of all cells in the mesh) or f.edgesCCW()

(the edges of face f oriented counter clockwise around the face). This mesh

interface is known to the compiler so it is able to reason about how to best split

up the mesh topology across many processor given a particular application.

The general case is supported by the facet-edge [24] data structure, although in

practice it is expected that most of the actual use cases will involve a very small

subset of possible topological relations. This leads to possible optimizations

opportunities discussed later.

Note that position is not an inherent property of the mesh. Mesh contains only

connectivity information - it is really just a graph. Position is simply treated

as a field that is associated with vertices. This allows for more generality by

allowing Liszt to possibly be useful for general problems on graphs that aren’t

necessarily derived from a partitioning of space.

Sets/Lists – Are simple collections of one type mesh primitive. They can be user de-

fined as in the case of defining a region over which a certain boundary condition

is applied or line search is to be performed. In addition to be defined explic-

itly, they can also be implicitly defined. Statements such as mesh.cells() also

return a set. More generally, imagine that with a fourth order finite difference

scheme, a stencil with a width of 2 neighbors to either side is needed. That is

to implement:

CHAPTER 4. LISZT 82

∇2T =∂2T

∂x2+∂2T

∂y2

=1

12h2(Ti,j+2 + Ti,j−2 + Ti+2,j + Ti−2,j + . . .

16 (Ti,j+1 + Ti,j−1 + Ti+1,j + Ti−1,j)− 60Ti,j) +O(h4)

One could write something like the following code: f o ra l l v in mesh . v e r t i c e s ( )

double sum = −60 ∗ T[ v ] ;

f o ra l l v1 in v . v e r t i c e s ( ) sum += 16 ∗ T[ v1 ] ;

vec3 d i r 1 = pos [ v ] − pos [ v1 ] ;

f o ra l l v2 in v1 . v e r t i c e s ( ) vec3 d i r 2 = pos [ v1 ] − pos [ v2 ] ;

i f ( d i r 1 == di r2 )

sum −= T[ v2 ] ;

T[ v ] = T[ v ] + d e l t a t ∗ sum / (12 ∗ h ∗ h ) ;

But when writing a finite difference algorithm such as this, one expects a carte-

sian mesh, so the language provides an improved way of writing the same code. a s s e r t (MeshType == Cartes ian ) // check performed at runtime

//2D example

f o ra l l ( v in mesh . v e r t i c e s ( ) ) double sum = −60 ∗ T[ v ] ;

// b u i l t in func t i on r e tu rn ing the s e t o f 1 s t l e v e l ha lo s

f o ra l l ( v1 in FDhalo (1 , v ) )

CHAPTER 4. LISZT 83

sum += 16 ∗ T[ v1 ] ;

// b u i l t in func t i on r e tu rn ing the s e t o f 2nd l e v e l ha lo s


sum −= T[ v2 ] ;

T[ v ] = T[ v ] + d e l t a t ∗ sum / (12 ∗ h ∗ h ) ;

This has clear advantages simply from a code clarity point of view. But more

importantly, it allows the compiler to make optimizations that might not oth-

erwise be possible. The code says is expects the mesh to be cartesian, allowing

certain fast data structures to be used for storing and accessing the mesh. Fur-

thermore, the built in List FDhalo is used to specify the neighbors for the vertex.

This builtin function will be translated by the compiler into much more efficient

code than the example given above.

Lists are the same as Sets except they imply an ordering. This allows them to be

randomly accessed using the [ ] operator. Because of this additional constraint,

Lists should be used sparingly, but some algorithms are best expressed using

this construct (see the high order FEM example in section 4.4). In fact, should

one need different weights for vertices in the same halo level, the code could be

written as follows, taking advantage of the fact that FDhalo returns a List and

not a Set. // in s t ead o f t h i s


sum += 16 ∗ T[ v1 ] ;

// t h i s

vec4 weights = 13 , 14 , 15 , 16 ;

// the orde r ing o f FDhalo i s always :

//1 s t d i r e c t i o n , i n c r e a s i n g coord inate value

// then 2nd d i r e c t i o n , i n c r e a s i n g coord inate value , e t c .

CHAPTER 4. LISZT 84

// which p h y s i c a l d i r e c t i o n corresponds to the ” f i r s t ” d i r e c t i o n

// can be t e s t e d for , s i n c e i t the re i s no guarantee that the

//mesh axes are a l i gned with the c a r t e s i a n g r id

// f o r ex : (−1 , 0) (1 , 0) (0 , −1) (0 , 1)

Li s t<vertex> v e r t s = FDhalo (1 , v ) ;

vec4 temp =

T[ v e r t s [ 0 ] ] , T[ v e r t s [ 1 ] ] , T[ v e r t s [ 2 ] ] , T[ v e r t s [ 3 ] ] sum += weights . dot ( temp ) ;

Degrees of Freedom – Many higher order methods require more information than

can be stored at only the vertices, edges and faces of a cell. To accommodate

such methods degrees of freedom (DOF) are allowed to be placed at vertices

and in edges, faces and cells and also, importantly, at the following pairs: (face,

edge), (cell, edge), (cell, face). This is important because two cells that share

the same face might, for instance, be of different orders and need different DOF

on different sides of the same face.

Fields – Data can be stored at any of the mesh primitive types and also at DOF.

Additionally, data storage on the mesh is supported through fields which are

accessed through mesh elements rather than integer indices. This allows the

Liszt compiler to reason about what mesh relationships and data access patterns

are being used.

Sparse Matrices – Sparse Matrices are two dimensional and relate DOF to DOF.

The non-zero entries are determined by analyzing the kernels to determine which

entries are written to, they do not need to specified or declared. Again, this

leads to some optimization possibilities that will be discussed later.

Solvers – Both linear and non-linear solvers will be provided. At the current state

of the project, existing solver packages, such as Trilinos [36] of Sandia Labs will

be used to avoid ”re-inventing the wheel.” However, this does incur a cost in

translating data from Liszt’s internal format into whatever format the solver

CHAPTER 4. LISZT 85

packages use. Attempting to use solver packages internal formats within Liszt

could prevent many optimizations. In the long run, for maximum performance,

it is likely that solvers will eventually be written directly as a component of

Liszt.

Liszt abstracts the representation of commonly used objects to allow for architec-

ture specific optimizations. For instance, 3D vectors with dot and cross products are

included, allowing the compiler to implement them using SIMD when available. In

order to retarget Liszt code to many different architectures, we make a key domain

specific assumption: the computation is local to a particular piece of mesh topology.

For instance, an operation performed on a particular cell will only need data about a

limited number of neighboring values on the mesh.

While a whole range of optimizations are possible, five specific optimizations have

been chosen to focus on for the initial implementation.

Optimal Domain Decomposition – ParMETIS [83] is used for domain decompo-

sition. ParMETIS takes a graph of vertices and weighted edges and decomposes

it into a set of domains that minimize the cost of broken edges. Liszt uses its

knowledge of data access patterns to correctly weight the edges of the graph.

Liszt can also correctly determine what geometric primitive should be used as

the vertices of the graph. For example if the only parallel loops are over the

vertices of the mesh, forall(mesh.vertices()), the vertices should be parti-

tioned; if the parallel loops are over the cells, forall(mesh.cells()), the cells

should be partitioned. If cells are partitioned the ownership of the vertices,

edges and faces of the cell are determined using an algorithm that guarantees

on average each partition will have an equal number of each.

In the case where more than one kind of geometric primitive is used in an outer

loop, the compiler will partition the cells. If the pattern of neighbor access

changes, Liszt will automatically change the graph that is input to ParMETIS.

Ghost Cells – The same analysis used to determine the correct weighting of the

edges of the ParMETIS graph can also be used to determine what field values

CHAPTER 4. LISZT 86

need to be shared across domain boundaries. Liszt automatically determines

how many levels of ghost cells/faces are needed and will make sure that whenever

values are accessed they are up-to-date. This adaptability is a key advancement

over current codes that fix the level of ghost cells/faces. For example, in a

large eddy simulation code, when one wishes to change the extent of a filter,

a major code rewrite is involved to handle to new amount of necessary ghost

information. In Liszt, this will be automatic. When running on a distributed

memory machine, Liszt takes care of all the necessary MPI calls.

Currently this handled in Liszt by the generation (currently by hand, although

soon to be automatic) of auxiliary kernels which mimc the actual kernel’s mem-

ory access patterns. Special functions are used in place of normal memory

accesses. These functions do not actually perform the memory fetch, but in-

stead record what memory fetch will take place, allowing Liszt to determine

when memory will need to be accessed that isn’t locally available and where

it resides. Liszt runs these kernels after domain decomposition but before the

main solver loop.

Mesh Representation – Because Liszt understands what mesh relationships are

necessary for the program it can choose the optimal way to represent the mesh.

A program that moves around the mesh in an advanced way might need the full

generality of a facet-edge mesh. Many scientific applications use relatively few

mesh relations, and a more limited but smaller and faster representation is pre-

ferred. In addition to understanding mesh relationships, Liszt also understands

the mesh itself. If a mesh happens to only consist of tetrahedra, Liszt would

identify this and in conjunction with what mesh relationships are used choose

an optimal representation. Even in a more complicated case where the mesh

consisted of mostly tetrahedra but a few more complex elements, Liszt could

also optimize this by using a hybrid representation of the mesh. The optimal

tetrahedra representation combined with a slower, more general representation

for the complex elements. The load balancing would then also be able to take

advantage of this information (that the complex elements are likely to take

CHAPTER 4. LISZT 87

longer to process.)

Optimal Layout of Fields – Liszt’s abstraction of representing field data as being

stored at mesh primitives leaves the compiler options with respect to how to

actually store the data in memory. The number of possible layouts is vast. The

simplest is an array of structures; each geometric primitive has a structure with

all the fields at that location. Another option is a structure of arrays; in this case

each field is stored contiguously. There are many other options depending on

the data access patterns and machine architecture. After analyzing the program

Liszt can choose an optimal layout. For example, consider a program with four

kernels and four fields.

Kernel 1 Kernel 2 Kernel 3 Kernel 4

Field 1 X X

Field 2 X X X

Field 3 X X

Field 4 X X

For cache purposes on traditional architectures, fields 3 and 4 should clearly be

grouped together. The grouping field 2 with either field 1 or 3 and 4 depends on

long (assumed to be closely related to how much math and memory access the

kernel performs) each kernel will take. If kernel 2 is estimated to be bandwidth

bound then grouping fields 1 and 2 together would provide maximum speed.

On the other hand, if kernel 2 is arithmetic or instruction bound and kernel 3 is

bandwidth bound then grouping kernel 2, 3 and 4 together would be optimal.

Its possible that for some configurations the choice would essentially have no

impact (if all kernels were highly compute bound, for example) and then an

arbitrary choice can made.

A second optimization on traditional cpus related to cache performance and

fields is blocking. A loop such as the following:

f o ra l l c in mesh . c e l l s ( ) //do s t u f f . . .

CHAPTER 4. LISZT 88

can be transformed into this: // t h i s i s s imply pseudo code to show how the loop might

//be transformed by the compi le r

// the user would never wr i t e such code

f o ra l l chunk in mesh . c e l l s ( ) f o ra l l c in chunk

//do s t u f f

Each chunk’s size would be chosen such that all of the fields, mesh data, etc.

fit into the L2 cache. To take full advantage of this blocking, the layout of each

field would arrange each chunk to be continuous in memory.

Optimal Sparse Matrix Representation – The optimal method to represent the

sparse matrices depends both on the structure of the matrix itself as well as

the hardware that is going to be used in the computation [8]. With knowledge

of the matrix structure and the hardware platform, the optimal matrix can be

chosen at runtime, by means of a table based lookup.

Many of the optimizations that Liszt enables are only possible after the mesh itself

is known and to a lesser extent which subset of all possible kernels in the program

will be performed. This means that a just in time (JIT) compilation strategy must

be employed. The Liszt runtime will allow for the mesh to be loaded, configurations

to be read and sets to be constructed in a general and non-optimal manner. Once the

configuration, mesh, and sets are known, the optimizations are applied and code is

generated for the target architecture, and compiled. The main solver loop then runs

with this optimal code.

CHAPTER 4. LISZT 89

4.4 Examples

These examples are important for two reasons:

1. They demonstrate the possibility of writing all of these algorithms in Liszt

2. They provide a showcase for the language itself, rather than a specification

The examples are not complete programs; in general, only the main computational

loops are shown. Initialization code, mesh loading, boundary determination and

related code have been skipped over. The declaration of Fields and SparseMatrices

are still shown. Sometimes functions are used, but not declared in the code shown;

when this is the case it has been documented in comments.

A finite difference example has already been provided in the discussion on Sets.

The following is a simple finite volume code for solving the scalar convection

equation in three dimensions. That is it solves

∂φ

∂t+∇ · (φu)

where u is the velocity field and is constant in time (in the example below it is also

constant in space, but it need not be), φ is the convected quantity. Density has been

assumed constant. The usual finite volume approach is taken and the equation is

integrated over a control volume and then the second term is recast using Gauss’

divergence theorem to be over the surface of the volume.∫CV

∇ · (φu) dV =

∫S

n · (φu) dS

To evaluate this new integral the value of φ is assumed to constant on each face.

Upwind differencing is used to determine which cell’s value of φ should be used. /∗ I n i t i a l i z a t i o n code not shown ∗/vec3 g l o b a l V e l o c i t y = (1 , 0 , 0 ) ;

// time stepp ing loop

for (double t = 0 ; t < NumSteps ; t += d e l t a t )

CHAPTER 4. LISZT 90

// c a l c u l a t e a f l u x f o r the f a c e s not part o f boundary

f o ra l l f in I n t e r i o r F a c e s vec<3> normal = normals . normal ( f ) ;

double vDotN = g l o b a l V e l o c i t y . dot ( normal ) ;

double area = faceArea ( f ) ;

double f l u x ;

// determine c o r r e c t f l u x c o n t r i b u t i o n

i f (vDotN >= 0)

f l u x = area ∗ vDotN ∗ Phi ( f . i n s i d e ( ) ) ;

else

f l u x = area ∗ vDotN ∗ Phi ( f . ou t s i d e ( ) ) ;

// s c a t t e r f l u x e s

Flux [ f . i n s i d e ( ) ] −= f l u x ;

Flux [ f . ou t s i d e ( ) ] += f l u x ;

// handle the boundary cond i t i on

f o ra l l f in OutflowFaces // need the f a c e to have the c o r r e c t o r i e n t a t i o n

i f ( f . ou t s i d e ( ) . ID ( ) != 0) f . f l i p ( ) ;

vec<3> normal = normals . normal ( f ) ;

double vDotN = g l o b a l V e l o c i t y . dot ( normal ) ;

a s s e r t (vDotN >= 0 ) ; // f o r an out f low face , i t b e t t e r be . . .

double area = faceArea ( f ) ;

double f l u x = area ∗ vDotN ∗ Phi ( f . i n s i d e ( ) ) ;

Flux [ f . i n s i d e ( ) ] −= f l u x ;

//Now perform time advancement s i n c e the Flux i s known

f o ra l l c in mesh . c e l l s ( ) double volume = cel lVolume ( c ) ;

Phi [ c ] = Phi ( c ) + d e l t a t ∗ Flux ( c ) / volume ;

// need to zero f l u x f o r next i t e r a t i o n

Flux [ c ] = 0 . ;

// i n i t i a l i z a t i o n o f camera ob j e c t not shown

camera . snapshot ( ) ; // wr i t e v i s u a l i z a t i o n data

CHAPTER 4. LISZT 91

The following example uses the Galerkin finite element method to solve Laplace’s

equation two dimensions. That is it solves the following problem

−∇2u = f in Ω

u = 0 on ΓD

∂u

∂n= h on ΓN

The problem is put into the weak form by multiplying by a set of test functions,

v, and then integrating over the volume.

−∫

Ω

v∇2u =

∫Ω

fv for all v

Then using Green’s identity to rewrite the first term...∫Ω

∇u · ∇v −∫∂Ω

∂u

∂nv =

∫Ω

fv for all v

Then using the boundary conditions to rewrite the second term...∫Ω

∇u · ∇v =

∫Ω

fv +

∫∂Ω

hv for all v

If we choose to represent the unknown u using the same space of functions v, then

we have the classical Galerkin finite element method.

ui

∫Ω

∇vi · ∇vj =

∫Ω

fvi +

∫∂Ω

hvi for all i, j in v

This gives rise to a system of equation KU = F . K, the stiffness matrix is the

result of the first integral on the left hand side. U is a vector of the unknowns ui. F

is the result of the two integrals on the right hand side. Each one of these terms can

be seen being computed in the example below. The exact details of the calculation

of K is hidden inside a function to simplify this example. Note that because there is

CHAPTER 4. LISZT 92

one degree of freedom at each vertex, the method is using first order elements. // p lace one DOF at each ver tex

SparseMatrix<mesh . v e r t i c e s ( ) .DOF( ) , mesh . v e r t i c e s ( ) .DOF( ) ,

double> K;

Fie ld<mesh . v e r t i c e s ( ) .DOF( ) , double> rhs ;

Fie ld<mesh . v e r t i c e s ( ) .DOF( ) , double> u ;

Fie ld<mesh . v e r t i c e s ()> pos = // load p o s i t i o n

Set<Edge> NeumannBCs ; // i n i t i a l i z a t i o n not shown

//F i s the in homogeneous term in the PDE

//G i s a func t i on which computes the l o c a l s t i f f n e s s matrix

//H i s a func t i on which eva lua t e s the Neumann BC

f o ra l l c in mesh . c e l l s ( )

//DOF are automat i ca l l y returned in a counte r c l o ckw i s e o rde r ing

// in 2D

LocalMatrix<c .DOF( ) , c .DOF( ) , double> Kloc = G( c .DOF( ) , pos ) ;

ReduceLocalToGlobal ( Kloc , K) ;

// performs the equ iva l en t o f the f o l l o w i n g

// f o r a l l d1 in c .DOF( ) // f o r a l l d2 in c .DOF( ) // K[ d1 ] [ d2 ] += Kloc [ d1 ] [ d2 ] ;

// //

// f o r c i n g term

double localRHS = ce l l vo lume ( c ) ∗ F( c e l l c e n t e r ( c ) ) / 6 ;

f o ra l l v in c . v e r t i c e s ( ) rhs [ v ] += localRHS ;

f o ra l l e in NeumannBCs double va l = length ( e ) ∗ H( cente r ( e ) ) / 2 ;

rhs [ e . t a i l ] += val ;

rhs [ e . head ] += val ;

u = LinearSo lve (K, rhs ) ;

CHAPTER 4. LISZT 93

This is a more advanced Galerkin finite element example for solving the same

problem. The elements are now quadratic instead of linear and the details of the

integration and construction of the stiffness matrix are shown. The handling of the

boundary conditions is not shown. Most of the math in the function quadBasisEval

is to determine the mapping from the real space triangle to a scaled, reference, right

triangle. The functions, func1, func2, etc. which evaluate each of the basis functions

inside the triangle, work with scaled coordinates for simplicity. vec3 quadResult quadBasisEval ( L i s t<vertex> vert s , Fie ld<vertex , vec2> pos ,

DOF dof1 , Ce l l c )

vec3 returnVal ;

double det = 2 ∗ area ( c ) ;

//compute transformed coo rd ina t e s

double r = ( pos [ v e r t s [ 2 ] ] . y − pos [ v e r t s [ 0 ] ] . y ) ∗ ( pos [ qp ] . x − pos [ v e r t s [ 0 ] ] . x ) +

( pos [ v e r t s [ 0 ] ] . x − pos [ v e r t s [ 2 ] ] . x ) ∗ ( pos [ qp ] . y − pos [ v e r t s [ 0 ] ] . y ) ;

double drdx = ( pos [ v e r t s [ 2 ] ] . y − pos [ v e r t s [ 0 ] ] . y ) / det ;

double drdy = ( pos [ v e r t s [ 0 ] ] . x − pos [ v e r t s [ 2 ] ] . x ) / det ;

double s = ( pos [ v e r t s [ 0 ] ] . y − pos [ v e r t s [ 1 ] ] . y ) ∗ ( pos [ qp ] . x − pos [ v e r t s [ 0 ] ] . x ) +

( pos [ v e r t s [ 1 ] ] . x − pos [ v e r t s [ 0 ] ] . x ) ∗ ( pos [ qp ] . y − pos [ v e r t s [ 0 ] ] . y ) ;

double dsdx = ( pos [ v e r t s [ 0 ] ] . y − pos [ v e r t s [ 1 ] ] . y ) / det ;

double dsdy = ( pos [ v e r t s [ 1 ] ] . x − pos [ v e r t s [ 0 ] ] . x ) / det ;

double b , dbdr , dbds ;

i f ( dof1 . type == 0) b = func1 ( r , s ) ; // d e t a i l s not shown

dbdr = func2 ( r , s ) ; // d i t t o

dbds = func3 ( r , s ) ; // d i t t o

else i f ( dof1 . type == 1)

// . . .

// . . . up through a l l 6 p o s s i b i l i t i e s

double dbdx = dbdr ∗ drdx + dbds ∗ dsdx ;

double dbdy = dbdr ∗ drdy + dbds ∗ dsdy ;

CHAPTER 4. LISZT 94

returnVal . x = b ;

returnVal . y = dbdx ;

returnVal . z = dbdy ;

return returnVal ;

main ( ) // Place 1 DOF at each ver tex and at each edge midpoint

SparseMatrix<mesh . c e l l s ( ) .DOF( ) , mesh . c e l l s ( ) .DOF( ) , double> K;

Fie ld<mesh . c e l l s ( ) .DOF( ) , double> rhs ;

Fie ld<mesh . v e r t i c e s ( ) , double> pos = // load p o s i t i o n s

// each c e l l has a f i e l d conta in ing a l i s t

// s t o r i n g vec3s o f l ength 3

Fie ld<mesh . c e l l s ( ) , L i s t<vec3 , 3> > QuadraturePoints ;

// determine quadrature po in t s and weights

// f o r each element

f o ra l l c in mesh . c e l l s ( ) List<vertex> v e r t s = mesh . v e r t e x L i s t ( c ) ;

vec2 v1 = pos [ v e r t s [ 0 ] ] ;

vec2 v2 = pos [ v e r t s [ 1 ] ] ;

vec2 v3 = pos [ v e r t s [ 2 ] ] ;

QuadraturePointsWeights ( c ) [ 0 ] = vec3 ( ( v1 + v2 ) / 2 , 1 /3 ) ;



// assembly loop

f o ra l l c in mesh . c e l l s ( )

// can be a f o r a l l loop even though a L i s t i m p l i e s an

// orde r ing as long as the o rde r ing i s not used

f o ra l l vec3 qp in QuadraturePointsWeights ( c ) double w = area ( c ) ∗ qp . z ;

f o ra l l dof1 in c .DOF( ) List<vertex> v e r t s = mesh . v e r t e x L i s t ( ) ;

// f i l l up the rhs vector , F

vec3 iQuad = quadBasisEval ( vert s , pos , dof1 , c ) ;

rhs ( dof1 ) += w ∗ F( qp ) ∗ iQuad . x ;

CHAPTER 4. LISZT 95

f o ra l l dof2 in c .DOF( ) vec3 jQuad = quadBasisEval ( ver t s , pos , dof2 , c ) ;

// f i l l up the s t i f f n e s s matrix K

K( dof1 , dof2 ) += w ∗ ( iQuad . y ∗ jQuad . y + iQuad . z ∗ jQuad . z ) ;

// handle boundary cond i t i ons , not shown

u = LinearSo lve (K, rhs ) ;

To give the reader unfamiliar with discontinuous Galerkin methods, a brief overview

of using them to solve a simple scalar wave equation is given. This should enable the

reader to understand where the terms in the more complicated example come from.

For further reading [37] is recommended.

Consider the equation∂u

∂t+∂au

∂x= 0

where u is the unknown and a is a constant. Then consider space being decomposed

into K distinct elements. On each of those elements we can locally represent the

solution u as follows, where N is the order of the polynomial representation and li is

the Lagrange polynomial of order i.

u(x, t) =N+1∑n=1

u(xi, t)li(x)

Next, like in the finite element method, we multiply by a set of test functions and

integrate, but here only locally over each element. So we have for element E∫E

(∂u

∂t+∂au

∂x

)lidx = 0 1 ≤ i ≤ N + 1

CHAPTER 4. LISZT 96

then this is integrated by parts to yield∫E

(∂u

∂tli − au

dlidx

)dx = −

∫∂E

n · aulidx 1 ≤ i ≤ N + 1

One of the key points of the method is that in the term on the right hand side, the

value of au is multiply defined at each interface between elements. How this disconti-

nuity between elements is resolved depends on the equations one is solving. Without

delving any deeper into this resolution, (au)∗ is simply referred to as the resolved

quantity, known as the flux. This is known as the weak formulation, integrating the

entire equation by parts one more time yields the strong formulation.

The expansion of u is substituted into the above equation, which can then be

arranged into the following form

Mdu

dt+ (S)T au = −(au)∗xrl(xr) + (au)∗xll(xl)

where Mij =∫Eli(x)lj(x)dx and Sij =

∫Eli(x)

dljdxdx. The entire equation can mul-

tiplied by M−1 to obtain an explicit expression for the time derivatives. Note that

the expression on the right hand side is really just a surface integral even though it

isn’t written as one in one dimension. It is even more important to realize that this

equation is per element ; it is not global. As such, no matrix inversions are required to

advance the system in time (other than the matrix M which is small and can easily

be accomplished as a pre-processing step.)

An example showcasing Liszt for using the nodal discontinuous Galerkin method

for solving the two dimensional Maxwell’s Equations (in vacuum) with triangular

elements.

CHAPTER 4. LISZT 97

The equations being solved (in dimensionless form) are thus (ignoring boundary

conditions):

∂Hx

∂t= −∂Ez

∂y∂Hy

∂t=∂Ez∂x

∂Ez∂t

=∂Hy

∂x− ∂Hx

∂y

Ez = 0 on Γ

where H is the magnetic field and E is the electric field and Γ is the boundary.

The following variables are considered to be initialized previously in the program:

N : the order of the local approximation

Np : the number of terms in local expansion, (N+1)(N+2)2

for 2D

Nf : the number of terms of the local expansion located on the face of a cell, = N+1.

Dr : Np × Np matrix such that ∂u∂r

= Dru

Ds : Np × Np matrix such that ∂u∂s

= Dsu

r, s : The coordinates in the reference triangle space (not actual variables in the

program)

rx, ry, sx, sy : Each cell has unique vectors of length Np describing the mapping, ∂r∂x

,∂r∂y

, ∂s∂x

, ∂s∂y

.

L : Np × 3Nf matrix describing how to “lift” the surface terms to volume terms

scale : Field from (cell, edge) pair to a double containing the inverse of the Jacobian

mapping along the edge Fie ld<mesh . c e l l s ( ) .DOF( ) , double> Ez ; // out o f plane e l e c t r i c f i e l d

Fie ld<mesh . c e l l s ( ) .DOF( ) , double> Hx; // in plane magnetic f i e l d components

CHAPTER 4. LISZT 98

Fie ld<mesh . c e l l s ( ) .DOF( ) , double> Hy;

//Np − 3N DOFs placed i n s i d e each c e l l

// f o r each c e l l :

//N−1 DOFs placed along each edge ( be long ing to one c e l l , edge pa i r )

//1 DOF placed at each ver tex ( be long ing to two c e l l , edge p a i r s

Fie ld<mesh . c e l l s ( ) , L i s t<c . edges ( ) .DOF( ) , double> > rhsHx ;

Fie ld<mesh . c e l l s ( ) , L i s t<c . edges ( ) .DOF( ) , double> > rhsHy ;

Fie ld<mesh . c e l l s ( ) , L i s t<c . edges ( ) .DOF( ) , double> > rhsEz ;

// f l u x and rhs c a l c u l a t i o n

f o ra l l c in mesh . c e l l s ( ) Fie ld<c . edges ( ) .DOF( ) , double> dHx ; // s i z e i s 3Nf

Fie ld<c . edges ( ) .DOF( ) , double> dHy ;

Fie ld<c . edges ( ) .DOF( ) , double> dEz ;

Fie ld<c . edges ( ) .DOF( ) , double> f luxHx ;

Fie ld<c . edges ( ) .DOF( ) , double> f luxHy ;

Fie ld<c . edges ( ) .DOF( ) , double> f luxEz ;

f o ra l l e in c . edges ( )

// note that e c a r r i e s with i t the in fo rmat ion about

// which c e l l i t came from , automat i ca l l y prov id ing

//a ( c e l l , edge ) pa i r

//compute f i e l d d i f f e r e n c e

// There are Nf DOF per ( c e l l , edge ) pair , so the exp r e s s i on

//Hx( e .DOF( ) ) i s i n h e r e n t l y a vec to r

// Cel lEdgePair c r e a t e s a ( c e l l , edge ) pa i r

// e . oppos i t e ( c ) r e tu rn s the c e l l on the other s i d e o f edge e from c e l l c

dHx( e .DOF( ) ) = Hx( e .DOF( ) ) − Hx( Cel lEdgePair ( e . oppos i t e ( c ) , e ) .DOF( ) ) ;

dHy( e .DOF( ) ) = Hy( e .DOF( ) ) − Hy( Cel lEdgePair ( e . oppos i t e ( c ) , e ) .DOF( ) ) ;

dEz ( e .DOF( ) ) = Ez( e .DOF( ) ) − Ez( Cel lEdgePair ( e . oppos i t e ( c ) , e ) .DOF( ) ) ;

normalDotDH = normal ( e ) . x ∗ dHx + normal ( e ) . y ∗ dHy ;

//compute f l u x e s along t h i s edge

fluxHx ( e .DOF( ) ) = normal ( e ) . y ∗ dEz +

(normalDotDH ∗ normal ( e ) . x − dHx) ∗ s c a l e ( e ) ;

f luxHy ( e .DOF( ) ) = −normal ( e ) . x ∗ dEz +

(normalDotDH ∗ normal ( e ) . y − dHy) ∗ s c a l e ( e ) ;

f luxEz ( e .DOF( ) ) = −normal ( e ) . x ∗ dHy + ( normal ( e ) . y ∗ dHx − dEz) ∗ s c a l e ( e ) ;

CHAPTER 4. LISZT 99

//compute g rad i en t and c u r l o f f i e l d s in t h i s c e l l

Fie ld<c .DOF( ) , double> Ezr = Dr ∗ Ez( c .DOF( ) ) ; // s i z e i s Np

Fie ld<c .DOF( ) , double> Ezs = Ds ∗ Ez( c .DOF( ) ) ;

Fie ld<c .DOF( ) , double> Ezx = rx ( c ) ∗ Ezr + sx ( c ) ∗ Ezs ;

Fie ld<c .DOF( ) , double> Ezy = ry ( c ) ∗ Ezr + sy ( c ) ∗ Ezs ;

Fie ld<c .DOF( ) , double> Hxr = Dr ∗ Hx( c .DOF( ) ) ;

Fie ld<c .DOF( ) , double> Hxs = Ds ∗ Hx( c .DOF( ) ) ;

Fie ld<c .DOF( ) , double> Hyr = Dr ∗ Hy( c .DOF( ) ) ;

Fie ld<c .DOF( ) , double> Hys = Ds ∗ Hy( c .DOF( ) ) ;

Fie ld<c .DOF( ) , double> curlHz =

rx ( c ) ∗ Hyr + sx ( c ) ∗ Hys − ry ( c ) ∗ Hxr − sy ( c ) ∗ Hxs ;

//compute rhs

rhsHx ( c ) = −Ezy + L ∗ f luxHx / 2 ; //L i s Np x 3Nf , f l u x i s 3Nf x 1

rhsHy ( c ) = Ezx + L ∗ f luxHy /2 ;

rhsEz ( c ) = curlHz + L ∗ f luxEz /2 ;

// t imestepp ing . . .

Chapter 5

Conclusions

This thesis began with a brief overview of general purpose processor technology, dis-

cussed eventual difficulties in further increasing performance using their instruction

level parallelism paradigm. Commodity graphics processors and IBM’s Cell were in-

troduced as two examples of an emerging class of hardware where the programming

paradigm and hardware design is based upon data parallelism. The specifics of hard-

ware platform and their programming models were described. The second chapter

examined using GPUs for solving the O(N2) N-body problem. It provided some

general and some hardware specific techniques for achieving maximal performance.

The maximum performance was quite high, over twenty five times a highly optimized

traditional CPU code. Also, on the metric of performance per dollar the GPU was far

and away the best contender. On the metric of performance per watt the specialized

hardware GRAPE was marginally better due to its single purpose design. The third

chapter examined both the CELL and GPUs for solving the compressible Euler equa-

tions. It was determined that GPUs were suitable with a speedup of around twenty

and the CELL was not (with an overall slowdown) due to various architectural and

programming difficulties which were discussed. This highlighted the difficulty of hav-

ing multiple programming models. This lead directly to the final chapter describing

a new Domain Specific Language for writing mesh based PDE solvers, Liszt, which

would allow for writing code once and retargeting to different acceleration technolo-

gies. It would also ease code development in general through the automatic handling

100

CHAPTER 5. CONCLUSIONS 101

of domain decomposition and parallelization. Examples of all the mainstream mesh

based techniques for solving PDEs were presented in Liszt. The work of the previous

chapters, especially chapter 3 showed the need for this new language and will prove

useful as the development of Liszt compiler continues, especially the GPU backend.

Appendix A

New Periodic Boundary Conditions

for Simulating Nano-wires under

Torsion and Bending

102

APPENDIX A. TORSION AND BENDING PBC 103

A.1 Introduction

Recently there has been considerable interest in the directed growth of semiconductor

nanowires (NWs), which can be used to construct nano-scale field effect transistors

(FETs) [22, 94, 39], chemical and biological sensors [21], nano-actuators [17] and nano-

fluidic components [29]. Epitaxially grown NWs have the potential to function as

conducting elements between different layers of three-dimensional integrated circuits.

Because significant stress may build up during fabrication and service (e.g. due to

thermal or lattice mismatch), characterization and prediction of mechanical strength

and stability of NWs is important for the reliability of these novel devices.

NWs also offer unique opportunities for studying the fundamental deformation

mechanisms of materials at the nanoscale. The growing ability to fabricate and me-

chanically test microscale and nanoscale specimens and the increasing computational

power allows for direct comparison between experiments and theory at the same

length scale.

The size of these devices presents a challenge to test their mechanical properties.

In macroscale samples, the materials are routinely tested in tension, shear, torsion

and bending using standard grips and supports. Smaller samples, however, require

more inventive testing techniques. For nanoscale testing, tensile and bending tests

have been performed using nanoindentors, AFM [50, 23], and MEMS devices [96, 44].

Similar experiments have been performed at the microscale [46, 90]. With the rapid

progress of nanofabrication and nanomanipulation capabilities, additional tension,

torsion, and bending experimental data on crystalline and amorphous nanowires will

soon be available.

Molecular dynamics is poised to be the main theoretical tool to help understand

and predict small scale mechanical properties. However, since MD is limited in the

number of atoms it can simulate; it cannot simulate whole nanowires. Either the

nanowire simulated must be extremely short or periodic boundary conditions (PBC)

must be used. End conditions artificially alter the material locally such that defect

nucleation and failure often occurs there. This results in simulations that test the


strength of the boundary rather than the intrinsic strength of the material. Tradi-

tional PBC remove this artifact by enforcing translational invariance and eliminating

all artificial boundaries.

The use of conventional PBC allows for the simulation of tensile, pure shear, and

simple shear in MD [80]. In fact, the mechanical properties of silicon nanowires in

tension were recently calculated using this approach [48]. The nanowires were strained

by extending the periodicity along the nanowire length and the stress was calculated

through the Virial formula. However, regardless of the types of strain imposed on the

periodic simulation cell, the images form a perfect lattice which precludes nonzero

average torsion or bending. Therefore, to simulate torsion or bending tests, either

small finite nanowires must be simulated or the current PBC framework must be

altered.

Many Molecular Dynamics simulations on torsion and bending of nanoscale struc-

tures have been reported [38, 95, 77, 40, 72]. The artificial end effects are sometimes

reduced by putting the ends far away from the region undergoing severe deforma-

tion, requiring a long nanowire [60]. There have also been attempts to rectify this

problem [73]. Recently, the objective molecular dynamics (OMD) formulation [25]

has been proposed that generalizes periodic boundary conditions to accommodate

symmetries other than translational. Under this framework, torsion and bending

simulations can be performed without end effects. But the general formulation of

OMD is somewhat difficult to apply to existing MD simulation programs.

In this thesis, a simpler formulation that accommodates torsion and bending in

a generalized periodic boundary condition framework is presented. It is shown that

torsion and bending can be related to shear and normal strains when expressed in

cylindrical coordinates. This leads to t-PBC and b-PBC, respectively, as formulated in

Section 2. While only linear momenta are preserved in PBC, both t-PBC and b-PBC

preserve the angular momentum around their rotation axes. These new boundary

conditions can be easily implemented on top of existing simulation programs that use

conventional PBC. In Section 3, the Virial expressions for the torque and bending

moment are derived that are analogous to the Virial expressions for the average stress

in simulation cells under PBC. The Virial expressions of torque and bending moment,


expressed as a sum over discrete atoms, are found to correspond to a set of tensorial

quantities in continuum mechanics, expressed as a volume integral. Section 4 presents

the application of these new boundary conditions to modeling of the intrinsic strength

of Si nanowires under torsion and bending.

A.2 Generalization of Periodic Boundary Condi-

tions

A.2.1 Review of Conventional PBC

PBC can be visualized as a primary cell surrounded by a set of replicas, or image cells.

The replicas are arranged into a regular lattice specified by three repeat vectors: c1,

c2, c3. This means that whenever there is an atom at location ri there are also atoms

at ri +n1c1 +n2c2 +n3c3, where n1, n2, n3 are arbitrary integers [2, 15]. Because the

atoms in the image cells behave identically as those in the primary cell, it is immaterial

to specify which space belongs to the primary cell and which space belongs to the

image cell. Even though it is customary to refer to the parallelepiped formed by the

three period vectors as the simulation cell and the surface of this parallelepiped as

the boundary, there is no physical interface at this boundary. In other words, the

“boundary” between the primary and image cells in PBC can be drawn anywhere and

is only a matter of convention. Consequently, translational invariance is preserved

and linear momenta is conserved in all three directions. It is customary to set the

velocity of the center of mass to zero in the initial condition which should remain

zero during the simulation. This provides an important check of the self-consistency

of the simulation program.

The scaled coordinates si are usually introduced to simplify the notation and the

implementation of PBC, where

ri = H · si (A.1)

and H = [c1|c2|c3] is a 3×3 matrix whose three columns are formed by the coordinates

of the three repeat vectors. For example, H becomes a diagonal matrix when the three


repeat vectors are parallel to the x-, y-, z-axes, respectively,

H =

Lx 0 0

0 Ly 0

0 0 Lz

(A.2)

where Lx = |c1|, Ly = |c2|, Lz = |c3|. The periodic boundary conditions can also

be stated in terms of the scaled coordinates as follows: whenever there is an atom at

location si = (six, siy, s

iz)

T, there are also atoms at location (six +n1, siy +n2, s

iz +n3)T,

where n1, n2, n3 are arbitrary integers. The scaled coordinates of each atom, six, siy,

siz are sometimes limited to [−0.5, 0.5), although this is not necessary.

To apply a normal strain in the x direction, one only needs to modify the magni-

tude of Lx. To introduce a shear strain εyz, one can simply add an off-diagonal term

to the H matrix,

H =

Lx 0 0

0 Ly 2 εyz Ly

0 0 Lz

(A.3)

Regardless of the normal or shear strain, the scaled coordinates, six, siy, s

iz, still

independently satisfy PBC in the domain [−0.5, 0.5), which is the main advantage

for introducing the scaled coordinates. By modifying H in these ways, one can stretch

and shear a crystal in MD.

A.2.2 Torsional PBC

While the exact formulation of PBC as stated above cannot accommodate a non-

zero average torsion over the entire simulation cell, the general idea can still be used.

Consider a nanowire of length Lz aligned along the z-axis, as shown in Fig. A.1(a).

To apply PBC along the z-axis, one makes two copies of the atoms in the nanowire,

shifts them along z by ±Lz, and lets them interact with the atoms in the primary

wire. Two copies of the original nanowire would be sufficient if the cut-off radius rc of

the interatomic potential function is smaller than Lz (usually rc Lz). After PBC

is applied, the model may be considered as an infinitely long, periodic wire along the


L z

L z

L z

x

y

z

L z

L z

L z

x

y

z

(a) (b)

primary wire

image wire

image wire

primary wire

image wire

image wire

φ

φ

2φ

Figure A.1: (a) A nanowire subjected to PBC along z axis. (b) A nanowire subjectedto t-PBC along z axis.

z-axis. Any arbitrary section of length Lz can now be considered as the primary wire

due to the periodicity. Since the atomic arrangement must repeat itself after every

Lz distance along the wire, the average torsion that can be applied to the nanowire

is zero. A local torsion in some section of the wire has to be cancelled by an opposite

torsion at another section that is less than Lz away.

One way to introduce an average torque to this infinitely long wire is to rotate the

two images by angle +φ and −φ, respectively, before attaching them to the two ends

of the primary wire as shown in Fig. A.1(b). The image wire that is displaced by

Lz is rotated by φ, while the one that is displaced by −Lz is rotated by −φ. In this

case, as one travels along the wire by Lz, he will find that the atomic arrangement

in the cross section will be rotated around z axis by angle φ but otherwise identical.

Again, because this property is satisfied by any cross section of the nanowire, it is

arbitrary which is called the primary wire and which are called images, similar to


conventional periodic boundary conditions. The torsion imposed on the nanowire

can be characterized by the angle of rotation per unit length, φ/Lz. In the limit of

small deformation, the shear strain field produced by the torsion is,

εθz =r φ

2Lz(A.4)

where r is the distance away from the z-axis.

The above procedure specifies torsional periodic boundary conditions (t-PBC)

that can be easily expressed in terms of scaled cylindrical coordinates. Consider

an atom i with cartesian coordinates ri = (xi, yi, zi)T and cylindrical coordinates

(ri, θi, zi)T, which are related to each other by,

xi = ri cos θi (A.5)

yi = ri sin θi (A.6)

When the wire is subjected to PBC along z (with free boundary conditions in x and y),

the scaled cylindrical coordinates (sir, siθ, s

iz)

T are introduced through the relationshipri

θi

zi

=

R 0 0

0 2π 0

0 0 Lz

sir

siθ

siz

≡M ·

sir

siθ

siz

(A.7)

Both siθ and siz independently satisfy periodic boundary conditions in the domain

[−0.5, 0.5). No boundary condition is applied to coordinate sir. R is a characteristic

length scale in the radial direction in order to make sir dimensionless. Although this

is not necessary, one can choose R to be the radius of the nanowire, in which case sir

would vary from 0 to 1.

Torsion can be easily imposed by introducing an off-diagonal term to the matrix

M, which becomes

M =

R 0 0

0 2π φ

0 0 Lz

(A.8)


The scaled coordinates, siθ and siz, still independently satisfy periodic boundary con-

ditions in the domain [−0.5, 0.5). This is analogous to the application of shear strain

to a simulation cell subjected to conventional PBC, as described in Eq. (A.3). t-PBC

can be easily implemented in an existing simulation program by literally following

Fig. A.1(b), i.e. by making two copies of the wire, rotating them by ±φ, and placing

the two copies at the two ends of the primary wire. In practice, it is not necessary

to copy the entire wire, because the cut-off radius rc of the interatomic potential

function is usually much smaller than Lz. Only two sections at the ends of the pri-

mary wire with lengths longer than rc need to be copied.1 It is important to perform

this operation of “copy-and-paste” at every MD time step, or whenever the potential

energy and atomic forces need to be evaluated. This will completely remove the end

effects and will ensure that identical MD trajectories will be generated had a different

section (also of length Lz) of the wire been chosen as the primary wire.

An important property of the t-PBC is that the trajectory of every atom satisfy the

classical (Newton’s) equation of motion. In other words, among the infinite number

of atoms that are periodic images of each other, it makes no physical difference as

to which one should be called “primary” and which ones should be called “images”.

Since the primary atoms follow the Newton’s equation of motion (fi = m ai), to prove

the above claim it suffices to show that the image atoms, which are slaves of the

primary atoms (through the “copy-and-paste” operation) also follow the Newton’s

equation of motion (fi′ = m ai′).

To show this, consider an atom i and its periodic image i′, such that si′r = sir,

si′

θ = siθ, si′z = siz + 1. The position of the two atoms are related by t-PBC: ri′ =

Rotz(ri, φ) + ez Lz, where Rotz(·, φ) represent rotation of a vector around z-axis by

angle φ and ez is the unit vector along z-axis. Hence the acceleration of the two atoms

are related to each other through: ai′ = Rotz(ai, φ). Now consider an arbitrary atom

j that falls within the cut-off radius of atom i. Let rij ≡ rj − ri be the distance

vector from atom i to j. Consider the image atom j′ such that sj′r = sjr, s

j′

θ = sjθ,

sj′z = sjz + 1. Hence rj′ = Rotz(rj, φ) + ez Lz, and ri′j′ ≡ rj′ − ri′ = Rotz(rij, φ). Since

1This simple approach is not able to accommodate long-range Coulomb interactions, for whichthe Ewald summation is usually used in conventional PBC. Extension of the Ewald method to t-PBCis beyond the scope of this thesis.


this is true for an arbitrary neighbor atom j around atom i, the forces on atoms i

and i′ must satisfy the relation: fi′ = Rotz(fi, φ). Therefore, the trajectory of atom i′

also satisfies the Newton’s equation of motion fi′ = m ai′ .

MD simulations under t-PBC should conserve the total linear momentum Pz and

angular momentum Jz because t-PBC preserves both translational invariance along

and rotational invariance around the z axis. However, the linear momenta Px and Py

are no longer conserved in t-PBC due to the specific choice of the origin in the x-y

plane (which defines the cylindrical coordinates r and θ). In comparison, the angular

momentum Jz is usually not conserved in PBC. Consequently, at the beginning of MD

simulations under t-PBC, both Pz and Jz must be set to zero. Pz and Jz will remain

zero, which provides an important self-consistency check of the implementation of

boundary conditions and numerical integrators.

A.2.3 Bending PBC

The same idea can be used to impose bending deformation on wires. Again, the

atomic positions will be described through scaled cylindrical coordinates, (sir, siθ, s

iz)

T,

which is related to the real cylindrical coordinates, (ri, θi, zi)T, through the following

transformation,ri

θi

zi

=

R 0 0

0 Θ 0

0 0 Lz

sir

siθ

siz

+

L0/Θ

0

0

≡ N ·

sir

siθ

siz

+

L0/Θ

0

0

(A.9)

While the coordinate system here is still the same as that in the case of torsion, the

wire is oriented along the θ direction, as shown in Fig. A.2. Among the three scaled

coordinates, only siθ is subjected to a periodic boundary condition, in the domain of

[−0.5, 0.5). This means that θi is periodic in the domain [−Θ/2,Θ/2). No boundary

conditions are applied to sir and siz. R and Lz are characteristic length scales in the

r and z directions, respectively. L0 is the original (stress free) length of the wire and

ρ = L0/Θ is the radius of curvature of the wire. The equation r = ρ specifies the

neutral surface of the wire. Thus, ri = ρ+Rsir, where Rsir describes the displacement


x

θ

y

r

Θ

z

F

M M

F

primary wire

image wireimage wire

Figure A.2: A nanowire subjected to b-PBC around z axis. At equilibrium the netline tension force F must vanish but a non-zero bending moment M will remain.

of atom i away from the neutral axis in the r direction.

In the previous section, an off-diagonal element has to be introduced to the

transformation matrix M in order to introduce torsion. In comparison, the form

of Eq. (A.9) does not need to be changed to accommodate bending. Different amount

of bending can be imposed by adjusting the value Θ, while the matrix N remains

diagonal. The larger Θ is the more severe the bending deformation. The state of zero

bending corresponds to the limit of Θ→ 0.

Intuitively, it may seem that increasing the value of Θ would elongate the wire and

hence induce a net tension force F in addition to a bending moment M . However,

this is not the case because the direction of force F at the two ends of the wire

are not parallel to each other, as shown in Fig. A.2. When no lateral force (i.e. in

the r direction) is applied to the wire, F must vanish for the entire wire to reach

equilibrium. Otherwise, there will be a non-zero net force in the −x direction, which

will cause the wire to move until F become zero. At equilibrium, only a bending

moment (but no tension force) can be imposed by b-PBC.

b-PBC can be implemented in a similar way as t-PBC. One makes two copies of the

primary wire and rotates them around the z axis by ±Θ. The atoms in these copies

will interact and provide boundary conditions for atoms in the primary wire.2 Again,

2Similar to the case of t-PBC, this simple approach is not able to accommodate long-rangeCoulomb interactions. While a wire under t-PBC can be visualized as an infinitely long wire, this


this “copy-and-paste” operation is required at every step of MD simulation. This

will ensure all atoms (primary and images) satisfy Newton’s equation of motion. The

proof is similar to that given in the previous section for t-PBC and is omitted here for

brevity. Interestingly, both the linear momentum Pz and the angular momentum Jz

for the center of mass are conserved in b-PBC, exactly the same as t-PBC. Therefore,

both Pz and Jz must be set to zero in the initial condition of MD simulations.

A.3 Virial Expressions for Torque and Bending

Moment

The experimental data on tensile tests are usually presented in the form of stress-

strain curves. The normal stress is calculated from, σ = F/A, where F is the force

applied to the ends and A is the cross section area of the wire. In experiments on

macroscopic samples, the end effects are reduced by making the ends of the speci-

men much thicker than the middle (gauge) section where significant deformation is

expected. In atomistic simulations, on the other hand, the end effects are removed

by a different approach, usually through the use of periodic boundary conditions.

Unfortunately, with the end effects completely removed by PBC, there is no place

to serve as grips where external forces can be applied. Therefore, the stress must be

computed differently in atomistic simulations under PBC than in experiments. The

Virial stress expression is widely used in atomistic calculations, which represents the

time and volume average of stress in the simulation cell.

The same problem appears in atomistic simulations under t-PBC and b-PBC.

There needs to be a procedure to compute the torque and bending moment in these

new boundary conditions. In this section, the Virial expressions for the torque and

bending moment in t-PBC and b-PBC are developed. Similar to the Virial stress, the

interpretation will encounter some difficulty in b-PBC, because continuing the curved wire along theθ-direction will eventually make the wire overlap. The interpretation of b-PBC would then requirethe wire to exist in a multi-sheeted Riemann space [88, page 80] so that the wire does not reallyoverlap with each other.


new expressions involve discrete sums over all atoms in the simulation cell. The corre-

sponding expressions in continuum mechanics, expressed in terms of volume integrals,

are also identified. Since the derivation of these new expressions are motivated by

that of the original Virial expression, a natural place is to begin is with a quick review

of the Virial stress.

A.3.1 Virial Stress in PBC

For an atomistic simulation cell subjected to PBC in all three directions, the Virial

formula gives the stress averaged over the entire simulation cell at thermal equilibrium

as

σαβ =1

Ω

⟨N∑i=1

−mi viα v

iβ +

N−1∑i=1

N∑j=i+1

∂V

∂(xiα − xjα)

(xiβ − xjβ)

⟩(A.10)

In this formula Ω = det(H) is the volume of the simulation cell, N is the total number

of atoms, viα and xiα are the α-components of the velocity and position of atom i, and

V is the potential energy. The terms (xiα−xjα) and (xiβ−xjβ) in the second summation

are assumed to be taken from the nearest images of atom i and atom j. The bracket

〈·〉 means ensemble average, which equals to the long time average if the system has

reached equilibrium. Thus the Virial stress is the stress both averaged over the entire

space and over a long time.

The Virial stress is the derivative of the free energy F of the atomistic simulation

cell with respect to a virtual strain εαβ, which deforms the periodic vectors c1, c2 and

c3 and hence the matrix H,

σαβ =1

Ω

∂F

∂εαβ(A.11)

Assuming the simulation cell is in equilibrium under the canonical ensemble, the free

energy is defined as,

F ≡ −kBT ln

1

h3NN !

∫d3Nrid

3Npi exp

[− 1

kBT

(N∑i=1

|pi|2

2mi

+ V (ri)

)](A.12)


where kB is the Boltzmann’s constant, T is temperature, h is Planck’s constant, ri

and pi are atomic position and momentum vectors, and V is the interatomic potential

function. The momenta can be integrated out explicitly to give,

F = −kBT ln

1

Λ3NN !

∫d3Nri exp

[−V (ri)

kBT

](A.13)

where Λ ≡ h/(2πmkBT )1/2 is the thermal de Broglie wavelength. In atomistic sim-

ulations under PBC, the potential energy can be written as a function of the scaled

coordinates si and matrix H. Hence, F can also be written in terms of an integral

over the scaled coordinates.

F = −kBT ln

ΩN

Λ3NN !

∫d3Nsi exp

[−V (si,H)

kBT

](A.14)

The Virial formula can be obtained by taking the derivative of Eq. (A.14) with respect

to εαβ. The first term in the Virial formula comes from the derivative of the volume Ω

with respect to εαβ, which contributes a −N kB T δαβ/Ω term to the total stress. This

is equivalent to the velocity term in the Virial formula because⟨mi v

iα v

iβ

⟩= kB T δαβ

in the canonical ensemble. The second term comes from the derivative of the potential

energy V (si,H) with respect to εαβ. The Virial stress expression can also be derived

in several alternative approaches (see [93, 63, 18, 20, 97] for more discussions). The

corresponding quantity for Virial stress in continuum mechanics is the volume average

of the stress tensor,

σij =1

Ω

∫Ω

σij dV =1

Ω

∮S

tj xi dS (A.15)

where the integral∮S

is over the bounding surface of volume Ω, tj is the traction force

density on surface element dS, and xi is the position vector of the surface element.

A.3.2 Virial Torque in t-PBC

The Virial torque expression for a simulation cell subjected to t-PBC can be derived

in a similar fashion. First, the potential energy V is re-written as a function of the


scaled cylindrical coordinates and the components of matrix M, as given in Eq. (A.8),

V (ri) = V (sir, siθ, siz, R, φ, Lz) (A.16)

The Virial torque is then defined as the derivative of the free energy F with respect

to φ,

τ ≡ ∂F

∂φ(A.17)

F = −kBT ln

ΩN

Λ3NN !

∫d3Nsi exp

[−V (sir, siθ, siz, R, φ, Lz)

kBT

](A.18)

Since ∂Ω/∂φ = 0, the torque reduces to

τ =

∫d3Nsi exp

[−V (sir,siθ,s

iz,R,φ,Lz)

kBT

]∂V∂φ∫

d3Nsi exp[−V (sir,siθ,siz,R,φ,Lz)

kBT

] ≡⟨∂V

∂φ

⟩(A.19)

In other words, the torque τ is simply the ensemble average of the derivative of

the potential energy with respect to torsion angle φ. To facilitate calculation in an

atomistic simulation, one can express ∂V∂φ

in terms of the real coordinates of the atoms,

∂V

∂φ=

1

Lz

N−1∑i=1

N∑j=i+1

− ∂V

∂(xi − xj)(yi zi − yj zj) +

∂V

∂(yi − yj)(xi zi − xj zj) (A.20)

Hence one arrives at the Virial torque expression

τ =1

Lz

⟨N−1∑i=1

N∑j=i+1

− ∂V

∂(xi − xj)(yi zi − yj zj) +

∂V

∂(yi − yj)(xi zi − xj zj)

⟩(A.21)

There is no velocity term in Eq. (A.21) because modifying φ does not change the

volume Ω of the wire. This expression is verified numerically in Appendix C in the

zero temperature limit when the free energy equals to the potential energy. The

corresponding quantity in continuum elasticity theory can be written in terms of an


integral over the volume Ω of the simulation cell,

τ = Qzz ≡1

Lz

∫Ω

−y σxz + x σyz dV (A.22)

The derivation is given in Appendix A. The stress in the above expression refers

to the Cauchy stress in the context of finite deformation. Because it uses current

coordinates, the expression remains valid in finite deformation. The correspondence

between Eqs. (A.21) and (A.22) bears a strong resemblance to the correspondence

between Eqs. (A.10) and (A.15). While the Virial stress formula corresponds to the

average (i.e. zeroth moment) of the stress field over volume Ω, τ corresponds to a

linear combination of the first moments of the stress field.

A.3.3 Virial Bending Moment in b-PBC

Following a similar procedure, one can obtain the Virial expression for the bending

moment for a simulation cell subjected to b-PBC. First, rewrite the potential energy

of a system under b-PBC as,

V (ri) = V (sir, siθ, siz, R,Θ, Lz) (A.23)

The Virial bending moment is then the derivative of the free energy with respect to

Θ.

M ≡ ∂F

∂Θ(A.24)

F = −kBT ln

ΩN

Λ3NN !

∫d3Nsi exp

[−V (sir, siθ, siz, R,Θ, Lz)

kBT

](A.25)

Again, one finds that M is simply the ensemble average of the derivative of potential

energy with respect to Θ,

M =

⟨∂V

∂Θ

⟩(A.26)


The derivative ∂V∂Θ

can be expressed in terms of the real coordinates of the atoms,

∂V

∂Θ= 1

Θ

∑N−1i=1

∑Nj=i+1 −

∂V

∂(xi − xj)(yi θi − yj θj + ρ cos θi − ρ cos θj)

+∂V

∂(yi − yj)(xi θi − xj θj − ρ sin θi + ρ sin θj) (A.27)

Hence one arrives at the Virial bending moment expression,

M = 1Θ

⟨∑N−1i=1

∑Nj=i+1 − ∂V

∂(xi − xj)(yi θi − yj θj + ρ cos θi − ρ cos θj)

+∂V

∂(yi − yj)(xi θi − xj θj − ρ sin θi + ρ sin θj)

⟩(A.28)

There is no velocity term in Eq. (A.28) because modifying Θ does not change the

volume Ω of the wire. This expression is verified numerically in Appendix D in the

zero temperature limit when the free energy equals to the potential energy. The

corresponding quantity in continuum elasticity theory can be written in terms of an

integral over the volume Ω of the simulation cell,

M = Qzθ =1

Θ

∫A

dA

∫ Θ

0

dθ (−y σxθ + x σyθ) (A.29)

=1

Θ

∫A

dA

∫ Θ

0

dθ r σθθ =1

Θ

∫Ω

σθθ dV

where A is the cross-section area of the continuum body subjected to b-PBC. The

correspondence between Eqs. (A.28) and (A.30) bears a strong resemblance to the

correspondence between Eqs. (A.10) and (A.15). Similar to τ , M also corresponds to

a linear combination of the first moments of the stress field over the simulation cell

volume.

A.4 Numerical Results

This section demonstrates the usefulness of t-PBC and b-PBC described above by

torsion and bending Molecular Dynamics simulations of Si nanowires (NWs) to failure.


The interactions between Si atoms are described by the modified-embedded-atom-

method (MEAM) potential [7], which has been found to be more reliable in the study

of the failure Si NWs than several other potential models for Si [48]. Two NWs both

oriented along the [111] direction with diameters D = 7.5 nm and D = 10 nm and

the same aspect ratio Lz/D = 2.27 were considered. To make sure the NW surface

is well reconstructed, the NWs are annealed by MD simulations at 1000 K for 1 ps

followed by a conjugate gradient relaxation. Simulation results on initially perfect

NWs under torsion and bending deformation at T = 300 K are presented.

A.4.1 Si Nanowire under Torsion

Simulations of Si NWs under torsion can be carried out easily using t-PBC. Before

applying a torsion, the NWs are first equilibrated at the specified temperature and

zero stress (i.e. zero axial force) by MD simulations under PBC where the NW

length is allowed to elongate to accommodate the thermal strain. Fig. A.3(a) and (c)

shows the annealed Si NW structures. Subsequently, torsion is applied to the NW

through t-PBC, where the twist angle φ (between two ends of the NW) increases

in steps of 0.02 radian (≈ 1.15). For each twist angle, MD simulation under t-

PBC is performed for 2 ps. The Nose-Hoover thermostat is used to maintain the

temperature at T = 300 K using the Stomer-Verlet time integrator [10] with a time

step of 1 fs. The linear momentum Pz and angular momentum Jz are conserved

within 2× 10−10eV · ps · A−1 and 9× 10−7eV · ps, during the simulation, respectively.

The twist angle continues to increase until the NW fails. If the Virial torque at the

end of the 2 ps simulation is lower than that at the beginning of the simulation, the

MD simulation is continued in 2 ps increments without increasing the twist angle,

until the bending moment increases. The purpose of this approach is to give enough

simulation time to resolve the failure process whenever that occurs. The Virial torque

is computed by time averaging over the last 1 ps of the simulation for each twist angle.

The torque versus twist angle relationship is plotted in Fig. A.4.

The τ -φ curve is linear for small values of φ and becomes non-linear as φ ap-

proaches the critical value at failure. The torsional stiffness can be obtained from the


(a) Initial structure, D = 7.5 nm

(b) After failure, D = 7.5 nm, φ = 1.16 rad

(c) Initial structure, D = 10 nm

(d) After failure, D = 10 nm, φ = 1.18 rad

Figure A.3: Snapshots of Si NWs of two diameters before torsional deformation andafter failure. The failure mechanism depends on its diameter.


0 0.5 1 1.5 20

2000

4000

6000

8000

10000

12000

14000

Rotation Angle (radians)

Viria

l T

orq

ue (

eV

)

D = 75 A L = 170 A

D = 100 A L = 227 A

Figure A.4: Virial torque τ as a function of rotation angle φ between the two endsof the NWs of two different diameters. Because the two NWs have the same aspectratio Lz/D, they have the same maximum strain (on the surface) γmax = φD

2Lzat the

same twist angle φ.

torque-twist relationship and its value at small φ can be compared to theory. The

torsional stiffness is defined as

kt ≡∂τ

∂φ(A.30)

In the limit of φ→ 0, the torsional stiffness is estimated to be kt = 5.11× 103 eV for

D = 7.5 nm and kt = 1.25 × 104 eV for D = 10 nm. Strength of Materials predicts

the following relationships for elastically isotropic circular shafts under torsion:

τ =φ

LzGJ , kt =

GJ

Lz(A.31)

where G is the shear modulus, J = πD4/32 is the polar moment of inertia. This

expression is valid only in the limit of small deformation (φ → 0). To compare

the simulation results against this expression, one needs to use the shear modulus

of Si given by the MEAM model (C11 = 163.78 GPa, C12 = 64.53 GPa, C44 =

76.47 GPa) on the (111) plane, which is G = 58.57 GPa. The predictions of the

torsional stiffness from Strength of Materials are compared with the estimated value

from MD simulations in Table A.1. The predictions overestimate the MD results by

25 ∼ 30%. However, this difference can be easily eliminated by a slight adjustment


(∼ 6%) of the NW diameter D, given that kt ∝ D4. The adjusted diameters D∗

for the two NWs is approximately 6 A smaller than the nominal diameters D, which

corresponds to a reduction of the NW radius by 3 A. This can be easily accounted

for by the inaccuracy in the definition of NW diameter and the possibility of a weak

surface layer on Si NWs [48].

Table A.1: Comparison of torsional stiffness for Si NW estimated from MD simula-tions and that predicted by Strength of Materials (SOM) theory. D∗ is the adjustedNW diameter that makes the SOM predictions exactly match MD results. The criticaltwist angle φc and critical shear strain γc at failure are also listed.

Nominal diameter D kt (MD) kt (SOM) Adjusted diameter D∗ φc γc7.5 nm 5110 eV 6680 eV 7.0 nm 1.16 (rad) 0.2610.0 nm 12538 eV 15812 eV 9.4 nm 1.18 (rad) 0.26

The above agreement gives us confidence in the use of Strength of Materials to

describe the behavior of NWs under torsion. Hence, it can be used to extract the

critical strain in both NWs at failure. The maximum strain (engineering strain) in a

cylindrical torsional shaft occurs on its surface,

γmax =φD

2Lz(A.32)

Given that the aspect ratio of NWs is kept at Lz/D = 2.27, one has

γmax = 0.22φ (A.33)

for both NWs. The critical twist angle and critical strain at failure for both NWs are

listed in Table A.1.

The critical shear strain at failure is expected to be independent of the shaft di-

ameter for large diameters. This seems to hold remarkably well in the NW torsion

simulations. Because the NW under t-PBC has no “ends”, failure can initiate any-

where along the NW. However, different failure mechanism are observed in the two

NWs with different diameters. The thinner NW fails by sliding along a (111) plane,


as seen in Fig. A.3(b). The thicker NW fails by sliding both along a (111) plane

and along longitudinal planes, creating wedges on the (111) cross section, as seen

in Fig. A.3(d). The failure mechanism of the thicker NW is also more gradual than

that of the thinner NW. As can be observed in Fig. A.4, the torque is completely

relieved on the thinner NW when failure occurs, whereas the thicker NW experiences

a sequence of failures. A more detailed analysis on the size dependence of NW failure

modes and their mechanisms will be presented in a subsequent paper.

A.4.2 Si Nanowire under Bending

Simulations of Si NWs can be carried out using b-PBC just as was done for torsion.

The Si NWs are equilibrated in the same way as described in the previous section

before applying bending through b-PBC. The bending angle Θ (between two ends

of the NW) increases in steps of 0.02 radian (≈ 1.15). For each twist angle, MD

simulations under b-PBC were performed for 2 ps. The linear momentum Pz and

angular momentum Jz is conserved to the same level of precision as in the torsion

simulations. The bending angle continues to increase until the NW fails. If the Virial

bending moment at the end of the 2 ps simulation is lower than that at the begin-

ning of the simulation, the MD simulation is continued in 2 ps increments without

increasing the bending angle, until the bending moment increases. The purpose of

this approach is to give enough simulation time to resolve the failure process when-

ever that occurs. The Virial bending moment is computed by a time average over the

last 1 ps of the simulation for each twist angle. The bending moment versus bending

angle relationship is plotted in Fig. A.5.

The M -Θ curve is linear for small values of Θ and becomes non-linear as Θ ap-

proaches the critical value at failure. The bending stiffness can be computed from

the M -Θ curve and its value at small Θ can be compared to theory. Similar to the

torsional stiffness in the previous section, define a bending stiffness as

kb ≡∂M

∂Θ(A.34)

In the limit of Θ→ 0 the bending stiffness is estimated to be kb = 8.12× 103 eV for


0 0.2 0.4 0.6 0.8 10

5000

10000

15000

Rotation Angle (radians)

Viria

l B

en

din

g M

om

en

t (e

V)

D = 75 A L = 170 A

D = 100 A L = 227 A

Figure A.5: Virial bending moment M as a function of bending angle Θ between thetwo ends of the two NWs with different diameters. Because the two NWs have thesame aspect ratio Lz/D, they have the same maximum strain εmax = ΘD

2Lzat the same

bending angle Θ.

D = 7.5 nm and kb = 1.96 × 104 eV for D = 10 nm. Strength of Materials predicts

the following relationships for elastically isotropic beam under bending,

M =Θ

L0

E Iz , kb =E IzL0

(A.35)

where E is the Young’s modulus, Iz = πD4/64 is the moment of inertia of the NW

cross section around z-axis. To compare the simulation results against this expression,

one needs to use the Young’s modulus of Si given by the MEAM model along the

[111] direction, which is 181.90 GPa. The predictions of the torsional stiffness from

Strength of Materials are compared with the estimated value from MD simulations

in Table A.2. The predictions overestimate the MD results by 23 ∼ 25%. But this

difference can be easily eliminated by a slight adjustment (∼ 5%) of the NW diameter

D, given that kb ∝ D4. The adjusted diameters D∗ for the two NWs is approximately

5 A smaller than the nominal diameters D, which corresponds to a reduction of the

NW radius by 2.5 A. It is encouraging to see that the adjusted diameters from torsion

simulations match those for the bending simulations reasonably well.

The above agreement gives us confidence in the use of Strength of Materials theory

to describe the behavior of NW under bending. Hence it can be used to extract the


Table A.2: Comparison of the bending stiffnesses for Si NWs estimated from MDsimulations and that predicted by Strength of Materials (SOM) theory. D∗ is theadjusted NW diameter that makes SOM predictions exactly match MD results. Thecritical bending angle Θf and critical normal strain εf at fracture are also listed.

Nominal diameter D kb (MD) kb (SOM) Adjusted diameter D∗ Θf εf7.5 nm 8117 eV 10374 eV 7.1 nm 0.96 (rad) 0.2110.0 nm 19619 eV 24554 eV 9.5 nm 0.76 (rad) 0.17

critical strain experienced by both NWs at the point of fracture. Based on the

Strength of Materials theory, the maximum strain (engineering strain) of a beam in

pure bending occurs at the points furthest away from the bending axis,

εmax =ΘD

2L0

(A.36)

Since the aspect ratio of NWs is kept at L0/D = 2.27, one has

εmax = 0.22 Θ (A.37)

for both NWs. The critical bending angle and critical normal strain at failure for

both NWs are listed in Table A.2. The critical strain at fracture is similar to results

obtained from MD simulations of Si NWs under uniaxial tension, εf = 0.18, also

using the MEAM model [48]. The higher critical stress value observed in the thinner

NW in bending is related to the higher stress gradient across its cross section.

Fig. A.6 shows the atomic structure of the NWs right before and right after frac-

ture. The much larger critical strain observed in the thinner NW is related to the

formation of metastable hillocks on the compressible side of the NW, as shown in

Fig. A.6(a). It seems that the formation of hillocks relieves some bending strain

and allows the thinner NW to deform further without causing fracture. In fact, the

onset of hillock formation in the thinner NW happens at the same rotation angle

(Θ = 0.76 rad) as the angle at which the thicker NW fractures.


(a) Before fracture, D = 7.5 nm, φ = 0.94 rad

(b) After fracture, D = 7.5 nm, φ = 0.96 rad

(c) Before fracture, D = 10 nm, φ = 0.74 rad

(d) After fracture, D = 10 nm, φ = 0.76 rad

Figure A.6: Snapshots of Si NWs of two diameters under bending deformation beforeand after fracture. While metastable hillocks form on the thinner NWs before fracture(a), this does not happen for the thicker NW (c).


A.5 Summary

In this appendix a unified approach to handle torsion and bending of wires in atom-

istic simulations by generalizing the Born-von Karman periodic boundary conditions

to cylindrical coordinates has been presented. The expressions for the torque and

bending moments in terms of an average over the entire simulation cell were derived,

in close analogy to the Virial stress expression. Molecular Dynamics simulations un-

der these new boundary conditions show several failure modes of Silicon nanowires

under torsion and bending, depending on the nanowire diameter. These simulations

are able to probe the intrinsic behavior of nanowires because the artificial end effects

are completely removed.

Bibliography

[1] J. Ahrens, B. Geveci, and C. Law. Paraview: An end user tool for large data

visualization. Technical report, Academic Press, 2005.

[2] M. P. Allen and D. J. Tildesley. Computer Simulation of Liquids. Oxford Uni-

versity Press, 2007.

[3] A. A. Amsden and F. H. Harlow. The SMAC method: a numerical technique

for calculating incompressible flows. Technical Report LA-4370, Los Alamos

National Laboratory, 1970.

[4] ATI. Radeon X1900 product site, 2006.

http://www.ati.com/products/radeonx1900/index.html .

[5] ATITool. techpowerup.com, 2006.

http://www.techpowerup.com/atitool.

[6] John Aycock. A brief history of just-in-time. ACM Comput. Surv., 35(2):97–113,

2003.

[7] M. I. Baskes. Modified embedded-atom potentials for cubic materials and impu-

rities. Phys. Rev. B, 46:2727–2742, 1992.

[8] Nathan Bell and Michael Garland. Efficient sparse matrix-vector multiplication

on cuda. Technical report, NVIDIA, 2008.

[9] J. Bolz, I. Farmer, E. Grinspun, and P. Schroder. Sparse matrix solvers on the

GPU: conjugate gradients and multigrid. In SIGGRAPH ’03: ACM SIGGRAPH

Papers, pages 917–924, New York, NY, USA, 2003. ACM.

127

BIBLIOGRAPHY 128

[10] S. D. Bond, B. J. Leimkuhler, and B. B. Laird. The nose-poincare method for

constant temperature molecular dynamics. J. Comput. Phys., 151:114–134, 1999.

[11] T. Brandvik and G. Pullan. Acceleration of a 3d euler solver using commod-

ity graphics hardware. In 46th AIAA Aerospace Sciences Meeting and Exhibit,

January 2008.

[12] I. Buck. High level languages for GPUs. In SIGGRAPH ’05: ACM SIGGRAPH

2005 Courses, page 109, New York, NY, USA, 2005. ACM Press.

[13] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Han-

rahan. Brook for GPUs: Stream computing on graphics hardware. ACM Trans-

actions on Graphics, 23(3):777 – 786, 2004 2004.

[14] Ian Buck, Kayvon Fatahalian, and Pat Hanrahan. Gpubench: Evaluating gpu

performance for numerical and scientific applications. In Poster Session at GP2

Workshop on General Purpose Computing on Graphics Processors, 2004.

http://gpubench.sourceforge.net.

[15] V. V. Bulatov and W. Cai. Computer Simulations of Dislocations. Oxford

University Press, 2006.

[16] M. H. Carpenter, D. Gottlieb, and S. Abarbanel. Time-stable boundary condi-

tions for finite-difference schemes solving hyperbolic systems: Methodology and

application to high-order compact schemes. J. Comput. Phys., 111(2):220–236,

1994.

[17] M. Chau, O. Englander, and L. Lin. Silicon nanowire-based nanoactuator. In

Proceedings of the 3rd IEEE conference on nanotechnology, volume 2, pages 879–

880, San Francisco, CA, Aug 12-14 2003.

[18] K. S. Chueng and S. Yip. Atomic-level stress in an inhomogeneous system. J.

Appl. Phys., 70:5688–90, 1991.

[19] J.-F. Collard and D. Lavery. Optimizations to prevent cache penalties for the

intel itanium 2 processor. pages 105–114, March 2003.

BIBLIOGRAPHY 129

[20] J. Cormier, J. M. Rickman, and T. J. Delph. Stress calculation in atomistic

simulations of perfect and imperfect solids. J. Appl. Phys., 89:99–104, 2001.

[21] Y. Cui, Q. Wei, H. Park, and C. M. Lieber. Nanowire nanosensors for highly

sensitive and selective detection of biological and chemical species. Science,

293:1289–1292, 2001.

[22] Y. Cui, Z. Zhong, D. Wang, W. U. Wang, and C. M. Lieber. High performance

silicon nanowire field effect transistors. Nano Letters, 3:149–152, 2003.

[23] W. Ding, L. Calabri, X. Chen, K. Kohlhass, and R. S. Ruoff. Mechanics of

crystalline boron nanowires. presented at the 2006 MRS spring meeting, San

Francisco, CA, 2006.

[24] David Dobkin and Michael Laszlo. Primitives for the manipulation of three-

dimensional subdivisions. Algorithmica, 4(1-4):3–32, 1989.

[25] T. Dumitrica and R. D. James. Objective molecular dynamics. J. Mech. Phys.

Solids, 55:2206–2236, 2007.

[26] Carter Edwards. Sierra framework version 3: Core services theory and design.

Technical report, Sandia National Laboratories, 2002.

[27] E. Elsen, V. Vishal, E. Darve, P. Hanrahan, V. Pande,

and I. Buck. GROMACS on the GPU, 2005.

http://bcats.stanford.edu/pdf/BCATS 2005 abstract book.pdf.

[28] A. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover. GPU Cluster for High

Performance Computing. SC, 00:47, 2004.

[29] R. Fan, R. Karnik, M. Yue, D. Y. Li, A. Majumdar, and P. D. Yang. DNA

translocation in inorganic nanotubes. Nano Letters, 5:1633–1637, 2005.

[30] H. Fujitani, Y. Tanida, M. Ito, G. Jayachandran, C. D. Snow, M. R. Shirts,

E. J. Sorin, and V. S. Pande. Direct Calculation of the Binding Free Energies of

FKBP Ligands. J. Chem. Phys., 123(8):84108, 2005.

BIBLIOGRAPHY 130

[31] T. Fukushige, J. Makino, and A. Kawai. GRAPE-6A: A Single-Card GRAPE-

6 for Parallel PC-GRAPE Cluster Systems. Publications of the Astronomical

Society of Japan, 57:1009–1021, dec 2005.

[32] Dominik Goddeke, Robert Strzodka, and Stefan Turek. Accelerating double

precision FEM simulations with GPUs. In Proceedings of ASIM 2005 - 18th

Symposium on Simulation Technique, September 2005.

[33] D. Rodriguez Gomez, E. Darve, and A. Pohorille. Assessing the efficiency of free

energy calculation methods. Journal of Chemical Physics, 120(8):3563–78, Feb

2004.

[34] N. Goodnight, C. Woolley, G. Lewin, D. Luebke, and G. Humphreys. A multigrid

solver for boundary value problems using programmable graphics hardware. In

SIGGRAPH ’05: ACM SIGGRAPH 2005 Courses, pages 193–203, 2005.

[35] M. J. Harris, W. V. Baxter, T. Scheuermann, and A. Lastra. Simulation of cloud

dynamics on graphics hardware. In HWWS ’03: Proceedings of the ACM SIG-

GRAPH/EUROGRAPHICS conference on Graphics hardware, pages 92–101,

2003.

[36] Michael A. Heroux, Roscoe A. Bartlett, Vicki E. Howle, Robert J. Hoekstra,

Jonathan J. Hu, Tamara G. Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P.

Pawlowski, Eric T. Phipps, Andrew G. Salinger, Heidi K. Thornquist, Ray S.

Tuminaro, James M. Willenbring, Alan Williams, and Kendall S. Stanley. An

overview of the trilinos project. ACM Trans. Math. Softw., 31(3):397–423, 2005.

[37] Jan Hesthaven and Tim Warburton. Nodal Discontinuous Galerkin Methods:

Algorithms, Analysis and Applications. Springer, 2008.

[38] M. F. Horstemeyer, J. Lim, W. Y. Lu, D. A. Mosher, M. I. Baskes, V. C. Prantil,

and S. J. Plimpton. Torsion/simple shear of single crystal copper. J. Eng. Mater.

Tech., 124:322–328, 2002.

BIBLIOGRAPHY 131

[39] Y. Huang and C. M. Lieber. Integrated nanoscale electronics and optoelectronics:

Exploring nanoscale science and technology through semiconductor nanowires.

Pure Appl. Chem, 76:2051–2068, 2004.

[40] M. Huhtala, A. Kuronen, and K. Kaski. Dynamical simulations of carbon nan-

otube bending. Int. J. Modern Phys. C, 15:517–534, 2004.

[41] IBM. Cell Broadband Engine Programming Handbook, 1.11 edition, May 2008.

[42] Peta Computing Institute. Mdgrape-3 pci-x, 2006.

[43] Intel. Intel Pentium 4 Thermal Management, 2006.

http://www.intel.com/support/processors/pentium4/sb/CS-007999.htm.

[44] Y. Isono, M. Kiuchi, and S. Matsui. Development of electrostatic actuated nano

tensile testing device for mechanical and electrical characterstics of FIB deposited

carbon nanowire. presented at the 2006 MRS spring meeting, San Francisco, CA,

2006.

[45] Hrvoje Jasak, Aleksandar Jemcov, and Zeljko Tukovic. Openfoam: A c++ li-

brary for complex physics simulations. volume 47, 2007.

[46] P. M. Jeff and N. A. Fleck. The failure of composite tubes due to combined

compression and torsion. J. Mater. Sci., 29:3080–3084, 1994.

[47] Yunfei Chen Juekuan Yang, Yujuan Wang. Accelerated molecular dynamics

simulation of thermal conductivities. Journal of Computational Physics, 2006.

doi:10.1016/j.jcp.2006.06.039.

[48] K. Kang and W. Cai. Brittle and ductile fracture of semiconductor nanowires –

molecular dynamics simulations. Philosophical Magazine, 87:2169–2189, 2007.

[49] Y. Khalighi, G. Iaccarino, and P. Moin. Comparison of Lattice Boltzmann

Method and conventional CFD techniques. APS Meeting Abstracts, pages K7+,

November 2004.

BIBLIOGRAPHY 132

[50] T. Kizuka, Y. Takatani, K. Asaka, and R. Yoshizaki. Measurements of the

atomistic mechanics of single crystalline silicon wires of nanometer width. Phys.

Rev. B, 72:035333–1–6, 2005.

[51] J. Kruger and R. Westermann. Linear Algebra Operators for GPU Implementa-

tion of Numerical Algorithms. In ACM Transactions on Graphics (Proceedings

of SIGGRAPH), pages 908–916, July 2003.

[52] Orion S. Lawlor, Sayantan Chakravorty, Terry L. Wilmarth, Nilesh Choudhury,

Isaac Dooley, Gengbin Zheng, and Laxmikant V. Kale. Parfum: a parallel frame-

work for unstructured meshes for scalable dynamic physics applications. Eng.

with Comput., 22(3):215–235, 2006.

[53] A. Lefohn. GPU data structures. In GPGPU: General-Purpose Computation

on Graphics Hardware Tutorial, Int. Conf. for High Perf. Comput., Netw., Stor.

and Anal., Nov. 2006.

[54] W. Li, Z. Fan, X. Wei, and A. Kaufman. GPU Gems 2, chapter 47, GPU-based

flow simulation with complex boundaries, pages 747–764. Addison-Wesley, 2005.

[55] W. Li, X. Wei, and A. Kaufman. Implementing lattice boltzmann computation

on graphics hardware. Visual Comput., 19(444–456), 2003.

[56] E. Lindahl, B. Hess, and D. van der Spoel. GROMACS 3.0: A package for

molecular simulation and trajectory analysis. J. Mol. Mod., 7:306–317, 2001.

[57] Y. Liu, X. Liu, and E. Wu. Real-time 3D fluid simulation on GPU with complex

obstacles. In 12th Pacific Conference on Computer Graphics and Applications,

6-8 Oct. 2004, Seoul, South Korea, pages 247–256, 2004.

[58] K Long. Sundance 2.0 tutorial. Technical Report Technical Report SAND2004-

4793, Sandia National Laboratories, 2004.

[59] D. Luebke, M. Harris, J. Kruger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley,

and A. Lefohn. GPGPU: general purpose computation on graphics hardware. In

SIGGRAPH ’04: ACM SIGGRAPH 2004 Course Notes, page 33, 2004.

BIBLIOGRAPHY 133

[60] M. A. Makeev and D. Srivastava. Silicon carbide nanowires under external loads:

An atomistic simulation study. Phys. Rev. B, 74:165303, 2006.

[61] J. Makino, T. Fukushige, M. Koga, and K. Namura. GRAPE-6: Massively-

Parallel Special-Purpose Computer for Astrophysical Particle Simulations. Pub-

lications of the Astronomical Society of Japan, 55:1163–1187, dec 2003.

[62] Junichiro Makino, Eiichiro Kokubo, and Toshiyuki Fukushige. Performance evau-

lation and tuning of grape-6 - towards 40 ”real” tflops. In SC ’03: Proceedings of

the 2003 ACM/IEEE conference on Supercomputing, page 2, Washington, DC,

USA, 2003. IEEE Computer Society.

[63] G. Marc and W. G. McMillian. The virial theorem. Adv. Chem. Phys., 58:209–

361, 1985.

[64] A.C. Marta and J.J. Alonso. High-speed MHD flow control using adjoint-

based sensitivities. AIAA paper 2006-8009, 14th AIAA/AHI International Space

Planes and Hypersonic Systems and Technologies Conference, Canberra, Aus-

tralia, November 2006.

[65] K. Mattson, M. Svard, M. Carpenter, and J. Nordstrom. Accuracy requirements

for transient aerodynamics. In 16th AIAA Computational Fluid Dynamics Con-

ference, Orlando, FL, June 2003.

[66] Stephen McMillan. The Vectorization of Small-N Integrators. In Piet Hut and

Stephen McMillan, editors, The Use of Supercomputers in Stellar Dynamics,

pages 156–161, 1986.

[67] C. McNairy and D. Soltis. Itanium 2 processor microarchitecture. Micro, IEEE,

23(2):44–55, March-April 2003.

[68] Jeffrey M. McNally, L.E. Garey, and R.E. Shaw. A communication-less parallel

algorithm for tridiagonal toeplitz systems. Journal of Computational and Applied

Mathematics, 212(2):260 – 271, 2008.

[69] Microsoft. Directx home page, 2003. http://www.microsoft.com/windows/directx/default.asp.

BIBLIOGRAPHY 134

[70] Microsoft. Pixel shader 3.0 specification on msdn, 2006.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9 c/directx sdk.asp.

[71] Gordon Moore. Cramming more components onto integrated circuits. Electronic

Magazine, 38(19), April 1965.

[72] K. Mylvaganam, T. Vodenitcharova, and L. C. Zhang. The bending-kinking

analysis of a single-walled carbon nanotubea combined molecular dynamics and

continuum mechanics technique. J. Mater. Sci, 41:3341–3347, 2006.

[73] A. Nakatani and H. Kitagawa. Atomistic study of size effect in torsion tests of

nanowire. XXI ICTAM, 15-21, August 2004.

[74] Keigo Nitadori, Junichiro Makino, and Piet Hut. Performance tuning of n-body

codes on modern microprocessors: I. direct integration with a hermite scheme

on x86 64 architectures, Nov 2005. http://arxiv.org/abs/astro-ph/0511062.

[75] Keigo Nitadori, Junichiro Makino, and Piet Hut. Performance tuning of n-body

codes on modern microprocessors: I. direct integration with a hermite scheme

on x86 64 architecture. NEW ASTRON., 12:169, 2006.

[76] J. Nordstrom, E. van der Weide, J. Gong, and M. Svard. A hybrid method

for the unsteady compressible Navier-Stokes equations. Annual CTR Research

Briefs, Center for Turbulence Research, Stanford, 2007.

[77] T. Nozaki, M. Doyama, and Y. Kogure. Computer simulation of high-speed

bending deformation in copper. Radiation Effects and Defects in Solids, 157:217–

222, 2002.

[78] NVIDIA. CUDA Programming Guide 1.1, November 2007.

http://developer.download.nvidia.com/compute/cuda/1 1/NVIDIA CUDA Programming Guide 1.1.pdf.

[79] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J Kruger, A. E. Lefohn, and

T. J. Purcell. A survey of general-purpose computation on graphics hardware.

Computer Graphics Forum, 26(1):80–113, 2007.

BIBLIOGRAPHY 135

[80] M. Parrinello and A. Rahman. Polymorphic transitions in single crystals: a new

molecular dynamics method. J. Appl. Phys., 52:7182–7190, 1981.

[81] M. Rumpf and R. Strzodka. Nonlinear diffusion in graphics hardware. In Pro-

ceedings of EG/IEEE TCVG Symposium on Visualization VisSym ’01, pages

75–84, 2001.

[82] C. E. Scheidegger, J. L. D. Comba, and R. D. da Cunha. Practical CFD Simu-

lations on Programmable Graphics Hardware using SMAC. Computer Graphics

Forum, 24(4):715–728, 2005.

[83] Kirk Schloegel, George Karypis, and Vipin Kumar. Parallel static and dynamic

multi-constraint graph partitioning. Concurrency and Computation: Practice

and Experience, 14(3):219–240, 2002.

[84] Patrick Schmid. Does cache size really boost performance?, 10 2007.

[85] Anand Lal Shimpi. 6mb l2 vs. 3mb l2, 2008.

[86] J.W. Sias, Sain zee Ueng, G.A. Kent, I.M. Steiner, E.M. Nystrom, and W.-M.W.

Hwu. Field-testing impact epic research results in itanium 2. pages 26–37, June

2004.

[87] Christopher D. Snow, Eric J. Sorin, Young Min Rhee, and Vijay S. Pande. How

Well Can Simulation Predict Protein Folding Kinetics and Thermodynamics?

Ann. Rev. Biophys. Biomol. Struc., 34:43–69, 2005.

[88] A. Sommerfeld. Partial Differential Equations in Physics, Lectures on Theoretical

Physics, volume VI. Academic Press, 1964.

[89] J. Stam. Stable fluids. In SIGGRAPH, pages 121–128, July 1999.

[90] J. S. StOlken and A. G. Evans. A microbend test method for measuring the

plasticity length scale. Acta Mater., 46:5100–5115, 1998.

[91] M. Svard, K. Mattsson, and J. Nordstrom. Steady-state computations using

summation-by-parts operators. J. Sci. Comput., 24(1):79–95, 2005.

BIBLIOGRAPHY 136

[92] M. Taiji, T. Namuri, Y. Ohno, N. Futatsugi, A. Suenaga, N. Takada, and A. Kon-

agaya. Protein explorer: A petaflops special-purpose computer system for molec-

ular dynamics simulations. In SC ’03: Proceedings of the 2003 ACM/IEEE con-

ference on Supercomputing, 2003.

[93] D. H. Tsai. Virial theorem and stress calculation in molecular-dynamics. J.

Chem. Phys., 70:1375–82, 1979.

[94] D. Wang, Q. Wang, A. Javey, R. Tu, and H. Dai. Germanium nanowire field-

effect transistors with SiO2 and high-κ HfO2. Appl. Phys. Lett., 83:2432–2434,

2003.

[95] C. Zhang and H. Shen. Buckling and postbuckling analysis of single-walled

carbon nanotubes in thermal environments via molecular dynamics simulation.

Carbon, 44:2608–2616, 2006.

[96] Y. Zhu and H. D. Espinosa. An electromechanical material testing system for in

situ electron microscopy and applications. Proc. Nat’l. Acad. Sci., 102:14503–

14508, 2005.

[97] J. A. Zimmerman, E. B. Webb III, J. J. Hoyt, R. E. Jones, P. A. Klein, and D. J.

Bammann. Calculation of stress in atomistic simulation. Modell. Simul. Mater.

Sci. Eng., 12:S319–332, 2004.

PARALLEL SCIENTIFIC COMPUTATION A …mc.stanford.edu/cgi-bin/images/5/5a/Elsen_phd.pdf · PARALLEL...

Documents

Transcript of PARALLEL SCIENTIFIC COMPUTATION A …mc.stanford.edu/cgi-bin/images/5/5a/Elsen_phd.pdf · PARALLEL...