PARALLEL SCIENTIFIC COMPUTATION A …mc.stanford.edu/cgi-bin/images/5/5a/Elsen_phd.pdf · PARALLEL...
Transcript of PARALLEL SCIENTIFIC COMPUTATION A …mc.stanford.edu/cgi-bin/images/5/5a/Elsen_phd.pdf · PARALLEL...
PARALLEL SCIENTIFIC COMPUTATION
ON EMERGING ARCHITECTURES
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF MECHANICAL
ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Erich Konrad Elsen
September 2009
c© Copyright by Erich Konrad Elsen 2009
All Rights Reserved
ii
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Eric Darve) Principal Adviser
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Juan Alonso
Aeronautics and Astronautics)
I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.
(Frank Ham
Mechanical Engineering)
Approved for the University Committee on Graduate Studies.
iii
Preface
The main goal of this thesis is to develop a method for more easily writing high per-
formance scientific codes, specifically mesh based PDE solvers. The best method
for achieving this is a Domain Specific Language (DSL), which we have named
Liszt. Liszt provides hardware independence (for example between streaming com-
puters, commodity graphics processors and specialized processors like IBM’s CELL
and ClearSpeed’s line of accelerator boards) by making the mesh and mesh-based
data storage primitives of the language. Code is forced to be written in a parallel
way involving loops over mesh elements. Liszt has the additional desirable properties
of reducing programmer time and effort, reducing program complexity, automatic
parallelization/domain decomposition and built-in parallel visualization and check-
pointing. Recognizing that creating a language capable of generating code for these
platforms is a challenging problem, work was first done on how to best achieve high
performance on these platforms for these kinds of problems to provide guidance when
developing the language. The second and third chapters deal with implementing an
O(N2) N-Body simulation and a compressible Euler flow solver on commodity graph-
ics hardware and IBM’s Cell. The final chapter, which is presented as an appendix
due to its being unrelated to the rest of the work deals with a new periodic boundary
condition developed for simulating nanowires undergoing torsion.
iv
Acknowledgements
I would like to thank my parents for making me believe I could do anything and then
not trying to tell me what that should be, Deb Michael and Doreen Wood for helping
me navigating the bureaucracy of the University for 5 years, all the teachers I’ve ever
had, but especially: Don Porzio, Mrs. Franzen, Anthony Jacobi, John P. D’Angelo,
Geir Dullerud, Rose Marie Wood, Gustavo Romero, Fred Weldy, and Wei Cai. Ilhami
Torunglo and Ahmet Karakas helped me grow wise in the ways of the ”real” world and
I owe them deeply for all the generosity they have shown me. Frank Ham and Juan
Alonso were especially helpful in that I worked with them on several projects during
my stay and they always provided valuable insight and advice. I would especially like
to thank Parviz Moin for both enticing me to come to Stanford, advising me during
my first year and his guidance since. Finally, I would like to thank my advisor for
everything over these last five years; hopefully some of his wisdom has been passed
on to me. He took a chance on me solely because I expressed some interest in those
GPU things (for which I’m grateful) and I think it worked out well.
v
Contents
Preface iv
Acknowledgements v
1 Historical Background 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Single-Threaded Performance . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Parallel Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Merrimac and Streaming . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Programmable GPUs . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Cell Broadband Engine Architecture . . . . . . . . . . . . . . 18
1.4 Comparison of Technologies . . . . . . . . . . . . . . . . . . . . . . . 21
2 N-Body Simulations on GPUs 22
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Implementation and Optimization on GPUs . . . . . . . . . . . . . . 28
2.3.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 General Optimization . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3 Optimization for small systems . . . . . . . . . . . . . . . . . 31
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 Comparison to other Architectures . . . . . . . . . . . . . . . 35
vi
2.5.2 Hardware Constraints . . . . . . . . . . . . . . . . . . . . . . 37
2.5.3 On-board Memory vs. Cache Usage . . . . . . . . . . . . . . . 38
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7.1 Flops Accounting . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Structured PDE Solvers on CELL and GPUs 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Review of prior work on GPUs . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Flow Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Numerical accuracy considerations and performance comparisons be-
tween CPU and GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Mapping the Algorithms to the GPU . . . . . . . . . . . . . . . . . . 48
3.5.1 Classification of kernel types . . . . . . . . . . . . . . . . . . . 48
3.5.2 Data layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.3 Summary of GPU code . . . . . . . . . . . . . . . . . . . . . . 52
3.5.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6.1 Performance scaling with block size . . . . . . . . . . . . . . . 57
3.6.2 Performance of the three main kernel types . . . . . . . . . . . 58
3.6.3 Performance on real meshes . . . . . . . . . . . . . . . . . . . 60
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8 CELL Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8.1 Amdahl’s Revenge . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4 Liszt 71
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.1 Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.2 Language Components . . . . . . . . . . . . . . . . . . . . . . 80
vii
4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 Conclusions 100
A Torsion and Bending PBC 102
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.2 Generalization of Periodic Boundary Conditions . . . . . . . . . . . . 105
A.2.1 Review of Conventional PBC . . . . . . . . . . . . . . . . . . 105
A.2.2 Torsional PBC . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.2.3 Bending PBC . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.3 Virial Expressions for Torque and Bending Moment . . . . . . . . . . 112
A.3.1 Virial Stress in PBC . . . . . . . . . . . . . . . . . . . . . . . 113
A.3.2 Virial Torque in t-PBC . . . . . . . . . . . . . . . . . . . . . . 114
A.3.3 Virial Bending Moment in b-PBC . . . . . . . . . . . . . . . . 116
A.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
A.4.1 Si Nanowire under Torsion . . . . . . . . . . . . . . . . . . . . 118
A.4.2 Si Nanowire under Bending . . . . . . . . . . . . . . . . . . . 122
A.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Bibliography 127
viii
List of Tables
1.1 SGEMM and DGEMM numbers are for the best performing matrix
sizes on each platform that are very large (ie much too big to fit en-
tirely in any kind of cache or local memory. The FFT is for best
performing (power of 2), very large 2D complex transforms. The Cell
is the PowerXCell 8i accelerator board from Mercury Systems. . . . 21
2.1 Values for the maximum performance of each kernel on the X1900XTX.
The instructions are counted as the number of pixel shader assembly
arithmetic instructions in the inner loop. . . . . . . . . . . . . . . . 27
2.2 Values for the maximum performance of each kernel on the X1900XTX. 28
2.3 Comparison of GROMACS(GMX) running on a 3.2 GHz Pentium 4
vs. the GPU showing the estimated simulation time per day for a 1000
atom system.
*GROMACS does not have an SSE inner loop for LJC(linear) . . . . 34
3.1 Measured speed-ups for the NACA 0012 airfoil computation. . . . . . 61
3.2 Speed-ups for the hypersonic vehicle computation . . . . . . . . . . . 62
A.1 Comparison of torsional stiffness for Si NW estimated from MD simu-
lations and that predicted by Strength of Materials (SOM) theory. D∗
is the adjusted NW diameter that makes the SOM predictions exactly
match MD results. The critical twist angle φc and critical shear strain
γc at failure are also listed. . . . . . . . . . . . . . . . . . . . . . . . . 121
ix
A.2 Comparison of the bending stiffnesses for Si NWs estimated from MD
simulations and that predicted by Strength of Materials (SOM) theory.
D∗ is the adjusted NW diameter that makes SOM predictions exactly
match MD results. The critical bending angle Θf and critical normal
strain εf at fracture are also listed. . . . . . . . . . . . . . . . . . . . 124
x
List of Figures
1.1 Transistor Counts Over the Last 35 Years . . . . . . . . . . . . . . . 3
1.2 Illustration of SIMD operation . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Parallel solve of Tri-diagonal Matrix . . . . . . . . . . . . . . . . . . . 11
1.4 G70 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 CUDA Programming Model with N threads per block. Only the first
kernel is shown in full detail due to space constraints. . . . . . . . . . 18
1.7 Overview of the layout of the Cell . . . . . . . . . . . . . . . . . . . . 19
1.8 Conceptual Diagram of Cell SPE . . . . . . . . . . . . . . . . . . . . 20
2.1 GA Kernel with varying amounts of unrolling . . . . . . . . . . . . . 30
2.2 Performance improvement for LJC(sigmoidal) kernel with i-particle
replication for several values of N . . . . . . . . . . . . . . . . . . . . 33
2.3 Speed comparison of CPU, GPU and GRAPE-6A . . . . . . . . . . . 35
2.4 Useful MFlops per second per U.S. Dollar of CPU, GPU and GRAPE-6A 36
2.5 Millions of Interactions per Watt of CPU, GPU and GRAPE-6A . . . 36
2.6 GFlops achieved as a function of memory speed . . . . . . . . . . . . 39
3.1 Array of Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Structure of Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Flowchart of NSSUS running on the GPU. . . . . . . . . . . . . . . . 52
xi
3.4 This figure illustrates the stencil in the x direction and the branching
on the GPU. Each colored square represents a mesh node. The color
corresponds to the stencil used for the node. Inner nodes (in grey) use
the same stencil. For optimal efficiency, nodes inside a 4 × 4 square
should branch coherently, i.e., use the same stencil (see square with a
dashed line border). For this calculation, this is not the case near the
boundary which leads to inefficiencies in the execution. The algorithm
proposed here reduces branching and leads to only one branch (instead
of 3 here). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 The continuity of the solution across mesh blocks is enforced by com-
puting penalty terms using the SAT approach[16]. The fact that the
connectivity between blocks is unstructured creates special difficulty.
On this figure, for each node on the faces of the blue block, one must
identify the face of one of the green blocks from which the penalty
terms are to be computed. In this case, the left face of the blue block
intersects the faces of four distinct green blocks. This leads to the
creation of 4 sub-faces on the blue block. For each sub-face, penalty
terms need to be computed. Note that some nodes may belong to
several sub-faces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 To calculate the penalty terms efficiently for each sub-face, one first
copies data from the 3D block into a smaller sub-face stream (shown
on the right). In this figure, the block has 10 sub-faces. Assume that
the largest sub-face can be stored in memory as a 2D rectangle of size
nx × ny. In the case shown, the sub-face stream is then composed of
12 nx × ny rectangles, 2 of which are unused. Some of the space is
occupied by real data (in blue); the rest is unused (shown in grey). . . 55
xii
3.7 This figure shows the mapping from neighboring blocks to the neighbor
stream used to process the penalty terms for the blue block. There
are four large blocks surrounding the blue block (top and bottom not
shown). They lead to the first 4 green rectangles. The other rectangles
are formed by the two blocks in the front right and the four smaller
blocks in the front left. . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.8 Performance scaling with block size, 1st order. . . . . . . . . . . . . . 57
3.9 left: pointwise performance (inviscid flux calculation); right: stencil
performance (3rd order residual calculation). . . . . . . . . . . . . . . 59
3.10 Unstructured gather performance (boundary conditions and penalty
terms calculation). The decrease in speed-up is due to an unavoidable
O(n3) vs. O(n2) algorithmic difference in one of the kernels that make
up the boundary calculations. See the discussion in the text. . . . . 59
3.11 Three block C-mesh around the NACA 0012 airfoil. . . . . . . . . . . 60
3.12 Mach number around the NACA 0012 airfoil, M∞ = 0.63, α = 2. . . . 60
3.13 Mach number – side and back views of the hypersonic vehicle. . . . . 61
3.14 Amdahl’s Law (A = 1) vs. CBE (A = 10) . . . . . . . . . . . . . . . 64
3.15 Amdahl’s Law (A = 1) vs. CBE (A = 10) . . . . . . . . . . . . . . . 64
3.16 Ratio of Amdahl’s Law Speedup to CBE Speedup . . . . . . . . . . . 65
3.17 Cell Memory Bandwidth treating each SPE as an Independent Co-
processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.18 Cell Memory Bandwidth Viewing each SPE as a Step in a Pipeline . 67
3.19 Circular Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.1 (a) A nanowire subjected to PBC along z axis. (b) A nanowire sub-
jected to t-PBC along z axis. . . . . . . . . . . . . . . . . . . . . . . 107
A.2 A nanowire subjected to b-PBC around z axis. At equilibrium the net
line tension force F must vanish but a non-zero bending moment M
will remain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.3 Snapshots of Si NWs of two diameters before torsional deformation
and after failure. The failure mechanism depends on its diameter. . . 119
xiii
A.4 Virial torque τ as a function of rotation angle φ between the two ends
of the NWs of two different diameters. Because the two NWs have the
same aspect ratio Lz/D, they have the same maximum strain (on the
surface) γmax = φD2Lz
at the same twist angle φ. . . . . . . . . . . . . . 120
A.5 Virial bending moment M as a function of bending angle Θ between
the two ends of the two NWs with different diameters. Because the two
NWs have the same aspect ratio Lz/D, they have the same maximum
strain εmax = ΘD2Lz
at the same bending angle Θ. . . . . . . . . . . . . 123
A.6 Snapshots of Si NWs of two diameters under bending deformation be-
fore and after fracture. While metastable hillocks form on the thinner
NWs before fracture (a), this does not happen for the thicker NW (c). 125
xiv
Chapter 1
Historical Background
1
CHAPTER 1. HISTORICAL BACKGROUND 2
1.1 Introduction
Since the invention of the first transistor in 1958 an empirical ”law” has continued
to predict our ability to manufacture in ever smaller dimensions. Gordon Moore,
co-founder of Intel, made the observation that approximately every 18 months the
number of transistors that could be mass produced in a given area doubled [71] (see
figure 1.1). From 1958 until about 2002 this statement was equivalent to saying that
the speed of the processor also doubled every 18 months. In fact, the correspondence
was close enough that many people erred in thinking that the latter statement was
actually Moore’s law. Since then the increase in performance of a single core has
increased much more slowly. The impact of this decrease in performance growth rate
and its repercussions for scientific computing are the main motivating force behind
this thesis.
The solution of the hardware designers to the inability to significantly increase
single-threaded performance was to increase the explicit parallelism both in the hard-
ware and in the programming model. No longer can software be written in a sequential
fashion relying on advances in hardware to improve performance. Software must now
be written to take advantage of the parallelism inherent in the processors by explic-
itly expressing the parallelism of the algorithms. This requires no less effort than
completely rethinking and rewriting most high-performance code.
1.2 Single-Threaded Performance
In the single threaded programming model the CPU is viewed as doing only one thing
at a time. It theoretically executes each command in its entirety before moving on
to the next; the results of a previous instruction are available for the next one. The
main factors determining performance are then:
• Speed of individual instructions
• Speed of data movement from memory to execution units
CHAPTER 1. HISTORICAL BACKGROUND 3
Figure 1.1: Transistor Counts Over the Last 35 Years
CHAPTER 1. HISTORICAL BACKGROUND 4
The speed of each instruction is mainly determined by the clock speed of the card since
on most arithmetic instructions (floating point division being the main exception) on
modern processors take one cycle (when pipelining, which will be explained later,
is taken into account). Manufacturing companies have been unable to continuing
increasing the clock speeds of processors due to thermal dissipation issues even as
they continue to shrink transistor sizes. The speed of data movement is important
to ensure that every cycle a processor is performing a useful operation instead of
waiting for data arrive. Unfortunately, delays to main memory can be on the order of
hundreds of cycles and cannot be significantly reduced. The obvious solution to the
first problem is to exploit parallelism somehow to execute more than one instruction
each clock cycle. There are two techniques for this. One is done in hardware, requires
no changes to program code and is known as ‘superscalar’ processing; the other
requires writing new code utilizing SIMD (Simultaneous Instruction Multiple Data)
instructions or an auto-vectorizing compiler. The solutions to the second problem are
to add a memory hierarchy which decreases in size but increases in speed (cache) and
to try and find the processor another instruction to execute while waiting for data
for the current instruction, which is known as out-of-order execution. Pipelining,
superscalar execution and out-of-order execution all take advantage of and require
instruction level parallelism (ILP). Unfortunately, all these technologies have a point
of diminishing return. First each technology will be described and the reason it fails
to scale beyond a certain point will be explained. The SIMD instructions are a limited
step toward data level parallelism.
Processor 80386 80486 Pentium Pentium Pro Pentium 4
Year 1986 1989 1993 1995 2000
Cache Size (Internal) 8KB 32KB 512KB 2048KB
Pipelined X X X X
Superscalar X X X
Out of order X X
SIMD X
CHAPTER 1. HISTORICAL BACKGROUND 5
Cache attempts to reduce the latency problem by storing recently used data closer
to the processor (temporal locality) as well as bringing data spatially close to a re-
quested location into the cache as well (spatial locality) under the assumption that
it may also soon be needed. Generally, the hardware makes all the decisions with re-
gards to what is brought into the cache and when data is evicted from the cache. This
greatly simplifies the programming model (and importantly is backwards compatible
with previous serial code), but can also lead to sub-optimal performance because the
programmer cannot take advantage of a known access pattern by ”informing” the
cache. Increasing the size of the cache obviously increases the amount of data that
can be in the cache at any one time and therefore also the time, on average, that a
piece of data will reside in the cache before being evicted increasing its chances of
being reused. A doubling in cache size from two to four or three to six megabytes
results in an average improvement of approximately 10% [85] [84] on a suite of typ-
ical application benchmarks including compression, rendering, video encoding and
gaming. Clearly, the marginal efficiency of those extra transistors is not high.
Next, techniques for taking advantage of ILP are examined. Pipelining was the
earliest technique of this type to be implemented. It arose naturally because executing
a single instruction actually consists of multiple steps. In a very generic 4-stage
pipeline, an instruction must be fetched from memory, decoded, executed and then
the result written. Instead of keeping three of these stages idle while waiting for one
instruction to move all the way through the pipeline, a new instruction is begun as
soon as the first one has been fetched. Of course, the ability of the processor to do
this depends on their being 4 independent instructions in a row, otherwise it must
wait for a previous instruction to finish before starting the next one.
Listing 1.1: Pseudo-Assembly to Illustrate ILP mul x1 , y1 −> a
mul x2 , y2 −> b // independent
mul x3 , y3 −> c // independent
add a , b −> a // dependent
add a , c −> a // dependent
mov a −> memory // dependent
CHAPTER 1. HISTORICAL BACKGROUND 6
add q , r −> s // independent For example, in 1.1, the first three instructions are independent and would start
filling up the pipeline but then a ”bubble” would form because the fourth instruction
depends on the result of the first and second. And in a worst case scenario, the fifth
instruction depends on the fourth, which means the pipeline is completely unused
- the fifth must wait for the fourth to finish before it can enter. So clearly, the
effectiveness of pipelining depends on the ability to have large amounts of contiguous
and independent instructions.
A super-scalar processor will have multiple functional units such as ALUs so that
two multiplies can happen at exactly the same time. Not pipelined but truly in
parallel. So in the example 1.1, the first two multiplications would executed in parallel,
then the next multiplication and add could also be executed in parallel, after that
only one instruction would be executed at a time because of dependencies. This
technique can be, and often is, combined with pipelining so that each functional unit
has its own pipeline. Ultimately though, these techniques are limited by the amount
of parallelism available in the instruction stream.
Out of order execution attempts to solve this fundamental problem by allowing
for instructions to be executed in a different order from the one described by the
instruction stream. In our example, 1.1 assuming we still have a two-unit superscalar
processor, now instead of only executing the add a, c->a instruction by itself be-
cause the next instruction depends on its result, the add q, r->s statement could be
executed with it since it has no dependencies. In practice this technique is complex
and requires a large of amount of transistors for book-keeping machinery. This places
limits on how many dependent instructions can be ”passed over” while looking for
the next independent one.
The last mentioned technique, SIMD (Single Instruction Multiple Data), can be
seen as connection between the ILP of the past and the data parallelism of the future.
Because only so much parallelism, even with all these techniques, can be extracted
from a serial instruction stream additional instructions that specifically operate on
multiple data at one time were introduced. For example, to perform the additions
CHAPTER 1. HISTORICAL BACKGROUND 7
Figure 1.2: Illustration of SIMD operation
A B
C D+ +
= =E F
a+b, c+d we could do this with one SIMD instruction if a and c are contiguous
in memory as well as b and d; see figure 1.2. In this way the programmer could
begin to explicitly specify the parallelism in the code. In some applications this can
lead to a large speedup [75], but they are often difficult to use, essentially requiring
programming in assembly language and they require very specific data layout and
alignment that can be very difficult to achieve for many applications.
A possible solution to the complexity and size of the circuitry to determine and
keep track of dependencies between instructions in the various ILP techniques is to
remove them from the processor and instead move the job to the compiler. The
compiler should determine at compile time which instructions are independent and
should be executed in parallel. This is the approach of the Intel Itanium [67]. In
practice, writing the necessary compilers has proved to be a very challenging task
and current compilers are still not optimal [86] [19].
Even without introducing new techniques to take advantage of ILP, speeds could
still be increased if the clock speed of the processors could continue to be increased.
This also proved to not be possible. In an ideal CMOS transistor current only flows
when the transistor is switching states. As the process node reached 130 and the
90 nanometers, an unanticipated phenomenon occurred which was current leakage
through the transistor even when it wasn’t switching. This lead to much higher ther-
mal dissipation requirements than originally anticipated and limited the maximum
clock rates of the chips. To some extant this problem has been mitigated with the
introduction of high-k called materials that reduce this current leakage. Nonetheless,
CHAPTER 1. HISTORICAL BACKGROUND 8
processor speeds remain capped at around 4GHz.
A new paradigm was needed to continue increasing performance of processors. It
is data parallelism.
1.3 Parallel Architectures
In specialized areas such as High-Performance Computing (HPC), graphics and mul-
timedia applications the limitations of the general purpose processors had been ap-
parent for some time. Engineers realized that for the same transistor and power
budgets a great deal more computing power was possible - provided it was the right
kind of computing! The basic idea behind all of the following technologies is to use
a larger number of simple processors instead of a small number of very powerful pro-
cessors while placing the burden of expressing parallelism on the programmer. The
approaches taken by existing hardware are quite different but the commonality is
that the calculation must be parallel. If the algorithm/computation is completely
sequential there is nothing parallel hardware or algorithms can do to accelerate it.
An example of such a problem would be “pointer chasing”. The first memory location
contains the location of the second memory location, which contains the location of
the third and so on. Starting at the first memory location it is impossible to get to
the end of the chain in any fashion other than following the pointers. Algorithms
like this should be avoided at all cost. These new technologies depend on parallel
algorithms to fully utilize their power which requires a fundamental shift in software
development.
1.3.1 Parallel Algorithms
Consider a simple example, solving a tri-diagonal matrix. The serial solution is well
known and is O(N). As a warmup to help the reader begin to think “parallelly” the
serial and parallel solutions are presented next.
• Serial : Simply perform gaussian elimination from the bottom-up until there is
only one unknown left in the top row. Solve for this unknown. Now substitute
CHAPTER 1. HISTORICAL BACKGROUND 9
back into the second row from the top which allows the next unknown to be
solved for. This process continues until the last unknown is found.
β γ 0 0 0
α β γ 0 0
0 α β γ 0
0 0 α β γ
0 0 0 α β
~x =
y0
y1
y2
y3
y4
after the first step becomes
β γ 0 0 0
α β γ 0 0
0 α β γ 0
0 0 α β∗1 0
0 0 0 α β
~x =
y0
y1
y2
y∗3
y4
after going all the way up:
β∗4 0 0 0 0
α β∗3 0 0 0
0 α β∗2 0 0
0 0 α β∗1 0
0 0 0 α β
~x =
y∗0
y∗1
y∗2
y∗3
y4
• Parallel : One possible parallel algorithm for solving the system below is to use
cyclic reduction. At each step the even rows are used to eliminate the even
numbered unknowns from the odd equation above and below it resulting in a
new system containing just the odd rows. This reduction in the number of
unknowns is repeated until one equation in one unknown is left, which is then
solved, and the solution is propagated through the reverse of the reduction
procedure to solve for all of the unknowns (see figure 1.3.1).
CHAPTER 1. HISTORICAL BACKGROUND 10
Original System:
β γ 0 0 0 0 0
α β γ 0 0 0 0
0 α β γ 0 0 0
0 0 α β γ 0 0
0 0 0 α β γ 0
0 0 0 0 α β γ
0 0 0 0 0 0 α β
~x =
y0
y1
y2
y3
y4
y5
y6
After first reduction step (note that the odd rows are decoupled from the even
rows):
β γ 0 0 0 0 0
0 β∗1 0 γ∗1 0 0 0
0 α β γ 0 0 0
0 α∗3 0 β∗3 0 γ∗3 0
0 0 0 α β γ 0
0 0 0 α∗5 0 β∗5 0
0 0 0 0 0 α β
~x =
y0
y∗1
y2
y∗3
y4
y∗5
y6
The disadvantage to this scheme, operating in a reduction fashion, is that the
amount of parallelism available at each stage decreases. When only a small
number of equations are left, it is likely faster (depending on the specifics of the
hardware) to perform the solve serially at that point instead of continuing the
reductions.
There are other schemes for the parallel solution of Toeplitz tri-diagonal matri-
ces (the coefficient on each diagonal is constant) that require no communication
at all provided one is willing to accept some error in the solution [68].
CHAPTER 1. HISTORICAL BACKGROUND 11
Figure 1.3: Parallel solve of Tri-diagonal MatrixReduce and solve modified equation
Propogate solution tosolve for all unknowns
0
1
2
3
4
5
6
1*
3*
5*
3**
0
1
2
3
4
5
6
1*
3*
5*
1.3.2 Merrimac and Streaming
The Merrimac Streaming Supercomputing project began at Stanford to solve the
hardware and software issues outlined above. Specifically it recognized three require-
ments for continued high-performance on modern VLSI devices.
1. Parallelism
2. Latency Tolerance - 500 or more cycles to main memory
3. Exploitation of locality in addition to parallelism
One of its main contributions was the popularization of the stream programming
abstraction. In this abstraction data is organized into streams which are collections
of data on which similar computations are to be performed (Data Parallel paradigm).
Computation is performed by kernels which are computations that operate on each
element of an output stream. The key difference between this and the earlier vector
programming model is that kernels are not just simple arithmetic operations as in the
vector model, but rather a small program that has access to a local register file. This
change now allows for the programmer to express information about locality through
CHAPTER 1. HISTORICAL BACKGROUND 12
kernels and can prevent unnecessary writes and reads in main memory by keeping
local data in the registers. It also allows for data dependancies other than a one to
one mapping from input to output stream because of the ability to store information
locally.
Although it was planned to develop and produce specialized hardware to take
advantage of this programming model, due to various circumstances, the project never
got past the design stage. Instead, a version of the Brook language was developed
for GPUs to take advantage of an already existing hardware which the programming
model mapped to very well.
1.3.3 Programmable GPUs
The advances in hardware design and programming abstractions have come very
quickly since the introduction of programmable GPUs. First, the GPUs of 2005,
when this research began will be described. This will make the mapping of the
BrookGPU language to the hardware clear. It will also bring to light some of its
shortcomings. Then the current (2009) state of the programming model and hardware
will be described in the context of surmounting the aforementioned shortcomings.
Architecture circa 2005
The entire architecture of the GPU will not be described, but only that relating to
using the GPU for general purpose computations. GPUs of this era generally had
separate hardware for vertex and pixel shaders, but only pixel shaders were generally
used for general purpose computations. Likewise, because generally only one rectangle
the size of screen is rendered, the vast majority of the fixed function pipeline is not
utilized. A top-level depiction of a GPU from this era (NVIDIA’s G70) can be
seen in figure 1.4. From the perspective of the stream programming abstraction,
everything above the fragment crossbar is not terribly important. What matters is
that somehow fragments are generated (based upon the output destination) and feed
into the fragment shaders to be processed independently. The programming model
combined with its view of the hardware can be seen in figure 1.5. Theoretically, if
CHAPTER 1. HISTORICAL BACKGROUND 13
Figure 1.4: G70 Architecture
there are N fragments they can all be thought as being executed simultaneously on
N different processors. Of course, in reality, there were only 24 physical processors on
a GPU (the exact number obviously depended on the particular GPU); a far smaller
number than fragments to be processed, but there are actually more fragments ”in
flight” than the number of processors. The actual number of fragments ”in flight” is
approximately 20× the number of physical processors. This is done so that whenever
one fragment stalls waiting for a memory access, another fragment can be scheduled
immediately in its place and no processing capacity is lost due to memory latency.
The order in which fragments are generated and sent into the queue to be processed
is chosen to maximize the possibility of cache hits, assuming the fragments access
data that has 2D locality.
BrookGPU
Brook for GPUs (also known as BrookGPU) was designed by Ian Buck[13, 12, 59].
Brook is a source to source compiler which converts Brook code into C++ code and
CHAPTER 1. HISTORICAL BACKGROUND 14
Figure 1.5: Programming Model
Kernel
Texture Cache
Input Streams Gather Arguments
Constants
Fixed OutputLocation
Local Registers
a high level shader language like Cg or HLSL. This code then gets compiled into
pixel shader assembly by an appropriate shader compiler like Microsoft’s FXC or
NVIDIA’s CGC. The graphics driver finally maps the pixel shader assembly code
into hardware instructions as appropriate to the architecture. It can run on top of
either DirectX or OpenGL; due its greater maturity, the DirectX backend was used for
all results in this thesis. Specifically Microsoft DirectX 9.0c [69] and the Pixel Shader
3.0 Specification [70]. In the Pixel Shader 3.0 specification, the shader has access to
32 general purpose, 4-component, single precision floating point (float4) registers,
16 float4 input textures, 4 float4 render targets (output streams) and 32 float4
constant registers. A shader consists of a number of assembly-like instructions. GPUs
of this era had a maximum static program length of 512 (ATI) or 1024 (NVIDIA)
instructions.
The syntax of Brook is based on C with some extensions. The data is represented
as streams. These streams are operated on by kernels which have specific restrictions:
each kernel is a short program to be executed concurrently on each record of the
output stream(s). This implies that each instance of a kernel automatically has an
output location associated with it. It is this location only to which output can be
written. Scatter operations (writing to arbitrary memory locations) are not allowed.
Gather operations (read with indirect addressing) are possible for input streams. Here
is a trivial example:
CHAPTER 1. HISTORICAL BACKGROUND 15
kernel void add (
/∗ stream argument ∗/ f loat a<>,
/∗ gather argument ∗/ f loat b [ ] [ ] ,
/∗ constant ∗/ int width ,
/∗ output ∗/ out f loat r e s u l t <>)
f loat2 indexToRight = indexo f ( r e s u l t ) . xy + f loat2 ( 1 , 0 ) ;
//wrap around i f we ’ re on the edge
i f ( indexToRight . x == width )
indexToRight . x = 0 ;
// because a i s a stream argument
//we do not need to prov ide i n d i c e s
// i t i s automat i ca l l y that o f the output l o c a t i o n
r e s u l t = a + b [ indexToRight ] ;
f loat a<100>; f loat b<100>; f loat c<100>;
add ( a , b , 100 , c ) ; There will be one hundred of instances of the add kernel that are created, implicitly
executing a parallel for loop over all the elements of the output steam c. The indexof
operator can be used to get the location a particular instance of the kernel will be
writing to in the output stream(s).
The features of the hardware appear in the language in many ways. Unlike memory
in traditional machines, streams are all addressed using two coordinates because under
the hood all memory is represented as textures which are inherently 2D in graphics
languages (3D textures were not yet standardized or supported by all platforms when
Brook was created). Most importantly, caches are two dimensional. Instead of cache
lines one can instead think of cache squares around the data requested1. Some of
the more annoying features of the hardware are related to looping. Both NVIDIA
and ATI cards use an 8-bit counter for for loops, so each for loop is limited to 256
1Technically, the memory on GPUs is still linear but by using algorithms based on space fillingZ-curves, the hardware gives the appearance of a two dimensional memory layout.
CHAPTER 1. HISTORICAL BACKGROUND 16
iterations (i = 0...255)2. To do more iterations for loops must be nested - 2 loops for
65,535 iterations and so on. The required control flow is given in the following code
snippet. bool breakFlag = fa l se ;
for ( int i = 0 ; i < 256 ; ++i ) for ( int j = 0 ; j < 256 ; ++j )
int l i n e a r I n d e x = i ∗ 256 + j ;
i f ( l i n e a r I n d e x >= des i redNumIterat ions ) breakFlag = true ;
break ;
// do something
i f ( breakFlag )
break ;
A further complication with loops is that on NVIDIA hardware there is a hard limit
of 65,535 (assembly) instructions per kernel invocation. The exact number of in-
structions used is impossible to determine before runtime because the true assembly
instructions used by the hardware are generated by the driver at runtime (Just-In-
Time Compiled). The solution is to multi-pass a kernel that might do many loop
iterations across many kernel invocations but this is naturally inefficient since data
must reloaded over and over again instead of remaining in registers (basically negating
some of the advantage of the stream paradigm over the vector processing paradigm).
The inability to scatter has some important algorithmic implications. For exam-
ple, if we make a calculation and then need to update values at multiple memory
locations with this single value: f oo [ bar ] += value ;
foo [ moo ] += value ; 2for unknown reasons on ATI the limit is actually i = 0...254
CHAPTER 1. HISTORICAL BACKGROUND 17
we have to calculate value twice on GPUs since we could only output to one
location.
A limitation of this programming model itself is that locality can only be directly
expressed by the programmer at one level, that of the registers. It is only indi-
rectly possible, through the texture cache, to make use of locality between different
fragments. Even this ability only allows for re-use of constant read-only data, it is
impossible to share calculated information between fragments.
CUDA and Recent Hardware
Even though the research in this thesis was done with BrookGPU, it is worth men-
tioning CUDA and recent architectural developments. NVIDIA’s G80 and later se-
ries chips as well as ATI’s R600 and later series chips have what’s known as unified
shaders. Instead of specific hardware that is only either a vertex or pixel shader, there
is one unit that can function as either depending on the demand. BrookGPU and
indeed, most general purpose GPU computing, never used vertex shaders so as far
as GPGPU was concerned, they were a waste of transistors. Now however, all of the
unified shaders can be utilized for computation. More importantly, NVIDIA released
CUDA which is an evolution of the stream programming model of BrookGPU. The
two main evolutionary features are:
Shared Memory - a small amount of read/write memory that can be shared among
a group of threads (the preferred terminology to move away from the graphics
specific fragment) known as a block.
Scatter - it is now possible to write to arbitrary memory locations from each thread.
It is also possible to place synchronization points in kernel code which will be respected
within a block. Combined with the shared memory this allows is a second level of
locality that is explicitly controlled by the programmer. The new programming model
can be seen in figure 1.6.
CHAPTER 1. HISTORICAL BACKGROUND 18
Kernel 1
Texture Cache
Global Memory
Linear Memory
ConstantsLinear GlobalMemory
Local Registers Shared Memory
Kernel 2
Kernel 3
Kernel N-1Kernel N
Figure 1.6: CUDA Programming Model with N threads per block. Only the firstkernel is shown in full detail due to space constraints.
1.3.4 Cell Broadband Engine Architecture
The Cell Broadband Engine Architecture (usually shortened to just Cell) was de-
veloped by Sony, Toshiba and IBM. It sits in between the completely data parallel
paradigm of GPUs and the instruction level paradigm of conventional CPUs. It con-
sists of one Power Processing Element (PPE), a simplified PowerPC processor and
eight Synergistic Processing Elements (SPE). They are all connected by the Element
Interconnect Bus (EIB), a circular ring connecting the PPE, 8 SPEs and a memory
controller (MIC). The MIC interfaces with the onboard XDR (extreme data rate)
RAM which has maximum data rate of 25.6 GB/sec to the ring. The EIB actually
consists of 4 “lanes”, two which operate clockwise and two counter-clockwise. The
maximum bandwidth around the ring is 204 GB/sec (at a clock speed of 3.2 GHz.)
Compared to the GPUs available when the Cell was released, the Cell’s bandwidth
to main memory was about half of the GPUs. Compared to today’s GPUs it is nearly
a factor of seven! This combined with the same discrepancy between the bandwidth
around the ring compared to the main memory bandwidth makes it clear that the
Cell can not be thought of as a pure streaming, data-parallel processor. It is partially
data parallel but also task parallel. To make the most use of the available bandwidth,
SPEs must process data in a pipeline fashion, with each SPE, generally, performing
a different task. Or, at the minimum able to usefully reuse information amongst
themselves. Unfortunately, this programming model does not always map well to
CHAPTER 1. HISTORICAL BACKGROUND 19
Power Processor Element (PPE)64 bit PowerPC
EIB
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
RAM25.6 GB/sec
Figure 1.7: Overview of the layout of the Cell
complicated scientific codes.
The PPE is fairly standard PowerPC processor with the exception that it doesn’t
support out-of-order execution. Additionally, some instructions were converted into
microcoded instructions (essentially a sequence of other instructions) that are stored
in a ROM chip. It takes 11 cycles for the instructions to be fetched from the ROM
and the pipeline stalls during this time. Although Cell aware compilers will try to
avoid these instructions, it is not always possible. These two differences can have a
significant impact on performance, as will be shown later.
The SPEs are unique processors. They have no cache, instead they have a small
Local Store (LS), 256 KB in size with predictable 6 cycle latency on all loads. All
operations are vector operations; there are no scalar instructions. Scalar operations
can be emulated by the compiler using shifts and masks, but this results in under-
utilization of the available compute power by at least a factor 4 (likely much more).
It has a large register file - 128 general purpose registers are available. In keeping
with the vector nature of the processor the registers are 16 bytes in size, the size of
CHAPTER 1. HISTORICAL BACKGROUND 20
256 KBLocal Store
Memory FlowController
128 128-bitRegisters
Even PipeArithemtic Ops
Odd PipeLoad/StoreBranch Ops
EIB
Figure 1.8: Conceptual Diagram of Cell SPE
a typical SIMD vector (4 floats or 2 doubles). Due to the relatively simple nature of
the processor execution of code on a SPE is deterministic; it can be determined stat-
ically how code will be pipelined. Optimizing code to make to prevent pipeline stalls
is important optimization technique that has some trade-offs that will be discussed
later (code-size vs. number of stalls).
In addition to their compute capabilities, they also have a Memory Flow Controller
(MFC) which contains a Direct Memory Access (DMA) controller which is used to
transfer data between the LS and main memory. The SPE queue up memory transfers
with the MFC, which takes care of servicing them asynchronously, while the SPE
goes about computing. Ideally, by employing some kind of buffering strategy, the
SPE should never be waiting for data transfers. There are unfortunately, quite a
range of restrictions and conditions on the DMAs to achieve maximum performance.
Maximum performance “is achieved for transfers in which both the EA [main memory
address] and LSA [local store address] are 128-byte aligned and multiples of 128
bytes.” [41]. All of the requirements and conditions in their full detail can be in the
Cell Broadband Engine Programming Manual.
CHAPTER 1. HISTORICAL BACKGROUND 21
1.4 Comparison of Technologies
Transistors Power BW Max Single Max Double SGEMM DGEMM FFT CostMillions Watts GB/sec GFlops GFlops GFlops GFlops GFlops $
Nehalem 731 130 25.6 102.4 51.2 92 45 41 1700GTX 285 1400 183 159 1062 88 355 74 95 400Cell 250 150 22.8 180 102 175 75 21 8000
Table 1.1: SGEMM and DGEMM numbers are for the best performing matrix sizeson each platform that are very large (ie much too big to fit entirely in any kind ofcache or local memory. The FFT is for best performing (power of 2), very large 2Dcomplex transforms. The Cell is the PowerXCell 8i accelerator board from MercurySystems.
From the raw numbers in table 1.1 many of the relative strengths and weaknesses
of each platform become apparent. In terms of raw and achieved single precision
performance and performance per dollar, the GPU is dominant. The Cell is very
efficient in terms of performance per transistor, but that is a rather useless metric,
except perhaps to IBM’s bottom line. The Cell also has slightly higher double pre-
cision performance than the GPU, and therefore also slightly better double precision
performance per watt, but fairs horribly in performance per dollar comparisons. Ab-
solute performance of the CPU is generally the worst of the three, but it falls in the
middle when performance per dollar is examined.
What is missing from the table is performance on more complicated applications.
Matrix-Matrix multiply and Fast Fourier Transforms are simple compared to complex
scientific applications and important to such a variety of applications that a great
deal of manpower goes into optimizing a very small piece of code, which can make
performance numbers for these routines not representative of the performance one
can achieve on larger applications. This thesis attempts to fill in this gap in chapter 3
where the performance and implementation of a compressible Euler solver is detailed
on both GPUs and the CELL and compared with a reference CPU implementation.
Chapter 2
N-Body Simulations on GPUs
22
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 23
2.1 Introduction
The classical N -body problem consists of obtaining the time evolution of a system
of N mass particles interacting according to a given force law. The problem arises
in several contexts, ranging from molecular scale calculations in structural biology to
stellar scale research in astrophysics. Molecular dynamics (MD) has been successfully
used to understand how certain proteins fold and function, which have been outstand-
ing questions in biology for over three decades [87, 33]. Exciting new developments
in MD methods offer hope that such calculations will play a significant role in future
drug research [30]. In stellar dynamics where experimental observations are hard, if
not impossible, theoretical calculations may often be the only way to understand the
formation and evolution of galaxies.
Analytic solutions to the equations of motion for more than 2 particles or compli-
cated force functions are intractable which forces one to resort to computer simula-
tions. A typical simulation consists of a force evaluation step, where the force law and
the current configuration of the system are used to the compute the forces on each
particle, and an update step, where the dynamical equations (usually Newton’s laws)
are numerically stepped forward in time using the computed forces. The updated
configuration is then reused to calculate forces for the next time step and the cycle
is repeated as many times as desired.
The simplest force models are pairwise additive, that is the force of interaction
between two particles is independent of all the other particles, and the individual
forces on a particle add linearly. The force calculation for such models is of com-
plexity O(N2). Since typical studies involve a large number of particles (103 to 106)
and the desired number of integration steps is usually very large (106 to 1015), the
computational requirements often limit both the problem size as well as the simula-
tion time and consequently, the useful information that may be obtained from such
simulations. Numerous methods have been developed to deal with these issues. For
molecular simulations, it is common to reduce the number of particles by treating
the solvent molecules as a continuum. In stellar simulations, one uses individual time
stepping or tree algorithms to minimize the number of force calculations. Despite
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 24
such algorithmic approximations and optimizations, the computational capabilities
of current hardware remain a limiting factor.
Typically N -body simulations utilize neighborlists, tree methods or other algo-
rithms to reduce the order of the force calculations. In previous work [27], a GPU
implementation of a neighbor list based method to compute non-bonded forces was
demonstrated. However, since the GPU so far outperformed the CPU, the neigh-
borlist creation quickly became a limiting factor. Building the neighborlist on the
GPU is extremely difficult due to the lack of specific abilities (namely indirected out-
put) and research on computing the neighborlist on the GPU is still in progress. Other
simplistic simulations that do not need neighborlist updates have been implemented
by others [47]. However, for small N, one finds they can do an O(N2) calculation
significantly faster on the GPU than an O(N) method using the CPU (or even with
a combination of the GPU and CPU). This has direct applicability to biological sim-
ulations that use continuum models for the solvent. The reader should also note that
in many of the reduced order methods such as tree based schemes, at some stage an
O(N2) calculation is performed on a subsystem of the particles, so this method can
be used to improve the performance of such methods as well. When using GRAPE
accelerator cards for tree based algorithms, the host processor takes care of building
the tree and the accelerator cards are used to speed up the force calculation step;
GPUs could be used in a similar way in place of the GRAPE accelerator boards.
Using the methods described below, acceleration of the force calculation by a
factor of 25 is possible with GPUs compared to highly optimized SSE code running
on an Intel Pentium 4. This performance is in the range of the specially designed
GRAPE-6A [31] and MDGRAPE-3 [92] processors, but uses a commodity processor
at a much better performance/cost ratio.
2.2 Algorithm
General purpose CPUs are designed for a wide variety of applications and take limited
advantage of the inherent parallelism in many calculations. Improving performance in
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 25
the past has relied on increasing clock speeds and the size of high speed cache memo-
ries. Programming a CPU for high performance scientific applications involves careful
data layout to utilize the cache optimally and careful scheduling of instructions.
In contrast, graphics processors are designed for intrinsically parallel operations,
such as shading pixels, where the computations on one pixel are completely indepen-
dent of another. GPUs are an example of streaming processors, which use explicit data
parallelism to provide high compute performance and hide memory latency. Data is
expressed as streams and data parallel operations are expressed as kernels. Kernels
can be thought of as functions that transform each element of an input stream into
a corresponding element of an output stream. When expressed this way, the kernel
function can be applied to multiple elements of the input stream in parallel. Instead
of blocking data to fit caches, the data is streamed into the compute units. Since
streaming fetches are predetermined, data can be fetched in parallel with computa-
tion. This section describes how the N -body force calculation can be mapped to
streaming architectures.
In its simplest form the N -body force calculation can be described by the following
pseudo-code: for i = 1 to N
f o r c e [ i ] = 0
r i = coo rd ina t e s [ i ]
for j = 1 to N
r j = coo rd ina t e s [ j ]
f o r c e [ i ] = f o r c e [ i ] + f o r c e f u n c t i o n ( r i , r j )
end
end Since all coordinates are fixed during the force calculation, the force computation can
be parallelized for the different values of i. In terms of streams and kernels, this can
be expressed as follows:
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 26
stream coo rd ina t e s ;
stream f o r c e s ;
kernel k f o r c e ( r i )
f o r c e = 0
for j = 1 to N
r j = coo rd ina t e s [ j ]
f o r c e = f o r c e + f o r c e f u n c t i o n ( r i , r j )
end
return f o r c e
end kernel
f o r c e s = k f o r c e ( coo rd ina t e s ) The kernel kforce is applied to each element of the stream coordinates to pro-
duce an element of the forces stream. Note that the kernel can perform an indexed
fetch from the coordinates stream inside the j-loop. An out-of-order indexed fetch
can be slow, since in general, there is no way to prefetch the data. However in this
case the indexed accesses are sequential. Moreover, the j-loop is executed simulta-
neously for many i-elements; even with minimal caching, rj can be reused for many
N i-elements without fetching from memory thus the performance of this algorithm
would be expected to be high. The implementation of this algorithm on GPUs and
GPU-specific performance optimizations are described in the following section.
There is however one caveat in using a streaming model. Newton’s Third law
states that the force on particle i due to particle j is the negative of the force on
particle j due to particle i. CPU implementations use this fact to halve the number
of force calculations. However, in the streaming model, the kernel has no ability to
write an out-of-sequence element (scatter), so forces[j] can not be updated while
summing over the j-loop to calculate forces[i]. This effectively doubles the number
of computations that must be done on the GPU compared to a CPU.
Several commonly used force functions were implemented to measure and compare
performance. For stellar dynamics, depending on the integration scheme being used,
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 27
Flops Input Inner BandwidthKernel Formula per Unroll (bytes) Loop (GB/s)
Interaction. Instructions.
Gravity(accel)
mj
(r2ij+ε2)3/2 rij 19 4×4 64 125 19.9
Gravity(accel & jerk)
mj
(r2ij+ε2)3/2 rij
mj
hvij
(r2ij+ε2)3/2 − 3(rij ·vij)rij
(r2ij+ε2)5/2
i 42 1×4 128 104 40.6
LJC(constant)
qiqj
εr3ij
rij + εij
»“σij
rij
”6−“σij
rij
”12–
30 2×4 104 109 33.6
LJC(linear)
qiqj
r4ij
rij + εij
»“σij
rij
”6−“σij
rij
”12–
30 2×4 104 107 34.5
LJC(sigmoidal)
qiqj
ζ(rij)r3ij
rij +
εij
»“σij
rij
”6−“σij
rij
”12–
ζ(r) = e(αr3+βr2+γ+δ)
43 2×4 104 138 27.3
Table 2.1: Values for the maximum performance of each kernel on the X1900XTX.The instructions are counted as the number of pixel shader assembly arithmetic in-structions in the inner loop.
one may need to compute just the forces, or the forces as well as the time derivative of
the forces (jerk). These kernels are referred to as GA (Gravitational Acceleration) and
GAJ (Gravitational Acceleration and Jerk) in the rest of this chapter. In molecular
dynamics, it is not practical to use O(N2) approaches when the solvent is treated
explicitly, so this work restricts itself to continuum solvent models. In such models,
the quantum interaction of non-bonded atoms is given by a Lennard-Jones function
and the electrostatic interaction is given by Coulomb’s Law suitably modified to
account for the solvent. The LJC(constant) kernel calculates the Coulomb force with
a constant dielectric, while the LJC(linear) and LJC(sigmoidal) kernels use distance
dependent dielectrics. The equations used for each kernel as well as the arithmetic
complexity of the calculation are shown in Tables 2.1 and 2.2.
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 28
Useful Giga SystemKernel GFLOPS Interactions Size
per sec.
Gravity(accel)
94.3 4.97 65,536
Gravity(accel & jerk)
53.5 1.27 65,536
LJC(constant)
77.6 2.59 4096
LJC(linear)
79.5 2.65 4096
LJC(sigmoidal)
90.3 2.10 4096
Table 2.2: Values for the maximum performance of each kernel on the X1900XTX.
2.3 Implementation and Optimization on GPUs
2.3.1 Precision
Recent graphics boards have 32-bit floating point arithmetic. Consequently all of
the calculations were done in single precision. Whether or not this is sufficiently
accurate for the answers being sought from the simulation is often the subject of
a debate which will not be settled here. In many cases, though certainly not all,
single precision is enough to obtain useful results. Furthermore, if double precision
is necessary, it is usually not required throughout the calculation, but rather only
in a select few instances. For reference, GRAPE-6 [61] performs the accumulation
of accelerations, subtraction of position vectors and update of positions in 64-bit
fixed point arithmetic with everything else in either 36, 32 or 29 bit floating point
precision. It is quite common to do the entire force calculation in single precision for
molecular simulations while using double precision for some operations in the update
step. If and where necessary, the appropriate precision could be emulated on graphics
boards [32]. The impact on performance would depend on where and how often it
would be necessary to do calculations in double precision.
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 29
2.3.2 General Optimization
The algorithm was implemented for several force models. For simplicity, in the follow-
ing discussion, only the GA kernel is discussed, which corresponds to the gravitational
attraction between two mass particles, given by
ai = −G∑i 6=j
mj
(r2ij + ε2)3/2
rij (2.1)
where ai is the acceleration on particle i, G is a constant (often normalized to one), mj
is the mass of particle j, ε is a softening parameter used to avoid near singular forces
when two particles become very close, and rij is the vector displacement between
particles i and j. The performance of the kernel for various input sizes are shown in
Figure 2.1.
The algorithm outlined in Section 2.2 was implemented in BrookGPU and targeted
for the ATI X1900XTX. Even this naive implementation performs very well, achieving
over 40 GFlops, but its performance can be improved. This kernel executes 48 Giga-
instructions/sec and has a memory bandwidth of 33 GB/sec. Using information from
GPUBench [14], one expects the X1900XTX to be able to execute approximately
30-50 Giga-instruction/sec (it depends heavily on the pipelining of commands) and
have a cache memory bandwidth of 41GB/sec. The nature of the algorithm is such
that almost all the memory reads will be from the cache since all the pixels being
rendered at a given time will be accessing the same j-particle. Thus this kernel is
limited by the rate at which the GPU can issue instructions (compute bound).
To achieve higher performance, the standard technique of loop unrolling was used.
This naive implementation is designated as a 1×1 kernel because it is not unrolled
in either i or j. The convention followed hereafter when designating the amount of
unrolling will be that A×B means i unrolled A times and j unrolled B times. The
second GA kernel (1×4) which was written unrolled the j-loop four times, enabling
the use of the 4-way SIMD instructions on the GPU. This reduces instructions that
must be issued by around a factor of 3. (some Pixel Shader instructions are scalar
which prevents a reduction by a factor of 4). The performance for this kernel is
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 30
shown in Figure 2.1. It achieves a modest speedup compared to the previous one,
and the kernel has now switched from being compute bound to bandwidth bound (35
Giga-Instructions/sec and ≈40GB/sec).
100 1000 10000
Output Stream Size
0
1
2
3
4
5
Gig
a-I
nte
ract
ion
s p
er s
eco
nd
1x1
1x4
4x4
Figure 2.1: GA Kernel with varying amounts of unrolling
Further reducing bandwidth usage is somewhat more difficult. It involves using
the multiple render targets (MRT) capability of recent GPUs which is abstracted as
multiple output streams by BrookGPU. By reading in 4 i-particles into each kernel
invocation and outputting the force on each into a separate output stream, we reduce
by a factor of four the size of each output stream compared with original. This
reduces input bandwidth requirements to one quarter of original bandwidth because
each j-particle is only read by one-quarter as many fragments. To make this more
clear, the pseudo-code for this kernel is shown below. This kernel is designated as a
4×4 kernel. stream coo rd ina t e s ;
stream index = range ( 1 to N sk ip 4 ) ;
stream f o r c e s1 , f o r c e s2 , f o r c e s3 , f o r c e s 4 ;
kernel k force4x4 ( i )
f o r c e 1 = 0
f o r c e 2 = 0
f o r c e 3 = 0
f o r c e 4 = 0
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 31
r i 1 = coo rd ina t e s [ i ]
r i 2 = coo rd ina t e s [ i +1]
r i 3 = coo rd ina t e s [ i +2]
r i 4 = coo rd ina t e s [ i +3]
for j = 1 to N sk ip 4
r j 1 = coo rd ina t e s [ j ]
r j 2 = coo rd ina t e s [ j +1]
r j 3 = coo rd ina t e s [ j +2]
r j 4 = coo rd ina t e s [ j +3]
f o r c e 1 += f o r c e f u n c t i o n 4 ( r i1 , r j1 , r j2 , r j3 , r j 4 )
f o r c e 2 += f o r c e f u n c t i o n 4 ( r i2 , r j1 , r j2 , r j3 , r j 4 )
f o r c e 3 += f o r c e f u n c t i o n 4 ( r i3 , r j1 , r j2 , r j3 , r j 4 )
f o r c e 4 += f o r c e f u n c t i o n 4 ( r i4 , r j1 , r j2 , r j3 , r j 4 )
end
return f o r ce1 , f o rce2 , f o rce3 , f o r c e 4
end kernel
f o r c e s1 , f o r c e s2 , f o r c e s3 , f o r c e s 4 = kforce4x4 ( i n d i c e s ) In the above code, the input is the sequence of integers 1, 5, 9, ...N and the output
is 4 force streams. The force function4 uses the 4-way SIMD math available on
the GPU to compute 4 forces at a time. The four output streams can be trivially
merged into a single one if needed. Results for this kernel can be seen in Figure 2.1.
Once more the kernel has become instruction-rate limited and its bandwidth is half
that of the maximum bandwidth of the ATI board, but the overall performance has
increased significantly.
2.3.3 Optimization for small systems
In all cases, performance is severely limited when the number of particles is less than
about 4000. This is due to a combination of fixed overhead in executing kernels and
the lack of sufficiently many parallel threads of execution. It is sometimes necessary
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 32
to process small systems or subsystems of particles (N ≈ 100− 1000).
For example, in molecular dynamics where forces tend to be short-range in nature,
it is more common to use O(N) methods by neglecting or approximating the inter-
actions beyond a certain cutoff distance. However, when using continuum solvent
models, the number of particles is small enough (N ≈ 1000) that the O(N2) method
is comparable in complexity while giving greater accuracy than O(N) methods.
It is common in stellar dynamics to parallelize the individual time step scheme
by using the block time step method [66]. In this method forces are calculated on
only a subset of the particles at any one time. In some simulations a small core can
form such that the smallest subset might have less than 1000 particles in it. To take
maximal advantage of GPUs it is therefore important to get good performance for
small output stream sizes.
To do this, one can increase the number of parallel threads by decreasing the
j-loop length. For example, the input stream can be replicated twice, with the j-loop
looping over the first N/2 particles for the first half of the replicated stream and
looping over the second N/2 particles for the second half of the stream. Consider the
following pseudocode that replicates the stream size by a factor of 2: stream coo rd ina t e s ;
stream i n d i c e s = range ( 1 to 2N ) ;
stream p a r t i a l f o r c e s ;
kernel k f o r c e ( i )
f o r c e = 0
i f i <= N:
r i = coo rd ina t e s [ i ]
for j = 1 to N/2
r j = coo rd ina t e s [ j ]
f o r c e = f o r c e + f o r c e f u n c t i o n ( r i , r j )
end
else
r i = coo rd ina t e s [ i−N+1]
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 33
for j = N/2+1 to N
r j = coo rd ina t e s [ j ]
f o r c e = f o r c e + f o r c e f u n c t i o n ( r i , r j )
end
e n d i f
return f o r c e
end kernel
p a r t i a l f o r c e s = k f o r c e ( i n d i c e s ) In this example, the stream indices is twice as long as the coordinates stream
and contains integers in sequence from 1 to 2N . After applying the kernel kforce
on indices to get partial forces, the force on particle i can be obtained with by
adding partial forces[i] and partial forces[i+N], which can be expressed as a
trivial kernel. The performance of the LJC(sigmoidal) kernel for different number of
replications of the i-particles is shown in Figure 2.2 for several system sizes.
2.4 Results
All kernels were run on an ATI X1900XTX PCIe graphics card on Dell Dimension
8400 with pre-release drivers from ATI (version 6.5) and the DirectX SDK of February
2 4 6 8
Replication of i-particles
0.5
1.0
1.5
2.0
2.5
Gig
a-I
nte
ract
ion
s p
er s
eco
nd
4096
2048
1024
768
Figure 2.2: Performance improvement for LJC(sigmoidal) kernel with i-particle repli-cation for several values of N
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 34
GMX GMX GPU GPUKernel Million ns/day Million ns/day
Intrxn/s Intrxn/s
LJC(constant) 66 11.4 2232 386LJC(linear)* 33 5.7 2271 392LJC(sigmoidal) 40 6.9 1836 317
Table 2.3: Comparison of GROMACS(GMX) running on a 3.2 GHz Pentium 4 vs.the GPU showing the estimated simulation time per day for a 1000 atom system.*GROMACS does not have an SSE inner loop for LJC(linear)
2006. A number of different force models were implemented with varying compute-to-
bandwidth ratios (see Table 2.1). A sample code listing is provided in the appendix
(2.7.1) to show the details of how flops are counted.
To compare against the CPU, a specially optimized version of the GA and GAJ
kernels were written since no software suitable for a direct comparison to the GPU
existed. The work of [74] uses SSE for the GAJ kernel but does some parts of the
calculation in double precision which makes it unsuitable for a direct comparison. The
performance they achieved is comparable to the performance achieved here. Using
SSE intrinsics and Intel’s C++ Compiler v9.0, sustained performance of 3.8 GFlops
on a 3.0 GHz Pentium 4 was achieved.
GROMACS [56] is currently the fastest performing molecular dynamics software
with hand-written SSE assembly loops. As mentioned in Section 2.2 the CPU can do
out-of-order writes without a significant penalty. GROMACS uses this fact to halve
the number of calculations needed in each force calculation step. In the comparison
against the GPU in Table 2.3 the interactions per second as reported by GROMACS
have been doubled to reflect this. Also shown in the table are the estimated nanosec-
onds one could simulate in a day for a system of 1000 atoms - all O(N) operations
such as constraints and updates have been neglected in this estimate, as they consume
less than 2% of the total runtime. The GPU calculation thus represents an order of
magnitude improvement over existing methods on CPUs.
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 35
0
1
2
3
4
5
Bil
lion
In
teracti
on
s/s
CPU (observed)
GPU (observed)
GRAPE (theoretical peak)
GA GAJ LJC(constant)
Figure 2.3: Speed comparison of CPU, GPU and GRAPE-6A
2.5 Discussion
2.5.1 Comparison to other Architectures
In Figure 2.3 is a comparison of interactions/sec between the ATI X1900XTX, GRAPE-
6A and a Pentium 4 3.0GHz. The numbers for the GPU and CPU are observed values,
those for GRAPE-6A are for its theoretical peak. Compared to GRAPE-6A, the GPU
can calculate over twice as many interactions when only the acceleration is computed,
and a little over half as many when both the acceleration and jerk are computed. The
GPU bests the CPU by 35x, 39x and 15x for the GA, LJC(constant) and GAJ kernels
respectively.
Another important metric is performance per unit of power dissipated. These
results can be seen in Figure 2.5. Here the custom design and much smaller on-board
memory allows GRAPE-6A to better the GPU by a factor of 4 for the GAJ kernel,
although they are still about equal for the GA kernel. The power dissipation of the
Intel Pentium 4 3.0 GHz is 82W [43], the X1900XTX is 120W [4], and GRAPE-6A’s
dissipation is estimated to be 48W since each of the 4 processing chips on the board
dissipates approximately 12W [62].
The advantages of the GPU become readily apparent when the metric of perfor-
mance per dollar is examined (Figure 2.4). The current price of an Intel Pentium 4
630 3.0GHz is $100, an ATI X1900XTX is $350, and an MD-GRAPE3 board costs
$16000 [42]. The GPU outperforms GRAPE-6A by a factor of 22 for the GA kernel
and 6 for the GAJ kernel.
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 36
0
2
4
6
8
10
Mil
lio
n I
nte
ra
cti
on
s/s
ec/U
SD
CPU (observed)
GPU (observed)
GRAPE (theoretical peak)
MD-GRAPE3 (observed)
GA GAJ LJC(constant)
Figure 2.4: Useful MFlops per second per U.S. Dollar of CPU, GPU and GRAPE-6A
0
10
20
30
40
50
Mil
lion
In
teracti
on
s/W
att
CPU (observed)
GPU (observed)
GRAPE (theoretical peak)
MD-GRAPE3 (observed)
GA GAJ LJC(constant)
Figure 2.5: Millions of Interactions per Watt of CPU, GPU and GRAPE-6A
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 37
2.5.2 Hardware Constraints
The 4×4 unrolling that is possible with the GA kernel does not work for the other,
more complicated kernels. For example, the GAJ kernel requires two outputs per
particle (jerk in addition to acceleration). This reduces the maximum unrolling pos-
sibility to 2×4 because the GPU is limited to a maximum of 4 outputs per kernel.
However, even this amount of unrolling doesn’t work because the compiler cannot
fit the kernel within the 32 available registers. The number of registers is also what
prevents the LJC kernels from being unrolled by 4×4 instead of 2×4.
This apparent limitation due to the number of registers appears to result from
compiler inefficiencies; the authors are currently hand coding a 2×4 GAJ kernel
directly in pixel shader assembly which should cause the kernel to become compute
bound and greatly increase its performance. The performance gain of unrolling the
LJC kernels to 4×4 by rewriting them in assembly would most likely be small since
these kernels are already compute bound.
While the maximum texture size of 4096×4096 and 512 MB would make it pos-
sible to store up to 16 million particles on the board at a time, this really isn’t
necessary. In fact, GRAPE-6A only has storage for 131,000 particles on the board
at any one time. This is small enough to occasionally seem restrictive - a good bal-
ance is around 1 million particles which could easily be accommodated by 64MB. If
board manufacturers wanted to produce cheaper boards specifically for use in these
kinds of computations they could significantly reduce the cost without affecting the
functionality by reducing the amount of onboard RAM.
The current limits on the number of instructions also impacts the efficiency of large
GPGPU programs. On ATI hardware, the maximum shader length of 512 instructions
limits the amount of loop unrolling and the complexity of the force functions one can
handle. On NVIDIA hardware, the dynamic instruction limit limits us to very small
systems without resorting to multi-pass techniques which effect the cache efficiency,
and therefore the performance of the proposed algorithms.
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 38
2.5.3 On-board Memory vs. Cache Usage
As mentioned in Section 2.3.2 one expects the kernels to make very efficient use of
the cache on the boards. There are a maximum of 512 threads in flight on the ATI
X1900XTX at any one time [4], and in the ideal situation, each of these threads will
try and access the same j-particle at approximately the same time. The first thread
to request a j-particle will miss the cache and cause the particle to be fetched from
on-board memory, however once it is in the cache, all the threads should be able to
read it without it having to be fetched from on-board memory again.
For example, in the case of the GA kernel with 65,536 particles, there would
be 16,384 fragments to be processed, and if fragments were processed in perfectly
separate groups of 512, then 32 groups would need to be processed. Each group
would need to bring in 65,536 particles from main memory to the cache resulting in
an extremely low memory bandwidth requirement of 38.2 MB/sec.
Of course, the reality is that particles are not processed in perfectly separate
groups of 512 particles that all request the same particle at the same time, but by
using ATITool [5] to adjust the memory clock of the board one can determine how
much bandwidth each kernel actually needs to main memory. The results of this
testing can be seen in Figure 2.6.
The performance degradation occurs at approximately 11.3, 5.2, and 2.1 GB/sec
for the LJC, GAJ and GA kernels respectively. The LJC kernels must also read in an
exclusion list for each particle which does not cache as well as the other reads, and is
the reason why their bandwidth to main memory is higher than that of the gravity
kernels. The number for the GA kernel suggests that approximately 10 particles are
accessing the same j-particle at once.
At memory speeds above 500MHz all the kernels run very near their peak speed,
thus board manufacturers could not only use less RAM, they could also use cheaper
RAM if they were to produce a number of boards that would only be used for these
calculations. This would reduce the cost and power requirements over the standard
high end versions used for gaming.
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 39
0 200 400 600
Memory Speed (MHz)
0
20
40
60
80
100
GF
lop
s
GA
GAJ
LJC (sig)
LJC (linear)
LJC (const)
Figure 2.6: GFlops achieved as a function of memory speed
2.6 Conclusion
The processing power of GPUs has been successfully used to accelerate pairwise force
calculations for several commonly used force models in stellar and molecular dynam-
ics simulations. In some cases the GPU is more than 25 times as fast as a highly
optimized SSE-based CPU implementation and exceeds the performance of GRAPE-
6A, which is hardware specially designed for this task. Furthermore, the performance
is compute bound, so this work is well poised to take advantage of further increases
in the number of ALUs on GPUs, even if memory subsystem speeds do not increase
significantly. Because GPUs are mass produced, they are relatively inexpensive and
their performance to cost ratio is an order of magnitude better than the alternatives.
The wide availability of GPUs will allow distributed computing initiatives like Fold-
ing@Home to utilize the combined processing power of tens of thousands of GPUs to
address problems in structural biology that were hitherto computationally infeasible.
It is safe to conclude that the future will see some truly exciting applications of GPUs
to molecular dynamics.
2.7 Appendix
2.7.1 Flops Accounting
To detail how flops are counted a snippet of the actual Brook code for the GA ker-
nel is presented. The calculation of the acceleration on the first i-particle has been
CHAPTER 2. N-BODY SIMULATIONS ON GPUS 40
commented with the flop counts for each instruction. In total, the calculation of
the acceleration on the first i-particle performs 76 flops. Since four interactions are
computed, this amounts to 19 flops per interaction. f loat3 d1 , d2 , d3 , d4 , ou tacc e l 1 ;
f loat4 jmass , r , r inv , rinvcubed , s c a l a r ;
d1 = jpos1 − i po s1 ; //3
d2 = jpos2 − i po s1 ; //3
d3 = jpos3 − i po s1 ; //3
d4 = jpos4 − i po s1 ; //3
r . x = dot ( d1 , d1 ) + eps ; //6
r . y = dot ( d2 , d2 ) + eps ; //6
r . z = dot ( d3 , d3 ) + eps ; //6
r .w = dot ( d4 , d4 ) + eps ; //6
r inv = r s q r t ( r ) ; //4
r invcubed = r inv ∗ r inv ∗ r inv ; //8
s c a l a r = jmass ∗ r invcubed ; //4
outacc e l 1 += s c a l a r . y∗d2 + s c a l a r . z∗d3 + s c a l a r .w∗d4 ; //18
i f ( I l i s t . x != J l i s t 1 . x ) //don ’ t add f o r c e due to o u r s e l f
ou tacc e l 1 += s c a l a r . x ∗ d1 ; //6
Chapter 3
Structured PDE Solvers on CELL
and GPUs
41
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 42
3.1 Introduction
In this section the implementation of flow solvers using GPUs and the CELL are
examined. Specifically, the focus is on solving the compressible Euler equations in
complicated geometry with a multi-block structured code. Previous attempts to im-
plement flow solvers had been attempted (see the next section) but never had a real
large scale engineering application been demonstrated. In this section, a real engi-
neering flow calculation running on a single GPU with “engineering” accuracy and
numerics is presented. It demonstrates the potential of these processors for high per-
formance scientific computing. The CELL presented many more difficulties and its
level of performance was far below expectations leading to us terminate our work on
that architecture. The difficulties and performance achieved are described after the
GPU work.
3.2 Review of prior work on GPUs
The current state of the art in applying GPUs to computational fluid mechanics is
either simulations for graphics purposes emphasizing speed and appearance over accu-
racy, or simulations generally dealing with 2D geometries and using simpler numerics
not suited for complex engineering flows. Some previous efforts in this direction
are now reviewed. The most notable work of engineering significance is the work of
Brandvik [11] who solved an Euler flow in 3D geometry.
Kruger and Westermann[51] implemented basic linear operators (vector-vector
arithmetic, matrix-vector multiplication with full and sparse matrices) and measured
a speed-up around 12–15 on ATI 9800 compared to Pentium 4 2.8 GHz. Applications
to the conjugate gradient method and the Navier-Stokes equations in 2D are pre-
sented. Rumpf and Strzodka[81] applied the conjugate gradient method and Jacobi
iterations to solve non-linear diffusion problems for image processing operations.
Bolz et al.[9] implemented sparse matrix solvers on GPU using the conjugate
gradient method and a multigrid acceleration. Their approach was tested on a 2D
flow problem. A 2D unit square was chosen as test case. A speed-up by 2 was
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 43
measured with a GeForce FX.
Goodnight et al.[34] implemented the multigrid method on GPUs for three appli-
cations: simulation of heat transfer, modeling of fluid mechanics, and tone mapping of
high dynamic range images. For the fluid mechanics application, the vorticity-stream
function formulation was applied to solve for the vorticity field of a 2D airfoil. This
was implemented on NVIDIA GeForceFX 5800 Ultra using Cg. A speed-up of 2.3
was measured compared to an AMD Athlon XP 1800.
In computer graphics where accuracy is not essential but speed is, flow simula-
tions using the method of Stam[89] are very popular. It is a semi-Lagrangian method
and allows large time-steps to be applied in solving the Navier-Stokes equations with
excellent stability. Though the method is not accurate enough for engineering compu-
tation, it does capture the characteristics of fluid motion with nice visual appearance.
Harris et al.[35] performed a rather comprehensive simulation of cloud visualization
based on Stam’s method[89]. Partial differential equations describe fluid motion,
thermodynamic processes, buoyant forces, and water phase transitions. Liu et al.[57]
performed various 3D flow calculations, e.g. flow over a city, using Stam’s method[89].
Their goal is to have a real-time solver along with visualization running on the GPU.
A Jacobi solver is used with a fixed number of iterations in order to obtain a satis-
factory visual effect.
The Lattice-Boltzmann model (LBM) is attractive for GPU processors since it is
simple to implement on sequential and parallel machines, requires a significant com-
putational cost (therefore benefits from faster processors) and is capable of simulating
flows around complex geometries. One should be aware of some limitations of this
approach; what is gained in terms of algorithm simplicity is often lost in terms of
overall accuracy and various physical/numerical limitations (see review by Khalighi
et al.[49]). Li et al.[55, 54] obtained a speed-up around 6 using Cg on an NVIDIA
GeForce FX 5900 Ultra (vs. Pentium 4 2.53Ghz). See the work of Fan et al.[28] using
a GPU cluster.
Scheidegger et al.[82] ported the simplified marker and cell (SMAC) method[3] for
time-dependent incompressible flows. SMAC is a technique used primarily to model
free surface flows. Scheidegger performed several 2D flow calculations and obtained
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 44
speed-ups on NV 35 and NV 40 varying from 7 to 21. The error of the results was in
the range 10−2–10−3. See also the recent review by Owens et al.[79].
The work of Brandvik et al.[11] is the closest to our own. They implement a 2D
and 3D compressible solver on the GPU in both BrookGPU and Nvidia’s CUDA. They
achieve speedups of 29 (2D) and 16 (3D) respectively, although the 3D BrookGPU
version achieved a speedup of only 3. A finite volume discretization with vertex
storage and a structured grid of quadrilaterals was used. No multi-grid or multiple
blocks were used.
3.3 Flow Solver
The Navier-Stokes Stanford University Solver (NSSUS) solves the three-dimensional
Unsteady Reynolds Averaged Navier-Stokes (URANS) equations on multi-block meshes
using a vertex-centered solution with first to sixth order finite difference and artificial
dissipation operators based on work by Mattson[65], Svard[91], and Carpenter[16] on
Summation by Parts (SBP) operators. Boundary conditions are implemented using
penalty terms based on the Simultaneous Approximation Term (SAT) approach[16].
Geometric multigrid with support for irregular coarsening of meshes is also imple-
mented. The SBP and SAT approaches allow for provably stable handling of the
boundary conditions (both physical boundaries and boundaries between blocks). The
numerics of the code are investigated in the work of Nordstorm, et al.[76]
This work focuses on a subset of the capabilities in NSSUS, namely the steady
solution of the compressible Euler equations which come about if the viscous effects
and heat transfer in the Navier-Stokes equations are neglected. Flows modeled using
the Euler equations are routinely used as part of the analysis and design of transonic
and supersonic aircraft, missiles, hypersonic vehicles, and launch vehicles. Current
GPUs are well suited to solving the Euler equations since the use of double precision,
needed for the fine mesh spacing required to properly resolve the boundary layer in
RANS simulations, is not necessary.
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 45
The non-dimensional Euler equations in conservation form are
∂W
∂t+∂E
∂x+∂F
∂y+∂G
∂z= 0, (3.1)
where W is the vector of conserved flow variables and E, F, and G are the Euler flux
vectors defined as:
W = [ρ, ρu, ρv, ρw, ρe],
E = [ρu, ρu2 + p, ρuv, ρuw, ρuh],
F = [ρv, ρuv, ρv2 + p, ρvw, ρvh],
G = [ρw, ρuw, ρvw, ρw2 + p, ρwh]
In these equations, ρ is the density, u, v, and w are the cartesian velocity components,
p is the static pressure, and h is the total enthalpy related to the total energy by
h = e+ pρ. For an ideal gas, the equation of state may be written as
p = (γ − 1) ρ
[e− 1
2(u2 + v2 + w2)
]. (3.2)
For the finite difference discretization a coordinate transformation from the phys-
ical coordinates (x, y, z) to the computational coordinates (ξ, η, ζ) is performed to
yield:∂W
∂t+∂E
∂ξ+∂F
∂η+∂G
∂ζ= 0, (3.3)
where W = W/J , J is the coordinate transformation Jacobian, and:
E =1
J(ξxE + ξyF + ξzG), F =
1
J(ηxE + ηyF + ηzG), G =
1
J(ζxE + ζyF + ζzG).
Discretizing the spatial operators results in a system of ordinary differential equations
d
dt
(Wijk
Jijk
)+Rijk = 0, (3.4)
at every node in the mesh. An explicit five-stage Runge-Kutta scheme using modified
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 46
coefficients for a maximum stability region is used to advance the equations to a
steady state solution. Computing the residual R is the main computational cost;
it includes the inviscid Euler fluxes, the artificial dissipation for stability, and the
penalty terms for the boundary conditions. The penalty states, obtained either from
physical boundary conditions or (for internal block boundaries) from the value of the
flow solution in another block, are used to compute the penalty terms. Geometric
multi-grid is used to speed up convergence.
In the next sections, the implementation of NSSUS on GPUs is described. This
work was accomplished using BrookGPU. The algorithms required to implement
NSSUS on the GPU are discussed and numerical results and performance measure-
ments are reported.
3.4 Numerical accuracy considerations and perfor-
mance comparisons between CPU and GPU
Producing identical results in a CPU and GPU implementation of an algorithm is,
perhaps surprisingly, not a simple matter. Even if the exact same sequence of in-
structions are executed on each processor, it is quite possible for the results to be
different. Current GPUs do not support the entire IEEE-754 standard. Some of the
deviations are not, in the author’s experience, generally a concern: not all rounding
modes are supported; there is no support for denormalized numbers; and NaN and
floating point exceptions are not handled identically. However, other differences are
more significant and will affect most applications: division and square root are imple-
mented in a non-standard-compliant fashion, and multiplication and addition can be
combined by the compiler into a single instruction (FMAD) which has no counterpart
on current CPUs. X = A∗B + C; // FMAD
This instruction truncates the result of the intermediate multiplication leading to
different behavior than if the operations were performed sequentially[78].
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 47
There are other differences between the architectures that can cause even a se-
quence of additions and multiplications (without FMADs) to yield different results.
This is because the FPU registers are 80-bit on CPUs but only 32-bit on current
generation GPUs. If the following sequence of operations was performed: C = 1E5 + 1E−5; // C i s in a r e g i s t e r
D = 10 ∗ C; // C i s s t i l l in a r e g i s t e r , so i s D
E = D − 1E6 ; // The r e s u l t E i s f i n a l l y wr i t t en to memory On a GPU, E would be 0, while on a CPU it would contain the correct result of
.0001. The result of the initial addition would be truncated to 1E5 to fit in the 32-bit
registers of a GPU unlike the CPU where the 80-bit registers can represent the result
of the addition.
Evaluation of transcendental functions are also likely to produce different results,
especially for large values of the operand.
To further complicate matters, CPUs have an additional SIMD unit that is sepa-
rate from the traditional FPU. This unit has its own 128-bit registers that are used
to store either 4 single precision or 2 double precision numbers. This has implications
both for speed-up and accuracy comparisons. Each number is now stored in its own
32-bit quarter of the register. The above operations would yield the same result on
both platforms if the CPU was using the SIMD unit for the computation.
In addition, by utilizing the SIMD unit, the CPU performs these 4 operations
simultaneously which leads to a significant increase in performance. Unfortunately,
the SIMD unit can only be directly used by programming in assembly language or
using “intrinsics” in a language such as C/C++. Intrinsics are essentially assembly
language instructions, but allow the compiler to take care of instruction order opti-
mization and register allocation. In most scientific applications writing at such a low
level is impractical and rarely done; instead compilers that “auto-vectorize” code have
been developed. They attempt to transform loops so that the above SIMD operations
can be used.
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 48
3.5 Mapping the Algorithms to the GPU
3.5.1 Classification of kernel types
In mapping the various algorithms to the GPU it is useful to classify kernels into four
categories based on their memory access patterns. All of the kernels that make up
the entire PDE solver can be classified into one of these categories. Portions of the
computation that are often referred to as a unit, the artificial dissipation for example,
are often composed of a sequence of many different kernels. For each kernel type a
simple example of sequential C code is given, followed by how that code would be
transformed into streaming BrookGPU code.
The categories are:
Pointwise. When all memory accesses, possibly from many different streams, are
from the same location as the output location of the fragment. A simple example
of this type of kernel would be calculating momentum at all vertices by multiplying
the density and velocity at each vertex. Kernels of this type often have much greater
computational density than the following three types of kernels. for ( int i = 0 ; i < 100 ; ++i )
c [ i ] = a [ i ] + b [ i ] ; would be transformed into the above add kernel.
Stencil. Kernels of this type require data that is spatially local to the output loca-
tion of the fragment. The data may or may not be local in memory depending on
how the 3D data is mapped to 2D space. Difference approximations and multigrid
transfer operations lead to kernels of this type. These kernels often have a very low
computational density, often performing only one arithmetic operation per memory
load. for ( int x = l e f t ; x < r i g h t ; ++x )
for ( int y = bottom ; y < top ; ++y )
r e s [ x ] = ( func [ x +1] [ y ] + func [ x−1] [ y ] + func [ x ] [ y+1]
+ func [ x ] [ y−1] − 4∗ func [ x ] [ y ] ) / d e l t a ; would become
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 49
kernel void r e s ( f loat de l ta , f loat func [ ] [ ] , out f loat res<>)
f loat2 my index = indexo f ( r e s ) . xy ;
f loat2 up = my index + f loat2 (0 , 1 ) ;
f loat2 down = my index − f loat2 (0 , 1 ) ;
f loat2 r i g h t = my index + f loat2 (1 , 0 ) ;
f loat2 l e f t = my index − f loat2 (1 , 0 ) ;
r e s = ( func [ up ] + func [ down ] + func [ r i g h t ] + func [ l e f t ]
− 4∗ func [ my index ] ) / d e l t a ;
Unstructured gather. While connectivity inside a block is structured, the blocks
themselves are connected in an unstructured fashion. To access data from neighbor-
ing blocks, special data structures are created to be used by gather kernels which
consolidate non-local information. Copying the sub-faces of a block into their own
sub-face stream is a special case of this kind of kernel. kernel void unstructGather ( f loat2 pos [ ] [ ] , f loat data [ ] ,
out f loat r e s h u f f l e<> ) f loat2 my index = indexo f ( r e s h u f f l e ) ;
f loat2 gatherPos = pos [ my index ] ;
r e s h u f f l e = data [ gatherPos ] ;
The contents of the pos stream are indices that are used to access elements of the
data stream.
Reduction. Reduction kernels are used to monitor the convergence of the solver. A
reduction kernel outputs a single scalar by performing a commutative operation on all
the elements of the input stream. Examples include the sum, product or maximum
of all elements in a stream. Reduction operations are implemented in Brook using
efficient tree data structures and an optimal number of passes [13].
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 50
X Y Z X Y Z
Figure 3.1: Array of Structures
X X X X X X
Z Z Z Z Z Z
Y Y Y Y Y Y
X X
YY
Z ZFigure 3.2: Structure of Arrays
3.5.2 Data layout
Because the entire iterative loop of the solver is performed on the GPU, the data
layout the CPU need not constrain the data layout the BrookGPU version of NSSUS
uses. A one-time translation to and from the GPU format can be done at the begin-
ning and end of the complete solve with minimal overhead. This translation is on the
order one second, whereas solves take minutes to tens of minutes.
Until the release of the G80 from NVIDIA, all graphics processors had 4-wide
SIMD processors; the latest ATI card, the R600, will retain this design. For maximum
efficiency on these vector designs, data should be laid out using a structure of arrays
(SoA), see figure 3.2, instead of the more convenient array of structures (AoS), see
figure 3.1, so that the full vector capability of the processor is utilized every cycle.
Unfortunately, such a data layout presents a number of problems. The main one
is that the mesh metrics which would be stored as 3 float3 streams in AoS become
9 float4 streams in SoA. The maximum number of inputs to any kernel is 16, a
hardware limitation, and having 9 of those taken up by just the metrics means it will
not be possible to get all the necessary data into some of the kernels.
The second difficulty SoA introduces is that along the direction that data is packed
into float4s, mesh dimensions are forced to be multiples of 4, when in reality they
almost never are. This could, of course, be surmounted with sufficient effort and
increased complexity of the software.
Finally, since NVIDIA has moved to scalar chips, it should theoretically not matter
which format is used on their future cards. Even on ATI cards, AoS harnesses more
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 51
than 3/4th of the available computational power by use of intrinsics such as dot
product and length as well as combining floats into float3 or float4 when possible.
For all of these reasons, in this project, it was decided to go with the simpler (from
a software engineering standpoint) AoS format.
To lay the 3D data out in the 2D texture memory, the standard “flat” 3D texture
approach[53] was used where each 2D plane making up the 3D data is stored at a
different location in the 2D stream. This leads to some additional indexing to figure
out a fragment’s 3D index from its location in the 2D stream (12 flops) and also
additional work to convert back 3D indices (9 flops). kernel f l o a t 3 where am i ( f loat2 index ,
f loat s i z ex , f loat s i z ey , f loat dx )
f l o a t 3 my loc ;
my loc . x = fmod ( index . x , s i z e x ) ;
my loc . y = fmod ( index . y , s i z e y ) ;
my loc . z = f l o o r ( index . x / s i z e x ) +
dx ∗ f l o o r ( index . y / s i z e y ) ;
return my loc ;
kernel f loat2 newZIndex ( f l o a t 3 my loc , f loat dz ,
f loat s i z ex , f loat s i z ey , f loat dx )
f loat2 new index ;
new index . x = fmod ( my loc . z+dz , dx )∗ s i z e x + my loc . x ;
new index . y = f l o o r ( ( my loc . z+dz )/ dx )∗ s i z e y + my loc . y ;
return new index ;
Data for each block in the multi-block topology is stored in separate streams; the
solver loops over the blocks and processes each one sequentially.
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 52
3.5.3 Summary of GPU code
A summary of the code execution is shown in Figure 3.3. The existing preprocess-
ing subroutines implemented on the CPU are unchanged. Additional GPU specific
preprocessing code is run on the CPU to setup the communication patterns between
blocks, and the treatment of the penalty states and penalty terms. The transfer of
data from the host to the GPU includes the initial value of the solution, preprocessed
quantities computed from the mesh coordinates, and weights and stencils used in the
multigrid scheme. Once the data is on the GPU the solver runs in a closed loop. The
only data communicated back to the host are the L2 norms of the residuals which
are used for monitoring the convergence of the code, and the current solution if the
output to a restart file is requested. The number of lines for the GPU implementation
is approximately: 4,500 lines of Brook code, 8,000 lines of supporting C++ code and
1,000 lines of new Fortran code. The original NSSUS code is in Fortran. It took
approximately 4 months to develop the necessary algorithms and make the changes
to original code.
Inviscid fluxArtificial dissipationMultigrid forcing termsInviscid residualCopy sub-face dataBlock to block communicationPenalty state for physical boundary conditionsPenalty terms
Preprocessing
Preprocessing for GPU
Transfer data to GPU
Run solver on GPU
Return data from GPU
Write output file
while iteration < iterationsMax and solution not converged: loop over steps of the multigrid cycle: if prolongation step: transfer correction/solution to fine grid if restriction step: transfer solution and residual to coarse grid if smoothing step: compute residual compute time step store solution state update solution compute residual loop over remaining Runge Kutta stages: update solution compute residual update solution compute L2 norm of the residual
CPU GPU
Figure 3.3: Flowchart of NSSUS running on the GPU.
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 53
3.5.4 Algorithms
Constraints from the geometry of the mesh may require that in some blocks, especially
at coarse multigrid levels, the differencing in some directions is done at a lower order
than otherwise desired if the number of points in that direction becomes too small.
To accommodate this constraint imposed by realistic geometries and also to avoid
writing 27 different kernels for each possible combination of order and direction (up
to third order is currently implemented on the GPU), all differencing stencils are
applied in each direction separately.
The numerics of the code are such that one-sided difference approximations are
used near the boundaries of the domain. A boundary point is designated as a point
where a special stencil is needed, and interior point as a point where the normal
stencil is applied. This distinction presents a problem for parallel data processors
such as GPUs because boundary points perform a different calculation from interior
points and furthermore different boundary points perform different calculations. This
can lead to terrible branch coherency problems. See Figure 3.4. However, regardless
of the order of the discretization, the branching can be reduced to only checking if
the fragment is a boundary point or not. While the calculation for each boundary
point is different, it is always a linear combination of field values which can be com-
puted as a dot product between stencil coefficients and field values. Thus by using a
small 1D stream (that can be indexed using the boundary point’s own location) to
hold the coefficients, only one branch instead of three is required. (Note: we count
an if...else... statement as one branch.) The exact number depends on the
branch granularity of the hardware which is theoretically 4×4 on the 8800. However,
GPUbench [14] suggests that in practice 8×8 performs better than 4×4 and 16×16
even better than 8× 8. For 16× 16, the maximum possible number of branches is 8
– one interior point plus 4 right boundary points and 4 left boundary points (which
can be adjacent due to the flat 3D layout). This technique reduces this maximum
to two – one branch for interior points plus one branch for right and left boundary
points. Higher order differencing would benefit even more from this technique.
Dealing with the boundary conditions and penalty terms in an efficient manner
is significantly more difficult than either of the two previous cases. Figure 3.5 shows
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 54
Figure 3.4: This figure illustrates the stencil in the xdirection and the branching on the GPU. Each coloredsquare represents a mesh node. The color correspondsto the stencil used for the node. Inner nodes (in grey)use the same stencil. For optimal efficiency, nodes in-side a 4 × 4 square should branch coherently, i.e., usethe same stencil (see square with a dashed line border).For this calculation, this is not the case near the bound-ary which leads to inefficiencies in the execution. Thealgorithm proposed here reduces branching and leads toonly one branch (instead of 3 here).
how sub-faces and penalty terms are computed for each block. The unstructured
connectivity between blocks leads to several sub-faces on each block. Each node on
the blue block must be penalized against the corresponding node on the adjacent
blocks. For example, the node on the blue block located at the intersection of all
four green blocks must be penalized against the corner node in each of the four green
blocks.
In Brook, it is not possible to stream over a subset of the entries in an array.
Instead one must go through all the O(n3) entries and use if statements to determine
whether and what type of calculation need to be performed. This leads to a significant
loss of performance since effectively only O(n2) entries (“surface” entries) need to be
operated on. This problem is made worse by the fact that certain nodes belong to
multiple faces thereby requiring multiple passes. In order to solve these issues, it
was decided to copy the sub-face data into one smaller 2D stream (hereafter called
sub-face stream); copy data from other blocks if necessary for the internal penalty
states (called the neighbor stream). These streams are then used to calculate the
penalty state for physical boundary conditions and the penalty terms. This step is
computationally efficient since primarily only nodes which need to be processed, are.
This is a strictly O(n2) step. Finally, the result is applied back into the full 3D
stream. This is shown in more details in Figure 3.6 and 3.7.
The copying of the sub-face data into the sub-face stream is done by calculating
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 55
Figure 3.5: The continuity of the solution across meshblocks is enforced by computing penalty terms usingthe SAT approach[16]. The fact that the connectiv-ity between blocks is unstructured creates special dif-ficulty. On this figure, for each node on the faces ofthe blue block, one must identify the face of one of thegreen blocks from which the penalty terms are to becomputed. In this case, the left face of the blue blockintersects the faces of four distinct green blocks. Thisleads to the creation of 4 sub-faces on the blue block.For each sub-face, penalty terms need to be computed.Note that some nodes may belong to several sub-faces.
Figure 3.6: To calculate the penaltyterms efficiently for each sub-face, onefirst copies data from the 3D block intoa smaller sub-face stream (shown onthe right). In this figure, the block has10 sub-faces. Assume that the largestsub-face can be stored in memory asa 2D rectangle of size nx × ny. In thecase shown, the sub-face stream is thencomposed of 12 nx×ny rectangles, 2 ofwhich are unused. Some of the spaceis occupied by real data (in blue); therest is unused (shown in grey).
Figure 3.7: This figure shows themapping from neighboring blocksto the neighbor stream used toprocess the penalty terms for theblue block. There are four largeblocks surrounding the blue block(top and bottom not shown).They lead to the first 4 green rect-angles. The other rectangles areformed by the two blocks in thefront right and the four smallerblocks in the front left.
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 56
and storing the location in the full 3D stream from which each fragment in the sub-
face stream will gather. The copying of the data from other blocks into the neighbor
stream is done by pre-computing and storing the block number and the location
within that block from which each fragment in the sub-face stream gathers. This
kernel requires multiple blocks as input and must branch to gather from the correct
block. This is illustrated by the pseudo-code below, which can be implemented in
Brook: kernel void buildNeighborStream ( f loat block1 [ ] [ ] ,
f loat block2 [ ] [ ] ,
f loat3 d o n o r l i s t <>,
out f loat penal ty data<>)
block = d o n o r l i s t . x ;
ga ther coord = d o n o r l i s t . yz ;
i f ( b lock == 1) pena l ty data = block1 [ gather coord ] ;
else i f ( b lock == 2) pena l ty data = block2 [ gather coord ] ;
. . .
An important point to make is that this method automatically handles the case
of intersecting sub-faces (such as at edges and corners) where multiple boundary
conditions and penalty terms need to be applied. In that respect, this approach leads
to a significantly simpler code.
3.6 Results
The performance scaling of the code with block size is examined followed by an
investigation of the performance of each of the three main kinds of kernels. Then
the performance on meshes for complex geometries typical of realistic engineering
problems is examined. In all our tests, the CPU used was a single core of an Intel
Core 2 Duo E6600 (2.4Ghz, 4MB L2 cache) and the GPU used was an NVIDIA
8800GTX (128 scalar processor cores at 1.35Ghz).
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 57
For all the results given below, a consistent accuracy compared to the original
single precision code in the range of 5 to 6 significant digits was observed, including
the converged solution for the hypersonic vehicle. This is the accuracy to be expected
since the GPU operates in single precision. This good behavior is partly a result
of considering the Euler equation. The Navier-Stokes equations for example often
require a very fine mesh near the boundary to resolve the boundary layer. In that
case, differences in mesh element sizes may result in loss of accuracy.
3.6.1 Performance scaling with block size
Figure 3.8 shows the scaling of performance and speedup with respect to the block
size. These tests were run on single block cube geometries with freestream boundary
conditions on all faces. As the data set becomes larger than the L2 cache, the CPU
slows down by a factor of about of two. On the other hand, when the data set
increases, the GPU becomes much more efficient, improving by about a factor of 100.
The GPU doesn’t reach its peak efficiency until it is working on streams with at least
32,000 elements.
103
104
105
106
1
10
Number of vertices
Mic
rose
cond
s pe
r ve
rtex
103
104
105
106
10
20
30
40
Number of vertices
Spe
ed−
up
CPU single gridCPU multigridGPU single gridGPU multigrid
Single gridMultigrid
Figure 3.8: Performancescaling with block size, 1st
order.
The multigrid cycle used in these and following tests was a 2 level V cycle. In
principle, multigrid should be used with more than 2 levels but for the compressible
Euler equations, the presence of shocks limits the number of grids which can be
efficiently used to two. Since our goal is to model a hypersonic vehicle in which
shocks are present, 2 grids were used throughout this work even in cases where there
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 58
is no shock. In 3D, 2 levels require computing on a grid approximately 8 times smaller
than the original and it is known from the single grid results that small grids will
be slower than larger ones; consequently, one would expect the multigrid solver to
be somewhat slower than the single grid. This is indeed the case. For 512 vertices,
multi-grid is about twice as slow. For larger grids, the performance of multi-grid
generally follows that of the single grid results but are slightly slower.
3.6.2 Performance of the three main kernel types
The three main different types of kernels have different performance characteristics
which will be examined here. For pointwise kernels, the inviscid flux kernel is con-
sidered; stencil kernels will be represented by the residual calculation (differencing
of the fluxes), and kernels with unstructured gathers by the boundary and penalty
terms calculation. Reduction kernels are not examined since they have been studied
elsewhere[13] and these kernels are less than one percent of the total runtime.
Figure 3.9 shows that the inviscid flux kernel scales similarly to the overall program
(figure 3.8) although with a more marked increase at the largest size. This kernel has
an approximately 1:1 ratio of flops to bytes loaded which suggests that it is still
limited by the maximum memory bandwidth of the card. Indeed, the largest mesh
achieved a bandwidth of 78 Gbytes/sec which is nearly the theoretical peak of the
card. The achievable memory bandwidth depends not only on the size of the data
stream, but also its shape. The second largest mesh has an x-dimension that is
divisible by 16, whereas the largest mesh has an x-dimension divisible by 128. This
is the likely reason for the variations between these two stream sizes.
The second type of kernel, the stencil computation, also follows the same basic
scaling pattern as the timings in Figure 3.8 (Figure 3.9). This particular kernel loads 5
bytes for every one (useful) flop it performs. This very poor ratio is due to loading the
differencing coefficients as well as some values which are never used – a byproduct
of the way some data is packed into float3s and float4s. Nonetheless, a very high
bandwidth for this type of kernel is achieved. The bandwidth is in fact higher than
the memory bandwidth of card! This is possible because the 2D locality of the data
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 59
104
105
106
0
20
40
60
80
100
120
Number of Vertices
GF
lops
OR
Gby
tes/
sec
BandwidthGFlops
104
105
106
0
20
40
60
80
100
120
Number of Vertices
GF
lops
OR
Gby
tes/
sec
BandwidthGFlops
Figure 3.9: left: pointwise performance(inviscid flux calculation); right: stencilperformance (3rd order residual calcula-tion).
104
105
106
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Number of Vertices
Mic
rose
cond
s pe
r ve
rtex
CPUGPU
104
105
106
1
2
3
4
5
6
7
8
Number of Vertices
Spe
edup
Figure 3.10: Unstructured gather per-formance (boundary conditions andpenalty terms calculation). The de-crease in speed-up is due to an unavoid-able O(n3) vs. O(n2) algorithmic dif-ference in one of the kernels that makeup the boundary calculations. See thediscussion in the text.
access allows for the cache to be utilized very efficiently. The stencil coefficients, a
total of sixteen values, are also almost certainly kept in the cache.
The final type of kernel is the unstructured gather of which 3 out of the 5 kernels
that make up the boundary and penalty term calculation consist. Its performance
and scaling can be seen in Figure 3.10. Startlingly, this routine does not see increased
efficiency with larger blocks and the speedup vs. the CPU actually decreases after
a point. To explain this, each of the individual kernels is examined. The first two
are unstructured gather kernels that copy data to the sub-face streams and they run,
as would be expected, at approximately the random memory bandwidth of the card
(∼ 10 GBytes/sec). The next two are pointwise calculations for the penalty terms
which behave much like the inviscid flux kernel. The last kernel applies the calculated
penalties to the volumetric data, which as mentioned above implies an implicit loop
over the entire volume even though one only wishes to apply the penalties to the
boundaries. This is unavoidable because of the inability in Brook to scatter outputs
to arbitrary memory locations. Even though most of the fragments do little work
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 60
other than determining if they are a boundary point or not, as the block grows the
ratio of interior to surface points increases and the overhead of all the interior points
determining their location slows the overall computation down. In practice however,
it is unlikely that the size of a given block will be larger than 2 million elements; so
in most practical situations, one is in the region where the GPU speed-up is large.
X
Y
Z
Figure 3.11: Three block C-mesh around the NACA 0012airfoil.
Figure 3.12: Mach numberaround the NACA 0012 air-foil, M∞ = 0.63, α = 2.
3.6.3 Performance on real meshes
The NACA 0012 airfoil (from the National Advisory Committee for Aeronautics) is a
symmetric, 12% thick airfoil, that is a standard test case geometry for computational
fluid dynamics codes. Figure 3.11 shows the mesh with three blocks used for this
simulation (C-mesh topology) and Figure 3.12 shows the Mach number around the
airfoil.
The CPU code was compiled with the following options using the Intel Fortran
compiler version 10: -O2 -tpp7 -axWP -ipo.
Table 3.1 shows the speedups for the NACA 0012 airfoil test case. As expected the
speed-ups with multigrid is lower than with a single grid because the computations on
the coarser grids are not as efficient. However, over an order of magnitude reduction
in computation time is still achieved.
For our final calculations, the hypersonic vehicle configuration from Marta and
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 61
Table 3.1: Measured speed-ups for the NACA 0012 airfoil computation.
Order Multigrid cycle Speed-up
1st order single grid 17.63rd order single grid 15.11st order 2 grids 15.63rd order 2 grids 14.0
Alonso[64] was used. This is representative of a typical mesh used in the external
aerodynamic analysis of aerospace vehicles. It is a 15 block mesh; two versions were
used with approximately 720,000 and 1.5 million nodes. Because the blocks are
processed sequentially on the GPU, an important consideration is not only the overall
mesh size but the sizes of individual blocks. For the 1.5 million node mesh, the
approximate average block size is 100,000 nodes, with a minimum of 10,000 and a
maximum of 200,000 nodes. Figure 3.13 shows the Mach number on the surface of
the vehicle and the symmetry plane for a Mach 5 freestream.
Figure 3.13: Mach number – side and back views of the hypersonic vehicle.
In Table 3.2, one can see the same general trend for speed-ups as the problem size
and multigrid cycle are varied. Beyond just the pure speed-up, it’s also important
to note the practical impact of the shortened computational time. For example, a
converged solution for the 1.5M node mesh using a 2-grid multigrid cycle requires
approximately 4 CPU hours, but only about 15 minutes on one GPU!
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 62
Table 3.2: Speed-ups for the hypersonic vehicle computation
Mesh size Multigrid cycle Speed-up
720k single grid 15.4720k 2 grids 11.21.5M single grid 20.21.5M 2 grids 15.8
3.7 Conclusion
Measured speed-ups range from 15x to over 40x. To demonstrate the capabilities
of the code a hypersonic vehicle in cruise at Mach 5 was simulated – something
out of the reach of most previous fluid simulation works on GPUs. The three main
types of kernels necessary for solving PDEs were presented and their performance
characteristics analyzed. Suggestions to reduce branch incoherency due to stencils
that vary at the boundaries were made. A novel technique to handle the complications
created by the boundary conditions and the unstructured multi-block nature of the
mesh was also developed.
Additonal analysis has identified further ways in which the performance can be
improved. Performance on small blocks is lackluster and unfortunately with meshes
around realistic geometries, small blocks often can not be avoided. By grouping all the
blocks into a single large texture, this problem could be avoided at the cost of increased
indexing difficulties. Also, NVIDIA’s new language, CUDA, offers some interesting
possibilities. It has an extremely fast memory shared by a group of processors, the
“parallel data cache”, which could be used to increase the memory bandwidth of the
stencil calculations even further. Scatter operations are also supported which means
that the application of the penalty terms (at block interfaces) could scale with the
number of surface vertices instead of the total number of vertices.
An important demonstration would be the use of a parallel computer with GPUs
for fluid dynamics simulations. This would establish the performance in a realistic
engineering setting. It will impose some interesting difficulties because while nodes
will be on the order of 10× faster, the network speeds and latencies will not have
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 63
changed and might be a bottleneck.
While the exact direction of future CPU developments is impossible to predict,
it seems very likely that they will incorporate many light computational cores very
similar in nature to the fragment shaders of current GPUs. The techniques presented
here should thus be applicable to the general purpose processors of tomorrow.
3.8 CELL Experiences
3.8.1 Amdahl’s Revenge
As discussed in section 1.3.4 the PPE on the CELL is significantly slower than the
normal PowerPC processor it is based on. In fact, for NSSUS, it approximately 10×slower! Of course, the SPEs can be significantly faster. This leads to a new type of
Amdahl’s Law on the CELL. Instead of
1
(1− P ) + PS
where P is the parallel portion of the code and S is the speedup on that portion,
this yields a new law
1
A ∗ (1− P ) + PS
where A is the slowdown of the PPE. Plotting this new law with A = 10 alongside
Amdahl’s original law as in figure 3.14 and figure 3.15 shows the dramatic impact
that this can have. For P < .9 it is impossible to achieve a speedup regardless of how
large S is. Even when P = .995, when only .5% of the code is still serial, figure 3.16
shows the speedup is four times less than what it could be if the PPE weren’t slower
than a normal processor. Note that on all of these plots, the maximum plotted S is
100. Since there are eight SPEs, each one would need to be over 12× as fast as normal
processor to reach this speedup. In fact, their peak performance is 25.6 GFLOPs,
which is in fact comparable to the performance of normal processor. The next section
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 64
Figure 3.14: Amdahl’s Law (A = 1) vs. CBE (A = 10)
0 20 40 60 80 1000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Speedup of Parallel Portion (S)
Ove
rall S
peed
up
CBE P=.5Amdahl P=.5CBE P=.8Amdahl P=.8
Figure 3.15: Amdahl’s Law (A = 1) vs. CBE (A = 10)
0 20 40 60 80 1000
10
20
30
40
50
60
70
Speedup of Parallel Portion (S)
Ove
rall S
peed
up
CBE P=.9Amdahl P=.9CBE P=.995Amdahl P=.995
will show that the maximum speedup obtained was 10×, even if P = .995 with this
S the possibilities for overall speedup are limited to 6.7×. Significantly worse than
for the GPU. For NSSUS, there are about 220 routines that make up the last 0.5
percent of the runtime, so a P value of .995 can be taken as a realistic maximum.
3.8.2 Implementation
The main type of computation considered here is that of the stencil type (see section
3.5.1). The generic problem considered is to apply a 3D stencil over a volume, with
the complication that it is not simply a stencil over one field, or array, but a stencil
over multiple fields, or arrays. For example, fi 6= f(ai−1, ai, ai+1) but rather fi =
f(ai−1, ai, ai+1, bi−1, bi, bi+1). In the actual NSSUS code the number of fields is thirteen
for the inviscid flux, artificial dissipation and inviscid residual!
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 65
Figure 3.16: Ratio of Amdahl’s Law Speedup to CBE Speedup
0 20 40 60 80 1001
2
3
4
5
6
7
8
9
10
Speedup of Parallel Portion (S)
Ove
rall S
peed
up A
mda
hl /
Ove
rall S
peed
up C
BE
P=.9P=.995
The first factor to consider in implementing these kernels is what data must be
on a SPE to do the computation, keeping in mind that there is only 256Kb for the
data and the program code. And that furthermore, the space for data has to be
divided into at least two separate buffers so that communication can be overlapped
with this computation. Even worse, there must be “halo” data brought in which
means that the actual amount of data computed must be smaller than the amount of
data brought in.
There is a competing constraint between performance and code size which exists
to some extent on all systems. On most modern systems however, code size is not
an issue and code is optimized for maximum speed. Code size most definitely is an
issue on the CELL because of the limited space on the SPE. But, to obtain maximum
performance, loops must be aggressively unrolled to eliminate branches and allow for
maximum instruction reordering by the compiler to eliminate data dependencies (the
SPEs do not have out-of-order capability, see 1.3.4). In the case of NSSUS, these three
kernels: inviscid flux, artificial dissipation and inviscid residual have a code size of
approximately 80Kb when unrolling loops enough to achieve reasonable performance!
This leaves us only 176Kb for data.
Each kernel ki has a different set of required arrays Aki . If Aki ∩ Akj = then
there is no benefit, in terms of memory reuse, in trying to run ki and kj without
initiating new DMAs. However, often the overlap is significant and re-DMAing most
of the arrays would be a waste of memory bandwidth. The arrays needed for the
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 66
Figure 3.17: Cell Memory Bandwidth treating each SPE as an Independent Co-processor
PPE
25.6 GB/sec
three kernels mentioned above are nearly identical.
One possible space saving method is to buffer the kernel code so that instead of
all the kernel codes taking up space on the SPE even when they aren’t being used,
only the kernel currently running is on the SPE while the kernel that is going to
be run next is being DMA’d in. When the current kernel is finished executing it
jumps to wherever the beginning of the next kernel was placed. There are a couple
of downsides to this.
1. It is complicated and error prone to actually implement. Debugging is difficult.
2. The size of the buffer must be the size of the largest kernel. If most of the
kernels are small and one is very large, the space savings may not actually be
significant.
3. Bandwidth that could be used for data is now instead being used for program
code.
Due to all of these considerations, this optimization was not attempted.
A second factor to consider is how to get data to the SPEs for them to perform
computation. The CELL has only 25.6GB/sec of bandwidth from main memory to
all of the SPEs. The most straightforward use of the SPEs as eight independent
and homogenous processors all simultaneously executing the same kernel leads to a
diagram like 3.17 in which all the SPEs are competing for this 25.6GB/sec. This
bandwidth itself is already significantly lower than that of GPUs which in the model
used for the above work was about 100GB/sec and in the most recent models is
160GB/sec. Therefore, using this computational method may lead to non-optimal
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 67
Figure 3.18: Cell Memory Bandwidth Viewing each SPE as a Step in a Pipeline
PPE
25.6 GB/sec
~200GB/secaround ring intotal
performance on the CELL which is significantly lower than that of GPUs. Here, each
SPE would bring in one 3D block of all the necessary arrays, do the computation on
them while fetching the next block (buffering) and then start writing the results back
while getting the next block and beginning computation on the just fetched block.
Each block is totally independent from one another.
The bandwidth around the element-interconnect-bus (EIB) which connects the
SPEs can be significantly higher 200GB/sec, but utilizing this bandwidth requires
algorithms that pass data between the SPEs. This requires an algorithmic complica-
tion - no longer can the SPEs be viewed as identical accelerators fetching data from
memory and writing the result back, but as steps in a pipeline that pass data to one
another, bringing data in at one side and eventually writing it back to main memory
at the other. This method addresses the problem of having multiple kernels taking
up space while only using one of them at a given time because each SPE would only
have one kernel. It has the major problem that it requires, to make full use of the
machine, there be exactly as many stages in the pipeline as there are SPEs. In the
case of the CELL - eight. Unless it is the CELL in the playstation 3 (PS3) in which
case it is six. In this case, a natural breakdown of the computation into eight stages
could not be found, ruling this method out.
The solution that was found to keep, as much as possible, the simplicity of the
homogenous SPE approach while minimizing the amount of bandwidth needed is
called circular buffering.
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 68
To decrease the bandwidth needed from the SPEs to main memory while retain-
ing as much as possible the relative simplicity of the homogenous SPE approach a
technique called circular buffering was developed. The idea behind this method is
that instead of having each SPE process independent blocks, each SPE sweeps a
small rectangle in the x-y plane through the z-direction of the volume. This way it
is possible to reuse some of the data from the previous memory transfer in the next
calculation. A diagram of this procedure in 2D, for clarity, is in figure 3.19. As can be
seen in that figure, a problem that arises from this type of buffering is that arrays are
no longer completely contiguous and require some complicated indexing to address
that right location. If the computation were compute bound, then this extra math
would be a waste of possible computational power.
Another important advantage of circular buffering is that it reduces the amount
of space required for the buffer by two-thirds. Decreasing the size of the buffers
allows the size of the sub-block to be increased which is important for maximizing
the compute/bandwidth ratio.
Finally, some of the finer technical points regarding how memory DMAs on the
CELL work are considered. All memory transfers must be 16 byte aligned and transfer
128 bytes for maximum performance (32 single precision numbers). It is impossible
to start a transfer that is not 16 byte aligned however transfers can be smaller than
128 bytes but performance will suffer by approximately the same ratio as the actual
number bytes to 128 bytes. This facts are important because of how a 3D array is
stored in main memory and how this implies a small sub-block must be transferred.
The formula for determining the linear offset of location in a 3D is given by: l i n e a r O f f s e t = sizeX ∗ s izeY ∗ zCoord + sizeX ∗ yCoord + xCoord The extent of a sub-block’s x-coordinates will be (by definition) less than sizeX
so that there will be a jump in the linear offset each time the y-coordinate or z-
coordinate of the sub-block changes. Therefore, each sub-block (of size [sx, sy, sz])
cannot be transferred as one continuous memory copy, but as sysz copies each of
length sx. The optimization problem then: minimize the amount of time it takes to
transfer all of the sub-blocks subject to the constraints that each sub-block can only
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 69
Figure 3.19: Circular Buffering
This is being DMA'd
This data is complete
Bufferk=1
k=2
k=3
k=1
k=2
k=3
Subblock Sizewith Halo
Size of Computed Data
Only this Data is DMA'dfor Next Sub-block
have 1000 floats (due to the space restrictions) and the x size must be a multiple of
four (due to alignment constraint). Long sub-blocks (in the x-direction) will transfer
more quickly because of the quirks of the CELL hardware, but more will need to be
transferred because of poor surface / volume ratio. The solution of this particular
problem is a sub-block with size 16 × 10 × 6. This solution is very specific to this
application, although the technique for arriving at it is general. The z-direction is
chosen to be only six because that is the direction of the circular buffering and there is
no benefit to making it larger than the minimum necessary because circular buffering
ensures no extra data is transferred in any case.
The final performance figures employing these optimization techniques, along with
others related mostly to data alignment for DMA transfers and SIMD operations, is
a bandwidth limited 10× speedup on the three aforementioned kernels combined.
However, these kernels make up only eighty percent on the program runtime. As
section 3.8.1 showed, this is not enough to achieve an overall program speedup. Even
the speedup achieved on the kernels alone is smaller than what can be achieved using
GPUs. This reality coupled with the greater difficulty of developing software on the
CELL lead to the abandonment of this line of research in favor of GPUs.
The one advantage of the CELL over GPUs at the beginning of this research was
the ability to perform double precision operations. GPUs have since closed this gap
CHAPTER 3. STRUCTURED PDE SOLVERS ON CELL AND GPUS 70
and have double precision performance approximately equal to that of the CELL,
while having, in general, significantly better single precision performance.
Chapter 4
Liszt - A Domain Specific
Language for Writing Codes on
Unstructured Meshes
71
CHAPTER 4. LISZT 72
4.1 Introduction
The last chapters have shown that a great deal work and thought goes into achieving
maximum performance with these accelerator cards. In chapter 3, the same physics
and numerics were implemented on two different architectures which required two
very different implementation strategies. This is non-optimal from a number of per-
spectives:
• The amount of time spent writing and debugging code is approximately linearly
proportional to the number of architectures that must be supported.
• The total amount of code is approximately linearly proportional to the number
of architectures that must be supported which means maintenance times and
costs are also linearly proportional to the number of architectures
• The programmer must become adept at writing and optimizing code for multiple
architectures
There is clearly some possibility here to generalize how code is written so that it need
only be written once. What other benefits could this bring? And what would some
features of the resulting language need to be?
Ideally, one would write code once, in a Domain Specific Language (DSL), that
could be compiled to and optimized for multiple architectures. We have named this
DSL Liszt. First, we recognize that the accelerator cards are parallel in nature and
also that compilers which take serial code and parallelize it are difficult to write in
the best case and in the worst case impossible. For example, a compiler which would
transform the serial algorithm for solving tri-diagonal systems presented in Chapter
1 into the parallel version does not exist, and it is unclear how one would go about
writing such a thing. This places a restriction on Liszt - it must explicitly express
parallelism.
Parallelism is already present at a high level in NSSUS. It must be able to run on a
cluster of machines and as such performs domain decomposition and uses MPI to run
on multiple processors. The parallelism required to run on the accelerator cards is at
a lower level than this, so Liszt should therefore also be able to handle the parallelism
CHAPTER 4. LISZT 73
at the cluster level. This includes parallel file i/o, domain decomposition, ghost
cell determination, ghost cell communication and parallel restart and visualization
output.
How should the language express the parallelism? The parallelism in a finite
difference code is over the vertices (in the sense that the calculation at every vertex
performs more or less the same operations), so a natural way to express this would be,
‘for all the vertices in the mesh, do this ’. Of course, other numerical techniques such
as finite volume, finite element and discontinuous galerkin do not parallelize in this
way. They might involve operations over the faces or cells of the mesh instead. By
making the elements of a 3D (or 2D) mesh as primitives (vertex, edge, face and cell)
in Liszt and allowing for groups of these elements, parallel loops over these groups
can be implemented.
Additionally, to allow Liszt to reason about the communication patterns, and
optimize not only code but also the data structures themselves for each platform
the layout of data in memory must be abstracted. Data should be represented on
the mesh in some fashion and then accessed through mesh primitives, allowing the
compiler freedom to determine how to physically layout the memory optimally for
each machine and program configuration.
The design of the Liszt language should be general enough that all of the main
techniques for solving PDEs on grids can be expressed while allowing enough flexibility
for most new algorithms to be developed. One of the main problems with the previous
work (see next section) is that they are specialized for one specific numerical technique
and sometimes even specific application areas.
Given this brief overview of the motivating ideas behind Liszt, first existing alter-
natives will be examined followed by an in depth description of the language along
with code samples for what finite difference (FD), finite volume (FV), Galerkin finite
element method (FEM) of both first and higher orders, and discontinuous Galerkin
(DG) methods would look like in Liszt. The goal of this chapter is not to describe the
functioning of Liszt down to the tiniest detail (even the syntax for some operations
is not necessarily finalized); the project is ongoing and many of these details are still
changing. The aim is to describe the higher level concepts of the language, which
CHAPTER 4. LISZT 74
will not change, and show that all major methods of solving PDEs can be expressed
cleanly and efficiently using these constructs.
4.2 Previous Work
Sundance [58] is a framework of C++ classes, developed at Sandia National Labs,
that allows for rapid development of parallel FEM solvers by expressing at a high
level the weak formulation of a PDE and its discretization. Within these boundaries
it is very successful. Its main shortcoming is that its boundaries are too narrowly
defined; it is impossible to use for anything but FEM methods and extremely difficult
to experiment with numerics (discretizations and quadrature rules for example) that
have not been supplied by Sundance’s creators. It is not well suited for solving time
dependent problems. Also, a major limitation for mechanics codes in particular is
that it doesn’t support moving or deforming meshes. We believe it is possible to create
a more general language that also allows for more expressive power while retaining
some of the simplicity of Sundance’s approach.
Next, an example of a simple Sundance program, taken from the Sundance tuto-
rial [58], for solving a potential flow problem. Many of the ideas are similar to ideas
in Liszt, which will be presented later. /∗∗∗ So lve s the Laplace equat ion f o r p o t e n t i a l f low past an e l l i p t i c a l
∗ post in a wind tunne l .
∗/int main ( int argc , void∗∗ argv )
try
Sundance : : i n i t (&argc , &argv ) ;
/∗ We w i l l do our l i n e a r a lgebra us ing Epetra ∗/VectorType<double> vecType = new EpetraVectorType ( ) ;
CHAPTER 4. LISZT 75
/∗ Create a mesh . I t w i l l be o f type Bas i sS impl i c ia lMesh , and
∗ w i l l be b u i l t us ing a Part i t ionedRectang leMesher . ∗/
MeshType meshType = new BasicSimpl ic ia lMeshType ( ) ;
MeshSource mesher
= new ExodusNetCDFMeshReader ( ” post . ncdf ” , meshType ) ;
Mesh mesh = mesher . getMesh ( ) ; At the end a mesh, of a specific type, is loaded (possibly in parallel).
/∗ Create a c e l l f i l t e r that w i l l i d e n t i f y the maximal c e l l s
∗ in the i n t e r i o r o f the domain ∗/C e l l F i l t e r i n t e r i o r = new MaximalCe l lF i l t e r ( ) ;
C e l l F i l t e r boundary = new BoundaryCe l lF i l t e r ( ) ;
C e l l F i l t e r in = boundary . l abe l edSubse t ( 1 ) ;
C e l l F i l t e r out = boundary . l abe l edSubse t ( 2 ) ; CellFilters are just collections of cells. Here they are used to specify the cells over
which boundary conditions will be applied. This concept is similar to the Set concept
in Liszt (which again will be described later). /∗ Create unknown and t e s t func t i ons , d i s c r e t i z e d us ing
∗ f i r s t −order Lagrange i n t e r p o l a n t s ∗/Expr phi = new UnknownFunction (new Lagrange ( 1 ) , ”u” ) ;
Expr phiHat = new TestFunction (new Lagrange ( 1 ) , ”v” ) ; This defines the unknown function we are solving for (phi) and the test function
that we will multiply the equation by and then integrate as in the standard weak
formulation of a FE problem. Notice that the basis we can use to represent these
functions is a predefined polynomial basis - Lagrange, if we wished to use a more
exotic basis that didn’t exist in Sundance, we would have to modify Sundance itself
to add such a capability. /∗ Create d i f f e r e n t i a l operator and coord inate f u n c t i o n s ∗/Expr x = new CoordExpr ( 0 ) ;
Expr dx = new Der iva t i ve ( 0 ) ;
CHAPTER 4. LISZT 76
Expr dy = new Der iva t i ve ( 1 ) ;
Expr dz = new Der iva t i ve ( 2 ) ;
Expr grad = L i s t (dx , dy , dz ) ;
/∗ We need a quadrature r u l e f o r doing the i n t e g r a t i o n s ∗/QuadratureFamily quad2 = new GaussianQuadrature ( 2 ) ;
double L = 1 . 0 ; Here a quadrature rule is define in case some of the terms in the Integral below cannot
be exactly integrated (in this case they can). Note that again the quadrature rule
must be chosen from a set of predefined options. /∗ Def ine the weak form ∗/Expr eqn = I n t e g r a l ( i n t e r i o r , ( grad∗phiHat )∗ ( grad∗phi ) , quad2 )
+ I n t e g r a l ( in , phiHat ∗(x−phi )/L , quad2 ) ; This gives a symbolic representation of the weak formulation of the problem.
/∗ Def ine the D i r i c h l e t BC ∗/Expr bc = EssentialBC (out , phiHat∗phi /L , quad2 ) ;
Here because the Dirichlet boundary conditions give rise to a separate equation, they
are defined separately. /∗ We can now s e t up the l i n e a r problem ! ∗/LinearProblem prob (mesh , eqn , bc , phiHat , phi , vecType ) ;
The only difference between a linear and non-linear problem from the user’s point of
view is that a non-linear problem must also be supplied with an initial guess. /∗ Read the parameters f o r the l i n e a r s o l v e r from an XML f i l e ∗/ParameterXMLFileReader reader ( ” . . / . . / t u t o r i a l / b i cg s tab . xml” ) ;
ParameterList solverParams = reader . getParameters ( ) ;
L inearSo lver<double> l i n S o l v e r
= L inea rSo lve rBu i lde r : : c r e a t e S o l v e r ( solverParams ) ;
/∗ s o l v e the problem ∗/
CHAPTER 4. LISZT 77
Expr so ln = prob . s o l v e ( l i n S o l v e r ) ; A variety of linear and non-linear solvers are available.
/∗ Pro j e c t the v e l o c i t y onto a d i s c r e t e space f o r v i s u a l i z a t i o n ∗/Disc re teSpace d i s c r e t e S p a c e (mesh ,
L i s t (new Lagrange ( 1 ) ,
new Lagrange ( 1 ) ,
new Lagrange ( 1 ) ) , vecType ) ;
L2Projector p r o j e c t o r ( d i s c r e t eSpace , grad∗ so ln ) ;
Expr v e l o c i t y = p r o j e c t o r . p r o j e c t ( ) ;
/∗ Write the f i e l d in VTK format ∗/Fie ldWri ter w = new VTKWriter ( ”Post3d” ) ;
w. addMesh ( mesh ) ;
w. addFie ld ( ” phi ” , new ExprFieldWrapper ( so ln [ 0 ] ) ) ;
w. addFie ld ( ”ux” , new ExprFieldWrapper ( v e l o c i t y [ 0 ] ) ) ;
w. addFie ld ( ”uy” , new ExprFieldWrapper ( v e l o c i t y [ 1 ] ) ) ;
w. addFie ld ( ”uz” , new ExprFieldWrapper ( v e l o c i t y [ 2 ] ) ) ;
w. wr i t e ( ) ;
It supports outputting data in the VTK file format, which can be used by programs
such as Paraview [1] for visualization. catch ( except ion& e )
Sundance : : handleException ( e ) ;
Sundance : : f i n a l i z e ( ) ;
Just some boilerplate to wrap things up nicely.
Sundance has many desirable features that Liszt seeks to incorporate. It makes
CHAPTER 4. LISZT 78
writing parallel code very similar to writing serial code, parallel mesh loading and vi-
sualization are handled automatically and a CellFilter or Set concept is used to group
cells on which similar computations are performed. However, its downsides: FEM
specific, lack of easy extensibility, not suitable for time-dependent problems, difficulty
dealing with moving meshes and no clear way to take advantage of accelerator cards
lead us to seek a more general and powerful solution.
SIERRA [26] is another framework also developed at Sandia National Lab. It
is not classified, but public information about it is scarce and it is not open-source.
Its objectives are similar to Liszt’s. It recognizes that a great deal of infrastruc-
ture related to mesh decomposition, parallelization, communication and mesh i/o are
essentially common across a great deal of mesh-based PDE solvers and this common-
ality should be leveraged. It does not attempt to take advantage of accelerator cards.
Information about its exact capabilities must be inferred from one of the publicly
available documents that shows it being used for large multi-physics simulations with
different but overlapping meshes - implying its capabilities are quite advanced.
Unfortunately, due to its relatively secret nature a close examination of it is not
possible. It is possible that Liszt may in many ways duplicate functionality present
in SIERRA. However, conversations with people who have seen it being used, suggest
it is not the most user friendly environment to work with. It also lacks the idea of
re-targeting to multiple accelerator architectures because its concept of parallelism is
at the domain decomposition level and not lower.
OpenFOAM [45] is a set a C++ classes which can be used to create PDE solvers.
Essentially the only supported numerical method is 2nd order Finite Volume. The
equations to be solved are represented symbolically and then the method of dis-
cretization for each term is chosen from a list. It provides a fairly complete set of pre
and post processing utilities and supports a moving mesh. Most of its capabilities
are geared toward writing fluid solvers, although others have been written (Electro-
magnetics, Solid Mechanics and Finance). Its main drawback is, again, the inability
to significantly alter its numerics.
ParFUM (Parallel Framework for Unstructured Meshing) [52] is a library devel-
oped at the University of Illinois at Urbana-Champaign on top of their Charm++
CHAPTER 4. LISZT 79
framework. Its goals are similar to the aforementioned solutions and Liszt’s. It is more
general than OpenFOAM and Sundance in that it does not target one specific nu-
merical method. It supports mesh-refinement and can take advantage of Charm++’s
dynamic (run-time) load balancing. However, it still requires the programmer to
manually describe and register ghost cells and trigger their update. It currently lacks
implicit solver support. It can support arbitrary cell shapes, but does not provide
for arbitrary connectivity relations and requires that all nodes have a position in
space. Because it is not parsed, it cannot support retargeting one code to multiple
architectures.
4.3 Language
4.3.1 Flow
A typical run of a Liszt program would follow these steps:
1. Load Configuration Files
(a) Determines which kernels will be used this particular run
(b) Specifies a particular mesh
(c) Possibly specifies a hardware configuration (e.g. which accelerator card to
use, how many nodes, etc.)
2. Load/Generate Sets to be used during computation
(a) Boundary Condition Sets
(b) Sets for Line Searches
(c) ...
3. Compiler Generates Optimized, Machine Specific Code
4. Solver Runs
(a) Parallel Mesh I/O
CHAPTER 4. LISZT 80
(b) Parallel Domain Decomposition
(c) Solver Loop
(d) Parallel Visualization and Restart Output
This technique of generating code at runtime is known as Just-In-Time (JIT)
Compilation [6]. For this to be advantageous the assumption is that the amount of
time spent in the initialization and code generation phase is small compared to the
time spent in the solving phase. For large scale scientific calculations this is likely to
be the case, since solves often take on the order of hours or even days.
4.3.2 Language Components
A sample Liszt fragment looks like: Fie ld<Vertex , double3> pos = . . . ; // load ver tex p o s i t i o n s
SparseMatrix<Vertex , Vertex> A;
f o ra l l ( Ce l l c in mesh . c e l l s ( ) ) double3 c e l l C e n t e r = cente r ( c ) ;
f o ra l l ( Face f in c . f a c e s ( ) ) double3 f ace dx =
cente r ( f ) − c e l l C e n t e r ;
// note that the f o l l o w i n g loop i s p a r a l l e l
// the CCW i m p l i e s the o r i e n t a t i o n o f the edges
// not t h e i r o rde r ing
f o ra l l ( Edge e in f . edgesCCW( c ) ) ver tex v0 = e . t a i l ( ) ;
ve r tex v1 = e . head ( ) ;
double3 v0 dx = pos ( v0 ) − cente r ;
double3 v1 dx = pos ( v1 ) − cente r ;
double3 face normal = v0 dx . c r o s s ( v1 dx ) ;
// c a l c u l a t e f l u x f o r f a c e
DOF d0 = v0 . getDOF ( ) ; // code to p lace the DOF
CHAPTER 4. LISZT 81
DOF d1 = v1 . getDOF ( ) ; // not shown in t h i s sn ippet
A[ d0 ] [ d1 ] += . . .
A[ d1 ] [ d0 ] −= . . .
Mesh – Liszt includes the interface to the mesh as part of the language, providing ob-
jects for vertices, edges, faces, and cells, along with a full set of topological func-
tions such as mesh.cells() (the set of all cells in the mesh) or f.edgesCCW()
(the edges of face f oriented counter clockwise around the face). This mesh
interface is known to the compiler so it is able to reason about how to best split
up the mesh topology across many processor given a particular application.
The general case is supported by the facet-edge [24] data structure, although in
practice it is expected that most of the actual use cases will involve a very small
subset of possible topological relations. This leads to possible optimizations
opportunities discussed later.
Note that position is not an inherent property of the mesh. Mesh contains only
connectivity information - it is really just a graph. Position is simply treated
as a field that is associated with vertices. This allows for more generality by
allowing Liszt to possibly be useful for general problems on graphs that aren’t
necessarily derived from a partitioning of space.
Sets/Lists – Are simple collections of one type mesh primitive. They can be user de-
fined as in the case of defining a region over which a certain boundary condition
is applied or line search is to be performed. In addition to be defined explic-
itly, they can also be implicitly defined. Statements such as mesh.cells() also
return a set. More generally, imagine that with a fourth order finite difference
scheme, a stencil with a width of 2 neighbors to either side is needed. That is
to implement:
CHAPTER 4. LISZT 82
∇2T =∂2T
∂x2+∂2T
∂y2
=1
12h2(Ti,j+2 + Ti,j−2 + Ti+2,j + Ti−2,j + . . .
16 (Ti,j+1 + Ti,j−1 + Ti+1,j + Ti−1,j)− 60Ti,j) +O(h4)
One could write something like the following code: f o ra l l v in mesh . v e r t i c e s ( )
double sum = −60 ∗ T[ v ] ;
f o ra l l v1 in v . v e r t i c e s ( ) sum += 16 ∗ T[ v1 ] ;
vec3 d i r 1 = pos [ v ] − pos [ v1 ] ;
f o ra l l v2 in v1 . v e r t i c e s ( ) vec3 d i r 2 = pos [ v1 ] − pos [ v2 ] ;
i f ( d i r 1 == di r2 )
sum −= T[ v2 ] ;
T[ v ] = T[ v ] + d e l t a t ∗ sum / (12 ∗ h ∗ h ) ;
But when writing a finite difference algorithm such as this, one expects a carte-
sian mesh, so the language provides an improved way of writing the same code. a s s e r t (MeshType == Cartes ian ) // check performed at runtime
//2D example
f o ra l l ( v in mesh . v e r t i c e s ( ) ) double sum = −60 ∗ T[ v ] ;
// b u i l t in func t i on r e tu rn ing the s e t o f 1 s t l e v e l ha lo s
f o ra l l ( v1 in FDhalo (1 , v ) )
CHAPTER 4. LISZT 83
sum += 16 ∗ T[ v1 ] ;
// b u i l t in func t i on r e tu rn ing the s e t o f 2nd l e v e l ha lo s
f o ra l l ( v2 in FDhalo (2 , v ) )
sum −= T[ v2 ] ;
T[ v ] = T[ v ] + d e l t a t ∗ sum / (12 ∗ h ∗ h ) ;
This has clear advantages simply from a code clarity point of view. But more
importantly, it allows the compiler to make optimizations that might not oth-
erwise be possible. The code says is expects the mesh to be cartesian, allowing
certain fast data structures to be used for storing and accessing the mesh. Fur-
thermore, the built in List FDhalo is used to specify the neighbors for the vertex.
This builtin function will be translated by the compiler into much more efficient
code than the example given above.
Lists are the same as Sets except they imply an ordering. This allows them to be
randomly accessed using the [ ] operator. Because of this additional constraint,
Lists should be used sparingly, but some algorithms are best expressed using
this construct (see the high order FEM example in section 4.4). In fact, should
one need different weights for vertices in the same halo level, the code could be
written as follows, taking advantage of the fact that FDhalo returns a List and
not a Set. // in s t ead o f t h i s
f o ra l l ( v1 in FDhalo (1 , v ) )
sum += 16 ∗ T[ v1 ] ;
// t h i s
vec4 weights = 13 , 14 , 15 , 16 ;
// the orde r ing o f FDhalo i s always :
//1 s t d i r e c t i o n , i n c r e a s i n g coord inate value
// then 2nd d i r e c t i o n , i n c r e a s i n g coord inate value , e t c .
CHAPTER 4. LISZT 84
// which p h y s i c a l d i r e c t i o n corresponds to the ” f i r s t ” d i r e c t i o n
// can be t e s t e d for , s i n c e i t the re i s no guarantee that the
//mesh axes are a l i gned with the c a r t e s i a n g r id
// f o r ex : (−1 , 0) (1 , 0) (0 , −1) (0 , 1)
Li s t<vertex> v e r t s = FDhalo (1 , v ) ;
vec4 temp =
T[ v e r t s [ 0 ] ] , T[ v e r t s [ 1 ] ] , T[ v e r t s [ 2 ] ] , T[ v e r t s [ 3 ] ] sum += weights . dot ( temp ) ;
Degrees of Freedom – Many higher order methods require more information than
can be stored at only the vertices, edges and faces of a cell. To accommodate
such methods degrees of freedom (DOF) are allowed to be placed at vertices
and in edges, faces and cells and also, importantly, at the following pairs: (face,
edge), (cell, edge), (cell, face). This is important because two cells that share
the same face might, for instance, be of different orders and need different DOF
on different sides of the same face.
Fields – Data can be stored at any of the mesh primitive types and also at DOF.
Additionally, data storage on the mesh is supported through fields which are
accessed through mesh elements rather than integer indices. This allows the
Liszt compiler to reason about what mesh relationships and data access patterns
are being used.
Sparse Matrices – Sparse Matrices are two dimensional and relate DOF to DOF.
The non-zero entries are determined by analyzing the kernels to determine which
entries are written to, they do not need to specified or declared. Again, this
leads to some optimization possibilities that will be discussed later.
Solvers – Both linear and non-linear solvers will be provided. At the current state
of the project, existing solver packages, such as Trilinos [36] of Sandia Labs will
be used to avoid ”re-inventing the wheel.” However, this does incur a cost in
translating data from Liszt’s internal format into whatever format the solver
CHAPTER 4. LISZT 85
packages use. Attempting to use solver packages internal formats within Liszt
could prevent many optimizations. In the long run, for maximum performance,
it is likely that solvers will eventually be written directly as a component of
Liszt.
Liszt abstracts the representation of commonly used objects to allow for architec-
ture specific optimizations. For instance, 3D vectors with dot and cross products are
included, allowing the compiler to implement them using SIMD when available. In
order to retarget Liszt code to many different architectures, we make a key domain
specific assumption: the computation is local to a particular piece of mesh topology.
For instance, an operation performed on a particular cell will only need data about a
limited number of neighboring values on the mesh.
While a whole range of optimizations are possible, five specific optimizations have
been chosen to focus on for the initial implementation.
Optimal Domain Decomposition – ParMETIS [83] is used for domain decompo-
sition. ParMETIS takes a graph of vertices and weighted edges and decomposes
it into a set of domains that minimize the cost of broken edges. Liszt uses its
knowledge of data access patterns to correctly weight the edges of the graph.
Liszt can also correctly determine what geometric primitive should be used as
the vertices of the graph. For example if the only parallel loops are over the
vertices of the mesh, forall(mesh.vertices()), the vertices should be parti-
tioned; if the parallel loops are over the cells, forall(mesh.cells()), the cells
should be partitioned. If cells are partitioned the ownership of the vertices,
edges and faces of the cell are determined using an algorithm that guarantees
on average each partition will have an equal number of each.
In the case where more than one kind of geometric primitive is used in an outer
loop, the compiler will partition the cells. If the pattern of neighbor access
changes, Liszt will automatically change the graph that is input to ParMETIS.
Ghost Cells – The same analysis used to determine the correct weighting of the
edges of the ParMETIS graph can also be used to determine what field values
CHAPTER 4. LISZT 86
need to be shared across domain boundaries. Liszt automatically determines
how many levels of ghost cells/faces are needed and will make sure that whenever
values are accessed they are up-to-date. This adaptability is a key advancement
over current codes that fix the level of ghost cells/faces. For example, in a
large eddy simulation code, when one wishes to change the extent of a filter,
a major code rewrite is involved to handle to new amount of necessary ghost
information. In Liszt, this will be automatic. When running on a distributed
memory machine, Liszt takes care of all the necessary MPI calls.
Currently this handled in Liszt by the generation (currently by hand, although
soon to be automatic) of auxiliary kernels which mimc the actual kernel’s mem-
ory access patterns. Special functions are used in place of normal memory
accesses. These functions do not actually perform the memory fetch, but in-
stead record what memory fetch will take place, allowing Liszt to determine
when memory will need to be accessed that isn’t locally available and where
it resides. Liszt runs these kernels after domain decomposition but before the
main solver loop.
Mesh Representation – Because Liszt understands what mesh relationships are
necessary for the program it can choose the optimal way to represent the mesh.
A program that moves around the mesh in an advanced way might need the full
generality of a facet-edge mesh. Many scientific applications use relatively few
mesh relations, and a more limited but smaller and faster representation is pre-
ferred. In addition to understanding mesh relationships, Liszt also understands
the mesh itself. If a mesh happens to only consist of tetrahedra, Liszt would
identify this and in conjunction with what mesh relationships are used choose
an optimal representation. Even in a more complicated case where the mesh
consisted of mostly tetrahedra but a few more complex elements, Liszt could
also optimize this by using a hybrid representation of the mesh. The optimal
tetrahedra representation combined with a slower, more general representation
for the complex elements. The load balancing would then also be able to take
advantage of this information (that the complex elements are likely to take
CHAPTER 4. LISZT 87
longer to process.)
Optimal Layout of Fields – Liszt’s abstraction of representing field data as being
stored at mesh primitives leaves the compiler options with respect to how to
actually store the data in memory. The number of possible layouts is vast. The
simplest is an array of structures; each geometric primitive has a structure with
all the fields at that location. Another option is a structure of arrays; in this case
each field is stored contiguously. There are many other options depending on
the data access patterns and machine architecture. After analyzing the program
Liszt can choose an optimal layout. For example, consider a program with four
kernels and four fields.
Kernel 1 Kernel 2 Kernel 3 Kernel 4
Field 1 X X
Field 2 X X X
Field 3 X X
Field 4 X X
For cache purposes on traditional architectures, fields 3 and 4 should clearly be
grouped together. The grouping field 2 with either field 1 or 3 and 4 depends on
long (assumed to be closely related to how much math and memory access the
kernel performs) each kernel will take. If kernel 2 is estimated to be bandwidth
bound then grouping fields 1 and 2 together would provide maximum speed.
On the other hand, if kernel 2 is arithmetic or instruction bound and kernel 3 is
bandwidth bound then grouping kernel 2, 3 and 4 together would be optimal.
Its possible that for some configurations the choice would essentially have no
impact (if all kernels were highly compute bound, for example) and then an
arbitrary choice can made.
A second optimization on traditional cpus related to cache performance and
fields is blocking. A loop such as the following:
f o ra l l c in mesh . c e l l s ( ) //do s t u f f . . .
CHAPTER 4. LISZT 88
can be transformed into this: // t h i s i s s imply pseudo code to show how the loop might
//be transformed by the compi le r
// the user would never wr i t e such code
f o ra l l chunk in mesh . c e l l s ( ) f o ra l l c in chunk
//do s t u f f
Each chunk’s size would be chosen such that all of the fields, mesh data, etc.
fit into the L2 cache. To take full advantage of this blocking, the layout of each
field would arrange each chunk to be continuous in memory.
Optimal Sparse Matrix Representation – The optimal method to represent the
sparse matrices depends both on the structure of the matrix itself as well as
the hardware that is going to be used in the computation [8]. With knowledge
of the matrix structure and the hardware platform, the optimal matrix can be
chosen at runtime, by means of a table based lookup.
Many of the optimizations that Liszt enables are only possible after the mesh itself
is known and to a lesser extent which subset of all possible kernels in the program
will be performed. This means that a just in time (JIT) compilation strategy must
be employed. The Liszt runtime will allow for the mesh to be loaded, configurations
to be read and sets to be constructed in a general and non-optimal manner. Once the
configuration, mesh, and sets are known, the optimizations are applied and code is
generated for the target architecture, and compiled. The main solver loop then runs
with this optimal code.
CHAPTER 4. LISZT 89
4.4 Examples
These examples are important for two reasons:
1. They demonstrate the possibility of writing all of these algorithms in Liszt
2. They provide a showcase for the language itself, rather than a specification
The examples are not complete programs; in general, only the main computational
loops are shown. Initialization code, mesh loading, boundary determination and
related code have been skipped over. The declaration of Fields and SparseMatrices
are still shown. Sometimes functions are used, but not declared in the code shown;
when this is the case it has been documented in comments.
A finite difference example has already been provided in the discussion on Sets.
The following is a simple finite volume code for solving the scalar convection
equation in three dimensions. That is it solves
∂φ
∂t+∇ · (φu)
where u is the velocity field and is constant in time (in the example below it is also
constant in space, but it need not be), φ is the convected quantity. Density has been
assumed constant. The usual finite volume approach is taken and the equation is
integrated over a control volume and then the second term is recast using Gauss’
divergence theorem to be over the surface of the volume.∫CV
∇ · (φu) dV =
∫S
n · (φu) dS
To evaluate this new integral the value of φ is assumed to constant on each face.
Upwind differencing is used to determine which cell’s value of φ should be used. /∗ I n i t i a l i z a t i o n code not shown ∗/vec3 g l o b a l V e l o c i t y = (1 , 0 , 0 ) ;
// time stepp ing loop
for (double t = 0 ; t < NumSteps ; t += d e l t a t )
CHAPTER 4. LISZT 90
// c a l c u l a t e a f l u x f o r the f a c e s not part o f boundary
f o ra l l f in I n t e r i o r F a c e s vec<3> normal = normals . normal ( f ) ;
double vDotN = g l o b a l V e l o c i t y . dot ( normal ) ;
double area = faceArea ( f ) ;
double f l u x ;
// determine c o r r e c t f l u x c o n t r i b u t i o n
i f (vDotN >= 0)
f l u x = area ∗ vDotN ∗ Phi ( f . i n s i d e ( ) ) ;
else
f l u x = area ∗ vDotN ∗ Phi ( f . ou t s i d e ( ) ) ;
// s c a t t e r f l u x e s
Flux [ f . i n s i d e ( ) ] −= f l u x ;
Flux [ f . ou t s i d e ( ) ] += f l u x ;
// handle the boundary cond i t i on
f o ra l l f in OutflowFaces // need the f a c e to have the c o r r e c t o r i e n t a t i o n
i f ( f . ou t s i d e ( ) . ID ( ) != 0) f . f l i p ( ) ;
vec<3> normal = normals . normal ( f ) ;
double vDotN = g l o b a l V e l o c i t y . dot ( normal ) ;
a s s e r t (vDotN >= 0 ) ; // f o r an out f low face , i t b e t t e r be . . .
double area = faceArea ( f ) ;
double f l u x = area ∗ vDotN ∗ Phi ( f . i n s i d e ( ) ) ;
Flux [ f . i n s i d e ( ) ] −= f l u x ;
//Now perform time advancement s i n c e the Flux i s known
f o ra l l c in mesh . c e l l s ( ) double volume = cel lVolume ( c ) ;
Phi [ c ] = Phi ( c ) + d e l t a t ∗ Flux ( c ) / volume ;
// need to zero f l u x f o r next i t e r a t i o n
Flux [ c ] = 0 . ;
// i n i t i a l i z a t i o n o f camera ob j e c t not shown
camera . snapshot ( ) ; // wr i t e v i s u a l i z a t i o n data
CHAPTER 4. LISZT 91
The following example uses the Galerkin finite element method to solve Laplace’s
equation two dimensions. That is it solves the following problem
−∇2u = f in Ω
u = 0 on ΓD
∂u
∂n= h on ΓN
The problem is put into the weak form by multiplying by a set of test functions,
v, and then integrating over the volume.
−∫
Ω
v∇2u =
∫Ω
fv for all v
Then using Green’s identity to rewrite the first term...∫Ω
∇u · ∇v −∫∂Ω
∂u
∂nv =
∫Ω
fv for all v
Then using the boundary conditions to rewrite the second term...∫Ω
∇u · ∇v =
∫Ω
fv +
∫∂Ω
hv for all v
If we choose to represent the unknown u using the same space of functions v, then
we have the classical Galerkin finite element method.
ui
∫Ω
∇vi · ∇vj =
∫Ω
fvi +
∫∂Ω
hvi for all i, j in v
This gives rise to a system of equation KU = F . K, the stiffness matrix is the
result of the first integral on the left hand side. U is a vector of the unknowns ui. F
is the result of the two integrals on the right hand side. Each one of these terms can
be seen being computed in the example below. The exact details of the calculation
of K is hidden inside a function to simplify this example. Note that because there is
CHAPTER 4. LISZT 92
one degree of freedom at each vertex, the method is using first order elements. // p lace one DOF at each ver tex
SparseMatrix<mesh . v e r t i c e s ( ) .DOF( ) , mesh . v e r t i c e s ( ) .DOF( ) ,
double> K;
Fie ld<mesh . v e r t i c e s ( ) .DOF( ) , double> rhs ;
Fie ld<mesh . v e r t i c e s ( ) .DOF( ) , double> u ;
Fie ld<mesh . v e r t i c e s ()> pos = // load p o s i t i o n
Set<Edge> NeumannBCs ; // i n i t i a l i z a t i o n not shown
//F i s the in homogeneous term in the PDE
//G i s a func t i on which computes the l o c a l s t i f f n e s s matrix
//H i s a func t i on which eva lua t e s the Neumann BC
f o ra l l c in mesh . c e l l s ( )
//DOF are automat i ca l l y returned in a counte r c l o ckw i s e o rde r ing
// in 2D
LocalMatrix<c .DOF( ) , c .DOF( ) , double> Kloc = G( c .DOF( ) , pos ) ;
ReduceLocalToGlobal ( Kloc , K) ;
// performs the equ iva l en t o f the f o l l o w i n g
// f o r a l l d1 in c .DOF( ) // f o r a l l d2 in c .DOF( ) // K[ d1 ] [ d2 ] += Kloc [ d1 ] [ d2 ] ;
// //
// f o r c i n g term
double localRHS = ce l l vo lume ( c ) ∗ F( c e l l c e n t e r ( c ) ) / 6 ;
f o ra l l v in c . v e r t i c e s ( ) rhs [ v ] += localRHS ;
f o ra l l e in NeumannBCs double va l = length ( e ) ∗ H( cente r ( e ) ) / 2 ;
rhs [ e . t a i l ] += val ;
rhs [ e . head ] += val ;
u = LinearSo lve (K, rhs ) ;
CHAPTER 4. LISZT 93
This is a more advanced Galerkin finite element example for solving the same
problem. The elements are now quadratic instead of linear and the details of the
integration and construction of the stiffness matrix are shown. The handling of the
boundary conditions is not shown. Most of the math in the function quadBasisEval
is to determine the mapping from the real space triangle to a scaled, reference, right
triangle. The functions, func1, func2, etc. which evaluate each of the basis functions
inside the triangle, work with scaled coordinates for simplicity. vec3 quadResult quadBasisEval ( L i s t<vertex> vert s , Fie ld<vertex , vec2> pos ,
DOF dof1 , Ce l l c )
vec3 returnVal ;
double det = 2 ∗ area ( c ) ;
//compute transformed coo rd ina t e s
double r = ( pos [ v e r t s [ 2 ] ] . y − pos [ v e r t s [ 0 ] ] . y ) ∗ ( pos [ qp ] . x − pos [ v e r t s [ 0 ] ] . x ) +
( pos [ v e r t s [ 0 ] ] . x − pos [ v e r t s [ 2 ] ] . x ) ∗ ( pos [ qp ] . y − pos [ v e r t s [ 0 ] ] . y ) ;
double drdx = ( pos [ v e r t s [ 2 ] ] . y − pos [ v e r t s [ 0 ] ] . y ) / det ;
double drdy = ( pos [ v e r t s [ 0 ] ] . x − pos [ v e r t s [ 2 ] ] . x ) / det ;
double s = ( pos [ v e r t s [ 0 ] ] . y − pos [ v e r t s [ 1 ] ] . y ) ∗ ( pos [ qp ] . x − pos [ v e r t s [ 0 ] ] . x ) +
( pos [ v e r t s [ 1 ] ] . x − pos [ v e r t s [ 0 ] ] . x ) ∗ ( pos [ qp ] . y − pos [ v e r t s [ 0 ] ] . y ) ;
double dsdx = ( pos [ v e r t s [ 0 ] ] . y − pos [ v e r t s [ 1 ] ] . y ) / det ;
double dsdy = ( pos [ v e r t s [ 1 ] ] . x − pos [ v e r t s [ 0 ] ] . x ) / det ;
double b , dbdr , dbds ;
i f ( dof1 . type == 0) b = func1 ( r , s ) ; // d e t a i l s not shown
dbdr = func2 ( r , s ) ; // d i t t o
dbds = func3 ( r , s ) ; // d i t t o
else i f ( dof1 . type == 1)
// . . .
// . . . up through a l l 6 p o s s i b i l i t i e s
double dbdx = dbdr ∗ drdx + dbds ∗ dsdx ;
double dbdy = dbdr ∗ drdy + dbds ∗ dsdy ;
CHAPTER 4. LISZT 94
returnVal . x = b ;
returnVal . y = dbdx ;
returnVal . z = dbdy ;
return returnVal ;
main ( ) // Place 1 DOF at each ver tex and at each edge midpoint
SparseMatrix<mesh . c e l l s ( ) .DOF( ) , mesh . c e l l s ( ) .DOF( ) , double> K;
Fie ld<mesh . c e l l s ( ) .DOF( ) , double> rhs ;
Fie ld<mesh . v e r t i c e s ( ) , double> pos = // load p o s i t i o n s
// each c e l l has a f i e l d conta in ing a l i s t
// s t o r i n g vec3s o f l ength 3
Fie ld<mesh . c e l l s ( ) , L i s t<vec3 , 3> > QuadraturePoints ;
// determine quadrature po in t s and weights
// f o r each element
f o ra l l c in mesh . c e l l s ( ) List<vertex> v e r t s = mesh . v e r t e x L i s t ( c ) ;
vec2 v1 = pos [ v e r t s [ 0 ] ] ;
vec2 v2 = pos [ v e r t s [ 1 ] ] ;
vec2 v3 = pos [ v e r t s [ 2 ] ] ;
QuadraturePointsWeights ( c ) [ 0 ] = vec3 ( ( v1 + v2 ) / 2 , 1 /3 ) ;
QuadraturePointsWeights ( c ) [ 1 ] = vec3 ( ( v2 + v3 ) / 2 , 1 /3 ) ;
QuadraturePointsWeights ( c ) [ 2 ] = vec3 ( ( v1 + v3 ) / 2 , 1 /3 ) ;
// assembly loop
f o ra l l c in mesh . c e l l s ( )
// can be a f o r a l l loop even though a L i s t i m p l i e s an
// orde r ing as long as the o rde r ing i s not used
f o ra l l vec3 qp in QuadraturePointsWeights ( c ) double w = area ( c ) ∗ qp . z ;
f o ra l l dof1 in c .DOF( ) List<vertex> v e r t s = mesh . v e r t e x L i s t ( ) ;
// f i l l up the rhs vector , F
vec3 iQuad = quadBasisEval ( vert s , pos , dof1 , c ) ;
rhs ( dof1 ) += w ∗ F( qp ) ∗ iQuad . x ;
CHAPTER 4. LISZT 95
f o ra l l dof2 in c .DOF( ) vec3 jQuad = quadBasisEval ( ver t s , pos , dof2 , c ) ;
// f i l l up the s t i f f n e s s matrix K
K( dof1 , dof2 ) += w ∗ ( iQuad . y ∗ jQuad . y + iQuad . z ∗ jQuad . z ) ;
// handle boundary cond i t i ons , not shown
u = LinearSo lve (K, rhs ) ;
To give the reader unfamiliar with discontinuous Galerkin methods, a brief overview
of using them to solve a simple scalar wave equation is given. This should enable the
reader to understand where the terms in the more complicated example come from.
For further reading [37] is recommended.
Consider the equation∂u
∂t+∂au
∂x= 0
where u is the unknown and a is a constant. Then consider space being decomposed
into K distinct elements. On each of those elements we can locally represent the
solution u as follows, where N is the order of the polynomial representation and li is
the Lagrange polynomial of order i.
u(x, t) =N+1∑n=1
u(xi, t)li(x)
Next, like in the finite element method, we multiply by a set of test functions and
integrate, but here only locally over each element. So we have for element E∫E
(∂u
∂t+∂au
∂x
)lidx = 0 1 ≤ i ≤ N + 1
CHAPTER 4. LISZT 96
then this is integrated by parts to yield∫E
(∂u
∂tli − au
dlidx
)dx = −
∫∂E
n · aulidx 1 ≤ i ≤ N + 1
One of the key points of the method is that in the term on the right hand side, the
value of au is multiply defined at each interface between elements. How this disconti-
nuity between elements is resolved depends on the equations one is solving. Without
delving any deeper into this resolution, (au)∗ is simply referred to as the resolved
quantity, known as the flux. This is known as the weak formulation, integrating the
entire equation by parts one more time yields the strong formulation.
The expansion of u is substituted into the above equation, which can then be
arranged into the following form
Mdu
dt+ (S)T au = −(au)∗xrl(xr) + (au)∗xll(xl)
where Mij =∫Eli(x)lj(x)dx and Sij =
∫Eli(x)
dljdxdx. The entire equation can mul-
tiplied by M−1 to obtain an explicit expression for the time derivatives. Note that
the expression on the right hand side is really just a surface integral even though it
isn’t written as one in one dimension. It is even more important to realize that this
equation is per element ; it is not global. As such, no matrix inversions are required to
advance the system in time (other than the matrix M which is small and can easily
be accomplished as a pre-processing step.)
An example showcasing Liszt for using the nodal discontinuous Galerkin method
for solving the two dimensional Maxwell’s Equations (in vacuum) with triangular
elements.
CHAPTER 4. LISZT 97
The equations being solved (in dimensionless form) are thus (ignoring boundary
conditions):
∂Hx
∂t= −∂Ez
∂y∂Hy
∂t=∂Ez∂x
∂Ez∂t
=∂Hy
∂x− ∂Hx
∂y
Ez = 0 on Γ
where H is the magnetic field and E is the electric field and Γ is the boundary.
The following variables are considered to be initialized previously in the program:
N : the order of the local approximation
Np : the number of terms in local expansion, (N+1)(N+2)2
for 2D
Nf : the number of terms of the local expansion located on the face of a cell, = N+1.
Dr : Np × Np matrix such that ∂u∂r
= Dru
Ds : Np × Np matrix such that ∂u∂s
= Dsu
r, s : The coordinates in the reference triangle space (not actual variables in the
program)
rx, ry, sx, sy : Each cell has unique vectors of length Np describing the mapping, ∂r∂x
,∂r∂y
, ∂s∂x
, ∂s∂y
.
L : Np × 3Nf matrix describing how to “lift” the surface terms to volume terms
scale : Field from (cell, edge) pair to a double containing the inverse of the Jacobian
mapping along the edge Fie ld<mesh . c e l l s ( ) .DOF( ) , double> Ez ; // out o f plane e l e c t r i c f i e l d
Fie ld<mesh . c e l l s ( ) .DOF( ) , double> Hx; // in plane magnetic f i e l d components
CHAPTER 4. LISZT 98
Fie ld<mesh . c e l l s ( ) .DOF( ) , double> Hy;
//Np − 3N DOFs placed i n s i d e each c e l l
// f o r each c e l l :
//N−1 DOFs placed along each edge ( be long ing to one c e l l , edge pa i r )
//1 DOF placed at each ver tex ( be long ing to two c e l l , edge p a i r s
Fie ld<mesh . c e l l s ( ) , L i s t<c . edges ( ) .DOF( ) , double> > rhsHx ;
Fie ld<mesh . c e l l s ( ) , L i s t<c . edges ( ) .DOF( ) , double> > rhsHy ;
Fie ld<mesh . c e l l s ( ) , L i s t<c . edges ( ) .DOF( ) , double> > rhsEz ;
// f l u x and rhs c a l c u l a t i o n
f o ra l l c in mesh . c e l l s ( ) Fie ld<c . edges ( ) .DOF( ) , double> dHx ; // s i z e i s 3Nf
Fie ld<c . edges ( ) .DOF( ) , double> dHy ;
Fie ld<c . edges ( ) .DOF( ) , double> dEz ;
Fie ld<c . edges ( ) .DOF( ) , double> f luxHx ;
Fie ld<c . edges ( ) .DOF( ) , double> f luxHy ;
Fie ld<c . edges ( ) .DOF( ) , double> f luxEz ;
f o ra l l e in c . edges ( )
// note that e c a r r i e s with i t the in fo rmat ion about
// which c e l l i t came from , automat i ca l l y prov id ing
//a ( c e l l , edge ) pa i r
//compute f i e l d d i f f e r e n c e
// There are Nf DOF per ( c e l l , edge ) pair , so the exp r e s s i on
//Hx( e .DOF( ) ) i s i n h e r e n t l y a vec to r
// Cel lEdgePair c r e a t e s a ( c e l l , edge ) pa i r
// e . oppos i t e ( c ) r e tu rn s the c e l l on the other s i d e o f edge e from c e l l c
dHx( e .DOF( ) ) = Hx( e .DOF( ) ) − Hx( Cel lEdgePair ( e . oppos i t e ( c ) , e ) .DOF( ) ) ;
dHy( e .DOF( ) ) = Hy( e .DOF( ) ) − Hy( Cel lEdgePair ( e . oppos i t e ( c ) , e ) .DOF( ) ) ;
dEz ( e .DOF( ) ) = Ez( e .DOF( ) ) − Ez( Cel lEdgePair ( e . oppos i t e ( c ) , e ) .DOF( ) ) ;
normalDotDH = normal ( e ) . x ∗ dHx + normal ( e ) . y ∗ dHy ;
//compute f l u x e s along t h i s edge
fluxHx ( e .DOF( ) ) = normal ( e ) . y ∗ dEz +
(normalDotDH ∗ normal ( e ) . x − dHx) ∗ s c a l e ( e ) ;
f luxHy ( e .DOF( ) ) = −normal ( e ) . x ∗ dEz +
(normalDotDH ∗ normal ( e ) . y − dHy) ∗ s c a l e ( e ) ;
f luxEz ( e .DOF( ) ) = −normal ( e ) . x ∗ dHy + ( normal ( e ) . y ∗ dHx − dEz) ∗ s c a l e ( e ) ;
CHAPTER 4. LISZT 99
//compute g rad i en t and c u r l o f f i e l d s in t h i s c e l l
Fie ld<c .DOF( ) , double> Ezr = Dr ∗ Ez( c .DOF( ) ) ; // s i z e i s Np
Fie ld<c .DOF( ) , double> Ezs = Ds ∗ Ez( c .DOF( ) ) ;
Fie ld<c .DOF( ) , double> Ezx = rx ( c ) ∗ Ezr + sx ( c ) ∗ Ezs ;
Fie ld<c .DOF( ) , double> Ezy = ry ( c ) ∗ Ezr + sy ( c ) ∗ Ezs ;
Fie ld<c .DOF( ) , double> Hxr = Dr ∗ Hx( c .DOF( ) ) ;
Fie ld<c .DOF( ) , double> Hxs = Ds ∗ Hx( c .DOF( ) ) ;
Fie ld<c .DOF( ) , double> Hyr = Dr ∗ Hy( c .DOF( ) ) ;
Fie ld<c .DOF( ) , double> Hys = Ds ∗ Hy( c .DOF( ) ) ;
Fie ld<c .DOF( ) , double> curlHz =
rx ( c ) ∗ Hyr + sx ( c ) ∗ Hys − ry ( c ) ∗ Hxr − sy ( c ) ∗ Hxs ;
//compute rhs
rhsHx ( c ) = −Ezy + L ∗ f luxHx / 2 ; //L i s Np x 3Nf , f l u x i s 3Nf x 1
rhsHy ( c ) = Ezx + L ∗ f luxHy /2 ;
rhsEz ( c ) = curlHz + L ∗ f luxEz /2 ;
// t imestepp ing . . .
Chapter 5
Conclusions
This thesis began with a brief overview of general purpose processor technology, dis-
cussed eventual difficulties in further increasing performance using their instruction
level parallelism paradigm. Commodity graphics processors and IBM’s Cell were in-
troduced as two examples of an emerging class of hardware where the programming
paradigm and hardware design is based upon data parallelism. The specifics of hard-
ware platform and their programming models were described. The second chapter
examined using GPUs for solving the O(N2) N-body problem. It provided some
general and some hardware specific techniques for achieving maximal performance.
The maximum performance was quite high, over twenty five times a highly optimized
traditional CPU code. Also, on the metric of performance per dollar the GPU was far
and away the best contender. On the metric of performance per watt the specialized
hardware GRAPE was marginally better due to its single purpose design. The third
chapter examined both the CELL and GPUs for solving the compressible Euler equa-
tions. It was determined that GPUs were suitable with a speedup of around twenty
and the CELL was not (with an overall slowdown) due to various architectural and
programming difficulties which were discussed. This highlighted the difficulty of hav-
ing multiple programming models. This lead directly to the final chapter describing
a new Domain Specific Language for writing mesh based PDE solvers, Liszt, which
would allow for writing code once and retargeting to different acceleration technolo-
gies. It would also ease code development in general through the automatic handling
100
CHAPTER 5. CONCLUSIONS 101
of domain decomposition and parallelization. Examples of all the mainstream mesh
based techniques for solving PDEs were presented in Liszt. The work of the previous
chapters, especially chapter 3 showed the need for this new language and will prove
useful as the development of Liszt compiler continues, especially the GPU backend.
Appendix A
New Periodic Boundary Conditions
for Simulating Nano-wires under
Torsion and Bending
102
APPENDIX A. TORSION AND BENDING PBC 103
A.1 Introduction
Recently there has been considerable interest in the directed growth of semiconductor
nanowires (NWs), which can be used to construct nano-scale field effect transistors
(FETs) [22, 94, 39], chemical and biological sensors [21], nano-actuators [17] and nano-
fluidic components [29]. Epitaxially grown NWs have the potential to function as
conducting elements between different layers of three-dimensional integrated circuits.
Because significant stress may build up during fabrication and service (e.g. due to
thermal or lattice mismatch), characterization and prediction of mechanical strength
and stability of NWs is important for the reliability of these novel devices.
NWs also offer unique opportunities for studying the fundamental deformation
mechanisms of materials at the nanoscale. The growing ability to fabricate and me-
chanically test microscale and nanoscale specimens and the increasing computational
power allows for direct comparison between experiments and theory at the same
length scale.
The size of these devices presents a challenge to test their mechanical properties.
In macroscale samples, the materials are routinely tested in tension, shear, torsion
and bending using standard grips and supports. Smaller samples, however, require
more inventive testing techniques. For nanoscale testing, tensile and bending tests
have been performed using nanoindentors, AFM [50, 23], and MEMS devices [96, 44].
Similar experiments have been performed at the microscale [46, 90]. With the rapid
progress of nanofabrication and nanomanipulation capabilities, additional tension,
torsion, and bending experimental data on crystalline and amorphous nanowires will
soon be available.
Molecular dynamics is poised to be the main theoretical tool to help understand
and predict small scale mechanical properties. However, since MD is limited in the
number of atoms it can simulate; it cannot simulate whole nanowires. Either the
nanowire simulated must be extremely short or periodic boundary conditions (PBC)
must be used. End conditions artificially alter the material locally such that defect
nucleation and failure often occurs there. This results in simulations that test the
APPENDIX A. TORSION AND BENDING PBC 104
strength of the boundary rather than the intrinsic strength of the material. Tradi-
tional PBC remove this artifact by enforcing translational invariance and eliminating
all artificial boundaries.
The use of conventional PBC allows for the simulation of tensile, pure shear, and
simple shear in MD [80]. In fact, the mechanical properties of silicon nanowires in
tension were recently calculated using this approach [48]. The nanowires were strained
by extending the periodicity along the nanowire length and the stress was calculated
through the Virial formula. However, regardless of the types of strain imposed on the
periodic simulation cell, the images form a perfect lattice which precludes nonzero
average torsion or bending. Therefore, to simulate torsion or bending tests, either
small finite nanowires must be simulated or the current PBC framework must be
altered.
Many Molecular Dynamics simulations on torsion and bending of nanoscale struc-
tures have been reported [38, 95, 77, 40, 72]. The artificial end effects are sometimes
reduced by putting the ends far away from the region undergoing severe deforma-
tion, requiring a long nanowire [60]. There have also been attempts to rectify this
problem [73]. Recently, the objective molecular dynamics (OMD) formulation [25]
has been proposed that generalizes periodic boundary conditions to accommodate
symmetries other than translational. Under this framework, torsion and bending
simulations can be performed without end effects. But the general formulation of
OMD is somewhat difficult to apply to existing MD simulation programs.
In this thesis, a simpler formulation that accommodates torsion and bending in
a generalized periodic boundary condition framework is presented. It is shown that
torsion and bending can be related to shear and normal strains when expressed in
cylindrical coordinates. This leads to t-PBC and b-PBC, respectively, as formulated in
Section 2. While only linear momenta are preserved in PBC, both t-PBC and b-PBC
preserve the angular momentum around their rotation axes. These new boundary
conditions can be easily implemented on top of existing simulation programs that use
conventional PBC. In Section 3, the Virial expressions for the torque and bending
moment are derived that are analogous to the Virial expressions for the average stress
in simulation cells under PBC. The Virial expressions of torque and bending moment,
APPENDIX A. TORSION AND BENDING PBC 105
expressed as a sum over discrete atoms, are found to correspond to a set of tensorial
quantities in continuum mechanics, expressed as a volume integral. Section 4 presents
the application of these new boundary conditions to modeling of the intrinsic strength
of Si nanowires under torsion and bending.
A.2 Generalization of Periodic Boundary Condi-
tions
A.2.1 Review of Conventional PBC
PBC can be visualized as a primary cell surrounded by a set of replicas, or image cells.
The replicas are arranged into a regular lattice specified by three repeat vectors: c1,
c2, c3. This means that whenever there is an atom at location ri there are also atoms
at ri +n1c1 +n2c2 +n3c3, where n1, n2, n3 are arbitrary integers [2, 15]. Because the
atoms in the image cells behave identically as those in the primary cell, it is immaterial
to specify which space belongs to the primary cell and which space belongs to the
image cell. Even though it is customary to refer to the parallelepiped formed by the
three period vectors as the simulation cell and the surface of this parallelepiped as
the boundary, there is no physical interface at this boundary. In other words, the
“boundary” between the primary and image cells in PBC can be drawn anywhere and
is only a matter of convention. Consequently, translational invariance is preserved
and linear momenta is conserved in all three directions. It is customary to set the
velocity of the center of mass to zero in the initial condition which should remain
zero during the simulation. This provides an important check of the self-consistency
of the simulation program.
The scaled coordinates si are usually introduced to simplify the notation and the
implementation of PBC, where
ri = H · si (A.1)
and H = [c1|c2|c3] is a 3×3 matrix whose three columns are formed by the coordinates
of the three repeat vectors. For example, H becomes a diagonal matrix when the three
APPENDIX A. TORSION AND BENDING PBC 106
repeat vectors are parallel to the x-, y-, z-axes, respectively,
H =
Lx 0 0
0 Ly 0
0 0 Lz
(A.2)
where Lx = |c1|, Ly = |c2|, Lz = |c3|. The periodic boundary conditions can also
be stated in terms of the scaled coordinates as follows: whenever there is an atom at
location si = (six, siy, s
iz)
T, there are also atoms at location (six +n1, siy +n2, s
iz +n3)T,
where n1, n2, n3 are arbitrary integers. The scaled coordinates of each atom, six, siy,
siz are sometimes limited to [−0.5, 0.5), although this is not necessary.
To apply a normal strain in the x direction, one only needs to modify the magni-
tude of Lx. To introduce a shear strain εyz, one can simply add an off-diagonal term
to the H matrix,
H =
Lx 0 0
0 Ly 2 εyz Ly
0 0 Lz
(A.3)
Regardless of the normal or shear strain, the scaled coordinates, six, siy, s
iz, still
independently satisfy PBC in the domain [−0.5, 0.5), which is the main advantage
for introducing the scaled coordinates. By modifying H in these ways, one can stretch
and shear a crystal in MD.
A.2.2 Torsional PBC
While the exact formulation of PBC as stated above cannot accommodate a non-
zero average torsion over the entire simulation cell, the general idea can still be used.
Consider a nanowire of length Lz aligned along the z-axis, as shown in Fig. A.1(a).
To apply PBC along the z-axis, one makes two copies of the atoms in the nanowire,
shifts them along z by ±Lz, and lets them interact with the atoms in the primary
wire. Two copies of the original nanowire would be sufficient if the cut-off radius rc of
the interatomic potential function is smaller than Lz (usually rc Lz). After PBC
is applied, the model may be considered as an infinitely long, periodic wire along the
APPENDIX A. TORSION AND BENDING PBC 107
L z
L z
L z
x
y
z
L z
L z
L z
x
y
z
(a) (b)
primary wire
image wire
image wire
primary wire
image wire
image wire
φ
φ
2φ
Figure A.1: (a) A nanowire subjected to PBC along z axis. (b) A nanowire subjectedto t-PBC along z axis.
z-axis. Any arbitrary section of length Lz can now be considered as the primary wire
due to the periodicity. Since the atomic arrangement must repeat itself after every
Lz distance along the wire, the average torsion that can be applied to the nanowire
is zero. A local torsion in some section of the wire has to be cancelled by an opposite
torsion at another section that is less than Lz away.
One way to introduce an average torque to this infinitely long wire is to rotate the
two images by angle +φ and −φ, respectively, before attaching them to the two ends
of the primary wire as shown in Fig. A.1(b). The image wire that is displaced by
Lz is rotated by φ, while the one that is displaced by −Lz is rotated by −φ. In this
case, as one travels along the wire by Lz, he will find that the atomic arrangement
in the cross section will be rotated around z axis by angle φ but otherwise identical.
Again, because this property is satisfied by any cross section of the nanowire, it is
arbitrary which is called the primary wire and which are called images, similar to
APPENDIX A. TORSION AND BENDING PBC 108
conventional periodic boundary conditions. The torsion imposed on the nanowire
can be characterized by the angle of rotation per unit length, φ/Lz. In the limit of
small deformation, the shear strain field produced by the torsion is,
εθz =r φ
2Lz(A.4)
where r is the distance away from the z-axis.
The above procedure specifies torsional periodic boundary conditions (t-PBC)
that can be easily expressed in terms of scaled cylindrical coordinates. Consider
an atom i with cartesian coordinates ri = (xi, yi, zi)T and cylindrical coordinates
(ri, θi, zi)T, which are related to each other by,
xi = ri cos θi (A.5)
yi = ri sin θi (A.6)
When the wire is subjected to PBC along z (with free boundary conditions in x and y),
the scaled cylindrical coordinates (sir, siθ, s
iz)
T are introduced through the relationshipri
θi
zi
=
R 0 0
0 2π 0
0 0 Lz
sir
siθ
siz
≡M ·
sir
siθ
siz
(A.7)
Both siθ and siz independently satisfy periodic boundary conditions in the domain
[−0.5, 0.5). No boundary condition is applied to coordinate sir. R is a characteristic
length scale in the radial direction in order to make sir dimensionless. Although this
is not necessary, one can choose R to be the radius of the nanowire, in which case sir
would vary from 0 to 1.
Torsion can be easily imposed by introducing an off-diagonal term to the matrix
M, which becomes
M =
R 0 0
0 2π φ
0 0 Lz
(A.8)
APPENDIX A. TORSION AND BENDING PBC 109
The scaled coordinates, siθ and siz, still independently satisfy periodic boundary con-
ditions in the domain [−0.5, 0.5). This is analogous to the application of shear strain
to a simulation cell subjected to conventional PBC, as described in Eq. (A.3). t-PBC
can be easily implemented in an existing simulation program by literally following
Fig. A.1(b), i.e. by making two copies of the wire, rotating them by ±φ, and placing
the two copies at the two ends of the primary wire. In practice, it is not necessary
to copy the entire wire, because the cut-off radius rc of the interatomic potential
function is usually much smaller than Lz. Only two sections at the ends of the pri-
mary wire with lengths longer than rc need to be copied.1 It is important to perform
this operation of “copy-and-paste” at every MD time step, or whenever the potential
energy and atomic forces need to be evaluated. This will completely remove the end
effects and will ensure that identical MD trajectories will be generated had a different
section (also of length Lz) of the wire been chosen as the primary wire.
An important property of the t-PBC is that the trajectory of every atom satisfy the
classical (Newton’s) equation of motion. In other words, among the infinite number
of atoms that are periodic images of each other, it makes no physical difference as
to which one should be called “primary” and which ones should be called “images”.
Since the primary atoms follow the Newton’s equation of motion (fi = m ai), to prove
the above claim it suffices to show that the image atoms, which are slaves of the
primary atoms (through the “copy-and-paste” operation) also follow the Newton’s
equation of motion (fi′ = m ai′).
To show this, consider an atom i and its periodic image i′, such that si′r = sir,
si′
θ = siθ, si′z = siz + 1. The position of the two atoms are related by t-PBC: ri′ =
Rotz(ri, φ) + ez Lz, where Rotz(·, φ) represent rotation of a vector around z-axis by
angle φ and ez is the unit vector along z-axis. Hence the acceleration of the two atoms
are related to each other through: ai′ = Rotz(ai, φ). Now consider an arbitrary atom
j that falls within the cut-off radius of atom i. Let rij ≡ rj − ri be the distance
vector from atom i to j. Consider the image atom j′ such that sj′r = sjr, s
j′
θ = sjθ,
sj′z = sjz + 1. Hence rj′ = Rotz(rj, φ) + ez Lz, and ri′j′ ≡ rj′ − ri′ = Rotz(rij, φ). Since
1This simple approach is not able to accommodate long-range Coulomb interactions, for whichthe Ewald summation is usually used in conventional PBC. Extension of the Ewald method to t-PBCis beyond the scope of this thesis.
APPENDIX A. TORSION AND BENDING PBC 110
this is true for an arbitrary neighbor atom j around atom i, the forces on atoms i
and i′ must satisfy the relation: fi′ = Rotz(fi, φ). Therefore, the trajectory of atom i′
also satisfies the Newton’s equation of motion fi′ = m ai′ .
MD simulations under t-PBC should conserve the total linear momentum Pz and
angular momentum Jz because t-PBC preserves both translational invariance along
and rotational invariance around the z axis. However, the linear momenta Px and Py
are no longer conserved in t-PBC due to the specific choice of the origin in the x-y
plane (which defines the cylindrical coordinates r and θ). In comparison, the angular
momentum Jz is usually not conserved in PBC. Consequently, at the beginning of MD
simulations under t-PBC, both Pz and Jz must be set to zero. Pz and Jz will remain
zero, which provides an important self-consistency check of the implementation of
boundary conditions and numerical integrators.
A.2.3 Bending PBC
The same idea can be used to impose bending deformation on wires. Again, the
atomic positions will be described through scaled cylindrical coordinates, (sir, siθ, s
iz)
T,
which is related to the real cylindrical coordinates, (ri, θi, zi)T, through the following
transformation,ri
θi
zi
=
R 0 0
0 Θ 0
0 0 Lz
sir
siθ
siz
+
L0/Θ
0
0
≡ N ·
sir
siθ
siz
+
L0/Θ
0
0
(A.9)
While the coordinate system here is still the same as that in the case of torsion, the
wire is oriented along the θ direction, as shown in Fig. A.2. Among the three scaled
coordinates, only siθ is subjected to a periodic boundary condition, in the domain of
[−0.5, 0.5). This means that θi is periodic in the domain [−Θ/2,Θ/2). No boundary
conditions are applied to sir and siz. R and Lz are characteristic length scales in the
r and z directions, respectively. L0 is the original (stress free) length of the wire and
ρ = L0/Θ is the radius of curvature of the wire. The equation r = ρ specifies the
neutral surface of the wire. Thus, ri = ρ+Rsir, where Rsir describes the displacement
APPENDIX A. TORSION AND BENDING PBC 111
x
θ
y
r
Θ
z
F
M M
F
primary wire
image wireimage wire
Figure A.2: A nanowire subjected to b-PBC around z axis. At equilibrium the netline tension force F must vanish but a non-zero bending moment M will remain.
of atom i away from the neutral axis in the r direction.
In the previous section, an off-diagonal element has to be introduced to the
transformation matrix M in order to introduce torsion. In comparison, the form
of Eq. (A.9) does not need to be changed to accommodate bending. Different amount
of bending can be imposed by adjusting the value Θ, while the matrix N remains
diagonal. The larger Θ is the more severe the bending deformation. The state of zero
bending corresponds to the limit of Θ→ 0.
Intuitively, it may seem that increasing the value of Θ would elongate the wire and
hence induce a net tension force F in addition to a bending moment M . However,
this is not the case because the direction of force F at the two ends of the wire
are not parallel to each other, as shown in Fig. A.2. When no lateral force (i.e. in
the r direction) is applied to the wire, F must vanish for the entire wire to reach
equilibrium. Otherwise, there will be a non-zero net force in the −x direction, which
will cause the wire to move until F become zero. At equilibrium, only a bending
moment (but no tension force) can be imposed by b-PBC.
b-PBC can be implemented in a similar way as t-PBC. One makes two copies of the
primary wire and rotates them around the z axis by ±Θ. The atoms in these copies
will interact and provide boundary conditions for atoms in the primary wire.2 Again,
2Similar to the case of t-PBC, this simple approach is not able to accommodate long-rangeCoulomb interactions. While a wire under t-PBC can be visualized as an infinitely long wire, this
APPENDIX A. TORSION AND BENDING PBC 112
this “copy-and-paste” operation is required at every step of MD simulation. This
will ensure all atoms (primary and images) satisfy Newton’s equation of motion. The
proof is similar to that given in the previous section for t-PBC and is omitted here for
brevity. Interestingly, both the linear momentum Pz and the angular momentum Jz
for the center of mass are conserved in b-PBC, exactly the same as t-PBC. Therefore,
both Pz and Jz must be set to zero in the initial condition of MD simulations.
A.3 Virial Expressions for Torque and Bending
Moment
The experimental data on tensile tests are usually presented in the form of stress-
strain curves. The normal stress is calculated from, σ = F/A, where F is the force
applied to the ends and A is the cross section area of the wire. In experiments on
macroscopic samples, the end effects are reduced by making the ends of the speci-
men much thicker than the middle (gauge) section where significant deformation is
expected. In atomistic simulations, on the other hand, the end effects are removed
by a different approach, usually through the use of periodic boundary conditions.
Unfortunately, with the end effects completely removed by PBC, there is no place
to serve as grips where external forces can be applied. Therefore, the stress must be
computed differently in atomistic simulations under PBC than in experiments. The
Virial stress expression is widely used in atomistic calculations, which represents the
time and volume average of stress in the simulation cell.
The same problem appears in atomistic simulations under t-PBC and b-PBC.
There needs to be a procedure to compute the torque and bending moment in these
new boundary conditions. In this section, the Virial expressions for the torque and
bending moment in t-PBC and b-PBC are developed. Similar to the Virial stress, the
interpretation will encounter some difficulty in b-PBC, because continuing the curved wire along theθ-direction will eventually make the wire overlap. The interpretation of b-PBC would then requirethe wire to exist in a multi-sheeted Riemann space [88, page 80] so that the wire does not reallyoverlap with each other.
APPENDIX A. TORSION AND BENDING PBC 113
new expressions involve discrete sums over all atoms in the simulation cell. The corre-
sponding expressions in continuum mechanics, expressed in terms of volume integrals,
are also identified. Since the derivation of these new expressions are motivated by
that of the original Virial expression, a natural place is to begin is with a quick review
of the Virial stress.
A.3.1 Virial Stress in PBC
For an atomistic simulation cell subjected to PBC in all three directions, the Virial
formula gives the stress averaged over the entire simulation cell at thermal equilibrium
as
σαβ =1
Ω
⟨N∑i=1
−mi viα v
iβ +
N−1∑i=1
N∑j=i+1
∂V
∂(xiα − xjα)
(xiβ − xjβ)
⟩(A.10)
In this formula Ω = det(H) is the volume of the simulation cell, N is the total number
of atoms, viα and xiα are the α-components of the velocity and position of atom i, and
V is the potential energy. The terms (xiα−xjα) and (xiβ−xjβ) in the second summation
are assumed to be taken from the nearest images of atom i and atom j. The bracket
〈·〉 means ensemble average, which equals to the long time average if the system has
reached equilibrium. Thus the Virial stress is the stress both averaged over the entire
space and over a long time.
The Virial stress is the derivative of the free energy F of the atomistic simulation
cell with respect to a virtual strain εαβ, which deforms the periodic vectors c1, c2 and
c3 and hence the matrix H,
σαβ =1
Ω
∂F
∂εαβ(A.11)
Assuming the simulation cell is in equilibrium under the canonical ensemble, the free
energy is defined as,
F ≡ −kBT ln
1
h3NN !
∫d3Nrid
3Npi exp
[− 1
kBT
(N∑i=1
|pi|2
2mi
+ V (ri)
)](A.12)
APPENDIX A. TORSION AND BENDING PBC 114
where kB is the Boltzmann’s constant, T is temperature, h is Planck’s constant, ri
and pi are atomic position and momentum vectors, and V is the interatomic potential
function. The momenta can be integrated out explicitly to give,
F = −kBT ln
1
Λ3NN !
∫d3Nri exp
[−V (ri)
kBT
](A.13)
where Λ ≡ h/(2πmkBT )1/2 is the thermal de Broglie wavelength. In atomistic sim-
ulations under PBC, the potential energy can be written as a function of the scaled
coordinates si and matrix H. Hence, F can also be written in terms of an integral
over the scaled coordinates.
F = −kBT ln
ΩN
Λ3NN !
∫d3Nsi exp
[−V (si,H)
kBT
](A.14)
The Virial formula can be obtained by taking the derivative of Eq. (A.14) with respect
to εαβ. The first term in the Virial formula comes from the derivative of the volume Ω
with respect to εαβ, which contributes a −N kB T δαβ/Ω term to the total stress. This
is equivalent to the velocity term in the Virial formula because⟨mi v
iα v
iβ
⟩= kB T δαβ
in the canonical ensemble. The second term comes from the derivative of the potential
energy V (si,H) with respect to εαβ. The Virial stress expression can also be derived
in several alternative approaches (see [93, 63, 18, 20, 97] for more discussions). The
corresponding quantity for Virial stress in continuum mechanics is the volume average
of the stress tensor,
σij =1
Ω
∫Ω
σij dV =1
Ω
∮S
tj xi dS (A.15)
where the integral∮S
is over the bounding surface of volume Ω, tj is the traction force
density on surface element dS, and xi is the position vector of the surface element.
A.3.2 Virial Torque in t-PBC
The Virial torque expression for a simulation cell subjected to t-PBC can be derived
in a similar fashion. First, the potential energy V is re-written as a function of the
APPENDIX A. TORSION AND BENDING PBC 115
scaled cylindrical coordinates and the components of matrix M, as given in Eq. (A.8),
V (ri) = V (sir, siθ, siz, R, φ, Lz) (A.16)
The Virial torque is then defined as the derivative of the free energy F with respect
to φ,
τ ≡ ∂F
∂φ(A.17)
F = −kBT ln
ΩN
Λ3NN !
∫d3Nsi exp
[−V (sir, siθ, siz, R, φ, Lz)
kBT
](A.18)
Since ∂Ω/∂φ = 0, the torque reduces to
τ =
∫d3Nsi exp
[−V (sir,siθ,s
iz,R,φ,Lz)
kBT
]∂V∂φ∫
d3Nsi exp[−V (sir,siθ,siz,R,φ,Lz)
kBT
] ≡⟨∂V
∂φ
⟩(A.19)
In other words, the torque τ is simply the ensemble average of the derivative of
the potential energy with respect to torsion angle φ. To facilitate calculation in an
atomistic simulation, one can express ∂V∂φ
in terms of the real coordinates of the atoms,
∂V
∂φ=
1
Lz
N−1∑i=1
N∑j=i+1
− ∂V
∂(xi − xj)(yi zi − yj zj) +
∂V
∂(yi − yj)(xi zi − xj zj) (A.20)
Hence one arrives at the Virial torque expression
τ =1
Lz
⟨N−1∑i=1
N∑j=i+1
− ∂V
∂(xi − xj)(yi zi − yj zj) +
∂V
∂(yi − yj)(xi zi − xj zj)
⟩(A.21)
There is no velocity term in Eq. (A.21) because modifying φ does not change the
volume Ω of the wire. This expression is verified numerically in Appendix C in the
zero temperature limit when the free energy equals to the potential energy. The
corresponding quantity in continuum elasticity theory can be written in terms of an
APPENDIX A. TORSION AND BENDING PBC 116
integral over the volume Ω of the simulation cell,
τ = Qzz ≡1
Lz
∫Ω
−y σxz + x σyz dV (A.22)
The derivation is given in Appendix A. The stress in the above expression refers
to the Cauchy stress in the context of finite deformation. Because it uses current
coordinates, the expression remains valid in finite deformation. The correspondence
between Eqs. (A.21) and (A.22) bears a strong resemblance to the correspondence
between Eqs. (A.10) and (A.15). While the Virial stress formula corresponds to the
average (i.e. zeroth moment) of the stress field over volume Ω, τ corresponds to a
linear combination of the first moments of the stress field.
A.3.3 Virial Bending Moment in b-PBC
Following a similar procedure, one can obtain the Virial expression for the bending
moment for a simulation cell subjected to b-PBC. First, rewrite the potential energy
of a system under b-PBC as,
V (ri) = V (sir, siθ, siz, R,Θ, Lz) (A.23)
The Virial bending moment is then the derivative of the free energy with respect to
Θ.
M ≡ ∂F
∂Θ(A.24)
F = −kBT ln
ΩN
Λ3NN !
∫d3Nsi exp
[−V (sir, siθ, siz, R,Θ, Lz)
kBT
](A.25)
Again, one finds that M is simply the ensemble average of the derivative of potential
energy with respect to Θ,
M =
⟨∂V
∂Θ
⟩(A.26)
APPENDIX A. TORSION AND BENDING PBC 117
The derivative ∂V∂Θ
can be expressed in terms of the real coordinates of the atoms,
∂V
∂Θ= 1
Θ
∑N−1i=1
∑Nj=i+1 −
∂V
∂(xi − xj)(yi θi − yj θj + ρ cos θi − ρ cos θj)
+∂V
∂(yi − yj)(xi θi − xj θj − ρ sin θi + ρ sin θj) (A.27)
Hence one arrives at the Virial bending moment expression,
M = 1Θ
⟨∑N−1i=1
∑Nj=i+1 − ∂V
∂(xi − xj)(yi θi − yj θj + ρ cos θi − ρ cos θj)
+∂V
∂(yi − yj)(xi θi − xj θj − ρ sin θi + ρ sin θj)
⟩(A.28)
There is no velocity term in Eq. (A.28) because modifying Θ does not change the
volume Ω of the wire. This expression is verified numerically in Appendix D in the
zero temperature limit when the free energy equals to the potential energy. The
corresponding quantity in continuum elasticity theory can be written in terms of an
integral over the volume Ω of the simulation cell,
M = Qzθ =1
Θ
∫A
dA
∫ Θ
0
dθ (−y σxθ + x σyθ) (A.29)
=1
Θ
∫A
dA
∫ Θ
0
dθ r σθθ =1
Θ
∫Ω
σθθ dV
where A is the cross-section area of the continuum body subjected to b-PBC. The
correspondence between Eqs. (A.28) and (A.30) bears a strong resemblance to the
correspondence between Eqs. (A.10) and (A.15). Similar to τ , M also corresponds to
a linear combination of the first moments of the stress field over the simulation cell
volume.
A.4 Numerical Results
This section demonstrates the usefulness of t-PBC and b-PBC described above by
torsion and bending Molecular Dynamics simulations of Si nanowires (NWs) to failure.
APPENDIX A. TORSION AND BENDING PBC 118
The interactions between Si atoms are described by the modified-embedded-atom-
method (MEAM) potential [7], which has been found to be more reliable in the study
of the failure Si NWs than several other potential models for Si [48]. Two NWs both
oriented along the [111] direction with diameters D = 7.5 nm and D = 10 nm and
the same aspect ratio Lz/D = 2.27 were considered. To make sure the NW surface
is well reconstructed, the NWs are annealed by MD simulations at 1000 K for 1 ps
followed by a conjugate gradient relaxation. Simulation results on initially perfect
NWs under torsion and bending deformation at T = 300 K are presented.
A.4.1 Si Nanowire under Torsion
Simulations of Si NWs under torsion can be carried out easily using t-PBC. Before
applying a torsion, the NWs are first equilibrated at the specified temperature and
zero stress (i.e. zero axial force) by MD simulations under PBC where the NW
length is allowed to elongate to accommodate the thermal strain. Fig. A.3(a) and (c)
shows the annealed Si NW structures. Subsequently, torsion is applied to the NW
through t-PBC, where the twist angle φ (between two ends of the NW) increases
in steps of 0.02 radian (≈ 1.15). For each twist angle, MD simulation under t-
PBC is performed for 2 ps. The Nose-Hoover thermostat is used to maintain the
temperature at T = 300 K using the Stomer-Verlet time integrator [10] with a time
step of 1 fs. The linear momentum Pz and angular momentum Jz are conserved
within 2× 10−10eV · ps · A−1 and 9× 10−7eV · ps, during the simulation, respectively.
The twist angle continues to increase until the NW fails. If the Virial torque at the
end of the 2 ps simulation is lower than that at the beginning of the simulation, the
MD simulation is continued in 2 ps increments without increasing the twist angle,
until the bending moment increases. The purpose of this approach is to give enough
simulation time to resolve the failure process whenever that occurs. The Virial torque
is computed by time averaging over the last 1 ps of the simulation for each twist angle.
The torque versus twist angle relationship is plotted in Fig. A.4.
The τ -φ curve is linear for small values of φ and becomes non-linear as φ ap-
proaches the critical value at failure. The torsional stiffness can be obtained from the
APPENDIX A. TORSION AND BENDING PBC 119
(a) Initial structure, D = 7.5 nm
(b) After failure, D = 7.5 nm, φ = 1.16 rad
(c) Initial structure, D = 10 nm
(d) After failure, D = 10 nm, φ = 1.18 rad
Figure A.3: Snapshots of Si NWs of two diameters before torsional deformation andafter failure. The failure mechanism depends on its diameter.
APPENDIX A. TORSION AND BENDING PBC 120
0 0.5 1 1.5 20
2000
4000
6000
8000
10000
12000
14000
Rotation Angle (radians)
Viria
l T
orq
ue (
eV
)
D = 75 A L = 170 A
D = 100 A L = 227 A
Figure A.4: Virial torque τ as a function of rotation angle φ between the two endsof the NWs of two different diameters. Because the two NWs have the same aspectratio Lz/D, they have the same maximum strain (on the surface) γmax = φD
2Lzat the
same twist angle φ.
torque-twist relationship and its value at small φ can be compared to theory. The
torsional stiffness is defined as
kt ≡∂τ
∂φ(A.30)
In the limit of φ→ 0, the torsional stiffness is estimated to be kt = 5.11× 103 eV for
D = 7.5 nm and kt = 1.25 × 104 eV for D = 10 nm. Strength of Materials predicts
the following relationships for elastically isotropic circular shafts under torsion:
τ =φ
LzGJ , kt =
GJ
Lz(A.31)
where G is the shear modulus, J = πD4/32 is the polar moment of inertia. This
expression is valid only in the limit of small deformation (φ → 0). To compare
the simulation results against this expression, one needs to use the shear modulus
of Si given by the MEAM model (C11 = 163.78 GPa, C12 = 64.53 GPa, C44 =
76.47 GPa) on the (111) plane, which is G = 58.57 GPa. The predictions of the
torsional stiffness from Strength of Materials are compared with the estimated value
from MD simulations in Table A.1. The predictions overestimate the MD results by
25 ∼ 30%. However, this difference can be easily eliminated by a slight adjustment
APPENDIX A. TORSION AND BENDING PBC 121
(∼ 6%) of the NW diameter D, given that kt ∝ D4. The adjusted diameters D∗
for the two NWs is approximately 6 A smaller than the nominal diameters D, which
corresponds to a reduction of the NW radius by 3 A. This can be easily accounted
for by the inaccuracy in the definition of NW diameter and the possibility of a weak
surface layer on Si NWs [48].
Table A.1: Comparison of torsional stiffness for Si NW estimated from MD simula-tions and that predicted by Strength of Materials (SOM) theory. D∗ is the adjustedNW diameter that makes the SOM predictions exactly match MD results. The criticaltwist angle φc and critical shear strain γc at failure are also listed.
Nominal diameter D kt (MD) kt (SOM) Adjusted diameter D∗ φc γc7.5 nm 5110 eV 6680 eV 7.0 nm 1.16 (rad) 0.2610.0 nm 12538 eV 15812 eV 9.4 nm 1.18 (rad) 0.26
The above agreement gives us confidence in the use of Strength of Materials to
describe the behavior of NWs under torsion. Hence, it can be used to extract the
critical strain in both NWs at failure. The maximum strain (engineering strain) in a
cylindrical torsional shaft occurs on its surface,
γmax =φD
2Lz(A.32)
Given that the aspect ratio of NWs is kept at Lz/D = 2.27, one has
γmax = 0.22φ (A.33)
for both NWs. The critical twist angle and critical strain at failure for both NWs are
listed in Table A.1.
The critical shear strain at failure is expected to be independent of the shaft di-
ameter for large diameters. This seems to hold remarkably well in the NW torsion
simulations. Because the NW under t-PBC has no “ends”, failure can initiate any-
where along the NW. However, different failure mechanism are observed in the two
NWs with different diameters. The thinner NW fails by sliding along a (111) plane,
APPENDIX A. TORSION AND BENDING PBC 122
as seen in Fig. A.3(b). The thicker NW fails by sliding both along a (111) plane
and along longitudinal planes, creating wedges on the (111) cross section, as seen
in Fig. A.3(d). The failure mechanism of the thicker NW is also more gradual than
that of the thinner NW. As can be observed in Fig. A.4, the torque is completely
relieved on the thinner NW when failure occurs, whereas the thicker NW experiences
a sequence of failures. A more detailed analysis on the size dependence of NW failure
modes and their mechanisms will be presented in a subsequent paper.
A.4.2 Si Nanowire under Bending
Simulations of Si NWs can be carried out using b-PBC just as was done for torsion.
The Si NWs are equilibrated in the same way as described in the previous section
before applying bending through b-PBC. The bending angle Θ (between two ends
of the NW) increases in steps of 0.02 radian (≈ 1.15). For each twist angle, MD
simulations under b-PBC were performed for 2 ps. The linear momentum Pz and
angular momentum Jz is conserved to the same level of precision as in the torsion
simulations. The bending angle continues to increase until the NW fails. If the Virial
bending moment at the end of the 2 ps simulation is lower than that at the begin-
ning of the simulation, the MD simulation is continued in 2 ps increments without
increasing the bending angle, until the bending moment increases. The purpose of
this approach is to give enough simulation time to resolve the failure process when-
ever that occurs. The Virial bending moment is computed by a time average over the
last 1 ps of the simulation for each twist angle. The bending moment versus bending
angle relationship is plotted in Fig. A.5.
The M -Θ curve is linear for small values of Θ and becomes non-linear as Θ ap-
proaches the critical value at failure. The bending stiffness can be computed from
the M -Θ curve and its value at small Θ can be compared to theory. Similar to the
torsional stiffness in the previous section, define a bending stiffness as
kb ≡∂M
∂Θ(A.34)
In the limit of Θ→ 0 the bending stiffness is estimated to be kb = 8.12× 103 eV for
APPENDIX A. TORSION AND BENDING PBC 123
0 0.2 0.4 0.6 0.8 10
5000
10000
15000
Rotation Angle (radians)
Viria
l B
en
din
g M
om
en
t (e
V)
D = 75 A L = 170 A
D = 100 A L = 227 A
Figure A.5: Virial bending moment M as a function of bending angle Θ between thetwo ends of the two NWs with different diameters. Because the two NWs have thesame aspect ratio Lz/D, they have the same maximum strain εmax = ΘD
2Lzat the same
bending angle Θ.
D = 7.5 nm and kb = 1.96 × 104 eV for D = 10 nm. Strength of Materials predicts
the following relationships for elastically isotropic beam under bending,
M =Θ
L0
E Iz , kb =E IzL0
(A.35)
where E is the Young’s modulus, Iz = πD4/64 is the moment of inertia of the NW
cross section around z-axis. To compare the simulation results against this expression,
one needs to use the Young’s modulus of Si given by the MEAM model along the
[111] direction, which is 181.90 GPa. The predictions of the torsional stiffness from
Strength of Materials are compared with the estimated value from MD simulations
in Table A.2. The predictions overestimate the MD results by 23 ∼ 25%. But this
difference can be easily eliminated by a slight adjustment (∼ 5%) of the NW diameter
D, given that kb ∝ D4. The adjusted diameters D∗ for the two NWs is approximately
5 A smaller than the nominal diameters D, which corresponds to a reduction of the
NW radius by 2.5 A. It is encouraging to see that the adjusted diameters from torsion
simulations match those for the bending simulations reasonably well.
The above agreement gives us confidence in the use of Strength of Materials theory
to describe the behavior of NW under bending. Hence it can be used to extract the
APPENDIX A. TORSION AND BENDING PBC 124
Table A.2: Comparison of the bending stiffnesses for Si NWs estimated from MDsimulations and that predicted by Strength of Materials (SOM) theory. D∗ is theadjusted NW diameter that makes SOM predictions exactly match MD results. Thecritical bending angle Θf and critical normal strain εf at fracture are also listed.
Nominal diameter D kb (MD) kb (SOM) Adjusted diameter D∗ Θf εf7.5 nm 8117 eV 10374 eV 7.1 nm 0.96 (rad) 0.2110.0 nm 19619 eV 24554 eV 9.5 nm 0.76 (rad) 0.17
critical strain experienced by both NWs at the point of fracture. Based on the
Strength of Materials theory, the maximum strain (engineering strain) of a beam in
pure bending occurs at the points furthest away from the bending axis,
εmax =ΘD
2L0
(A.36)
Since the aspect ratio of NWs is kept at L0/D = 2.27, one has
εmax = 0.22 Θ (A.37)
for both NWs. The critical bending angle and critical normal strain at failure for
both NWs are listed in Table A.2. The critical strain at fracture is similar to results
obtained from MD simulations of Si NWs under uniaxial tension, εf = 0.18, also
using the MEAM model [48]. The higher critical stress value observed in the thinner
NW in bending is related to the higher stress gradient across its cross section.
Fig. A.6 shows the atomic structure of the NWs right before and right after frac-
ture. The much larger critical strain observed in the thinner NW is related to the
formation of metastable hillocks on the compressible side of the NW, as shown in
Fig. A.6(a). It seems that the formation of hillocks relieves some bending strain
and allows the thinner NW to deform further without causing fracture. In fact, the
onset of hillock formation in the thinner NW happens at the same rotation angle
(Θ = 0.76 rad) as the angle at which the thicker NW fractures.
APPENDIX A. TORSION AND BENDING PBC 125
(a) Before fracture, D = 7.5 nm, φ = 0.94 rad
(b) After fracture, D = 7.5 nm, φ = 0.96 rad
(c) Before fracture, D = 10 nm, φ = 0.74 rad
(d) After fracture, D = 10 nm, φ = 0.76 rad
Figure A.6: Snapshots of Si NWs of two diameters under bending deformation beforeand after fracture. While metastable hillocks form on the thinner NWs before fracture(a), this does not happen for the thicker NW (c).
APPENDIX A. TORSION AND BENDING PBC 126
A.5 Summary
In this appendix a unified approach to handle torsion and bending of wires in atom-
istic simulations by generalizing the Born-von Karman periodic boundary conditions
to cylindrical coordinates has been presented. The expressions for the torque and
bending moments in terms of an average over the entire simulation cell were derived,
in close analogy to the Virial stress expression. Molecular Dynamics simulations un-
der these new boundary conditions show several failure modes of Silicon nanowires
under torsion and bending, depending on the nanowire diameter. These simulations
are able to probe the intrinsic behavior of nanowires because the artificial end effects
are completely removed.
Bibliography
[1] J. Ahrens, B. Geveci, and C. Law. Paraview: An end user tool for large data
visualization. Technical report, Academic Press, 2005.
[2] M. P. Allen and D. J. Tildesley. Computer Simulation of Liquids. Oxford Uni-
versity Press, 2007.
[3] A. A. Amsden and F. H. Harlow. The SMAC method: a numerical technique
for calculating incompressible flows. Technical Report LA-4370, Los Alamos
National Laboratory, 1970.
[4] ATI. Radeon X1900 product site, 2006.
http://www.ati.com/products/radeonx1900/index.html .
[5] ATITool. techpowerup.com, 2006.
http://www.techpowerup.com/atitool.
[6] John Aycock. A brief history of just-in-time. ACM Comput. Surv., 35(2):97–113,
2003.
[7] M. I. Baskes. Modified embedded-atom potentials for cubic materials and impu-
rities. Phys. Rev. B, 46:2727–2742, 1992.
[8] Nathan Bell and Michael Garland. Efficient sparse matrix-vector multiplication
on cuda. Technical report, NVIDIA, 2008.
[9] J. Bolz, I. Farmer, E. Grinspun, and P. Schroder. Sparse matrix solvers on the
GPU: conjugate gradients and multigrid. In SIGGRAPH ’03: ACM SIGGRAPH
Papers, pages 917–924, New York, NY, USA, 2003. ACM.
127
BIBLIOGRAPHY 128
[10] S. D. Bond, B. J. Leimkuhler, and B. B. Laird. The nose-poincare method for
constant temperature molecular dynamics. J. Comput. Phys., 151:114–134, 1999.
[11] T. Brandvik and G. Pullan. Acceleration of a 3d euler solver using commod-
ity graphics hardware. In 46th AIAA Aerospace Sciences Meeting and Exhibit,
January 2008.
[12] I. Buck. High level languages for GPUs. In SIGGRAPH ’05: ACM SIGGRAPH
2005 Courses, page 109, New York, NY, USA, 2005. ACM Press.
[13] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Han-
rahan. Brook for GPUs: Stream computing on graphics hardware. ACM Trans-
actions on Graphics, 23(3):777 – 786, 2004 2004.
[14] Ian Buck, Kayvon Fatahalian, and Pat Hanrahan. Gpubench: Evaluating gpu
performance for numerical and scientific applications. In Poster Session at GP2
Workshop on General Purpose Computing on Graphics Processors, 2004.
http://gpubench.sourceforge.net.
[15] V. V. Bulatov and W. Cai. Computer Simulations of Dislocations. Oxford
University Press, 2006.
[16] M. H. Carpenter, D. Gottlieb, and S. Abarbanel. Time-stable boundary condi-
tions for finite-difference schemes solving hyperbolic systems: Methodology and
application to high-order compact schemes. J. Comput. Phys., 111(2):220–236,
1994.
[17] M. Chau, O. Englander, and L. Lin. Silicon nanowire-based nanoactuator. In
Proceedings of the 3rd IEEE conference on nanotechnology, volume 2, pages 879–
880, San Francisco, CA, Aug 12-14 2003.
[18] K. S. Chueng and S. Yip. Atomic-level stress in an inhomogeneous system. J.
Appl. Phys., 70:5688–90, 1991.
[19] J.-F. Collard and D. Lavery. Optimizations to prevent cache penalties for the
intel itanium 2 processor. pages 105–114, March 2003.
BIBLIOGRAPHY 129
[20] J. Cormier, J. M. Rickman, and T. J. Delph. Stress calculation in atomistic
simulations of perfect and imperfect solids. J. Appl. Phys., 89:99–104, 2001.
[21] Y. Cui, Q. Wei, H. Park, and C. M. Lieber. Nanowire nanosensors for highly
sensitive and selective detection of biological and chemical species. Science,
293:1289–1292, 2001.
[22] Y. Cui, Z. Zhong, D. Wang, W. U. Wang, and C. M. Lieber. High performance
silicon nanowire field effect transistors. Nano Letters, 3:149–152, 2003.
[23] W. Ding, L. Calabri, X. Chen, K. Kohlhass, and R. S. Ruoff. Mechanics of
crystalline boron nanowires. presented at the 2006 MRS spring meeting, San
Francisco, CA, 2006.
[24] David Dobkin and Michael Laszlo. Primitives for the manipulation of three-
dimensional subdivisions. Algorithmica, 4(1-4):3–32, 1989.
[25] T. Dumitrica and R. D. James. Objective molecular dynamics. J. Mech. Phys.
Solids, 55:2206–2236, 2007.
[26] Carter Edwards. Sierra framework version 3: Core services theory and design.
Technical report, Sandia National Laboratories, 2002.
[27] E. Elsen, V. Vishal, E. Darve, P. Hanrahan, V. Pande,
and I. Buck. GROMACS on the GPU, 2005.
http://bcats.stanford.edu/pdf/BCATS 2005 abstract book.pdf.
[28] A. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover. GPU Cluster for High
Performance Computing. SC, 00:47, 2004.
[29] R. Fan, R. Karnik, M. Yue, D. Y. Li, A. Majumdar, and P. D. Yang. DNA
translocation in inorganic nanotubes. Nano Letters, 5:1633–1637, 2005.
[30] H. Fujitani, Y. Tanida, M. Ito, G. Jayachandran, C. D. Snow, M. R. Shirts,
E. J. Sorin, and V. S. Pande. Direct Calculation of the Binding Free Energies of
FKBP Ligands. J. Chem. Phys., 123(8):84108, 2005.
BIBLIOGRAPHY 130
[31] T. Fukushige, J. Makino, and A. Kawai. GRAPE-6A: A Single-Card GRAPE-
6 for Parallel PC-GRAPE Cluster Systems. Publications of the Astronomical
Society of Japan, 57:1009–1021, dec 2005.
[32] Dominik Goddeke, Robert Strzodka, and Stefan Turek. Accelerating double
precision FEM simulations with GPUs. In Proceedings of ASIM 2005 - 18th
Symposium on Simulation Technique, September 2005.
[33] D. Rodriguez Gomez, E. Darve, and A. Pohorille. Assessing the efficiency of free
energy calculation methods. Journal of Chemical Physics, 120(8):3563–78, Feb
2004.
[34] N. Goodnight, C. Woolley, G. Lewin, D. Luebke, and G. Humphreys. A multigrid
solver for boundary value problems using programmable graphics hardware. In
SIGGRAPH ’05: ACM SIGGRAPH 2005 Courses, pages 193–203, 2005.
[35] M. J. Harris, W. V. Baxter, T. Scheuermann, and A. Lastra. Simulation of cloud
dynamics on graphics hardware. In HWWS ’03: Proceedings of the ACM SIG-
GRAPH/EUROGRAPHICS conference on Graphics hardware, pages 92–101,
2003.
[36] Michael A. Heroux, Roscoe A. Bartlett, Vicki E. Howle, Robert J. Hoekstra,
Jonathan J. Hu, Tamara G. Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P.
Pawlowski, Eric T. Phipps, Andrew G. Salinger, Heidi K. Thornquist, Ray S.
Tuminaro, James M. Willenbring, Alan Williams, and Kendall S. Stanley. An
overview of the trilinos project. ACM Trans. Math. Softw., 31(3):397–423, 2005.
[37] Jan Hesthaven and Tim Warburton. Nodal Discontinuous Galerkin Methods:
Algorithms, Analysis and Applications. Springer, 2008.
[38] M. F. Horstemeyer, J. Lim, W. Y. Lu, D. A. Mosher, M. I. Baskes, V. C. Prantil,
and S. J. Plimpton. Torsion/simple shear of single crystal copper. J. Eng. Mater.
Tech., 124:322–328, 2002.
BIBLIOGRAPHY 131
[39] Y. Huang and C. M. Lieber. Integrated nanoscale electronics and optoelectronics:
Exploring nanoscale science and technology through semiconductor nanowires.
Pure Appl. Chem, 76:2051–2068, 2004.
[40] M. Huhtala, A. Kuronen, and K. Kaski. Dynamical simulations of carbon nan-
otube bending. Int. J. Modern Phys. C, 15:517–534, 2004.
[41] IBM. Cell Broadband Engine Programming Handbook, 1.11 edition, May 2008.
[42] Peta Computing Institute. Mdgrape-3 pci-x, 2006.
[43] Intel. Intel Pentium 4 Thermal Management, 2006.
http://www.intel.com/support/processors/pentium4/sb/CS-007999.htm.
[44] Y. Isono, M. Kiuchi, and S. Matsui. Development of electrostatic actuated nano
tensile testing device for mechanical and electrical characterstics of FIB deposited
carbon nanowire. presented at the 2006 MRS spring meeting, San Francisco, CA,
2006.
[45] Hrvoje Jasak, Aleksandar Jemcov, and Zeljko Tukovic. Openfoam: A c++ li-
brary for complex physics simulations. volume 47, 2007.
[46] P. M. Jeff and N. A. Fleck. The failure of composite tubes due to combined
compression and torsion. J. Mater. Sci., 29:3080–3084, 1994.
[47] Yunfei Chen Juekuan Yang, Yujuan Wang. Accelerated molecular dynamics
simulation of thermal conductivities. Journal of Computational Physics, 2006.
doi:10.1016/j.jcp.2006.06.039.
[48] K. Kang and W. Cai. Brittle and ductile fracture of semiconductor nanowires –
molecular dynamics simulations. Philosophical Magazine, 87:2169–2189, 2007.
[49] Y. Khalighi, G. Iaccarino, and P. Moin. Comparison of Lattice Boltzmann
Method and conventional CFD techniques. APS Meeting Abstracts, pages K7+,
November 2004.
BIBLIOGRAPHY 132
[50] T. Kizuka, Y. Takatani, K. Asaka, and R. Yoshizaki. Measurements of the
atomistic mechanics of single crystalline silicon wires of nanometer width. Phys.
Rev. B, 72:035333–1–6, 2005.
[51] J. Kruger and R. Westermann. Linear Algebra Operators for GPU Implementa-
tion of Numerical Algorithms. In ACM Transactions on Graphics (Proceedings
of SIGGRAPH), pages 908–916, July 2003.
[52] Orion S. Lawlor, Sayantan Chakravorty, Terry L. Wilmarth, Nilesh Choudhury,
Isaac Dooley, Gengbin Zheng, and Laxmikant V. Kale. Parfum: a parallel frame-
work for unstructured meshes for scalable dynamic physics applications. Eng.
with Comput., 22(3):215–235, 2006.
[53] A. Lefohn. GPU data structures. In GPGPU: General-Purpose Computation
on Graphics Hardware Tutorial, Int. Conf. for High Perf. Comput., Netw., Stor.
and Anal., Nov. 2006.
[54] W. Li, Z. Fan, X. Wei, and A. Kaufman. GPU Gems 2, chapter 47, GPU-based
flow simulation with complex boundaries, pages 747–764. Addison-Wesley, 2005.
[55] W. Li, X. Wei, and A. Kaufman. Implementing lattice boltzmann computation
on graphics hardware. Visual Comput., 19(444–456), 2003.
[56] E. Lindahl, B. Hess, and D. van der Spoel. GROMACS 3.0: A package for
molecular simulation and trajectory analysis. J. Mol. Mod., 7:306–317, 2001.
[57] Y. Liu, X. Liu, and E. Wu. Real-time 3D fluid simulation on GPU with complex
obstacles. In 12th Pacific Conference on Computer Graphics and Applications,
6-8 Oct. 2004, Seoul, South Korea, pages 247–256, 2004.
[58] K Long. Sundance 2.0 tutorial. Technical Report Technical Report SAND2004-
4793, Sandia National Laboratories, 2004.
[59] D. Luebke, M. Harris, J. Kruger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley,
and A. Lefohn. GPGPU: general purpose computation on graphics hardware. In
SIGGRAPH ’04: ACM SIGGRAPH 2004 Course Notes, page 33, 2004.
BIBLIOGRAPHY 133
[60] M. A. Makeev and D. Srivastava. Silicon carbide nanowires under external loads:
An atomistic simulation study. Phys. Rev. B, 74:165303, 2006.
[61] J. Makino, T. Fukushige, M. Koga, and K. Namura. GRAPE-6: Massively-
Parallel Special-Purpose Computer for Astrophysical Particle Simulations. Pub-
lications of the Astronomical Society of Japan, 55:1163–1187, dec 2003.
[62] Junichiro Makino, Eiichiro Kokubo, and Toshiyuki Fukushige. Performance evau-
lation and tuning of grape-6 - towards 40 ”real” tflops. In SC ’03: Proceedings of
the 2003 ACM/IEEE conference on Supercomputing, page 2, Washington, DC,
USA, 2003. IEEE Computer Society.
[63] G. Marc and W. G. McMillian. The virial theorem. Adv. Chem. Phys., 58:209–
361, 1985.
[64] A.C. Marta and J.J. Alonso. High-speed MHD flow control using adjoint-
based sensitivities. AIAA paper 2006-8009, 14th AIAA/AHI International Space
Planes and Hypersonic Systems and Technologies Conference, Canberra, Aus-
tralia, November 2006.
[65] K. Mattson, M. Svard, M. Carpenter, and J. Nordstrom. Accuracy requirements
for transient aerodynamics. In 16th AIAA Computational Fluid Dynamics Con-
ference, Orlando, FL, June 2003.
[66] Stephen McMillan. The Vectorization of Small-N Integrators. In Piet Hut and
Stephen McMillan, editors, The Use of Supercomputers in Stellar Dynamics,
pages 156–161, 1986.
[67] C. McNairy and D. Soltis. Itanium 2 processor microarchitecture. Micro, IEEE,
23(2):44–55, March-April 2003.
[68] Jeffrey M. McNally, L.E. Garey, and R.E. Shaw. A communication-less parallel
algorithm for tridiagonal toeplitz systems. Journal of Computational and Applied
Mathematics, 212(2):260 – 271, 2008.
[69] Microsoft. Directx home page, 2003. http://www.microsoft.com/windows/directx/default.asp.
BIBLIOGRAPHY 134
[70] Microsoft. Pixel shader 3.0 specification on msdn, 2006.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9 c/directx sdk.asp.
[71] Gordon Moore. Cramming more components onto integrated circuits. Electronic
Magazine, 38(19), April 1965.
[72] K. Mylvaganam, T. Vodenitcharova, and L. C. Zhang. The bending-kinking
analysis of a single-walled carbon nanotubea combined molecular dynamics and
continuum mechanics technique. J. Mater. Sci, 41:3341–3347, 2006.
[73] A. Nakatani and H. Kitagawa. Atomistic study of size effect in torsion tests of
nanowire. XXI ICTAM, 15-21, August 2004.
[74] Keigo Nitadori, Junichiro Makino, and Piet Hut. Performance tuning of n-body
codes on modern microprocessors: I. direct integration with a hermite scheme
on x86 64 architectures, Nov 2005. http://arxiv.org/abs/astro-ph/0511062.
[75] Keigo Nitadori, Junichiro Makino, and Piet Hut. Performance tuning of n-body
codes on modern microprocessors: I. direct integration with a hermite scheme
on x86 64 architecture. NEW ASTRON., 12:169, 2006.
[76] J. Nordstrom, E. van der Weide, J. Gong, and M. Svard. A hybrid method
for the unsteady compressible Navier-Stokes equations. Annual CTR Research
Briefs, Center for Turbulence Research, Stanford, 2007.
[77] T. Nozaki, M. Doyama, and Y. Kogure. Computer simulation of high-speed
bending deformation in copper. Radiation Effects and Defects in Solids, 157:217–
222, 2002.
[78] NVIDIA. CUDA Programming Guide 1.1, November 2007.
http://developer.download.nvidia.com/compute/cuda/1 1/NVIDIA CUDA Programming Guide 1.1.pdf.
[79] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J Kruger, A. E. Lefohn, and
T. J. Purcell. A survey of general-purpose computation on graphics hardware.
Computer Graphics Forum, 26(1):80–113, 2007.
BIBLIOGRAPHY 135
[80] M. Parrinello and A. Rahman. Polymorphic transitions in single crystals: a new
molecular dynamics method. J. Appl. Phys., 52:7182–7190, 1981.
[81] M. Rumpf and R. Strzodka. Nonlinear diffusion in graphics hardware. In Pro-
ceedings of EG/IEEE TCVG Symposium on Visualization VisSym ’01, pages
75–84, 2001.
[82] C. E. Scheidegger, J. L. D. Comba, and R. D. da Cunha. Practical CFD Simu-
lations on Programmable Graphics Hardware using SMAC. Computer Graphics
Forum, 24(4):715–728, 2005.
[83] Kirk Schloegel, George Karypis, and Vipin Kumar. Parallel static and dynamic
multi-constraint graph partitioning. Concurrency and Computation: Practice
and Experience, 14(3):219–240, 2002.
[84] Patrick Schmid. Does cache size really boost performance?, 10 2007.
[85] Anand Lal Shimpi. 6mb l2 vs. 3mb l2, 2008.
[86] J.W. Sias, Sain zee Ueng, G.A. Kent, I.M. Steiner, E.M. Nystrom, and W.-M.W.
Hwu. Field-testing impact epic research results in itanium 2. pages 26–37, June
2004.
[87] Christopher D. Snow, Eric J. Sorin, Young Min Rhee, and Vijay S. Pande. How
Well Can Simulation Predict Protein Folding Kinetics and Thermodynamics?
Ann. Rev. Biophys. Biomol. Struc., 34:43–69, 2005.
[88] A. Sommerfeld. Partial Differential Equations in Physics, Lectures on Theoretical
Physics, volume VI. Academic Press, 1964.
[89] J. Stam. Stable fluids. In SIGGRAPH, pages 121–128, July 1999.
[90] J. S. StOlken and A. G. Evans. A microbend test method for measuring the
plasticity length scale. Acta Mater., 46:5100–5115, 1998.
[91] M. Svard, K. Mattsson, and J. Nordstrom. Steady-state computations using
summation-by-parts operators. J. Sci. Comput., 24(1):79–95, 2005.
BIBLIOGRAPHY 136
[92] M. Taiji, T. Namuri, Y. Ohno, N. Futatsugi, A. Suenaga, N. Takada, and A. Kon-
agaya. Protein explorer: A petaflops special-purpose computer system for molec-
ular dynamics simulations. In SC ’03: Proceedings of the 2003 ACM/IEEE con-
ference on Supercomputing, 2003.
[93] D. H. Tsai. Virial theorem and stress calculation in molecular-dynamics. J.
Chem. Phys., 70:1375–82, 1979.
[94] D. Wang, Q. Wang, A. Javey, R. Tu, and H. Dai. Germanium nanowire field-
effect transistors with SiO2 and high-κ HfO2. Appl. Phys. Lett., 83:2432–2434,
2003.
[95] C. Zhang and H. Shen. Buckling and postbuckling analysis of single-walled
carbon nanotubes in thermal environments via molecular dynamics simulation.
Carbon, 44:2608–2616, 2006.
[96] Y. Zhu and H. D. Espinosa. An electromechanical material testing system for in
situ electron microscopy and applications. Proc. Nat’l. Acad. Sci., 102:14503–
14508, 2005.
[97] J. A. Zimmerman, E. B. Webb III, J. J. Hoyt, R. E. Jones, P. A. Klein, and D. J.
Bammann. Calculation of stress in atomistic simulation. Modell. Simul. Mater.
Sci. Eng., 12:S319–332, 2004.