An optimized GPU implementation of a 2D free surface simulation … · 2014-09-23 · An optimized...
Transcript of An optimized GPU implementation of a 2D free surface simulation … · 2014-09-23 · An optimized...
An optimized GPU implementation of a 2D free surface
simulation model on unstructured meshes
A. Lacasta∗, M. Morales-Hernandez∗, J. Murillo∗, P. Garcia-Navarro∗,
Abstract
This work is related with the implementation of a finite volume method to solve
the 2D shallow water equations on Graphic Processing Units (GPU). The strategy
is fully oriented to work efficiently with unstructured meshes which are widely
used in many fields of Engineering. Due to the design of the GPU cards, struc-
tured meshes are better suited to work with than unstructured meshes. In order
to overcome this situation, some strategies are proposed and analyzed in terms of
computational gain, by means of introducing certain ordering on the unstructured
meshes. The necessity of performing the simulations using unstructured instead
of structured meshes is also justified by means of some test cases with analytic
solution.
Keywords: GPU, Finite volume methods, Unsteady flow, Unstructured meshes,
Dry/wet boundaries, CUDA, High Performance Computing
1. Introduction1
Physically based simulations of complex systems usually require large com-2
putational facilities to be completed in a reasonable time. Moreover when the3
simulated phenomenon is unsteady and based on a dynamical estimation of the up-4
dating time step, the computational performance is an important topic to be taken5
into account. One of the most widespread strategies to reduce the computational6
cost is the use of parallel techniques, involving a suitable number of processors.7
Since CPU frequencies seem to be reaching their maximum capacity [1], nowa-8
days Many-Core parallel techniques appear to be an interesting option.9
10
∗LIFTEC-EINA,CSIC-Universidad Zaragoza
Email addresses: [email protected] (A. Lacasta), [email protected] (M.
Morales-Hernandez), [email protected] (J. Murillo), [email protected] (P.
Garcia-Navarro)
Preprint submitted to Advances in Engineering Software June 2, 2014
In recent years, Graphic Processing Unit (GPU) has been used to accelerate the11
calculations because of its inherent vector-oriented designing. This paradigm is12
known as General-Purpose Computing on Graphics Processing Unit (GPGPU) and13
it is widely used for a very large range of applications in CFD such as [2],[3],[4]14
and [5] as well as other environmental applications such [6]. In the present work,15
special attention is paid to the application of this GPUs to unsteady flows of inter-16
est in hydraulics. Shallow Water models in particular are widely used to simulate17
surface geophysical flows. These situations usually involve large size domains18
and long time scales. Practical applications require a compromise between spatial19
accuracy and computational efficiency. In order to achieve the necessary spatial20
resolution, rather fine grids become necessary in many cases requiring more data21
storage, increasing proportionally the number of operations and reducing the allow-22
able time step size for explicit calculations. When, at the same time, a reasonable23
computational time is desired, the use of GPU codes is one of the options for com-24
puting large space and temporal domain problems.25
26
The idea of accelerating the calculations in unsteady hydraulic simulation us-27
ing multiple CPU was recently reported in [7],[8] or [9] as well as using GPU in28
[10][11][12][5]. Although a very good compromise between number of CPU used29
and performance is offered by the former option, the cost of using multiple CPU30
is significant due to the hardware investment and associated use. Alternatively, the31
GPU technology offers the performance of smaller clusters with less disbursement32
[13]. The main difficulty, and apparent drawback, when porting codes from CPU33
to GPU, is the cell order required by the GPU to process data efficiently. This34
drawback is not present when dealing with structured meshes due to the inherent35
order and a simple and efficient implementation is relatively easy to be obtained.36
37
Despite the wide use of structured meshes, complex geometries for internal or38
external boundaries are problematic to be represented if not using unstructured39
meshes. Moreover, when dealing with topographic representation some recent40
works [14] have shown the benefit of using unstructured meshes in unsteady hy-41
draulic simulations over irregular topography. The quality of the numerical results42
is sensitive to the grid resolution. Hence grid refinement is clearly an option to43
modify the whole resolution. In that sense, adaptive grid refinement, readily avail-44
able when using triangular unstructured meshes [15], designed to follow local bed45
variations or irregular boundaries can be very useful. The present work is moti-46
vated by the implementation in GPU of a code able to perform unsteady hydraulic47
simulations on variable density triangular unstructured meshes.48
49
The performance of GPU based calculations with Double Precision (double)50
2
is lower than those that use Single Precision (float) [16], [5]. In the particu-51
lar case of the 2D Shallow Water Equations with source terms [17], [18], the use52
of float is not always desirable. In fact, when simulating complex topography53
cases, wave propagation over dry beds represents a numerical challenge. The past54
experience with the dynamical stability control of such transient flows involving55
wet/dry fronts indicates that double precision is always required. All the perfor-56
mance analysis presented will deal with that kind of data.57
58
In the first part of the text, the governing equations are outlined. They are fol-59
lowed by a description of the finite volume updating scheme used. Then, the most60
relevant general aspects of the implementation in GPU are identified. The par-61
ticular difficulties encountered when dealing with triangular unstructured meshes62
and some improvements to overcome them are detailed in the following section.63
Finally, they are applied to two test cases in order to prove their behaviour when64
using unstructured meshes.65
66
2. Mathematical model/Governing equations67
The two-dimensional SW equations, which represent depth averaged mass and68
momentum conservation, can be written as follows:69
∂U
∂t+
−→∇E = H (1)
where70
U = (h, qx, qy)T (2)
are the conserved variables with h representing the water depth, qx = hu , qy = hv71
and u = (u, v) the depth averaged velocity vector along the (x, y) coordinates72
respectively. The fluxes of these variables are E=(F,G) given by:73
F =
(qx,
q2xh
+1
2gh2,
qxqyh
)T
, G =
(qy,
qxqyh
,q2yh
+1
2gh2
)T
(3)
where g is the acceleration due to the gravity.74
75
The source terms of the system are the bed slope and the friction terms:76
H =
(0, −gh
∂z
∂x− τb,x
ρw, −gh
∂z
∂y− τb,y
ρw
)T
(4)
3
Where τb,x and τb,y are the components of the bed friction stress and ρw is the77
water density. These friction losses in both (x, y) axis are written in terms of the78
Manning’s roughness coefficient n:79
τb,xρw
= ghn2u
√u2 + v2
h4/3,
τb,yρw
= ghn2v
√u2 + v2
h4/3(5)
3. Numerical scheme80
The numerical resolution of system (??) can be obtained by means of the first81
order upwind finite volume scheme. Integrating in a volume or grid cell Ω the82
numerical scheme can be expressed compactly:83
∂
∂t
∫
Ω
U dΩ+
NE∑
k=1
(δE − T)k · nk lk = 0 (6)
It is possible to define a Jacobian matrix Jk of the normal flux at each edge as84
a result of the local linearization85
δ(E · n)k = JkδUk (7)
and to diagonalize it in terms of matrices P and Λ, formed by its eigenvalues λm86
and eigenvectors em respectively:87
P =
1 0 1u− c nx −cny u+ c nx
v − c ny c nx v + c ny
, Λ =
λ1 0 0
0 λ2 0
0 0 λ3
,
e1 =
1u− c nx
v − c ny
, e2 =
0−c ny
c nx
, e3 =
1u+ c nx
v + c ny
,
λ1 = u · n− c, λ2 = u · n, λ3 = u · n+ c
(8)
where u · n = u nx + v ny. The definition of the averaged variables is as follows88
[19]:89
uk =ui√hi + uj
√hj√
hi +√
hj, vk =
vi√hi + vj
√hj√
hi +√
hj, ck =
√ghi + hj
2(9)
4
The difference across the edge k can be projected onto the eigenvectors basis90
[18]:91
δUk = Uj −Ui = PkAk (10)
where Ak = (α1, α2, α3)Tk contains the set of wave strengths. Following the92
same procedure with the source terms [18]93
(Tn)k = PkBk (11)
where Bk = (β1, β2, β3)Tk contains the source strengths.94
95
More information about the values of the wave and the source strengths as well as96
the entropy fix can be found in [18]. The contributions due to the fluxes and the97
source terms are combined in a compact expression98
(γm)k =(λm αm − βm
)
k(12)
The 2D numerical upwind explicit scheme is formulated using only the contri-99
butions that arrive to the cell:100
γ−k =1
2
[1− sign
(λk
)]γk (13)
so that the finite volume approach for the updating of a single cell of area Ωi is101
[20]:102
Un+1i = U
ni − ∆t
Ωi
NE∑
k=1
3∑
m=1
(γ−m eml
)nk
(14)
Considering that in the explicit scheme (14) each k cell edge is used to deliver103
information between a pair of neighbouring cells of different size, the time step104
size compatible with numerical stability is limited by105
∆t ≤ ∆tλ ∆tλ =min(χi, χj)
maxm=1,2,3 |λm|(15)
so that the following dimensionless quantity is defined106
CFL =∆t
∆tλ≤ 1 (16)
to control the numerical stability of the method.107
108
5
Considering unstructured meshes, the distance χi in each cell i must consider109
the volume of the cell and the length of the shared k edges.110
χi =Ωi
maxk=1,NE lk(17)
For more details, see [21, 18].111
112
4. Implementation on GPU113
The GPU contains a large number of processors working all together apply-114
ing the same operation over different elements. In order to program using this115
paradigm, NVIDIA has developed CUDA (Compute Unified Device Architecture)116
[22] that abstracts some aspects of the hardware, allowing programmers to develop117
general purpose programs efficiently.118
119
There are two main points to understand the performance of GPUs by means120
of CUDA. The first is based on the way CUDA applications are developed. The121
basic element to be processed is called Thread. Threads are identified by labels122
ranging between 0 and BlockDim. The group of Threads is called Block, and it123
contains a (recommended) 32 multiple number of Threads. Finally any group of124
Blocks is called Grid. The second aspect of interest is the hardware architecture.125
The minimum unit is the Streaming Processor (SP), where a single Thread is exe-126
cuted. A group of SP’s form the Streaming Multiprocessor (SM), typically with 32127
SP’s. Finally, a GPU is composed by between 2 and 16 SM’s.128
129
NVIDIA GPUs are generally formed by Streaming Multiprocessors (Typically130
14-16 in the Tesla Series) which contain Streaming Processors (32 in the case of131
the Tesla) [23]. The GPU distributes the Blocks among the SMs. The SMs in turn132
assigns the Threads to the SP’s. All SP’s inside the multiprocessor perform the133
same operations at the same time, but each of them applies it to a different element134
inside a vector. The designing of the GPU is the reason of the recommendation of135
configure blockDim multiple of 32. The set of 32 threads processed in a SM is136
called warp.137
138
CUDA for C is an extension of the C Standard programming language which139
includes the possibility of specifying three key abstractions in the execution: hier-140
archy of thread groups, shared memories and barrier synchronization. These are141
exposed to the programmer as a set of extensions that may be introduced in the142
6
code. The most significant change in the syntax is the neccessary parameters for143
the execution of the functions. In particular, the functions include:144
1 // Function calling with blocksPerGrid blocks of size145
2 // threadsPerBlock using streamId. If no stream is specified,146
3 // the third and fourth parameter may be ommited.147
4 functionName148
5 <<< // This is always included in CUDA kernels149
6 blocksPerGrid, // Number of Blocks150
7 threadsPerBlock, // Number of Threads within a block151
8 0, // This parameter must be included if stream152
9 // ID is specified. It is 0 by default.153
10 streamId // Identifier of the Execution Stream154
11 >>> // This is always included in CUDA kernels155
12 (parameters);156
where blocksPerGrid is the number of blocks of size threadsPerBlock157
launched in the stream streamId. Moreover, inside each function, each thread158
may establish its identifier taking into account its own threaId.x and blockId.x159
(in the case of 2D or 3D blocks, it is possible to get access to the .y, .z ID). This160
provides each thread the way of establishing a unique access as, for instance, the161
value of a vector V:162
1 ...163
2 float V[size];164
3 ...165
4 int ID;166
5 float velocity;167
6168
7 // Each thread may build its own unique Identifier as169
8 ID = threadIdx.x+(blockIdx.x*blockDim.x);170
9 // Each thread wants a position of vector v171
10 v[ID]=velocity172
More details of the standard can be found in [22].173
174
The implementation of numerical models using GPU requires to take into ac-175
count four important aspects:176
• Number of elements to be processed: The number of Blocks and the num-177
ber of Threads within each Block are parameters to be tuned by the pro-178
gramer. They determine the maximum number of elements the GPU can179
process (Blocks times Threads per Block). This number must be greater180
than or equal to the number of elements to be processed.181
• Bottlenecks: In order to process all the operations following the GPU paradigm,182
special attention must be paid to the shared information between the process-183
ing elements.184
7
• Data transfer reduction: The communication between CPU and GPU is185
very slow. In general, all the operations must take place inside the GPU,186
otherwise the overhead caused by data transfers may generate such a cost that187
the global performance of the implementation can be lower than on CPU.188
• Floating Point data precision: The GPU arithmetic performance is halved189
when using double precision data. Many applications require double pre-190
cision because of numerical aspects but there exist many others for which191
simple precision is enough to develop the calculations. When single preci-192
sion is acceptable, performance can be almost doubled on GPU.193
The main loop of our implementation is shown in Listing 1 where the fun-194
damental of the programming and the general aspect of the simulation code are195
shown.196
Listing 1: Overview of the CUDA implementation.
1 ...197
2 // Configuration of the parameters198
3 Threads=256;199
4 edgeBlocks=nEdge/Threads;200
5 cellBlocks=nCell/Threads;201
6 while(t<tmax)202
7 // Calculate the fluxes203
8 calculateEdgeFluxes<<<edgeBlocks,Threads,0,executionStream204
>>>(...);205
9 // Establish the minimum dt obtaining the ID of the206
10 // minimum dt207
11 // (*) Explained at Listing 3208
12 cublasIdamin(...,nEdge,vDt,1,id);209
13 // And assign it210
14 newDt<<<1,1,0,executionStream>>>(dt,vDt,id);211
15 // Update the elapsed time (in GPU)212
16 updateT<<<1,1,0,executionStream>>>(cuda_t,dt);213
17 // Retrieves the value of t to CPU214
18 cudaMemcpy(t,cuda_t,sizeof(double),cudaMemcpyDeviceToHost);215
19 // Update the cell values216
20 assignFluxes<<<cellBlocks,Threads,0,executionStream>>>(...);217
21 // Verify whether it is neccessary to dump data and218
22 // if so, process it.219
23 // (*) Detailed in Listing 4220
24 if(t<t_dump)221
25 // Copy of cell variables to a GPU222
26 // stored buffer223
27 cudaMemcpy(..., cudaMemcpyDeviceToDevice);224
28 // Establishing the barrier to ensure the copy of the225
29 // data to the buffer226
8
30 cudaStreamSynchronize(copyStream);227
31 // Copy the data to the CPU buffer228
32 cudaMemcpyAsync(..., cudaMemcpyDeviceToHost,copyStream);229
33 // Create a parallel CPU stream in order to control230
34 // the disk-transfer231
35 pThread_create( &diskThread, ...);232
36 233
37 234
First, the number of elements of the Blocks is defined statically at the begin-235
ing attending to the criterion of occupancy of the streaming multiprocessors (See236
CUDA GPU Occupancy Calculator [23]). For our purposes, the amount of 96237
Threads/Block and 256 Threads/Block has been identified in general as the best238
configuration for meshes with less than 100000 elements and more than 100000239
elements respectively.240
241
Bottlenecks appear when reduction patterns are present in the algorithms. The242
necessity of using reduction functions during the computation can be implemented243
using cublas. cublas library has high-level functions that work retrieving re-244
sults to GPU or to CPU. When interested in using them without taking out the data245
from the GPU, thus must be specified as in Listing 2246
Listing 2: Cublas configuration
1 cublasSetPointerMode_v2(handle, CUBLAS_POINTER_MODE_DEVICE)247
stating that all results have to be returned to the GPU memory.248
249
1 2 3 n-2 n-1 n
0.13 0.45 0.05 0.62 0.78 0.11
cublasIdamin() 3
dt[1..n] Δt=dt[3]
Figure 1: Gathering minimum ∆t for all the domain
The proposed model formulated as in (14) requires the computations of the250
minimum global ∆t that verifies the stability condition 16 when running along all251
the cell edges. It is achieved by means of the functions cublasIdamin. Figure252
1 illustrates an example of the behaviour of this function obtaining the minimum253
among all of them. Details are shown in Listing 3.254
255
Listing 3: Gathering ∆t operation
1 __global__ void newDt(double *dt,double *vDt, int *id)256
9
2 // As cublasIdamin returns it value following257
3 // 1-based indexing, we must substract 1258
4 *dt=vDt[*id-1];259
5 260
6 ....261
7 cublasIdamin(handle,*npared,vDt,1,id);262
8 newDt<<<1,1>>>(dt,vDt,d_id);263
9 ...264
As the calculation is controlled by CPU, it is necessary to transfer the updated265
tn+1. After ∆t is calculated, the updating operation can be performed as in Listing266
4 hence the updated value of tn+1 is transferred to the CPU.267
268
Listing 4: Updating t
1 __global__ void updateT(double *dt,double *t)269
2 *t=*t+*dt;270
3 271
∆t is the only variable that has to be transferred every time-step due to the272
control of the global calculation by the CPU. When data dumping is required, all273
variables at the cells must be transferred to the CPU. This transfer is slow. Although274
out of the scope of this work, this particular operation may be optimized making275
use of concurrent execution and asynchronous data transfer.276
5. Issues related to implementation on unstructured meshes277
Structured meshes have proved suitable for GPU computations [11]. Indeed,278
the GPUs have been developed optimizing the accesses to memory when this kind279
of structures (grids) are used. The main question when calculating with structured280
meshes to solve the Shallow Water equations is whether they provide good results281
when the topography is complicated. It is more difficult to apply local refinement282
of the mesh if neccesary and, when solving river basins, the angulosity of the mesh283
could lead to artificial viscosity near the shores. Moreover, when the simulation284
requires to respect some kind of structure, the flexibility of the mesh allows fitting285
to represent it propertly [15]286
287
The main advantage of structured meshes is the inherent order existent in their288
creation. Neighbouring cells are usually near in memory. When the solver works289
by edges, this point is very important in order to get the coalesced access to the290
main memory in GPU [24].291
292
10
The way the calculations are made over unstructured meshes is not the same293
as that when using structured meshes. The work in [25] provides a way to make294
the calculations using the NSEW scheme, implying that the order of the cells is295
inherent to the manner of accessing the data. However, when using unstructured296
meshes, the most common way of performing the calculation is by solving the297
fluxes by edges rather than by cells because the number of operations is nearly298
halved. In the present work three main aspects have been identified as relevant to299
construct an optimal solver in GPU when working with unstructured meshes:300
• Cell ordering: The way cells are ordered is important when accessing data301
from two neighboring cells.302
• AoS vs SoA: The choice of using Array of Structures or Structure of Arrays303
may improve the performance of the solver304
• Edges ordering: From the previous two points, when calculating by edges,305
the ordering of these shows relevant results.306
In consequence, there is a limitation because of the variable wet-dry bound-307
aries that appear in unsteady cases and may decrease the overall performance of308
the code. When ordering in the cells is not proximal, Warp divergence is probable.309
Warp divergence occurs when two threads have different evaluation in a control310
flow structure. Within a warp, the hardware in not capable of executing if and311
else statements and serialization of execution paths is required. More details312
about warp divergence can be found in [26]. This implies that when two elements313
to be processed must apply different operation by means of a flow control structure314
(i.e. if... else... such in the case of wet-dry frontier), first those threads315
that satisfy the condition will apply the first operation and after this, those Threads316
which enter in the else condition, will apply the other operations. This special317
issues may have important impact on the performance of the application. The order318
of the cells, will avoid partially this limitation and it is next discussed.319
320
5.1. Cell Ordering321
Cell ordering has relevant weight on the way the cells are connected. This is322
controlled by the connectivity matrix, defined as:323
mi,j =
1 if cell i and cell j are neighbours and i < j0 otherwise
(18)
Coalesced memory access or memory coalescing refers to combining multiple324
memory accesses into a single transaction. Every successive 128 bytes ( 32 single325
11
precision words ) memory can be accessed by a warp (32 consecutive threads) in a326
single transaction. Among the conditions that may result in uncoalesced load, i.e.,327
memory access becomes serialized, the more problematic is when memory access328
is sparse [23]. Structured meshes, in general, have coalesced-pattern implicit in329
their construction and this allows not only to know implicitly the index of the cell330
given the edge but also an ordered manner of performing the memory access. This331
is illustrated by the sketch in Figure 2.332
......
... ...
32 Threads Warp
Cell Data Array
Figure 2: Access pattern for most common structured meshes
On the other hand, unstructured meshes require to have an auxiliar index vector333
to perform the calculations by edges. Taking this into account, the coalescence for334
the access to the cells given the edge depend on their design. An example of the335
access pattern is described in Figure 3. In contrast to Figure 2, unstructured meshes336
require an the connectivity between edges and cells.337
...
...
...
...
...
...... ... ... ... ... ...
32 Threads Warp
Wall Neighbouring
Vector
Cell Data Array
Figure 3: Access pattern for unstructured meshes
Two 32-triangular element meshes are defined in Figure 4. The first is a trian-338
gular structured mesh and the second is a triangular unstructured mesh constructed339
using Triangle [27]. Also Figure 4 shows the connectivity matrix of both meshes.340
That of the structured mesh is very close to a banded matrix whereas that of the341
unstructured mesh does not have that pattern. This implies that, for the structured342
mesh, the memory will be more ordered and the memory transactions will be per-343
formed faster.344
345
When dealing with unstructured meshes, some manipulations can be applied346
to change the connectivity matrix to make it become closer to a tridiagonal matrix.347
The RCM (Reverse Cutchil-McKee) algorithm transforms a sparse matrix into a348
banded matrix form with a small bandwidth. This bandwidth can be measured as:349
12
0
5
10
15
20
25
30
0 5 10 15 20 25 30 0
5
10
15
20
25
30
0 5 10 15 20 25 30
Figure 4: Upper: Structured mesh (left) and unstructured mesh (right). Lower: Connectivity matrix
for the structured mesh (left) and for the unstructured mesh (right). (Note that there exists a point if
m(i, j) = 1).
ϕ = max|f(vi)− f(vj)| : vivj ∈ E (19)
where |f(vi) − f(vj)| defines the distance between elements vi and vj . For our350
purpose, f(vi) is the index of the element vi and the difference of the indexes of351
two neighboring elements vi and vj represents the distance in memory allocation352
for those elements.353
354
In terms of matrices, the graph bandwidth is the bandwidth of the symmetric355
matrix, which represents the adjacency matrix of the graph. Applying this trans-356
formation to the unstructured mesh of Figure 4, the bandwidth goes from ϕ = 29357
(Original bandwidth) to ϕ = 5 (Figure 5), so, in the worst case, two variables358
13
related to the same edge, will be allocated at most 6 memory positions apart.359
0
5
10
15
20
25
30
0 5 10 15 20 25 30
Figure 5: Unstructured RCM processed mesh (left) and connectivity matrix (right)
5.2. AoS vs SoA360
A very frequent question when working with arrays is the use of Structrure of361
Arrays (SoA) or Arrays of Structures (AoS). This is more important when working362
with GPU. In Figure 6 the difference between both options is displayed. Taking363
into account that the considered variables are U = h, hu, hv, it is necessary to364
store them ordered by cells. While SoA stores first all h1..NC for the NC mesh365
cells and then hu1..NC and hv1..NC , AoS stores h, hu, hv1 then h, hu, hv2366
and so on up to NC .367
368
Figure 6: Example of configuration of GPU Main Memory for Structure of Arrays and Array of
Structures for 5 cells
14
Contrary to the CPU, the GPU has a very small cache and, moreover, the ac-369
cesses are improved in groups of 32 elements. Therefore, when a single Thread370
accesses any memory address, it will move a group of 32 elements starting in that371
memory address to the cache. If the contiguous Threads access the near memory372
addresses, there is no need to access the memory again, because the GPU have373
already brought the following 31 elements to the cache.374
375
Listing 5: Cell access example function. Detailed the access to the variables of interest
1 __global__ void cellAccess(double *h, double *hu, ..., double *w)376
377
2 int i = ThreadIdx.x+(BlockIdx.x*BlockDim.x);378
3 ...379
4 // SoA380
5 h=h[i];381
6 hu=hu[i];382
7 hv=hv[i];383
8 ...384
9 // AoS385
10 h=w[3*i];386
11 hu=w[3*i+1];387
12 hv=w[3*i+2];388
13 ...389
14 390
When the GPU performs the memory load in order to obtain the value of h391
(for instance) for a given identifier, if the memory is ordered as AoS (Listing 5,392
lines 10-12), the accesses pattern is 3-displaced. This is so because, given a Thread393
accessing to position i, the neigbouring Thread in the Warp will need to access394
position i+3. However, if the memory is ordered as SoA (Listing 5, lines 5-7) the395
access to position i, will be made concurrently to the access to position i+1 by the396
next Thread and the coalesced access will be automatically produced. It is impor-397
tant to note that if, as in the example, data types are double, internally two cycles398
are required for a Warp. This is so because the 16 first Threads will access the first399
128 bytes and the second 16, the next 128 bytes, covering each 16 the amount of400
bandwidth allowed by the Caching GPU Load Operation. For the case of AoS, the401
first 6 Threads will hit in the access, but the 6 next Threads will need to load 128402
bytes again because of the structure of the memory space. Although AoS is clearer403
conceptually, SoA improves memory load operations in this type of problems and404
is thus recommended in GPU implementations.405
406
Bearing in mind that the default mode of the load operation in the GPU is407
Caching, it attempts to hit the load in L1 and L2 cache with load granularity of408
15
128-Byte. If the data type used is double, it takes 8 bytes to store data and for a409
Warp, if the data are contiguous, it is necessary to access twice the main memory.410
On the other hand, when using float only one access is required. For our case,411
it was decided to write operations as appears in Listing 1. It is worth stressing that412
the cache mechanism wil not be sufficient to compensate a bad coalescence.413
5.3. Edge Ordering414
The third point deals with edge ordering. Edges are defined by the label of415
their neighbouring cells cid1 and cid2, and by their own label i. When operating416
on edges, functions like the one in Listing 6 are used. Here, the cell identifiers are417
stored in two vectors using an SoA approach to improve memory access. The way418
that accesses are made are the last point where coalescence is neccessary.419
420
Listing 6: Edges access example function. Detailed the access to the variables of interest
1 __global__ void edgeAccess(int *connectivityMatrix1, int *421
connectivityMatrix2...)422
2 int i = ThreadIdx.x+(BlockIdx.x*BlockDim.x);423
3 ...424
4 cid1=connectivityMatrix1[i];425
5 cid2=connectivityMatrix2[i];426
6 ...427
7 h1=h[cid1];428
8 h2=h[cid2];429
9 ...430
10 431
The lack of coalescence happens because the Threads in a Block will be as-432
signed consecutive edge identifiers, but will access variables which have been in-433
dexed by cells. See an example in Listing 6 where a function needs to operate on434
a variable at both sides of the edge. One of the Threads will access edge i, retrieve435
the neighboring cells cid1[i] and cid2[i], then access the cell-indexed depth vec-436
tor: h[cid1[i]]. The next Thread, operating on i + 1 will retrieve cid1[i + 1] and437
cid2[i + 1] in a coalesced way. There is no guarantee however, that cid1[i] and438
cid1[i + 1] are close to one another, and thus coalescence problems happen when439
this Thread requires access h[cid1[i+ 1]].440
441
Listing 7 is presented as a way to force coalescence by means of edges or-442
dering. Note that the matrix is symmetric and an element cid1, cid2 appears in443
cid2, cid1. To avoid this replication, it is neccessary to establish the matrix as a444
triangular matrix. Here only the element that verifies cid1 < cid2 has been kept.445
446
16
Listing 7: Ordering method for the connectivity matrix based on qsort algorithm
1 struct connectivityMatrix447
2 int cid1; int cid2;448
3 ;449
4450
5 // The necessary comparison operator451
6 // for the qsort operation452
7 int cmp(const connectivityMatrix *idx1, const connectivityMatrix *453
idx2)454
8 if(idx1->cid1<idx2->cid1)455
9 return(0);456
10 else457
11 if(idx1->cid1>idx2->cid1)458
12 return(1);459
13 else460
14 if(idx1->cid2<=idx2->cid2)461
15 return(0);462
16 else463
17 return(1);464
18 465
19 466
20 467
21 468
22469
23470
24 void createEdgeIdentifiers(int *connectivityMatrix1, int *471
connectivityMatrix2)472
25 // It is very useful to remove elements under the473
26 // main diagonal. It means, remove duplicated474
27 // elements in order to avoid replication of the475
28 // edges. After this operation has been applied,476
29 // there exist nEdges elements not replicated.477
30 // NOTE: This operation has been ommited478
31 edges=(connectivityMatrix*)malloc(sizeof(connectivityMatrix)*479
nEdges);480
32 for(i=0;i<nEdges;i++)481
33 edges[i].cid1=connectivityMatrix1[i];482
34 edges[i].cid2=connectivityMatrix2[i];483
35 484
36 // qsort is defined in stdlib. cmp is the function485
37 // previously defined which defined the criterion of486
38 // order487
39 qsort(edges,nEdges,sizeof(connectivityMatrix),cmp);488
40 for(i=0;i<nEdges;i++)489
41 connectivityMatrix1[i]=edges[i].cid1;490
42 connectivityMatrix2[i]=edges[i].cid2;491
43 492
44 493
17
With this ordering, defining the kth edge as wk = (cid1, cid2) , 1 < k <494
nEdges, and cid1 and cid2 being the identifiers of the neighbouring cells for that495
edge, the following property is enforced:496
wk−1 = (cida, cidb)|cida < cid1 ∨ [(cida = cid1) ∧ (cidb ≤ cid2)] (20)
Therefore, when Thread i in a Warp accesses position idx and Thread i + 1497
to the position idx+1, the cell data that these Threads must load are as close as498
possible. Applying this technique to the unstructured mesh presented earlier, it is499
possible to observe in Figure 7 how the proximity of the numbering can be ob-500
tained.501
502
0 5 10 15 20 25 30
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
40
Edge identifier
0 5 10 15 20 25 30
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
40
Edge identifier
Figure 7: Unstructured mesh RCM processed edges ordering. Original ordering (left), qsort post-
procesed ordering (right)
These three optimization strategies, cell ordering, the use of SoA and the edges503
ordering, help to achieve the proximity in the memory space, allowing Threads the504
accessing to the nearest possible taking the advantage of the coalescence within a505
Warp.506
6. Results507
In order to analyze the performance of our strategy, two test cases have been508
selected. Both test cases have been run on an Intel Core 2 Quad CPU Q9550 @509
2.83 GHz using an OpenMP shared memory parallelized implementation. In or-510
der to explode the Multi-threading of the processor, the comparisons include the511
18
simulation using 1 core and 4 cores. Furthermore, the GPU used for our tests is512
a NVIDIA Tesla C2070. The code of the GPU version is the same for all the513
combinations, implemented with SoA similarly to Listing 1. On the other hand,514
four combinations of GPU calculations have been tested for unstructured meshes.515
The first one (called GPU) makes an standard preprocess of the mesh, allocating516
data ordered by cell numbering and building the edges by node connectivity (i.e.517
following cells ordering). The second one (called GPU+RCM) have the cell num-518
bering reordered by the RCM method. The third (GPU+edgeOrd) orders only the519
edges. The fourth applies both RCM and Edge Ordering (GPU+RCM+edgeOrd).520
For Squared structured meshes there are two tests. One without ordering and the521
second one applying the edge ordering. As the meshes have been built as presented522
on the examples, the bandwidth, as defined in (19) is proportional to the square root523
of the number of cells.524
525
6.1. Test Case 1526
Test Case 1 is oriented to demonstrate the influence of the mesh when solving527
the shallow water equation and to analyze the behaviour of the GPU in squared528
structured meshes and triangular unstructured meshes. This test case is fully ex-529
plained in [28] and it is related to the oscillating evolution of the free surface in530
a radially symmetrical frictionless paraboloid. The solution is defined with the531
parameters r0 = 10m, a = 25.0m, h0 = 2.5m and L = 100m. Squared and532
triangular unstructured meshes are processed using RCM. The main properties of533
the triangular unstructured meshes and the reduction of their bandwidth are shown534
in Table 1. The results for this case are displayed for t=T/4, T/2, 3T/4 and T535
taking into account that ω = 2π/T and then, T = 11.2192s in Figure 8 and Figure536
9. Results show differences between both discretizations. The quadrilateral mesh537
creates preferential directionality along the axis. This has special interest with the538
evolution of the problem and much more when flooding cases are simulated.539
540
Table 3 and Table 4 show the total time t required by the complete simulation in541
all the implementations together with the time used per time step (t/ts). According542
to Table 3 quadrilateral grids require less steps due to the CFL condition. In that543
mesh, χ is three times bigger than the smallest one for the unstructured mesh and544
then the time step is three times larger. This is the main reason that explains that,545
for this case, the absolute execution time t for the squared mesh is smaller than546
for the unstructured mesh with the same number of cells. When the computational547
cost of each time step is analyzed, there is another point to have in consideration.548
For the same number of cells, the implicated edges in the calculus is bigger for the549
19
(a) t = T/4 (b) t = T/2
(c) t = 3T/4 (d) t = T
Figure 8: Comparison of Water Depth for structured (lower-left) and unstructured (upper-right)
meshes
structured mesh.550
551
Comparing CPU and GPU time-step execution cost, the speed-up goes from552
20
(a) t = T/4 (b) t = T/2
(c) t = 3T/4 (d) t = T
Figure 9: Comparison of Velocity for structured (lower-left) and unstructured (upper-right) meshes
15x to 30x comparing the mono-core implementation and from 6-11x comparing553
with the 4 core version. As many authors mention (e.g [29]), squared meshes have554
better results because of their implicit order in the creation of the mesh but ap-555
plying our strategies with unstructured meshes, the performance may be doubled556
21
compared against the non-optimized version going from 30x to nearly 60x for com-557
parison respect to the 1 core run time and from 8x to 20x compared with OpenMP558
version using 4 core.559
560
Mesh
10000 40000 100000 200000 400000
nCells 10015 40030 100423 200797 400680
ϕ0 9525 39794 100401 200735 400283
ϕRCM 118 243 372 522 752
Table 1: Bandwidth analysis for the triangular unstructured meshes used in Test Case 1
0.1
1
10
100
1000
10000
10000 Cells 40000 Cells 100000 Cells 200000 Cells 400000 Cells
Log(
Com
puta
tiona
l Cos
t) (
s)
CPU 1 CoreCPU 4 Core
GPUGPU+RCM
GPU+EdgeOrderingGPU+RCM+EdgeOrdering
0
10
20
30
40
50
60
4 Core GPU GPU+RCM GPU+EdgeOrdaring GPU+RCM+EdgeOrdering
Spe
ed-U
p
10000 Cells 40000 Cells 100000 Cells 200000 Cells 400000 Cells
Figure 10: Computational cost of the different implementations using different mesh sizes (upper)
and speed-up (lower) for test case 1
It is very important to note that the size of the computational domain varies561
along the simulation. In the code, there are some control flow statements in order562
to apply calculations when a pair of cells are dry. This implies that there exists563
Thread divergence because no calculations are made when two neighboring cells564
are dry and, as explained in section 5, this may reduce the GPU performance. By565
means of the purposed technique, this limitation is minimized as ordering the cells566
22
improves locallity making more probable the application of the same calculation567
for all the Threads in the Warp, and the Thread divergence limitation is strongly568
reduced. In other words, the reordering leads to the warp to process cells with the569
same propierties and then the if or else action will be applied in the whole warp.570
6.2. Test Case 2571
In the second test case, a dam break problem in a channel with two bumps is572
considered, and numerical simulations are compared with experimental data. The573
experimental data were obtained by Aureli et al. [30]. The geometry leads to574
highly unsteady flow, including shock formation, strong surface curvatures and re-575
verse flow. Therefore, this case is useful to simulate it using a uniform refined mesh576
and because of it character of rapidly propagation, a good compromise between the577
different meshes is obtained, allowing good measurements in the comparison of a578
full computational load case. The flume was rectangular in section, 1.0 m wide,579
0.5 m high and 7.0 m long. The dam was placed at x=2.25 m from the head of580
the reservoir (x=0). The water surface elevation in the reservoir is 0.45m. Dry581
conditions are considered downstream the gate. The slopes of both obstacles are582
symmetric. Friction losses are computed using the Manning roughness coefficient583
n = 0.007 sm−1/3. As in the previous case, the computational domain is repre-584
sented in unstructured meshes with cell count ranging from 10000 to 400000 cells,585
setting CFL=0.9 in all cases. Measurements of water depth versus time were586
made based on video recordings of the flow. There are three gauging points: P1587
at x=1.40m, P2 at x=2.25m and P3 at x=4.5m Also, velocity measurements were588
done at gauging point P1.589
590
Mesh
10000 40000 100000 200000 400000
nCells 10171 40033 100138 200440 400186
ϕ0 10094 36613 100098 199658 400106
ϕRCM 39 86 129 204 262
Table 2: Bandwidth analysis for the triangular unstructured meshes used in Test Case 2
Computed and measured data are in good agreement (Figure 13). Differences591
can be explained by the limits of the mathematical model, that considers purely592
hydrostatic forces along the flow. On the other hand, computational effort is also593
reduced by means of this techniques. When applying RCM (Table 2, Figure 11), ϕ594
varies 3 orders of magnitude, leading to a good coalesced pattern for this mesh.595
596
23
Figure 11: RCM method impact for the 100000 cells mesh used in Test Case 2. Original mesh
(upper) and RCM processed mesh (lower)
The behaviour of the technique is very similar to the one found in the previous597
test case. Those domains containing less than 50000 cells, offer gain values of598
around 15% when applying all the proposed techniques. When using more than599
100000 computational cells, the speed-up reaches values around 60x representing600
an increase of around a 23%. It is very important to note that, as in the previous601
case, applying only RCM is not enough. In fact, the sole application of the RCM602
could lead to a worse performance because although the cell access pattern will603
be coalesced, the edge access pattern could be less coalesced as a consequence of604
RCM. In that case, the speed-up could be lower than for computing without any605
modification in the mesh.606
607
Contrary to the previous test case where the wet domain changes all the time,608
in this second test case all the domain is included almost immediatly after the gate609
removal at the beginning in the calculation. This avoids Thread divergences and610
hence the speed-up for the not optimized version is higher.611
7. Conclusions612
A general implementation of a finite volume scheme on Graphic Processors613
Unit has been analyzed in this work. Unstructured meshes are useful for many614
fields of the engineering but GPU designing is oriented to structured meshes. The615
results have been compared with an OpenMP implementation. In this work, some616
24
1
10
100
1000
10000
100000
10000 Cells 40000 Cells 100000 Cells 200000 Cells 400000 Cells
Log(
Com
puta
tiona
l Cos
t) (
s)
CPU 1 CoreCPU 4 Core
GPUGPU+RCM
GPU+EdgeOrderingGPU+RCM+EdgeOrdering
0
10
20
30
40
50
60
70
4 Core GPU GPU+RCM GPU+EdgeOrdaring GPU+RCM+EdgeOrdering
Spe
ed-U
p
10000 Cells 40000 Cells 100000 Cells 200000 Cells 400000 Cells
Figure 12: Computational cost of the different implementations using different mesh sizes (upper)
and speed-up (lower) for test case 2
improvements on the way of preprocessing the meshes have demonstrated good617
results, improving the performance of the GPU implementation when unstructured618
meshes are used. Specifically the use of SoA for the definition of variables, and619
the application of cell and edge ordering algorithms have been shown to have a620
noticeable influence on GPU performance. When not all the domain is used, these621
strategies allow coalesced memory to avoid Thread divergences. When almost622
all the domain is used, the profits of the coalesced memory access is obtained623
increasing in both cases the performance between 10% − 25% with small meshes624
and with no moving boundaries and 60%−100% with more than 100000 cells and625
variable domain.626
These results provide a very usuful technique that lead to have a very efficient627
implementation on modern GPUs. It is notting worth that OpenMP parallel version628
is really easy to be implemented in a reasonable time while the CUDA version629
require more time as well as to be carefull with details related to the architecture.630
The proposed strategy can be extended to other solvers based on calculations by631
edges and also to other methods where calculations and storage are performed by632
nodes to simulate unsteady flows on unstructured meshes.633
25
0
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 10 12 14
h (m
)
t (s)
P1, x=1.4m
-0.2
0
0.2
0.4
0.6
0 2 4 6 8 10 12 14
u (m
/s)
t (s)
P3, x=4.5m
0
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 10 12 14
h (m
)
t (s)
P1, x=1.4m
0
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 10 12 14
h (m
)
t (s)
P2, x=2.25m
Figure 13: Water level evolution in time. Experimental data (− • −) and numerical results (−−)
for Test Case 2
26
Figure 14: Evolution of the Test Case 2 for t=1s., 2s., 3s., 4s., 5s., and 8s.
8. Acknowledgments634
This work has been partially supported by the European Social Fund (Aragon635
Government). We also thank NVIDIA for the donation of hardware facilities.636
27
References637
[1] A. Danowitz, K. Kelley, J. Mao, J. P. Stevenson, M. Horowitz, CPU DB:638
Recording microprocessor history, Queue 10 (2012) 10:10–10:27.639
[2] T. Tomczak, K. Zadarnowska, Z. Koza, M. Matyka, L. Miroslaw, Accelera-640
tion of iterative navier-stokes solvers on graphics processing units, Interna-641
tional Journal of Computational Fluid Dynamics 27 (????).642
[3] J. M. Domínguez, A. J. Crespo, M. Gómez-Gesteira, Optimization strategies643
for CPU and GPU implementations of a smoothed particle hydrodynamics644
method, Computer Physics Communications 184 (2013) 617 – 627.645
[4] W.-Y. Liang, T.-J. Hsieh, M. T. Satria, Y.-L. Chang, J.-P. Fang, C.-C. Chen,646
C.-C. Han, A GPU-based simulation of tsunami propagation and inunda-647
tion, in: Proceedings of the 9th International Conference on Algorithms and648
Architectures for Parallel Processing, ICA3PP ’09, Springer-Verlag, Berlin,649
Heidelberg, 2009, pp. 593–603.650
[5] M. J. Castro, S. Ortega, M. de la Asunción, J. M. Mantas, J. M. Gallardo,651
GPU computing for shallow water flow simulation based on finite volume652
schemes, Comptes Rendus Mécanique 339 (2011) 165 – 184.653
[6] M. Abouali, J. Timmermans, J. E. Castillo, B. Z. Su, A high performance654
GPU implementation of surface energy balance system (sebs) based on cuda-655
c, Environmental Modelling & Software 41 (2013) 134 – 138.656
[7] A. Lacasta, P. García-Navarro, J. Burguete, J. Murillo, Preprocess static sub-657
domain decomposition in practical cases of 2D unsteady hydraulic simula-658
tion, Computers & Fluids 80 (2013) 225 – 232. Selected contributions of the659
23rd International Conference on Parallel Fluid Dynamics ParCFD2011.660
[8] B. F. Sanders, J. E. Schubert, R. L. Detwiler, Parbrezo: A parallel, unstruc-661
tured grid, godunov-type, shallow-water code for high-resolution flood in-662
undation modeling at the regional scale, Advances in Water Resources 33663
(2010) 1456 – 1467.664
[9] M. Castro, J. García-Rodríguez, J. González-Vida, C. Parés, A parallel 2d665
finite volume scheme for solving systems of balance laws with nonconserva-666
tive products: Application to shallow flows, Computer Methods in Applied667
Mechanics and Engineering 195 (2006) 2788 – 2815.668
28
[10] A. J. Kalyanapu, S. Shankar, E. R. Pardyjak, D. R. Judi, S. J. Burian, Assess-669
ment of GPU computational enhancement to a 2d flood model, Environmen-670
tal Modelling & Software 26 (2011) 1009 – 1016.671
[11] A. R. Brodtkorb, M. L. Sætra, M. Altinakar, Efficient shallow water simu-672
lations on GPUs: Implementation, visualization, verification, and validation,673
Computers and Fluids 55 (2012) 1 – 12.674
[12] S. Rostrup, H. D. Sterck, Parallel hyperbolic pde simulation on clusters: Cell675
versus GPU, Computer Physics Communications 181 (2010) 2164 – 2179.676
[13] D. Jacobsen, J. C. Thibault, I. Senocak, An MPI-CUDA Implementation for677
Massively Parallel Incompressible Flow Computations on Multi-GPU Clus-678
ters, in: American Institute of Aeronautics and Astronautics (AIAA) 48th679
Aerospace Science Meeting Proceedings.680
[14] D. Caviedes-Voullième, P. García-Navarro, J. Murillo, Influence of mesh681
structure on 2d full shallow water equations and scs curve number simulation682
of rainfall/runoff events, Journal of Hydrology 448–449 (2012) 39 – 59.683
[15] J. E. Schubert, B. F. Sanders, M. J. Smith, N. G. Wright, Unstructured mesh684
generation and landcover-based resistance for hydrodynamic modeling of ur-685
ban flooding, Advances in Water Resources 31 (2008) 1603 – 1621.686
[16] N. Whitehead, A. Fit-Florea, Precision & performance: Floating point and687
ieee 754 compliance for nvidia GPUs, Technical white paper by NVIDA,688
2011.689
[17] J. Murillo, P. García-Navarro, J. Burguete, P. Brufau, The influence of source690
terms on stability, accuracy and conservation in two-dimensional shallow691
flow simulation using triangular finite volumes, International Journal for Nu-692
merical Methods in Fluids 54 (2007) 543–590.693
[18] J. Murillo, P. Garcia-Navarro, Weak solutions for partial differential equa-694
tions with source terms: Application to the shallow water equations, Journal695
of Computational Physics Volume: 229 Issue: 11 Pages: 4327-4368 (2010).696
[19] P. Roe, Approximate riemann solvers, parameter vectors, and difference697
schemes, Journal of Computational Physics 43 (1981) 357 – 372.698
[20] M. Morales-Hernández, P. García-Navarro, J. Burguete, P. Brufau, A conser-699
vative strategy to couple 1d and 2d models for shallow water flow simulation,700
Computers & Fluids 81 (2013) 26 – 44.701
29
[21] J. Murillo, P. García-Navarro, J. Burguete, Time step restrictions for well-702
balanced shallow water solutions in non-zero velocity steady states, Interna-703
tional Journal for Numerical Methods in Fluids 60 (2009) 1351–1377.704
[22] NVIDIA Corporation, NVIDIA CUDA C Programming Guide, 2011.705
[23] P. N. Glaskowsky, Nvidia ’ s fermi : The first complete GPU computing ar-706
chitecture, A white paper prepared under contract with NVIDIA Corporation707
(2009) 1–26.708
[24] F. De Vuyst, C. Labourdette, Some examples for cfd instant computations on709
GPU, lbm for incompressible fluids, vfm for compressible fluid, in: Work-710
shop Trends in physical and numerical modeling of multiphase flows,CFD711
and its experimental validation for multiphase flows.712
[25] A. Brodtkorb, T. Hagen, M. Saetra, GPU programming strategies and trends713
in GPU computing, Journal of Parallel and Distributed Computing (In Press)714
(2012).715
[26] M. Pharr, R. Fernando, GPU Gems 2: Programming Techniques for High-716
Performance Graphics and General-Purpose Computation (GPU Gems),717
Addison-Wesley Professional, 2005.718
[27] J. R. Shewchuk, Triangle: Engineering a 2D Quality Mesh Generator and719
Delaunay Triangulator, in: M. C. Lin, D. Manocha (Eds.), Applied Compu-720
tational Geometry: Towards Geometric Engineering, volume 1148 of Lecture721
Notes in Computer Science, Springer-Verlag, 1996, pp. 203–222. From the722
First ACM Workshop on Applied Computational Geometry.723
[28] W. C. Thacker, Some exact solutions to the nonlinear shallow-water wave724
equations, Journal of Fluid Mechanics 107 (1981) 499–508.725
[29] D. Komatitsch, G. Erlebacher, D. Göddeke, D. Michéa, High-order finite-726
element seismic wave propagation modeling with mpi on a large GPU cluster,727
Journal of Computational Physics 229 (2010) 7692 – 7714.728
[30] F. Aureli, P. Mignosa, M. Tomirotti, Dam-break flows in presence of abrupt729
bottom variations, in: Proceedings of The XXVIII IAHR International730
Congress.731
30
GPU Speed-Up
t (ms) t/ts vs. CPU1Core vs. CPU4Core
10000 Cells
CPU1Core - Unst. Mesh 4991.81 5.57
CPU1Core - Sq. Mesh 3953.16 7.40
CPU4Core - Unst. Mesh 1783.99 1.99 2.79
CPU4Core - Sq. Mesh 1470.98 2.75 2.69
GPU - Unst. Mesh 295.56 0.33 16.89 5.43
GPU+RCM - Unst. Mesh 306.89 0.34 16.27 5.23
GPU+edgeOrd - Unst. Mesh 283.04 0.32 17.64 5.67
GPU+RCM+edgeOrd - Unst. Mesh 267.63 0.30 18.65 6.00
GPU - Sq. Mesh 178.28 0.33 22.17 8.25
GPU+edgeOrd - Sq. Mesh 177.29 0.33 22.30 8.67
40000 Cells
CPU1Core - Unst. Mesh 58105.63 28.98
CPU1Core - Sq. Mesh 35323.78 32.53
CPU4Core - Unst. Mesh 21599.27 10.77 2.69
CPU4Core - Sq. Mesh 14054.87 12.94 2.51
GPU - Unst. Mesh 1921.82 0.96 30.23 8.99
GPU+RCM - Unst. Mesh 2120.12 1.06 27.41 8.15
GPU+edgeOrd - Unst. Mesh 1656.17 0.83 35.08 10.43
GPU+RCM+edgeOrd - Unst. Mesh 1401.03 0.70 41.47 12.33
GPU - Sq. Mesh 839.92 0.77 42.06 16.74
GPU+edgeOrd - Sq. Mesh 841.08 0.77 42.00 16.71
100000 Cells
CPU1Core - Unst. Mesh 264078.60 80.05
CPU1Core - Sq. Mesh 160182.18 88.21
CPU4Core - Unst. Mesh 98068.16 29.72 2.69
CPU4Core - Sq. Mesh 66227.14 36.47 2.42
GPU - Unst. Mesh 8349.61 2.53 31.63 10.57
GPU+RCM - Unst. Mesh 9174.20 2.78 28.78 9.62
GPU+edgeOrd - Unst. Mesh 6750.18 2.05 39.12 13.08
GPU+RCM+edgeOrd - Unst. Mesh 5267.17 1.60 50.14 16.76
GPU - Sq. Mesh 3392.74 1.87 47.21 19.52
GPU+edgeOrd - Sq. Mesh 3384.75 1.86 47.32 19.56
200000 Cells
CPU1Core - Unst. Mesh 840913.36 162.46
CPU1Core - Sq. Mesh 455589.91 176.59
CPU4Core - Unst. Mesh 332028.52 64.14 2.53
CPU4Core - Sq. Mesh 170974.87 66.27 2.66
GPU - Unst. Mesh 26533.18 5.13 31.69 11.26
GPU+RCM - Unst. Mesh 28943.84 5.59 29.05 10.32
GPU+edgeOrd - Unst. Mesh 20562.27 3.97 40.90 14.53
GPU+RCM+edgeOrd - Unst. Mesh 15639.76 3.02 53.77 19.11
GPU - Sq. Mesh 8998.99 3.49 50.63 19.00
GPU+edgeOrd - Sq. Mesh 8992.23 3.49 50.66 19.01
400000 Cells
CPU1Core - Unst. Mesh 3322224.69 337.04
CPU1Core - Sq. Mesh 1299827.96 352.73
CPU4Core - Unst. Mesh 1411913.42 10.02 2.35
CPU4Core - Sq. Mesh 525174.53 142.51 2.48
GPU - Unst. Mesh 109699.40 11.13 30.28 11.58
GPU+RCM - Unst. Mesh 117526.78 11.92 28.27 10.81
GPU+edgeOrd - Unst. Mesh 83056.90 8.43 40.00 15.30
GPU+RCM+edgeOrd - Unst. Mesh 56543.50 5.74 58.76 22.47
GPU - Sq. Mesh 24195.92 6.57 53.72 21.70
GPU+edgeOrd - Sq. Mesh 24180.67 6.56 53.75 21.72
Table 3: Test Case 1: CPU and GPU total execution time (t) and per time step (t/ts) together with
performance of the GPU version using different configurations. CPU1Core a sequential version on
an Intel Core 2 Quad CPU Q9550 @ 2.83 GHz and CPU4Core is the OpenMP Implemantation
using 4 Cores
31
GPU Speed-Up
t (ms) t/ts vs. CPU1Core vs. CPU4Core
10000 Cells
CPU1Core 41532.30 7.49
CPU4Core 36165.99 2.94 2.51
GPU 1969.88 0.36 21.08 7.36
GPU+RCM 2013.51 0.36 20.63 7.20
GPU+edgeOrd 1882.64 0.34 22.06 7.70
GPU+RCM+edgeOrd 1738.89 0.31 23.88 8.34
40000 Cells
CPU1Core 505401.30 33.40
CPU4Core 417096.00 15.51 2.15
GPU 13762.87 0.91 36.72 11.63
GPU+RCM 15138.44 1.00 33.39 10.58
GPU+edgeOrd 12613.27 0.83 40.07 12.69
GPU+RCM+edgeOrd 11102.60 0.73 45.52 14.42
100000 Cells
CPU1Core 2295837.22 88.48
CPU4Core 1835517.78 36.50 2.42
GPU 51589.11 1.99 44.50 12.46
GPU+RCM 58088.87 2.24 39.52 11.07
GPU+edgeOrd 46221.68 1.78 49.67 13.91
GPU+RCM+edgeOrd 41278.49 15.57 55.62 44.47
200000 Cells
CPU1Core 4949436.05 180.39
CPU4Core 3912997.63 79.13 2.27
GPU 103658.95 3.78 47.75 17.65
GPU+RCM 120357.69 4.39 41.12 15.20
GPU+edgeOrd 93125.04 3.39 53.15 19.64
GPU+RCM+edgeOrd 82877.25 3.02 59.72 22.06
400000 Cells
CPU1Core 14448850.66 359.22
CPU4Core 11717537.88 169.62 2.11
GPU 306335.43 7.62 47.17 17.43
GPU+RCM 350376.72 8.71 41.24 15.24
GPU+edgeOrd 269668.53 6.70 53.58 19.80
GPU+RCM+edgeOrd 234851.10 5.84 61.52 22.74
Table 4: Test Case 2: CPU and GPU total execution time (t) and per time step (t/ts) together with
performance of the GPU version using different configurations. CPU1Core a sequential version on
an Intel Core 2 Quad CPU Q9550 @ 2.83 GHz and CPU4Core is the OpenMP Implemantation
using 4 Cores
32