[IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA...
Transcript of [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA...
![Page 1: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase](https://reader031.fdocuments.us/reader031/viewer/2022030223/5750a4dc1a28abcf0cad93ad/html5/thumbnails/1.jpg)
Multiphase LBM Distributed Over Multiple GPUs
Carlos Rosales 1
1Texas Advanced Computing Center, The University of Texas at AustinResearch Office Complex 1.101, J.J. Pickle Research Campus, Building 196
10100 Burnet Road, Austin, Texas [email protected]
Abstract—A parallel distributed CUDA implementation ofa Lattice Boltzmann Method for multiphase flows with largedensity ratios is described in this paper. Validation runs studyingthe terminal velocity of a rising bubble under the effect of gravityshow good agreement with the expected theoretical values. Thecode is benchmarked against the performance of a typical CPUimplementation of the same algorithm on both AMD and Intelplatforms, and a single GPU is observed to perform up to 10Xfaster than a quad-core CPU socket, a 40X speedup with respectto a single core. The code is shown to scale well when executedon multiple GPUs, which makes the port to CUDA valuable evenwhen compared to parallel CPU implementations.
I. INTRODUCTION
Multiphase flows have application in most areas of daily
life, and their understanding is of great importance in both
academia and industry. Computational methods are particu-
larly suited for the study of multiphase flows because they
allow us to analyze the effect of the many variables involved
in the problem in an independent manner, which is very
challenging to do in classical experimental work. Even though
a tremendous effort has been put into improving models and
numerical techniques to investigate multiphase flows, they
remain one of the most challenging subjects in computational
physics and engineering.
One of the most challenging issues in the study of mul-
tiphase flows is the range of spatial and temporal scales that
need to be analyzed. Interfacial phenomena can be localized in
small volumes, but for the models to be useful the calculations
often need to extend far beyond the interfacial regions. Also
the timescales involved in interfacial phenomena are short, but
at the same time one needs to be able to simulate the evolution
of the multiphase system over significant periods of time.
Traditional Computational Fluid Dynamics (CFD) methods
are capable of large system evolution simulations, but tend to
use high level models to represent interactions at the interface
level. Kinetic techniques like the Lattice Boltzmann Method
[1] are better suited to more detailed analysis of short time
scale evolution, but are hindered by their explicit nature and
require many time steps to produce useful results. Parallel
implementations [2], [3], adaptive meshing algorithms [4]
and multiple meshing levels [5], have helped improve the
performance of LBM so that more simulation steps can be
taken per second, but large scale simulations over long periods
of time remain a challenge.
Although graphics cards have been attractive in terms of raw
computing power for some time, it has only been recently that
user-friendly application programming interfaces like CUDA
or OpenCL made them practical for the general research
community. Over the past few years implementations of single
phase LBM on GPUs have shown excellent performance when
compared to their CPU equivalent [6], and in this work
we explore the extension of multiphase LBM to the CUDA
framework. In particular we present an implementation of the
Zheng-Shu-Chew multiphase model [7] in CUDA, and show
that execution on a GPU offers excellent speedup for the
simulation of large density ratio multiphase flows.
In Section 2 we describe the LBM model used and the
CUDA implementation details. Section 3 contains validation
and performance data, and Section 4 offers our conclusions
and suggestions for further improvements.
II. METHODOLOGY
A. The Zheng-Shu-Chew Multiphase Model
We model a flow with two immiscible fluids using the mass
conservation equation, the Navier Stokes equations, and an
interface evolution equation as in reference [7]:
∂n
∂t+∇ · (nu) = 0, (1)
∂nu
∂t+∇ · (nuu) = −∇ · P + μn∇2u+ Fb, (2)
∂φ
∂t+∇ · (φu) = θM∇2μφ, (3)
where μn is the dynamic viscosity of the fluid, μφ is
the chemical potential, θM is the mobility of the interface
(molecular diffusion mobility), P is the pressure tensor, Fb is
the body force, and n and φ are defined as:
n =ρA + ρB
2, φ =
ρA − ρB2
, (4)
with A and B indicating each of the fluids. This model uses
an average density n that is the same for every computational
node. As in other Free-Energy methods, the interfacial force
originates from the derivatives of φ as described below.
The advantage of this model is that the interface between
the two phases is captured using a convective Cahn-Hilliard
equation with second order accuracy without the inclusion of
a pressure correction term. This is a significant step forward
in the efficiency of the method, since the pressure correction
step is computationally very expensive and typically forces
2011 IEEE International Conference on Cluster Computing
978-0-7695-4516-5/11 $26.00 © 2011 IEEE
DOI 10.1109/CLUSTER.2011.9
1
2011 IEEE International Conference on Cluster Computing
978-0-7695-4516-5/11 $26.00 © 2011 IEEE
DOI 10.1109/CLUSTER.2011.9
1
![Page 2: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase](https://reader031.fdocuments.us/reader031/viewer/2022030223/5750a4dc1a28abcf0cad93ad/html5/thumbnails/2.jpg)
simulations to be limited to density ratios below one hundred
[8]. This is achieved by using a standard lattice Boltzmann
equation for the momentum distribution function g, but intro-
ducing an over-relaxation term in the equation for the order
parameter f :
gi(x+ ciδt, t+ δt) = gi(x, t) + Ωgi (x, t) (5)
fi(x+ ciδt, t+ δt) = fi(x, t) + Ωfi (x, t)
+ (1− η)[fi(x+ ciδt, t)
− fi(x, t)] (6)
Here Ωi is the collision term in the BGK [9] approximation:
Ωgi =
geqi (x, t)− gi(x, t)
τn(7)
Ωfi =
feqi (x, t)− fi(x, t)
τφ(8)
where gi and fi are the distribution functions for the
momentum and the phase, τn and τφ are their respective
relaxation times, ci is the lattice velocity, and η is the over-
relaxation constant coefficient. This scheme reduces to the
standard lattice Boltzmann equation when η is unity. The
detailed expressions of the equilibrium distribution functions
geq and feq are given in the Appendix. The index i in
these equations refers to each of the chosen directions in the
discretized velocity space. This implementation uses a D3Q19
discretization for g, but requires only a D3Q7 discretization for
f . The velocity directions involved in these two discretization
modes are shown in Figure 1.
Fig. 1. D3Q19 discretization scheme. Red arrows (color online, numbered)make the subset of direction which defines the D3Q7 scheme.
The macroscopic quantities corresponding to the order pa-
rameter φ, the density of the fluid n, and its velocity u are
defined in terms of the distribution functions f and g:
φ(x, t) =∑i
fi(x, t) (9)
n(x, t) =∑i
gi(x, t) (10)
u(x, t) =1
n
[∑i
cigi +1
2(μφ∇φ+ Fb)
](11)
where the body force representing gravity is given by
Fb = 2φ∗g, and the chemical potential is:
μφ = 4A(φ3 − φ∗2φ)− κ∇2φ, (12)
with φ∗ defined as the constant φ∗ = (ρA − ρB)/2 using
the intially given values of ρA and ρB . In this expression the
parameters A and κ depend on the sufrace tension, σ, and the
interface width, W :
A =3σ
Wφ∗4 , (13)
κ =1
2A(Wφ∗)2. (14)
The Chapman-Enskog expansion, together with the method
of successive approximations [10], can be used to recover
the convective Cahn-Hilliard equation (3) to second order
accuracy from equation(6) using the following definitions of
the coefficient η and the mobility θM :
η =1
τφ + 0.5(15)
θM = η
(τφη − 1
2
)δΓ (16)
B. MPI-CUDA Implementation
From the point of view of a programmer, using CUDA on
a GPU is essentially equivalent to working with a massively
threaded processor which has minimal overhead in the thread
spawning process. The sheer number of threads can hide
memory latencies, but it is important that the programmer be
aware of the memory access patterns of the algorithm in use.
At the logical level the CUDA framework consists of a
regular grid of blocks, with each block being assigned a fixed
number of threads that can be indexed in three dimensions.
In this work we define a grid of blocks in the X and Y
directions, and assign each block a number of threads in each
direction corresponding to the number of points assigned to
it. Each of these threads will work on all points along the Z
direction for a given (x,y) value. This is better explained with
a concrete example. Let’s consider a simulation domain of size
64x64x64. If we wish to have 256 threads per block we could
define BLOCK_SIZE_X = 32 and BLOCK_SIZE_Y = 8. This
corresponds to having 64/32 = 2 blocks in the X direction and
64/8 = 8 blocks in the Y direction for a total number of 16
blocks, as shown in Figure 2.
22
![Page 3: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase](https://reader031.fdocuments.us/reader031/viewer/2022030223/5750a4dc1a28abcf0cad93ad/html5/thumbnails/3.jpg)
(a) (b)
Fig. 2. Domain distribution in CUDA framework for a 64x64x64 simulationdomain. a) Physical node assignment. Each block contains 32x8x64 = 16384simulation points. b) Logical block/thread partition. Each thread is assignedto a single (x,y) value and works on all z values.
The computational volume is divided in a regular grid of
blocks in the X and Y directions. Each computational block
consists of a number of threads in the X and the Y directions,
which we define as BLOCK_SIZE_X and BLOCK_SIZE_Yso that our partitioning in CUDA will look like:
// Define number of threads in each block#define BLOCK_SIZE_X 32#define BLOCK_SIZE_Y 8...// Define the grid and specify the number// of threads per blockdim3 dimblock(BLOCK_SIZE_X,BLOCK_SIZE_Y,1);dim3 dimgrid(NX/BLOCK_SIZE_X,NY/BLOCK_SIZE_Y,1);...// Call a kernel with the previously// defined gridstream_f <<<dimgrid,dimblock>>> ( ... );
It is important to notice that in the CUDA framework
threads do not access the memory individually, but in groups
of 32, known as warps. The GPU will coalesce accesses to
contiguous memory elements in single access groups of up to
128 bytes in length. This makes contiguous access for adjacent
threads critical for high performance computing on GPUs, as
non-contiguous access will lead to a signifficantly reduced
effective memory bandwidth. The vectorized code used in this
work was written in such way that variables corresponding to
contiguous values in x were adjacent in memory in order to
improve coalesced memory access on the GPU.
The following steps were taken to implement the ZSC model
in the CUDA framework:
• Break down all major steps into kernel subroutines
– Hydrodynamics update (�u, ρ, φ)
– Collision
– Stream
• Use one array per LBM direction for the distribution
functions
• Use one-dimensional arrays for all quantities
• Indexing arrays so that warps of threads access contigu-
ous memory positions
• Minimize device/host data exchange (MPI exchange data
and program output)
Having multiple kernels, each doing a relatively simple
task, helps to construct an efficient CUDA code because the
resources required by each thread will be less than those of
a very complex kernel. This is particularly important when
considering that there is a very limited amount of registers
available in a GPU, and that after these registers are exhausted
the code will have to allocate variables local to a kernel in the
local memory space, which is physically located on the main
GPU memory and thus has a much longer access latency than
the registers available to each streaming multiprocessor. There
is a need to reach balance between simplifying the kernels
and having many of them because of the overhead related
to launching each kernel. We found that creating tasks for
logically indipendent actions (velocity update, density update,
collision step, etc...) provided sufficient performance without
the need to generate cumbersome code.
We chose to employ one dimensional arrays to store data
in order to use the native CUDA memory allocation and
deallocation routines cudaMalloc() and cudaFree() as
well as the device/host transfer function cudaMemcpy().
The use of one array for each of the LBM directions was
arbitrary. One could have equally chosen to employ a single
array to hold them all.
The indexing of the arrays was done in such a way that
the values along x were contiguous in memory. We were also
careful to request block sizes that were 32 threads wide along
the x direction, so that during runtime the request for 32
values of a particular array from each thread warp results
in one coalesced memory access. Failure to take this type
of measures results in significantly degraded performance,
particularly when using algorithms that perform many memory
accesses and have little data reuse.
The data exchange between device and host required for
every MPI exchange is the most limiting factor when using
multiphase LBM algorithms. Instead of exchanging all the
distribution function values to the ghost cells every time step
we only exchange the values that are absolutely necessary.
This corresponds to 5 elements of g, 1 of f and the value of
φ per cell across task face boundaries, plus one element of gand the value of φ per cell for every task edge boundary.
In our implementation we allocate one array twice the
size of the grid to hold the current and collision values
for each of the distribution function directions. This means
that there are a total of 19 + 7 = 26 arrays to be passed
to the collision and stream functions (from the D3Q19 and
D3Q7 discretizations), plus the nearest neighbor information
and the macroscopic quantities. Because of the limitation in
the number of arguments that can be passed on to a kernel
function in CUDA (a maximum of 32 pointer-sized arguments)
the collision and the stream steps were divided into separate
kernels for the f and g distribution functions.
The code section below shows most of the GPU kernel
code for the collision of the phase distribution function,
f . The code shows that few changes are required from a
standard CPU implementation besides identifying the thread
numbers along the X and Y directions and then calculating
the correct index for the current Z value – and its neighboring
33
![Page 4: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase](https://reader031.fdocuments.us/reader031/viewer/2022030223/5750a4dc1a28abcf0cad93ad/html5/thumbnails/4.jpg)
points, needed for the gradient and laplacian terms – before
proceeding with the update. Notice that some of the constants
below are standard quantities defined by CUDA at run time
(blockIdx, blockDim, threadIdx) while others have
been defined previously in the code (gridSize, alpha4,
phiStarSq_d, kappa_d, invTauPhi_d, invEta2_d,
invTauPhiOne_d) and reside in the GPU memory as
symbols.
__global__ void collision_f( float *f_0_d,float *f_1_d, float *f_2_d, float *f_3_d,float *f_4_d, float *f_5_d, float *f_6_d,int *nb_east_d, int *nb_west_d,int *nb_north_d, int *nb_south_d,float *phi_d, float *rho_d,float *ux_d, float *uy_d, float *uz_d ){// 23 Local Variablesint i, idx, idx2, ie, iw, j, jn, js,
k, kt, kb;float phin, muPhi, invRho;float Af, Cf, Fx, Fy, Fz, lapPhi;
// Identify current threadi = blockIdx.x * blockDim.x + threadIdx.x;j = blockIdx.y * blockDim.y + threadIdx.y;
// Main collision loopfor( k = 1; k < NZ-1; k++ ){
// Define index of current point in old// (idx) and new (idx2) f arraysidx = gridId( i, j, k );idx2 = gridId( i, j, k ) + gridSize;
// Define for later reusephin = phi_d[idx];invRho = 1.f / rho_d[idx];
// Near neighborsie = nb_east_d[idx];iw = nb_west_d[idx];...
// Laplacian of the order parameter PhilapPhi = ( phi_d[ gridId(ie,jn,k ) ] + ....
// Chemical potentialmuPhi = alpha4_d*phin*( phin*phin -
phiStarSq_d ) - kappa_d*lapPhi;
// Interfacial and gravity forcesFx = muPhi*(2.f*(phi_d[gridId(ie,j ,k )] - ...Fy = muPhi*(2.f*(phi_d[gridId(i ,jn,k )] - ...Fz = muPhi*(2.f*(phi_d[gridId(i ,j ,kt)] - ...if( phin > 0.f ) Fz = Fz + grav_d;
// Equilibrium coefficientsAf = 0.5f * Gamma_d * invTauPhi_d * muPhi;Cf = invTauPhi_d * invEta2_d * phin;
// Collision updatef_0_d[idx2] = invTauPhiOne_d * f_0_d[idx]
- 6.0f*Af + invTauPhi_d*phin;f_1_d[idx2] = invTauPhiOne_d * f_1_d[idx]
+ Af + Cf * ux_d[idx];
f_2_d[idx2] = invTauPhiOne_d * f_2_d[idx]+ Af - Cf * ux_d[idx];
f_3_d[idx2] = invTauPhiOne_d * f_3_d[idx]+ Af + Cf * uy_d[idx];
f_4_d[idx2] = invTauPhiOne_d * f_4_d[idx]+ Af - Cf * uy_d[idx];
f_5_d[idx2] = invTauPhiOne_d * f_5_d[idx]+ Af + Cf * uz_d[idx];
f_6_d[idx2] = invTauPhiOne_d * f_6_d[idx]+ Af - Cf * uz_d[idx];
}}
The MPI implementation is fairly straightforward, and con-
sists of a simple domain decomposition along the Z direction.
Each GPU is assigned a subdomain of the total computational
volume and uses one additional layer of ghost cells which are
synchronized every time step to ensure the correct calculation
of the gradient and laplacian of the phase, φ, and the correct
update of the phase and momentum distribution functions
f and g. Each MPI synchronization requires communication
between device and host, which is actually the most expensive
of the two operations. This scheme is illustrated in Figure
3. This 1D partitioning scheme was chosen to avoid addi-
tional synchronization requirements due to the block/thread
decomposition in each card. A 1D partitioning scheme is not
scalable to a very large number of cards, and in the future
we will consider using a 2D or 3D partitioning scheme that
provides more flexibility. Although adding synchronization
and host/device data exchange steps is expensive, and not
necessary when using a relatively small number of GPUs, there
will be cases where the number of GPUs and the domain size
and shape are such that the actual bottleneck of the application
will be on the MPI exchange and in those cases a more flexible
partitioning scheme will improve scalability.
Fig. 3. Communication scheme showing data flow from GPUs to hosts andMPI exchange to update ghost cells. Each row of squares represents the workdone by a singhe CUDA thread, and ghost cells are represented by black filledsquares.
44
![Page 5: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase](https://reader031.fdocuments.us/reader031/viewer/2022030223/5750a4dc1a28abcf0cad93ad/html5/thumbnails/5.jpg)
Due to the additional term used in the streaming of the
phase distribution function (6) and to the need of nearest
neighbor data for the gradient and laplacian calculations in
the velocity equation (11) and chemical potential equation
(12), a multiphase implementation of LBM requires two extra
synchronization and communication points per time step. This
makes the strong scaling of multiphase LBM codes signifi-
cantly more challenging than that of single phase LBM codes.
As an example, the update of the phase on the ghost nodes
previous to the collision step looks like:
// Pack task-boundary Phi valuespack_phi <<<dimgrid,dimblock>>> ( top_snd_d,
bot_snd_d,phi_d );
// Synchronize to ensure packing is complete// and copy data to hostcudaThreadSynchronize();cudaMemcpy( top_snd, top_snd_d, SIZE,
cudaMemcpyDeviceToHost );cudaMemcpy( bot_snd, bot_snd_d, SIZE,
cudaMemcpyDeviceToHost );
// Carry out MPI update using// Isend/Irecv/Waitall MPI callsmpiUpdate_phi( top_snd, bot_snd,
top_rcv, bot_rcv );
// Copy exchanged data to device and unpackcudaMemcpy( top_rcv_d, top_rcv, SIZE,
cudaMemcpyHostToDevice );cudaMemcpy( bot_rcv_d, bot_rcv, SIZE,
cudaMemcpyHostToDevice );unpack_phi <<<dimgrid,dimblock>>> ( top_rcv_d,
bot_rcv_d,phi_d );
// Carry out collision stepcollision_f <<<dimgrid,dimblock>>> ( f_0, ... );collision_g <<<dimgrid,dimblock>>> ( g_0, ... );
Similar code is used for the data exchange between GPUs
required before the streaming of f and after the streaming step
for both f and g.
III. RESULTS
A. Validation Tests
To validate the GPU code we chose to compare the cal-
culated value of the terminal velocity of a rising bubble of
spherical shape with its theoretical value in low Reynolds num-
ber flow. Assuming no deformation occurs, the gravitational
and drag forces in this case are described by the following
equations:
Fgrav =4
3πR3 (ρH − ρL) g (17)
Fdrag = 6πRUμH (18)
where R is the radius of the sphere, U its velocity along
the vertical direction, ρL its density, and ρH and μH are
the density and dynamic viscosity of the surrounding fluid
respectively. The terminal velocity, Ut, is obtained by equating
these two expressions:
Ut =8R2 (ρH − ρL) g
6 (ρH + ρL) (τn − 1/2)=
Eoσ
3 (ρH + ρL) (τn − 1/2)(19)
with Eo the Eotvos number.
All the calculations were done using σ = 0.01 for a sphere
of radius R = 20 in a domain of size 128x128x512. The
interface width was set to 4 units, the interface mobility was
set to Γ = 400, and the relaxation parameters used were τn =1.0 and τφ = 0.7. The simulations were run for 20000 in
order to ensure that the terminal velocity had been reached.
These parameters ensure that the Reynolds number remains
well below 100 so that the analytical expresion 19 can be
used for the comparison. Table I shows that computed values
for the terminal velocity differ from the theoretical prediction
by approximately 3% only.
TABLE ITERMINAL VELOCITY OF RISING BUBBLE
Eo Usim Utheory Error (%)
10 6.48 · 10−5 6.66 · 10−5 2.7
15 9.69 · 10−5 1.00 · 10−4 3.0
20 1.29 · 10−4 1.33 · 10−4 3.0
25 1.61 · 10−4 1.66 · 10−4 3.1
30 1.94 · 10−4 2.00 · 10−4 2.5
35 2.26 · 10−4 2.33 · 10−4 3.0
40 2.58 · 10−4 2.66 · 10−4 3.0
B. Performance Evaluation
A series of test with increasingly large computational do-
mains were run to investigate the performance of the GPU
code when compared to a similar CPU version. The CPU
version was run on both AMD Barcelona and Intel Nehalem
architectures in the Ranger and Longhorn clusters at TACC.
Both CPU and GPU versions of the code use single precision
floating point variables. The CPU code used is a single
precision version of the open source package MPLABS [2].
For the comparison a 1D decomposition along the Z direction
was also used in the CPU code using all the four cores in a
socket.
Results show that the GPU code is up to 10 times faster
than the CPU code running on a quad-core AMD Barcelona
socket, and up to 4 times faster than running on a quad-
core Intel Nehalem socket. This corresponds to 40x and 16x
speedups with respect to single-core CPU executions, which is
an excellent result when considering that it corresponds to the
complete application execution and not that of a hand-picked
kernel. Figure 4 shows the speedup factor as a function of the
domain size for tests of sizes between 643 and 2243 mesh
points. It is important to note that regular runs were used
for this calculations, with statistical data (average speed of
the secondary phase, maximum velocity in the computational
55
![Page 6: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase](https://reader031.fdocuments.us/reader031/viewer/2022030223/5750a4dc1a28abcf0cad93ad/html5/thumbnails/6.jpg)
domain, and phase conservation) being calculated, transferred
from device to host, and written to file every 100 iterations
out of a total 1000 iterations for the test. tests were run
three times, and results presented correspond to the average
of the recorded execution times. No significant variation was
observed between runs.
Fig. 4. Speedup factor as a function of the computational domain size. Valuescorrespond to 1 GPU compared to 1 CPU quad-core socket.
Strong scalability tests were performed on a domain of size
128x128x768 usign up to 64 GPUs. An elongated domain
was chosen to avoid work starvation when increasing the
number of GPUs. Figure 5 shows that the code scales well
up to 32 GPUs, where the parallel efficiency degradation
starts becoming obvious. The parallel efficiency when using
32 GPUs is still in the 70% range, but it drops quickly to
55% by the time the code is using 64 GPUs. The measure of
performance used in these plots is MLUPS (Millions of Lattice
site UPdates per Second), which is commonly used by LBM
researchers as a measure of efficiency. The scalability can
be further improved by using asynchronous memory copies
between device and host, as well as reducing some of the
communication overhead by employing one-sided differenti-
ation schemes for the derivatives of the phase parameter, φ,
calculated at the task boundary limit. The author is currently
working on these improvements.
Fig. 5. Strong scaling as a function of the number of GPUs.
Lattice Boltzmann calculations require very large meshes
in order to provide accurate results due to their mesoscopic
nature, so a more interesting measure of scalability from the
application point of view is how efficient the parallelization is
with regards to increasing volume sizes. Figure 6 shows the
scalability of the code using a domain of size 128x128x768
per GPU. The efficiency of this test is never below 95% for
the cases studied. The excellent weak scaling behaviour is due
to the reliance on point-to-point near-neighbor communication
amongst the hosts.
Fig. 6. Weak scaling as a function of the number of GPUs.
IV. CONCLUSIONS
An MPI-CUDA implementation of a multiphase LBM suit-
able for large density differences has been developed. The
implementation has been validated against a classic bubble
rising test, and the performance of the code has been shown
to provide a speedup of up to 10 times with respect to a
serial version of the same code running on a CPU socket
with four cores. Furthermore, the implementation shows good
scalability over multiple GPUs and should allow researchers to
exploit the increasing number of available clusters with hybrid
CPU/GPU architectures. The significant speedup gained by
implementing this multiphase LBM code on GPUs allows for
longer simulations, and this opens the possibility of using more
realistic parameters in LBM multiphase flow simulations.
Future work will be done on two areas: from the Physics
point of view we would like to carry out detailed bubble
clustering investigations using this code; from the Performance
point of view we would like to include 2D partitioning and
asynchronous device/host data exchanges. Given the excellent
weak scaling of the code we will be able to carry out bubble
clusting studies using LBM with a level of detail not achieved
before, and hopefully shed some light into the dynamics of this
process. On the other hand, we are aware of a few changes that
would make this implementation even more scalable, and we
hope to explore those in parallel with our clustering research.
ACKNOWLEDGMENTS
The author would like to thank Dr. Paul Navratil and Dr.
John Cazes from TACC for useful discussions on this subject,
as well as David Carver from the Systems Group at TACC for
assistance running in the Longhorn cluster.
66
![Page 7: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase](https://reader031.fdocuments.us/reader031/viewer/2022030223/5750a4dc1a28abcf0cad93ad/html5/thumbnails/7.jpg)
APPENDIX A
The f distribution function is discretized using D3Q7 and
the g distribution using D3Q19. Using these discretization
schemes the equilibrium values for the distribution functions
are given by,
geqi=0,...,18 = wiAgi
+ win
(3ciαuα +
9
2uαuβciαciβ − 3
2u2
)(20)
f eqi=0,...,6 = Af
i +Bfi φ+ Cf
i φci · u (21)
The equilibrium coefficients (Afi , B
fi , C
fi ) are given by,
Af0 = −2Γμφ (22)
Afi=1,...,6 =
1
2Γμφ (23)
Bf0 = 1 (24)
Bfi=1,...,6 = 0 (25)
Cfi=0,...,6 =
1
2η(26)
and the equilibrium coefficients (Agi , wi) are defined as,
Ag0 =
1
4
[9n− 15
(φμφ +
n
3
)](27)
Agi=1,...,18 = 3φμφ + n (28)
w0 =4
9(29)
wi=1,...,6 =1
9(30)
wi=7,...,18 =1
36(31)
REFERENCES
[1] S. Succi, The Lattice Boltzmann Equation for Fluid Dynamics andBeyond. New York, NY: Oxford University Press, 2001.
[2] C. Rosales. (2007) Multi-Phase Lattice Boltzmann Suite. [Online].Available: http://code.google.com/p/mplabs
[3] J. Desplat, I. Pagonabarraga, and P. Bladon, “LUDWIG: A parallelLatticeBoltzmann code for complex fluids,” Computer Physics Com-munications, vol. 134, pp. 273–290, 2001.
[4] O. Filippova and D. Hanel, “Grid Refinement for Lattice-BGK Models,”Journal of Computational Physics, vol. 147, pp. 219–228, 1998.
[5] C. Rosales and D. Whyte, “Dual Grid Lattice Boltzmann Method forMultiphase Flows,” International Journal for Numerical Methods inEngineering, vol. 84, pp. 1068–1084, 2010.
[6] J. Tolke and M. Krafczyk, “TeraFLOP computing on a desktop PCwith GPUs for 3d CFD,” International Journal of Computational FluidDynamics, vol. 22, pp. 443–456, 2008.
[7] H. Zheng, C. Shu, and Y. Chew, “A lattice boltzmann model formultiphase flows with large density ratio,” Journal of ComputationalPhysics, vol. 218, pp. 353–371, 2006.
[8] T. Inamuro, S. Tajima, and F. Ogino, “Lattice boltzmann simulation ofdroplet collision dynamics,” International Journal of Heat and MassTransfer, vol. 47, pp. 4649–4657, 2004.
[9] P. Bhatnagar, E. Gross, and M. Krook, “A model for collision processesin gases. I. Small amplitude processes in charged and neutral one-component systems,” Physical Review, vol. 94, pp. 511–525, 1954.
[10] R. Nourgaliev, T. Dihn, T. Theofanous, and D. Joseph, “The latticeBoltzmann equation method: theoretical interpretation, numerics andimplications,” International Journal of Multiphase Flow, vol. 29, pp.117–, 2003.
77