[IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA...

7
Multiphase LBM Distributed Over Multiple GPUs Carlos Rosales 1 1 Texas Advanced Computing Center, The University of Texas at Austin Research Office Complex 1.101, J.J. Pickle Research Campus, Building 196 10100 Burnet Road, Austin, Texas 78758-4497 1 [email protected] Abstract—A parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios is described in this paper. Validation runs studying the terminal velocity of a rising bubble under the effect of gravity show good agreement with the expected theoretical values. The code is benchmarked against the performance of a typical CPU implementation of the same algorithm on both AMD and Intel platforms, and a single GPU is observed to perform up to 10X faster than a quad-core CPU socket, a 40X speedup with respect to a single core. The code is shown to scale well when executed on multiple GPUs, which makes the port to CUDA valuable even when compared to parallel CPU implementations. I. I NTRODUCTION Multiphase flows have application in most areas of daily life, and their understanding is of great importance in both academia and industry. Computational methods are particu- larly suited for the study of multiphase flows because they allow us to analyze the effect of the many variables involved in the problem in an independent manner, which is very challenging to do in classical experimental work. Even though a tremendous effort has been put into improving models and numerical techniques to investigate multiphase flows, they remain one of the most challenging subjects in computational physics and engineering. One of the most challenging issues in the study of mul- tiphase flows is the range of spatial and temporal scales that need to be analyzed. Interfacial phenomena can be localized in small volumes, but for the models to be useful the calculations often need to extend far beyond the interfacial regions. Also the timescales involved in interfacial phenomena are short, but at the same time one needs to be able to simulate the evolution of the multiphase system over significant periods of time. Traditional Computational Fluid Dynamics (CFD) methods are capable of large system evolution simulations, but tend to use high level models to represent interactions at the interface level. Kinetic techniques like the Lattice Boltzmann Method [1] are better suited to more detailed analysis of short time scale evolution, but are hindered by their explicit nature and require many time steps to produce useful results. Parallel implementations [2], [3], adaptive meshing algorithms [4] and multiple meshing levels [5], have helped improve the performance of LBM so that more simulation steps can be taken per second, but large scale simulations over long periods of time remain a challenge. Although graphics cards have been attractive in terms of raw computing power for some time, it has only been recently that user-friendly application programming interfaces like CUDA or OpenCL made them practical for the general research community. Over the past few years implementations of single phase LBM on GPUs have shown excellent performance when compared to their CPU equivalent [6], and in this work we explore the extension of multiphase LBM to the CUDA framework. In particular we present an implementation of the Zheng-Shu-Chew multiphase model [7] in CUDA, and show that execution on a GPU offers excellent speedup for the simulation of large density ratio multiphase flows. In Section 2 we describe the LBM model used and the CUDA implementation details. Section 3 contains validation and performance data, and Section 4 offers our conclusions and suggestions for further improvements. II. METHODOLOGY A. The Zheng-Shu-Chew Multiphase Model We model a flow with two immiscible fluids using the mass conservation equation, the Navier Stokes equations, and an interface evolution equation as in reference [7]: ∂n ∂t + ∇· (nu)=0, (1) ∂nu ∂t + ∇· (nuu)= −∇ · P + μ n 2 u + F b , (2) ∂φ ∂t + ∇· (φu)= θ M 2 μ φ , (3) where μ n is the dynamic viscosity of the fluid, μ φ is the chemical potential, θ M is the mobility of the interface (molecular diffusion mobility), P is the pressure tensor, F b is the body force, and n and φ are defined as: n = ρ A + ρ B 2 , φ = ρ A ρ B 2 , (4) with A and B indicating each of the fluids. This model uses an average density n that is the same for every computational node. As in other Free-Energy methods, the interfacial force originates from the derivatives of φ as described below. The advantage of this model is that the interface between the two phases is captured using a convective Cahn-Hilliard equation with second order accuracy without the inclusion of a pressure correction term. This is a significant step forward in the efficiency of the method, since the pressure correction step is computationally very expensive and typically forces 2011 IEEE International Conference on Cluster Computing 978-0-7695-4516-5/11 $26.00 © 2011 IEEE DOI 10.1109/CLUSTER.2011.9 1 2011 IEEE International Conference on Cluster Computing 978-0-7695-4516-5/11 $26.00 © 2011 IEEE DOI 10.1109/CLUSTER.2011.9 1

Transcript of [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA...

Page 1: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase

Multiphase LBM Distributed Over Multiple GPUs

Carlos Rosales 1

1Texas Advanced Computing Center, The University of Texas at AustinResearch Office Complex 1.101, J.J. Pickle Research Campus, Building 196

10100 Burnet Road, Austin, Texas [email protected]

Abstract—A parallel distributed CUDA implementation ofa Lattice Boltzmann Method for multiphase flows with largedensity ratios is described in this paper. Validation runs studyingthe terminal velocity of a rising bubble under the effect of gravityshow good agreement with the expected theoretical values. Thecode is benchmarked against the performance of a typical CPUimplementation of the same algorithm on both AMD and Intelplatforms, and a single GPU is observed to perform up to 10Xfaster than a quad-core CPU socket, a 40X speedup with respectto a single core. The code is shown to scale well when executedon multiple GPUs, which makes the port to CUDA valuable evenwhen compared to parallel CPU implementations.

I. INTRODUCTION

Multiphase flows have application in most areas of daily

life, and their understanding is of great importance in both

academia and industry. Computational methods are particu-

larly suited for the study of multiphase flows because they

allow us to analyze the effect of the many variables involved

in the problem in an independent manner, which is very

challenging to do in classical experimental work. Even though

a tremendous effort has been put into improving models and

numerical techniques to investigate multiphase flows, they

remain one of the most challenging subjects in computational

physics and engineering.

One of the most challenging issues in the study of mul-

tiphase flows is the range of spatial and temporal scales that

need to be analyzed. Interfacial phenomena can be localized in

small volumes, but for the models to be useful the calculations

often need to extend far beyond the interfacial regions. Also

the timescales involved in interfacial phenomena are short, but

at the same time one needs to be able to simulate the evolution

of the multiphase system over significant periods of time.

Traditional Computational Fluid Dynamics (CFD) methods

are capable of large system evolution simulations, but tend to

use high level models to represent interactions at the interface

level. Kinetic techniques like the Lattice Boltzmann Method

[1] are better suited to more detailed analysis of short time

scale evolution, but are hindered by their explicit nature and

require many time steps to produce useful results. Parallel

implementations [2], [3], adaptive meshing algorithms [4]

and multiple meshing levels [5], have helped improve the

performance of LBM so that more simulation steps can be

taken per second, but large scale simulations over long periods

of time remain a challenge.

Although graphics cards have been attractive in terms of raw

computing power for some time, it has only been recently that

user-friendly application programming interfaces like CUDA

or OpenCL made them practical for the general research

community. Over the past few years implementations of single

phase LBM on GPUs have shown excellent performance when

compared to their CPU equivalent [6], and in this work

we explore the extension of multiphase LBM to the CUDA

framework. In particular we present an implementation of the

Zheng-Shu-Chew multiphase model [7] in CUDA, and show

that execution on a GPU offers excellent speedup for the

simulation of large density ratio multiphase flows.

In Section 2 we describe the LBM model used and the

CUDA implementation details. Section 3 contains validation

and performance data, and Section 4 offers our conclusions

and suggestions for further improvements.

II. METHODOLOGY

A. The Zheng-Shu-Chew Multiphase Model

We model a flow with two immiscible fluids using the mass

conservation equation, the Navier Stokes equations, and an

interface evolution equation as in reference [7]:

∂n

∂t+∇ · (nu) = 0, (1)

∂nu

∂t+∇ · (nuu) = −∇ · P + μn∇2u+ Fb, (2)

∂φ

∂t+∇ · (φu) = θM∇2μφ, (3)

where μn is the dynamic viscosity of the fluid, μφ is

the chemical potential, θM is the mobility of the interface

(molecular diffusion mobility), P is the pressure tensor, Fb is

the body force, and n and φ are defined as:

n =ρA + ρB

2, φ =

ρA − ρB2

, (4)

with A and B indicating each of the fluids. This model uses

an average density n that is the same for every computational

node. As in other Free-Energy methods, the interfacial force

originates from the derivatives of φ as described below.

The advantage of this model is that the interface between

the two phases is captured using a convective Cahn-Hilliard

equation with second order accuracy without the inclusion of

a pressure correction term. This is a significant step forward

in the efficiency of the method, since the pressure correction

step is computationally very expensive and typically forces

2011 IEEE International Conference on Cluster Computing

978-0-7695-4516-5/11 $26.00 © 2011 IEEE

DOI 10.1109/CLUSTER.2011.9

1

2011 IEEE International Conference on Cluster Computing

978-0-7695-4516-5/11 $26.00 © 2011 IEEE

DOI 10.1109/CLUSTER.2011.9

1

Page 2: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase

simulations to be limited to density ratios below one hundred

[8]. This is achieved by using a standard lattice Boltzmann

equation for the momentum distribution function g, but intro-

ducing an over-relaxation term in the equation for the order

parameter f :

gi(x+ ciδt, t+ δt) = gi(x, t) + Ωgi (x, t) (5)

fi(x+ ciδt, t+ δt) = fi(x, t) + Ωfi (x, t)

+ (1− η)[fi(x+ ciδt, t)

− fi(x, t)] (6)

Here Ωi is the collision term in the BGK [9] approximation:

Ωgi =

geqi (x, t)− gi(x, t)

τn(7)

Ωfi =

feqi (x, t)− fi(x, t)

τφ(8)

where gi and fi are the distribution functions for the

momentum and the phase, τn and τφ are their respective

relaxation times, ci is the lattice velocity, and η is the over-

relaxation constant coefficient. This scheme reduces to the

standard lattice Boltzmann equation when η is unity. The

detailed expressions of the equilibrium distribution functions

geq and feq are given in the Appendix. The index i in

these equations refers to each of the chosen directions in the

discretized velocity space. This implementation uses a D3Q19

discretization for g, but requires only a D3Q7 discretization for

f . The velocity directions involved in these two discretization

modes are shown in Figure 1.

Fig. 1. D3Q19 discretization scheme. Red arrows (color online, numbered)make the subset of direction which defines the D3Q7 scheme.

The macroscopic quantities corresponding to the order pa-

rameter φ, the density of the fluid n, and its velocity u are

defined in terms of the distribution functions f and g:

φ(x, t) =∑i

fi(x, t) (9)

n(x, t) =∑i

gi(x, t) (10)

u(x, t) =1

n

[∑i

cigi +1

2(μφ∇φ+ Fb)

](11)

where the body force representing gravity is given by

Fb = 2φ∗g, and the chemical potential is:

μφ = 4A(φ3 − φ∗2φ)− κ∇2φ, (12)

with φ∗ defined as the constant φ∗ = (ρA − ρB)/2 using

the intially given values of ρA and ρB . In this expression the

parameters A and κ depend on the sufrace tension, σ, and the

interface width, W :

A =3σ

Wφ∗4 , (13)

κ =1

2A(Wφ∗)2. (14)

The Chapman-Enskog expansion, together with the method

of successive approximations [10], can be used to recover

the convective Cahn-Hilliard equation (3) to second order

accuracy from equation(6) using the following definitions of

the coefficient η and the mobility θM :

η =1

τφ + 0.5(15)

θM = η

(τφη − 1

2

)δΓ (16)

B. MPI-CUDA Implementation

From the point of view of a programmer, using CUDA on

a GPU is essentially equivalent to working with a massively

threaded processor which has minimal overhead in the thread

spawning process. The sheer number of threads can hide

memory latencies, but it is important that the programmer be

aware of the memory access patterns of the algorithm in use.

At the logical level the CUDA framework consists of a

regular grid of blocks, with each block being assigned a fixed

number of threads that can be indexed in three dimensions.

In this work we define a grid of blocks in the X and Y

directions, and assign each block a number of threads in each

direction corresponding to the number of points assigned to

it. Each of these threads will work on all points along the Z

direction for a given (x,y) value. This is better explained with

a concrete example. Let’s consider a simulation domain of size

64x64x64. If we wish to have 256 threads per block we could

define BLOCK_SIZE_X = 32 and BLOCK_SIZE_Y = 8. This

corresponds to having 64/32 = 2 blocks in the X direction and

64/8 = 8 blocks in the Y direction for a total number of 16

blocks, as shown in Figure 2.

22

Page 3: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase

(a) (b)

Fig. 2. Domain distribution in CUDA framework for a 64x64x64 simulationdomain. a) Physical node assignment. Each block contains 32x8x64 = 16384simulation points. b) Logical block/thread partition. Each thread is assignedto a single (x,y) value and works on all z values.

The computational volume is divided in a regular grid of

blocks in the X and Y directions. Each computational block

consists of a number of threads in the X and the Y directions,

which we define as BLOCK_SIZE_X and BLOCK_SIZE_Yso that our partitioning in CUDA will look like:

// Define number of threads in each block#define BLOCK_SIZE_X 32#define BLOCK_SIZE_Y 8...// Define the grid and specify the number// of threads per blockdim3 dimblock(BLOCK_SIZE_X,BLOCK_SIZE_Y,1);dim3 dimgrid(NX/BLOCK_SIZE_X,NY/BLOCK_SIZE_Y,1);...// Call a kernel with the previously// defined gridstream_f <<<dimgrid,dimblock>>> ( ... );

It is important to notice that in the CUDA framework

threads do not access the memory individually, but in groups

of 32, known as warps. The GPU will coalesce accesses to

contiguous memory elements in single access groups of up to

128 bytes in length. This makes contiguous access for adjacent

threads critical for high performance computing on GPUs, as

non-contiguous access will lead to a signifficantly reduced

effective memory bandwidth. The vectorized code used in this

work was written in such way that variables corresponding to

contiguous values in x were adjacent in memory in order to

improve coalesced memory access on the GPU.

The following steps were taken to implement the ZSC model

in the CUDA framework:

• Break down all major steps into kernel subroutines

– Hydrodynamics update (�u, ρ, φ)

– Collision

– Stream

• Use one array per LBM direction for the distribution

functions

• Use one-dimensional arrays for all quantities

• Indexing arrays so that warps of threads access contigu-

ous memory positions

• Minimize device/host data exchange (MPI exchange data

and program output)

Having multiple kernels, each doing a relatively simple

task, helps to construct an efficient CUDA code because the

resources required by each thread will be less than those of

a very complex kernel. This is particularly important when

considering that there is a very limited amount of registers

available in a GPU, and that after these registers are exhausted

the code will have to allocate variables local to a kernel in the

local memory space, which is physically located on the main

GPU memory and thus has a much longer access latency than

the registers available to each streaming multiprocessor. There

is a need to reach balance between simplifying the kernels

and having many of them because of the overhead related

to launching each kernel. We found that creating tasks for

logically indipendent actions (velocity update, density update,

collision step, etc...) provided sufficient performance without

the need to generate cumbersome code.

We chose to employ one dimensional arrays to store data

in order to use the native CUDA memory allocation and

deallocation routines cudaMalloc() and cudaFree() as

well as the device/host transfer function cudaMemcpy().

The use of one array for each of the LBM directions was

arbitrary. One could have equally chosen to employ a single

array to hold them all.

The indexing of the arrays was done in such a way that

the values along x were contiguous in memory. We were also

careful to request block sizes that were 32 threads wide along

the x direction, so that during runtime the request for 32

values of a particular array from each thread warp results

in one coalesced memory access. Failure to take this type

of measures results in significantly degraded performance,

particularly when using algorithms that perform many memory

accesses and have little data reuse.

The data exchange between device and host required for

every MPI exchange is the most limiting factor when using

multiphase LBM algorithms. Instead of exchanging all the

distribution function values to the ghost cells every time step

we only exchange the values that are absolutely necessary.

This corresponds to 5 elements of g, 1 of f and the value of

φ per cell across task face boundaries, plus one element of gand the value of φ per cell for every task edge boundary.

In our implementation we allocate one array twice the

size of the grid to hold the current and collision values

for each of the distribution function directions. This means

that there are a total of 19 + 7 = 26 arrays to be passed

to the collision and stream functions (from the D3Q19 and

D3Q7 discretizations), plus the nearest neighbor information

and the macroscopic quantities. Because of the limitation in

the number of arguments that can be passed on to a kernel

function in CUDA (a maximum of 32 pointer-sized arguments)

the collision and the stream steps were divided into separate

kernels for the f and g distribution functions.

The code section below shows most of the GPU kernel

code for the collision of the phase distribution function,

f . The code shows that few changes are required from a

standard CPU implementation besides identifying the thread

numbers along the X and Y directions and then calculating

the correct index for the current Z value – and its neighboring

33

Page 4: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase

points, needed for the gradient and laplacian terms – before

proceeding with the update. Notice that some of the constants

below are standard quantities defined by CUDA at run time

(blockIdx, blockDim, threadIdx) while others have

been defined previously in the code (gridSize, alpha4,

phiStarSq_d, kappa_d, invTauPhi_d, invEta2_d,

invTauPhiOne_d) and reside in the GPU memory as

symbols.

__global__ void collision_f( float *f_0_d,float *f_1_d, float *f_2_d, float *f_3_d,float *f_4_d, float *f_5_d, float *f_6_d,int *nb_east_d, int *nb_west_d,int *nb_north_d, int *nb_south_d,float *phi_d, float *rho_d,float *ux_d, float *uy_d, float *uz_d ){// 23 Local Variablesint i, idx, idx2, ie, iw, j, jn, js,

k, kt, kb;float phin, muPhi, invRho;float Af, Cf, Fx, Fy, Fz, lapPhi;

// Identify current threadi = blockIdx.x * blockDim.x + threadIdx.x;j = blockIdx.y * blockDim.y + threadIdx.y;

// Main collision loopfor( k = 1; k < NZ-1; k++ ){

// Define index of current point in old// (idx) and new (idx2) f arraysidx = gridId( i, j, k );idx2 = gridId( i, j, k ) + gridSize;

// Define for later reusephin = phi_d[idx];invRho = 1.f / rho_d[idx];

// Near neighborsie = nb_east_d[idx];iw = nb_west_d[idx];...

// Laplacian of the order parameter PhilapPhi = ( phi_d[ gridId(ie,jn,k ) ] + ....

// Chemical potentialmuPhi = alpha4_d*phin*( phin*phin -

phiStarSq_d ) - kappa_d*lapPhi;

// Interfacial and gravity forcesFx = muPhi*(2.f*(phi_d[gridId(ie,j ,k )] - ...Fy = muPhi*(2.f*(phi_d[gridId(i ,jn,k )] - ...Fz = muPhi*(2.f*(phi_d[gridId(i ,j ,kt)] - ...if( phin > 0.f ) Fz = Fz + grav_d;

// Equilibrium coefficientsAf = 0.5f * Gamma_d * invTauPhi_d * muPhi;Cf = invTauPhi_d * invEta2_d * phin;

// Collision updatef_0_d[idx2] = invTauPhiOne_d * f_0_d[idx]

- 6.0f*Af + invTauPhi_d*phin;f_1_d[idx2] = invTauPhiOne_d * f_1_d[idx]

+ Af + Cf * ux_d[idx];

f_2_d[idx2] = invTauPhiOne_d * f_2_d[idx]+ Af - Cf * ux_d[idx];

f_3_d[idx2] = invTauPhiOne_d * f_3_d[idx]+ Af + Cf * uy_d[idx];

f_4_d[idx2] = invTauPhiOne_d * f_4_d[idx]+ Af - Cf * uy_d[idx];

f_5_d[idx2] = invTauPhiOne_d * f_5_d[idx]+ Af + Cf * uz_d[idx];

f_6_d[idx2] = invTauPhiOne_d * f_6_d[idx]+ Af - Cf * uz_d[idx];

}}

The MPI implementation is fairly straightforward, and con-

sists of a simple domain decomposition along the Z direction.

Each GPU is assigned a subdomain of the total computational

volume and uses one additional layer of ghost cells which are

synchronized every time step to ensure the correct calculation

of the gradient and laplacian of the phase, φ, and the correct

update of the phase and momentum distribution functions

f and g. Each MPI synchronization requires communication

between device and host, which is actually the most expensive

of the two operations. This scheme is illustrated in Figure

3. This 1D partitioning scheme was chosen to avoid addi-

tional synchronization requirements due to the block/thread

decomposition in each card. A 1D partitioning scheme is not

scalable to a very large number of cards, and in the future

we will consider using a 2D or 3D partitioning scheme that

provides more flexibility. Although adding synchronization

and host/device data exchange steps is expensive, and not

necessary when using a relatively small number of GPUs, there

will be cases where the number of GPUs and the domain size

and shape are such that the actual bottleneck of the application

will be on the MPI exchange and in those cases a more flexible

partitioning scheme will improve scalability.

Fig. 3. Communication scheme showing data flow from GPUs to hosts andMPI exchange to update ghost cells. Each row of squares represents the workdone by a singhe CUDA thread, and ghost cells are represented by black filledsquares.

44

Page 5: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase

Due to the additional term used in the streaming of the

phase distribution function (6) and to the need of nearest

neighbor data for the gradient and laplacian calculations in

the velocity equation (11) and chemical potential equation

(12), a multiphase implementation of LBM requires two extra

synchronization and communication points per time step. This

makes the strong scaling of multiphase LBM codes signifi-

cantly more challenging than that of single phase LBM codes.

As an example, the update of the phase on the ghost nodes

previous to the collision step looks like:

// Pack task-boundary Phi valuespack_phi <<<dimgrid,dimblock>>> ( top_snd_d,

bot_snd_d,phi_d );

// Synchronize to ensure packing is complete// and copy data to hostcudaThreadSynchronize();cudaMemcpy( top_snd, top_snd_d, SIZE,

cudaMemcpyDeviceToHost );cudaMemcpy( bot_snd, bot_snd_d, SIZE,

cudaMemcpyDeviceToHost );

// Carry out MPI update using// Isend/Irecv/Waitall MPI callsmpiUpdate_phi( top_snd, bot_snd,

top_rcv, bot_rcv );

// Copy exchanged data to device and unpackcudaMemcpy( top_rcv_d, top_rcv, SIZE,

cudaMemcpyHostToDevice );cudaMemcpy( bot_rcv_d, bot_rcv, SIZE,

cudaMemcpyHostToDevice );unpack_phi <<<dimgrid,dimblock>>> ( top_rcv_d,

bot_rcv_d,phi_d );

// Carry out collision stepcollision_f <<<dimgrid,dimblock>>> ( f_0, ... );collision_g <<<dimgrid,dimblock>>> ( g_0, ... );

Similar code is used for the data exchange between GPUs

required before the streaming of f and after the streaming step

for both f and g.

III. RESULTS

A. Validation Tests

To validate the GPU code we chose to compare the cal-

culated value of the terminal velocity of a rising bubble of

spherical shape with its theoretical value in low Reynolds num-

ber flow. Assuming no deformation occurs, the gravitational

and drag forces in this case are described by the following

equations:

Fgrav =4

3πR3 (ρH − ρL) g (17)

Fdrag = 6πRUμH (18)

where R is the radius of the sphere, U its velocity along

the vertical direction, ρL its density, and ρH and μH are

the density and dynamic viscosity of the surrounding fluid

respectively. The terminal velocity, Ut, is obtained by equating

these two expressions:

Ut =8R2 (ρH − ρL) g

6 (ρH + ρL) (τn − 1/2)=

Eoσ

3 (ρH + ρL) (τn − 1/2)(19)

with Eo the Eotvos number.

All the calculations were done using σ = 0.01 for a sphere

of radius R = 20 in a domain of size 128x128x512. The

interface width was set to 4 units, the interface mobility was

set to Γ = 400, and the relaxation parameters used were τn =1.0 and τφ = 0.7. The simulations were run for 20000 in

order to ensure that the terminal velocity had been reached.

These parameters ensure that the Reynolds number remains

well below 100 so that the analytical expresion 19 can be

used for the comparison. Table I shows that computed values

for the terminal velocity differ from the theoretical prediction

by approximately 3% only.

TABLE ITERMINAL VELOCITY OF RISING BUBBLE

Eo Usim Utheory Error (%)

10 6.48 · 10−5 6.66 · 10−5 2.7

15 9.69 · 10−5 1.00 · 10−4 3.0

20 1.29 · 10−4 1.33 · 10−4 3.0

25 1.61 · 10−4 1.66 · 10−4 3.1

30 1.94 · 10−4 2.00 · 10−4 2.5

35 2.26 · 10−4 2.33 · 10−4 3.0

40 2.58 · 10−4 2.66 · 10−4 3.0

B. Performance Evaluation

A series of test with increasingly large computational do-

mains were run to investigate the performance of the GPU

code when compared to a similar CPU version. The CPU

version was run on both AMD Barcelona and Intel Nehalem

architectures in the Ranger and Longhorn clusters at TACC.

Both CPU and GPU versions of the code use single precision

floating point variables. The CPU code used is a single

precision version of the open source package MPLABS [2].

For the comparison a 1D decomposition along the Z direction

was also used in the CPU code using all the four cores in a

socket.

Results show that the GPU code is up to 10 times faster

than the CPU code running on a quad-core AMD Barcelona

socket, and up to 4 times faster than running on a quad-

core Intel Nehalem socket. This corresponds to 40x and 16x

speedups with respect to single-core CPU executions, which is

an excellent result when considering that it corresponds to the

complete application execution and not that of a hand-picked

kernel. Figure 4 shows the speedup factor as a function of the

domain size for tests of sizes between 643 and 2243 mesh

points. It is important to note that regular runs were used

for this calculations, with statistical data (average speed of

the secondary phase, maximum velocity in the computational

55

Page 6: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase

domain, and phase conservation) being calculated, transferred

from device to host, and written to file every 100 iterations

out of a total 1000 iterations for the test. tests were run

three times, and results presented correspond to the average

of the recorded execution times. No significant variation was

observed between runs.

Fig. 4. Speedup factor as a function of the computational domain size. Valuescorrespond to 1 GPU compared to 1 CPU quad-core socket.

Strong scalability tests were performed on a domain of size

128x128x768 usign up to 64 GPUs. An elongated domain

was chosen to avoid work starvation when increasing the

number of GPUs. Figure 5 shows that the code scales well

up to 32 GPUs, where the parallel efficiency degradation

starts becoming obvious. The parallel efficiency when using

32 GPUs is still in the 70% range, but it drops quickly to

55% by the time the code is using 64 GPUs. The measure of

performance used in these plots is MLUPS (Millions of Lattice

site UPdates per Second), which is commonly used by LBM

researchers as a measure of efficiency. The scalability can

be further improved by using asynchronous memory copies

between device and host, as well as reducing some of the

communication overhead by employing one-sided differenti-

ation schemes for the derivatives of the phase parameter, φ,

calculated at the task boundary limit. The author is currently

working on these improvements.

Fig. 5. Strong scaling as a function of the number of GPUs.

Lattice Boltzmann calculations require very large meshes

in order to provide accurate results due to their mesoscopic

nature, so a more interesting measure of scalability from the

application point of view is how efficient the parallelization is

with regards to increasing volume sizes. Figure 6 shows the

scalability of the code using a domain of size 128x128x768

per GPU. The efficiency of this test is never below 95% for

the cases studied. The excellent weak scaling behaviour is due

to the reliance on point-to-point near-neighbor communication

amongst the hosts.

Fig. 6. Weak scaling as a function of the number of GPUs.

IV. CONCLUSIONS

An MPI-CUDA implementation of a multiphase LBM suit-

able for large density differences has been developed. The

implementation has been validated against a classic bubble

rising test, and the performance of the code has been shown

to provide a speedup of up to 10 times with respect to a

serial version of the same code running on a CPU socket

with four cores. Furthermore, the implementation shows good

scalability over multiple GPUs and should allow researchers to

exploit the increasing number of available clusters with hybrid

CPU/GPU architectures. The significant speedup gained by

implementing this multiphase LBM code on GPUs allows for

longer simulations, and this opens the possibility of using more

realistic parameters in LBM multiphase flow simulations.

Future work will be done on two areas: from the Physics

point of view we would like to carry out detailed bubble

clustering investigations using this code; from the Performance

point of view we would like to include 2D partitioning and

asynchronous device/host data exchanges. Given the excellent

weak scaling of the code we will be able to carry out bubble

clusting studies using LBM with a level of detail not achieved

before, and hopefully shed some light into the dynamics of this

process. On the other hand, we are aware of a few changes that

would make this implementation even more scalable, and we

hope to explore those in parallel with our clustering research.

ACKNOWLEDGMENTS

The author would like to thank Dr. Paul Navratil and Dr.

John Cazes from TACC for useful discussions on this subject,

as well as David Carver from the Systems Group at TACC for

assistance running in the Longhorn cluster.

66

Page 7: [IEEE 2011 IEEE International Conference on Cluster Computing (CLUSTER) - Austin, TX, USA (2011.09.26-2011.09.30)] 2011 IEEE International Conference on Cluster Computing - Multiphase

APPENDIX A

The f distribution function is discretized using D3Q7 and

the g distribution using D3Q19. Using these discretization

schemes the equilibrium values for the distribution functions

are given by,

geqi=0,...,18 = wiAgi

+ win

(3ciαuα +

9

2uαuβciαciβ − 3

2u2

)(20)

f eqi=0,...,6 = Af

i +Bfi φ+ Cf

i φci · u (21)

The equilibrium coefficients (Afi , B

fi , C

fi ) are given by,

Af0 = −2Γμφ (22)

Afi=1,...,6 =

1

2Γμφ (23)

Bf0 = 1 (24)

Bfi=1,...,6 = 0 (25)

Cfi=0,...,6 =

1

2η(26)

and the equilibrium coefficients (Agi , wi) are defined as,

Ag0 =

1

4

[9n− 15

(φμφ +

n

3

)](27)

Agi=1,...,18 = 3φμφ + n (28)

w0 =4

9(29)

wi=1,...,6 =1

9(30)

wi=7,...,18 =1

36(31)

REFERENCES

[1] S. Succi, The Lattice Boltzmann Equation for Fluid Dynamics andBeyond. New York, NY: Oxford University Press, 2001.

[2] C. Rosales. (2007) Multi-Phase Lattice Boltzmann Suite. [Online].Available: http://code.google.com/p/mplabs

[3] J. Desplat, I. Pagonabarraga, and P. Bladon, “LUDWIG: A parallelLatticeBoltzmann code for complex fluids,” Computer Physics Com-munications, vol. 134, pp. 273–290, 2001.

[4] O. Filippova and D. Hanel, “Grid Refinement for Lattice-BGK Models,”Journal of Computational Physics, vol. 147, pp. 219–228, 1998.

[5] C. Rosales and D. Whyte, “Dual Grid Lattice Boltzmann Method forMultiphase Flows,” International Journal for Numerical Methods inEngineering, vol. 84, pp. 1068–1084, 2010.

[6] J. Tolke and M. Krafczyk, “TeraFLOP computing on a desktop PCwith GPUs for 3d CFD,” International Journal of Computational FluidDynamics, vol. 22, pp. 443–456, 2008.

[7] H. Zheng, C. Shu, and Y. Chew, “A lattice boltzmann model formultiphase flows with large density ratio,” Journal of ComputationalPhysics, vol. 218, pp. 353–371, 2006.

[8] T. Inamuro, S. Tajima, and F. Ogino, “Lattice boltzmann simulation ofdroplet collision dynamics,” International Journal of Heat and MassTransfer, vol. 47, pp. 4649–4657, 2004.

[9] P. Bhatnagar, E. Gross, and M. Krook, “A model for collision processesin gases. I. Small amplitude processes in charged and neutral one-component systems,” Physical Review, vol. 94, pp. 511–525, 1954.

[10] R. Nourgaliev, T. Dihn, T. Theofanous, and D. Joseph, “The latticeBoltzmann equation method: theoretical interpretation, numerics andimplications,” International Journal of Multiphase Flow, vol. 29, pp.117–, 2003.

77