GPU Cluster for Scientific Computing and Large-Scale Simulation · from the appropriate textures,...

1
GPU Cluster for Scientific Computing and Large-Scale Simulation Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne Yoakum-Stover Stony Brook University, Stony Brook, NY 11794-4400 http://www.cs.sunysb.edu/~vislab/projects/gpgpu/GPU_Cluster/GPU_Cluster.html The Stony Brook Visual Computing Cluster. In this work, we use a cluster of GPUs for high performance sci- entific computing and large-scale simulation. Our scalable GPU cluster, named The Stony Brook Visual Computing Cluster, cur- rently has 32 nodes connected by a 1 Gigabit network switch. Each node consists of an HP PC with an Nvidia GeForce FX 5800 Ultra (128MB video memory) for GPU computation. As the first example application, we have implemented on the GPU cluster a large-scale flow simulation using the computational fluid dynamics (CFD) model known as the Lattice Boltzmann Model (LBM). The LBM is second-order accurate in both time and space. The numerical method is highly parallelizable due to its linear nature, and most notably it can easily accommodate complex-shaped boundaries. The LBM models Boltzmann dynam- ics of “flow particles” on a 3D lattice. This dynamics can be repre- sented as a two-step process: (1) Particle distributions stream syn- chronously along lattice links associated with different velocity di- rections. (2) Between streaming steps, collisions locally drive the system toward equilibrium while conserving mass and momentum. We have previously implemented the small-scale LBM compu- tation on a single GPU. The particle distributions, velocity, and den- sity are grouped into volumes and packed into stacks of textures. The streaming and collision operations are implemented by frag- ment programs that fetch any required neighbor state information from the appropriate textures, perform the computation, and then render the results to textures. The main challenge in scaling LBM computation on the GPU cluster is to conduct the communication between GPU nodes. This communication is due to the fact that in every computation step, particle distributions at border sites of the sub-domain may need to stream to adjacent nodes. We need to read the outward stream- ing distributions from different channels of different textures in the GPU, transfer them through the network to appropriate neighboring nodes, and finally write them into appropriate texels of multiple tex- tures in the GPUs of neighboring nodes. Currently, the upstreaming data from GPU to PC memory through AGP bus is slow and slows down the whole communication. To minimize the overhead of read- ing data from the GPU, we gather together in a texture all the data that a node needs to send out and then read them out in a single read operation. Overlapping network communication time with the computation LBM flow simulation in the Times Square Area of NYC, vi- sualized by streamlines. (480 × 400 × 80 lattice size, 1.66km × 1.13km area, 91 blocks and roughly 850 buildings.) time is feasible, since the CPU and the network card are all standing idle while the GPU is computing. However, because each GPU can compute its sub-domain quickly, optimizing network performance to keep the communication time from becoming the bottleneck is still necessary. We have designed communication schedules to re- duce the likelihood that the smooth data transferring between two GPU nodes is interrupted by a third GPU node. We have also de- signed a technique to simplify the communication pattern so that each GPU node will have fewer direct connections with others, which greatly improved the network performance. Furthermore, we make the shape of each lattice sub-domain close to a cube to minimize the size of transferred data. Using 30 GPU nodes, we have simulated wind flow in the Times Square area of New York City with a 480 × 400 × 80 LBM lattice. The simulation runs in 0.31 second / step. Compared to our alterna- tive implementation on a cluster with 30 CPUs (Pentium Xeon 2.4 GHz without using SSE) for computation, the speed-up is above 4.6. Recent exciting news indicates that the performance of the GPU cluster will be further improved with the PCI-Express bus to be available this year. By connecting with a PCI-Express slot, a GPU can communicate with the system at 4GB/sec in both upstream and downstream directions. Moreover, the PCI-Express will allow mul- tiple GPUs to be plugged into a single PC. The interconnection of these GPUs will greatly reduce the network load. Although this work has been focused on LBM simulation, many numerical problems such as cellular automata, finite differences and finite elements, can be scaled on the GPU cluster as well (or by using some help of the CPUs). We plan to work on this in the future. Considering the attractive Flops/$ ratio and the rapid evolution of GPUs, we believe that the GPU cluster is a very promising tool for scientific computation and large-scale simulation.

Transcript of GPU Cluster for Scientific Computing and Large-Scale Simulation · from the appropriate textures,...

Page 1: GPU Cluster for Scientific Computing and Large-Scale Simulation · from the appropriate textures, perform the computation, and then render the results to textures. The main challenge

GPU Cluster for Scientific Computing and Large-Scale Simulation

Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne Yoakum-StoverStony Brook University, Stony Brook, NY 11794-4400

http://www.cs.sunysb.edu/~vislab/projects/gpgpu/GPU_Cluster/GPU_Cluster.html

The Stony Brook Visual Computing Cluster.

In this work, we use a cluster of GPUs for high performance sci-entific computing and large-scale simulation. Our scalable GPUcluster, namedThe Stony Brook Visual Computing Cluster, cur-rently has 32 nodes connected by a 1 Gigabit network switch. Eachnode consists of an HP PC with an Nvidia GeForce FX 5800 Ultra(128MB video memory) for GPU computation.

As the first example application, we have implemented on theGPU cluster a large-scale flow simulation using the computationalfluid dynamics (CFD) model known as the Lattice BoltzmannModel (LBM). The LBM is second-order accurate in both timeand space. The numerical method is highly parallelizable dueto its linear nature, and most notably it can easily accommodatecomplex-shaped boundaries. The LBM models Boltzmann dynam-ics of “flow particles” on a 3D lattice. This dynamics can be repre-sented as a two-step process: (1) Particle distributions stream syn-chronously along lattice links associated with different velocity di-rections. (2) Between streaming steps, collisions locally drive thesystem toward equilibrium while conserving mass and momentum.

We have previously implemented the small-scale LBM compu-tation on a single GPU. The particle distributions, velocity, and den-sity are grouped into volumes and packed into stacks of textures.The streaming and collision operations are implemented by frag-ment programs that fetch any required neighbor state informationfrom the appropriate textures, perform the computation, and thenrender the results to textures.

The main challenge in scaling LBM computation on the GPUcluster is to conduct the communication between GPU nodes. Thiscommunication is due to the fact that in every computation step,particle distributions at border sites of the sub-domain may needto stream to adjacent nodes. We need to read the outward stream-ing distributions from different channels of different textures in theGPU, transfer them through the network to appropriate neighboringnodes, and finally write them into appropriate texels of multiple tex-tures in the GPUs of neighboring nodes. Currently, the upstreamingdata from GPU to PC memory through AGP bus is slow and slowsdown the whole communication. To minimize the overhead of read-ing data from the GPU, we gather together in a texture all the datathat a node needs to send out and then read them out in a single readoperation.

Overlapping network communication time with the computation

LBM flow simulation in the Times Square Area of NYC, vi-sualized by streamlines. (480×400×80 lattice size, 1.66km× 1.13km area, 91 blocks and roughly 850 buildings.)

time is feasible, since the CPU and the network card are all standingidle while the GPU is computing. However, because each GPU cancompute its sub-domain quickly, optimizing network performanceto keep the communication time from becoming the bottleneck isstill necessary. We have designedcommunication schedulesto re-duce the likelihood that the smooth data transferring between twoGPU nodes is interrupted by a third GPU node. We have also de-signed a technique to simplify the communication pattern so thateach GPU node will have fewer direct connections with others,which greatly improved the network performance. Furthermore,we make the shape of each lattice sub-domain close to a cube tominimize the size of transferred data.

Using 30 GPU nodes, we have simulated wind flow in the TimesSquare area of New York City with a480× 400× 80 LBM lattice.The simulation runs in 0.31 second / step. Compared to our alterna-tive implementation on a cluster with 30 CPUs (Pentium Xeon 2.4GHz without using SSE) for computation, the speed-up is above4.6.

Recent exciting news indicates that the performance of the GPUcluster will be further improved with the PCI-Express bus to beavailable this year. By connecting with a PCI-Express slot, a GPUcan communicate with the system at 4GB/sec in both upstream anddownstream directions. Moreover, the PCI-Express will allow mul-tiple GPUs to be plugged into a single PC. The interconnection ofthese GPUs will greatly reduce the network load.

Although this work has been focused on LBM simulation, manynumerical problems such as cellular automata, finite differences andfinite elements, can be scaled on the GPU cluster as well (or byusing some help of the CPUs). We plan to work on this in the future.Considering the attractive Flops/$ ratio and the rapid evolution ofGPUs, we believe that the GPU cluster is a very promising tool forscientific computation and large-scale simulation.

zhe fan
ACM workshop on general purpose computing on graphics processors, 2004