GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

How To Use Your Desktop Supercomputer:GPU Accelerated Domain Decomposition

Richard Southern

GPGPU DDM (Richard Southern) 1


Overview

Purpose of this talk: To demonstrate by example how theGPU can be used for solving general computing problems.

The example: A Domain Decomposition Method for solvingcommon Boundary Valued Problems.



An Introduction to General Purpose GPU programming



The Evolution of the Desktop Supercomputer

In the 1970’s, most supercomputers used parallel vector

processors.

Single Instruction, Multiple Data (SIMD).

Same thing for Real-Time Graphics systems.

March 2001 NVIDIA releases the GeForce 3, a vectorprocessing SIMD programmable graphics chip.



The NVIDIA nfiniteFX Engine!

From the original press release:“With the GeForce3 and its nfiniteFXTMengine, NVIDIAintroduces the world’s first programmable 3D graphics chiparchitecture. By combining programmable vertex and pixel shadingcapabilities, and 3D texture technology, the nfiniteFX enginedelivers unprecedented visual realism on your PC.”



Graphics Cards pre-GeForce 3



The GeForce3



Result: Ugly Zombies



sin wave water effect

/* Vertex shader */uniform float waveTime;uniform float waveWidth;uniform float waveHeight;

void main() vec4 v = vec4(gl Vertex);v .z = sin(waveWidth ∗ v .x + waveTime)∗

cos(waveWidth ∗ v .y + waveTime) ∗ waveHeight;gl Position = gl ModelViewProjectionMatrix ∗ v ;

/* Fragment shader */void main()

gl FragColor [0] = gl FragCoord [0]/400.0;gl FragColor [1] = gl FragCoord [1]/400.0;gl FragColor [2] = 1.0;



sin wave water effect



General Purpose GPU programming

We can use this commodity chip can be used for solvingtraditional supercomputing problems.

• Fast Fourier Transform• Bioinformatics (Database queries, Visualization, etc.)• Neural networks• Video processing• . . .

Advantages: cheap, ubiquitous.

Changes needed to be made to the hardware / drivers esp.texture read / write.



Result: Tesla, the dedicated GPGPU card



DGEMM Performance



Folding@Home Results

OS Type Native TFLOPS x86 TFLOPS Active CPUs Total CPUsWindows 190 190 199965 3405324Mac OS X/PowerPC 3 3 4237 139764Mac OS X/Intel 20 20 6592 129084Linux 59 59 34492 508701ATI GPU 642 677 6296 134815

NVIDIA GPU 1325 2796 11131 211480

PLAYSTATION 3 755 1593 26784 1006896



Programming on the GPU

Originally assembler (nightmare).

Shader Language API’s:• Cg (NVidia),• GLSL (OpenGL),• HLSL (Microsoft)

General GPU programming languages:• CUDA (NVidia),• Brook (Stanford),• OpenCL (OpenGL)



About CUDA

Stands for Compute Unified Device Architecture.

Abstracts stream processing and memory access.

Data can be passed around using C pointers.

Some graphics concepts still need to be understood: textures,surfaces, vector types.

Only works on NVidia cards.



Simplest CUDA example: Add two vectors in parallel.

global void VecAdd(float ∗ A,float ∗ B,float ∗ C ) int i = threadIdx.x ;C [i ] = A[i ] + B[i ];

int main() . . .// Invoke the kernel from the main programVecAdd <<< 1, N >>> (A, B, C );. . .



Domain Decomposition for Boundary Value Problems



Discrete Boundary Value Problems

Boundary Value Problems have numerous applications:• Solutions to general PDE’s,• Finite Element Methods (solving heat equations, antennae

simulations, deformation),• Fluid Simulations (Navier-Stokes theorem, Smooth Particle

Hydrodynamics),• Radial Basis Functions (Smooth data interpolation),• . . .

Basic form:f (x) =

∑

i

qiΦ(‖x − xi‖)

• Φ(r) is some smooth kernel function.• xi is an interpolation center / observation site.



Solving Boundary Value Problems

Given a set of observation sites and observed valuesf (xi ) = bi , compute coefficients qi .

Define matrix A with Ai ,j = Φ(‖xi − xj‖).

Solve for q inAq = b.



Solving Boundary Value Problems

Given a set of observation sites and observed valuesf (xi ) = bi , compute coefficients qi .

Define matrix A with Ai ,j = Φ(‖xi − xj‖).

Solve for q inAq = b.

Good for about 5000 observations. What about 1, 000, 000?



Our method

Method of Yokota et al. 2010 PetRBF.

GMRES: Generalised Minimisation of Residuals

Solves for the following problem

minq

‖b − Aq‖.

Given a preconditioner M, compute iteratively

qn+1 = qn + M (b − Aqn) , q0 = 0.

A perfect preconditioner would be

M = A−1 =⇒ q1 = A−1b = q.

We derive an approximation of A−1 using the Schwartz

method.



Kernel properties

Must be a compact function for Φ(r).

Gaussian Φ(r) = exp(− r2

σ2 )

Set any value Φ(r) < ε to 0.

Now A is a sparse matrix (store index and value of non-zeroentries).

−3 −2 −1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1



Break the problem down




Each sub-matrix AΩiis the matrix A computed only from the

subset of points in Ωi .

Construct sparse restriction matrices

Rix = [I 0]

[

xΩi

xΩ\Ωi

]

and Rix = [I 0]

[

xΩi

xΩ\Ωi

]



Additive Schwartz Method (ASM)

Compute the inverse of each sub-matrix, and add themtogether.

Matrix is symmetric, but convergence is slower and less stable.

Preconditioner:

M = A−1ASM

=∑

i

RTi A−1

i .



Restricted Additive Schwartz Method (RASM)

Compute the inverse of each sub-matrix, but restrict rows tooriginal domain Ωi .

Matrix is non-symmetric - convergence time improved.

Makes it a bit more fiddly to calculate.

Preconditioner:

M = A−1RASM

=∑

i

RTi A−1

i .



Domain Decomposition on the GPU



Summary of problem components

1. Domain decomposition of input points in Ωi and Ωi .

2. Compute kernel matrix A.

3. Compute Schwartz preconditioner M = A−1RASM

.

4. Solve GMRES

qn+1 = qn + M (b − Aqn) .



1. Decompose the domain

Given points x and

Define Ωi = Ωi + ∆i , where ∆i is some padding applied todomain.

For each point, classify it as either OUTSIDE BOTH,INSIDE OVERLAP, or INSIDE BOTH.



1. Decompose the domain

Given points x and

Define Ωi = Ωi + ∆i , where ∆i is some padding applied todomain.

For each point, classify it as either OUTSIDE BOTH,INSIDE OVERLAP, or INSIDE BOTH.

Performance terrible: About 45s to sort 100, 000 points.



1. Fast domain decomposition

Alternative overlap

Ωi = Ωi +∑

j∈Ni

Ωj .

Given x in dimension d and resolution vector res ∈ Nd , bucket

sort into a grid.

Consists of two GPU passes:• pointHash() Called per point, determines which cell each

point is in, and• buildGrid() Called per cell, inverts the point hash structure

into a grid.

2.5 million points sorted in 1.25s!



2. Compute the kernel matrix

Matrix is normally too big for main memory.

Solution: Compute each row (in parallel) and pack intosparse matrix structure.

A is stored in a Compressed Row Sparse structure (CSR).

CSR makes pre–multiply fast.

Improve computation using existing domain decomposition.



3. Compute the preconditioner

Construct each kernel sub–matrix Ai from the restricted pointset Rix.

Compute each matrix inverse A−1i using CUDA accelerated

library CULA.

Combine restricted rows into preconditioner.

M is packed in CSR format.

Each matrix Ai can be inverted in parallel on multiple CPU’s.



4. Solve GMRES

Observe thatqn+1 = qn + M (b − Aqn)

can be simplified.

Define g(A, x,b, α) = b + αAx.

Then GMRES becomes two step process:

v = g(A,qn,b,−1)

qn+1 = g(M, v,qn, 1)



Results and Conclusions



Smooth image scaling

x is the vector of pixel positions. br ,g ,b is the colour valuevector.

Solve for RBF coefficients for each color channel qr ,g ,b.

100 × 100 256 × 256 (Original) 500 × 500



Lagergren et al. 2010, about 1 fps



Results

On my Quadro FX 3700, maxThreadsPerBlock=512.

100, 000 random vertices in 3D, computed in 102.83s.

Task Properties Time(s)Segmentation Ωi Old method ≈ 45sConstructing coefficient matrix A Row occupancy 0.001% 46.98sConstructing preconditioner M 2744 submatrices, average 40 × 40 16.39sRunning GMRES RMS < 0.00001 in 5 steps 0.01s



Results

On my Quadro FX 3700, maxThreadsPerBlock=512.

100, 000 random vertices in 3D, computed in 102.83s.

Task Properties Time(s)Segmentation Ωi Old method ≈ 45sConstructing coefficient matrix A Row occupancy 0.001% 46.98sConstructing preconditioner M 2744 submatrices, average 40 × 40 16.39sRunning GMRES RMS < 0.00001 in 5 steps 0.01s

Not particularly impressive performance.

Should expect 1, 000, 000 at ≈ 1 fps.

Problem is still limited by hardware constraints.



Solution: Throw more hardware at the problem



Solution: Throw more hardware at the problem

Problem is rediculously parallel.• Grid partition Ωi bucket sorting on multiple GPU’s,• Kernel matrix A in chunks on separate GPU’s,• Matrix inversion A−1

i on multiple GPU’s,• . . .



Conclusions

Parallelized Domain Decomposition problems good use ofgraphics hardware.

Can expect an exponential performance improvement.

Interface is simple. . .



Conclusions

Parallelized Domain Decomposition problems good use ofgraphics hardware.

Can expect an exponential performance improvement.

Interface is simple. . .

BUT Memory management very difficult.

Debugging is a nightmare.


GPU Accelerated Domain Decomposition

Science

Transcript of GPU Accelerated Domain Decomposition