GPU Accelerated Domain Decomposition
-
Upload
richard-southern -
Category
Science
-
view
248 -
download
2
Transcript of GPU Accelerated Domain Decomposition
Introduction Domain Decomposition DD on the GPU Conclusions
How To Use Your Desktop Supercomputer:GPU Accelerated Domain Decomposition
Richard Southern
GPGPU DDM (Richard Southern) 1
Introduction Domain Decomposition DD on the GPU Conclusions
Overview
Purpose of this talk: To demonstrate by example how theGPU can be used for solving general computing problems.
The example: A Domain Decomposition Method for solvingcommon Boundary Valued Problems.
GPGPU DDM (Richard Southern) 2
Introduction Domain Decomposition DD on the GPU Conclusions
An Introduction to General Purpose GPU programming
GPGPU DDM (Richard Southern) 3
Introduction Domain Decomposition DD on the GPU Conclusions
The Evolution of the Desktop Supercomputer
In the 1970’s, most supercomputers used parallel vector
processors.
Single Instruction, Multiple Data (SIMD).
Same thing for Real-Time Graphics systems.
March 2001 NVIDIA releases the GeForce 3, a vectorprocessing SIMD programmable graphics chip.
GPGPU DDM (Richard Southern) 4
Introduction Domain Decomposition DD on the GPU Conclusions
The NVIDIA nfiniteFX Engine!
From the original press release:“With the GeForce3 and its nfiniteFXTMengine, NVIDIAintroduces the world’s first programmable 3D graphics chiparchitecture. By combining programmable vertex and pixel shadingcapabilities, and 3D texture technology, the nfiniteFX enginedelivers unprecedented visual realism on your PC.”
GPGPU DDM (Richard Southern) 5
Introduction Domain Decomposition DD on the GPU Conclusions
Graphics Cards pre-GeForce 3
GPGPU DDM (Richard Southern) 6
Introduction Domain Decomposition DD on the GPU Conclusions
The GeForce3
GPGPU DDM (Richard Southern) 7
Introduction Domain Decomposition DD on the GPU Conclusions
Result: Ugly Zombies
GPGPU DDM (Richard Southern) 8
Introduction Domain Decomposition DD on the GPU Conclusions
sin wave water effect
/* Vertex shader */uniform float waveTime;uniform float waveWidth;uniform float waveHeight;
void main() vec4 v = vec4(gl Vertex);v .z = sin(waveWidth ∗ v .x + waveTime)∗
cos(waveWidth ∗ v .y + waveTime) ∗ waveHeight;gl Position = gl ModelViewProjectionMatrix ∗ v ;
/* Fragment shader */void main()
gl FragColor [0] = gl FragCoord [0]/400.0;gl FragColor [1] = gl FragCoord [1]/400.0;gl FragColor [2] = 1.0;
GPGPU DDM (Richard Southern) 9
Introduction Domain Decomposition DD on the GPU Conclusions
sin wave water effect
GPGPU DDM (Richard Southern) 10
Introduction Domain Decomposition DD on the GPU Conclusions
General Purpose GPU programming
We can use this commodity chip can be used for solvingtraditional supercomputing problems.
• Fast Fourier Transform• Bioinformatics (Database queries, Visualization, etc.)• Neural networks• Video processing• . . .
Advantages: cheap, ubiquitous.
Changes needed to be made to the hardware / drivers esp.texture read / write.
GPGPU DDM (Richard Southern) 11
Introduction Domain Decomposition DD on the GPU Conclusions
Result: Tesla, the dedicated GPGPU card
GPGPU DDM (Richard Southern) 12
Introduction Domain Decomposition DD on the GPU Conclusions
DGEMM Performance
GPGPU DDM (Richard Southern) 13
Introduction Domain Decomposition DD on the GPU Conclusions
Folding@Home Results
OS Type Native TFLOPS x86 TFLOPS Active CPUs Total CPUsWindows 190 190 199965 3405324Mac OS X/PowerPC 3 3 4237 139764Mac OS X/Intel 20 20 6592 129084Linux 59 59 34492 508701ATI GPU 642 677 6296 134815
NVIDIA GPU 1325 2796 11131 211480
PLAYSTATION 3 755 1593 26784 1006896
GPGPU DDM (Richard Southern) 14
Introduction Domain Decomposition DD on the GPU Conclusions
Programming on the GPU
Originally assembler (nightmare).
Shader Language API’s:• Cg (NVidia),• GLSL (OpenGL),• HLSL (Microsoft)
General GPU programming languages:• CUDA (NVidia),• Brook (Stanford),• OpenCL (OpenGL)
GPGPU DDM (Richard Southern) 15
Introduction Domain Decomposition DD on the GPU Conclusions
About CUDA
Stands for Compute Unified Device Architecture.
Abstracts stream processing and memory access.
Data can be passed around using C pointers.
Some graphics concepts still need to be understood: textures,surfaces, vector types.
Only works on NVidia cards.
GPGPU DDM (Richard Southern) 16
Introduction Domain Decomposition DD on the GPU Conclusions
Simplest CUDA example: Add two vectors in parallel.
global void VecAdd(float ∗ A,float ∗ B,float ∗ C ) int i = threadIdx.x ;C [i ] = A[i ] + B[i ];
int main() . . .// Invoke the kernel from the main programVecAdd <<< 1, N >>> (A, B, C );. . .
GPGPU DDM (Richard Southern) 17
Introduction Domain Decomposition DD on the GPU Conclusions
Domain Decomposition for Boundary Value Problems
GPGPU DDM (Richard Southern) 18
Introduction Domain Decomposition DD on the GPU Conclusions
Discrete Boundary Value Problems
Boundary Value Problems have numerous applications:• Solutions to general PDE’s,• Finite Element Methods (solving heat equations, antennae
simulations, deformation),• Fluid Simulations (Navier-Stokes theorem, Smooth Particle
Hydrodynamics),• Radial Basis Functions (Smooth data interpolation),• . . .
Basic form:f (x) =
∑
i
qiΦ(‖x − xi‖)
• Φ(r) is some smooth kernel function.• xi is an interpolation center / observation site.
GPGPU DDM (Richard Southern) 19
Introduction Domain Decomposition DD on the GPU Conclusions
Solving Boundary Value Problems
Given a set of observation sites and observed valuesf (xi ) = bi , compute coefficients qi .
Define matrix A with Ai ,j = Φ(‖xi − xj‖).
Solve for q inAq = b.
GPGPU DDM (Richard Southern) 20
Introduction Domain Decomposition DD on the GPU Conclusions
Solving Boundary Value Problems
Given a set of observation sites and observed valuesf (xi ) = bi , compute coefficients qi .
Define matrix A with Ai ,j = Φ(‖xi − xj‖).
Solve for q inAq = b.
Good for about 5000 observations. What about 1, 000, 000?
GPGPU DDM (Richard Southern) 21
Introduction Domain Decomposition DD on the GPU Conclusions
Our method
Method of Yokota et al. 2010 PetRBF.
GMRES: Generalised Minimisation of Residuals
Solves for the following problem
minq
‖b − Aq‖.
Given a preconditioner M, compute iteratively
qn+1 = qn + M (b − Aqn) , q0 = 0.
A perfect preconditioner would be
M = A−1 =⇒ q1 = A−1b = q.
We derive an approximation of A−1 using the Schwartz
method.
GPGPU DDM (Richard Southern) 22
Introduction Domain Decomposition DD on the GPU Conclusions
Kernel properties
Must be a compact function for Φ(r).
Gaussian Φ(r) = exp(− r2
σ2 )
Set any value Φ(r) < ε to 0.
Now A is a sparse matrix (store index and value of non-zeroentries).
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
GPGPU DDM (Richard Southern) 23
Introduction Domain Decomposition DD on the GPU Conclusions
Break the problem down
GPGPU DDM (Richard Southern) 24
Introduction Domain Decomposition DD on the GPU Conclusions
Break the problem down
GPGPU DDM (Richard Southern) 25
Introduction Domain Decomposition DD on the GPU Conclusions
Break the problem down
GPGPU DDM (Richard Southern) 26
Introduction Domain Decomposition DD on the GPU Conclusions
Break the problem down
Each sub-matrix AΩiis the matrix A computed only from the
subset of points in Ωi .
Construct sparse restriction matrices
Rix = [I 0]
[
xΩi
xΩ\Ωi
]
and Rix = [I 0]
[
xΩi
xΩ\Ωi
]
GPGPU DDM (Richard Southern) 27
Introduction Domain Decomposition DD on the GPU Conclusions
Additive Schwartz Method (ASM)
Compute the inverse of each sub-matrix, and add themtogether.
Matrix is symmetric, but convergence is slower and less stable.
Preconditioner:
M = A−1ASM
=∑
i
RTi A−1
i .
GPGPU DDM (Richard Southern) 28
Introduction Domain Decomposition DD on the GPU Conclusions
Restricted Additive Schwartz Method (RASM)
Compute the inverse of each sub-matrix, but restrict rows tooriginal domain Ωi .
Matrix is non-symmetric - convergence time improved.
Makes it a bit more fiddly to calculate.
Preconditioner:
M = A−1RASM
=∑
i
RTi A−1
i .
GPGPU DDM (Richard Southern) 29
Introduction Domain Decomposition DD on the GPU Conclusions
Domain Decomposition on the GPU
GPGPU DDM (Richard Southern) 30
Introduction Domain Decomposition DD on the GPU Conclusions
Summary of problem components
1. Domain decomposition of input points in Ωi and Ωi .
2. Compute kernel matrix A.
3. Compute Schwartz preconditioner M = A−1RASM
.
4. Solve GMRES
qn+1 = qn + M (b − Aqn) .
GPGPU DDM (Richard Southern) 31
Introduction Domain Decomposition DD on the GPU Conclusions
1. Decompose the domain
Given points x and
Define Ωi = Ωi + ∆i , where ∆i is some padding applied todomain.
For each point, classify it as either OUTSIDE BOTH,INSIDE OVERLAP, or INSIDE BOTH.
GPGPU DDM (Richard Southern) 32
Introduction Domain Decomposition DD on the GPU Conclusions
1. Decompose the domain
Given points x and
Define Ωi = Ωi + ∆i , where ∆i is some padding applied todomain.
For each point, classify it as either OUTSIDE BOTH,INSIDE OVERLAP, or INSIDE BOTH.
Performance terrible: About 45s to sort 100, 000 points.
GPGPU DDM (Richard Southern) 33
Introduction Domain Decomposition DD on the GPU Conclusions
1. Fast domain decomposition
Alternative overlap
Ωi = Ωi +∑
j∈Ni
Ωj .
Given x in dimension d and resolution vector res ∈ Nd , bucket
sort into a grid.
Consists of two GPU passes:• pointHash() Called per point, determines which cell each
point is in, and• buildGrid() Called per cell, inverts the point hash structure
into a grid.
2.5 million points sorted in 1.25s!
GPGPU DDM (Richard Southern) 34
Introduction Domain Decomposition DD on the GPU Conclusions
2. Compute the kernel matrix
Matrix is normally too big for main memory.
Solution: Compute each row (in parallel) and pack intosparse matrix structure.
A is stored in a Compressed Row Sparse structure (CSR).
CSR makes pre–multiply fast.
Improve computation using existing domain decomposition.
GPGPU DDM (Richard Southern) 35
Introduction Domain Decomposition DD on the GPU Conclusions
3. Compute the preconditioner
Construct each kernel sub–matrix Ai from the restricted pointset Rix.
Compute each matrix inverse A−1i using CUDA accelerated
library CULA.
Combine restricted rows into preconditioner.
M is packed in CSR format.
Each matrix Ai can be inverted in parallel on multiple CPU’s.
GPGPU DDM (Richard Southern) 36
Introduction Domain Decomposition DD on the GPU Conclusions
4. Solve GMRES
Observe thatqn+1 = qn + M (b − Aqn)
can be simplified.
Define g(A, x,b, α) = b + αAx.
Then GMRES becomes two step process:
v = g(A,qn,b,−1)
qn+1 = g(M, v,qn, 1)
GPGPU DDM (Richard Southern) 37
Introduction Domain Decomposition DD on the GPU Conclusions
Results and Conclusions
GPGPU DDM (Richard Southern) 38
Introduction Domain Decomposition DD on the GPU Conclusions
Smooth image scaling
x is the vector of pixel positions. br ,g ,b is the colour valuevector.
Solve for RBF coefficients for each color channel qr ,g ,b.
100 × 100 256 × 256 (Original) 500 × 500
GPGPU DDM (Richard Southern) 39
Introduction Domain Decomposition DD on the GPU Conclusions
Lagergren et al. 2010, about 1 fps
GPGPU DDM (Richard Southern) 40
Introduction Domain Decomposition DD on the GPU Conclusions
Results
On my Quadro FX 3700, maxThreadsPerBlock=512.
100, 000 random vertices in 3D, computed in 102.83s.
Task Properties Time(s)Segmentation Ωi Old method ≈ 45sConstructing coefficient matrix A Row occupancy 0.001% 46.98sConstructing preconditioner M 2744 submatrices, average 40 × 40 16.39sRunning GMRES RMS < 0.00001 in 5 steps 0.01s
GPGPU DDM (Richard Southern) 41
Introduction Domain Decomposition DD on the GPU Conclusions
Results
On my Quadro FX 3700, maxThreadsPerBlock=512.
100, 000 random vertices in 3D, computed in 102.83s.
Task Properties Time(s)Segmentation Ωi Old method ≈ 45sConstructing coefficient matrix A Row occupancy 0.001% 46.98sConstructing preconditioner M 2744 submatrices, average 40 × 40 16.39sRunning GMRES RMS < 0.00001 in 5 steps 0.01s
Not particularly impressive performance.
Should expect 1, 000, 000 at ≈ 1 fps.
Problem is still limited by hardware constraints.
GPGPU DDM (Richard Southern) 42
Introduction Domain Decomposition DD on the GPU Conclusions
Solution: Throw more hardware at the problem
GPGPU DDM (Richard Southern) 43
Introduction Domain Decomposition DD on the GPU Conclusions
Solution: Throw more hardware at the problem
Problem is rediculously parallel.• Grid partition Ωi bucket sorting on multiple GPU’s,• Kernel matrix A in chunks on separate GPU’s,• Matrix inversion A−1
i on multiple GPU’s,• . . .
GPGPU DDM (Richard Southern) 44
Introduction Domain Decomposition DD on the GPU Conclusions
Conclusions
Parallelized Domain Decomposition problems good use ofgraphics hardware.
Can expect an exponential performance improvement.
Interface is simple. . .
GPGPU DDM (Richard Southern) 45
Introduction Domain Decomposition DD on the GPU Conclusions
Conclusions
Parallelized Domain Decomposition problems good use ofgraphics hardware.
Can expect an exponential performance improvement.
Interface is simple. . .
BUT Memory management very difficult.
Debugging is a nightmare.
GPGPU DDM (Richard Southern) 46