A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight...
-
Upload
frederica-french -
Category
Documents
-
view
223 -
download
3
Transcript of A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight...
A Multigrid Solver for Boundary Value Problems Using Programmable
Graphics HardwareNolan Goodnight Cliff Woolley Gregory Lewin
David Luebke Greg Humphreys
University of Virginia
Graphics Hardware 2003July 26-27 – San Diego, CA
General-Purpose GPU Programming
Why do we port algorithms to the GPU?
How much faster can we expect it to be, really?
What is the challenge in porting?
Case Study
Problem: Implement a Boundary Value Problem (BVP) solver using the GPU
Could benefit an entire class of scientific and engineering applications, e.g.:
Heat transfer
Fluid flow
Related Work
Krüger and Westermann: Linear Algebra Operators for GPU Implementation of Numerical Algorithms
Bolz et al.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid
Very similar to our system Developed concurrently
Complementary approach
Driving problem: Fluid mechanics sim
Problem domain is a warped disc:
regular grid
regular grid
BVPs: Background
Boundary value problems are sometimes governedby PDEs of the form:
= f
is some operator
is the problem domain
f is a forcing function (source term)
Given and f, solve for .
BVPs: Example
Heat Transfer Find a steady-state temperature distribution T
in a solid of thermal conductivity k with thermal source S
This requires solving a Poisson equation of the form:
k2T = -S
This is a BVP where is the Laplacian operator 2
All our applications require a Poisson solver.
BVPs: Solving
Most such problems cannot be solved analytically
Instead, discretize onto a grid to form a set of linear equations, then solve:
Direct elimination
Gauss-Seidel iteration
Conjugate-gradient
Strongly implicit procedures
Multigrid method
Multigrid method
Iteratively corrects an approximation to the solution
Operates at multiple grid resolutions
Low-resolution grids are used to correct higher-resolution grids recursively
Very fast, especially for large grids: O(n)
Multigrid method
Use coarser grid levels to recursively correct an approximation to the solution
Algorithm:
smooth
residual
restrict recurse
interpolate 1
111 -4
1/8
1/8
1/81/8 1/4
1/16
1/16
1/16
1/16 1/2
1/2
1/21/2 11/4
1/4
1/4
1/4
= i - f
Implementation
For each step of the algorithm:
Bind as texture maps the buffers that contain the necessary data
Set the target buffer for rendering
Activate a fragment program that performs the necessary kernel computation
Render a grid-sized quad with multitexturing
fragment program
render target buffer
render target buffer
source buffer texture
source buffer texture
Optimizing the Solver
Detect steady-state natively on GPU
Minimize shader length
Special-case whenever possible
Avoid context-switching
Optimizing the Solver: Steady-state
How to detect convergence?
L1 norm - average error
L2 norm – RMS error (common in visual sim)
L norm – max error (common in sci/eng apps) Can use occlusion query!
secs to steady statevs. grid size
Optimizing the Solver: Shader length
Minimize number of registers used
Vectorize as much as possible
Use the rasterizer to perform computations of linearly-varying values
Pre-compute invariants on CPU
shader original fp
fastpath fp
fastpath vp
smooth 79-6-1 20-4-1 12-2
residual 45-7-0 16-4-0 11-1
restrict 66-6-1 21-3-0 11-1
interpolate 93-6-1 25-3-0 13-2
Optimizing the Solver: Special-case
Fast-path vs. slow-path
write several variants of each fragment program to handle boundary cases
eliminates conditionals in the fragment program
equivalent to avoiding CPU inner-loop branching
slow path with boundaries
fast path, no boundaries
Optimizing the Solver: Special-case
Fast-path vs. slow-path
write several variants of each fragment program to handle boundary cases
eliminates conditionals in the fragment program
equivalent to avoiding CPU inner-loop branching
secs per v-cyclevs. grid size
Optimizing the Solver: Context-switching
Find best packing data of multiple grid levelsinto the pbuffer surfaces
Optimizing the Solver: Context-switching
Find best packing data of multiple grid levelsinto the pbuffer surfaces
Optimizing the Solver: Context-switching
Find best packing data of multiple grid levelsinto the pbuffer surfaces
Optimizing the Solver: Context-switching
Remove context switching
Can introduce operations with undefined results: reading/writing same surface
Why do we need to do this?
Can we get away with it?
What about superbuffers?
Data Layout
Performance:
secs to steady statevs. grid size
Data Layout
Compute 4 values at a time
Requires source, residual, solution values to be in different buffers
Complicates boundary calculations
Adds setup and teardown overhead
Stacked domain
Possible additional vectorization:
Results: CPU vs. GPU
Performance:
secs to steady statevs. grid size
Conclusions
What we need going forward:
Superbuffers or: Universal support for multiple-surface
pbuffers
or: Cheap context switching
Developer tools Debugging tools
Documentation
Global accumulator
Ever increasing amounts of precision, memory Textures bigger than 2048 on a side
Acknowledgements
Hardware
David Kirk
Matt Papakipos
Driver Support
Nick Triantos
Pat Brown
Stephen Ehmann
Fragment Programming
James Percy
Matt Pharr
General-purpose GPU
Mark Harris
Aaron Lefohn
Ian Buck
Funding
NSF Award #0092793