A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

35
A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys University of Virginia Graphics Hardware 2003 July 26-27 – San Diego, CA augmented by Klaus Mueller, Stony Brook University

description

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware. Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys. University of Virginia. augmented by Klaus Mueller, Stony Brook University. General-Purpose GPU Programming. - PowerPoint PPT Presentation

Transcript of A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Page 1: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

A Multigrid Solver for Boundary Value Problems Using Programmable

Graphics HardwareNolan Goodnight Cliff Woolley Gregory Lewin

David Luebke Greg Humphreys

University of Virginia

Graphics Hardware 2003July 26-27 – San Diego, CA

augmented by Klaus Mueller, Stony Brook University

Page 2: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

General-Purpose GPU Programming

Why do we port algorithms to the GPU?

How much faster can we expect it to be, really?

What is the challenge in porting?

Page 3: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Case Study

Problem: Implement a Boundary Value Problem (BVP) solver using the GPU

Could benefit an entire class of scientific and engineering applications, e.g.:

Heat transfer

Fluid flow

Page 4: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Related Work

Krüger and Westermann: Linear Algebra Operators for GPU Implementation of Numerical Algorithms

Bolz et al.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Very similar to our system Developed concurrently

Complementary approach

Page 5: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Driving problem: Fluid mechanics sim

Problem domain is a warped disc:

regular grid

regular grid

Page 6: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

BVPs: Background

Boundary value problems are sometimes governedby PDEs of the form:

= f

is some operator

is the problem domain

f is a forcing function (source term)

Given and f, solve for .

Page 7: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

BVPs: Example

Heat Transfer Find a steady-state temperature distribution T

in a solid of thermal conductivity k with thermal source S

This requires solving a Poisson equation of the form:

k2T = -S

This is a BVP where is the Laplacian operator 2

All our applications require a Poisson solver.

Page 8: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

BVPs: Solving

Most such problems cannot be solved analytically

Instead, discretize onto a grid to form a set of linear equations, then solve:

Direct elimination

Gauss-Seidel iteration

Conjugate-gradient

Strongly implicit procedures

Multigrid method

Page 9: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Multigrid method

Iteratively corrects an approximation to the solution

Operates at multiple grid resolutions

Low-resolution grids are used to correct higher-resolution grids recursively

Very fast, especially for large grids: O(n)

Page 10: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Multigrid method

Use coarser grid levels to recursively correct an approximation to the solution

may converge slowly on fine grid -> restrict to course grid push out long wavelength errors quickly (single

grid solvers only smooth out high frequency errors)

Algorithm:

smooth

residual restrict

recurse

interpolate = i - f

Page 11: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Implementation - Overview

For each step of the algorithm:

Bind as texture maps the buffers that contain the necessary data (current solution, residual, source terms, etc.)

Set the target buffer for rendering

Activate a fragment program that performs the necessary kernel computation (smoothing, residual calculation, restriction, interpolation)

Render a grid-sized quad with multitexturing

fragment program

render target buffer

render target buffer

source buffer texture

source buffer texture

Page 12: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Implementation - Overview

Page 13: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Input buffers

Solution buffer: four-channel floating point pixel buffer (p-buffer)

one channel each for solution, residual, source term, and a debugging term

toggle front and back surfaces used to hold old and new solution

Operator map: contains the discretized operator (e.g., Laplacian)

Red-black map: accelerate odd-even tests (see later)

Page 14: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Smoothing

Jacobi method

one matrix row:

calculate new value for each solution vector element:

in our application, the aij are the Laplacian (sparse matrix):

Page 15: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Smoothing

Also factor in the source term

Use Red-black map to update only half of the grid cells in each pass

converges faster in practice

known as red-black iteration

requires two passes per iteration

Page 16: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Calculate residual

Apply operator (Laplacian) and source term to the current solution

residual = k2T + S Store result in the target surface

Use occlusion query to determine if all solution fragments are below threshold ( < threshold)

occlusion query = true means all fragments are below threshold

this is an L norm, which may be too strict

less strict norms L1, L2, will require reduction or fragment accumulation register (not available yet), could run in CPU instead

Page 17: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Multigrid reduction and refinement

Average (restrict) current residual into coarser grid

Iterate/smooth on coarser grid, solving k2 = -S

Interpolate correction back into finer grid

or restrict once more -> recursion

use bilinear interpolation

Update grid with this correction

Iterate/smooth on fine grid

Page 18: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Boundary conditions

Dirichlet (prescribed)

Neumann (prescribed derivative)

Mixed (coupled value and derivative)

Uk: value at grid point k

nk: normal at grid point k

Periodic boundaries result in toroidal mapping

Apply boundary conditions in smoothing pass

Page 19: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Boundary conditions

Only need to compute at boundaries

boundaries need significantly more computations

restrict computations to boundaries

GPUs do not allow branching

or better, both branches are executed and the invalid fragment is discarded

even more wasteful

decompose domain into boundary and interior areas

use general (boundary) and fastpath (interior) shaders

run these in two separate passes, on respective domains

Page 20: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Optimizing the Solver

Detect steady-state natively on GPU

Minimize shader length

Use special-case whenever possible

Limit context switches

Page 21: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Optimizing the Solver: Steady-state

How to detect convergence?

L1 norm - average error

L2 norm – RMS error (common in visual sim)

L norm – max error (common in sci/eng apps) Can use occlusion query!

secs to steady statevs. grid size

Page 22: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Optimizing the Solver: Shader length

Minimize number of registers used Vectorize as much as possible Use the rasterizer to perform computations of

linearly-varying values Pre-compute invariants on CPU Compute texture coodinate offsets in vertex

shader

shader original fp

fastpath fp

fastpath vp

smooth 79-6-1 20-4-1 12-2

residual 45-7-0 16-4-0 11-1

restrict 66-6-1 21-3-0 11-1

interpolate 93-6-1 25-3-0 13-2

Page 23: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Optimizing the Solver: Special-case

Fast-path vs. slow-path

write several variants of each fragment program to handle boundary cases

eliminates conditionals in the fragment program

equivalent to avoiding CPU inner-loop branching

slow path with boundaries

fast path, no boundaries

Page 24: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Optimizing the Solver: Special-case

Fast-path vs. slow-path

write several variants of each fragment program to handle boundary cases

eliminates conditionals in the fragment program

equivalent to avoiding CPU inner-loop branching

secs per v-cyclevs. grid size

Page 25: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Optimizing the Solver: Context-switching

Find best packing data of multiple grid levelsinto the pbuffer surfaces - many p-buffers

Page 26: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Optimizing the Solver: Context-switching

Find best packing data of multiple grid levelsinto the pbuffer surfaces - two p-buffers

Page 27: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Optimizing the Solver: Context-switching

Find best packing data of multiple grid levelsinto the pbuffer surfaces - a single p-buffer

Still one front- and one back surface for iterative smoothing

Page 28: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Optimizing the Solver: Context-switching

Remove context switching

Can introduce operations with undefined results: reading/writing same surface

Why do we need to do this? there is a chance that we write and read from

the same surface at the same time

Can we get away with it? Yes, we can. Just need to be careful to avoid

these conflicts

What about RGBA parallelism? was not used in this implemtation, may give

another boost of factor 4

Page 29: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Data Layout

Performance:

secs to steady statevs. grid size

Page 30: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Data Layout

Compute 4 values at a time

Requires source, residual, solution values to be in different buffers

Complicates boundary calculations

Adds setup and teardown overhead

Stacked domain

Possible additional vectorization:

Page 31: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Results: CPU vs. GPU

Performance:

secs to steady statevs. grid size

Page 32: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Applications – Flow Simulation

Page 33: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Applications – High Dynamic Range

CPU GPU

Page 34: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Conclusions

What we need going forward:

Superbuffers or: Universal support for multiple-surface

pbuffers

or: Cheap context switching

Developer tools Debugging tools

Documentation

Global accumulator

Ever increasing amounts of precision, memory Textures bigger than 2048 on a side

Page 35: A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Acknowledgements

Hardware

David Kirk

Matt Papakipos

Driver Support

Nick Triantos

Pat Brown

Stephen Ehmann

Fragment Programming

James Percy

Matt Pharr

General-purpose GPU

Mark Harris

Aaron Lefohn

Ian Buck

Funding

NSF Award #0092793