Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan...

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder

Caltech ASCI Center

Why Use the GPU?

• Semiconductor trends– cost– wires vs. compute– Stanford streaming supercomputer

• Parallelism– many functional units– graphics is prime example

• Harvesting this power– what application suitable?– what abstractions useful?

• History– massively parallel SIMD machines– media processing

1980 1990 2000 2010 2020

Perf (ps/Inst)

Linear (ps/Inst)

Possible

Actual

Imagine stream processor; Bill Dally, Stanford Connection Machine CM2; Thinking Machines

Contributions and Related Work

• Contributions– numerical algorithms on GPU

• unstructured grids: conjugate gradients• regular grids: multigrid

– what abstractions are needed?

• Numerical algorithms– Goodnight et al. 2003 (MG)– Hall et al. 2003 (cache)– Harris et al. 2002 (FD sim.)– Hillisland et al. 2003 (optimization)– Krueger & Westermann 2003 (NLA)– Strzodka (PDEs)

Streaming Model

• Abstract model– Purcell, et al. 2002– data structures: streams– algorithms: kernels

• Concrete model– render a rectangle– data structures: textures– algorithms: fragment programs

Kernelinput

recordstream

outputrecordstream

globals

Rasterizer(set up textureindices and all

associated data)

Fragmentprogram

(for all pixelsin parallel)

Textureas read-only

memory

Output goes totexture

Bind buffer to texture

Kernel

globals

Sparse Matrices: Geometric Flow

• Ubiquitous in numerical computing– discretization of PDEs: animation

• finite elements, difference, volumes

– optimization, editing, etc., etc.

• Example here:– processing of surfaces

• Canonical non-linear problem– mean curvature flow– implicit time discretization

• solve sequence of SPD systems

))cot()(cot(

iNj ijiii

ijijij

)()()()( tntHttx iiiit

Velocity opposite meancurvature normal

ii xAx 1

Conjugate Gradients

• High level code– inner loop– matrix-vector

multiply– sum-reduction– scalar-vector

• Inner product– fragment-wise multiply– followed by sum-reduction– odd dimensions can be handled

Aj – off-diagonal matrix elements

R – pointers to segments

Row-Vector Product

X – vector elements

R – pointers to segments

Ai – diagonal matrix elements

J – pointers to xj

Aj – off-diagonal matrix elements

Fragment program

Apply to All Pixels

• Two extremes– one row at a time: setup overhead

– all rows at once: limited by worst row

• Middle ground– organize “batches” of work

• How to arrange batches?– order rows by non-zero entries

• optimal packing NP hard

• We choose fixed size rectangles– fragment pipe is quantized

– simple experiments reveal best size• 26 x 18 – 91% efficient

• wasted fragments on diagonal

Area(pixels)

Packing (Greedy)

9 9 8 8 8 8 8 7 715 13 13 12 12 11 10 9 9 7 7 7 7 7 7 7 7 6 5 5 4

15 13 13

12 12 11

10 9 9

non-zero entriesper row

each batchbound to anappropriate

fragment program All this setup doneonce only at the

beginning of time.Depends only onmesh connectivity

Recomputing Matrix

• Matrix entries depend on surface– must “render” into matrix– two additional indirection textures

• previous and next

Results (NV30@500MHz)

• 37k elements – matrix multiply

• 33 instructions, 120 per second

• only 13 flops

• latency limited

– reduction• 7 inst/frag/pass, 3400 per second

– CG solve: 20 per second

Regular Grids

• Poisson solver as example– multigrid approach– this time variables on “pixel grid”

• e.g.: Navier-Stokes

u p2after discretization:solve Poisson eq.at each time step

Poisson Equation

• Appears all over the place– easy to discretize on regular grid– matrix multiply is

stencil application– FD Laplace stencil:

• Use iterative matrix solver– just need application of stencil

• easy: just like filtering

• incorporate geometry (Jacobian)

• variable coefficients

(i,j)-4

jijiji

,1,1,2

Multigrid

RelaxRelax

Projection Projection Interpolation Interpolation

• Fine to coarse to fine cycle– high freq. error removed quickly– lower frequency error takes longer

Relax, Project, Interpolate

Computations and Storage Layout

• Lots of stencil applications– matrix multiply: 3x3 stencil

– projection: 3x3 stencil

– interpolation: 2x2(!)• floor op in indexing

• Storage for matrices and DOFs– variables in one texture

– matrices in 9(=3x3) textures

– all textures packed• exploit 4 channels

• domain decomp.

• padded boundary

21,0 2 2/)(41

d hh dii vv

Coarser Matrices

• Operator at coarser level– needed for relaxation at all levels

• Triple matrix product…– work out terms and map to stencils

• exploit local support of stencils

• straightforward but t-e-d-i-o-u-s

}1,0,1{,

]2[]2[']['4/1

]2[4/1][

dgeedh

eiAdgeSeS

eiASSiA

Results (NV30@500MHz)

• 257x257 grid– matrix multiply - 27 instructions

• 1370 per second

– interpolation 10 inst.– projection 19 inst.

• Overall performance– 257x257 at 80 fps!

Conclusions

• Enhancements– global registers for reductions– texture fetch with offset– rectangular texture border– scalar versus vector problems

• Where are we now?– good streaming processor– twice as fast as CPU implementation– lots of room for improvement

• Scientific computing compiler– better languages! Brook? C*?– manage layout in a buffer

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan...

Documents

Transcript of Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan...

Krötz, Omer Offen, Eitan Sayag Representation Theory ...

Robust Treatment of Simultaneous Collisions · Robust Treatment of Simultaneous Collisions David Harmon Etienne Vouga Rasmus Tamstorf Eitan Grinspun Columbia University Columbia University

Discrete Shells - California Institute of Technologymultires.caltech.edu/pubs/ds.pdf · Eitan Grinspun, Anil N. Hirani, Mathieu Desbrun, Peter Schröder / Discrete Shells (resp. edge,

Unsupervised learning of Natural languages Eitan Volsky Yasmine Meroz.

Discrete Differential Geometry Surfaces 2D/3D Shape Manipulation, 3D Printing CS 6501 Slides from Olga Sorkine, Eitan Grinspun.

Miklos Bergou, Max Wardetzky, Stephen Robinson, Basile Audoly and Eitan Grinspun- Discrete Elastic Rods

Yinxiao Li, Xiuhan Hu, Danfei Xu, Yonghao Yue, Eitan ... · Yinxiao Li, Xiuhan Hu, Danfei Xu, Yonghao Yue, Eitan Grinspun, Peter K. Allen Abstract—Robotic manipulation of deformable

Can Mean-Curvature Flow be Modiﬁed to be Non …misha/MyPapers/SGP12.pdfEurographics Symposium on Geometry Processing 2012 Eitan Grinspun and Niloy Mitra (Guest Editors) Volume 31

Eitan S. Acks J0301

Computational Design of Reconfigurables · Akash Garg Alec Jacobson Eitan Grinspun Columbia University Figure 1: Reconﬁgurables—objects that transform between multiple conﬁgurations—are

Eitan Suez XUL

Eitan gurari introduction to the theory of computation 1989

Eitan Shamir - Paper

Eitan Grossman (HUJi) Stéphane Polis (F.R.S.-FNRS – ULg)

Maximov Eitan Transcript

Siddhartha Mishra and Eitan Tadmor - University Of …...Siddhartha Mishra1 and Eitan Tadmor2 Abstract. We design eﬃcient numerical schemes for approximating the MHD equations in

Developing GUIs with XUL Eitan Suez Programmer eitan@uptodata.com .

Yinxiao Li, Xiuhan Hu, Danfei Xu, Yonghao Yue, Eitan ...Multi-Sensor Surface Analysis for Robotic Ironing Yinxiao Li, Xiuhan Hu, Danfei Xu, Yonghao Yue, Eitan Grinspun, Peter K. Allen

Interactive Exploration of Design Trade-Offsadriana/tradeoffs/aschulz2018.pdf · Adriana Schulz, Harrison Wang, Eitan Grinspun, Justin Solomon, and Wo-jciech Matusik. 2018. ... fore,

Discrete Differential Geometry: An Applied Introduction · Discrete Differential Geometry: An Applied Introduction SIGGRAPH 2006 COURSE NOTES ORGANIZER Eitan Grinspun LECTURERS Mathieu