Parallel Numerical Simulation · Ioan Lucian Muntean Fifth SimLab Short Course on Parallel...

The Jacobi and Gauß- . . .

CG Algorithm

Other Algorithms

of 13

Examples of Parallel AlgorithmsIoan Lucian Muntean

Fifth SimLab Short Course on

Parallel Numerical Simulation

Belgrade, October 1-7, 2006

Examples of Parallel Algorithms

October 5, 2006

Ioan Lucian MunteanDepartment of Computer Science – Chair VTechnische Universität München, Germany

http://www5.in.tum.de/persons/muntean.html


CG Algorithm

Other Algorithms

of 13


6.1. The Jacobi and Gauß-Seidel Iterations• scenario:

– solve an elliptic partial differential equation (PDE) with Dirichlet boundary con-ditions on a given domain Ω

– simple example: Poisson’s equation ∆u = f on the unit square Ω =]0, 1[2 withu given on Ω’s boundary

∆u(x, y) =∂2u(x, y)

∂x2+

∂2u(x, y)

∂y2= f(x, y) for (x, y) ∈ Ω,

u(x, y) = g(x, y) for (x, y) ∈ δΩ

– the function u(x, y) (or an approximation to it) has to be found

– occurrences: a fitted membrane, the stationary heat equation, ...

• discretization:

– for its solution, the PDE has to be discretized

– again a simple example: the finite difference discretization for mesh width h

∂2u(x, y)

∂x2≈ u(x− h, y)− 2u(x, y) + u(x + h, y)

h2,

∂2u(x, y)

∂y2≈ u(x, y − h)− 2u(x, y) + u(x, y + h)

h2



CG Algorithm

Other Algorithms

of 13


The Jacobi and Gauß-Seidel Iterations (cont’d)

• discretization (cont’d):

– introduce an equidistant grid of (N + 1)2 grid points ui,j ≈ u(ih, jh), i =0, ..., N , j = 0, ..., N , N = 1/h

– resulting discrete equation in the interior:

ui,j−1 + ui−1,j − 4ui,j + ui+1,j + ui,j+1 = h2fi,j , 0 < i, j < N

this scheme is called five-point difference star

– resulting equation on the boundary:

ui,j = g(ih, jh), i = 0 or i = N or j = 0 or j = N



CG Algorithm

Other Algorithms

of 13


The Resulting System of Linear Equations• for each inner point one linear equation in the unknowns ui,j

• equations in points next to the boundary (i.e. i = 1 or i = N−1 or j = 1 or j = N−1)access the boundary values

– these are shifted to the right-hand side of the equation– hence, all unknowns are located to the left of the ‘=’ sign, all known quantities

to its right

• assemble the overall vector of unknowns by lexicographic row-wise ordering

• result: system Ax = b of (N − 1)2 linear equations in (N − 1)2 unknowns• matrix A is block-tridiagonal with identity or tridiagonal blocks I or T , resp.

A =

0BBBBBB@

T II T I

I. . .

. . .. . .

. . . II T

1CCCCCCA , T =

0BBBBBB@

−4 11 −4 1

1. . .

. . .. . .

. . . 11 −4

1CCCCCCA ∈ RN−1,N−1



CG Algorithm

Other Algorithms

of 13


Solving Large Sparse Systems of Linear Equations

• the standard textbook method is Gaussian elimination

• this is a so-called direct solver which provides the exact solution of the system (apartfrom round-off errors)

• drawbacks of Gaussian elimination:

– for M unknowns, one needs O(M3) arithmetic operations (not acceptable forreally large M as they are standard in modern simulation problems)

– the algorithm does not exploit the sparsity of the matrix:existing zeroes are “destroyed” (turned into non-zeroes), which produces morecomputational work and more storage requirements

• therefore: use iterative methods instead

– they approach the exact the solution and approximate it, but typically don’treach it

– one step of iteration costs O(M) operations

– typically much less than O(M2) steps needed (the gain)

– ideal case (multigrid or multilevel methods): only O(1) steps needed

– basic (and not that sophisticated) methods (number of steps still depending onM ):

* relaxation methods: Jacobi, Gauß-Seidel, SOR

* minimization methods: steepest descent, conjugate gradients



CG Algorithm

Other Algorithms

of 13


The Jacobi Iteration• decompose A in its diagonal part DA, its upper triangular part UA, and its lower

triangular part LA:A = LA + DA + UA

• starting point: b = Ax = DAx + (LA + UA)x

• writing b = DAx(it+1) + (LA + UA)x(it) with x(it) denoting the approximation to xafter it steps of the iteration leads to the following iterative scheme:

x(it+1) := −D−1A (LA + UA)x(it) + D−1

A b = x(it) + D−1A r(it)

where the residual is defined as r(it) = b−Ax(it)

• or in a more explicit algorithmic form:

for it=0,1,2,...:for k=1,...,M:

x(it+1)k = 1

ak,k

“bk −

Pj 6=k ak,jx

(it)j

”• for our special A resulting from the finite difference discretization of the Poisson equa-

tion, this means (pay attention to the indices!):

for it=0,1,2,...:for j=1,...,N-1:

for i=1,...,N-1:

u(it+1)i,j = 1

4

“u

(it)i,j−1 + u

(it)i−1,j + u

(it)i,j+1 + u

(it)i+1,j − h2fi,j

”• remember that the boundary values are fixed



CG Algorithm

Other Algorithms

of 13


The Gauß-Seidel Iteration

• take the same decomposition A = LA + DA + UA

• new starting point: b = Ax = (DA + LA)x + UAx

• writing b = (DA + LA)x(it+1) + UAx(it) leads to the following iterative scheme:

x(it+1) := − (DA + LA)−1UAx(it) + (DA + LA)−1b = x(it) + (DA + LA)−1r(it)

• or in a more explicit algorithmic form:

for it=0,1,2,...:for k=1,...,M:

x(it+1)k = 1

ak,k

“bk −

Pk−1j=1 ak,jx

(it+1)j −

PMj=k+1 ak,jx

(it)j

”• for our special A resulting from the finite difference discretization of the Poisson equa-

tion, this means (pay attention to the indices!):

for it=0,1,2,...:for j=1,...,N-1:

for i=1,...,N-1:

u(it+1)i,j = 1

4

“u

(it+1)i,j−1 + u

(it+1)i−1,j + u

(it)i,j+1 + u

(it)i+1,j − h2fi,j

”• remember again that the boundary values are fixed

• there is no general superiority of Gauß-Seidel to Jacobi;in our case discussed here, however, Gauß-Seidel is twice as fast as Jacobi



CG Algorithm

Other Algorithms

of 13


Parallelizing Jacobi

• note that neither Jacobi nor Gauß-Seidel are used today any more – they are tooslow; nevertheless, the algorithmic aspects are still of interest

• a parallel Jacobi algorithm is quite straightforward:

– in the current iteration step, only values from the previous step are used

– hence, all updates of one iteration step can be made in parallel (if that manyprocessors are available)

– more realistic scenario: subdivide the domain into strips or squares, for exam-ple (what is better with respect to a good communication-computation ratio?)



CG Algorithm

Other Algorithms

of 13


Parallelizing Jacobi (cont’d)

• each processor needs for its calculations:

– if adjacent to the boundary: a subset of the boundary values– one row or one column of values from the processors dealing with the neigh-

bouring subdomains– some hint when to stop

• the above considerations lead to the following algorithm each processor has to exe-cute:

1. update all local approximate values u(it)i,j to u

(it+1)i,j

2. send all updates in points next to interior boundaries to the respective proces-sors

3. receive all necessary updates from the “neighbouring” processors4. compute the local residual values and provide them via a reduce operation5. receive the overall residual as the reduce operation’s result and go back to 1. if

this value is larger than some given threshold



CG Algorithm

Other Algorithms

of 13


Parallelizing Gauß-Seidel• at first glance, there seems to be an enforced sequential order, since the updated

values are immediately used where available

• remedy: change the order of visiting and updating the grid points

• first possibility: wavefront ordering

– diagonal order of updating

– all values along a diagonal line can be updated in parallel

– the single diagonal lines have to be processed sequentially, however

– problem: suppose we have P = N − 1 processors; then there are P 2 overallupdates that can be organized in 2P − 1 sequential steps (diagonals), whichrestricts the speed-up to roughly P/2

– better: use P = (N − 1)/k processors only; the we get k sequential strips ofkP 2 updates and kP + P − 1 sequential internal steps; now, the speed-up isgiven by k · kP 2/(k(kP + P − 1)), which is roughly kP/(k + 1)



CG Algorithm

Other Algorithms

of 13


Parallelizing Gauß-Seidel (cont’d)

• second possibility: red-black or checkerboard ordering

– give the grid points a checkerboard colouring of red (!) and black

– order of visiting and updating: first lexicographically the red ones, then lexico-graphically the black ones

– no dependences within the red set nor within the black set

– subdivide the grid such that each processor has some red and some blackpoints (roughly the same number)

– the result: two necessarily sequential steps (red and black), but perfect paral-lelism within each of them



CG Algorithm

Other Algorithms

of 13


6.2. CG Algorithm

Conjugate Gradients

• above method + efficient construction of the conjugate directions

• principle of construction: Gram-Schmidt conjugation of r’s

• no detailed derivation here, just the algorithm:

repeat(i) :αi =d(i)T

r(i)

d(i)T Ad(i);

x(i+1) = x(i) + αid(i);

r(i+1) = r(i) − αiAd(i);

βi+1 =r(i+1)T

r(i+1)

r(i)T r(i);

d(i+1) = r(i+1) + β(i+1)d(i);

• faster than steepest descent, but still depending on n!

• search spaces form a so-called Krylov sequence:

spand(0), . . . , d(i−1) = spand(0), Ad(0), . . . , Ai−1d(0)

= spanr(0), Ar(0), . . . , Ai−1r(0)

• other famous Krylov methods: GMRES, Bi-CGSTAB



CG Algorithm

Other Algorithms

of 13


6.3. Other Algorithms

• just some name dropping, due to lack of time

• graph partitioning:

– take a graph and try to define P subsets of points such that the number ofconnections (edges) between the subsets becomes as small as possible

– example 1: an arbitrary sparse matrix; unknowns are points of the graph, non-zero matrix entries are edges; how to parallelize an iterative algorithm?

– example 2 (and closely related): a finite element mesh; grid points are pointsof the graph, neighbourship relations are edges; how to define subdomains inan optimal way?

• domain decomposition methods

• ...


Parallel Numerical Simulation · Ioan Lucian Muntean Fifth SimLab Short Course on Parallel...

Documents

Transcript of Parallel Numerical Simulation · Ioan Lucian Muntean Fifth SimLab Short Course on Parallel...