Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

21
10/10/2002 Yun (Helen) He, GHC2002 1 Effective Methods in Reducing Communication Overheads in Solving PDE Problems on Distributed- Memory Computer Architectures Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

description

Effective Methods in Reducing Communication Overheads in Solving PDE Problems on Distributed-Memory Computer Architectures. Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory. Outline. Introduction Traditional Method - PowerPoint PPT Presentation

Transcript of Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

Page 1: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 1

Effective Methods in Reducing Communication Overheads in Solving PDE Problems on Distributed-Memory Computer Architectures

Chris Ding and Yun (Helen) HeLawrence Berkeley National Laboratory

Page 2: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 2

Outline Introduction

Traditional Method Ghost Cell Expansion (GCE) Method GCE Algorithm Diagonal Communication Elimination (DCE) Technique

Analysis of GCE Method Message Volume Communication Time Memory Usage and Computational Cost

Performance Test Problem Test Results

Conclusions

Page 3: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 3

Background On a distributed system, each processor holds a problem

subdomain, which includes one or several boundary layers (usually called ghost cells).

Ghost cells contain the most recent values of the neighboring processors. They must be updated at each time step. And it is done so in traditional approach. The message sizes exchanged between processors are usually very small.

To speedup the communication, one idea is to combine messages for different time steps and exchange the bigger messages less frequently to reduce the communication latency.

Page 4: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 4

Traditional Field Update

L = 2 L = 2 ghost | active | ghostghost | active | ghostt: O O | O O O O | O O t: O O | O O O O | O O t+1: | x x x x | t+1: | x x x x |

L = 1 L = 1 ghost | active | ghostghost | active | ghost t: O | O O O O | O t: O | O O O O | O t+1: | x x x x | t+1: | x x x x |

Where L is the number of layers required Where L is the number of layers required depends on depends on the order of accuracy of numerical discretizationthe order of accuracy of numerical discretization

Page 5: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 5

Ghost Cell Expansion (GCE) Method

L = 2 , e= 3 L = 2 , e= 3 ghost | active | ghostghost | active | ghost t: O O O O O | O O O O | O O O O Ot: O O O O O | O O O O | O O O O O t+1: S x x x | x x x x | x x x St+1: S x x x | x x x x | x x x S t+2: S x x | x x x x | x x St+2: S x x | x x x x | x x S t+3: S x | x x x x | x St+3: S x | x x x x | x S t+4: | x x x x | t+4: | x x x x |

L = 1 , e= 3 L = 1 , e= 3 ghost | active | ghostghost | active | ghost t: O O O O | O O O O | O O O O t: O O O O | O O O O | O O O O t+1: x x x | x x x x | x x x t+1: x x x | x x x x | x x x t+2: x x | x x x x | x x t+2: x x | x x x x | x x t+3: x | x x x x | x t+3: x | x x x x | x t+4: | x x x x | t+4: | x x x x |

where e is the where e is the expansion levelexpansion level

Page 6: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 6

Algorithm for GCE Method do istep = 1, total_stepsdo istep = 1, total_steps j = mod (istep-1,e+1)j = mod (istep-1,e+1) if (j == 0) update ghost_cellsif (j == 0) update ghost_cells y_start = 1 - e + jy_start = 1 - e + j y_end = ny + e - jy_end = ny + e - j x_start = 1 - e + jx_start = 1 - e + j x_end = nx + e – j x_end = nx + e – j if subdomain touches real boundariesif subdomain touches real boundaries set y_start, y_end to 1 or ny set y_start, y_end to 1 or ny set x_start, x_end to 1 or nxset x_start, x_end to 1 or nx endifendif do iy = y_start, y_enddo iy = y_start, y_end do ix = x_start, x_enddo ix = x_start, x_end update field (ix, iy)update field (ix, iy) enddoenddo enddoenddo enddoenddo

Page 7: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 7

Diagonal Communication Elimination (DCE) Technique

A regular grid in 2D decomposition has 4 immediate neighbors (left and right), and 4 second nearest neighbors (diagonal).

A regular grid in 3D decomposition has 6 immediate neighbors (horizontal and vertical), 12 second nearest neighbors (planar corner), and 8 third nearest neighbors (cubic corner).

Page 8: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 8

Diagonal Communication Elimination (DCE) Technique (cont’d)

55

44

11

33

66

77 22 88

Send blocks 5, 7 and 3 right;Send blocks 5, 7 and 3 right;Send blocks 6, 8 and 4 left;Send blocks 6, 8 and 4 left;Processor owns block 1 also has Processor owns block 1 also has blocks 5 and 6;blocks 5 and 6;Processors owns block 2 also has Processors owns block 2 also has blocks 7 and 8;blocks 7 and 8;Send blocks 5, 6 and 1 down;Send blocks 5, 6 and 1 down;Send blocks 7, 8 and 2 up;Send blocks 7, 8 and 2 up;

Active domain updated correctly!Active domain updated correctly!

Page 9: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 9

Message VolumeTraditional:Traditional:

GCE Method:

Ratio:Ratio:

L = 2, e = 4 L = 2, e = 4 ratio = 3/5 ratio = 3/5

)1(

eL

eLVV

old

new

)2(2 LNyNxLV old

)1/()22)(22()( eeLNyNxeLeV new

Page 10: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 10

Communication Time Traditional:Traditional:

GCE Method:

Lold TBLNyNxLT 4/8)2(2

)1/()4/8)22)(22[()( eTBeLNyNxeLeT Lnew

L=2, e=8, Nx = Ny = 800:L=2, e=8, Nx = Ny = 800:

IBM SP: B=133 MB/sec, TIBM SP: B=133 MB/sec, TLL=26 =26 secsec ratio = 2.15 ratio = 2.15

Cray T3E: B=300 MB/sec, TCray T3E: B=300 MB/sec, TLL=17 =17 secsec ratio = 2.31 ratio = 2.31

Page 11: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 11

Theoretical Speedup

Page 12: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 12

Memory Usage and Computational Cost

Decomposition M/M C/C

1D 2e/Nx e/2Nx

2D 2e/Nx+2e/Ny e/2Nx+e/2Ny

3D 2e/Nx+2e/Ny+2e/Nz e/2Nx+e/2Ny+e/2Nz

e = 4, Nx = Ny = 800 e = 4, Nx = Ny = 800 M/M = 2%, M/M = 2%, C/C = 0.5% C/C = 0.5%

Page 13: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 13

Test Problem 2D Laplacian Equation with Dirichlet boundary conditions:

02

2

2

2

yu

xu

With 2nd-order accuracy, using 5-point stencils:

With 4th-order accuracy, using 9-point stencils:

)]1,()1,(),1(),1([41),( yxuyxuyxuyxuyxu

)]2,()2,(),2(),2(

)1,(16)1,(16),1(16),1(16[601),(

yxuyxuyxuyxu

yxuyxuyxuyxuyxu

Page 14: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 14

Performance Analysis

Test 1: Global size 3200x3200, P=16, L=1 Test 2: Global size 3200x3200, P=16, L=2 Test 3: Global size 6400x6400, P=64, L=1 Test 4: Global size 6400x6400, P=64, L=2

Local domain size is always 800x800Local domain size is always 800x800

Page 15: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 15

Performance on IBM SP

Page 16: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 16

Performance on IBM SP (cont’d)

Page 17: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 17

Performance on IBM SP (cont’d)

Page 18: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 18

Performance on Cray T3E

Page 19: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 19

Performance on Cray T3E (cont’d)

Page 20: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 20

Performance on Cray T3E (cont’d)

Page 21: Chris Ding and Yun (Helen) He Lawrence Berkeley National Laboratory

10/10/2002 Yun (Helen) He, GHC2002 21

Conclusions With the new ghost cell expansion method, the frequency

to update processor subdomain boundaries is reduced from every time step to once per e+1 time steps.

Both the total message volume and total number of messages are reduced, thus the total communication time. The overhead of memory usage and computational cost are minimal.

Diagonal Communication Elimination technique reduces the total number of messages exchanged between processors from 8 to 4 in 2D domain decomposition and 26 to 6 in 3D domain decomposition.

The systematic experiments on IBM SP and Cray T3E both have communication speedup up to 170%.