Parallel Multistage Preconditioners by Hierarchical...

Parallel Multistage Preconditioners by Hierarchical Interface Decomposition on

“T2K Open Super Computer (TodaiCombined Cluster)” with Hybrid Parallel

Programming Models

Kengo NakajimaInformation Technology Center, The University of Tokyo

JAMSTEC (Japan Agency for Marine-Earth Science & Technology)CREST/JST (Japan Science & Technology Agency)

Work-in-Progress Presentation (WP-7), IEEE Cluster 2008September 30th, 2008, Tsukuba, Japan

Cluster2008 2

Motivations

• Parallel Preconditioners for Ill-Conditioned Problems• T2K Open Super Computer • Hybrid vs. Flat MPI

Cluster2008 3

• Simple problems can easily converge by simple preconditioners (e.g. Localized Block Jacobi) with excellent parallel efficiency.

• Difficult (ill-conditioned) problems cannot easily converge– Effect of domain decomposition on convergence is significant– More domains, more iterations for Localized Block Jacobi

Technical Issues of “Parallel”Preconditioners for Iterative Solvers

Cluster2008 4


E=100

E=103

3D Solid MechanicsE: Young’s Modulus

• If domain boundaries are on “harder” elements, convergence is very bad.

Cluster2008 5

• Simple problems can easily converge by simple preconditioners (e.g. Localized Block Jacobi) with excellent parallel efficiency.

• Difficult (ill-conditioned) problems cannot easily converge– Effect of domain decomposition on convergence is significant– More domains, more iterations for Localized Block Jacobi

• Remedies – deep fill-ins, deep overlapping– HID: Hierarchical Interface Decomposition

• [Henon & Saad 2007]


Cluster2008 6

T2K/Tokyo• “T2K Open Super Computer (Todai Combined Cluster)

– by Hitachi– Total 952 nodes (15,232 cores), 141 TFLOPS peak

• Each Node = 4x AMD Quadcore Opteron (Barcelona) (2.3GHz)

– 16th in TOP500 (June 2008) (fastest in Japan)• up to 32 nodes (512 cores) in this work

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

L3

L3

L3

L3

Cluster2008 7

Flat MPI vs. Hybrid

Hybrid：Hierarchal Structure

Flat-MPI：Each PE -> Independent

corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y corecorecorecorem

emor

y

mem

ory

mem

ory

mem

ory

core

core

core

core

core

core

core

core

core

core

core

core

Cluster2008 8

• HID for Ill-Conditioned Problems on T2K/Tokyo– Parallel FEM Applications

• Hybrid vs. Flat MPI Parallel Programming Models

• Effect of NUMA Control

Goal of this Study

Cluster2008 9

Outline

• HID (Hierarchical Interface Decomposition)• Hybrid Parallel Programming Model

– Reordering• Preliminary Results• Summary & Future Works

Cluster2008 10

• Multilevel Domain Decomposition– Extension of Nested Dissection

• Non-overlapped Approach: Connectors, Separators• Suitable for Parallel Preconditioning Method

HID: Hierarchical Interface Decomposition [Henon & Saad 2007]

level-1：●level-2：●level-4：●

0 0 0 1 1 1

0,2 0,2 0,2 1,3 1,3 1,3

2 2 2 3 3 3

2 2 2 2,3 3 3 3

2 2 2 2,3 3 3 3

0 0 0 0,1 1 1 1

0 0 0 0,1 1 1 1

0,12,3

0,12,3

0,12,3

Cluster2008 11

Parallel ILU for each Connectorat each LEVEL

• The unknowns are reordered according to their level numbers, from the lowest to highest.

• The block structure of the reordered matrix leads to natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied.

01

23

0,1

0,2

2,3

1,30,1,2,3

Level-1

Level-2

Level-4

Cluster2008 12



Cluster2008 13

• DAXPY, SMVP, Dot Products– Easy

• Factorization, Forward/Backward Substitutions in Preconditioning Processes

– Global dependency– Reordering for parallelism required: forming independent sets– Multicolor Ordering (MC), Reverse-Cuthill-Mckee (RCM)– Works on “Earth Simulator” [KN 2002,2003]

• both for parallel/vector performance

Parallel Preconditioned Iterative Solvers on an SMP/Multicore node by OpenMP

Cluster2008 14

• Multicoloring (MC)– Efficient– NOT robust for ill-conditioned

problems• RCM

– Robust– Load-Balancing is bad for

OpenMP on SMP/MulticoreNode

• CM-RCM– Robust– Efficient– Excellent Load-Balancing

Cyclic-Multicoloring + RCM（CM-RCM）

50

60

70

80

90

1 10 100 1000

color #

Itera

tions

▲ MC● CM-RCM

Cluster2008 15

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

Ordering Methods

RCM MC (Color#=4) CM-RCM (Color#=4)

1 2

2 3

3 4

4 1

1 2

2 3

3 4

4 1

3 4

4 1

1 2

2 3

3 4

4 1

1 2

2 3

1 2

2 3

3 4

4 1

1 2

2 3

3 4

4 1

3 4

4 1

1 2

2 3

3 4

4 1

1 2

2 3

1 2 3 4 5 6 7 8

2 3 4 5 6 7 8 9

3 4 5 6 7 8 9 10

4 5 6 7 8 9 10 11

5 6 7 8 9 10 11 12

6 7 8 9 10 11 12 13

7 8 9 10 11 12 13 14

8 9 10 11 12 13 14 15

1 2 3 4 5 6 7 8

2 3 4 5 6 7 8 9

3 4 5 6 7 8 9 10

4 5 6 7 8 9 10 11

5 6 7 8 9 10 11 12

6 7 8 9 10 11 12 13

7 8 9 10 11 12 13 14

8 9 10 11 12 13 14 15

Cluster2008 16

• CM-RCM with 10 colors• Ordering is applied to vertices in each “level” on each

“domain”

• If “level >1”, number of vertices are relatively smaller.– Color # in CM-RCM is controlled so that number of vertices is

more that 10 for each color.

Ordering Method: In this Work

Cluster2008 17

LEVEL-COLOR-THREADCommunications at Each Level: Forward Substitutions

additionalcomm.

do lev= 1, LEVELtotdo ic= 1, COLORtot(lev)

!$omp parallel do private(ip,i,SW1,SW2,SW3,isL,ieL,j,k,X1,X2,X3)do ip= 1, PEsmpTOTdo i = STACKmc(ip-1,ic,lev)+1, STACKmc(ip,ic,lev)

SW1= WW(3*i-2,R); SW2= WW(3*i-1,R); SW3= WW(3*i ,R)isL= INL(i-1)+1; ieL= INL(i)do j= isL, ieL

k= IAL(j)X1= WW(3*k-2,R); X2= WW(3*k-1,R); X3= WW(3*k ,R)

SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j )*X3

enddoX1= SW1; X2= SW2; X3= SW3X2= X2 - ALU(9*i-5)*X1X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2X3= ALU(9*i )* X3X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)WW(3*i-2,R)= X1; WW(3*i-1,R)= X2; WW(3*i ,R)= X3

enddoenddo

!$omp end parallel doenddo

call SOLVER_SEND_RECV_3_LEV(lev,…): Communications usingHierarchical Comm. Tables.

enddo

do lev= 1, LEVELtotdo ic= 1, COLORtot(lev)

!$omp parallel do private(ip,i,SW1,SW2,SW3,isL,ieL,j,k,X1,X2,X3)do ip= 1, PEsmpTOTdo i = STACKmc(ip-1,ic,lev)+1, STACKmc(ip,ic,lev)

SW1= WW(3*i-2,R); SW2= WW(3*i-1,R); SW3= WW(3*i ,R)isL= INL(i-1)+1; ieL= INL(i)do j= isL, ieL

k= IAL(j)X1= WW(3*k-2,R); X2= WW(3*k-1,R); X3= WW(3*k ,R)

SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j )*X3

enddoX1= SW1; X2= SW2; X3= SW3X2= X2 - ALU(9*i-5)*X1X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2X3= ALU(9*i )* X3X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)WW(3*i-2,R)= X1; WW(3*i-1,R)= X2; WW(3*i ,R)= X3

enddoenddo

!$omp end parallel doenddo

call SOLVER_SEND_RECV_3_LEV(lev,…): Communications usingHierarchical Comm. Tables.

enddo

Cluster2008 18



Cluster2008 19

• 3D Elastic Problems with Heterogeneous Material Property– Emax=103, Emin=10-3, ν=0.25

• generated by “sequential Gauss” algorithm for geo-statistics [Deutsch & Journel, 1998]

– 1283 tri-linear hexahedral elements, 6,291,456 DOF• (SGS+GPBiCG) Iterative Solvers

– Symmetric Gauss-Seidel– Original Block Jacobi, HID

• T2K/Tokyo– 512 cores (32 nodes)

• FORTARN90 (Hitachi) + MPI– Flat MPI, Hybrid (4x4, 8x2)

• Effect of NUMA Control

Target Application

Cluster2008 20

0 1 2 3

Flat MPI, Hybrid (4x4, 8x2)

Flat MPI

Hybrid4x4

Hybrid8x2

0 1 2 3

0 1 2 3

Cluster2008 21

Effect of NUMA Policy

--localalloc5

--cpunodebind=$SOCKET--localalloc4

--cpunodebind=$SOCKET--membind=$SOCKET3

--cpunodebind=$SOCKET--interleave=$SOCKET2

--cpunodebind=$SOCKET--interleave=all1

no command line switches0

Command line switchesPolicyID

0.00

1.00

2.00

3.00

0 1 2 3 4 5

NUMA policy

Rel

ativ

e Pe

rform

ance

Flat MPIHybrid 4x4Hybrid 8x2

32 nodes, 512 cores12.6K DOF/core

Cluster2008 22

Effect of NUMA Policy: Best Cases

--localalloc5

--cpunodebind=$SOCKET--localalloc4

--cpunodebind=$SOCKET--membind=$SOCKET3

--cpunodebind=$SOCKET--interleave=$SOCKET2

--cpunodebind=$SOCKET--interleave=all1

no command line switches0

command line switchesPolicyID


0.00

0.25

0.50

0.75

1.00

1.25

Flat MPI(policy3)

Hybrid 4x4(policy3)

Hybrid 8x2(policy2)

Programming Models

Rel

ativ

e Pe

rform

ance

8 nodes (128 cores)32 nodes (512 cores)


Cluster2008 23

0.80

1.00

1.20

1.40

1.60

1.80

32 64 128 192 256 384 512

core#

Rel

ativ

e Pe

rform

acne

(Ite

ratio

ns)

Flat MPIHybrid 4x4Hybrid 8x2

Relative PerformanceHID vs. Block Jacobi

normalized by Block Jacobi

0.80

1.00

1.20

1.40

1.60

1.80

32 64 128 192 256 384 512

core#

Rel

ativ

e Pe

rform

ance

(sec

.) Flat MPIHybrid 4x4Hybrid 8x2

Cluster2008 24

Scalability: HID vs. Block Jacobi

0

200

400

600

0 128 256 384 512

core#

Spee

d-U

p

Flat MPIHybrid 4x4Hybrid 8x2ideal

0

200

400

600

0 128 256 384 512

core#

Spee

d-U

p

Flat MPIHybrid 4x4Hybrid 8x2ideal

HID Localized Block Jacobi

Cluster2008 25



Cluster2008 26

• HID for Ill-Conditioned Problems on T2K/Tokyo– Hybrid/Flat MPI parallel programming model– CM-RCM reordering – comparison with Localized Block Jacobi

• Generally speaking, HID is better than Localized Block Jacobi

– more robust, more efficient• Hybid 4x4 and Flat MPI are competitive

– Effect of NUMA policy is significant (if size/core is large)• Future Works

– Effect of higher order of fill-ins for robust convergence– Nested dissection ordering on each node for hybrid programming

model

Summary & Future Works

Cluster2008 27

Flat MPI, Hybrid (4x4, 8x2)

Parallel Multistage Preconditioners by Hierarchical...

Documents

Transcript of Parallel Multistage Preconditioners by Hierarchical...