Parallel Multistage Preconditioners by Hierarchical...
Transcript of Parallel Multistage Preconditioners by Hierarchical...
Parallel Multistage Preconditioners by Hierarchical Interface Decomposition on
“T2K Open Super Computer (TodaiCombined Cluster)” with Hybrid Parallel
Programming Models
Kengo NakajimaInformation Technology Center, The University of Tokyo
JAMSTEC (Japan Agency for Marine-Earth Science & Technology)CREST/JST (Japan Science & Technology Agency)
Work-in-Progress Presentation (WP-7), IEEE Cluster 2008September 30th, 2008, Tsukuba, Japan
Cluster2008 2
Motivations
• Parallel Preconditioners for Ill-Conditioned Problems• T2K Open Super Computer • Hybrid vs. Flat MPI
Cluster2008 3
• Simple problems can easily converge by simple preconditioners (e.g. Localized Block Jacobi) with excellent parallel efficiency.
• Difficult (ill-conditioned) problems cannot easily converge– Effect of domain decomposition on convergence is significant– More domains, more iterations for Localized Block Jacobi
Technical Issues of “Parallel”Preconditioners for Iterative Solvers
Cluster2008 4
Technical Issues of “Parallel”Preconditioners for Iterative Solvers
E=100
E=103
3D Solid MechanicsE: Young’s Modulus
• If domain boundaries are on “harder” elements, convergence is very bad.
Cluster2008 5
• Simple problems can easily converge by simple preconditioners (e.g. Localized Block Jacobi) with excellent parallel efficiency.
• Difficult (ill-conditioned) problems cannot easily converge– Effect of domain decomposition on convergence is significant– More domains, more iterations for Localized Block Jacobi
• Remedies – deep fill-ins, deep overlapping– HID: Hierarchical Interface Decomposition
• [Henon & Saad 2007]
Technical Issues of “Parallel”Preconditioners for Iterative Solvers
Cluster2008 6
T2K/Tokyo• “T2K Open Super Computer (Todai Combined Cluster)
– by Hitachi– Total 952 nodes (15,232 cores), 141 TFLOPS peak
• Each Node = 4x AMD Quadcore Opteron (Barcelona) (2.3GHz)
– 16th in TOP500 (June 2008) (fastest in Japan)• up to 32 nodes (512 cores) in this work
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core Core Core Core
L1 L1 L1 L1L2 L2 L2 L2
L3
Memory
Core
L1
Core
L1
Core
L1
Core
L1L2 L2 L2 L2
L3
Memory
Core Core Core Core
L1 L1 L1 L1L2 L2 L2 L2
L3
Memory
L3
L3
L3
L3
Cluster2008 7
Flat MPI vs. Hybrid
Hybrid:Hierarchal Structure
Flat-MPI:Each PE -> Independent
corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y corecorecorecorem
emor
y
mem
ory
mem
ory
mem
ory
core
core
core
core
core
core
core
core
core
core
core
core
Cluster2008 8
• HID for Ill-Conditioned Problems on T2K/Tokyo– Parallel FEM Applications
• Hybrid vs. Flat MPI Parallel Programming Models
• Effect of NUMA Control
Goal of this Study
Cluster2008 9
Outline
• HID (Hierarchical Interface Decomposition)• Hybrid Parallel Programming Model
– Reordering• Preliminary Results• Summary & Future Works
Cluster2008 10
• Multilevel Domain Decomposition– Extension of Nested Dissection
• Non-overlapped Approach: Connectors, Separators• Suitable for Parallel Preconditioning Method
HID: Hierarchical Interface Decomposition [Henon & Saad 2007]
level-1:●level-2:●level-4:●
0 0 0 1 1 1
0,2 0,2 0,2 1,3 1,3 1,3
2 2 2 3 3 3
2 2 2 2,3 3 3 3
2 2 2 2,3 3 3 3
0 0 0 0,1 1 1 1
0 0 0 0,1 1 1 1
0,12,3
0,12,3
0,12,3
Cluster2008 11
Parallel ILU for each Connectorat each LEVEL
• The unknowns are reordered according to their level numbers, from the lowest to highest.
• The block structure of the reordered matrix leads to natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied.
01
23
0,1
0,2
2,3
1,30,1,2,3
Level-1
Level-2
Level-4
Cluster2008 12
• HID (Hierarchical Interface Decomposition)• Hybrid Parallel Programming Model
– Reordering• Preliminary Results• Summary & Future Works
Cluster2008 13
• DAXPY, SMVP, Dot Products– Easy
• Factorization, Forward/Backward Substitutions in Preconditioning Processes
– Global dependency– Reordering for parallelism required: forming independent sets– Multicolor Ordering (MC), Reverse-Cuthill-Mckee (RCM)– Works on “Earth Simulator” [KN 2002,2003]
• both for parallel/vector performance
Parallel Preconditioned Iterative Solvers on an SMP/Multicore node by OpenMP
Cluster2008 14
• Multicoloring (MC)– Efficient– NOT robust for ill-conditioned
problems• RCM
– Robust– Load-Balancing is bad for
OpenMP on SMP/MulticoreNode
• CM-RCM– Robust– Efficient– Excellent Load-Balancing
Cyclic-Multicoloring + RCM(CM-RCM)
50
60
70
80
90
1 10 100 1000
color #
Itera
tions
▲ MC● CM-RCM
Cluster2008 15
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
1 2
3 4
Ordering Methods
RCM MC (Color#=4) CM-RCM (Color#=4)
1 2
2 3
3 4
4 1
1 2
2 3
3 4
4 1
3 4
4 1
1 2
2 3
3 4
4 1
1 2
2 3
1 2
2 3
3 4
4 1
1 2
2 3
3 4
4 1
3 4
4 1
1 2
2 3
3 4
4 1
1 2
2 3
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8 9
3 4 5 6 7 8 9 10
4 5 6 7 8 9 10 11
5 6 7 8 9 10 11 12
6 7 8 9 10 11 12 13
7 8 9 10 11 12 13 14
8 9 10 11 12 13 14 15
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8 9
3 4 5 6 7 8 9 10
4 5 6 7 8 9 10 11
5 6 7 8 9 10 11 12
6 7 8 9 10 11 12 13
7 8 9 10 11 12 13 14
8 9 10 11 12 13 14 15
Cluster2008 16
• CM-RCM with 10 colors• Ordering is applied to vertices in each “level” on each
“domain”
• If “level >1”, number of vertices are relatively smaller.– Color # in CM-RCM is controlled so that number of vertices is
more that 10 for each color.
Ordering Method: In this Work
Cluster2008 17
LEVEL-COLOR-THREADCommunications at Each Level: Forward Substitutions
additionalcomm.
do lev= 1, LEVELtotdo ic= 1, COLORtot(lev)
!$omp parallel do private(ip,i,SW1,SW2,SW3,isL,ieL,j,k,X1,X2,X3)do ip= 1, PEsmpTOTdo i = STACKmc(ip-1,ic,lev)+1, STACKmc(ip,ic,lev)
SW1= WW(3*i-2,R); SW2= WW(3*i-1,R); SW3= WW(3*i ,R)isL= INL(i-1)+1; ieL= INL(i)do j= isL, ieL
k= IAL(j)X1= WW(3*k-2,R); X2= WW(3*k-1,R); X3= WW(3*k ,R)
SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j )*X3
enddoX1= SW1; X2= SW2; X3= SW3X2= X2 - ALU(9*i-5)*X1X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2X3= ALU(9*i )* X3X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)WW(3*i-2,R)= X1; WW(3*i-1,R)= X2; WW(3*i ,R)= X3
enddoenddo
!$omp end parallel doenddo
call SOLVER_SEND_RECV_3_LEV(lev,…): Communications usingHierarchical Comm. Tables.
enddo
do lev= 1, LEVELtotdo ic= 1, COLORtot(lev)
!$omp parallel do private(ip,i,SW1,SW2,SW3,isL,ieL,j,k,X1,X2,X3)do ip= 1, PEsmpTOTdo i = STACKmc(ip-1,ic,lev)+1, STACKmc(ip,ic,lev)
SW1= WW(3*i-2,R); SW2= WW(3*i-1,R); SW3= WW(3*i ,R)isL= INL(i-1)+1; ieL= INL(i)do j= isL, ieL
k= IAL(j)X1= WW(3*k-2,R); X2= WW(3*k-1,R); X3= WW(3*k ,R)
SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j )*X3
enddoX1= SW1; X2= SW2; X3= SW3X2= X2 - ALU(9*i-5)*X1X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2X3= ALU(9*i )* X3X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)WW(3*i-2,R)= X1; WW(3*i-1,R)= X2; WW(3*i ,R)= X3
enddoenddo
!$omp end parallel doenddo
call SOLVER_SEND_RECV_3_LEV(lev,…): Communications usingHierarchical Comm. Tables.
enddo
Cluster2008 18
• HID (Hierarchical Interface Decomposition)• Hybrid Parallel Programming Model
– Reordering• Preliminary Results• Summary & Future Works
Cluster2008 19
• 3D Elastic Problems with Heterogeneous Material Property– Emax=103, Emin=10-3, ν=0.25
• generated by “sequential Gauss” algorithm for geo-statistics [Deutsch & Journel, 1998]
– 1283 tri-linear hexahedral elements, 6,291,456 DOF• (SGS+GPBiCG) Iterative Solvers
– Symmetric Gauss-Seidel– Original Block Jacobi, HID
• T2K/Tokyo– 512 cores (32 nodes)
• FORTARN90 (Hitachi) + MPI– Flat MPI, Hybrid (4x4, 8x2)
• Effect of NUMA Control
Target Application
Cluster2008 20
0 1 2 3
Flat MPI, Hybrid (4x4, 8x2)
Flat MPI
Hybrid4x4
Hybrid8x2
0 1 2 3
0 1 2 3
Cluster2008 21
Effect of NUMA Policy
--localalloc5
--cpunodebind=$SOCKET--localalloc4
--cpunodebind=$SOCKET--membind=$SOCKET3
--cpunodebind=$SOCKET--interleave=$SOCKET2
--cpunodebind=$SOCKET--interleave=all1
no command line switches0
Command line switchesPolicyID
0.00
1.00
2.00
3.00
0 1 2 3 4 5
NUMA policy
Rel
ativ
e Pe
rform
ance
Flat MPIHybrid 4x4Hybrid 8x2
32 nodes, 512 cores12.6K DOF/core
Cluster2008 22
Effect of NUMA Policy: Best Cases
--localalloc5
--cpunodebind=$SOCKET--localalloc4
--cpunodebind=$SOCKET--membind=$SOCKET3
--cpunodebind=$SOCKET--interleave=$SOCKET2
--cpunodebind=$SOCKET--interleave=all1
no command line switches0
command line switchesPolicyID
8 nodes, 128 cores50.3K DOF/core
0.00
0.25
0.50
0.75
1.00
1.25
Flat MPI(policy3)
Hybrid 4x4(policy3)
Hybrid 8x2(policy2)
Programming Models
Rel
ativ
e Pe
rform
ance
8 nodes (128 cores)32 nodes (512 cores)
32 nodes, 512 cores12.6K DOF/core
Cluster2008 23
0.80
1.00
1.20
1.40
1.60
1.80
32 64 128 192 256 384 512
core#
Rel
ativ
e Pe
rform
acne
(Ite
ratio
ns)
Flat MPIHybrid 4x4Hybrid 8x2
Relative PerformanceHID vs. Block Jacobi
normalized by Block Jacobi
0.80
1.00
1.20
1.40
1.60
1.80
32 64 128 192 256 384 512
core#
Rel
ativ
e Pe
rform
ance
(sec
.) Flat MPIHybrid 4x4Hybrid 8x2
Cluster2008 24
Scalability: HID vs. Block Jacobi
0
200
400
600
0 128 256 384 512
core#
Spee
d-U
p
Flat MPIHybrid 4x4Hybrid 8x2ideal
0
200
400
600
0 128 256 384 512
core#
Spee
d-U
p
Flat MPIHybrid 4x4Hybrid 8x2ideal
HID Localized Block Jacobi
Cluster2008 25
• HID (Hierarchical Interface Decomposition)• Hybrid Parallel Programming Model
– Reordering• Preliminary Results• Summary & Future Works
Cluster2008 26
• HID for Ill-Conditioned Problems on T2K/Tokyo– Hybrid/Flat MPI parallel programming model– CM-RCM reordering – comparison with Localized Block Jacobi
• Generally speaking, HID is better than Localized Block Jacobi
– more robust, more efficient• Hybid 4x4 and Flat MPI are competitive
– Effect of NUMA policy is significant (if size/core is large)• Future Works
– Effect of higher order of fill-ins for robust convergence– Nested dissection ordering on each node for hybrid programming
model
Summary & Future Works
Cluster2008 27
Flat MPI, Hybrid (4x4, 8x2)