Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms
-
Upload
takahiro-katagiri -
Category
Technology
-
view
199 -
download
0
Transcript of Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms
![Page 1: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/1.jpg)
Towards Auto‐tuning for the Finite Difference Method
in Era of 200+ Thread Parallelisms
Takahiro Katagiri, Satoshi Ohshima, Masaharu Matsumoto
(Information Technology Center, The University of Tokyo)
1
ACSI2015, Tsukuba Application Session, January 28th (Wed) 9:45 ‐ 10:15
![Page 2: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/2.jpg)
Outline1. Background and ppOpen‐AT Functions2. Code Optimization Strategy3. Performance Evaluation4. Conclusion
![Page 3: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/3.jpg)
Outline1. Background and ppOpen‐AT Functions2. Code Optimization Strategy3. Performance Evaluation4. Conclusion
![Page 4: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/4.jpg)
Background• High‐Thread Parallelism (HTP)
– Multi‐core and many‐core processors are pervasive.
• Multicore CPUs: 16‐32 cores, 32‐64 Threads with Hyper Threading (HT) or Simultaneous Multithreading (SMT)
• Many Core CPU: Xeon Phi – 60 cores, 240 Threads with HT.
– Utilizing parallelism with full‐threads is important.
4
Performance Portability (PP)◦ Keeping high performance in multiple computer environments. Not only multiple CPUs, but also multiple compilers. Run‐time information, such as loop length and number of threads, is important.
◦ Auto‐tuning (AT) is one of candidates technologies to establish PP in multiple computer environments.
![Page 5: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/5.jpg)
ppOpen‐HPC Project• Middleware for HPC and Its AT Technology
– Supported by JST, CREST, from FY2011 to FY2015.– PI: Professor Kengo Nakajima (U. Tokyo)
• ppOpen‐HPC – An open source infrastructure for reliable simulation codes on post‐peta (pp) scale parallel computers.
– Consists of various types of libraries, which covers 5 kinds of discretization methods for scientific computations.
• ppOpen‐AT – An auto‐tuning language for ppOpen‐HPC codes – Using knowledge of previous project: ABCLibScript Project.– Auto‐tuning language based on directives for AT.
5
![Page 6: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/6.jpg)
Software Architecture of ppOpen‐HPC
6
FVM DEMFDMFEM
Many‐core CPUs GPUsMulti‐core
CPUs
MG
COMM
Auto‐Tuning FacilityCode Generation for Optimization CandidatesSearch for the best candidateAutomatic Execution for the optimization
ppOpen‐APPL
ppOpen‐MATH
BEM
ppOpen‐AT
User Program
GRAPH VIS MP
STATIC DYNAMIC
ppOpen‐SYS FT
Optimize memory accesses
![Page 7: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/7.jpg)
Design Philosophy of ppOpen‐AT • A Directive‐base AT Language
– Multiple regions can be specified.– Low cost description with directives.– Do not prevent original code execution.
• AT Framework Without Script Languages and OS Daemons (Static Code Generation)– Simple: Minimum software stack is required. – Useful: Our targets are supercomputers in operation, or just after development.
– Low Overhead: We DO NOT use code generator with run‐time feedback.
– Reasons: • Since supercomputer centers do not accept any OS kernel modification, and any daemons in login or compute nodes.
• Need to reduce additional loads of the nodes.• Environmental issues, such as batch job script, etc. 7
![Page 8: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/8.jpg)
ppOpen‐AT SystemppOpen‐APPL /*
ppOpen‐ATDirectives
User KnowledgeLibrary
Developer
① Before Release‐time
Candidate1
Candidate2
Candidate3
CandidatenppOpen‐AT
Auto‐Tuner
ppOpen‐APPL / *
AutomaticCodeGeneration②
:Target Computers
Execution Time④
Library User
③
Library Call
Selection
⑤
⑥
Auto‐tunedKernelExecution
Run‐time
![Page 9: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/9.jpg)
AT Timings of ppOpen‐AT (FIBER Framework)
OAT_ATexec()…do i=1, MAX_ITERTarget_kernel_k()…Target_kernel_m()…
enddo
Read the best parameter
Is this first call?
Yes
Read the best parameter
Is this first call?
Yes
One time execution (except for varying problem size and number of MPI processes )
AT for Before Execute‐time
Execute Target_Kernel_k() with varying parameters
Execute Target_Kernel_m() with varying parameters
parameter Store the best parameter
…
![Page 10: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/10.jpg)
Outline1. Background and ppOpen‐AT Functions2. Code Optimization Strategy3. Performance Evaluation4. Conclusion
![Page 11: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/11.jpg)
Target Application• Seism_3D: Simulation for seismic wave analysis.
• Developed by Professor Furumura at the University of Tokyo.– The code is re‐constructed as ppOpen‐APPL/FDM.
• Finite Differential Method (FDM) • 3D simulation
–3D arrays are allocated.• Data type: Single Precision (real*4) 11
![Page 12: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/12.jpg)
Flow Diagram of ppOpen‐APPL/FDM
12
),,,(
}],,,)21({},,)
21({[1
),,(
2/
1
zyxqp
zyxmxzyxmxcx
zyxdxd
pqpq
M
mm
pq
Space difference by FDM.
),,(,121
21
zyxptfzyx
uu np
nzp
nyp
nxpn
p
n
p
Explicit time expansion by central difference.
Initialization
Velocity Derivative (def_vel)
Velocity Update (update_vel)
Stress Derivative (def_stress)
Stress Update (update_stress)
Stop Iteration?NO
YES
End
Velocity PML condition (update_vel_sponge)Velocity Passing (MPI) (passing_vel)
Stress PML condition (update_stress_sponge)Stress Passing (MPI) (passing_stress)
![Page 13: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/13.jpg)
Target Loop Characteristics• Triple‐nested loops
!$omp parallel do do k = NZ00, NZ01do j = NY00, NY01do i = NX00, NX01<Codes from FDM>
end doend do
end do!$omp end parallel do
OpenMP directive to the outer loop (Z‐axis)
Loop lengths are varied according to problem size, the number of MPI processes and OpenMP threads.
The codes can be separable by loop split.
![Page 14: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/14.jpg)
What is separable codes?Variable definitions and references are separated.There is a flow‐dependency, but no data dependency between each other.
d1 = …d2 = ……dk = ……
… = … d1 …… = … d2 ……… = … dk …
Variable definitions
Variable references
d1 = …d2 = …… = … d1 …… = … d2 …
……dk = ……… = … dk …
Split toTwo Parts
![Page 15: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/15.jpg)
ppOpen‐AT Directives: Loop Split & Fusion with data‐flow dependence
15
!oat$ install LoopFusionSplit region start!$omp parallel do private(k,j,i,STMP1,STMP2,STMP3,STMP4,RL,RM,RM2,RMAXY,RMAXZ,RMAYZ,RLTHETA,QG)
DO K = 1, NZDO J = 1, NYDO I = 1, NX
RL = LAM (I,J,K); RM = RIG (I,J,K); RM2 = RM + RMRLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL
!oat$ SplitPointCopyDef region start QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)
!oat$ SplitPointCopyDef region endSXX (I,J,K) = ( SXX (I,J,K) + (RLTHETA + RM2*DXVX(I,J,K))*DT )*QGSYY (I,J,K) = ( SYY (I,J,K) + (RLTHETA + RM2*DYVY(I,J,K))*DT )*QGSZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG
!oat$ SplitPoint (K, J, I)STMP1 = 1.0/RIG(I,J,K); STMP2 = 1.0/RIG(I+1,J,K); STMP4 = 1.0/RIG(I,J,K+1)STMP3 = STMP1 + STMP2RMAXY = 4.0/(STMP3 + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K))RMAXZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I+1,J,K+1))RMAYZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I,J+1,K+1))
!oat$ SplitPointCopyInsertSXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT )*QGSXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT )*QGSYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT )*QG
END DO; END DO; END DO!$omp end parallel do!oat$ install LoopFusionSplit region end
Re‐calculation is defined.
Using the re‐calculation is defined.
Loop Split Point
Specify Loop Split and Loop Fusion
![Page 16: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/16.jpg)
Re‐ordering of Statements: increase a chance to optimize register allocation by compiler
16
!OAT$ RotationOrder sub region startSentence iSentence ii
!OAT$ RotationOrder sub region end!OAT$ RotationOrder sub region start
Sentence 1Sentence 2
!OAT$ RotationOrder sub region end
Sentence iSentence 1Sentence iiSentence 2
Generated Code
![Page 17: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/17.jpg)
The Kernel 1 (update_stress)• m_stress.f90(ppohFDM_update_stress)!OAT$ call OAT_BPset("NZ01")!OAT$ install LoopFusionSplit region start!OAT$ name ppohFDMupdate_stress!OAT$ debug (pp)!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)do k = NZ00, NZ01do j = NY00, NY01do i = NX00, NX01RL1 = LAM (I,J,K)
!OAT$ SplitPointCopyDef sub region startRM1 = RIG (I,J,K)
!OAT$ SplitPointCopyDef sub region endRM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K); DZVZ1 = DZVZ(I,J,K)D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT
Measured B/F=3.2
![Page 18: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/18.jpg)
The Kernel 1 (update_stress)• m_stress.f90(ppohFDM_update_stress)!OAT$ SplitPoint (K,J,I)!OAT$ SplitPointCopyInsert
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K)DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
end doend do
end do!$omp end parallel do!OAT$ install LoopFusionSplit region end
![Page 19: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/19.jpg)
Automatic Generated Codes for the kernel 1ppohFDM_update_stress #1 [Baseline]: Original 3-nested Loop #2 [Split]: Loop Splitting with K-loop
(Separated, two 3-nested loops) #3 [Split]: Loop Splitting with J-loop #4 [Split]: Loop Splitting with I-loop #5 [Split&Fusion]: Loop Fusion to #1 for K and J-loops
(2-nested loop) #6 [Split&Fusion]: Loop Fusion to #2 for K and J-Loops
(2-nested loop) #7 [Fusion]: Loop Fusion to #1
(loop collapse) #8 [Split&Fusion]: Loop Fusion to #2
(loop collapse, two one-nest loop)
![Page 20: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/20.jpg)
The Kernel 2 (update_vel)• m_velocity.f90(ppohFDM_update_vel)!OAT$ install LoopFusion region start!OAT$ name ppohFDMupdate_vel!OAT$ debug (pp)!$omp parallel do private(k,j,i,ROX,ROY,ROZ)do k = NZ00, NZ01do j = NY00, NY01do i = NX00, NX01! Effective Density
!OAT$ RotationOrder sub region startROX = 2.0_PN/( DEN(I,J,K) + DEN(I+1,J,K) )ROY = 2.0_PN/( DEN(I,J,K) + DEN(I,J+1,K) )ROZ = 2.0_PN/( DEN(I,J,K) + DEN(I,J,K+1) )
!OAT$ RotationOrder sub region end!OAT$ RotationOrder sub region start
VX(I,J,K) = VX(I,J,K) + ( DXSXX(I,J,K)+DYSXY(I,J,K)+DZSXZ(I,J,K) )*ROX*DTVY(I,J,K) = VY(I,J,K) + ( DXSXY(I,J,K)+DYSYY(I,J,K)+DZSYZ(I,J,K) )*ROY*DTVZ(I,J,K) = VZ(I,J,K) + ( DXSXZ(I,J,K)+DYSYZ(I,J,K)+DZSZZ(I,J,K) )*ROZ*DT
!OAT$ RotationOrder sub region endend do; end do; end do
!$omp end parallel do!OAT$ install LoopFusion region end
Measured B/F=1.7
![Page 21: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/21.jpg)
Automatic Generated Codes for the kernel 2ppohFDM_update_vel• #1 [Baseline]: Original 3‐nested Loop.• #2 [Fusion]: Loop Fusion for K and J‐Loops.
(2‐nested loop)• #3 [Fusion]: Loop Split for K, J, and I‐Loops.
(Loop Collapse)• #4 [Fusion&Re‐order]:
Re‐ordering of sentences to #1. • #5 [Fusion&Re‐order]:
Re‐ordering of sentences to #2. • #6 [Fusion&Re‐order]:
Re‐ordering of sentences to #3.
![Page 22: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/22.jpg)
The Kernel 3 (update_stress_sponge)
• m_stress.f90(ppohFDM_update_stress_sponge)!OAT$ call OAT_BPset("NZ01")!OAT$ install LoopFusion region start!OAT$ name ppohFDMupdate_sponge!OAT$ debug (pp)!$omp parallel do private(k,gg_z,i,gg_y,gg_yz,i,gg_x,gg_xyz)do k = NZ00, NZ01gg_z = gz(k)do j = NY00, NY01gg_y = gy(j); gg_yz = gg_y * gg_z; do i = NX00, NX01gg_x = gx(i); gg_xyz = gg_x * gg_yz; SXX(I,J,K) = SXX(I,J,K) * gg_xyz; SYY(I,J,K) = SYY(I,J,K) * gg_xyzSZZ(I,J,K) = SZZ(I,J,K) * gg_xyz; SXY(I,J,K) = SXY(I,J,K) * gg_xyzSXZ(I,J,K) = SXZ(I,J,K) * gg_xyz; SYZ(I,J,K) = SYZ(I,J,K) * gg_xyz
end doend do
end do!$omp end parallel do!OAT$ install LoopFusion region end
Measured B/F=6.8
![Page 23: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/23.jpg)
Automatic Generated Codes forthe kernel 3ppohFDM_update_sponge• #1 [Baseline]: Original 3‐nested loop• #2 [Fusion]: Loop Fusion for K and J‐loops
(2‐nested loop)• #3 [Fusion]: Loop Split for K, J, and I‐loops
(Loop Collapse)
![Page 24: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/24.jpg)
Outline1. Background and ppOpen‐AT Functions2. Code Optimization Strategy3. Performance Evaluation4. Conclusion
![Page 25: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/25.jpg)
An Example of Seism_3D Simulation West part earthquake in Tottori prefecture in Japan
at year 2000. ([1], pp.14) The region of 820km x 410km x 128 km is discretized with 0.4km. NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.
[1] T. Furumura, “Large‐scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News, Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.
Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan. (a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
![Page 26: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/26.jpg)
AT Candidates in This Experiment
1. Kernel update_stress– 8 Kinds of Candidates with Loop fusion and Loop Split.
2. Kernel update_vel– 6 Kinds of Candidates with Loop Fusion and Re‐ordering of
Statements. 3 Kinds of Candidates with Loop Fusion.3. Kernel update_stress_sponge4. Kernel update_vel_sponge5. Kernel ppohFDM_pdiffx3_p46. Kernel ppohFDM_pdiffx3_m47. Kernel ppohFDM_pdiffy3_p48. Kernel ppohFDM_pdiffy3_m49. Kernel ppohFDM_pdiffz3_p410. Kernel ppohFDM_pdiffz3_m4 26
![Page 27: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/27.jpg)
3 Kinds of Candidates with Loop Fusion for Data Packing and Data Unpacking.
11. Kernel ppohFDM_ps_pack12. Kernel ppohFDM_ps_unpack13. Kernel ppohFDM_pv_pack14. Kernel ppohFDM_pv_unpack
27
AT Candidates in This Experiment(Cont’d)
![Page 28: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/28.jpg)
ppohFDM_ps_pack (ppOpen‐APPL/FDM)• m_stress.f90(ppohFDM_ps_pack)!OAT$ call OAT_BPset("NZ01")!OAT$ install LoopFusion region start!OAT$ name ppohFDMupdate_sponge!OAT$ debug (pp)!$omp parallel do private(k,j,iptr,i)
do k=1, NZPiptr = (k‐1)*3*NYP*NL2do j=1, NYPdo i=1, NL2i1_sbuff(iptr+1) = SXX(NXP‐NL2+i,j,k)i1_sbuff(iptr+2) = SXY(NXP‐NL2+i,j,k)i1_sbuff(iptr+3) = SXZ(NXP‐NL2+i,j,k)i2_sbuff(iptr+1) = SXX(i,j,k)i2_sbuff(iptr+2) = SXY(i,j,k)i2_sbuff(iptr+3) = SXZ(i,j,k)iptr = iptr + 3
end doend do
end do!$omp end parallel do
![Page 29: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/29.jpg)
ppohFDM_ps_pack (ppOpen‐APPL/FDM)• m_stress.f90(ppohFDM_ps_pack)!$omp parallel do private(k,j,iptr,i)do k=1, NZPiptr = (k‐1)*3*NL2*NXPdo j=1, NL2do i=1, NXPj1_sbuff(iptr+1) = SXY(i,NYP‐NL2+j,k); j1_sbuff(iptr+2) = SYY(i,NYP‐NL2+j,k)j1_sbuff(iptr+3) = SYZ(i,NYP‐NL2+j,k)j2_sbuff(iptr+1) = SXY(i,j,k); j2_sbuff(iptr+2) = SYY(i,j,k); j2_sbuff(iptr+3) = SYZ(i,j,k)iptr = iptr + 3
end do; end do; end do!$omp end parallel do!$omp parallel do private(k,j,iptr,i)do k=1, NL2iptr = (k‐1)*3*NYP*NXPdo j=1, NYPdo i=1, NXPk1_sbuff(iptr+1) = SXZ(i,j,NZP‐NL2+k); k1_sbuff(iptr+2) = SYZ(i,j,NZP‐NL2+k)k1_sbuff(iptr+3) = SZZ(i,j,NZP‐NL2+k)k2_sbuff(iptr+1) = SXZ(i,j,k); k2_sbuff(iptr+2) = SYZ(i,j,k); k2_sbuff(iptr+3) = SZZ(i,j,k)iptr = iptr + 3
end do; end do; end do!$omp end parallel do!OAT$ install LoopFusion region end
![Page 30: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/30.jpg)
Machine Environment (8 nodes of the Xeon Phi) The Intel Xeon Phi Xeon Phi 5110P (1.053 GHz), 60 cores Memory Amount:8 GB (GDDR5) Theoretical Peak Performance:1.01 TFLOPS One board per node of the Xeon phi cluster InfiniBand FDR x 2 Ports
Mellanox Connect‐IB PCI‐E Gen3 x16 56Gbps x 2 Theoretical Peak bandwidth 13.6GB/s Full‐Bisection
Intel MPI Based on MPICH2, MVAPICH2 4.1 Update 3 (build 048)
Compiler:Intel Fortran version 14.0.0.080 Build 20130728 Compiler Options:
‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel –mmic‐align array64byte
KMP_AFFINITY=granularity=fine, balanced (Uniform Distribution of threads between sockets)
![Page 31: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/31.jpg)
Execution Details• ppOpen‐APPL/FDM ver.0.2• ppOpen‐AT ver.0.2• The number of time step: 2000 steps• The number of nodes: 8 node• Native Mode Execution• Target Problem Size (Almost maximum size with 8 GB/node)– NX * NY * NZ = 1536 x 768 x 240 / 8 Node– NX * NY * NZ = 768 * 384 * 120 / node(!= per MPI Process)
• The number of iterations for kernels to do auto‐tuning: 100
![Page 32: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/32.jpg)
Execution Details of Hybrid MPI/OpenMP
• Target MPI Processes and OMP Threads on the Xeon Phi– The Xeon Phi with 4 HT (Hyper Threading) – PX TY: XMPI Processes and Y Threads per process– P8T240 : Minimum Hybrid MPI/OpenMP execution for ppOpen‐APPL/FDM, since it needs minimum 8 MPI Processes.
– P16T120– P32T60– P64T30– P128T15 – P240T8– P480T4
– Less than P960T2 cause an MPI error in this environment.
#0
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
P2T8
#0
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
P4T4 Target of cores for one MPI Process
![Page 33: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/33.jpg)
Loop Length per thread (Z‐axis)(8 Nodes, 1536x768x240 / 8 Node )
0.5 1 2 24
6
12
Loop length per threadMPI processes for Z‐axis are different between each MPI/OpenMP execution
due to software restriction.
![Page 34: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/34.jpg)
PERFORMANCE ON WHOLE TIME
![Page 35: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/35.jpg)
Maximum Speedups by AT(Xeon Phi, 8 Nodes)
558
200 17130 20 51
Speedup [%]
Kinds of Kernels
Speedup = max ( Execution time of original code / Execution time with AT )
for all combination of Hybrid MPI/OpenMP Executions (PXTY)
NX*NY*NZ = 1536x768x240/ 8 Node
![Page 36: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/36.jpg)
BREAK DOWN OF TIMINGS
![Page 37: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/37.jpg)
HYPER THREATING (HT) EFFECT
![Page 38: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/38.jpg)
COMPARISON WITH DIFFERENT CPU ARCHITECTURES
![Page 39: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/39.jpg)
AUTO‐TUNING TIME ANDTHE BEST IMPLEMENTATION
![Page 40: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/40.jpg)
RELATED WORK
![Page 41: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/41.jpg)
Originality (AT Languages)AT Language
/ Items#1
#2
#3
#4
#5
#6
#7
ppOpen‐AT OAT Directives ✔ ✔ ✔ ✔ NoneVendor Compilers Out of Target Limited ‐Transformation
Recipes Recipe
Descriptions✔ ✔ ChiLL
POET Xform Description ✔ ✔ POET translator, ROSE
X language Xlang Pragmas ✔ ✔ X Translation,‘C and tcc
SPL SPL Expressions ✔ ✔ ✔ A Script Language
ADAPT
ADAPT Language
✔ ✔ PolarisCompiler
Infrastructure, Remote Procedure
Call (RPC)
Atune‐IL Atune Pragmas ✔ A Monitoring Daemon
PEPPHER PEPPHER Pragmas (interface)
✔ ✔ ✔ PEPPHER task graph and run-time
Xevolver Directive Extension(Recipe Descriptions)
(✔) (✔) (✔) ROSE,XSLT Translator
#1: Method for supporting multi-computer environments. #2: Obtaining loop length in run-time.#3: Loop split with increase of computations, and loop fusion to the split loop.#4: Re-ordering of inner-loop sentences. #5: Algorithm selection.#6: Code generation with execution feedback. #7: Software requirement.
(Users need to define rules. )
![Page 42: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/42.jpg)
Outline1. Background and ppOpen‐AT Functions2. Code Optimization Strategy3. Performance Evaluation4. Conclusion
![Page 43: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/43.jpg)
Conclusion Remarks• Loop transformation with loop fusion and loop split is a key technology for FDM codes to obtain high performance on many‐core architecture.
• AT with static code generation (with dynamic code selection) is a key technology for AT with minimum software stack for supercomputers in operation.
To obtain free code (MIT Licensing):http://ppopenhpc.cc.u‐tokyo.ac.jp/
![Page 44: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/44.jpg)
Future Work• Improving Search Algorithm
– We use a brute‐force search in current implementation.• But it is enough by applying knowledge of application.
– We have implemented a new search algorithm.• d‐Spline based search algorithm (a performance model) ,collaborated with Prof. Tanaka (Kogakuin U.)
• Code Selection Between Different Architectures– Selection of codes between vector machine implementation and scalar machine implementation.
– Whole code selection on main routine is needed.• Off‐loading Implementation Selection (for the Xeon Phi)– If problem size is too small to do off‐loading, the target execution is performed on CPU automatically.
![Page 45: Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Parallelisms](https://reader033.fdocuments.us/reader033/viewer/2022042701/55af7cdc1a28ab34568b4821/html5/thumbnails/45.jpg)
Thank you for your attention!Questions?
http://ppopenhpc.cc.u‐tokyo.ac.jp/