Post on 14-Jan-2016
description
The Application of POSIX The Application of POSIX Threads And OpenMP to the Threads And OpenMP to the U.S. NRC Neutron Kinetics U.S. NRC Neutron Kinetics
Code PARCSCode PARCSD.J. Lee and T.J. DownarD.J. Lee and T.J. Downar
School of Nuclear Engineering School of Nuclear Engineering Purdue UniversityPurdue University
July, 2001July, 2001
2
ContentsContents
• IntroductionIntroduction
• Parallelism in PARCSParallelism in PARCS
• Parallel Performance of PARCS Parallel Performance of PARCS
• Cache AnalysisCache Analysis
• ConclusionsConclusions
3
IntroductionIntroduction
4
PARCSPARCS
• ““PPurdue urdue AAdvanced dvanced RReactor eactor CCore ore SSimulator”imulator”
• U.S. NRC(Nuclear Regulatory Commission) U.S. NRC(Nuclear Regulatory Commission) Code for Nuclear Reactor Safety AnalysisCode for Nuclear Reactor Safety Analysis
• Developed at School of Nuclear Engineering Developed at School of Nuclear Engineering of Purdue Universityof Purdue University
• A Multi-Dimensional Multi-Group Reactor A Multi-Dimensional Multi-Group Reactor Kinetics Code Based on Nonlinear Nodal Kinetics Code Based on Nonlinear Nodal MethodMethod
5
Nuclear Power PlantNuclear Power Plant
Nuclear Reactor Core
6
Equations Solved in PARCS Equations Solved in PARCS
• Time-Dependent Boltzmann Transport Time-Dependent Boltzmann Transport Equation Equation
• T/H Field EquationsT/H Field Equations– Heat Conduction Equation Heat Conduction Equation
– Heat Convection EquationHeat Convection Equation
),,,(),',',()',',(''
),,,(),(),,,(),,,(1
tErStErEErdEd
tErErtErtErt
s
7
Spatial CouplingSpatial Coupling
Thermal-Hydraulics:Thermal-Hydraulics:
• Computes new Computes new coolant/fuel propertiescoolant/fuel properties
• Sends moderator Sends moderator temp., vapor and temp., vapor and liquid densities, void liquid densities, void fraction, boron conc., fraction, boron conc., and average, and average, centerline, and centerline, and surface fuel temp.surface fuel temp.
• Uses neutronic power Uses neutronic power as heat source for as heat source for conductionconduction
Thermal-Hydraulics:Thermal-Hydraulics:
• Computes new Computes new coolant/fuel propertiescoolant/fuel properties
• Sends moderator Sends moderator temp., vapor and temp., vapor and liquid densities, void liquid densities, void fraction, boron conc., fraction, boron conc., and average, and average, centerline, and centerline, and surface fuel temp.surface fuel temp.
• Uses neutronic power Uses neutronic power as heat source for as heat source for conductionconduction
Neutronics:Neutronics:
• Uses coolant and Uses coolant and fuel properties for fuel properties for local node conditionslocal node conditions
• Updates Updates macroscopic cross macroscopic cross sections based on sections based on local node conditionslocal node conditions
• Computes 3-D fluxComputes 3-D flux
• Sends node-wise Sends node-wise power distributionpower distribution
Neutronics:Neutronics:
• Uses coolant and Uses coolant and fuel properties for fuel properties for local node conditionslocal node conditions
• Updates Updates macroscopic cross macroscopic cross sections based on sections based on local node conditionslocal node conditions
• Computes 3-D fluxComputes 3-D flux
• Sends node-wise Sends node-wise power distributionpower distribution
8
High Necessity of HPC for PARCSHigh Necessity of HPC for PARCS• Acceleration Techniques in PARCSAcceleration Techniques in PARCS
– Nonlinear CMFD Method : Global(Low Order)+Local(High Order)Nonlinear CMFD Method : Global(Low Order)+Local(High Order)
– BILU3D Preconditioned BICGSTABBILU3D Preconditioned BICGSTAB
– Wielandt Shift MethodWielandt Shift Method
• Still, Computational Burden of PARCS is Very LargeStill, Computational Burden of PARCS is Very Large– Typically, The Calculation Speed is More Than an Order of Typically, The Calculation Speed is More Than an Order of
Magnitude Slower Than Real Time Magnitude Slower Than Real Time
– ExampleExample• NEACRP Benchmark NEACRP Benchmark
Several Tens of Seconds for 0.5 sec. SimulationSeveral Tens of Seconds for 0.5 sec. Simulation• PARCS/TRAC Coupled RUN PARCS/TRAC Coupled RUN
4 Hours for 100 sec. Simulation4 Hours for 100 sec. Simulation
9
Parallelism Parallelism In PARCSIn PARCS
10
PARCS Computational ModulesPARCS Computational Modules
• CMFDCMFD: Solves the “Global” Coarse Mesh : Solves the “Global” Coarse Mesh Finite Difference EquationFinite Difference Equation
• NODALNODAL: Solves “Local” Higher Order : Solves “Local” Higher Order Differenced EquationsDifferenced Equations
• XSECXSEC: Provides Temperature/Fluid Feedback : Provides Temperature/Fluid Feedback through Cross Sections (Coefficients of through Cross Sections (Coefficients of Boltzmann Equation)Boltzmann Equation)
• T/HT/H: Solution of Temperature/Fluid Field : Solution of Temperature/Fluid Field EquationsEquations
11
Parallelism in PARCSParallelism in PARCS
• NODAL and XsecNODAL and Xsec Module: Module:– Node by Node CalculationNode by Node Calculation
– Naturally ParallelizableNaturally Parallelizable
• T/HT/H Module: Module:– Channel by Channel CalculationChannel by Channel Calculation
– Naturally ParallelizableNaturally Parallelizable
• CMFDCMFD Module: Module:– Domain Decomposition PreconditioningDomain Decomposition Preconditioning
– Example: Split the Reactor into Two Halves Example: Split the Reactor into Two Halves
– The Number of Iteration Depends on the Number of The Number of Iteration Depends on the Number of DomainsDomains
12
Why Multi-Threaded Programming ?Why Multi-Threaded Programming ?• Coupling of DomainsCoupling of Domains
– The Information of One Plane at the Interface of Two The Information of One Plane at the Interface of Two Domains Should Be Transferred to Each OtherDomains Should Be Transferred to Each Other
– The Size of Information to be Exchanged is NOT SMALL The Size of Information to be Exchanged is NOT SMALL Compared with the Amount of Calculations for Each DomainCompared with the Amount of Calculations for Each Domain
• Message PassingMessage Passing– Large Communication Overhead Large Communication Overhead
• Multi-ThreadingMulti-Threading– Shared Address SpaceShared Address Space
– Negligible Communication Overhead Negligible Communication Overhead
13
Multi-threaded ProgrammingMulti-threaded Programming
• OpenMPOpenMP– FORTRAN, C, C++FORTRAN, C, C++
– Simple Implementation based on DirectivesSimple Implementation based on Directives
• POSIX ThreadsPOSIX Threads– No Interface to FORTRANNo Interface to FORTRAN
– Developed FORTRAN-to-C WrapperDeveloped FORTRAN-to-C Wrapper
– Much Caution Required to Avoid Race ConditionsMuch Caution Required to Avoid Race Conditions
14
POSIX THREADS WITH FORTRAN: POSIX THREADS WITH FORTRAN: nuc_threadsnuc_threads
• Mixed language interface accessible to both Mixed language interface accessible to both Fortran and C sections of the codeFortran and C sections of the code
• Minimal set of threads functions:Minimal set of threads functions:– nuc_init(*ncpu): nuc_init(*ncpu): initializes mutex and condition initializes mutex and condition
variables. variables.
– nuc_frk(*func_name,*nuc_arg,*arg):nuc_frk(*func_name,*nuc_arg,*arg): creates the creates the POSIX threads. POSIX threads.
– nuc_bar(*iam): nuc_bar(*iam): used for synchronization. used for synchronization.
– nuc_gsum(*iam,*A,*globsum):nuc_gsum(*iam,*A,*globsum): used to get a global used to get a global sum of an array updated by each thread. sum of an array updated by each thread.
15
Thread 1Thread 1Thread 1Thread 1 Thread 2Thread 2Thread 2Thread 2
BeginBeginBeginBegin
EndEndEndEnd
ForkForkForkFork
JoinJoinJoinJoin
Thread 1Thread 1Thread 1Thread 1 Thread 2Thread 2Thread 2Thread 2
ForkForkForkFork
JoinJoinJoinJoin
Thread 1Thread 1Thread 1Thread 1 Thread 2Thread 2Thread 2Thread 2
BeginBeginBeginBegin
EndEndEndEnd
ForkForkForkFork
JoinJoinJoinJoin
Implementation of OpenMP and Implementation of OpenMP and PthreadsPthreads
OpenMPOpenMPPthreadsPthreads
SynchronizatioSynchronizationn
SynchronizatioSynchronizationn
SynchronizatioSynchronizationn
SynchronizatioSynchronizationn
idleidle
16
Parallel Parallel Performance of Performance of
PARCS PARCS
17
Applications Applications
• Matrix Vector MultiplicationMatrix Vector Multiplication– Subroutine “MatVec” of PARCSSubroutine “MatVec” of PARCS
– Size of Matrix Is Same As NEACRP BenchmarkSize of Matrix Is Same As NEACRP Benchmark
• NEACRP Reactor Transient BenchmarkNEACRP Reactor Transient Benchmark– Control Rod Ejection From Hot Zero Power Control Rod Ejection From Hot Zero Power
ConditionCondition
– Full 3-Dimensional TransientFull 3-Dimensional Transient
18
Specification of MachineSpecification of Machine
PlatformPlatformPlatformPlatform SUN ULTRA-80SUN ULTRA-80SUN ULTRA-80SUN ULTRA-80 SGI ORIGIN 2000SGI ORIGIN 2000SGI ORIGIN 2000SGI ORIGIN 2000
Number of CPUsNumber of CPUsNumber of CPUsNumber of CPUs 2222 32323232
CPU TypeCPU Type
CPU TypeCPU Type
ULTRA SPARC IIULTRA SPARC II450 MHz450 MHzULTRA SPARC IIULTRA SPARC II450 MHz450 MHz
MIPS R10000MIPS R10000250 MHz250 MHz4-way superscalar4-way superscalar
MIPS R10000MIPS R10000250 MHz250 MHz4-way superscalar4-way superscalar
L1 CacheL1 Cache
L1 CacheL1 Cache
16 KB D-cache16 KB D-cache16 KB I-cache16 KB I-cacheCache Line Size : 32bytesCache Line Size : 32bytes
16 KB D-cache16 KB D-cache16 KB I-cache16 KB I-cacheCache Line Size : 32bytesCache Line Size : 32bytes
32 KB D-cache32 KB D-cache32 KB I-cache32 KB I-cacheCache Line Size : 32bytesCache Line Size : 32bytes
32 KB D-cache32 KB D-cache32 KB I-cache32 KB I-cacheCache Line Size : 32bytesCache Line Size : 32bytes
L2 CacheL2 CacheL2 CacheL2 Cache 4MB4MB4MB4MB4MB per CPU4MB per CPUCache Line Size : Cache Line Size : 128bytes128bytes
4MB per CPU4MB per CPUCache Line Size : Cache Line Size : 128bytes128bytes
Main MemoryMain MemoryMain MemoryMain Memory 1GB1GB1GB1GB 16GB16GB16GB16GB
CompilerCompilerCompilerCompiler SUN Workshop 6SUN Workshop 6-FORTRAN 90 6.1-FORTRAN 90 6.1SUN Workshop 6SUN Workshop 6-FORTRAN 90 6.1-FORTRAN 90 6.1
MIPSpro Compiler 7.2.1MIPSpro Compiler 7.2.1- FORTRAN 90- FORTRAN 90MIPSpro Compiler 7.2.1MIPSpro Compiler 7.2.1- FORTRAN 90- FORTRAN 90
19
Specification of MachineSpecification of Machine
PlatformPlatformPlatformPlatform LINUX MachineLINUX MachineLINUX MachineLINUX Machine
Number of CPUsNumber of CPUsNumber of CPUsNumber of CPUs 4444
CPU TypeCPU Type
CPU TypeCPU Type
Intel Pentium-IIIIntel Pentium-III550 MHz550 MHzIntel Pentium-IIIIntel Pentium-III550 MHz550 MHz
L1 CacheL1 Cache
L1 CacheL1 Cache
16 KB D-cache16 KB D-cache16 KB I-cache16 KB I-cacheCache Line Size : ? bytesCache Line Size : ? bytes
16 KB D-cache16 KB D-cache16 KB I-cache16 KB I-cacheCache Line Size : ? bytesCache Line Size : ? bytes
L2 CacheL2 CacheL2 CacheL2 Cache 512KB512KB512KB512KB
Main MemoryMain MemoryMain MemoryMain Memory 1GB1GB1GB1GB
CompilerCompilerCompilerCompiler NAGWare FORTRAN 90 NAGWare FORTRAN 90 Version 4.2Version 4.2NAGWare FORTRAN 90 NAGWare FORTRAN 90 Version 4.2Version 4.2
ftp://download.intel.com/design/PentiumIII/xeon/datashts/24509402.pdfftp://download.intel.com/design/PentiumIII/xeon/datashts/24509402.pdfSlot 2 technology, 100MHz bus, non-blocking cacheSlot 2 technology, 100MHz bus, non-blocking cache
20
SGISGISGISGI
SUNSUNSUNSUN
MachineMachineMachineMachine
Matrix-Vector MultiplicationMatrix-Vector Multiplication((MatVec Subroutine of PARCSMatVec Subroutine of PARCS))
1.731.731.731.73
3.763.763.763.76
SerialSerialSerialSerialOpenMPOpenMPOpenMPOpenMP
11*1)*1)11*1)*1) 2222 4444 8888
PthreadsPthreadsPthreadsPthreads
1111 2222 4444 8888
23.4323.4323.4323.43 13.2613.2613.2613.26 ---- ----
(0.16)(0.16)(0.16)(0.16) (0.28)(0.28)(0.28)(0.28) ---- ----
3.71 3.71 *2)*2)3.71 3.71 *2)*2) 1.931.931.931.93 ---- ----
(1.02) (1.02) *3)*3)(1.02) (1.02) *3)*3) (1.95)(1.95)(1.95)(1.95) ---- ----
1.731.731.731.73 0.920.920.920.92 0.520.520.520.52 0.370.370.370.37
(1.00)(1.00)(1.00)(1.00) (1.89)(1.89)(1.89)(1.89) (3.30)(3.30)*4)*4)(3.30)(3.30)*4)*4) (4.72)(4.72)(4.72)(4.72)
1.721.721.721.72 1.801.801.801.80 1.911.911.911.91 1.961.961.961.96
(1.01)(1.01)(1.01)(1.01) (0.96)(0.96)(0.96)(0.96) (0.91)(0.91)(0.91)(0.91) (0.88)(0.88)(0.88)(0.88)
*1) Number of Threads *4) Core is Divided into 18 Planes *1) Number of Threads *4) Core is Divided into 18 Planes
*2) Time(seconds) *2) Time(seconds)
*3) Speedup*3) Speedup
21
SGISGISGISGI
1 24 8
OpenMP
Pthreads0
1
2
3
4
5
Serial Run Time: 1.73 Serial Run Time: 1.73 ss
Serial Run Time: 1.73 Serial Run Time: 1.73 ss
SUNSUNSUNSUN
12
OpenMP
Pthreads
0
1
2
3
4
5
Serial Run Time: 3.76 Serial Run Time: 3.76 ss
Serial Run Time: 3.76 Serial Run Time: 3.76 ss
Matrix-Vector MultiplicationMatrix-Vector Multiplication((Subroutine of PARCSSubroutine of PARCS))
12
OpenMP
Pthreads
0
1
2
3
4
5
1 2 48
OpenMP
Pthreads0
1
2
3
4
5
22
NEACRP BenchmarkNEACRP Benchmark((Simulation with Multiple ThreadsSimulation with Multiple Threads))
Transient Power
050
100150200250300350400450500
0 0.1 0.2 0.3 0.4 0.5
TIME(sec)
Po
wer
(%
) serial
2 threads
4 threads
8 thredsthreads
23
# of# ofUpdatesUpdates
# of# ofUpdatesUpdates
TimeTime(sec)(sec)TimeTime(sec)(sec)
Parallel Performance (SUN)Parallel Performance (SUN)
CMFDCMFDCMFDCMFD 36.736.736.736.7 32.132.132.132.1
NodalNodalNodalNodal 11.511.511.511.5 11.311.311.311.3
T/HT/HT/HT/H 29.629.629.629.6 27.927.927.927.9
XsecXsecXsecXsec 7.67.67.67.6 7.17.17.17.1
CMFDCMFDCMFDCMFD 445445445445 445445445445
NodalNodalNodalNodal 31313131 31313131
T/HT/HT/HT/H 216216216216 216216216216
ModuleModuleModuleModule SerialSerialSerialSerialPthreadsPthreadsPthreadsPthreads
11*)*)11*)*)
TotalTotalTotalTotal 85.485.485.485.4 78.578.578.578.5
XsecXsecXsecXsec 225225225225 225225225225
20.820.820.820.8 1.771.771.771.77
6.46.46.46.4 1.781.781.781.78
14.514.514.514.5 2.042.042.042.04
3.73.73.73.7 2.042.042.042.04
456456456456 ----
33333333 ----
216216216216 ----
2222 SpeedupSpeedupSpeedupSpeedup
45.545.545.545.5 1.881.881.881.88
226226226226 ----
*) Number of Threads*) Number of Threads
24
Parallel Performance (SGI)Parallel Performance (SGI)
# of# ofUpdatesUpdates
# of# ofUpdatesUpdates
TimeTime(sec)(sec)TimeTime(sec)(sec)
CMFDCMFDCMFDCMFD 19.319.319.319.3
NodalNodalNodalNodal 9.29.29.29.2
T/HT/HT/HT/H 25.325.325.325.3
XsecXsecXsecXsec 4.44.44.44.4
CMFDCMFDCMFDCMFD 445445445445
NodalNodalNodalNodal 31313131
T/HT/HT/HT/H 216216216216
ModuleModuleModuleModuleOpenMPOpenMPOpenMPOpenMP
1 1 *1)*1)1 1 *1)*1)
TotalTotalTotalTotal 58.158.158.158.1
XsecXsecXsecXsec 225225225225
8.938.938.938.93 2.212.212.212.21 8.858.858.858.85 2.232.232.232.23
3.563.563.563.56 2.532.532.532.53 2.872.872.872.87 3.143.143.143.14
8.928.928.928.92 2.992.992.992.99 7.147.147.147.14 3.733.733.733.73
1.371.371.371.37 3.533.533.533.53 1.111.111.111.11 4.354.354.354.35
497497497497 ---- 565565565565 ----
38383838 ---- 39393939 ----
216216216216 ---- 217217217217 ----
4444 SpeedupSpeedupSpeedupSpeedup 8888 SpeedupSpeedupSpeedupSpeedup
22.822.822.822.8 2.642.64*2)*2)2.642.64*2)*2) 20.020.020.020.0 3.023.02*2)*2)3.023.02*2)*2)
228228228228 ---- 227227227227 ----
19.819.819.819.8
9.09.09.09.0
26.626.626.626.6
4.84.84.84.8
445445445445
31313131
216216216216
SerialSerialSerialSerial
60.260.260.260.2
225225225225
12.112.112.112.1 1.631.631.631.63
5.85.85.85.8 1.551.551.551.55
12.312.312.312.3 2.172.172.172.17
2.42.42.42.4 2.012.012.012.01
456456456456 ----
33333333 ----
216216216216 ----
2222 SpeedupSpeedupSpeedupSpeedup
32.632.632.632.6 1.851.851.851.85
226226226226 ----
*1) Number of Threads *2) Core is divided into 18 planes*1) Number of Threads *2) Core is divided into 18 planes
25
Cache Analysis Cache Analysis
26
CPUCPUCPUCPU
L1 CacheL1 CacheL1 CacheL1 Cache
L2 CacheL2 CacheL2 CacheL2 Cache
MemoryMemoryMemoryMemory
Memory Access TypeMemory Access TypeMemory Access TypeMemory Access Type CyclesCyclesCyclesCycles
L1 cache hitL1 cache hitL1 cache hitL1 cache hit 2222
L1 cache miss L1 cache miss satisfied by L2 cache satisfied by L2 cache hithit
L1 cache miss L1 cache miss satisfied by L2 cache satisfied by L2 cache hithit
8888
L2 cache miss L2 cache miss satisfied from satisfied from memorymemory
L2 cache miss L2 cache miss satisfied from satisfied from memorymemory
75757575
Memory Access TimeMemory Access Time
Typical Memory Access Cycles Typical Memory Access Cycles (SGI)(SGI)
27
CMFDCMFD(BICG)(BICG)
CMFDCMFD(BICG)(BICG)
ModuleModuleModuleModule
NodalNodalNodalNodal
T/HT/H(TRTH)(TRTH)T/HT/H(TRTH)(TRTH)
XSECXSECXSECXSEC
Cache Miss Measurements (SGI)Cache Miss Measurements (SGI)
CacheCacheCacheCache SerialSerialSerialSerialOpenMPOpenMPOpenMPOpenMP
11*1)*1)11*1)*1) 2222 4444 8888
L1L1L1L1 477,691477,691477,691477,691 479,474479,474479,474479,474 258,027258,027258,027258,027 156,461156,461156,461156,461 105,733105,733105,733105,733
L1L1L1L1 857,744857,744857,744857,744 853,866853,866853,866853,866 444,849444,849444,849444,849 249,507249,507249,507249,507 160,699160,699160,699160,699
L2L2L2L2 54,16354,16354,16354,163 55,53455,53455,53455,534 33,84633,84633,84633,846 19,01619,01619,01619,016 12,84812,84812,84812,848
L1L1L1L1 165,133165,133165,133165,133 60,58760,58760,58760,587 39,41939,41939,41939,419 25,85025,85025,85025,850 19,81619,81619,81619,816
L2L2L2L2 9,5519,5519,5519,551 9,5129,5129,5129,512 9,6739,6739,6739,673 6,4516,4516,4516,451 4,6204,6204,6204,620
L1L1L1L1 62,32462,32462,32462,324 57,46257,46257,46257,462 29,84529,84529,84529,845 17,71517,71517,71517,715 11,34411,34411,34411,344
L2L2L2L2 9,4569,4569,4569,456 9,5189,5189,5189,518 5,5175,5175,5175,517 3,7373,7373,7373,737 2,5782,5782,5782,578
L2L2L2L2 28,24228,24228,24228,242 29,65029,65029,65029,650 17,00717,00717,00717,007 11,75111,75111,75111,751 9,3099,3099,3099,309
*1) Number of Threads *1) Number of Threads
28
Cache Miss & Speedup Cache Miss & Speedup of XSEC Module (SGI)of XSEC Module (SGI)
0
2000
4000
6000
8000
10000
0 2 4 6 8 10
Number of CPUs
L2
MIS
SE
S
0
1
2
3
4
5
SP
EE
DU
P
L2 MISSES SPEEDUP
29
ModuleModuleModuleModule
Cache Miss Ratio (SGI)Cache Miss Ratio (SGI)OpenMPOpenMPOpenMPOpenMP
CMFDCMFD(BICG)(BICG)
CMFDCMFD(BICG)(BICG)
CacheCacheCacheCache SerialSerialSerialSerial11*1)*1)11*1)*1) 2222 4444 8888
L1L1L1L1 1.001.001.001.00
NodalNodalNodalNodalL1L1L1L1 1.001.001.001.00
L2L2L2L2 1.001.001.001.00
T/HT/H(TRTH)(TRTH)
T/HT/H(TRTH)(TRTH)
L1L1L1L1 1.001.001.001.00
L2L2L2L2 1.001.001.001.00
XSECXSECXSECXSECL1L1L1L1 1.001.001.001.00
L2L2L2L2 1.001.001.001.00
L2L2L2L2 1.001.001.001.00
3.053.053.053.05
3.443.443.443.44
2.852.852.852.85
6.396.396.396.39
1.481.481.481.48
3.523.523.523.52
2.532.532.532.53
2.402.402.402.40
4.524.524.524.52
5.345.345.345.34
4.224.224.224.22
8.338.338.338.33
2.072.072.072.07
5.495.495.495.49
3.673.673.673.67
3.033.033.033.03
1.851.851.851.85
1.931.931.931.93
1.601.601.601.60
4.194.194.194.19
0.990.990.990.99
2.092.092.092.09
1.711.711.711.71
1.661.661.661.66
1.001.001.001.00
1.001.001.001.00
0.980.980.980.98
2.732.732.732.73
1.001.001.001.00
1.081.081.081.08
0.990.990.990.99
0.950.950.950.95
Cache Miss Ratio =Cache Miss Ratio =ExecutionParallelofMissesCache
ExecutionSerialofMissesCache
*1) Number of Threads *1) Number of Threads
30
Speedup Estimation Using Cache Speedup Estimation Using Cache MissesMisses
where
= Total data access time for serial execution
= Total data access time for 2 threads execution.
where
= Total data access time for serial execution
= Total data access time for 2 threads execution.
thtotal
serialtotal
T
TS
2
serialtotalT
thtotalT 2
•Speedup•Speedup
where
= Total L2 cache access time = Total memory access time = Number of L1 data cache misses satisfied by L2 cache hit = Number of L2 data cache misses satisfied from main memory = L2 cache access time for 1 word = Main memory access time for 1 word.
where
= Total L2 cache access time = Total memory access time = Number of L1 data cache misses satisfied by L2 cache hit = Number of L2 data cache misses satisfied from main memory = L2 cache access time for 1 word = Main memory access time for 1 word.
MemMemLLmemLtotal tntnTTT 222
2LTmemT
2LnMemn
2LtMemt
•Data Access Time•Data Access Time
31
Estimated 2-thread Speedup Based Estimated 2-thread Speedup Based on Data Cache Misses for OpenMP on Data Cache Misses for OpenMP
on SGIon SGI
CMFD (BICG)CMFD (BICG)CMFD (BICG)CMFD (BICG) 1.631.631.631.63 1.781.781.781.78
NodalNodalNodalNodal 1.551.551.551.55 1.801.801.801.80
T/H (TRTH)T/H (TRTH)T/H (TRTH)T/H (TRTH) 2.172.172.172.17 2.042.042.042.04
ModuleModuleModuleModuleSpeedupSpeedupSpeedupSpeedup
MeasuredMeasuredMeasuredMeasured PredictedPredictedPredictedPredicted
XSECXSECXSECXSEC 2.012.012.012.01 1.861.861.861.86
32
Conclusions Conclusions
33
ConclusionsConclusions
• Comparison of OpenMP and POSIX ThreadsComparison of OpenMP and POSIX Threads– OpenMP is comparable to POSIX Threads in OpenMP is comparable to POSIX Threads in
terms of Parallel Performanceterms of Parallel Performance
– OpenMP is much easier to Implement than OpenMP is much easier to Implement than POSIX Threads due to the Directive based POSIX Threads due to the Directive based NatureNature
• Cache AnalysisCache Analysis– The Prediction of Speedup based on Data The Prediction of Speedup based on Data
Cache Misses Agrees well with the Measured Cache Misses Agrees well with the Measured SpeedupSpeedup
34
Continuing WorkContinuing Work
• AlgorithmicAlgorithmic
- 3-D Domain Decomposition3-D Domain Decomposition
• Software Software
- SUN CompilerSUN Compiler
- Pthreads Scheduling on SGIPthreads Scheduling on SGI
• Alternate PlatformsAlternate Platforms