S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende [email protected].
-
date post
22-Dec-2015 -
Category
Documents
-
view
224 -
download
3
Transcript of S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende [email protected].
S3D: Comparing Performance of XT3+XT4 with XT4Sameer Shende
TAU Performance SystemS3D Scalability Study 2
Acknowledgements
Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]
The performance data presented here is available at:
http://www.cs.uoregon.edu/research/tau/s3d
TAU Performance SystemS3D Scalability Study 3
TAU Parallel Performance System
http://www.cs.uoregon.edu/research/tau/ Multi-level performance instrumentation
Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system
Computer system architectures and operating systems Different programming languages and compilers
Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid
TAU Performance SystemS3D Scalability Study 4
The Story So Far...
Scalability study of S3D using TAU 3D Scatter plots and mapping of ranks to physical
processors points to partitioning in XT3/XT4 Memory and network on XT3 partition cause the rest of
the application to slow down Hypothesis: Running S3D on a ‘pure’ XT4 system will
help improve the performance significantly Ran a 6400 core simulation on an XT4 partition to
compare with XT3+XT4 (used #PBS -lfeature=xt4)...
TAU Performance SystemS3D Scalability Study 5
3D Scatter Plots
Plot four routines along X, Y, Z, and Color axes Each routine has a range (max, min) Each process (rank) has a unique position along the three
axes and a unique color Allows us to examine the distribution of nodes (clusters)
TAU Performance SystemS3D Scalability Study 6
Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters!
Previous work proved: Blue nodes are XT3, Red are XT4
TAU Performance SystemS3D Scalability Study 7
3D Triangle Mesh Display
Plot MPI rank, routine name, and exclusive time along X, Y and Z axes
Color can be shown by a fourth metric Scalable view Suitable for very large number of processors
TAU Performance SystemS3D Scalability Study 8
XT3+XT4: MPI_Wait
• Gap represents XT3 nodes
TAU Performance SystemS3D Scalability Study 9
3D View: Large MPI_Wait times on most CPUs
• To improve performance, we must reduce MPI_Wait time on other cpus
TAU Performance SystemS3D Scalability Study 10
3D View: XT3 Partition, Imbalance
On XT3: MPI_Wait takes less time, other routines take more time!
TAU Performance SystemS3D Scalability Study 11
Getting Back to MPI_Wait()
• MPI_Wait takes less time on XT3 nodes• Other routines take longer
TAU Performance SystemS3D Scalability Study 12
XT3+XT4: MPI_Wait - Sorted by Exclusive Time
• MPI_Wait takes 435.84 seconds on rank 3101• It takes 15.49 seconds on rank 0!• Rank 3101 is on XT4, rank 0 is on XT3
TAU Performance SystemS3D Scalability Study 13
Comparing XT4 and XT3 ranks (Best vs worst)
TAU Performance SystemS3D Scalability Study 14
Improving S3D Performance
Hypothesis: Running S3D on a ‘pure’ XT4 system
will help improve the performance significantly and reduce the time spent idling in MPI_Wait
TAU Performance SystemS3D Scalability Study 15
XT4 Profile: Main Window
TAU Performance SystemS3D Scalability Study 16
XT4: Mean Profile Sorted by Exclusive Time
• MPI_Wait has moved down!
TAU Performance SystemS3D Scalability Study 17
XT4: Mean Profile Sorted by Inclusive Time
TAU Performance SystemS3D Scalability Study 18
Comparing XT4 with XT3+XT4
• MPI_Wait takes 26% of time compared to combined XT3+XT4!
TAU Performance SystemS3D Scalability Study 19
Comparing Mean Inclusive Time
TAU Performance SystemS3D Scalability Study 20
XT4: 3D View
• The “exp” loop [~1GFlop] takes most time now!
TAU Performance SystemS3D Scalability Study 21
XT3+XT4: Scatter Plot (Before)
TAU Performance SystemS3D Scalability Study 22
XT4 Scatter Plot (After)
• MPI_Wait takes from 78 to 121 s now!
TAU Performance SystemS3D Scalability Study 23
Comparing Performance Hypothesis confirmed: XT4 is faster than XT3+XT4
Inclusive time down from 1935 to 1702 s 12% improvement Saved 24853.3 minutes (414 hours) of wallclock time!
Reduction in MPI_Wait time is most significant 390s (mean) down to 104s (mean)
Lessons learned: Slower XT3 nodes can have a significant impact on a
large scale S3D run S3D harness testcase does not perform well on non-
homogeneous nodes We recommend running S3D on XT4 partition only!
#PBS -lfeature=xt4
TAU Performance SystemS3D Scalability Study 24
Discussion Did we get optimal performance on XT4 nodes? Are the nodes performing at similar rates uniformly
now? Let us see the std. deviation plot of all routines...
TAU Performance SystemS3D Scalability Study 25
XT4: Standard Deviation
• IO routines!
TAU Performance SystemS3D Scalability Study 26
Scatter Plot: One CPU... WRITE_SAVEFILE
TAU Performance SystemS3D Scalability Study 27
WRITE_SAVEFILE
• Rank 0 is quicker!
TAU Performance SystemS3D Scalability Study 28
MPI_Barrier
TAU Performance SystemS3D Scalability Study 29
I/O is not performed uniformly
TAU Performance SystemS3D Scalability Study 30
I/O Becomes a Bottleneck: XT3, XT3+XT4...
MPI_Wait
WRITE_SAVEFILE
TAU Performance SystemS3D Scalability Study 31
Conclusions Using pure XT4 improved performance by 12% Need to investigate I/O in XT4/Lustre further to achieve better
performance... Discuss I/O issues with S3D developers
TAU Performance SystemS3D Scalability Study 32
S3D - Building with TAU Change name of compiler in build/make.XT3
ftn=> tau_f90.sh cc => tau_cc.sh
Set compile time environment variables setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/
Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi Disabled tracking message communication statistics in TAU MPI_Comm_compare() is not called inside TAU’s MPI wrapper Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation
setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ Selective instrumentation file eliminates instrumentation in lightweight routines Pre-process Fortran source code using cpp before compiling
Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:
export TAU_THROTTLE=1 export COUNTER1 GET_TIME_OF_DAY export COUNTER2 PAPI_FP_INS export COUNTER3 PAPI_L1_DCM export COUNTER4 PAPI_TOT_INS export COUNTER5 PAPI_L2_DCM
TAU Performance SystemS3D Scalability Study 33
Selective Instrumentation in TAU
% cat select.tauBEGIN_EXCLUDE_LIST
MCADIF
GETRATES
TRANSPORT_M::MCAVIS_NEW
MCEDIF
MCACON
CKYTCP
THERMCHEM_M::MIXCP
THERMCHEM_M::MIXENTH
THERMCHEM_M::GIBBSENRG_ALL_DIMT
CKRHOY
MCEVAL4
THERMCHEM_M::HIS
THERMCHEM_M::CPS
THERMCHEM_M::ENTROPY
END_EXCLUDE_LIST
BEGIN_INSTRUMENT_SECTION
loops routine="#"
END_INSTRUMENT_SECTION
TAU Performance SystemS3D Scalability Study 34
Getting Access to TAU on Jaguar set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) Choose Stub Makefiles (TAU_MAKEFILE env. var.) from
/spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* Makefile.tau-mpi-pdt-pgi (flat profile) Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)
Binaries of S3D can be found in: ~sameer/scratch/S3D-BINARIES
withtau» papi, multiplecounters, mpi, pdt, pgi options
without_tau
TAU Performance SystemS3D Scalability Study 35
Concluding Discussion Performance tools must be used effectively More intelligent performance systems for productive use
Evolve to application-specific performance technology Deal with scale by “full range” performance exploration Autonomic and integrated tools Knowledge-based and knowledge-driven process
Performance observation methods do not necessarily need to change in a fundamental sense More automatically controlled and efficiently use
Develop next-generation tools and deliver to community Open source with support by ParaTools, Inc. http://www.cs.uoregon.edu/research/tau
TAU Performance SystemS3D Scalability Study 36
Support Acknowledgements
Department of Energy (DOE)
Office of Science LLNL, LANL, ORNL, ASC PERI