S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende [email protected].

S3D: Comparing Performance of XT3+XT4 with XT4Sameer Shende

[email protected]

TAU Performance SystemS3D Scalability Study 2

Acknowledgements

Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]

The performance data presented here is available at:

http://www.cs.uoregon.edu/research/tau/s3d


TAU Parallel Performance System

http://www.cs.uoregon.edu/research/tau/ Multi-level performance instrumentation

Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system

Computer system architectures and operating systems Different programming languages and compilers

Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid


The Story So Far...

Scalability study of S3D using TAU 3D Scatter plots and mapping of ranks to physical

processors points to partitioning in XT3/XT4 Memory and network on XT3 partition cause the rest of

the application to slow down Hypothesis: Running S3D on a ‘pure’ XT4 system will

help improve the performance significantly Ran a 6400 core simulation on an XT4 partition to

compare with XT3+XT4 (used #PBS -lfeature=xt4)...


3D Scatter Plots

Plot four routines along X, Y, Z, and Color axes Each routine has a range (max, min) Each process (rank) has a unique position along the three

axes and a unique color Allows us to examine the distribution of nodes (clusters)


Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters!

Previous work proved: Blue nodes are XT3, Red are XT4


3D Triangle Mesh Display

Plot MPI rank, routine name, and exclusive time along X, Y and Z axes

Color can be shown by a fourth metric Scalable view Suitable for very large number of processors


XT3+XT4: MPI_Wait

• Gap represents XT3 nodes


3D View: Large MPI_Wait times on most CPUs

• To improve performance, we must reduce MPI_Wait time on other cpus


3D View: XT3 Partition, Imbalance

On XT3: MPI_Wait takes less time, other routines take more time!


Getting Back to MPI_Wait()

• MPI_Wait takes less time on XT3 nodes• Other routines take longer


XT3+XT4: MPI_Wait - Sorted by Exclusive Time

• MPI_Wait takes 435.84 seconds on rank 3101• It takes 15.49 seconds on rank 0!• Rank 3101 is on XT4, rank 0 is on XT3


Comparing XT4 and XT3 ranks (Best vs worst)


Improving S3D Performance

Hypothesis: Running S3D on a ‘pure’ XT4 system

will help improve the performance significantly and reduce the time spent idling in MPI_Wait


XT4 Profile: Main Window


XT4: Mean Profile Sorted by Exclusive Time

• MPI_Wait has moved down!


XT4: Mean Profile Sorted by Inclusive Time


Comparing XT4 with XT3+XT4

• MPI_Wait takes 26% of time compared to combined XT3+XT4!


Comparing Mean Inclusive Time


XT4: 3D View

• The “exp” loop [~1GFlop] takes most time now!


XT3+XT4: Scatter Plot (Before)


XT4 Scatter Plot (After)

• MPI_Wait takes from 78 to 121 s now!


Comparing Performance Hypothesis confirmed: XT4 is faster than XT3+XT4

Inclusive time down from 1935 to 1702 s 12% improvement Saved 24853.3 minutes (414 hours) of wallclock time!

Reduction in MPI_Wait time is most significant 390s (mean) down to 104s (mean)

Lessons learned: Slower XT3 nodes can have a significant impact on a

large scale S3D run S3D harness testcase does not perform well on non-

homogeneous nodes We recommend running S3D on XT4 partition only!

#PBS -lfeature=xt4


Discussion Did we get optimal performance on XT4 nodes? Are the nodes performing at similar rates uniformly

now? Let us see the std. deviation plot of all routines...


XT4: Standard Deviation

• IO routines!


Scatter Plot: One CPU... WRITE_SAVEFILE


WRITE_SAVEFILE

• Rank 0 is quicker!


MPI_Barrier


I/O is not performed uniformly


I/O Becomes a Bottleneck: XT3, XT3+XT4...

MPI_Wait

WRITE_SAVEFILE


Conclusions Using pure XT4 improved performance by 12% Need to investigate I/O in XT4/Lustre further to achieve better

performance... Discuss I/O issues with S3D developers


S3D - Building with TAU Change name of compiler in build/make.XT3

ftn=> tau_f90.sh cc => tau_cc.sh

Set compile time environment variables setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/

Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi Disabled tracking message communication statistics in TAU MPI_Comm_compare() is not called inside TAU’s MPI wrapper Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation

setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ Selective instrumentation file eliminates instrumentation in lightweight routines Pre-process Fortran source code using cpp before compiling

Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:

export TAU_THROTTLE=1 export COUNTER1 GET_TIME_OF_DAY export COUNTER2 PAPI_FP_INS export COUNTER3 PAPI_L1_DCM export COUNTER4 PAPI_TOT_INS export COUNTER5 PAPI_L2_DCM


Selective Instrumentation in TAU

% cat select.tauBEGIN_EXCLUDE_LIST

MCADIF

GETRATES

TRANSPORT_M::MCAVIS_NEW

MCEDIF

MCACON

CKYTCP

THERMCHEM_M::MIXCP

THERMCHEM_M::MIXENTH

THERMCHEM_M::GIBBSENRG_ALL_DIMT

CKRHOY

MCEVAL4

THERMCHEM_M::HIS

THERMCHEM_M::CPS

THERMCHEM_M::ENTROPY

END_EXCLUDE_LIST

BEGIN_INSTRUMENT_SECTION

loops routine="#"

END_INSTRUMENT_SECTION


Getting Access to TAU on Jaguar set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) Choose Stub Makefiles (TAU_MAKEFILE env. var.) from

/spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* Makefile.tau-mpi-pdt-pgi (flat profile) Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)

Binaries of S3D can be found in: ~sameer/scratch/S3D-BINARIES

withtau» papi, multiplecounters, mpi, pdt, pgi options

without_tau


Concluding Discussion Performance tools must be used effectively More intelligent performance systems for productive use

Evolve to application-specific performance technology Deal with scale by “full range” performance exploration Autonomic and integrated tools Knowledge-based and knowledge-driven process

Performance observation methods do not necessarily need to change in a fundamental sense More automatically controlled and efficiently use

Develop next-generation tools and deliver to community Open source with support by ParaTools, Inc. http://www.cs.uoregon.edu/research/tau


Support Acknowledgements

Department of Energy (DOE)

Office of Science LLNL, LANL, ORNL, ASC PERI

S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende [email protected].

Documents

Transcript of S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende [email protected].