Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of...
-
Upload
domenic-morrison -
Category
Documents
-
view
215 -
download
0
Transcript of Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of...
Extreme Performance Engineering:Petascale and Heterogeneous Systems
Allen D. Malony
Department of Computer and Information ScienceUniversity of Oregon
UVSQ October 21, 2010Extreme Performance Engineering
Outline
Part 1 Scalable systems and productivity Role of performance knowledge Performance engineering claims Model-based performance engineering and expectations
Part 2 Introduction to the TAU performance system
Part 3 Performance data mining Heterogeneous performance measurement (TAUcuda) Parallel performance monitoring (TAUmon)
UVSQ October 21, 2010Extreme Performance Engineering
Parallel Performance Engineering and Productivity
Scalable, optimized applications deliver HPC promise Optimization through performance engineering process
Understand performance complexity and inefficiencies Tune application to run optimally at scale
How to make the process more effective and productive? What performance technology should be used?
Performance technology part of larger environment Programmability, reusability, portability, robustness Application development and optimization productivity
Process, performance technology, and use will change as parallel systems evolve and grow to extreme scale
Goal is to deliver effective performance with high productivity value now and in the future
UVSQ October 21, 2010Extreme Performance Engineering
Traditionally an empirically-based approachobservation experimentation diagnosis tuning
Performance technology developed for each level
characterization
PerformanceTuning
PerformanceDiagnosis
PerformanceExperimentation
PerformanceObservation
hypotheses
properties
• Instrumentation• Measurement• Analysis• Visualization
PerformanceTechnology
• Experimentmanagement
• Performancestorage
PerformanceTechnology
• Data mining• Models• Expert systems
PerformanceTechnology
Parallel Performance Engineering Process
UVSQ October 21, 2010Extreme Performance Engineering
Performance Technology and Scale
What does it mean for performance technology to scale? Instrumentation, measurement, analysis, tuning
Scaling complicates observation and analysis Performance data size and value
standard approaches deliver a lot of data with little value Measurement overhead, intrusion, perturbation, noise
measurement overhead intrusion perturbation tradeoff with analysis accuracy “noise” in the system distorts performance
Analysis complexity increases More events, larger data, more parallelism
Traditional empirical process breaks down at extremes
UVSQ October 21, 2010Extreme Performance Engineering
Performance Engineering Process and Scale
What will enhance productive application development? Address application (developer) requirements at scale Support application-specific performance views
What are the important events and performance metrics Tied to application structure and computational model Tied to application domain and algorithms Process and tools must be more application-aware
Support online, dynamic performance analysis Complement static, offline assessment and adjustment Introspection of execution (performance) dynamics Requires scalable performance monitoring Integration with runtime infrastructure
UVSQ October 21, 2010Extreme Performance Engineering
Whole Performance Evaluation
Extreme scale performance is an optimized orchestration Application, processor, memory, network, I/O
Reductionist approaches to performance will be unable to support optimization and productivity objectives
Application-level only performance view is myopic Interplay of hardware, software, and system components Ultimately determines how performance is delivered
Performance should be evaluated in toto Application and system components Understand effects of performance interactions Identify opportunities for optimization across levels
Need whole performance evaluation practice
UVSQ October 21, 2010Extreme Performance Engineering
Role of Intelligence, Automation, and Knowledge
Scale forces the process to become more intelligent Even with intelligent and application-specific tools, the
decisions of what to analyze is difficult More automation and knowledge-based decision making Build these capabilities into performance tools
Support broader experimentation methods and refinement Access and correlate data from several sources Automate performance data analysis / mining / learning Include predictive features and experiment refinement
Knowledge-driven adaptation and optimization guidance Address scale issues through increased expertise
UVSQ October 21, 2010Extreme Performance Engineering
Role of Structure in Performance Understanding
Dealing with increasedcomplexity of the problem(measurement and analysis)should not be the only focus
There is fundamental structurein the computational modeland an application's behavior
Shigeo Fukuda, 1987Lunch with a Helmut On
Performance is a consequenceof structural forms and behaviors
It is the relationship between performance data and application semantics that is important to understand This need not be complex (not as complex as it seems)
UVSQ October 21, 2010Extreme Performance Engineering
Parallel Performance Diagnosis
UVSQ October 21, 2010Extreme Performance Engineering
Extreme Performance Engineering Claims
Claim 1: “Scaling up” current performance measurement and analysis techniques are insufficient for extreme scale (ES) performance diagnosis and tuning.
Claim 2: Performance of ES applications and systems should be evaluated in toto, to understand the effects of performance interactions and identify opportunities for optimization based on whole-performance engineering.
Claim 3: Adaptive ES applications might require performance monitoring, dynamic control, and tight coupling between the application and system at run time.
Claim 4: Optimization of ES applications requires model-level knowledge of parallel algorithms, parallel computation, system, and performance.
UVSQ October 21, 2010Extreme Performance Engineering
Models-based Performance Engineering
Performance engineering tools and practice must incorporate a performance knowledge discovery process
Focus on and automate performance problem identification and use to guide tuning decisions
Model-oriented knowledge Computational semantics of the application Symbolic models for algorithms Performance models for system architectures / components Use to define performance expectations
Higher-level abstraction for performance investigation Application developers can be more directly involved in
the performance engineering process
UVSQ October 21, 2010Extreme Performance Engineering
Performance Expectations
Context for understanding the performance behavior of the application and system
Traditional performance implicitly requires an expectation Users are forced to reason from the perspective of absolute
performance for every performance experiment and every application operation
Peak measures provide an absolute upper bound Empirical data alone does not lead to performance
understanding and is insufficient for optimization Empirical measurements lack a context for determining
whether the operations under consideration are performing well or performing poorly
UVSQ October 21, 2010Extreme Performance Engineering
Sources of Performance Expectations
Computation Model – Operational semantics of a parallel application specified by its algorithms and structure define a space of relevant performance behavior
Symbolic Performance Model – Symbolic or analytical model provides the relevant parameters for the performance model, which generate expectations
Historical Performance Data – Provides real execution history and empirical data on similar platforms
Relative Performance Data – Used to compare similar application operations across architectural components
Architectural parameters – Based on what is known of processor, memory, and communications performance
UVSQ October 21, 2010Extreme Performance Engineering
Model Use
Performance models derived from application knowledge, performance experiments, and symbolic analysis
They will be used to predict the performance of individual system components, such as communication and computation, as well as the application as a whole.
The models can then be compared to empirical measurements—manually or automatically—to pin point or dynamically rectify performance problems.
Further, these models can be evaluated at runtime in order to reduce perturbation and data management burdens, and to dynamically reconfigure system software.
UVSQ October 21, 2010Extreme Performance Engineering
Where are we now?
Large-scale performance measurement and analysis TAU performance system HPCToolkit Running on LCF machines
SciDAC PERI Automatic performance tuning Application / architecture modeling Tiger teams
Performance database PERI DB project TAU PerfDMF
UVSQ October 21, 2010Extreme Performance Engineering
Where are we now? (2)
Performance data mining TAU PerfExplorer
multi-experiment data mining analysis scripting, inference
SciDAC TASCS Computational Quality
of Service (CQoS) Configuration, adaptation
DOE FastOS (with Argonne) Linux (ZeptoOS) performance
measurement (KTAU) Scalable performance monitoring
UVSQ October 21, 2010Extreme Performance Engineering
Where are we now? (3)
Language integration OpenMP UPC Charm++
SciDAC FACETS Load balance modeling Automated performance testing
Projections TAU
UVSQ October 21, 2010Extreme Performance Engineering
What do we need?
Performance knowledge engineering At all levels Support the building of models Represented in form the tools can reason about
Understanding of “performance” interactions Between integrated components Control and data interactions
Properties of concurrency and dependencies With respect to scientific problem formulation Translate to performance requirements
More robust tools that are being used more broadly
UVSQ October 21, 2010Extreme Performance Engineering
TAU Performance System ComponentsTAU Architecture Program Analysis
Parallel Profile Analysis
PD
TP
erfD
MF
Par
aPro
f
Performance Data Mining
Performance Monitoring
TA
Uov
erS
upe
rmon
PerfExplorer
UVSQ October 21, 2010Extreme Performance Engineering
Performance Data Management
Provide an open, flexible framework to support common data management tasks Foster multi-experiment performance evaluation
Extensible toolkit to promote integration and reuse across available performance tools (PerfDMF) Originally designed to address critical TAU requirements Can import multiple profile formats Supports multiple DBMS: PostgreSQL, MySQL, ... Profile query and analysis API
Reference implementation for PERI-DB project
UVSQ October 21, 2010Extreme Performance Engineering
Metadata Collection
Integration of XML metadata for each parallel profile Three ways to incorporate metadata
Measured hardware/system information (TAU, PERI-DB) CPU speed, memory in GB, MPI node IDs, …
Application instrumentation (application-specific) TAU_METADATA() used to insert any name/value pair Application parameters, input data, domain decomposition
PerfDMF data management tools can incorporate an XML file of additional metadata Compiler flags, submission scripts, input files, …
Metadata can be imported from / exported to PERI-DB
UVSQ October 21, 2010Extreme Performance Engineering
Performance Data Mining / Analytics
Conduct systematic and scalable analysis process Multi-experiment performance analysis Support automation, collaboration, and reuse
Performance knowledge discovery framework Data mining analysis applied to parallel performance data
comparative, clustering, correlation, dimension reduction, … Use the existing TAU infrastructure
PerfExplorer v1 performance data mining framework Multiple experiments and parametric studies Integrate available statistics and data mining packages
Weka, R, Matlab / Octave Apply data mining operations in interactive enviroment
UVSQ October 21, 2010Extreme Performance Engineering
How to explain performance?
Should not just redescribe the performance results Should explain performance phenomena
What are the causes for performance observed? What are the factors and how do they interrelate? Performance analytics, forensics, and decision support
Need to add knowledge to do more intelligent things Automated analysis needs good informed feedback
iterative tuning, performance regression testing Performance model generation requires interpretation
We need better methods and tools for Integrating meta-information Knowledge-based performance problem solving
UVSQ October 21, 2010Extreme Performance Engineering
PerfExplorer 2.0
Performance data mining framework Parallel profile performance data and metadata
Programmable, extensible workflow automation Rule-based inference for expert system analysis
UVSQ October 21, 2010Extreme Performance Engineering
Automation & Knowledge Engineering
UVSQ October 21, 2010Extreme Performance Engineering
S3D Scalability Study
S3D flow solver for simulation of turbulent combustion Targeted application by DOE SciDAC PERI tiger team
Performance scaling study (C2H4 benchmark) Cray XT3+XT4 hybrid, XT4 (ORNL, Jaguar) IBM BG/P (ANL, Intrepid) Weak scaling (1 to 12000 cores) Evaluate scaling of code regions and MPI
Demonstration of scalable performance measurement, analysis, and visualization
Understanding scalability Requires environment for performance investigation Performance factors relative to main events
UVSQ October 21, 2010Extreme Performance Engineering
1212 1313 1414 1515
88 99 1010 1111
44 55 66 77
00 11 22 33
2 neighbors
3 neighbors
4 neighbors
4x4 example:
Center cells
Corner cellsData: Sweep3D on Linux Cluster, 16 processes
S3D Computational Structure
Domain decompositionwith wavefront evaluationand recursion dependencesin all 3 grid directions
Communication affectedby cell location
UVSQ October 21, 2010Extreme Performance Engineering
ORNL Jaguar* Cray XT3/XT4* 6400 cores
S3D on a Hybrid System (Cray XT3 + XT4)
6400 core execution on transitional Jaguar configuration
MPI_Wait
UVSQ October 21, 2010Extreme Performance Engineering
Zoom View of Parallel Profile (S3D, XT3+XT4)
Gap represents XT3 nodes MPI_Wait takes less time, other routines take more time
UVSQ October 21, 2010Extreme Performance Engineering
6400 cores
Scatterplot (S3D, XT3+XT4)
Scatterplot oftop threeroutines
Processmetadata isused to colorperformanceto machinetype (2 clusters)
Memory speedaccounts forperformancedifference!!!
UVSQ October 21, 2010Extreme Performance Engineering
S3D Run on XT4 Only
Better balance across nodes More uniform performance
UVSQ October 21, 2010Extreme Performance Engineering
Capturing Analysis and Inference Knowledge
Create PerfExplorer workflow for S3D case study?
Load DataExtractNon-callpath data
Extract top5 events
Load metadata
Correlate eventsWith metadata
Processinference rules
UVSQ October 21, 2010Extreme Performance Engineering
PerfExplorer Workflow Applied to S3D Performance
--------------- JPython test script start ------------doing single trial analysis for sweep3d on jaguarLoading Rules...Reading rules: rules/GeneralRules.drl... done.Reading rules: rules/ApplicationRules.drl... done.Reading rules: rules/MachineRules.drl... done.loading the data...Getting top 10 events (sorted by exclusive time)...Firing rules...MPI_Recv(): "CALLS" metric is correlated with the metadata field "total neighbors". The correlation is 1.0 (direct).MPI_Send(): "CALLS"metric is correlated with the metadata field "total Neighbors". The correlation is 1.0 (direct).MPI_Send(): "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is correlated with the metadata field "total neighbors". The correlation is 0.8596 (moderate).SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)". The correlation is -0.9792 (very high).SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)". The correlation is -0.9785258663321764 (very high).SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is -0.9818810020169854 (very high).SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is -0.9810373923601381 (very high).SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is 0.9985297567878844 (very high).SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is 0.996415213842904 (very high).SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is -0.9980107779462387 (very high).SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is -0.9959749668655212 (very high)....done with rules.---------------- JPython test script end -------------
Identifiedhardwaredifferences
Correlated communicationbehavior with metadata
Cray-specific inference
UVSQ October 21, 2010Extreme Performance Engineering
S3D Weak Scaling (Cray XT4, Jaguar)
C2H4 benchmark
UVSQ October 21, 2010Extreme Performance Engineering
S3D Weak Scaling (IBM BG/P, Intrepid)
C2H4 benchmark
JaguarIntrepid
UVSQ October 21, 2010Extreme Performance Engineering
S3D Total Runtime Breakdown by Events (Jaguar)
UVSQ October 21, 2010Extreme Performance Engineering
S3D FP Instructions / L1 Data Cache Miss (Jaguar)
UVSQ October 21, 2010Extreme Performance Engineering
S3D Total Runtime Breakdown by Events (Intrepid)
UVSQ October 21, 2010Extreme Performance Engineering
S3D Relative Efficiency / Speedup by Event (Jaguar)
UVSQ October 21, 2010Extreme Performance Engineering
S3D Relative Efficiency by Event (Intrepid)
UVSQ October 21, 2010Extreme Performance Engineering
Relative Speedup by Event (Intrepid)
UVSQ October 21, 2010Extreme Performance Engineering
S3D Event Correlation to Total Time (Jaguar)
r = 1 impliesdirect correlation
UVSQ October 21, 2010Extreme Performance Engineering
S3D Event Correlation to Total Time (Intrepid)
UVSQ October 21, 2010Extreme Performance Engineering
S3D 3D Correlation Cube (Intrepid, MPI_Wait)
GETRATES_I vs. RATI_I vs. RATX_I vs. MPI_Wait
UVSQ October 21, 2010Extreme Performance Engineering
Heterogeneous Parallel Systems and Performance
Heterogeneous parallel systems are highly relevant today Heterogeneous hardware technology more accessible
Multicore processors (e.g., 4-core, 6-core, 8-core, 16-core) Manycore (throughput) accelerators (e.g., Tesla, Fermi) High-performance engines (e.g., Intel) Special purpose components (e.g., FPGAs)
Performance is the main driving concern Heterogeneity is an important (the?) path to extreme scale
Heterogeneous software technology to get performance More sophisticated parallel programming environments Integrated parallel performance tools
support heterogeneous performance model and perspectives
UVSQ October 21, 2010Extreme Performance Engineering
Implications for Parallel Performance Tools
Current status quo is somewhat comfortable Mostly homogeneous parallel systems and software Shared-memory multithreading – OpenMP Distributed-memory message passing – MPI
Parallel computational models are relatively stable (simple) Corresponding performance models are relatively tractable Parallel performance tools can keep up and evolve
Heterogeneity creates richer computational potential Results in greater performance diversity and complexity
Heterogeneous systems will utilize more sophisticated programming and runtime environments
Performance tools have to support richer computation models and more versatile performance perspectives
UVSQ October 21, 2010Extreme Performance Engineering
Heterogeneous Performance Views
Want to create performance views that capture heterogeneous concurrency and execution behavior Reflect interactions between heterogeneous components Capture performance semantics relative to computation model Assimilate performance for all execution paths for shared view
Existing parallel performance tools are CPU (host)-centric Event-based sampling (not appropriate for accelerators) Direct measurement (through instrumentation of events)
What perspective does the host have of other components? Determines the semantics of the measurement data Determines assumptions about behavior and interactions
Performance views may have to work with reduced data
UVSQ October 21, 2010Extreme Performance Engineering
Task-based Performance View
Consider the “task” abstraction for GPU accelerator scenario Host regards external execution as a task
Tasks operate concurrently withrespect to the host
Requires support for trackingasynchronous execution
Host creates measurementperspective for external task Maintains local and remote performance data Tasks may have limited measurement support May depend on host for performance data I/O Performance data might be received from external task
How to create a view of heterogeneous external performance?
UVSQ October 21, 2010Extreme Performance Engineering
TAUcuda Performance Measurement (Version 2)
Performance measurement of heterogeneous applications using CUDA
CUDA system architecture Implemented by CUDA libraries
driver and device (cuXXX) libraries runtime (cudaYYY) library
Tools support (Parallel Nsight (Nexus), CUDA Profiler) not intended to integrate with other HPC performance tools
TAUcuda (v2) built on experimental Linux CUDA driver Linux CUDA driver R190.86 supports a callback interface!!!
Currently working with NVIDIA to develop version with production performance tools API
UVSQ October 21, 2010Extreme Performance Engineering
TAUcuda Architecture
TAUevents
TAUcudaevents
UVSQ October 21, 2010Extreme Performance Engineering
TAUcuda Instrumentation
TAUcuda events TAU events
UVSQ October 21, 2010Extreme Performance Engineering
TAUcuda Experimentation Environments
University of Oregon Linux workstation
Dual quad core Intel Xeon GTX 280
GPU cluster (Mist) Four dual quad core Intel Xeon server nodes Two NVIDIA S1070 Tesla servers (4 Tesla GPUs per S1070)
Argonne National Laboratory (Eureka) 100 dual quad core NVIDIA Quadro Plex S4 200 Quadro FX5600 (2 per S4)
University of Illinois at Urbana-Champaign GPU cluster (AC cluster)
32 nodes with one S1070 (4 GPUs per node)
UVSQ October 21, 2010Extreme Performance Engineering
CUDA Linpack Profile (4 processes, 4 GPUs)
GPU-accelerated Linpack benchmark (NVIDIA)
UVSQ October 21, 2010Extreme Performance Engineering
CUDA Linpack Trace
MPI communication (yellow)CUDA memory transfer (white)
UVSQ October 21, 2010Extreme Performance Engineering
SHOC Stencil2D (512 iterations, 4 CPUxGPU)
Scalable HeterOgenerous Computing benchmark suite CUDA / OpenCL kernels and microbenchmarks (ORNL)
CUDA memory transfer (white)
UVSQ October 21, 2010Extreme Performance Engineering
Scalable Parallel Performance Monitoring
Performance problem analysis is increasingly complex Multi-core, heterogeneous, and extreme scale computing Adaptive algorithms and runtime application tuning Performance dynamics variability within/between executions
Neo-performance measurement and analysis perspective Static, offline analysis dynamic, online analysis Scalable runtime analysis of parallel performance data Performance feedback to application for adaptive control Integrated performance monitoring (measurement + query)
Co-allocation of additional (tool specific) system resources
Goal Scalable, integrated parallel performance monitoring
UVSQ October 21, 2010Extreme Performance Engineering
Parallel Performance Measurement and Data
Parallel performance tools measure concurrently Scaling dictates “local” measurements (profile, trace)
save data with "local context" (processes or threads) Done without synchronization or central control
Parallel performance state is globally distributed as a result Logically part of application’s global data space Offline: outputs data at end for post-mortem analysis Online: access to performance state for runtime analysis
Definition: Monitoring Online access to parallel performance (data) state May or may not involve runtime analysis
UVSQ October 21, 2010Extreme Performance Engineering
Monitoring for Performance Dynamics
Runtime access to parallel performance data Scalable and lightweight Raises concerns of overhead and intrusion Support for performance-adaptive, dynamic applications
Alternative 1: Extend existing performance measurement Create own integrated monitoring infrastructure Disadvantage: maintain own monitoring framework
Alternative 2: Couple with other monitoring infrastructure Leverage scalable middleware from other supported projects Challenge: measurement system / monitor integration
UVSQ October 21, 2010Extreme Performance Engineering
Performance Dynamics: Parallel Profile Snapshots
Profile snapshots are parallel profiles recorded at runtime Shows performance profile dynamics (all types allowed)
Information
Ove
rhea
d Traces
Profile Snapshots
Profiles
A. Morris, W. Spear, A. Malony, and S. Shende, “Observing Performance Dynamics using Parallel Profile Snapshots,” European Conference on Parallel Processing (EuroPar), 2008.
UVSQ October 21, 2010Extreme Performance Engineering
Parallel Profile Snapshots of FLASH 3.0 (UIC)
Simulation of astrophysical thermonuclear flashes Snapshots show profile differences since last snapshot
Captures all eventssince beginningper thread
Mean profilecalculated post-mortem
Highlight changein performanceper iteration andat checkpointing
Initialization
Checkpointing
Finalization
UVSQ October 21, 2010Extreme Performance Engineering
FLASH 3.0 Performance Dynamics (Periodic)
INTRFC
UVSQ October 21, 2010Extreme Performance Engineering
TAUmon: Design
Design a transport-neutral application monitoring framework Base on prior / existing work with various transport systems
Supermon, MRNet, MPI Enable efficient development of monitoring functionality
Objectives Scalable access to a running application’s performance
at end of the application (before parallel teardown) while the application is still running
Support for scalable performance data analysis reduction statistical evaluation
Feedback (data, control) to application Monitoring engineering and performance efficiency issues
UVSQ October 21, 2010Extreme Performance Engineering
TAUmon: Architecture
... ... ...
... ...
MPI process 0 MPI process k MPI process P-1
TAUmon
TAUprofiles threads
MPI
monitoring infrastructure
UVSQ October 21, 2010Extreme Performance Engineering
TAUmon: Current Usage
TAU_ONLINE_DUMP() collective operations in application Called by all thread / processes (originally to output profiles) Arguments specify data analysis operation (future)
Appropriate version of TAU selected for transport system TAUmonMRnet: TAUmon using MRNet infrastructure TAUmonMPI: TAUmon using MPI infrastructure
User instruments application with TAU support for desired monitoring transport system (temporary)
User submits instrumented application to parallel job system Other launch systems must be submitted along with the
application to the job scheduler as needed different machine-specific job-submission scripts
UVSQ October 21, 2010Extreme Performance Engineering
TAUmon: Parallel Profile Data Analysis
Total parallel profile data size depends on: # events * size of event * # execution threads * Event size depends on # metrics Example: 200 events * 100 bytes * 64,000 threads = 1.28 G
Monitoring operations Periodic profile data output (à la profile snapshorts) Events unification Basic statistics: mean, min/max, standard deviation, ... Advanced statistics: histogram, clustering, ...
Strong motivation to implement the operations in parallel
UVSQ October 21, 2010Extreme Performance Engineering
TAUmonMRNet (a.k.a. ToM Revisited)
TAU over MRNet (ToM) Previously working with MRNet 2.1 (Cluster 2008 paper) 1-phase and 3-phase filters Explore overlay network with different span out (nodes)
TAUmon re-engineered for MRNet 3.0 (just released!) Re-implement ToM functionality Use new MRNet support Current implementation uses pre-released MRNet 3.0
version Testing with released version
UVSQ October 21, 2010Extreme Performance Engineering
Monitoring Operation with MRNet
Application collectively invokesTAU_ONLINE_DUMP() to start monitoringoperations using current performanceinformation
TAU data is accessed andsent through MRNet’scommunication API viastreams and filters
Filters perform appropriateaggregation operations on data
TAU frontend is responsible forcollecting the data, storing it, andeventual delivery to a consumer
UVSQ October 21, 2010Extreme Performance Engineering
TAUmonMPI
Use MPI-based transport No separate launch mechanisms Parallel gather operations implemented
as a binomial heap with staged MPIpoint-to-point calls (Rank 0 serves as root)
Current limitations: Application shares parallel resources with monitoring
transport Monitoring operations may cause performance intrusion No user control of transport network configuration
Potential advantages Easy to use Could be more robust overall
. . .
rank 0
UVSQ October 21, 2010Extreme Performance Engineering
TAUmon Experiments: PFLOTRAN
Predictive modeling of subsurface reactive flows Machines
ORNL Jaguar and UTK Kraken, Cray XT5 Processor counts
16,380 cores and 131Kcores, 12K (interactive) Scaling
Instrumentation (Source, PMPI) Full: 1131 events total, lots of small routines Partial: 1% exclusive + all MPI, 68 events total (44 MPI, 19
PETSc) with and without callpaths
Measurements (PAPI) Execution time: TOT CYC Counters: FP OPS, TOT IN, L1 DCA/DCM, L2 TCA/TCM, RES STL
UVSQ October 21, 2010Extreme Performance Engineering
PFLOTRAN Profile (Exclusive, 16,380 cores)
MPI_Allreduce
MPI_Waitany
KSPSolve
oursnesjacobian
TAU ParaProf profile analyzer
...
UVSQ October 21, 2010Extreme Performance Engineering
TAUmonMPI Event Unification (Cray XT5)
TAU unificationand merge time
UVSQ October 21, 2010Extreme Performance Engineering
TAUmonMPI Scaling (PFLOTRAN, Cray XT5)
New histogram timings12288: 0.8643 seconds24576: 0.6238 seconds
UVSQ October 21, 2010Extreme Performance Engineering
TAUmonMRnet Scaling (PFLOTRAN, Cray XT5)
UVSQ October 21, 2010Extreme Performance Engineering
TAUmonMPI Scaling (PFLOTRAN, BG/P)
UVSQ October 21, 2010Extreme Performance Engineering
TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)
4104 cores running with 374 extra cores for MRNet Each line bar shows the mean profile of an iteration
UVSQ October 21, 2010Extreme Performance Engineering
TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)
Frames (iteration) 12, 17, 21 12k PFLOTRAN execution Shifts in order of events sorted by average value over time
UVSQ October 21, 2010Extreme Performance Engineering
TAUmonMRnet Snapshot (FLASH, Cray XT5)
Sod 2D, 1,536 Cray XT5 cores Over 200 iterations. 15 maximum levels of refinement. MPI_Alltoall plateaus correspond to AMR refinement
UVSQ October 21, 2010Extreme Performance Engineering
TAUmonMRnet Clustering (FLASH, Cray XT5)
MPI_Init
MPI_Alltoall COMPRESS_LIST MPI_Allreduce
DRIVER_COMPUTEDT
MPI_Init MPI_Alltoall MPI_Allreduce
DRIVER_COMPUTEDT
MPI_Alltoall MPI_Init MPI_Allreduce
DRIVER_COMPUTEDT
MPI_Alltoall COMPRESS_LIST MPI_Allreduce
UVSQ October 21, 2010Extreme Performance Engineering
High-Productivity Performance Engineering (POINT)
Testbed AppsENZONAMDNEMO3D
UVSQ October 21, 2010Extreme Performance Engineering
Model Oriented Global Optimization (MOGO)
Empirical performance data evaluated with respect to performance expectations at various levels of abstraction
UVSQ October 21, 2010Extreme Performance Engineering
Performance Refactoring (PRIMA) (UO, Juelich)
Integration of instrumentation and measurement Core infrastructure
Focus on TAU and Scalasca University of Oregon, Research Centre Juelich Refactor instrumentation, measurement, and analysis Build next-generation tools on new common foundation
Extend to involve the SILC project Juelich, TU Dresden, TU Munich
UVSQ October 21, 2010Extreme Performance Engineering
Heterogeneous Exascale Software (Vancouver)
DOE X-stack program Partners:
Oak Ridge National LaboratoryUniversity of OregonUniversity of IllinoisGeorgia Institute of Technology
Components Compilers Scheduling and runtime
resource management Libraries Performance measurement, analysis, modeling
UVSQ October 21, 2010Extreme Performance Engineering
Call for “Extreme” Performance Engineering
Improve performance problem solving process Strategy to respond to scaling challenges and technology
changes / disruptions Evolve process and technology Build on robust, integrated performance infrastructure Carry forward performance expertise and knowledge
Model-oriented with knowledge-based reasoning Community-driven knowledge engineering Automated data / decision analytics
Closer integration with application development and execution environment
UVSQ October 21, 2010Extreme Performance Engineering
Support Acknowledgements
Department of Energy (DOE) Office of Science ASC/NNSA
Department of Defense (DoD) HPC Modernization Office (HPCMO)
NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Argonne National Laboratory Technical University Dresden ParaTools, Inc. NVIDIA
85