Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of...

85
Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon

Transcript of Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of...

Page 1: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

Extreme Performance Engineering:Petascale and Heterogeneous Systems

Allen D. Malony

Department of Computer and Information ScienceUniversity of Oregon

Page 2: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Outline

Part 1 Scalable systems and productivity Role of performance knowledge Performance engineering claims Model-based performance engineering and expectations

Part 2 Introduction to the TAU performance system

Part 3 Performance data mining Heterogeneous performance measurement (TAUcuda) Parallel performance monitoring (TAUmon)

Page 3: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Parallel Performance Engineering and Productivity

Scalable, optimized applications deliver HPC promise Optimization through performance engineering process

Understand performance complexity and inefficiencies Tune application to run optimally at scale

How to make the process more effective and productive? What performance technology should be used?

Performance technology part of larger environment Programmability, reusability, portability, robustness Application development and optimization productivity

Process, performance technology, and use will change as parallel systems evolve and grow to extreme scale

Goal is to deliver effective performance with high productivity value now and in the future

Page 4: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Traditionally an empirically-based approachobservation experimentation diagnosis tuning

Performance technology developed for each level

characterization

PerformanceTuning

PerformanceDiagnosis

PerformanceExperimentation

PerformanceObservation

hypotheses

properties

• Instrumentation• Measurement• Analysis• Visualization

PerformanceTechnology

• Experimentmanagement

• Performancestorage

PerformanceTechnology

• Data mining• Models• Expert systems

PerformanceTechnology

Parallel Performance Engineering Process

Page 5: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Performance Technology and Scale

What does it mean for performance technology to scale? Instrumentation, measurement, analysis, tuning

Scaling complicates observation and analysis Performance data size and value

standard approaches deliver a lot of data with little value Measurement overhead, intrusion, perturbation, noise

measurement overhead intrusion perturbation tradeoff with analysis accuracy “noise” in the system distorts performance

Analysis complexity increases More events, larger data, more parallelism

Traditional empirical process breaks down at extremes

Page 6: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Performance Engineering Process and Scale

What will enhance productive application development? Address application (developer) requirements at scale Support application-specific performance views

What are the important events and performance metrics Tied to application structure and computational model Tied to application domain and algorithms Process and tools must be more application-aware

Support online, dynamic performance analysis Complement static, offline assessment and adjustment Introspection of execution (performance) dynamics Requires scalable performance monitoring Integration with runtime infrastructure

Page 7: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Whole Performance Evaluation

Extreme scale performance is an optimized orchestration Application, processor, memory, network, I/O

Reductionist approaches to performance will be unable to support optimization and productivity objectives

Application-level only performance view is myopic Interplay of hardware, software, and system components Ultimately determines how performance is delivered

Performance should be evaluated in toto Application and system components Understand effects of performance interactions Identify opportunities for optimization across levels

Need whole performance evaluation practice

Page 8: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Role of Intelligence, Automation, and Knowledge

Scale forces the process to become more intelligent Even with intelligent and application-specific tools, the

decisions of what to analyze is difficult More automation and knowledge-based decision making Build these capabilities into performance tools

Support broader experimentation methods and refinement Access and correlate data from several sources Automate performance data analysis / mining / learning Include predictive features and experiment refinement

Knowledge-driven adaptation and optimization guidance Address scale issues through increased expertise

Page 9: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Role of Structure in Performance Understanding

Dealing with increasedcomplexity of the problem(measurement and analysis)should not be the only focus

There is fundamental structurein the computational modeland an application's behavior

Shigeo Fukuda, 1987Lunch with a Helmut On

Performance is a consequenceof structural forms and behaviors

It is the relationship between performance data and application semantics that is important to understand This need not be complex (not as complex as it seems)

Page 10: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Parallel Performance Diagnosis

Page 11: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Extreme Performance Engineering Claims

Claim 1: “Scaling up” current performance measurement and analysis techniques are insufficient for extreme scale (ES) performance diagnosis and tuning.

Claim 2: Performance of ES applications and systems should be evaluated in toto, to understand the effects of performance interactions and identify opportunities for optimization based on whole-performance engineering.

Claim 3: Adaptive ES applications might require performance monitoring, dynamic control, and tight coupling between the application and system at run time.

Claim 4: Optimization of ES applications requires model-level knowledge of parallel algorithms, parallel computation, system, and performance.

Page 12: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Models-based Performance Engineering

Performance engineering tools and practice must incorporate a performance knowledge discovery process

Focus on and automate performance problem identification and use to guide tuning decisions

Model-oriented knowledge Computational semantics of the application Symbolic models for algorithms Performance models for system architectures / components Use to define performance expectations

Higher-level abstraction for performance investigation Application developers can be more directly involved in

the performance engineering process

Page 13: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Performance Expectations

Context for understanding the performance behavior of the application and system

Traditional performance implicitly requires an expectation Users are forced to reason from the perspective of absolute

performance for every performance experiment and every application operation

Peak measures provide an absolute upper bound Empirical data alone does not lead to performance

understanding and is insufficient for optimization Empirical measurements lack a context for determining

whether the operations under consideration are performing well or performing poorly

Page 14: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Sources of Performance Expectations

Computation Model – Operational semantics of a parallel application specified by its algorithms and structure define a space of relevant performance behavior

Symbolic Performance Model – Symbolic or analytical model provides the relevant parameters for the performance model, which generate expectations

Historical Performance Data – Provides real execution history and empirical data on similar platforms

Relative Performance Data – Used to compare similar application operations across architectural components

Architectural parameters – Based on what is known of processor, memory, and communications performance

Page 15: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Model Use

Performance models derived from application knowledge, performance experiments, and symbolic analysis

They will be used to predict the performance of individual system components, such as communication and computation, as well as the application as a whole.

The models can then be compared to empirical measurements—manually or automatically—to pin point or dynamically rectify performance problems.

Further, these models can be evaluated at runtime in order to reduce perturbation and data management burdens, and to dynamically reconfigure system software.

Page 16: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Where are we now?

Large-scale performance measurement and analysis TAU performance system HPCToolkit Running on LCF machines

SciDAC PERI Automatic performance tuning Application / architecture modeling Tiger teams

Performance database PERI DB project TAU PerfDMF

Page 17: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Where are we now? (2)

Performance data mining TAU PerfExplorer

multi-experiment data mining analysis scripting, inference

SciDAC TASCS Computational Quality

of Service (CQoS) Configuration, adaptation

DOE FastOS (with Argonne) Linux (ZeptoOS) performance

measurement (KTAU) Scalable performance monitoring

Page 18: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Where are we now? (3)

Language integration OpenMP UPC Charm++

SciDAC FACETS Load balance modeling Automated performance testing

Projections TAU

Page 19: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

What do we need?

Performance knowledge engineering At all levels Support the building of models Represented in form the tools can reason about

Understanding of “performance” interactions Between integrated components Control and data interactions

Properties of concurrency and dependencies With respect to scientific problem formulation Translate to performance requirements

More robust tools that are being used more broadly

Page 20: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAU Performance System ComponentsTAU Architecture Program Analysis

Parallel Profile Analysis

PD

TP

erfD

MF

Par

aPro

f

Performance Data Mining

Performance Monitoring

TA

Uov

erS

upe

rmon

PerfExplorer

Page 21: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Performance Data Management

Provide an open, flexible framework to support common data management tasks Foster multi-experiment performance evaluation

Extensible toolkit to promote integration and reuse across available performance tools (PerfDMF) Originally designed to address critical TAU requirements Can import multiple profile formats Supports multiple DBMS: PostgreSQL, MySQL, ... Profile query and analysis API

Reference implementation for PERI-DB project

Page 22: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Metadata Collection

Integration of XML metadata for each parallel profile Three ways to incorporate metadata

Measured hardware/system information (TAU, PERI-DB) CPU speed, memory in GB, MPI node IDs, …

Application instrumentation (application-specific) TAU_METADATA() used to insert any name/value pair Application parameters, input data, domain decomposition

PerfDMF data management tools can incorporate an XML file of additional metadata Compiler flags, submission scripts, input files, …

Metadata can be imported from / exported to PERI-DB

Page 23: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Performance Data Mining / Analytics

Conduct systematic and scalable analysis process Multi-experiment performance analysis Support automation, collaboration, and reuse

Performance knowledge discovery framework Data mining analysis applied to parallel performance data

comparative, clustering, correlation, dimension reduction, … Use the existing TAU infrastructure

PerfExplorer v1 performance data mining framework Multiple experiments and parametric studies Integrate available statistics and data mining packages

Weka, R, Matlab / Octave Apply data mining operations in interactive enviroment

Page 24: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

How to explain performance?

Should not just redescribe the performance results Should explain performance phenomena

What are the causes for performance observed? What are the factors and how do they interrelate? Performance analytics, forensics, and decision support

Need to add knowledge to do more intelligent things Automated analysis needs good informed feedback

iterative tuning, performance regression testing Performance model generation requires interpretation

We need better methods and tools for Integrating meta-information Knowledge-based performance problem solving

Page 25: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

PerfExplorer 2.0

Performance data mining framework Parallel profile performance data and metadata

Programmable, extensible workflow automation Rule-based inference for expert system analysis

Page 26: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Automation & Knowledge Engineering

Page 27: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D Scalability Study

S3D flow solver for simulation of turbulent combustion Targeted application by DOE SciDAC PERI tiger team

Performance scaling study (C2H4 benchmark) Cray XT3+XT4 hybrid, XT4 (ORNL, Jaguar) IBM BG/P (ANL, Intrepid) Weak scaling (1 to 12000 cores) Evaluate scaling of code regions and MPI

Demonstration of scalable performance measurement, analysis, and visualization

Understanding scalability Requires environment for performance investigation Performance factors relative to main events

Page 28: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

1212 1313 1414 1515

88 99 1010 1111

44 55 66 77

00 11 22 33

2 neighbors

3 neighbors

4 neighbors

4x4 example:

Center cells

Corner cellsData: Sweep3D on Linux Cluster, 16 processes

S3D Computational Structure

Domain decompositionwith wavefront evaluationand recursion dependencesin all 3 grid directions

Communication affectedby cell location

Page 29: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

ORNL Jaguar* Cray XT3/XT4* 6400 cores

S3D on a Hybrid System (Cray XT3 + XT4)

6400 core execution on transitional Jaguar configuration

MPI_Wait

Page 30: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Zoom View of Parallel Profile (S3D, XT3+XT4)

Gap represents XT3 nodes MPI_Wait takes less time, other routines take more time

Page 31: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

6400 cores

Scatterplot (S3D, XT3+XT4)

Scatterplot oftop threeroutines

Processmetadata isused to colorperformanceto machinetype (2 clusters)

Memory speedaccounts forperformancedifference!!!

Page 32: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D Run on XT4 Only

Better balance across nodes More uniform performance

Page 33: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Capturing Analysis and Inference Knowledge

Create PerfExplorer workflow for S3D case study?

Load DataExtractNon-callpath data

Extract top5 events

Load metadata

Correlate eventsWith metadata

Processinference rules

Page 34: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

PerfExplorer Workflow Applied to S3D Performance

--------------- JPython test script start ------------doing single trial analysis for sweep3d on jaguarLoading Rules...Reading rules: rules/GeneralRules.drl... done.Reading rules: rules/ApplicationRules.drl... done.Reading rules: rules/MachineRules.drl... done.loading the data...Getting top 10 events (sorted by exclusive time)...Firing rules...MPI_Recv(): "CALLS" metric is correlated with the metadata field "total neighbors". The correlation is 1.0 (direct).MPI_Send(): "CALLS"metric is correlated with the metadata field "total Neighbors". The correlation is 1.0 (direct).MPI_Send(): "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is correlated with the metadata field "total neighbors". The correlation is 0.8596 (moderate).SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)". The correlation is -0.9792 (very high).SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)". The correlation is -0.9785258663321764 (very high).SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is -0.9818810020169854 (very high).SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is -0.9810373923601381 (very high).SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is 0.9985297567878844 (very high).SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is 0.996415213842904 (very high).SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is -0.9980107779462387 (very high).SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is -0.9959749668655212 (very high)....done with rules.---------------- JPython test script end -------------

Identifiedhardwaredifferences

Correlated communicationbehavior with metadata

Cray-specific inference

Page 35: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D Weak Scaling (Cray XT4, Jaguar)

C2H4 benchmark

Page 36: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D Weak Scaling (IBM BG/P, Intrepid)

C2H4 benchmark

JaguarIntrepid

Page 37: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D Total Runtime Breakdown by Events (Jaguar)

Page 38: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D FP Instructions / L1 Data Cache Miss (Jaguar)

Page 39: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D Total Runtime Breakdown by Events (Intrepid)

Page 40: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D Relative Efficiency / Speedup by Event (Jaguar)

Page 41: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D Relative Efficiency by Event (Intrepid)

Page 42: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Relative Speedup by Event (Intrepid)

Page 43: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D Event Correlation to Total Time (Jaguar)

r = 1 impliesdirect correlation

Page 44: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D Event Correlation to Total Time (Intrepid)

Page 45: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

S3D 3D Correlation Cube (Intrepid, MPI_Wait)

GETRATES_I vs. RATI_I vs. RATX_I vs. MPI_Wait

Page 46: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Heterogeneous Parallel Systems and Performance

Heterogeneous parallel systems are highly relevant today Heterogeneous hardware technology more accessible

Multicore processors (e.g., 4-core, 6-core, 8-core, 16-core) Manycore (throughput) accelerators (e.g., Tesla, Fermi) High-performance engines (e.g., Intel) Special purpose components (e.g., FPGAs)

Performance is the main driving concern Heterogeneity is an important (the?) path to extreme scale

Heterogeneous software technology to get performance More sophisticated parallel programming environments Integrated parallel performance tools

support heterogeneous performance model and perspectives

Page 47: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Implications for Parallel Performance Tools

Current status quo is somewhat comfortable Mostly homogeneous parallel systems and software Shared-memory multithreading – OpenMP Distributed-memory message passing – MPI

Parallel computational models are relatively stable (simple) Corresponding performance models are relatively tractable Parallel performance tools can keep up and evolve

Heterogeneity creates richer computational potential Results in greater performance diversity and complexity

Heterogeneous systems will utilize more sophisticated programming and runtime environments

Performance tools have to support richer computation models and more versatile performance perspectives

Page 48: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Heterogeneous Performance Views

Want to create performance views that capture heterogeneous concurrency and execution behavior Reflect interactions between heterogeneous components Capture performance semantics relative to computation model Assimilate performance for all execution paths for shared view

Existing parallel performance tools are CPU (host)-centric Event-based sampling (not appropriate for accelerators) Direct measurement (through instrumentation of events)

What perspective does the host have of other components? Determines the semantics of the measurement data Determines assumptions about behavior and interactions

Performance views may have to work with reduced data

Page 49: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Task-based Performance View

Consider the “task” abstraction for GPU accelerator scenario Host regards external execution as a task

Tasks operate concurrently withrespect to the host

Requires support for trackingasynchronous execution

Host creates measurementperspective for external task Maintains local and remote performance data Tasks may have limited measurement support May depend on host for performance data I/O Performance data might be received from external task

How to create a view of heterogeneous external performance?

Page 50: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUcuda Performance Measurement (Version 2)

Performance measurement of heterogeneous applications using CUDA

CUDA system architecture Implemented by CUDA libraries

driver and device (cuXXX) libraries runtime (cudaYYY) library

Tools support (Parallel Nsight (Nexus), CUDA Profiler) not intended to integrate with other HPC performance tools

TAUcuda (v2) built on experimental Linux CUDA driver Linux CUDA driver R190.86 supports a callback interface!!!

Currently working with NVIDIA to develop version with production performance tools API

Page 51: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUcuda Architecture

TAUevents

TAUcudaevents

Page 52: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUcuda Instrumentation

TAUcuda events TAU events

Page 53: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUcuda Experimentation Environments

University of Oregon Linux workstation

Dual quad core Intel Xeon GTX 280

GPU cluster (Mist) Four dual quad core Intel Xeon server nodes Two NVIDIA S1070 Tesla servers (4 Tesla GPUs per S1070)

Argonne National Laboratory (Eureka) 100 dual quad core NVIDIA Quadro Plex S4 200 Quadro FX5600 (2 per S4)

University of Illinois at Urbana-Champaign GPU cluster (AC cluster)

32 nodes with one S1070 (4 GPUs per node)

Page 54: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

CUDA Linpack Profile (4 processes, 4 GPUs)

GPU-accelerated Linpack benchmark (NVIDIA)

Page 55: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

CUDA Linpack Trace

MPI communication (yellow)CUDA memory transfer (white)

Page 56: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

SHOC Stencil2D (512 iterations, 4 CPUxGPU)

Scalable HeterOgenerous Computing benchmark suite CUDA / OpenCL kernels and microbenchmarks (ORNL)

CUDA memory transfer (white)

Page 57: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Scalable Parallel Performance Monitoring

Performance problem analysis is increasingly complex Multi-core, heterogeneous, and extreme scale computing Adaptive algorithms and runtime application tuning Performance dynamics variability within/between executions

Neo-performance measurement and analysis perspective Static, offline analysis dynamic, online analysis Scalable runtime analysis of parallel performance data Performance feedback to application for adaptive control Integrated performance monitoring (measurement + query)

Co-allocation of additional (tool specific) system resources

Goal Scalable, integrated parallel performance monitoring

Page 58: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Parallel Performance Measurement and Data

Parallel performance tools measure concurrently Scaling dictates “local” measurements (profile, trace)

save data with "local context" (processes or threads) Done without synchronization or central control

Parallel performance state is globally distributed as a result Logically part of application’s global data space Offline: outputs data at end for post-mortem analysis Online: access to performance state for runtime analysis

Definition: Monitoring Online access to parallel performance (data) state May or may not involve runtime analysis

Page 59: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Monitoring for Performance Dynamics

Runtime access to parallel performance data Scalable and lightweight Raises concerns of overhead and intrusion Support for performance-adaptive, dynamic applications

Alternative 1: Extend existing performance measurement Create own integrated monitoring infrastructure Disadvantage: maintain own monitoring framework

Alternative 2: Couple with other monitoring infrastructure Leverage scalable middleware from other supported projects Challenge: measurement system / monitor integration

Page 60: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Performance Dynamics: Parallel Profile Snapshots

Profile snapshots are parallel profiles recorded at runtime Shows performance profile dynamics (all types allowed)

Information

Ove

rhea

d Traces

Profile Snapshots

Profiles

A. Morris, W. Spear, A. Malony, and S. Shende, “Observing Performance Dynamics using Parallel Profile Snapshots,” European Conference on Parallel Processing (EuroPar), 2008.

Page 61: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Parallel Profile Snapshots of FLASH 3.0 (UIC)

Simulation of astrophysical thermonuclear flashes Snapshots show profile differences since last snapshot

Captures all eventssince beginningper thread

Mean profilecalculated post-mortem

Highlight changein performanceper iteration andat checkpointing

Initialization

Checkpointing

Finalization

Page 62: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

FLASH 3.0 Performance Dynamics (Periodic)

INTRFC

Page 63: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmon: Design

Design a transport-neutral application monitoring framework Base on prior / existing work with various transport systems

Supermon, MRNet, MPI Enable efficient development of monitoring functionality

Objectives Scalable access to a running application’s performance

at end of the application (before parallel teardown) while the application is still running

Support for scalable performance data analysis reduction statistical evaluation

Feedback (data, control) to application Monitoring engineering and performance efficiency issues

Page 64: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmon: Architecture

... ... ...

... ...

MPI process 0 MPI process k MPI process P-1

TAUmon

TAUprofiles threads

MPI

monitoring infrastructure

Page 65: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmon: Current Usage

TAU_ONLINE_DUMP() collective operations in application Called by all thread / processes (originally to output profiles) Arguments specify data analysis operation (future)

Appropriate version of TAU selected for transport system TAUmonMRnet: TAUmon using MRNet infrastructure TAUmonMPI: TAUmon using MPI infrastructure

User instruments application with TAU support for desired monitoring transport system (temporary)

User submits instrumented application to parallel job system Other launch systems must be submitted along with the

application to the job scheduler as needed different machine-specific job-submission scripts

Page 66: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmon: Parallel Profile Data Analysis

Total parallel profile data size depends on: # events * size of event * # execution threads * Event size depends on # metrics Example: 200 events * 100 bytes * 64,000 threads = 1.28 G

Monitoring operations Periodic profile data output (à la profile snapshorts) Events unification Basic statistics: mean, min/max, standard deviation, ... Advanced statistics: histogram, clustering, ...

Strong motivation to implement the operations in parallel

Page 67: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmonMRNet (a.k.a. ToM Revisited)

TAU over MRNet (ToM) Previously working with MRNet 2.1 (Cluster 2008 paper) 1-phase and 3-phase filters Explore overlay network with different span out (nodes)

TAUmon re-engineered for MRNet 3.0 (just released!) Re-implement ToM functionality Use new MRNet support Current implementation uses pre-released MRNet 3.0

version Testing with released version

Page 68: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Monitoring Operation with MRNet

Application collectively invokesTAU_ONLINE_DUMP() to start monitoringoperations using current performanceinformation

TAU data is accessed andsent through MRNet’scommunication API viastreams and filters

Filters perform appropriateaggregation operations on data

TAU frontend is responsible forcollecting the data, storing it, andeventual delivery to a consumer

Page 69: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmonMPI

Use MPI-based transport No separate launch mechanisms Parallel gather operations implemented

as a binomial heap with staged MPIpoint-to-point calls (Rank 0 serves as root)

Current limitations: Application shares parallel resources with monitoring

transport Monitoring operations may cause performance intrusion No user control of transport network configuration

Potential advantages Easy to use Could be more robust overall

. . .

rank 0

Page 70: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmon Experiments: PFLOTRAN

Predictive modeling of subsurface reactive flows Machines

ORNL Jaguar and UTK Kraken, Cray XT5 Processor counts

16,380 cores and 131Kcores, 12K (interactive) Scaling

Instrumentation (Source, PMPI) Full: 1131 events total, lots of small routines Partial: 1% exclusive + all MPI, 68 events total (44 MPI, 19

PETSc) with and without callpaths

Measurements (PAPI) Execution time: TOT CYC Counters: FP OPS, TOT IN, L1 DCA/DCM, L2 TCA/TCM, RES STL

Page 71: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

PFLOTRAN Profile (Exclusive, 16,380 cores)

MPI_Allreduce

MPI_Waitany

KSPSolve

oursnesjacobian

TAU ParaProf profile analyzer

...

Page 72: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmonMPI Event Unification (Cray XT5)

TAU unificationand merge time

Page 73: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmonMPI Scaling (PFLOTRAN, Cray XT5)

New histogram timings12288: 0.8643 seconds24576: 0.6238 seconds

Page 74: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmonMRnet Scaling (PFLOTRAN, Cray XT5)

Page 75: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmonMPI Scaling (PFLOTRAN, BG/P)

Page 76: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)

4104 cores running with 374 extra cores for MRNet Each line bar shows the mean profile of an iteration

Page 77: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)

Frames (iteration) 12, 17, 21 12k PFLOTRAN execution Shifts in order of events sorted by average value over time

Page 78: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmonMRnet Snapshot (FLASH, Cray XT5)

Sod 2D, 1,536 Cray XT5 cores Over 200 iterations. 15 maximum levels of refinement. MPI_Alltoall plateaus correspond to AMR refinement

Page 79: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

TAUmonMRnet Clustering (FLASH, Cray XT5)

MPI_Init

MPI_Alltoall COMPRESS_LIST MPI_Allreduce

DRIVER_COMPUTEDT

MPI_Init MPI_Alltoall MPI_Allreduce

DRIVER_COMPUTEDT

MPI_Alltoall MPI_Init MPI_Allreduce

DRIVER_COMPUTEDT

MPI_Alltoall COMPRESS_LIST MPI_Allreduce

Page 80: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

High-Productivity Performance Engineering (POINT)

Testbed AppsENZONAMDNEMO3D

Page 81: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Model Oriented Global Optimization (MOGO)

Empirical performance data evaluated with respect to performance expectations at various levels of abstraction

Page 82: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Performance Refactoring (PRIMA) (UO, Juelich)

Integration of instrumentation and measurement Core infrastructure

Focus on TAU and Scalasca University of Oregon, Research Centre Juelich Refactor instrumentation, measurement, and analysis Build next-generation tools on new common foundation

Extend to involve the SILC project Juelich, TU Dresden, TU Munich

Page 83: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Heterogeneous Exascale Software (Vancouver)

DOE X-stack program Partners:

Oak Ridge National LaboratoryUniversity of OregonUniversity of IllinoisGeorgia Institute of Technology

Components Compilers Scheduling and runtime

resource management Libraries Performance measurement, analysis, modeling

Page 84: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Call for “Extreme” Performance Engineering

Improve performance problem solving process Strategy to respond to scaling challenges and technology

changes / disruptions Evolve process and technology Build on robust, integrated performance infrastructure Carry forward performance expertise and knowledge

Model-oriented with knowledge-based reasoning Community-driven knowledge engineering Automated data / decision analytics

Closer integration with application development and execution environment

Page 85: Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

UVSQ October 21, 2010Extreme Performance Engineering

Support Acknowledgements

Department of Energy (DOE) Office of Science ASC/NNSA

Department of Defense (DoD) HPC Modernization Office (HPCMO)

NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Argonne National Laboratory Technical University Dresden ParaTools, Inc. NVIDIA

85