Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci...

Performance Analysis and Optimization through Run-time

Simulation and Statistics

Philip J. Mucci

University Of [email protected]

http://www.cs.utk.edu/~mucci

Motivation

• Tuning real DOD and DOE applications!– Performance on most codes is low. – Poor overall efficiency due to poor single node

performance.– Show good scalability because of the above and

faster interconnects.– The expertise is not there, nor should it be.

Description

• To use data available at run-time to better compilation and optimization technology.

• Empirically determine how well the code maps to the underlying architecture.

• Bottlenecks can be identified and possibly corrected by an explicit set of rules and transformations.

Information not being used

• Hardware statistics gathered through simulation or monitoring can identify the problem. (sample listing)– Cache and branching behavior– Cycle/Load/Store/FLOP counts– Bottleneck determination– Reference pattern– Dynamic memory placement

Problem Areas

• Efficient use of the memory hierarchy

• Register re-use

• Aliasing

• Inlining

• Demotion

• Algorithms (iterative vs. direct)

Solutions

• Understanding (tutorials, reference material)

• Tools

• Preprocessors

• Compilers

• Manpower

Increasing Cache Performance

• How do we better the use of the memory hierarchy?

• For computer scientists, it’s not that hard. We need the right tools.

• How much can we automate?

• Through available tools and source analysis we can usually get down to the function.

Cache Simulation

• Instrumentation of routines

• Run of the executable

• Analysis and correlation with source code!

• Old idea, new implementation.

Cache Simulation

• Hardware independence

• Information on:– Locality– Placement– Reference pattern and Reuse– Line usage

Locality

• Spatial and Temporal– misses/memory reference– misses/re-use

• Conflict vs. Capacity

Placement

• Padding can be very important

• Not always possible to do during static analysis phase.

• Reference pattern can affect padding.

Reference Pattern

• Again, not always possible to do during static analysis.

• Even harder to analyze when dealing with pseudo-optimized code.

• Examples: Stencils, Sparse solvers etc...

Reuse

• Blocking is critical to applications where there is re-use.

• We need to identify re-use potential, to spot areas where blocking and register allocation should be focused on.

Source Code Mapping

• Most cache tools are hard to use and relate to the source code.

• This tool simulates the cache(s) on each memory reference and thus can easy correlate the data.

• Instrumentation is at the source level, not object code.

Statistics

• Global, per file, per statement, per reference

• References, misses, cold misses, re-used references

• Conflict/Re-use matrix– M(A,B) = x means some element of A ejected

some element of B from the cache x times iff that element of A has been in the cache before.

Development status

• GUI for selective instrumentation

• Real parsers (F90, C, C++)

• Better report generation

Implementation

• Simulator written in C

• Instrumentation in Perl

• GUI in Java

• Report generator in Perl

Relevance

• Why shouldn’t this technology be part of a feedback loop?– Compile with instrumentation– Run– Recompile with information from the run– Watch input sensitivity issues.

Integration

• Identifying and correcting poor cache behavior can be made explicit and part of a compiler. (Ideally a source-to-source transformer or preprocessor)

• Simulator can stand alone for detailed analysis and optimization by CS folks.

• Our knowledge and expertise made available through the tools.

Hardware Counters

• Virtually every processor available has hardware counters

• The interfaces and documentation are poor or non-existent.

• Hardware differs greatly as do the semantics

• Useful for measurement, analysis, optimization, modeling and benchmarking.

Performance Data Standard

• Standardize an API to obtain hardware performance counters

• Standardize the definitions of what those counters mean

• API is lightweight and portable


• Target platforms– R10K, R12K– P2SC, Power PC 604e, Power 3– Sun Ultra 2/3– Intel PII, Katmai, Merced– Alpha 21164, 21264


• Motivation– Portable performance tools– Optimization through feedback– Developers wanting simple and accurate timing

and statistics– Modeling, evaluation


• Small number of useful measurement points– Timing cycles, microseconds– I/D cache misses, invalidations– Branch mispredictions– Load,store,FLOP,instruction counts– I/D TLB misses

Performance Data Standard API

• Efficient counter multiplexing

• Thread safety

• Functions for– start, stop, reset, get, accumulate, query, control

• Use the best available vendor supported interface or API

• Possible pairing with DAIS, Dyninst for naming

Development status

• Research on the various machines available hardware and interfaces

• Compilation of findings, web page and mailing list

• API specification to appear mid August for discussion

• Vendors are lurking• http://www.cs.utk.edu/~mucci/pdsa

Deliverables

• API for O2K, T3E, SP

• Portable prof implementation

People

• Shirley Browne (UT)

• Jeff Brown (LANL)

• Jeff Durachta (IBM, LANL)

• Christopher Kerr (IBM, LANL)

• George Ho (UT)

• Kevin London (UT)

• Philip Mucci (UT, Sandia)

Rice/UTK Collaboration

Rice UT

DOE DOD

Low level support

Optimization Technology

Tools

Apps

Deliverables

• F90,C++ preprocessor with feedback

• Cache tool

• Analysis and Optimization of poor codes

• Performance API

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci...

Documents

Transcript of Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci...