Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci...
-
Upload
zachariah-tyler -
Category
Documents
-
view
216 -
download
1
Transcript of Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci...
Performance Analysis and Optimization through Run-time
Simulation and Statistics
Philip J. Mucci
University Of [email protected]
http://www.cs.utk.edu/~mucci
Motivation
• Tuning real DOD and DOE applications!– Performance on most codes is low. – Poor overall efficiency due to poor single node
performance.– Show good scalability because of the above and
faster interconnects.– The expertise is not there, nor should it be.
Description
• To use data available at run-time to better compilation and optimization technology.
• Empirically determine how well the code maps to the underlying architecture.
• Bottlenecks can be identified and possibly corrected by an explicit set of rules and transformations.
Information not being used
• Hardware statistics gathered through simulation or monitoring can identify the problem. (sample listing)– Cache and branching behavior– Cycle/Load/Store/FLOP counts– Bottleneck determination– Reference pattern– Dynamic memory placement
Problem Areas
• Efficient use of the memory hierarchy
• Register re-use
• Aliasing
• Inlining
• Demotion
• Algorithms (iterative vs. direct)
Solutions
• Understanding (tutorials, reference material)
• Tools
• Preprocessors
• Compilers
• Manpower
Increasing Cache Performance
• How do we better the use of the memory hierarchy?
• For computer scientists, it’s not that hard. We need the right tools.
• How much can we automate?
• Through available tools and source analysis we can usually get down to the function.
Cache Simulation
• Instrumentation of routines
• Run of the executable
• Analysis and correlation with source code!
• Old idea, new implementation.
Cache Simulation
• Hardware independence
• Information on:– Locality– Placement– Reference pattern and Reuse– Line usage
Locality
• Spatial and Temporal– misses/memory reference– misses/re-use
• Conflict vs. Capacity
Placement
• Padding can be very important
• Not always possible to do during static analysis phase.
• Reference pattern can affect padding.
Reference Pattern
• Again, not always possible to do during static analysis.
• Even harder to analyze when dealing with pseudo-optimized code.
• Examples: Stencils, Sparse solvers etc...
Reuse
• Blocking is critical to applications where there is re-use.
• We need to identify re-use potential, to spot areas where blocking and register allocation should be focused on.
Source Code Mapping
• Most cache tools are hard to use and relate to the source code.
• This tool simulates the cache(s) on each memory reference and thus can easy correlate the data.
• Instrumentation is at the source level, not object code.
Statistics
• Global, per file, per statement, per reference
• References, misses, cold misses, re-used references
• Conflict/Re-use matrix– M(A,B) = x means some element of A ejected
some element of B from the cache x times iff that element of A has been in the cache before.
Development status
• GUI for selective instrumentation
• Real parsers (F90, C, C++)
• Better report generation
Implementation
• Simulator written in C
• Instrumentation in Perl
• GUI in Java
• Report generator in Perl
Relevance
• Why shouldn’t this technology be part of a feedback loop?– Compile with instrumentation– Run– Recompile with information from the run– Watch input sensitivity issues.
Integration
• Identifying and correcting poor cache behavior can be made explicit and part of a compiler. (Ideally a source-to-source transformer or preprocessor)
• Simulator can stand alone for detailed analysis and optimization by CS folks.
• Our knowledge and expertise made available through the tools.
Hardware Counters
• Virtually every processor available has hardware counters
• The interfaces and documentation are poor or non-existent.
• Hardware differs greatly as do the semantics
• Useful for measurement, analysis, optimization, modeling and benchmarking.
Performance Data Standard
• Standardize an API to obtain hardware performance counters
• Standardize the definitions of what those counters mean
• API is lightweight and portable
Performance Data Standard
• Target platforms– R10K, R12K– P2SC, Power PC 604e, Power 3– Sun Ultra 2/3– Intel PII, Katmai, Merced– Alpha 21164, 21264
Performance Data Standard
• Motivation– Portable performance tools– Optimization through feedback– Developers wanting simple and accurate timing
and statistics– Modeling, evaluation
Performance Data Standard
• Small number of useful measurement points– Timing cycles, microseconds– I/D cache misses, invalidations– Branch mispredictions– Load,store,FLOP,instruction counts– I/D TLB misses
Performance Data Standard API
• Efficient counter multiplexing
• Thread safety
• Functions for– start, stop, reset, get, accumulate, query, control
• Use the best available vendor supported interface or API
• Possible pairing with DAIS, Dyninst for naming
Development status
• Research on the various machines available hardware and interfaces
• Compilation of findings, web page and mailing list
• API specification to appear mid August for discussion
• Vendors are lurking• http://www.cs.utk.edu/~mucci/pdsa
Deliverables
• API for O2K, T3E, SP
• Portable prof implementation
People
• Shirley Browne (UT)
• Jeff Brown (LANL)
• Jeff Durachta (IBM, LANL)
• Christopher Kerr (IBM, LANL)
• George Ho (UT)
• Kevin London (UT)
• Philip Mucci (UT, Sandia)
Rice/UTK Collaboration
Rice UT
DOE DOD
Low level support
Optimization Technology
Tools
Apps
Deliverables
• F90,C++ preprocessor with feedback
• Cache tool
• Analysis and Optimization of poor codes
• Performance API