Stochastic Program Execution Tracing
description
Transcript of Stochastic Program Execution Tracing
Stochastic Program Execution Tracing
Jeff Odom, UMD
University of Maryland2
SIGMA Goals
IBM/UMD tools to understand caches– Focus of detailed statistics– Complement existing hardware counters
Ability to handle real applications– MPI and OpenMP programs– Fortran and C
Provide hints about restructuring– Padding (both inter and intra data
structures)– Blocking
UMD effort funded by PERC2
University of Maryland3
Original SIGMA Approach
Static instrumentation– Capture full information about memory use– Produce compact trace
• Extracts loops and memory strides
Post execution tools– Detailed simulator
• Full discrete event simulator– Memory profiler
• Portion of accesses attributed to each data structure
University of Maryland4
Representing Program Execution
Capture full execution behavior– Record all basic blocks and memory
addresses– Produces large traces (due to looping)
Trace compression– Maintain pattern buffer – Scan for repeating patterns
• Extract memory strides– Repeat algorithms for nested loopsBLK1 ADR ADR ADRBLK2
100 200 300
4 4 4
300 500
4 4
ADR ADR
250
7
BLK3RPT
Count
Length
Base
Stride
University of Maryland5
Trace Compression Isn’t Enough
A few seconds…– Slows execution considerably– Generates gigabytes
Orig Time (s)
Slowdown Trace Size (KB)
seis 8 4463x 1,934,667
BT 8 6000x 74,221
swim 396 777x 29
University of Maryland6
Sampling
We want…– Shorter execution times– Smaller traces
We need…– Representative traces– Where to sample?
Timestep boundary– Outermost loop– Requires manual identification (for now)
University of Maryland7
Dyninst + SIGMA = dynSIGMA
Dyninst adds flexibility– Vary sample rate without recompilation– Adaptive/progressive rate during execution– Target application runs at native speed when
instrumentation turned off
Leverage existing SIGMA infrastructure– Only generate trace– Offline simulation/profiling steps unchanged
Dual application framework– Mutatee generates trace– Mutator toggles instrumentation
University of Maryland8
Memtime
Simple but effective metric of application memory performance
n
iii TlatencyTmisslhmemtime
1
)(
miss TLB of penalty
misses TLB
levelat latency cache
levelat hits
levels cache
Tlatency
Tmiss
il
ih
n
i
i
University of Maryland9
Characteristic Pattern
Local and global data objects given canonical name
Vector of objects’ memtime is characteristic data pattern
Comparison of characteristic patterns done with simple linear correlation
Can also be applied for function objects
University of Maryland10
Example Application: seis
Seismic simulation from SPEChpc2002– Models multiple seismic processes– Process results pipelined
Variable timesteps– Different data pattern for each process
C & Fortran– Fortran – data processing– C – dynamic memory management, IO
University of Maryland11
Space & Time Gains From Sampling
Trace Size (MB)
Time (h:m:s) Correlation
1.00% 13.51 9:04 0.996139
2.50% 33.14 40:00 0.997124
5.00% 66.33 1:12:48 0.997307
10.00% 133.17 2:16:00 0.997131
Full (SIGMA) 1,889.32 9:55:04
Original seis 0:08
Includes 0:12 instrumentation overhead
University of Maryland12
Challenge of Irregularity
Compression requires regular accesses
Sampling may hide poor compression– Each sample may compress poorly– Offset by low sampling rate
Sampling may not be accurate enough– Control flow sampled as well– Sample boundary requires manual definition
University of Maryland13
Hybrid Traces
Accuracy may be more important than execution time, but storage capacity may be limited
Modeling data access at particular points can be more accurate than timestep sampling
Many codes are mostly regular, but irregular patterns spoil compression
University of Maryland14
Modified Linear Regression
Establish linear pattern (min 3 points) at each memory access location
Look for repetitions of pattern with higher-level strides
Once input no longer matches pattern, treat further input as irregular until new pattern discovered
University of Maryland15
Modified Linear Regression
Irregular sequence modeled using uniform distribution
Pattern matching done local to each instrumentation (memory access) point– Original SIGMA pattern matches globally
University of Maryland16
Modified Linear Regression
Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5
University of Maryland17
Modified Linear Regression
Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5
University of Maryland18
Modified Linear Regression
Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5
Becomes: 0 + x + 10y + {5,9,2,5}
University of Maryland19
Modified Linear Regression
Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5
Becomes: 0 + x + 10y + {5,9,2,5}
Becomes: 0 + x + 10y + {l:2, h:9}
University of Maryland20
Experiment Setup
NAS Parallel Benchmarks 3.2 Serial Version, Class S
IBM XL C 8.0, XL Fortran 10.1 DyninstAPI 5.0, including
– Liveness analysis• Up to 90% runtime reduction by excluding
one SPR (MQ)• Additional 3% improvement with other
GPR/FPR– Transactional instrumentation
Instrumentation always on (no sampling)
University of Maryland21
Transactional Instrumentation
Reduces– Memory allocation– Insertion time
Atomic operation
BPatch_thread *thr;
BPatch_process *proc;
proc = thr->getProcess();
proc->beginInsertionSet();
…
thr->insertSnippet(…);
thr->insertSnippet(…);
…
proc->finalizeInsertionSet(true);
University of Maryland22
Trace Size
BT CG EP FT LU MG SP
OriginalSize (KB) 16,732 489,81
7648,32
3344 1,011 495 1,405
Reduction w/ Irreg Comp (KB)
(20) 289,551
98,620 0 (53) (90) 78
-30.0%
-20.0%-10.0%
0.0%10.0%
20.0%
30.0%40.0%
50.0%60.0%
70.0%
BT CG EP FT LU MG SP
University of Maryland23
Accuracy
Memtime (s)1 – CorrelationOriginal New
BT 1.2139 1.2139 2.3 E-8
CG 0.2442 0.2403 5.7 E-8
EP 2.2881 2.2898 9.4 E-7
LU 0.3205 0.3206 8.2 E-8
MG 0.0558 0.0558 1.3 E-5
SP 0.5162 0.5161 4.0 E-8
University of Maryland24
Future Work
Larger datasets (NPB Class B,C)– Some results already gathered for W
Distributions other than uniform Irregular control flow
– Example: Upper triangular matrix does not need to iterate all MxN values
– Uses edge instrumentation• BPatch_basicBlock::getIncomingEdges• BPatch_basicBlock::getOutgoingEdges• BPatch_edge::getPoint
University of Maryland25
Questions?