Post on 25-Feb-2016
description
Dynamic Analysis: Looking
Back and the Road Ahead
Trishul Chilimbi Runtime Analysis & Design (RAD)
Research in Software Engineering (RiSE)
Microsoft Research
Dynamic Analysis Breakdown
Measurement
Representation
Analysis
WODA '092
3WODA '09
Measurement Methodology
Program Compiler
Executable
Machine
SourceInstrumentation
BinaryInstrumentation
Instruction Emulation
HardwareInstrumentation
ATOM (PLDI’04)EEL (PLDI’05)
Dynamo (PLDI’00) DCPI (SOSP’97)
Measurement Efficiency Hardware performance counters
DCPI (SOSP’97) Sampling
Bursty Tracing (PLDI’01, FDDO’01) Program Analysis
Path Profiling (MICRO’96)
WODA '094
Representation Raw
Trace Structured
Path Profile (MICRO’96) Whole Program Paths (PLDI’99) Whole Program Data Accesses (PLDI’01)
Custom Eraser’s Lock Set (SOSP’97)
WODA '095
Analysis Performance
Profiling and profile-driven optimization Correctness
Bug detection, heap and concurrency checkers Security
Security monitors, Taint Analysis
WODA '096
Dynamic Analysis: The Road Ahead Industrial-strength dynamic analysis Scaling dynamic analysis to process and analyze
large quantities of data System Level Data Centers, Multi-core
WODA '097
Scaling Dynamic Analyses System level analysis
Instrumentation Event Tracing for Windows (ETW)
Data volume Statistical Analysis Visualization
WODA '098
ETW Tracing Infrastructure General purpose real-time event logging facility Core component of Windows operating systems
starting with Windows 2000, continually extended and improved
High speed 1200 to 2000 cycles per logging
Low overhead less than 5% of the total CPU cycles for 20,000 events/sec
Works for both user mode applications and drivers Immune to app crashes and hangs
Writes to a file or to a real time listener Dynamically enabled or disabled
No re-compile, no reboots, no app restarts, … Designed for app tracing in production mode
Scalable
9WODA '09
ETW Architecture
Provider CProvider B
Provider A
Trace files
Controller
…
Consumer
Real time delivery
Logged Events
Session 1
Buffers
Session 2 Session 64
Event Tracing Sessions
Events
EventsEnable/DisableSession Control
Consumer
10WODA '09
ETW Performance Diagnostics OS
Process/thread activity Module load Disk and File IO TCPIP/UDPIP Pagefault Registry Context Switch Heap and Critical Section
Server applications
Active Directory IIS6 File Server Print Server Exchange Server
11WODA '09
ETW Statistics Kernel logger outputs:
~100K events in minutes ~200KB binary file ~100MB text dump Multiple traces/day
Expert analysis Processing a trace file: a few minutes Manual diagnosis time: sometimes minutes, sometimes
hours Manual diagnosis cannot keep up with rate of trace
collection
12WODA '09
Scaling Dynamic Analyses System level analysis
Instrumentation Data volume
Statistical Analysis Visualization
WODA '0913
HangViz Lock/resource contention lies at the root of many
performance problems Kernel manages most resources – not visible to
application developer Our solution
1. Start from an observed hang2. Pull out all relevant lock-related waits, represented
as a directed acyclic graph (DAG)3. Highlight critical path4. Provide visualization tool for further exploration5. Iterative feedback cycle
Joint work with Alice Zheng, Steve Hsaio, David Andrzewejski
14WODA '09
HangViz Outline Constructing the Ready DAG Finding the critical path Visualization
15WODA '09
Constructing A Ready DAG Relevant ETW events
CSwitch: context switches ReadyThread: thread releasing resource Stack: lock functions
Currently ETW does not track lock object ID Stack functions are used to differentiate between
different locks, but the signature is not perfect Sequence of wait and run intervals and
ReadyThread signals can be represented as a directed acyclic graph (DAG)
16WODA '09
Example: Simple Ready Chain
Outlook UI(waiting)
Outlook UI(running)
SearchIndexer
(running)
ReadyThread file lockSearchIndexer
(waiting)
eTrust(running)
ReadyThreadfile lock
17WODA '09
Complications: Non-Immediate Waits The immediate ready chain may not be the root
cause of the problem
18WODA '09
Example: Ready Tree
Outlook UI(waiting)
SI(running
)
Outlook UI
(running)ReadyThread file lock
eTrust(running)
SI(waiting)
ReadyThread file lock
SI(running
)
SearchIndexer
(waiting)
ReadyThread registry
lock
Systems thread
(running)
19WODA '09
Solution: Follow Overlapping Waits Look at all ready chains during the long wait
Follow any wait of the parent thread (e.g., SearchIndexer) that overlaps with the child wait
Repeat on parent thread Optional search depth to limit branching factor
20WODA '09
More Complications: False Runs The thread runs, but not because the resource has
been released Timer wake up – thread wakes up, lock is still not
available, thread goes back to sleep APC – thread is woken up to execute code for
someone else Bottomline: timer wake ups and APCs do NOT
terminate the wait, should be counted towards total wait time
21WODA '09
Example: APCs
Outlook UI(waiting)
SI(running)
Outlook UI
(running)ReadyThread
file lockSI
(waiting)
ReadyThreadAPC
Systems thread
SI(running
)
SI(waiting)
SearchIndexer(waiting)
SI(running)
SI(waiting)
ReadyThread
IExplorer(running)
22WODA '09
Finding Individual Critical Waits Algorithm for finding individual critical waits
Bucket wait times by their lock set (set of lock-related functions on the stack)
For each lock set, build probabilistic model of wait time
Gaussian, exponential, Gamma, or mixture of Gaussians
Select the best model for each lock set A long wait is critical if it has extremely low
probabilities under the model
23WODA '09
Probabilistic Model of Wait Times
7 us
10829 us
41 s
Low probability!
EnterCriticalRegion Wait Time Histogram Mixture of Gaussians Model
24WODA '09
Finding the Critical Path Ready DAGs can be complex But there should be only one critical path
One resource holding up the entire chain (for example, network or I/O)
Multiple threads on the chain are experiencing long waits
Critical path probably has longest average wait time Other possible metrics
maximum wait time: might be shared among multiple paths
longest chain: could have many short waits longest chain with longest average wait time
Possible expansion to cross-trace analysis
25WODA '09
Screen Shot I Generated ReadyTree (anomalous waits highlighted
in red)
26WODA '09
Screen Shot (close-up)
27WODA '09
Screen Shot (Annotation) Changing anomaly annotation
28WODA '09
WODA '09
A picture is worth a “million words” [of trace data] Heap Allocation “movies” expose problems Easy to use and supports deep exploration Observe instantaneous program behavior
• Investigate memory footprint, WS, fragmentation, leaksBAD GOOD
AllocRay: (with George Robertson, VIBE)
29
AllocRayHeap Allocation “Movies”
• Colors and filters help focus on different behaviors• Memory footprint• Fragmentation
• Pixels are tied to events and call stacks to facilitate investigation
30WODA '09
Scaling Dynamic Analyses Data centers
10,000+ machines running web services such as search, mail, online shopping
Large opportunity for dynamic analyses to reduce data center operations cost
10,000 x 100 metrics/minute -> 10+GB/day
WODA '0931
WODA '09
Statistical Debugging (Liblit et al. PLDI’03) Algorithm sketch
1) Collect code profiles for a large number of successful and failing runs of the program
2) Find code fragments that strongly correlate with failure
Cause & correlation Correlation implies causation, a logical fallacy! Example : error handling code
Statistical debugging – build a statistical model of program outcome that discriminates cause from correlation
32
WODA '09
Holmes (Chilimbi et al. ICSE’09)
… Path profiles from successful and failing runs
…if (y=0) { x = x + 1}…Bug predictors
(likely root cause)
…
Statistical analysis
Statistical
model
33
WODA '09
Statistical analysis
Differentiate cause from correlation
Key idea – find path fragments that strongly correlate with failure but the context in which the fragment occurs does not
a
b c
d
e f
Context of a path
foo(x, y)
34
WODA '09
Statistical model Inputs
A set of path profiles, one for each run Each run’s outcome (success/failure)
Compute four statistics for each path So(p), Fo(p) : number of successful/failing runs in
which context of path p was executed Se(p), Fe(p) : number of successful/failing runs in
which path p was executed
35
WODA '09
Statistical model
How much is the context of a path correlated with failure?
Measure of how many failures does a path occur in?
How much more is the path correlated with failure?
Overall measure that combines sensitivity and increase (specificity)
36
WODA '09
Holmes in actionEDG C++ compiler
Importance
Context
Increase
37
WODA '09
Branches, predicates AND pathsHow close do they get you?
Study of 45 bugs in 6 applications from the SIR benchmark suite
Path profiles take you down the right path!
38
Bug-directed Adaptive profiling
…Pr
oduc
tion
env
ironm
ent
Profiles
Bug report
s
Statistical analysis
Holmes backend
Holmes profiling
tools Bug predictors
myapp.dll
myapp.cpp
Static analysis
Root causewhile (is_eof_token(ch) {}if (id == 1) {} 39
WODA '09
Adaptive Profiling Bootstrapping
Stack traces Branch profiles
Iterative Profiling Additional function selection using coupling Strengthening weak predictors with richer profiles
40
WODA '09
HOLMES: Non-Adaptive Vs Adaptive
Benchmark
Holmes (Non-Adaptive)
Holmes1(Adaptive)
Holmes2 (Adaptive)
Holmes3 (Adaptive)
print_tokens 0.68 / 100% 0.42 / 100%
replace 0.57 / 99% 0.27 / 96% 0.53 / 98%
gcc 0.68 / 67% 0.58 / 67% 0.68 / 67%
translate.v1 0.53 / 58% 0.24 / 67% 0.47 / 25% 0.53 / 58%
translate.v2 0.83 / 93% 0.47 / 27% 0.89 / 80%
edg 0.65 / 98% 0.65 / 97% 0.64 / 96% 0.66 / 96%
41
WODA '09
HOLMES: ADAPTIVE OVERHEADS
Benchmark
Branch Profiles
Holmes (Non-Adaptive)
Holmes1(Adaptive)
Holmes2 (Adaptive)
Holmes3 (Adaptive)
gcc 75 181.3 2.6 9.6
translate.v1
3.5 4.7 0.3 2.1 3.5
translate.v2
8.8 2.8 0.8 0.0
gcc 84.1 170.2 7.3 46.5
translate.v1
25.2 41.1 4.1 3.0 21.6
translate.v2
25.1 43.2 3.1 1.8
Time Overhead (%)
Space Overhead (%)
42
Dynamic Analysis & Data Centers Data center environment is more controlled System level Vs. Application level metrics What is the analogue of paths that provides
context? Need predictive capability to take action
Reboot, Reimage, Notify operator
WODA '0943
Conclusion Dynamic analyses have been successfully used to
improve program performance, reliability, and security
Efficient measurement Need to scale dynamic analysis to industrial
strength to address challenges posed by system-level analysis, multi-core, and data centers
Efficient data management and analysis Data management: Database/ Map-Reduce style
processing Statistical Analysis Techniques
WODA '09 44