Dynamic Analysis: Looking Back and the Road Ahead

Dynamic Analysis: Looking

Back and the Road Ahead

Trishul Chilimbi Runtime Analysis & Design (RAD)

Research in Software Engineering (RiSE)

Microsoft Research

Dynamic Analysis Breakdown

Measurement

Representation

Analysis

WODA '092

3WODA '09

Measurement Methodology

Program Compiler

Executable

Machine

SourceInstrumentation

BinaryInstrumentation

Instruction Emulation

HardwareInstrumentation

ATOM (PLDI’04)EEL (PLDI’05)

Dynamo (PLDI’00) DCPI (SOSP’97)

Measurement Efficiency Hardware performance counters

DCPI (SOSP’97) Sampling

Bursty Tracing (PLDI’01, FDDO’01) Program Analysis

Path Profiling (MICRO’96)

WODA '094

Representation Raw

Trace Structured

Path Profile (MICRO’96) Whole Program Paths (PLDI’99) Whole Program Data Accesses (PLDI’01)

Custom Eraser’s Lock Set (SOSP’97)

WODA '095

Analysis Performance

Profiling and profile-driven optimization Correctness

Bug detection, heap and concurrency checkers Security

Security monitors, Taint Analysis

WODA '096

Dynamic Analysis: The Road Ahead Industrial-strength dynamic analysis Scaling dynamic analysis to process and analyze

large quantities of data System Level Data Centers, Multi-core

WODA '097

Scaling Dynamic Analyses System level analysis

Instrumentation Event Tracing for Windows (ETW)

Data volume Statistical Analysis Visualization

WODA '098

ETW Tracing Infrastructure General purpose real-time event logging facility Core component of Windows operating systems

starting with Windows 2000, continually extended and improved

High speed 1200 to 2000 cycles per logging

Low overhead less than 5% of the total CPU cycles for 20,000 events/sec

Works for both user mode applications and drivers Immune to app crashes and hangs

Writes to a file or to a real time listener Dynamically enabled or disabled

No re-compile, no reboots, no app restarts, … Designed for app tracing in production mode

Scalable

9WODA '09

ETW Architecture

Provider CProvider B

Provider A

Trace files

Controller

Consumer

Real time delivery

Logged Events

Session 1

Buffers

Session 2 Session 64

Event Tracing Sessions

Events

EventsEnable/DisableSession Control

Consumer

10WODA '09

ETW Performance Diagnostics OS

Process/thread activity Module load Disk and File IO TCPIP/UDPIP Pagefault Registry Context Switch Heap and Critical Section

Server applications

Active Directory IIS6 File Server Print Server Exchange Server

11WODA '09

ETW Statistics Kernel logger outputs:

~100K events in minutes ~200KB binary file ~100MB text dump Multiple traces/day

Expert analysis Processing a trace file: a few minutes Manual diagnosis time: sometimes minutes, sometimes

hours Manual diagnosis cannot keep up with rate of trace

collection

12WODA '09

Scaling Dynamic Analyses System level analysis

Instrumentation Data volume

Statistical Analysis Visualization

WODA '0913

HangViz Lock/resource contention lies at the root of many

performance problems Kernel manages most resources – not visible to

application developer Our solution

1. Start from an observed hang2. Pull out all relevant lock-related waits, represented

as a directed acyclic graph (DAG)3. Highlight critical path4. Provide visualization tool for further exploration5. Iterative feedback cycle

Joint work with Alice Zheng, Steve Hsaio, David Andrzewejski

14WODA '09

HangViz Outline Constructing the Ready DAG Finding the critical path Visualization

15WODA '09

Constructing A Ready DAG Relevant ETW events

CSwitch: context switches ReadyThread: thread releasing resource Stack: lock functions

Currently ETW does not track lock object ID Stack functions are used to differentiate between

different locks, but the signature is not perfect Sequence of wait and run intervals and

ReadyThread signals can be represented as a directed acyclic graph (DAG)

16WODA '09

Example: Simple Ready Chain

Outlook UI(waiting)

Outlook UI(running)

SearchIndexer

(running)

ReadyThread file lockSearchIndexer

(waiting)

eTrust(running)

ReadyThreadfile lock

17WODA '09

Complications: Non-Immediate Waits The immediate ready chain may not be the root

cause of the problem

18WODA '09

Example: Ready Tree

Outlook UI(waiting)

SI(running

Outlook UI

(running)ReadyThread file lock

eTrust(running)

SI(waiting)

ReadyThread file lock

SI(running

SearchIndexer

(waiting)

ReadyThread registry

Systems thread

(running)

19WODA '09

Solution: Follow Overlapping Waits Look at all ready chains during the long wait

Follow any wait of the parent thread (e.g., SearchIndexer) that overlaps with the child wait

Repeat on parent thread Optional search depth to limit branching factor

20WODA '09

More Complications: False Runs The thread runs, but not because the resource has

been released Timer wake up – thread wakes up, lock is still not

available, thread goes back to sleep APC – thread is woken up to execute code for

someone else Bottomline: timer wake ups and APCs do NOT

terminate the wait, should be counted towards total wait time

21WODA '09

Example: APCs

Outlook UI(waiting)

SI(running)

Outlook UI

(running)ReadyThread

file lockSI

(waiting)

ReadyThreadAPC

Systems thread

SI(running

SI(waiting)

SearchIndexer(waiting)

SI(running)

SI(waiting)

ReadyThread

IExplorer(running)

22WODA '09

Finding Individual Critical Waits Algorithm for finding individual critical waits

Bucket wait times by their lock set (set of lock-related functions on the stack)

For each lock set, build probabilistic model of wait time

Gaussian, exponential, Gamma, or mixture of Gaussians

Select the best model for each lock set A long wait is critical if it has extremely low

probabilities under the model

23WODA '09

Probabilistic Model of Wait Times

10829 us

Low probability!

EnterCriticalRegion Wait Time Histogram Mixture of Gaussians Model

24WODA '09

Finding the Critical Path Ready DAGs can be complex But there should be only one critical path

One resource holding up the entire chain (for example, network or I/O)

Multiple threads on the chain are experiencing long waits

Critical path probably has longest average wait time Other possible metrics

maximum wait time: might be shared among multiple paths

longest chain: could have many short waits longest chain with longest average wait time

Possible expansion to cross-trace analysis

25WODA '09

Screen Shot I Generated ReadyTree (anomalous waits highlighted

in red)

26WODA '09

Screen Shot (close-up)

27WODA '09

Screen Shot (Annotation) Changing anomaly annotation

28WODA '09

WODA '09

A picture is worth a “million words” [of trace data] Heap Allocation “movies” expose problems Easy to use and supports deep exploration Observe instantaneous program behavior

• Investigate memory footprint, WS, fragmentation, leaksBAD GOOD

AllocRay: (with George Robertson, VIBE)

AllocRayHeap Allocation “Movies”

• Colors and filters help focus on different behaviors• Memory footprint• Fragmentation

• Pixels are tied to events and call stacks to facilitate investigation

30WODA '09

Scaling Dynamic Analyses Data centers

10,000+ machines running web services such as search, mail, online shopping

Large opportunity for dynamic analyses to reduce data center operations cost

10,000 x 100 metrics/minute -> 10+GB/day

WODA '0931

WODA '09

Statistical Debugging (Liblit et al. PLDI’03) Algorithm sketch

1) Collect code profiles for a large number of successful and failing runs of the program

2) Find code fragments that strongly correlate with failure

Cause & correlation Correlation implies causation, a logical fallacy! Example : error handling code

Statistical debugging – build a statistical model of program outcome that discriminates cause from correlation

WODA '09

Holmes (Chilimbi et al. ICSE’09)

… Path profiles from successful and failing runs

…if (y=0) { x = x + 1}…Bug predictors

(likely root cause)

Statistical analysis

Statistical

WODA '09

Differentiate cause from correlation

Key idea – find path fragments that strongly correlate with failure but the context in which the fragment occurs does not

Context of a path

foo(x, y)

WODA '09

Statistical model Inputs

A set of path profiles, one for each run Each run’s outcome (success/failure)

Compute four statistics for each path So(p), Fo(p) : number of successful/failing runs in

which context of path p was executed Se(p), Fe(p) : number of successful/failing runs in

which path p was executed

WODA '09

Statistical model

How much is the context of a path correlated with failure?

Measure of how many failures does a path occur in?

How much more is the path correlated with failure?

Overall measure that combines sensitivity and increase (specificity)

WODA '09

Holmes in actionEDG C++ compiler

Importance

Context

Increase

WODA '09

Branches, predicates AND pathsHow close do they get you?

Study of 45 bugs in 6 applications from the SIR benchmark suite

Path profiles take you down the right path!

Bug-directed Adaptive profiling

Profiles

Bug report

Holmes backend

Holmes profiling

tools Bug predictors

myapp.dll

myapp.cpp

Static analysis

Root causewhile (is_eof_token(ch) {}if (id == 1) {} 39

WODA '09

Adaptive Profiling Bootstrapping

Stack traces Branch profiles

Iterative Profiling Additional function selection using coupling Strengthening weak predictors with richer profiles

WODA '09

HOLMES: Non-Adaptive Vs Adaptive

Benchmark

Holmes (Non-Adaptive)

Holmes1(Adaptive)

Holmes2 (Adaptive)

Holmes3 (Adaptive)

print_tokens 0.68 / 100% 0.42 / 100%

replace 0.57 / 99% 0.27 / 96% 0.53 / 98%

gcc 0.68 / 67% 0.58 / 67% 0.68 / 67%

translate.v1 0.53 / 58% 0.24 / 67% 0.47 / 25% 0.53 / 58%

translate.v2 0.83 / 93% 0.47 / 27% 0.89 / 80%

edg 0.65 / 98% 0.65 / 97% 0.64 / 96% 0.66 / 96%

WODA '09

HOLMES: ADAPTIVE OVERHEADS

Benchmark

Branch Profiles

Holmes (Non-Adaptive)

Holmes1(Adaptive)

Holmes2 (Adaptive)

Holmes3 (Adaptive)

gcc 75 181.3 2.6 9.6

translate.v1

3.5 4.7 0.3 2.1 3.5

translate.v2

8.8 2.8 0.8 0.0

gcc 84.1 170.2 7.3 46.5

translate.v1

25.2 41.1 4.1 3.0 21.6

translate.v2

25.1 43.2 3.1 1.8

Time Overhead (%)

Space Overhead (%)

Dynamic Analysis & Data Centers Data center environment is more controlled System level Vs. Application level metrics What is the analogue of paths that provides

context? Need predictive capability to take action

Reboot, Reimage, Notify operator

WODA '0943

Conclusion Dynamic analyses have been successfully used to

improve program performance, reliability, and security

Efficient measurement Need to scale dynamic analysis to industrial

strength to address challenges posed by system-level analysis, multi-core, and data centers

Efficient data management and analysis Data management: Database/ Map-Reduce style

processing Statistical Analysis Techniques

WODA '09 44

Dynamic Analysis: Looking Back and the Road Ahead

Documents

Transcript of Dynamic Analysis: Looking Back and the Road Ahead

Mental models of dynamic systems: taking stock and looking ... · Mental models of dynamic systems: taking stock and looking ahead Stefan N. Groessera,b* and Martin Schaffernichtc

MOLE looking ahead

LOOKING AHEAD - WordPress.com

Nuclear Security: Looking Ahead

Looking Ahead - Tower Semiconductor

Looking Ahead (1/2)

Thank you Looking Ahead…

Looking Ahead - ir.towersemi.com

Looking ahead -Cash forecasting

LOOKING BACK: LOOKING AHEAD: TODAY’S BUSINESS: 20

Looking ahead

Looking back, looking ahead,…looking toward the Cloud

Looking back and looking ahead - web.stanford.edu

Looking Back, Looking Ahead

Broad-scale component of RUSALCA: Looking Ahead. Looking Ahead workingcirculationscheme.

LNG - Looking AHEAD

Looking Ahead - CSE

Looking ahead 4.30.10

Looking Ahead to ASP

LOOKING AHEAD - medford.k12.or.us