Root Cause Analysis of Failures in Large-Scale Computing Environments
description
Transcript of Root Cause Analysis of Failures in Large-Scale Computing Environments
Root Cause Analysis of Failures in Large-Scale Computing
Environments
Alex Mirgorodskiy, University of [email protected]
Naoya Maruyama, Tokyo Institute of Technology
[email protected] P. Miller, University of Wisconsin
[email protected]://www.paradyn.org/
2
Motivation
• Systems are complex and non-transparent– Many components, different vendors
• Anomalies are common– Intermittent– Environment-specific
• Users have little debugging expertise
Finding the causes of bugs and performance problems in production systems is hard
3
Vision
Autonomous, detailed, low-overhead analysis:
• User specifies a perceived problem cause • The agent finds the actual cause
Host A Host BProcess P
Process Q
Agent
network
Process R
4
Applications• Diagnostics of E-commerce systems
– Trace the path each request takes through a system– Identify unusual paths– Find out why they are different from the norm
• Diagnostics of Cluster and Grid systems– Monitor behavior of different nodes in the system– Identify nodes with unusual behavior– Find out why they are different from the norm– Example: found problems in SCore middleware
• Diagnostics of Real-time and Interactive systems– Trace words through the phone network– Find out why some words were dropped
5
Key Components• Data collection: self-propelled instrumentation
– Works for a single process– Can cross the user-kernel boundary– Can be deployed on multiple nodes at the same time– Ongoing work: crossing process and host boundaries
• Data analysis: use repetitiveness to find anomalies– Repetitive execution of the same high-level action OR– Repetitiveness among identical processes (e.g., Cluster
management tools, Parallel codes, Web server farms)
6
Focus on Control Flow Anomalies
• Unusual statements executed– Corner cases are more likely to have bugs
• Statements executed in unusual order– Race conditions
• Function taking unusually long to complete– Sporadic performance problems– Deadlocks, livelocks
7
Current Framework
1. Traces control flow of all processes• Begins at process startup• Stops upon a failure or performance degradation
2. Identifies anomalies: unusual traces• Problems on a small number of nodes• Both fail-stop and not
3. Identifies the causes of the anomalies• Function responsible for the problem
P1
Trace of P1 P2
P3
P4
a.outbar
8430:8431:8433:8444:8446:8449:844b:844c:
pushmov...callmovxorpopret
foo %ebp%esp,%ebp
*%eax%ebp,%esp%eax,%eax%ebp
callcalljmp
Patch1instrument(foo)foo0x8405
6cf5:6d20:6d27:6d49:
push...call...iret
sys_call: %eax
*%eax
callcalljmp
instrument(%eax)*%eax0x6d27
Patch3
instrumenter.so
/dev/instrumenter
callcalljmp
instrument(%eax)*%eax0x8446
Patch2
patchjmp
jmp
jmp
jmp
%ebp%esp,%ebp
foo %ebp,%esp%ebp
pushmov...callmovpopret
83f0:83f1:83f3:8400:8405:8413:8414:
OS Kernelpatchjmp
InjectActivatePropagateAnalyze: build call graph/CFG with Dyninst
9
Data Collection: Trace Management
call
foo
ret
foo
Tracer
…Process P
•The trace is kept in a fixed-size circular buffer– New entries overwrite the oldest entries– Retains the most recent events leading to the
problem
•The buffer is located in a shared memory segment
– Does not disappear if the process crashes
10
Data Analysis: Find Anomalous Host
• Check if the anomaly was fail-stop or not:• One of the traces ends substantially earlier than
the others -> Fail-stop– The corresponding host is an anomaly
• Traces end at similar times -> Non-fail-stop– Look at differences in behavior across traces
Trace end time
Tra
ces
P1
P2
P3
P4
11
Data Analysis: Non-fail-stop Host
Find outliers (traces different from the rest):• Define a distance metric between two traces
– d(g,h) = measure of similarity of traces g and h
• Define a trace suspect score– σ(h) = similarity of h to the common behavior
• Report traces with high suspect scores– Most distant from the common behavior
12
Defining the Distance Metric
• Compute the time profile for each host h:– p(h) = (t1, …, tF)
– ti = normalized time spent in function fi on host h
– Profiles are less sensitive to noise than raw traces
• Delta vector of two profiles: δ(g,h) = p(g) – p(h)
• Distance metric: d(g,h) = Manhattan norm of δ(g,h)
t(foo)
t(bar) δ(g,h)p(g)
p(h)
13
Defining the Suspect Score
• Common behavior = normal• Suspect score: σ(h) = distance to nearest
neighbor– Report host with the highest σ to the analyst– h is in the big mass, σ(h) is low, h is normal– g is a single outlier, σ(g) is high, g is an anomaly
• What if there is more than one anomaly?
g
h
σ(g)
σ(h)
14
Defining the Suspect Score
• Suspect score: σk(h) = distance to the kth neighbor– Exclude (k-1) closest neighbors– Sensitivity study: k = NumHosts/4 works well
• Represents distance to the “big mass”:– h is in the big mass, kth neighbor is close, σk(h) is low
– g is an outlier, kth neighbor is far, σk(g) is high
g
h
σk(g)
Computing the score using k=2
15
Defining the Suspect Score
• Anomalous means unusual, but unusual does not always mean anomalous!
– E.g., MPI master is different from all workers– Would be reported as an anomaly (false positive)
• Distinguish false positives from true anomalies:– With knowledge of system internals – manual effort– With previous execution history – can be automated
g
h
σk(g)
16
Defining the Suspect Score
• Add traces from known-normal previous run– One-class classification
• Suspect score σk(h) = distance to the kth trial neighbor or the 1st known-normal neighbor
• Distance to the big mass or known-normal behavior– h is in the big mass, kth neighbor is close, σk(h) is low
– g is an outlier, normal node n is close, σk(g) is low
g
h n
17
Finding Anomalous Function
• Fail-stop problems– Failure is in the last function invoked
• Non-fail-stop problems– Find why host h was marked as an anomaly– Function with the highest contribution to
σ(h):• σ(h) = |δ (h,g)|, where g is the chosen neighbor
• anomFn = arg max |δi|i
18
Experimental Study: SCore
• SCore: cluster-management framework– Job scheduling, checkpointing, migration– Supports MPI, PVM, Cluster-enabled OpenMP
• Implemented as a ring of daemons, scored– One daemon per host for monitoring jobs– Daemons exchange keep-alive patrol messages– If no patrol message traverses the ring in 10
minutes, sc_watch kills and restarts all daemons
sc_watch
scored
scored
scored
patrol
19
Debugging SCoresc_watch
scored
scored
scored
patrol
• Inject tracing agents into all scoreds• Instrument sc_watch to find when the
daemons are being killed• Identify the anomalous trace• Identify the anomalous function/call path
20
Finding the Host
• Host n129 is unusual – different from the others• Host n129 is anomalous – not present in
previous known-normal runs• Host n129 is a new anomaly – not present in
previous known-faulty runs
21
Finding the Cause
• Call chain with the highest contribution to the suspect score: (output_job_status -> score_write_short -> score_write -> __libc_write)– Tries to output a log message to the scbcast process
• Writes to the scbcast process kept blocking for 10 minutes– Scbcast stopped reading data from its socket – bug!– Scored did not handle it well (spun in an infinite loop) – bug!
22
Ongoing workHost A Host BProcess P
Process Q
network
• Cross process and host boundaries– Propagate upon communication
• Reconstruct system-wide flows• Compare flows to identify anomalies
Process R
23
Ongoing work• Propagate upon communication
– Notice the act of communication– Identify the peer– Inject the agent into the peer– Trace the peer after it receives the data
• Reconstruct system-wide flows– Separate concurrent interleaved flows
• Compare flows– Identify common flows and anomalies
24
Conclusion
• Data collection: acquire call traces from all nodes– Self-propelled instrumentation: autonomous, dynamic
and low-overhead
• Data analysis: identify unusual traces and find what made them unusual– Fine-grained: identifies individual suspect functions– Highly accurate: reduces rate of false positives using
past history
• Come see the demo!
25
Relevant Publications• A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller,
"Root Cause Analysis of Failures in Large-Scale Computing Environments", Submitted for publication, ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy05Root.pdf
• A.V. Mirgorodskiy and B.P. Miller, "Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation", 12th Multimedia Computing and Networking (MMCN 2005), San Jose, CA, January 2005, ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy04SelfProp.pdf