Detailed and understandable network diagnosis
description
Transcript of Detailed and understandable network diagnosis
Detailed and understandable network diagnosis
Ratul Mahajan
With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD),
Sharad Agarwal, Jitu Padhye, Victor Bahl
Network diagnosis explains faulty behavior
ratul | gatech | '09
Starts with problem symptoms and ends at likely culprits
Configuration
File server
User cannot accessa remote folderConfiguration change
denies permission
Photo viewer
Current landscape of network diagnosis systems
ratul | gatech | '09
Big enterprisesLarge ISPs Network size
Small enterprises
?
Why study small enterprise networks separately?
ratul | gatech | '09
Big enterprisesLarge ISPs
Small enterprises
Less sophisticated adminsLess rich connectivity
Many shared components
IIS, SQL, Exchange, …
Our work
1. Uncovers the need for detailed and understandable diagnosis
2. Develops NetMedic for detailed diagnosis• Diagnoses application faults without application knowledge
3. Develops NetClinic for explaining diagnostic analysis
ratul | gatech | '09
Understanding problems in small enterprises
ratul | gatech | '09
100+ cases
Symptoms, root causes
Symptom App-specific 60 %
Failed initialization
13 %
Poor performance
10 %
Hang or crash 10 %
Unreachability 7 %
Identified cause Non-app config (e.g., firewall) 30 %
Software/driver bug 21 %
App config 19 %
Overload 4 %
Hardware fault 2 %
Unknown 25 %
And the survey says …..
Detailed diagnosis
Handle app-specific as well as generic faults
Identify culpritsat a fine granularity
ratul | gatech | '09
Example problem 1: Server misconfig
ratul | gatech | '09
Web server
Browser
Browser
Server config
Example problem 2: Buggy client
ratul | gatech | '09
SQL server
SQL client C2
SQL client C1
Requests
Example problem 3: Client misconfig
Exchange server
Outlook
ratul | sigcomm | '09
config
Outlookconfig
Current formulations sacrifice detail (to scale)
Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007])
• Model the network as a dependency graph at a coarse level• Simple dependency model
ratul | gatech | '09
Example problem 1: Server misconfig
ratul | gatech | '09
Web server
Browser
Browser
Server config
The network model is too coarse in current formulations
Example problem 2: Buggy client
ratul | gatech | '09
SQL server
SQL client C2
SQL client C1
Requests
The dependency model is too simple in current formulations
Example problem 3: Client misconfig
Exchange server
Outlook
ratul | sigcomm | '09
config
Outlookconfig
The failure model is too simple in current formulations
A formulation for detailed diagnosis
Dependency graph offine-grained components
Component state is a multi-dimensional vector
ratul | gatech | '09
SQL svr
Exch.svr IIS
svr
IIS config
ProcessOS
Config
SQL client
C1
SQL client
C2
% CPU timeIO bytes/sec
Connections/sec404 errors/sec
The goal of diagnosis
ratul | gatech | '09
Svr
C1
C2
Identify likely culprits for components of interest
Without using semantics of state variables No application knowledge Process
OS
Config
Using joint historical behavior to estimate impact
ratul | gatech | '09
D S
d0a d0
b d0c s0
a s0b s0
c s0d
dna dn
b dnc
. . .
. . .
. . .
. . .
. . .d1
a d1b d1
c
sna sn
b snc sn
d
. . . .
. . . .
. . . .
. . . .
. . . .s1
a s1b s1
c s1d
Identify time periods when state of S was “similar”
How “similar” on average states of D are at those times
Svr
C1
C2
Request rate (low)Response time (high)
Request rate (high)Response time (high)
Request rate (high)H
HL
Robust impact estimation
• Ignore state variables that represent redundant info• Place higher weight on state variables likely related
to fault being diagnosed• Ignore state variables irrelevant to interaction with
neighbor• Account for aggregate relationships among state
variables of neighboring components• Account for disparate ranges of state variables
ratul | gatech | '09
Ranking likely culprits
ratul | gatech | '09
AB
C D
A
BA
C
CA
B
D A
0.8
0.8
0.2 0.2A
B
C
D
A
A
A
A
0.8
0.8
0.8
0.2
1.8
0.8
2.6
0.4
Path weight Global impact
C
B
A
D
Diagnose a. edge impactb. path impact
Implementation of NetMedic
ratul | gatech | '09
Target componentsDiagnosis timeReference time
Monitor components
Component states
Ranked list of likely culprits
Evaluation setup
ratul | gatech | '09
IIS, SQL, Exchange, …
.
.
.
10 actively used desktops
Diverse set of faults observed in the logs
#components ~1000
#dimensions per component (avg)
35
NetMedic assigns low ranks to actual culprits
ratul | gatech | '09
0 20 40 60 80 1000
20
40
60
80
100
NetMedicCoarse
Rank of actual culprit
Cum
ulat
ive
% o
f fa
ults
NetMedic handles concurrent faults well
ratul | gatech | '09
2 simultaneous faults
0 20 40 60 80 1000
20
40
60
80
100
NetMedicCoarse
Rank of actual culprit
Cum
ulat
ive
% o
f fau
lts
Other empirical results
Netmedic needs a modest amount (~60 mins) of history
The key to effectiveness is correctly identifying many low impact edges
It compares favorably with a method that understands variable semantics
ratul | gatech | '09
Unleashing (systems like) NetMedic on admins
How to present the analysis results?• Need human verification
(Fundamental?) trade-off between coverage and accuracy
ratul | gatech | '09
Accu
racy
Fault coverage
Rule based
Inference based
State of the
practice
Research activity
The understandability challenge
Admins should be able to verify the correctness of the analysis• Identify culprits themselves if analysis is incorrect
Two sub-problems at the intersection with HCI• Visualizing complex analysis (NetClinic)• Intuitiveness of analysis (ongoing work)
ratul | gatech | '09
NetClinic: Visualizing diagnostic analysis
Underlying assumption: Admins can verify analysis if information is presented appropriately• They have expert, out-of-band information
Views diagnosis as multi-level analysisMakes results at all levels accessible on top of a
semantic graph layoutAllows top-down and bottom-up navigation across
levels while retaining context
ratul | gatech | '09
ratul | gatech | '09
ratul | gatech | '09
ratul | gatech | '09
ratul | gatech | '09
NetClinic user study
11 participants with knowledge of computer networks but not of NetMedic
Given 3 diagnostic tasks each after training• 88% task completion rate
Uncovered a rich mix of user strategies that the visualization must support
ratul | gatech | '09
Intuitiveness of analysis
What if you could modify the analysis itself to make it more accessible to humans?• Counters the tendency to “optimize” for incremental
gains in accuracy
ratul | gatech | '09
AccuracyUnd
erst
anda
bilit
y
Intuitiveness of analysis (2)
Goal: Go from mechanical measures to more human centric measures• Example: MoS measure for VoIP
Factors to consider• What information is used? E.g., Local vs. global• What operations are used? E.g., Arithmetic vs.
geometric means
ratul | gatech | '09
Conclusions
NetClinic enables admins to understand and verify complex
diagnostic analyses
ratul | gatech | '09
Accuracy
Detail
Understan
dabilit
y
Accu
racy
Coverage
Detail Coverage
Coverage
Accuracy
NetMedic enables detailed diagnosis in enterprise networks w/o
application knowledge
Thinking small (networks) can provide new perspectives