Detailed and understandable network diagnosis Ratul Mahajan
With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick
Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl
Slide 2
Network diagnosis explains faulty behavior ratul | gatech | '09
Starts with problem symptoms and ends at likely culprits
Configuration File server User cannot access a remote folder
Configuration change denies permission Photo viewer
Slide 3
Current landscape of network diagnosis systems ratul | gatech |
'09 Big enterprises Large ISPs Big enterprises Large ISPs Network
size Small enterprises ? ?
Slide 4
Why study small enterprise networks separately? ratul | gatech
| '09 Big enterprises Large ISPs Big enterprises Large ISPs Small
enterprises Less sophisticated admins Less rich connectivity Many
shared components IIS, SQL, Exchange,
Slide 5
Our work 1.Uncovers the need for detailed and understandable
diagnosis 2.Develops NetMedic for detailed diagnosis Diagnoses
application faults without application knowledge 3.Develops
NetClinic for explaining diagnostic analysis ratul | gatech |
'09
Slide 6
Understanding problems in small enterprises ratul | gatech |
'09 100+ cases Symptoms, root causes
Slide 7
Symptom App-specific 60 % Failed initialization 13 % Poor
performance 10 % Hang or crash 10 % Unreachability 7 % Identified
cause Non-app config (e.g., firewall) 30 % Software/driver bug 21 %
App config 19 % Overload 4 % Hardware fault 2 % Unknown 25 % And
the survey says .. Detailed diagnosis Handle app-specific as well
as generic faults Identify culprits at a fine granularity ratul |
gatech | '09
Slide 8
Example problem 1: Server misconfig ratul | gatech | '09 Web
server Browser Server config
Slide 9
Example problem 2: Buggy client ratul | gatech | '09 SQL server
SQL client C2 SQL client C1 Requests
Slide 10
Example problem 3: Client misconfig Exchange server Outlook
ratul | sigcomm | '09 config Outlook config
Slide 11
Current formulations sacrifice detail (to scale) Dependency
graph based formulations (e.g., Sherlock [SIGCOMM2007]) Model the
network as a dependency graph at a coarse level Simple dependency
model ratul | gatech | '09
Slide 12
Example problem 1: Server misconfig ratul | gatech | '09 Web
server Browser Server config The network model is too coarse in
current formulations
Slide 13
Example problem 2: Buggy client ratul | gatech | '09 SQL server
SQL client C2 SQL client C1 Requests The dependency model is too
simple in current formulations
Slide 14
Example problem 3: Client misconfig Exchange server Outlook
ratul | sigcomm | '09 config Outlook config The failure model is
too simple in current formulations
Slide 15
A formulation for detailed diagnosis Dependency graph of
fine-grained components Component state is a multi-dimensional
vector ratul | gatech | '09 SQL svr Exch. svr IIS svr IIS config
Process OS Config SQL client C1 SQL client C2 % CPU time IO
bytes/sec Connections/sec 404 errors/sec
Slide 16
The goal of diagnosis ratul | gatech | '09 Svr C1 C2 Identify
likely culprits for components of interest Without using semantics
of state variables No application knowledge Process OS Config
Slide 17
Using joint historical behavior to estimate impact ratul |
gatech | '09 DS d0ad0a d0bd0b d0cd0c s0as0a s0bs0b s0cs0c s0ds0d
dnadna dnbdnb dncdnc............... d1ad1a d1bd1b d1cd1c snasna
snbsnb sncsnc sndsnd.................... s1as1a s1bs1b s1cs1c
s1ds1d Identify time periods when state of S was similar How
similar on average states of D are at those times Svr C1 C2 Request
rate (low) Response time (high) Request rate (high) Response time
(high) Request rate (high) H H L
Slide 18
Robust impact estimation Ignore state variables that represent
redundant info Place higher weight on state variables likely
related to fault being diagnosed Ignore state variables irrelevant
to interaction with neighbor Account for aggregate relationships
among state variables of neighboring components Account for
disparate ranges of state variables ratul | gatech | '09
Slide 19
Ranking likely culprits ratul | gatech | '09 AB CD A B A C C A
B DA 0.8 0.2 A B C D A A A A 0.8 0.2 1.8 0.8 2.6 0.4 Path
weightGlobal impact C B A D
Slide 20
Diagnose a.edge impact b.path impact Implementation of NetMedic
ratul | gatech | '09 Target components Diagnosis time Reference
time Monitor components Component states Ranked list of likely
culprits
Slide 21
Evaluation setup ratul | gatech | '09 IIS, SQL, Exchange,
...... 10 actively used desktops Diverse set of faults observed in
the logs #components~1000 #dimensions per component (avg) 35
Slide 22
NetMedic assigns low ranks to actual culprits ratul | gatech |
'09
Other empirical results Netmedic needs a modest amount (~60
mins) of history The key to effectiveness is correctly identifying
many low impact edges It compares favorably with a method that
understands variable semantics ratul | gatech | '09
Slide 25
Unleashing (systems like) NetMedic on admins How to present the
analysis results? Need human verification (Fundamental?) trade-off
between coverage and accuracy ratul | gatech | '09 Accuracy Fault
coverage Rule based Inference based State of the practice Research
activity
Slide 26
The understandability challenge Admins should be able to verify
the correctness of the analysis Identify culprits themselves if
analysis is incorrect Two sub-problems at the intersection with HCI
Visualizing complex analysis (NetClinic) Intuitiveness of analysis
(ongoing work) ratul | gatech | '09
Slide 27
NetClinic: Visualizing diagnostic analysis Underlying
assumption: Admins can verify analysis if information is presented
appropriately They have expert, out-of-band information Views
diagnosis as multi-level analysis Makes results at all levels
accessible on top of a semantic graph layout Allows top-down and
bottom-up navigation across levels while retaining context ratul |
gatech | '09
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
NetClinic user study 11 participants with knowledge of computer
networks but not of NetMedic Given 3 diagnostic tasks each after
training 88% task completion rate Uncovered a rich mix of user
strategies that the visualization must support ratul | gatech |
'09
Slide 33
Intuitiveness of analysis What if you could modify the analysis
itself to make it more accessible to humans? Counters the tendency
to optimize for incremental gains in accuracy ratul | gatech | '09
Accuracy Understandability
Slide 34
Intuitiveness of analysis (2) Goal: Go from mechanical measures
to more human centric measures Example: MoS measure for VoIP
Factors to consider What information is used? E.g., Local vs.
global What operations are used? E.g., Arithmetic vs. geometric
means ratul | gatech | '09
Slide 35
Conclusions NetClinic enables admins to understand and verify
complex diagnostic analyses ratul | gatech | '09 Accuracy Detail
Understandability Accuracy Coverage Detail Coverage Accuracy
NetMedic enables detailed diagnosis in enterprise networks w/o
application knowledge Thinking small (networks) can provide new
perspectives