Detailed and understandable network diagnosis

35
Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

description

Detailed and understandable network diagnosis. Ratul Mahajan. With Srikanth Kandula, Bongshin Lee, Zhicheng Liu ( GaTech ), Patrick Verkaik (UCSD) , Sharad Agarwal, Jitu Padhye, Victor Bahl. Network diagnosis explains faulty behavior. - PowerPoint PPT Presentation

Transcript of Detailed and understandable network diagnosis

Page 1: Detailed and understandable network diagnosis

Detailed and understandable network diagnosis

Ratul Mahajan

With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD),

Sharad Agarwal, Jitu Padhye, Victor Bahl

Page 2: Detailed and understandable network diagnosis

Network diagnosis explains faulty behavior

ratul | gatech | '09

Starts with problem symptoms and ends at likely culprits

Configuration

File server

User cannot accessa remote folderConfiguration change

denies permission

Photo viewer

Page 3: Detailed and understandable network diagnosis

Current landscape of network diagnosis systems

ratul | gatech | '09

Big enterprisesLarge ISPs Network size

Small enterprises

?

Page 4: Detailed and understandable network diagnosis

Why study small enterprise networks separately?

ratul | gatech | '09

Big enterprisesLarge ISPs

Small enterprises

Less sophisticated adminsLess rich connectivity

Many shared components

IIS, SQL, Exchange, …

Page 5: Detailed and understandable network diagnosis

Our work

1. Uncovers the need for detailed and understandable diagnosis

2. Develops NetMedic for detailed diagnosis• Diagnoses application faults without application knowledge

3. Develops NetClinic for explaining diagnostic analysis

ratul | gatech | '09

Page 6: Detailed and understandable network diagnosis

Understanding problems in small enterprises

ratul | gatech | '09

100+ cases

Symptoms, root causes

Page 7: Detailed and understandable network diagnosis

Symptom App-specific 60 %

Failed initialization

13 %

Poor performance

10 %

Hang or crash 10 %

Unreachability 7 %

Identified cause Non-app config (e.g., firewall) 30 %

Software/driver bug 21 %

App config 19 %

Overload 4 %

Hardware fault 2 %

Unknown 25 %

And the survey says …..

Detailed diagnosis

Handle app-specific as well as generic faults

Identify culpritsat a fine granularity

ratul | gatech | '09

Page 8: Detailed and understandable network diagnosis

Example problem 1: Server misconfig

ratul | gatech | '09

Web server

Browser

Browser

Server config

Page 9: Detailed and understandable network diagnosis

Example problem 2: Buggy client

ratul | gatech | '09

SQL server

SQL client C2

SQL client C1

Requests

Page 10: Detailed and understandable network diagnosis

Example problem 3: Client misconfig

Exchange server

Outlook

ratul | sigcomm | '09

config

Outlookconfig

Page 11: Detailed and understandable network diagnosis

Current formulations sacrifice detail (to scale)

Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007])

• Model the network as a dependency graph at a coarse level• Simple dependency model

ratul | gatech | '09

Page 12: Detailed and understandable network diagnosis

Example problem 1: Server misconfig

ratul | gatech | '09

Web server

Browser

Browser

Server config

The network model is too coarse in current formulations

Page 13: Detailed and understandable network diagnosis

Example problem 2: Buggy client

ratul | gatech | '09

SQL server

SQL client C2

SQL client C1

Requests

The dependency model is too simple in current formulations

Page 14: Detailed and understandable network diagnosis

Example problem 3: Client misconfig

Exchange server

Outlook

ratul | sigcomm | '09

config

Outlookconfig

The failure model is too simple in current formulations

Page 15: Detailed and understandable network diagnosis

A formulation for detailed diagnosis

Dependency graph offine-grained components

Component state is a multi-dimensional vector

ratul | gatech | '09

SQL svr

Exch.svr IIS

svr

IIS config

ProcessOS

Config

SQL client

C1

SQL client

C2

% CPU timeIO bytes/sec

Connections/sec404 errors/sec

Page 16: Detailed and understandable network diagnosis

The goal of diagnosis

ratul | gatech | '09

Svr

C1

C2

Identify likely culprits for components of interest

Without using semantics of state variables No application knowledge Process

OS

Config

Page 17: Detailed and understandable network diagnosis

Using joint historical behavior to estimate impact

ratul | gatech | '09

D S

d0a d0

b d0c s0

a s0b s0

c s0d

dna dn

b dnc

. . .

. . .

. . .

. . .

. . .d1

a d1b d1

c

sna sn

b snc sn

d

. . . .

. . . .

. . . .

. . . .

. . . .s1

a s1b s1

c s1d

Identify time periods when state of S was “similar”

How “similar” on average states of D are at those times

Svr

C1

C2

Request rate (low)Response time (high)

Request rate (high)Response time (high)

Request rate (high)H

HL

Page 18: Detailed and understandable network diagnosis

Robust impact estimation

• Ignore state variables that represent redundant info• Place higher weight on state variables likely related

to fault being diagnosed• Ignore state variables irrelevant to interaction with

neighbor• Account for aggregate relationships among state

variables of neighboring components• Account for disparate ranges of state variables

ratul | gatech | '09

Page 19: Detailed and understandable network diagnosis

Ranking likely culprits

ratul | gatech | '09

AB

C D

A

BA

C

CA

B

D A

0.8

0.8

0.2 0.2A

B

C

D

A

A

A

A

0.8

0.8

0.8

0.2

1.8

0.8

2.6

0.4

Path weight Global impact

C

B

A

D

Page 20: Detailed and understandable network diagnosis

Diagnose a. edge impactb. path impact

Implementation of NetMedic

ratul | gatech | '09

Target componentsDiagnosis timeReference time

Monitor components

Component states

Ranked list of likely culprits

Page 21: Detailed and understandable network diagnosis

Evaluation setup

ratul | gatech | '09

IIS, SQL, Exchange, …

.

.

.

10 actively used desktops

Diverse set of faults observed in the logs

#components ~1000

#dimensions per component (avg)

35

Page 22: Detailed and understandable network diagnosis

NetMedic assigns low ranks to actual culprits

ratul | gatech | '09

0 20 40 60 80 1000

20

40

60

80

100

NetMedicCoarse

Rank of actual culprit

Cum

ulat

ive

% o

f fa

ults

Page 23: Detailed and understandable network diagnosis

NetMedic handles concurrent faults well

ratul | gatech | '09

2 simultaneous faults

0 20 40 60 80 1000

20

40

60

80

100

NetMedicCoarse

Rank of actual culprit

Cum

ulat

ive

% o

f fau

lts

Page 24: Detailed and understandable network diagnosis

Other empirical results

Netmedic needs a modest amount (~60 mins) of history

The key to effectiveness is correctly identifying many low impact edges

It compares favorably with a method that understands variable semantics

ratul | gatech | '09

Page 25: Detailed and understandable network diagnosis

Unleashing (systems like) NetMedic on admins

How to present the analysis results?• Need human verification

(Fundamental?) trade-off between coverage and accuracy

ratul | gatech | '09

Accu

racy

Fault coverage

Rule based

Inference based

State of the

practice

Research activity

Page 26: Detailed and understandable network diagnosis

The understandability challenge

Admins should be able to verify the correctness of the analysis• Identify culprits themselves if analysis is incorrect

Two sub-problems at the intersection with HCI• Visualizing complex analysis (NetClinic)• Intuitiveness of analysis (ongoing work)

ratul | gatech | '09

Page 27: Detailed and understandable network diagnosis

NetClinic: Visualizing diagnostic analysis

Underlying assumption: Admins can verify analysis if information is presented appropriately• They have expert, out-of-band information

Views diagnosis as multi-level analysisMakes results at all levels accessible on top of a

semantic graph layoutAllows top-down and bottom-up navigation across

levels while retaining context

ratul | gatech | '09

Page 28: Detailed and understandable network diagnosis

ratul | gatech | '09

Page 29: Detailed and understandable network diagnosis

ratul | gatech | '09

Page 30: Detailed and understandable network diagnosis

ratul | gatech | '09

Page 31: Detailed and understandable network diagnosis

ratul | gatech | '09

Page 32: Detailed and understandable network diagnosis

NetClinic user study

11 participants with knowledge of computer networks but not of NetMedic

Given 3 diagnostic tasks each after training• 88% task completion rate

Uncovered a rich mix of user strategies that the visualization must support

ratul | gatech | '09

Page 33: Detailed and understandable network diagnosis

Intuitiveness of analysis

What if you could modify the analysis itself to make it more accessible to humans?• Counters the tendency to “optimize” for incremental

gains in accuracy

ratul | gatech | '09

AccuracyUnd

erst

anda

bilit

y

Page 34: Detailed and understandable network diagnosis

Intuitiveness of analysis (2)

Goal: Go from mechanical measures to more human centric measures• Example: MoS measure for VoIP

Factors to consider• What information is used? E.g., Local vs. global• What operations are used? E.g., Arithmetic vs.

geometric means

ratul | gatech | '09

Page 35: Detailed and understandable network diagnosis

Conclusions

NetClinic enables admins to understand and verify complex

diagnostic analyses

ratul | gatech | '09

Accuracy

Detail

Understan

dabilit

y

Accu

racy

Coverage

Detail Coverage

Coverage

Accuracy

NetMedic enables detailed diagnosis in enterprise networks w/o

application knowledge

Thinking small (networks) can provide new perspectives