Detailed and understandable network diagnosis

Post on 22-Feb-2016

33 views 0 download

description

Detailed and understandable network diagnosis. Ratul Mahajan. With Srikanth Kandula, Bongshin Lee, Zhicheng Liu ( GaTech ), Patrick Verkaik (UCSD) , Sharad Agarwal, Jitu Padhye, Victor Bahl. Network diagnosis explains faulty behavior. - PowerPoint PPT Presentation

Transcript of Detailed and understandable network diagnosis

Detailed and understandable network diagnosis

Ratul Mahajan

With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD),

Sharad Agarwal, Jitu Padhye, Victor Bahl

Network diagnosis explains faulty behavior

ratul | gatech | '09

Starts with problem symptoms and ends at likely culprits

Configuration

File server

User cannot accessa remote folderConfiguration change

denies permission

Photo viewer

Current landscape of network diagnosis systems

ratul | gatech | '09

Big enterprisesLarge ISPs Network size

Small enterprises

?

Why study small enterprise networks separately?

ratul | gatech | '09

Big enterprisesLarge ISPs

Small enterprises

Less sophisticated adminsLess rich connectivity

Many shared components

IIS, SQL, Exchange, …

Our work

1. Uncovers the need for detailed and understandable diagnosis

2. Develops NetMedic for detailed diagnosis• Diagnoses application faults without application knowledge

3. Develops NetClinic for explaining diagnostic analysis

ratul | gatech | '09

Understanding problems in small enterprises

ratul | gatech | '09

100+ cases

Symptoms, root causes

Symptom App-specific 60 %

Failed initialization

13 %

Poor performance

10 %

Hang or crash 10 %

Unreachability 7 %

Identified cause Non-app config (e.g., firewall) 30 %

Software/driver bug 21 %

App config 19 %

Overload 4 %

Hardware fault 2 %

Unknown 25 %

And the survey says …..

Detailed diagnosis

Handle app-specific as well as generic faults

Identify culpritsat a fine granularity

ratul | gatech | '09

Example problem 1: Server misconfig

ratul | gatech | '09

Web server

Browser

Browser

Server config

Example problem 2: Buggy client

ratul | gatech | '09

SQL server

SQL client C2

SQL client C1

Requests

Example problem 3: Client misconfig

Exchange server

Outlook

ratul | sigcomm | '09

config

Outlookconfig

Current formulations sacrifice detail (to scale)

Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007])

• Model the network as a dependency graph at a coarse level• Simple dependency model

ratul | gatech | '09

Example problem 1: Server misconfig

ratul | gatech | '09

Web server

Browser

Browser

Server config

The network model is too coarse in current formulations

Example problem 2: Buggy client

ratul | gatech | '09

SQL server

SQL client C2

SQL client C1

Requests

The dependency model is too simple in current formulations

Example problem 3: Client misconfig

Exchange server

Outlook

ratul | sigcomm | '09

config

Outlookconfig

The failure model is too simple in current formulations

A formulation for detailed diagnosis

Dependency graph offine-grained components

Component state is a multi-dimensional vector

ratul | gatech | '09

SQL svr

Exch.svr IIS

svr

IIS config

ProcessOS

Config

SQL client

C1

SQL client

C2

% CPU timeIO bytes/sec

Connections/sec404 errors/sec

The goal of diagnosis

ratul | gatech | '09

Svr

C1

C2

Identify likely culprits for components of interest

Without using semantics of state variables No application knowledge Process

OS

Config

Using joint historical behavior to estimate impact

ratul | gatech | '09

D S

d0a d0

b d0c s0

a s0b s0

c s0d

dna dn

b dnc

. . .

. . .

. . .

. . .

. . .d1

a d1b d1

c

sna sn

b snc sn

d

. . . .

. . . .

. . . .

. . . .

. . . .s1

a s1b s1

c s1d

Identify time periods when state of S was “similar”

How “similar” on average states of D are at those times

Svr

C1

C2

Request rate (low)Response time (high)

Request rate (high)Response time (high)

Request rate (high)H

HL

Robust impact estimation

• Ignore state variables that represent redundant info• Place higher weight on state variables likely related

to fault being diagnosed• Ignore state variables irrelevant to interaction with

neighbor• Account for aggregate relationships among state

variables of neighboring components• Account for disparate ranges of state variables

ratul | gatech | '09

Ranking likely culprits

ratul | gatech | '09

AB

C D

A

BA

C

CA

B

D A

0.8

0.8

0.2 0.2A

B

C

D

A

A

A

A

0.8

0.8

0.8

0.2

1.8

0.8

2.6

0.4

Path weight Global impact

C

B

A

D

Diagnose a. edge impactb. path impact

Implementation of NetMedic

ratul | gatech | '09

Target componentsDiagnosis timeReference time

Monitor components

Component states

Ranked list of likely culprits

Evaluation setup

ratul | gatech | '09

IIS, SQL, Exchange, …

.

.

.

10 actively used desktops

Diverse set of faults observed in the logs

#components ~1000

#dimensions per component (avg)

35

NetMedic assigns low ranks to actual culprits

ratul | gatech | '09

0 20 40 60 80 1000

20

40

60

80

100

NetMedicCoarse

Rank of actual culprit

Cum

ulat

ive

% o

f fa

ults

NetMedic handles concurrent faults well

ratul | gatech | '09

2 simultaneous faults

0 20 40 60 80 1000

20

40

60

80

100

NetMedicCoarse

Rank of actual culprit

Cum

ulat

ive

% o

f fau

lts

Other empirical results

Netmedic needs a modest amount (~60 mins) of history

The key to effectiveness is correctly identifying many low impact edges

It compares favorably with a method that understands variable semantics

ratul | gatech | '09

Unleashing (systems like) NetMedic on admins

How to present the analysis results?• Need human verification

(Fundamental?) trade-off between coverage and accuracy

ratul | gatech | '09

Accu

racy

Fault coverage

Rule based

Inference based

State of the

practice

Research activity

The understandability challenge

Admins should be able to verify the correctness of the analysis• Identify culprits themselves if analysis is incorrect

Two sub-problems at the intersection with HCI• Visualizing complex analysis (NetClinic)• Intuitiveness of analysis (ongoing work)

ratul | gatech | '09

NetClinic: Visualizing diagnostic analysis

Underlying assumption: Admins can verify analysis if information is presented appropriately• They have expert, out-of-band information

Views diagnosis as multi-level analysisMakes results at all levels accessible on top of a

semantic graph layoutAllows top-down and bottom-up navigation across

levels while retaining context

ratul | gatech | '09

ratul | gatech | '09

ratul | gatech | '09

ratul | gatech | '09

ratul | gatech | '09

NetClinic user study

11 participants with knowledge of computer networks but not of NetMedic

Given 3 diagnostic tasks each after training• 88% task completion rate

Uncovered a rich mix of user strategies that the visualization must support

ratul | gatech | '09

Intuitiveness of analysis

What if you could modify the analysis itself to make it more accessible to humans?• Counters the tendency to “optimize” for incremental

gains in accuracy

ratul | gatech | '09

AccuracyUnd

erst

anda

bilit

y

Intuitiveness of analysis (2)

Goal: Go from mechanical measures to more human centric measures• Example: MoS measure for VoIP

Factors to consider• What information is used? E.g., Local vs. global• What operations are used? E.g., Arithmetic vs.

geometric means

ratul | gatech | '09

Conclusions

NetClinic enables admins to understand and verify complex

diagnostic analyses

ratul | gatech | '09

Accuracy

Detail

Understan

dabilit

y

Accu

racy

Coverage

Detail Coverage

Coverage

Accuracy

NetMedic enables detailed diagnosis in enterprise networks w/o

application knowledge

Thinking small (networks) can provide new perspectives