Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee,...

download Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD), Sharad.

If you can't read please download the document

  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    4

Transcript of Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee,...

  • Slide 1
  • Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl
  • Slide 2
  • Network diagnosis explains faulty behavior ratul | gatech | '09 Starts with problem symptoms and ends at likely culprits Configuration File server User cannot access a remote folder Configuration change denies permission Photo viewer
  • Slide 3
  • Current landscape of network diagnosis systems ratul | gatech | '09 Big enterprises Large ISPs Big enterprises Large ISPs Network size Small enterprises ? ?
  • Slide 4
  • Why study small enterprise networks separately? ratul | gatech | '09 Big enterprises Large ISPs Big enterprises Large ISPs Small enterprises Less sophisticated admins Less rich connectivity Many shared components IIS, SQL, Exchange,
  • Slide 5
  • Our work 1.Uncovers the need for detailed and understandable diagnosis 2.Develops NetMedic for detailed diagnosis Diagnoses application faults without application knowledge 3.Develops NetClinic for explaining diagnostic analysis ratul | gatech | '09
  • Slide 6
  • Understanding problems in small enterprises ratul | gatech | '09 100+ cases Symptoms, root causes
  • Slide 7
  • Symptom App-specific 60 % Failed initialization 13 % Poor performance 10 % Hang or crash 10 % Unreachability 7 % Identified cause Non-app config (e.g., firewall) 30 % Software/driver bug 21 % App config 19 % Overload 4 % Hardware fault 2 % Unknown 25 % And the survey says .. Detailed diagnosis Handle app-specific as well as generic faults Identify culprits at a fine granularity ratul | gatech | '09
  • Slide 8
  • Example problem 1: Server misconfig ratul | gatech | '09 Web server Browser Server config
  • Slide 9
  • Example problem 2: Buggy client ratul | gatech | '09 SQL server SQL client C2 SQL client C1 Requests
  • Slide 10
  • Example problem 3: Client misconfig Exchange server Outlook ratul | sigcomm | '09 config Outlook config
  • Slide 11
  • Current formulations sacrifice detail (to scale) Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007]) Model the network as a dependency graph at a coarse level Simple dependency model ratul | gatech | '09
  • Slide 12
  • Example problem 1: Server misconfig ratul | gatech | '09 Web server Browser Server config The network model is too coarse in current formulations
  • Slide 13
  • Example problem 2: Buggy client ratul | gatech | '09 SQL server SQL client C2 SQL client C1 Requests The dependency model is too simple in current formulations
  • Slide 14
  • Example problem 3: Client misconfig Exchange server Outlook ratul | sigcomm | '09 config Outlook config The failure model is too simple in current formulations
  • Slide 15
  • A formulation for detailed diagnosis Dependency graph of fine-grained components Component state is a multi-dimensional vector ratul | gatech | '09 SQL svr Exch. svr IIS svr IIS config Process OS Config SQL client C1 SQL client C2 % CPU time IO bytes/sec Connections/sec 404 errors/sec
  • Slide 16
  • The goal of diagnosis ratul | gatech | '09 Svr C1 C2 Identify likely culprits for components of interest Without using semantics of state variables No application knowledge Process OS Config
  • Slide 17
  • Using joint historical behavior to estimate impact ratul | gatech | '09 DS d0ad0a d0bd0b d0cd0c s0as0a s0bs0b s0cs0c s0ds0d dnadna dnbdnb dncdnc............... d1ad1a d1bd1b d1cd1c snasna snbsnb sncsnc sndsnd.................... s1as1a s1bs1b s1cs1c s1ds1d Identify time periods when state of S was similar How similar on average states of D are at those times Svr C1 C2 Request rate (low) Response time (high) Request rate (high) Response time (high) Request rate (high) H H L
  • Slide 18
  • Robust impact estimation Ignore state variables that represent redundant info Place higher weight on state variables likely related to fault being diagnosed Ignore state variables irrelevant to interaction with neighbor Account for aggregate relationships among state variables of neighboring components Account for disparate ranges of state variables ratul | gatech | '09
  • Slide 19
  • Ranking likely culprits ratul | gatech | '09 AB CD A B A C C A B DA 0.8 0.2 A B C D A A A A 0.8 0.2 1.8 0.8 2.6 0.4 Path weightGlobal impact C B A D
  • Slide 20
  • Diagnose a.edge impact b.path impact Implementation of NetMedic ratul | gatech | '09 Target components Diagnosis time Reference time Monitor components Component states Ranked list of likely culprits
  • Slide 21
  • Evaluation setup ratul | gatech | '09 IIS, SQL, Exchange, ...... 10 actively used desktops Diverse set of faults observed in the logs #components~1000 #dimensions per component (avg) 35
  • Slide 22
  • NetMedic assigns low ranks to actual culprits ratul | gatech | '09
  • Slide 23
  • NetMedic handles concurrent faults well ratul | gatech | '09 2 simultaneous faults
  • Slide 24
  • Other empirical results Netmedic needs a modest amount (~60 mins) of history The key to effectiveness is correctly identifying many low impact edges It compares favorably with a method that understands variable semantics ratul | gatech | '09
  • Slide 25
  • Unleashing (systems like) NetMedic on admins How to present the analysis results? Need human verification (Fundamental?) trade-off between coverage and accuracy ratul | gatech | '09 Accuracy Fault coverage Rule based Inference based State of the practice Research activity
  • Slide 26
  • The understandability challenge Admins should be able to verify the correctness of the analysis Identify culprits themselves if analysis is incorrect Two sub-problems at the intersection with HCI Visualizing complex analysis (NetClinic) Intuitiveness of analysis (ongoing work) ratul | gatech | '09
  • Slide 27
  • NetClinic: Visualizing diagnostic analysis Underlying assumption: Admins can verify analysis if information is presented appropriately They have expert, out-of-band information Views diagnosis as multi-level analysis Makes results at all levels accessible on top of a semantic graph layout Allows top-down and bottom-up navigation across levels while retaining context ratul | gatech | '09
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • NetClinic user study 11 participants with knowledge of computer networks but not of NetMedic Given 3 diagnostic tasks each after training 88% task completion rate Uncovered a rich mix of user strategies that the visualization must support ratul | gatech | '09
  • Slide 33
  • Intuitiveness of analysis What if you could modify the analysis itself to make it more accessible to humans? Counters the tendency to optimize for incremental gains in accuracy ratul | gatech | '09 Accuracy Understandability
  • Slide 34
  • Intuitiveness of analysis (2) Goal: Go from mechanical measures to more human centric measures Example: MoS measure for VoIP Factors to consider What information is used? E.g., Local vs. global What operations are used? E.g., Arithmetic vs. geometric means ratul | gatech | '09
  • Slide 35
  • Conclusions NetClinic enables admins to understand and verify complex diagnostic analyses ratul | gatech | '09 Accuracy Detail Understandability Accuracy Coverage Detail Coverage Accuracy NetMedic enables detailed diagnosis in enterprise networks w/o application knowledge Thinking small (networks) can provide new perspectives