George J. Lee Advanced Network Architecture Group

16
08/10/22 08/10/22 1 CAPRI: A Common Architecture CAPRI: A Common Architecture for Autonomous, Distributed for Autonomous, Distributed Internet Fault Diagnosis Internet Fault Diagnosis using Probabilistic using Probabilistic Relational Models Relational Models George J. Lee <[email protected]> George J. Lee <[email protected]> Advanced Network Architecture Group Advanced Network Architecture Group Computer Science and Artificial Computer Science and Artificial Intelligence Lab Intelligence Lab Massachusetts Institute of Technology Massachusetts Institute of Technology

description

CAPRI: A Common Architecture for Autonomous, Distributed Internet Fault Diagnosis using Probabilistic Relational Models. George J. Lee Advanced Network Architecture Group Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology. - PowerPoint PPT Presentation

Transcript of George J. Lee Advanced Network Architecture Group

Page 1: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 11

CAPRI: A Common CAPRI: A Common Architecture for Autonomous, Architecture for Autonomous, Distributed Internet Fault Distributed Internet Fault Diagnosis using Probabilistic Diagnosis using Probabilistic Relational ModelsRelational ModelsGeorge J. Lee <[email protected]>George J. Lee <[email protected]>Advanced Network Architecture GroupAdvanced Network Architecture GroupComputer Science and Artificial Intelligence LabComputer Science and Artificial Intelligence LabMassachusetts Institute of TechnologyMassachusetts Institute of Technology

Page 2: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 22

Automated Internet Automated Internet fault diagnosis is fault diagnosis is difficultdifficult

DADA

DA

Failure Report

Knowledge

Data

DiagnosisReasoning

DA = Diagnostic Agent

Knowledge, data, and reasoning are distributed– Agents need a common extensible language for

expressing knowledge & data Agents have incomplete information:

– Agents must perform probabilistic diagnosis when evidence is unavailable

Distributed diagnosis is costly– Agents must minimize probing and communication costWe need a Common Architecture for We need a Common Architecture for

Probabilistic Reasoning in the Internet Probabilistic Reasoning in the Internet (CAPRI)(CAPRI)

Page 3: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 33

OverviewOverview

An extensible language for expressing An extensible language for expressing diagnostic data & knowledgediagnostic data & knowledge– Based on Bayes nets and Probabilistic Relational Based on Bayes nets and Probabilistic Relational

ModelsModels Distributed probabilistic reasoning while Distributed probabilistic reasoning while

minimizing probing and communication costminimizing probing and communication cost– Trading off accuracy and costTrading off accuracy and cost– Incorporating past evidenceIncorporating past evidence– Propagating evidence to other agentsPropagating evidence to other agents– Simulations: accuracy vs. costSimulations: accuracy vs. cost

Learning diagnostic knowledge for real-world Learning diagnostic knowledge for real-world diagnosisdiagnosis– Passive diagnosis of HTTP proxy connectionsPassive diagnosis of HTTP proxy connections– Evaluation: accuracy using learned knowledgeEvaluation: accuracy using learned knowledge

Page 4: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 44

A-B Link

Bayes nets can Bayes nets can express diagnostic express diagnostic datadata

Data = evidence Data = evidence about a particular about a particular failurefailure– Diagnostic test Diagnostic test

resultsresults– Component Component

statusstatus Diagnosis without Diagnosis without

domain-specific domain-specific knowledgeknowledge

Allows distributed Allows distributed inferenceinference

AN Path

BN Path

B-C Link CN Path

AN Probe

A-B A-B LinkLink

BBN N PathPath

P(AP(AN N Path=OK)Path=OK)

FAILFAIL FAILFAIL 00FAILFAIL OKOK 00OKOK FAILFAIL 00OKOK OKOK 11

AAN N PathPath

P(AP(AN N Probe=OProbe=OK)K)

OKOK 0.950.95FAILFAIL 00

A-B Link=FAIL

A B C N…IP Path

Page 5: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 55

Probabilistic Relational Models Probabilistic Relational Models (PRMs) can express diagnostic (PRMs) can express diagnostic knowledgeknowledge

Knowledge = shared Knowledge = shared knowledge about knowledge about component and test component and test classesclasses– Class dependenciesClass dependencies– Diagnostic testsDiagnostic tests

Agents generate Agents generate Bayes net using PRMBayes net using PRM

Provided by experts Provided by experts or learned by agentsor learned by agents

ExtensibleExtensible– New component New component

and test classesand test classes– Subclassing (e.g. Subclassing (e.g.

Wireless Link)Wireless Link)

Link

IP Path

Ping Test

Result

Status

Status

P(Status=OP(Status=OK)K)

0.990.99

FirsFirstt

RestRest P(StatusP(Status=OK)=OK)

FAIFAILL

FAILFAIL 00

FAIFAILL

OKOK 00

OKOK FAILFAIL 00OKOK OKOK 11

PatPathh

P(Result=OP(Result=OK)K)

FAIFAILL

00

OKOK 0.950.95

First

Path

Rest

Page 6: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 66

Probabilistic models enable Probabilistic models enable agents to reduce diagnosis costagents to reduce diagnosis cost

Diagnosis Procedure:Diagnosis Procedure:1.1. Receive failure reportReceive failure report2.2. Construct Bayes net from PRMConstruct Bayes net from PRM3.3. Incorporate current and past Incorporate current and past

evidence using a Dynamic Bayes evidence using a Dynamic Bayes Net (DBN)Net (DBN)

4.4. Infer most probable explanation Infer most probable explanation (MPE) for failure(MPE) for failure

5.5. While While mpe_confidence mpe_confidence < < confThreshconfThresh::

1.1. Perform local tests or request Perform local tests or request diagnosis from other agents to diagnosis from other agents to maximize relevance/costmaximize relevance/cost

6.6. Propagate evidence to other Propagate evidence to other agentsagents

7.7. Return diagnosisReturn diagnosis

Architectural points: Agents can trade off

accuracy vs. cost using a confidence threshold

Agents can infer current status from past evidence given a temporal failure model

Agents can reduce load and improve robustness by propagating evidence

Diagnosis cost = probing + communication costDiagnosis cost = probing + communication cost

Page 7: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 77

Minimizing cost for IP Minimizing cost for IP path diagnosispath diagnosis IP path diagnosis: ISP (AIP path diagnosis: ISP (AB), rest of path (BB), rest of path (BN), or destination (NN), or destination (NDest)Dest) Simulated 6000 Autonomous System (AS) topologySimulated 6000 Autonomous System (AS) topology 1 DA per AS that can test links and destinations associated with that AS1 DA per AS that can test links and destinations associated with that AS All diagnostic agents have knowledge of prior link failure probabilitiesAll diagnostic agents have knowledge of prior link failure probabilities Diagnostic agents are reachable up to the point of failureDiagnostic agents are reachable up to the point of failure Status of inter-AS links and destination hosts drawn from prior probabilitiesStatus of inter-AS links and destination hosts drawn from prior probabilities Evidence collection and propagation follow DAs in the AS pathEvidence collection and propagation follow DAs in the AS path

DA 1User AS A

DA nDest AS N

DA 2AS B

DA kAS K

Evidence collection

Evidence propagation

Failure report

Diagnosis…

User B K…IP Path N…A Dest

Page 8: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 88

confThresh0.4

0.8

0.5

0.7

0.6

0.9

1.0

13 confidence thresholds, 500 users, 5 trials13 confidence thresholds, 500 users, 5 trials

Agents can trade off accuracy and Agents can trade off accuracy and costcost

Page 9: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 99

cache duration cache duration = number of = number of past time steps past time steps of evidence to of evidence to considerconsider

Inter-AS link Inter-AS link failures modeled failures modeled as a Markov as a Markov chain (Gilbert chain (Gilbert model)model)

100 users, 5 100 users, 5 trials, 30 time trials, 30 time stepssteps

>95% accuracy>95% accuracy

Incorporating past Incorporating past evidence reduces evidence reduces probing costsprobing costs

Page 10: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 1010

100,000 failures50,000 failures

10,000 users10,000 users 5 trials5 trials 1 time step1 time step >95% >95%

accuracyaccuracy

Evidence propagation reduces Evidence propagation reduces probing and communication probing and communication costscosts

Page 11: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 1111

HTTP Proxy Conn.UserServer

Agents can learn Agents can learn probabilistic models for TCP probabilistic models for TCP overlay connection overlay connection diagnosisdiagnosis

TCP Conn.ProxyServer

HourDstAS

SrcAS

TCP Conn.UserProxySrSr

c c ASAS

DsDst t ASAS

HouHourr

P(StatusP(Status= OK)= OK)

11 11 11 0.990.9911 22 11 0.870.87…… …… …… ……

UserUserProxyProxy

ProxyProxyServerServer

P(StatusP(Status= OK)= OK)

FAILFAIL FAILFAIL 00FAILFAIL OKOK 00OKOK FAILFAIL 00OKOK OKOK 11

1. Learn inter-AS TCP failure probabilities from Planetseer (28.3 million TCP connections from 196 hosts over 10 hours)

2. Diagnose HTTP proxy connections on CoDeeN without using probes

User Proxy ServerTCP Overlay Path

HourDstAS

SrcAS

Page 12: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 1212

Learned diagnostic Learned diagnostic knowledge improves knowledge improves accuracyaccuracy Accuracy: 80% vs. 53%Accuracy: 80% vs. 53%

– Train on hour Train on hour xx– Test on hour Test on hour x x + 1+ 1

Accuracy improves as training interval Accuracy improves as training interval increasesincreases– Train on first Train on first x x hours, test on hour hours, test on hour x x + 1+ 1

Accuracy remains high as training set Accuracy remains high as training set age increasesage increases– Train on hour 1, test on hour x > 1Train on hour 1, test on hour x > 1

Page 13: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 1313

Benefits of CAPRIBenefits of CAPRI

An extensible language for diagnostic data An extensible language for diagnostic data and knowledgeand knowledge– Based on Bayes nets and PRMsBased on Bayes nets and PRMs

Distributed diagnosis while minimizing Distributed diagnosis while minimizing probing and communication costprobing and communication cost– accuracy/cost tradeoffaccuracy/cost tradeoff– incorporating past evidenceincorporating past evidence– evidence propagationevidence propagation

Robustness to missing dataRobustness to missing data– probabilistic inference using cached dataprobabilistic inference using cached data

Ability to learn diagnostic knowledgeAbility to learn diagnostic knowledge– learn conditional failure probabilities using PRMslearn conditional failure probabilities using PRMs

Page 14: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 1414

Future WorkFuture Work

Costs and incentivesCosts and incentives– Learning the true network costs of Learning the true network costs of

diagnostic testsdiagnostic tests– Dynamically adjusting costDynamically adjusting cost– Incentives for agent to reveal evidenceIncentives for agent to reveal evidence

Intelligent routing of diagnostic queriesIntelligent routing of diagnostic queries Temporal failure modelsTemporal failure models

– Learning temporal failure modelsLearning temporal failure models– Predicting failure durationPredicting failure duration

Diagnosis using data from end usersDiagnosis using data from end users

Page 15: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 1515

76%

78%

80%

82%

84%

86%

1 2 3 4 5 6 7 8 9

Training Set Age (Hours)

Acc

ura

cy

Page 16: George J. Lee  Advanced Network Architecture Group

04/19/2304/19/23 1616

Model network component state Model network component state as a Markov chain (Gilbert model)as a Markov chain (Gilbert model)

Dynamic Bayes net (DBN):Dynamic Bayes net (DBN):

Modeling Dynamic Modeling Dynamic networksnetworks

s1 s3s2

OK FAIL 0.710.97

0.03

0.29

P(s3=OK | s1=FAIL) =P(s3=OK | s2=OK) P(s2=OK | s1=FAIL)

+ P(s3=OK | s2=FAIL) P(s2=FAIL | s1=FAIL)