George J. Lee Advanced Network Architecture Group
-
Upload
ali-gentry -
Category
Documents
-
view
24 -
download
0
description
Transcript of George J. Lee Advanced Network Architecture Group
04/19/2304/19/23 11
CAPRI: A Common CAPRI: A Common Architecture for Autonomous, Architecture for Autonomous, Distributed Internet Fault Distributed Internet Fault Diagnosis using Probabilistic Diagnosis using Probabilistic Relational ModelsRelational ModelsGeorge J. Lee <[email protected]>George J. Lee <[email protected]>Advanced Network Architecture GroupAdvanced Network Architecture GroupComputer Science and Artificial Intelligence LabComputer Science and Artificial Intelligence LabMassachusetts Institute of TechnologyMassachusetts Institute of Technology
04/19/2304/19/23 22
Automated Internet Automated Internet fault diagnosis is fault diagnosis is difficultdifficult
DADA
DA
Failure Report
Knowledge
Data
DiagnosisReasoning
DA = Diagnostic Agent
Knowledge, data, and reasoning are distributed– Agents need a common extensible language for
expressing knowledge & data Agents have incomplete information:
– Agents must perform probabilistic diagnosis when evidence is unavailable
Distributed diagnosis is costly– Agents must minimize probing and communication costWe need a Common Architecture for We need a Common Architecture for
Probabilistic Reasoning in the Internet Probabilistic Reasoning in the Internet (CAPRI)(CAPRI)
04/19/2304/19/23 33
OverviewOverview
An extensible language for expressing An extensible language for expressing diagnostic data & knowledgediagnostic data & knowledge– Based on Bayes nets and Probabilistic Relational Based on Bayes nets and Probabilistic Relational
ModelsModels Distributed probabilistic reasoning while Distributed probabilistic reasoning while
minimizing probing and communication costminimizing probing and communication cost– Trading off accuracy and costTrading off accuracy and cost– Incorporating past evidenceIncorporating past evidence– Propagating evidence to other agentsPropagating evidence to other agents– Simulations: accuracy vs. costSimulations: accuracy vs. cost
Learning diagnostic knowledge for real-world Learning diagnostic knowledge for real-world diagnosisdiagnosis– Passive diagnosis of HTTP proxy connectionsPassive diagnosis of HTTP proxy connections– Evaluation: accuracy using learned knowledgeEvaluation: accuracy using learned knowledge
04/19/2304/19/23 44
A-B Link
Bayes nets can Bayes nets can express diagnostic express diagnostic datadata
Data = evidence Data = evidence about a particular about a particular failurefailure– Diagnostic test Diagnostic test
resultsresults– Component Component
statusstatus Diagnosis without Diagnosis without
domain-specific domain-specific knowledgeknowledge
Allows distributed Allows distributed inferenceinference
AN Path
BN Path
B-C Link CN Path
…
AN Probe
A-B A-B LinkLink
BBN N PathPath
P(AP(AN N Path=OK)Path=OK)
FAILFAIL FAILFAIL 00FAILFAIL OKOK 00OKOK FAILFAIL 00OKOK OKOK 11
AAN N PathPath
P(AP(AN N Probe=OProbe=OK)K)
OKOK 0.950.95FAILFAIL 00
A-B Link=FAIL
A B C N…IP Path
04/19/2304/19/23 55
Probabilistic Relational Models Probabilistic Relational Models (PRMs) can express diagnostic (PRMs) can express diagnostic knowledgeknowledge
Knowledge = shared Knowledge = shared knowledge about knowledge about component and test component and test classesclasses– Class dependenciesClass dependencies– Diagnostic testsDiagnostic tests
Agents generate Agents generate Bayes net using PRMBayes net using PRM
Provided by experts Provided by experts or learned by agentsor learned by agents
ExtensibleExtensible– New component New component
and test classesand test classes– Subclassing (e.g. Subclassing (e.g.
Wireless Link)Wireless Link)
Link
IP Path
Ping Test
Result
Status
Status
P(Status=OP(Status=OK)K)
0.990.99
FirsFirstt
RestRest P(StatusP(Status=OK)=OK)
FAIFAILL
FAILFAIL 00
FAIFAILL
OKOK 00
OKOK FAILFAIL 00OKOK OKOK 11
PatPathh
P(Result=OP(Result=OK)K)
FAIFAILL
00
OKOK 0.950.95
First
Path
Rest
04/19/2304/19/23 66
Probabilistic models enable Probabilistic models enable agents to reduce diagnosis costagents to reduce diagnosis cost
Diagnosis Procedure:Diagnosis Procedure:1.1. Receive failure reportReceive failure report2.2. Construct Bayes net from PRMConstruct Bayes net from PRM3.3. Incorporate current and past Incorporate current and past
evidence using a Dynamic Bayes evidence using a Dynamic Bayes Net (DBN)Net (DBN)
4.4. Infer most probable explanation Infer most probable explanation (MPE) for failure(MPE) for failure
5.5. While While mpe_confidence mpe_confidence < < confThreshconfThresh::
1.1. Perform local tests or request Perform local tests or request diagnosis from other agents to diagnosis from other agents to maximize relevance/costmaximize relevance/cost
6.6. Propagate evidence to other Propagate evidence to other agentsagents
7.7. Return diagnosisReturn diagnosis
Architectural points: Agents can trade off
accuracy vs. cost using a confidence threshold
Agents can infer current status from past evidence given a temporal failure model
Agents can reduce load and improve robustness by propagating evidence
Diagnosis cost = probing + communication costDiagnosis cost = probing + communication cost
04/19/2304/19/23 77
Minimizing cost for IP Minimizing cost for IP path diagnosispath diagnosis IP path diagnosis: ISP (AIP path diagnosis: ISP (AB), rest of path (BB), rest of path (BN), or destination (NN), or destination (NDest)Dest) Simulated 6000 Autonomous System (AS) topologySimulated 6000 Autonomous System (AS) topology 1 DA per AS that can test links and destinations associated with that AS1 DA per AS that can test links and destinations associated with that AS All diagnostic agents have knowledge of prior link failure probabilitiesAll diagnostic agents have knowledge of prior link failure probabilities Diagnostic agents are reachable up to the point of failureDiagnostic agents are reachable up to the point of failure Status of inter-AS links and destination hosts drawn from prior probabilitiesStatus of inter-AS links and destination hosts drawn from prior probabilities Evidence collection and propagation follow DAs in the AS pathEvidence collection and propagation follow DAs in the AS path
DA 1User AS A
DA nDest AS N
DA 2AS B
DA kAS K
…
Evidence collection
Evidence propagation
Failure report
Diagnosis…
User B K…IP Path N…A Dest
04/19/2304/19/23 88
confThresh0.4
0.8
0.5
0.7
0.6
0.9
1.0
13 confidence thresholds, 500 users, 5 trials13 confidence thresholds, 500 users, 5 trials
Agents can trade off accuracy and Agents can trade off accuracy and costcost
04/19/2304/19/23 99
cache duration cache duration = number of = number of past time steps past time steps of evidence to of evidence to considerconsider
Inter-AS link Inter-AS link failures modeled failures modeled as a Markov as a Markov chain (Gilbert chain (Gilbert model)model)
100 users, 5 100 users, 5 trials, 30 time trials, 30 time stepssteps
>95% accuracy>95% accuracy
Incorporating past Incorporating past evidence reduces evidence reduces probing costsprobing costs
04/19/2304/19/23 1010
100,000 failures50,000 failures
10,000 users10,000 users 5 trials5 trials 1 time step1 time step >95% >95%
accuracyaccuracy
Evidence propagation reduces Evidence propagation reduces probing and communication probing and communication costscosts
04/19/2304/19/23 1111
HTTP Proxy Conn.UserServer
Agents can learn Agents can learn probabilistic models for TCP probabilistic models for TCP overlay connection overlay connection diagnosisdiagnosis
TCP Conn.ProxyServer
HourDstAS
SrcAS
TCP Conn.UserProxySrSr
c c ASAS
DsDst t ASAS
HouHourr
P(StatusP(Status= OK)= OK)
11 11 11 0.990.9911 22 11 0.870.87…… …… …… ……
UserUserProxyProxy
ProxyProxyServerServer
P(StatusP(Status= OK)= OK)
FAILFAIL FAILFAIL 00FAILFAIL OKOK 00OKOK FAILFAIL 00OKOK OKOK 11
1. Learn inter-AS TCP failure probabilities from Planetseer (28.3 million TCP connections from 196 hosts over 10 hours)
2. Diagnose HTTP proxy connections on CoDeeN without using probes
User Proxy ServerTCP Overlay Path
HourDstAS
SrcAS
04/19/2304/19/23 1212
Learned diagnostic Learned diagnostic knowledge improves knowledge improves accuracyaccuracy Accuracy: 80% vs. 53%Accuracy: 80% vs. 53%
– Train on hour Train on hour xx– Test on hour Test on hour x x + 1+ 1
Accuracy improves as training interval Accuracy improves as training interval increasesincreases– Train on first Train on first x x hours, test on hour hours, test on hour x x + 1+ 1
Accuracy remains high as training set Accuracy remains high as training set age increasesage increases– Train on hour 1, test on hour x > 1Train on hour 1, test on hour x > 1
04/19/2304/19/23 1313
Benefits of CAPRIBenefits of CAPRI
An extensible language for diagnostic data An extensible language for diagnostic data and knowledgeand knowledge– Based on Bayes nets and PRMsBased on Bayes nets and PRMs
Distributed diagnosis while minimizing Distributed diagnosis while minimizing probing and communication costprobing and communication cost– accuracy/cost tradeoffaccuracy/cost tradeoff– incorporating past evidenceincorporating past evidence– evidence propagationevidence propagation
Robustness to missing dataRobustness to missing data– probabilistic inference using cached dataprobabilistic inference using cached data
Ability to learn diagnostic knowledgeAbility to learn diagnostic knowledge– learn conditional failure probabilities using PRMslearn conditional failure probabilities using PRMs
04/19/2304/19/23 1414
Future WorkFuture Work
Costs and incentivesCosts and incentives– Learning the true network costs of Learning the true network costs of
diagnostic testsdiagnostic tests– Dynamically adjusting costDynamically adjusting cost– Incentives for agent to reveal evidenceIncentives for agent to reveal evidence
Intelligent routing of diagnostic queriesIntelligent routing of diagnostic queries Temporal failure modelsTemporal failure models
– Learning temporal failure modelsLearning temporal failure models– Predicting failure durationPredicting failure duration
Diagnosis using data from end usersDiagnosis using data from end users
04/19/2304/19/23 1515
76%
78%
80%
82%
84%
86%
1 2 3 4 5 6 7 8 9
Training Set Age (Hours)
Acc
ura
cy
04/19/2304/19/23 1616
Model network component state Model network component state as a Markov chain (Gilbert model)as a Markov chain (Gilbert model)
Dynamic Bayes net (DBN):Dynamic Bayes net (DBN):
Modeling Dynamic Modeling Dynamic networksnetworks
s1 s3s2
OK FAIL 0.710.97
0.03
0.29
P(s3=OK | s1=FAIL) =P(s3=OK | s2=OK) P(s2=OK | s1=FAIL)
+ P(s3=OK | s2=FAIL) P(s2=FAIL | s1=FAIL)