Bgp Trouble

download Bgp Trouble

of 22

Transcript of Bgp Trouble

  • 8/6/2019 Bgp Trouble

    1/22

    1

    BGP Anomaly Detection in an ISP

    Jian Wu (U. Michigan)Z. Morley Mao (U. Michigan)

    Jennifer Rexford (Princeton)Jia Wang (AT&T Labs)

    http://www.cs.princeton.edu/~jrex/papers/nsdi05-jian.pdf

  • 8/6/2019 Bgp Trouble

    2/22

    2

    Goal

    Identify important anomalies Lost reachability

    Persistent flapping Large traffic shifts

    Contributions:

    Build a tool to identify a small number ofimportant routing disruptions from a largevolume of raw BGP updates in real time.

    Use the tool to characterize routingdisruptions in an operational network

  • 8/6/2019 Bgp Trouble

    3/22

    3

    Capturing Routing Changes

    CBR

    CBR

    CPE

    BGP

    Monit

    or

    CBR

    CBR

    CBR

    CBR

    CBR

    CBR

    CBR

    CBR

    CBR

    CBR

    iBGP

    iBGP

    iBG

    P

    iB

    GP

    iBGP

    iBGP

    eBGP

    eB

    GP

    eBGP

    eBGP

    eBGP

    eBGP

    Updates

    Update

    s

    BestroutesBe

    stro

    utes

    Large operational network(8/16/2004 10/10-2004)

  • 8/6/2019 Bgp Trouble

    4/22

    4

    Challenges

    Large volume of BGP updates Millions daily, very bursty

    Too much for an operator to manage Different than root-cause analysis

    Identify changes and their effects

    Focus on actionable events

    Diagnose causes only in/near the AS

  • 8/6/2019 Bgp Trouble

    5/22

    5

    System Architecture

    EventClassification

    EventClassification

    Typed

    Events

    E

    EBR

    E

    EBR

    E

    EBR

    BGP

    Updates(106)

    BGP UpdateGrouping

    BGP UpdateGrouping

    Events

    PersistentFlapping

    Prefixes

    (101)

    (105)

    EventCorrelation

    EventCorrelation

    Clusters

    FrequentFlapping

    Prefixes

    (103

    )

    (101)

    Traffic ImpactPrediction

    Traffic ImpactPrediction

    E

    EBRE

    EBR E

    EBR

    Large

    Disruptions

    Netflow

    Data

    (101)

  • 8/6/2019 Bgp Trouble

    6/22

    6

    Grouping BGP Update into Events

    Challenge: A single routing change leads to multiple update messages

    affects routing decisions at multiple routers

    Solution:Group all updates fora prefix with inter-arrival < 70 secondsFlag prefixes withchanges lasting > 10minutes.

    BGP UpdateGrouping

    BGP UpdateGrouping

    E

    EBR

    E

    EBR

    E

    EBR

    BGP

    Updates

    Events

    Persistent

    Flapping

    Prefixes

  • 8/6/2019 Bgp Trouble

    7/22

    7

    Grouping Thresholds

    Based on data analysis and ourunderstanding of BGP

    Event timeout: 70 seconds 2 * MRAI timer + 10 seconds

    98% inter-arrival time < 70 seconds

    Convergence timeout: 10 minutes BGP usually converges within minutes

    99.9% events < 10 minutes

  • 8/6/2019 Bgp Trouble

    8/22

    8

    Persistent Flapping Prefixes

    Causes of persistent flapping Conservative damping parameters (78.6%)

    Protocol oscillations due to MED (18.3%) Unstable interface or BGP session (3.0%)

    Surprising finding: 15.2% of updates werecaused by persistent flapping prefixes, eventhough flap damping was enabled!

  • 8/6/2019 Bgp Trouble

    9/22

    9

    Example: Unstable eBGP Session

    ISP Peer

    CustomerEC

    EB

    EA ED

    p

    Flap damping parameters are session-based Damping not implemented for iBGP sessions

  • 8/6/2019 Bgp Trouble

    10/22

    10

    Event Classification

    Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers

    Change of flow of traffic through the network

    EventClassification

    EventClassificationEvents

    Typed

    Events

    Solution: classify events by severity of their impacts

  • 8/6/2019 Bgp Trouble

    11/22

    11

    Event Category No Disruption

    ISP

    EA

    p

    EB

    EC

    EE

    AS2

    ED

    AS1

    No Traffic Shift

    No Disruption: each of the border routers hasno traffic shift. (50.3%)

  • 8/6/2019 Bgp Trouble

    12/22

    12

    Event Category Internal Disruption

    ISP

    EA

    p

    EB

    EC

    EE

    AS2

    ED

    AS1

    Internal Traffic Shift

    Internal Disruption: all of the traffic shifts areinternal traffic shift. (15.6%)

  • 8/6/2019 Bgp Trouble

    13/22

    13

    Event Category Single ExternalDisruption

    ISP

    EA

    p

    EB

    EC

    EE

    AS2

    ED

    AS1

    external Traffic Shift

    Single External Disruption: only one of thetraffic shifts is external traffic shift. (20.7%)

  • 8/6/2019 Bgp Trouble

    14/22

    14

    Statistics on Event Classification

    Events Updates

    No Disruption 50.3% 48.6%

    Internal Disruption 15.6% 3.4%

    Single External Disruption 20.7% 7.9%Multiple External Disruption 7.4% 18.2%

    Loss/Gain of Reachability 6.0% 21.9%

    First 3 categories have significant variations fromday to day Updates per event depends on the type of events

    and the number of affected routers

  • 8/6/2019 Bgp Trouble

    15/22

    15

    Event Correlation

    Challenge: A single routing change affects multiple destination prefixes

    EventCorrelation

    EventCorrelation

    Typed

    EventsClusters

    Solution: group events of same type that occur close in time

  • 8/6/2019 Bgp Trouble

    16/22

    16

    EBGP Session Reset

    Caused most single external disruption events Check if the number of prefixes using that session as

    the best route changes dramatically

    Validation with Syslog router report (95%)

    time

    Number of prefixes

    session

    failure

    sessionrecovery

  • 8/6/2019 Bgp Trouble

    17/22

    17

    Hot-Potato Changes

    Hot-Potato Changes

    Caused internal disruption events Validation with OSPF measurement (95%)

    [Teixeira et al SIGMETRICS 04]

    ISP

    P

    EA EB

    EC

    10119

    Hot-potato routing =route to closest egress point

  • 8/6/2019 Bgp Trouble

    18/22

    18

    Traffic Impact Prediction

    Challenge: Routing changes have differentimpacts on the network which depends onthe popularity of the destinations

    Traffic ImpactPrediction

    Traffic ImpactPrediction

    EEBR

    Clusters Large

    Disruptions

    Netflow

    DataEEBR EEBR

    Solution: weigh each cluster by traffic volume

  • 8/6/2019 Bgp Trouble

    19/22

    19

    Traffic Impact Prediction

    Traffic weight Per-prefix measurement from Netflow

    10% prefixes accounts for 90% of traffic

    Traffic weight of a cluster Sum of traffic weight of the prefixes

    A few clusters have large traffic weight

    Mostly session resets & hot-potato changes

  • 8/6/2019 Bgp Trouble

    20/22

    20

    Performance Evaluation

    Memory Static memory: current routes, 600 MB

    Dynamic memory: clusters, 300 MB

    Speed 99% of intervals of 1 second of updates can

    be process within 1 second

    Occasional execution lag

    Every interval of 70 seconds of updates canbe processed within 70 seconds

    Measurements were based on 900MHz CPU

  • 8/6/2019 Bgp Trouble

    21/22

    21

    Conclusion

    BGP anomaly detection Fast, online fashion

    Operator concerns (reachability, flapping,traffic)

    Significant information reduction

    Uncovered important network behaviors

    Persistent flapping prefixes Hot-potato changes

    Session resets and interface failures

  • 8/6/2019 Bgp Trouble

    22/22

    22

    Detecting Peering Violations

    Consistent export requirement Peer should advertise prefixes at all peering points,

    with the same AS path length

    Allows the AS to do hot-potato routing

    Detecting violations Using iBGP feeds from the border routers

    Some inference tricks to identify inconsistencies

    Results of the study http://www.nanog.org/mtg-0410/feamster.html http://www.cs.princeton.edu/~jrex/papers/imc04.pdf