Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the...

34
Routing Protocol Convergence and © 2009 Cisco Systems, Inc. All rights reserved. LACNOG2010 Cisco Public Availability Alvaro Retana ([email protected] ) Principal Engineer Core IP Technology Architecture

Transcript of Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the...

Page 1: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Routing Protocol Convergence and

© 2009 Cisco Systems, Inc. All rights reserved.LACNOG2010 Cisco Public

Convergence and Availability

Alvaro Retana ([email protected])Principal EngineerCore IP Technology Architecture

Page 2: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

High AvailabilityOverview

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 2

Overview

2

Page 3: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Availability Definitions

• The probability that a service (or network, etc.) is operational, and functional as needed, at any point in time

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 3

• Availability = (MTBF—MTTR)/MTBFUseful definition for theoretical and practical

• MTBF is mean time between failureWhat, when, why and how does it fail?

• MTTR is mean time to repairHow long does it take to fix?

Page 4: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

What Is High Availability?

Availability Downtime Per Year (24x365)99.000%99.500%99.900%

3 Days 1 Day

15 Hours19 Hours8 Hours

36 Minutes48 Minutes46 Minutes

DPM1000050001000

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 4

DPM = Defects per Million (Hours of Running Time)

99.950%99.990%99.999%99.9999%

53 Minutes5 Minutes30 Seconds

4 Hours 23 Minutes500100101

“HighAvailability”

Page 5: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Downtime

67%

67%

79%

87%

87%

Customer Premises Equipment Failure

Network Software Failures

Network Hardware Failures

Physical Link Failures

Network Operations Failures

Causes of Unscheduled Downtime

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 5

25%

37%

37%

44%

62%

67%

0% 20% 40% 60% 80% 100%

Malicious Damage

Acts of Nature

Unknown

Congestion/Overload

Physical Environment Failures

Customer Premises Equipment Failure

% of RespondentsSource: Sage Research, IP Service Provider Downtime Study: Analysis of Downtime Causes,

Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB

Page 6: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Network Convergence• Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after the network event

• Network convergence requires all affected routers to process the event and update the appropriate data structures used for forwarding

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 6

structures used for forwarding• Network Convergence is the time required to:

Detect event has occurredPropagate the eventProcess the eventUpdate related forwarding structures

Page 7: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Network Convergence (2)� Network Design and Operational Considerations

Processes for fault, configuration, performance and securityNo Single Points of Failure (except at edge) / Failure Domain SizeExcellent consistency (HW, SW, config, design)Redundancy, Hierarchy, Summarization, Modularity

� DetectionPhysical Failure (light!)Fast Hellos

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 7

Fast HellosBidirectional Forwarding Detection (BFD)

� HidingInterface Dampening (for flapping links)Graceful Restart

� Propagation and ProcessingLink State Exponential Back offPrefix PrioritizationBGP Prefix Independent Convergence (PIC)IP Fast ReRouteIGP/BGP Interaction

Page 8: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Network Convergence (2)� Network Design and Operational Considerations

Processes for fault, configuration, performance and securityNo Single Points of Failure (except at edge) / Failure Domain SizeExcellent consistency (HW, SW, config, design)Redundancy, Hierarchy, Summarization, Modularity

� DetectionPhysical Failure (light!)Fast Hellos

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 8

Fast HellosBidirectional Forwarding Detection (BFD)

� HidingInterface Dampening (for flapping links)Graceful Restart

� Propagation and ProcessingLink State Exponential Back offPrefix PrioritizationBGP Prefix Independent Convergence (PIC)IP Fast ReRouteIGP/BGP Interaction

Page 9: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Graceful Restart

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 9

Graceful Restart

9

Page 10: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

NSF/SSO• Standby Route Processor (RP) takes control of router after a hardware or software fault on the Active RP

• SSO allows standby RP to take immediate control and maintain connectivity protocols

StandbyRP

ActiveRP

State Information

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 10

immediate control and maintain connectivity protocols

• NSF continues to forward packets until route convergence is complete

RPRP

Line CardLine Card

Page 11: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

NSF/SSO• Provide a scalable solution

Architecture must scale with workloads and features and meet network requirements

• Minimize state that must be synchronizedMinimize impact of HA on service

• Detect and react to failures quickly

Design Goals

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 11

• Detect and react to failures quicklyContinuously monitor Active componentsContinuously verify operation of Standby components

Page 12: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Graceful Restart• When the BGP peering session is brought up, the graceful restart capability is negotiated. If both peers state they are capable of GR, it’s enabled on the peering session.

• When A restarts, it opens a new

Control Data

GR ca

pabil

ityNe

w TC

P Ses

sion

Restart; close

BGP

A

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 12

• When A restarts, it opens a new TCP session to B, using the same router ID.

• B interprets this as a restart, and closes the old TCP session. Control Data

GR ca

pabil

ityNe

w TC

P Ses

sion

Restart; close old session

B

Page 13: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Graceful Restart• B transmits updates containing its BGP table (it’s local RIB out).

• A goes into read only mode, and does not run the bestpath calculations until its B has finished sending updates.

Control Data

Upda

tesEn

d of R

IB Ma

rker

Read only

A

BGP

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 13

• When B has finished sending updates, it sends an end of RIB marker, which is an update with an empty withdrawn NLRI TLV.

Control Data

End o

f RIB

Marke

r

Read only mode

B

Page 14: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Graceful Restart• When A receives the end of RIB marker, it runs bestpath, and installs the best routes in the routing table.

• After the local routing table is updated, BGP notifies CEF.

Control Data A

BGP

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 14

• CEF then updates the forwarding tables, and removes all information marked as stale.

Control Data B

Page 15: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Graceful Restart• rfc4724: Graceful Restart Mechanism for BGP• rfc5306: Restart Signaling for IS-IS• rfc4811: OSPF Out-of-Band Link State Database (LSDB) Resynchronization

References

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 15

• rfc5613: OSPF Link-Local Signaling• rfc4812: OSPF Restart Signaling• rfc3623: Graceful OSPF Restart

Page 16: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Fast Convergence

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 16

Fast Convergence

16

Page 17: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

OSPF Architectural Constants• Initial LSA Generation Delay = 500 ms • Recurring LSA Origination Delay = 5 s• LSA Arrival Throttling = 1 s• LSA Flooding Pacing = 33 ms

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 17

• LSA Flooding Pacing = 33 ms• LSA Retransmission = 66 ms• SPF Execution Delay = 500 ms• SPF Holdtime = 5 s

Page 18: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Event Propagation• Fast LSA Generation after Initial Event• Repeated events increase regeneration delay• Configuration:timers throttle lsa all <lsa-start> <lsa-hold> <lsa-max>

OSPF Exponential Backoff

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 18

� Similar Configuration for Event Processing (SPF Runs)timers throttle spf <spf-start> <spf-hold> <spf-max>

Page 19: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

timers throttle lsa all 10 500 5000

previous LSA generation at t0 (t1 – t0) > 5000 msEvents Causing LSA Generation

t1 time [ms]t2

1000

Event PropagationOSPF Exponential Backoff

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 19

LSA Generation

LSA Generation – Back-off Alg.

time [ms]

time [ms]

time [ms]t2

500

t1+10

5000 5000

1000 2000 4000 5000500

Page 20: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Link State Prefix Priority• Prefix Prioritization

4 priorities: Critical, High, Medium, Low/32 IPv4 and /128 IPv6 prefixes are classified by default in Medium PriorityRest is classified by default in Low Priority

• Prefix Prioritization is THE key behavior; for example

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 20

• Prefix Prioritization is THE key behavior; for exampleCRITICAL: IPTV SSM sourcesHIGH: Most Important PE’sMEDIUM: All other PE’sLOW: All other prefixes

Page 21: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

BGP PIC EdgePE-CE link failure (fast repair)

RR1 RR2

RR4RR3

1. link PE2-CE2 failsIf BGP PIC Edge implemented, then traffic

goes PE1,PE2,PE3,CE2

BGP PIC Edge

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 21

VPN 1site Bx.x.x.x/y

RD 1:1RD 2:1

RD 3:1

RR4RR3

PE1PE2

PE3

CE2CE1VPN 1site A

Page 22: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

BGP PIC EdgePE-CE link failure (re-optimization)

RR1 RR2

RR4RR33. PE2 withdraws paths4. RR2 and RR4 propagate

1. link PE2-CE2 failsIf BGP PIC Edge implemented, then traffic

goes PE1,PE2,PE3,CE2

2. Fast External Fallover scans BGP table, calculating new bestpaths

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 22

VPN 1site Bx.x.x.x/y

RD 1:1RD 2:1

RD 3:1

RR4RR3

PE1PE2

PE3

CE2CE1VPN 1site A

6. PE1 deletes path via PE2, now going via PE3

5. RR1 and RR3 propagate withdraws

4. RR2 and RR4 propagate withdraws

Page 23: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

BGP PIC EdgePE node failure (fast repair)

RR1 RR2

RR4RR3

3. PE1 withdraws pathsIf BGP PIC Edge implemented, then

traffic goes PE1,PE3,CE2

1. link PE2 fails2. The IGP does propagate the BGP NH failure

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 23

VPN 1site Bx.x.x.x/y

RD 1:1RD 2:1

RD 3:1

RR4RR3

PE1PE2

PE3

CE2CE1VPN 1site A

Page 24: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

10000

100000

1000000 msec

250k PIC250k no PIC500k PIC

BGP PIC Edge sample

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 24

1

10

100

1000

0

5000

0

1000

00

1500

00

2000

00

2500

00

3000

00

3500

00

4000

00

4500

00

5000

00

Prefix

500k PIC500k no PIC

Page 25: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

IP Fast ReRoute

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 25

IP Fast ReRoute

25

Page 26: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Objective• Provide fast re-route in pure IP networks and MPLS/LDP networks without deploying RSVP-TE.

• To restore productive forwarding to all reachable addresses within 50ms.

• Control the transition of the network from repair to

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 26

• Control the transition of the network from repair to normal forwarding without further packet loss or micro-looping.

Page 27: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

The Four Stages of IPFRR1. Pre-computation of repair paths2. Detection of failure3. Invocation of appropriate repair4. Controlled re-convergence of network

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 27

4. Controlled re-convergence of network

Page 28: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Basic Repair• Uses ECMP and Loop Free Alternates (LFA) where available

• LFAs easily computed in OSPF and IS-IS • Analogous to feasible successors in EIGRP

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 28

• Properties:•In general topologies around 80% of failures allow alldestinations to be repaired•For the remaining 20%, only a subset of destinations can be repaired

Page 29: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Triangle topology - ECMP

SiSi SiSi

S N

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 29

BA

SiSiSiSiP O

Page 30: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Square topology - LFA

SiSi SiSi

S N

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 30

BA

SiSiSiSiP

Page 31: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

More complex topology – no LFA available

SiSi SiSiSiSi

S NM

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 31

BA

SiSiSiSiP

Page 32: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Complex topology

SiSi SiSiSiSi

S NM

Final Solution in Process

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 32

BA

SiSiSiSiP

Ap

in Process

Page 33: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after

Designing for Fast Convergence� Designing for FC is more than tuning a few timers� Designers need to look at all network layers

Layer 1 and Layer 2 for failure detection properties and physical topology (shared-risk link groups)Layer 3 protocol behaviour, interactions between different

© 2009 Cisco Systems, Inc. All rights reserved. Cisco PublicLACNOG2010 33

Layer 3 protocol behaviour, interactions between different protocolsLayer 4-7 for application requirements and behaviour

� The base must be a solid network design!� Balance must be achieved between engineering complexity and gain.

Page 34: Routing Protocol Convergence and Availability · Network Convergence •Network convergence is the time needed for traffic to be rerouted to the alternative or more optimal path after