Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8...

85

Transcript of Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8...

Page 1: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered
Page 2: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Arkadiy Shapiro

Technical Marketing Engineer

NX-OS and Nexus 7000

[email protected]

BRKRST-2333

Ver 1.8

Network Failure Detection

Page 3: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Why am I here?

4

Campus Core

Catalyst 6500

Access

100G

10

0G

40G

40

G

ASR 9000

Routing Core

CRS-3

SP Edge

Campus Core

Nexus 2000 / 3000 / 3500 / 5000

Catalyst 6500

DC Access

Page 4: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Session Goals

5

At the end of the session, the participants

should:

Understand where failure detection fits in

achieving network fast convergence

Be able to identify which failure detection

technologies are needed to achieve

business needs and required SLAs

Understand future advances in network

failure detection technologies

Page 5: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Session Non-goals

This session does not include:

Discussion on other aspects of fast convergence

Details on software or hardware architectures of related Cisco products

Detailed roadmap discussion for related Cisco products

Detailed discussion on service / end-to-end failure technologies

Discussion on user-driven failure detection methods (ping, traceroute etc)

and using scripts / EEM to automate those

6

Page 6: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Agenda

Overview

Layer 1 Failure Detection

Layer 2 Failure Detection

Layer 3 Failure Detection

Summary

7

Page 7: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Agenda

Overview

Layer 1 Failure Detection

Layer 2 Failure Detection

Layer 3 Failure Detection

Summary

8

Page 8: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Routing Convergence in Action

Overview

A B C

D

Folks: my link to B is down Folks: my link to C is

down

Ok, fine, will use path via D

I don’t care, nothing changes for me

Ooops.. Problem

t0 t1 t3 t2 t4 Loss of Connectivity = t4 – t0 9

Page 9: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Routing Convergence Components

1. Failure Detection

2. Failure Propagation (flooding, etc.)

3. Topology/Routing Recalculation

4. Update of the routing and forwarding table (RIB & FIB)

10

Overview

t0 t1 t3 t2 t4

1 2 3 4

IGP and BGP Reaction

Page 10: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Failure Detection Overview

Detecting the failure is very critical but most

challenging part of network convergence

Failure Detection can occur on different levels / layers:

Physical Layer (1)

Data link Layer (2)

Network Layer (3)

Service / Application (not covered here)

Do you really need to touch all the layers?

11

Overview

Page 11: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

IP/MPLS

Ethernet/FR/ATM …

SONETSDH OTN

DWDN

Interconnection Options

A. Layer 3 p2p

B. Layer 3 with a Layer 1 (DWDM) “bump” in wire

C. Layer 3 with a Layer 2 (Ethernet / Frame Relay / ATM switch) “bump” in wire

D. Layer 3 with a Layer 3 (Firewall / router) “bump” in wire 12

L1

L2

L3 A

B

C

D

Overview

Page 12: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Bit transmission

Signaling: Auto-negotiation / FEFI / Remote Fault Indication

Other: Carrier Delay / Debounce

UDLD LACP 802.1ag CFM/

Y.1731 FM

Failure Detection Tools

Layered Approach

13

802.3ah Link OAM

BFD for MPLS LSPs / TE-FRR

BFD for BGP, OSPF, IS-IS, EIGRP, FHRPs and static

802.1ag CFM; Y.1731 PM; BFD for VCCV, GRE; FabricPath/TRILL OAM

Service /

Application

Layer 3

Layer 2

MPLS

Layer 1

Overview

Page 13: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Engineering Complexity vs. Gain K.I.S.S

14

Overview

Loss (Impairments/Time)

Co

st a

nd

C

om

ple

xity

Re-engineering Required

Pote

nti

al O

ver-

En

gin

eeri

ng

Viable- Engineering

Number of possible approaches, or combinations of approaches.

Range of viable engineering options may vary by type of application

Page 14: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Agenda

Overview

Layer 1 Failure Detection

Layer 2 Failure Detection

Layer 3 Failure Detection

Summary

15

Page 15: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Layer 1 – IPoDWDM Proactive Protection

IP / optical integration enables the capability to identify degraded link using optical

data (pre-FEC BER) and start protection (i.e. by signaling to the IGP/FRR) before

traffic starts failing, achieving hitless failover in many cases

16

Trans-ponder

Optical port on router

WDM port on router

Optical impairments Co

rre

cte

d b

its

FEC limit

Working path

Switchover lost data

Protected path

BER

LOF

Optical impairments Co

rre

cte

d b

its

FEC limit

Protection trigger

Working path Protect path

BER

Near-hitless switch

WDM WDM

FEC

FEC

Reactive protection Proactive protection

Layer 1 Failure Detection

HW

Support

CRS

ASR 9000

XR 12000

7600

Check

specific

interface

types!

Page 16: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Layer 1 Failure Detection – Ethernet

Ethernet mechanisms like auto-negotiation (1 GigE), FEFI (100FX) or link fault

signalling (802.3ae/ba) can signal local failures to the remote end

Challenge to get this signal across an Eth-over-SDH/OTN cloud as relaying the fault information to the other end is not always possible

Link Fault Signaling

19

Layer 1 Failure Detection

R1

rx

tx

tx

rx

R2

X

R2

rx tx

tx rx

rx

tx

tx

rx

Optical Transport R1 MUX-B MUX-A

X

“Bump” in Layer 1 link

Page 17: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link Down Detection

Link-down / interface-down event detection is hardware-dependent

Catalyst 6500 and Cisco 7600 OSM, SIP, 6708-10GE and more recent I/O

modules use interrupt-driven notification, offering <10ms detection

6704 offers <30ms with optimized polling

All other older I/O modules are being polled in order, 20ms per port

worst case 48 * 20ms = 960ms to detect failure!

Enhancement with CSCsr21196 (SXI, SRD2, SRC3) for fiber ports 60 msec

Nexus switches / CRS / ASR 9000 – interrupt-driven notification

How Fast?

20

Layer 1 Failure Detection

Page 18: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Carrier Delay

Running timer in software

Filters link up and down events, notifies protocols

By default, most IOS versions set timer at 2 seconds

to suppress short flaps

This behaviour is not desirable for Fast Convergence

Not recommended to set carrier-delay to 0 on SVI

Standard routing platform feature

21

interface …

carrier-delay msec 0

Layer 1 Failure Detection

Page 19: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Asymmetric Carrier Delay

When connecting to an Ethernet Layer 2 cloud, it may be desirable to delay link-up for a bit, without changing link-down carrier delay

Otherwise, the initial ARP request could get dropped in the L2 cloud, which can create short black-hole (due to incomplete adjacency)

Some device drivers have a built-in up-delay

POS: Generally 10 seconds

7600 ES20/40 WAN ports: 4 seconds

22

interface …

carrier-delay up 20

interface …

carrier-delay up msec 20

SW Support

IOS

• 12.0(32)SY2

• 12.2SRD

IOS XR

• XR 3.4.0

Layer 1 Failure Detection

Page 20: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Debounce Timer

Delay link down notification only

Runs in firmware

100 msec default in NX-OS

300 msec default on IOS on copper, 10 msec on fiber

Most cases recommended to keep it at default

Standard switching platform feature

23

switch(config)interface …

switch(config-if)# link debounce time ?

<0-5000> Timer value (in milliseconds)

NX-OS

Layer 1 Failure Detection

Page 21: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Carrier Delay vs Debounce timer

Carrier Delay /

Asymmetric Carrier Delay

Debounce timer

Runs in software

Runs in firmware

Not applicable to: • Switches except WAN interfaces ((i.e ES+ or

SIP/SPA on Catalyst 6500)

• Ethernet LAN switching interfaces on routers

(i.e Cisco 7600 with WS-X6708 card)

Not applicable to : • Routers except Ethernet LAN switching

interfaces (i.e Cisco 7600 with WS-X6708 card)

• WAN interfaces on switches (i.e ES+ or

SIP/SPA on Catalyst 6500)

• SVIs

Filters link down and up events Filters link down events only

24

Make sure to test before implementing!

Layer 1 Failure Detection

Page 22: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link Isolation - IP Event Dampening

Logical Diagram

25

Actual interface state

Maximum penalty

Suppress threshold

Reuse threshold

Accumulated penalty

Interface state seen by routing protocols

Layer 1 Failure Detection

SW Support

IOS

IOS XE

IOS XR

Page 23: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Agenda

Overview

Layer 1 Failure Detection

Layer 2 Failure Detection

Layer 3 Failure Detection

Summary

26

Page 24: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Technology Analysis

What layer?

Keepalive message interval and timeout?

Types of failures detected?

Reaction to failures?

Methods to support ISSU?

Scale?

Protocol offload?

Standardization?

Types of interfaces supported?

Layer 2 and Layer 3 Failure Detection

27

Page 25: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Network Scenarios

Classical Ethernet Layer 2

Single p2p link

Bundle

FabricPath / TRILL

Single p2p link

Bundle

Layer 3

Single p2p link

Bundle

SVI on top of Classical Ethernet

SVI on top of FabricPath / TRILL

28

Summary

SVI SVI

SVI SVI

Page 26: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Layer 2 – Data Link Layer

Generally only applicable to L2 transports using some form of keepalive mechanism

PPP or HDLC keepalives

Frame-Relay LMI

ATM OAM

Ethernet OAM, LACP (bundles), UDLD

Sub-second failure detection at scale typically not a goal using the features mentioned above

‒ Ethernet OAM / CFM is getting there…

‒ Fast UDLD

Tuning keepalive down to minimum is NOT recommended, can lead to false positives as keepalive processing may not be optimized

29

Layer 2 Failure Detection

Page 27: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Unidirectional Link Detection (UDLD)

Light-weight Layer 2 failure detection protocol

Designed for detecting:

One-way connections due to physical failure

One-way connections due to soft failure

Mis-wiring detection (loopback or triangle)

Cisco proprietary, but listed in informational RFC 5171

Runs on any single Ethernet link, even inside bundle

Typically a centralized implementation (hellos sent from

supervisor, not from LC)

Message interval: 7-90 sec (default: 15 seconds)

Detection: 2.5 x interval + timeout value (4 sec) ~ 21 sec

30

Layer 2 Failure Detection

Tx Rx

Tx Rx

Page 28: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

UDLD Basics of Operation

With ECHO messages, each device learns:

What its connected to and peer’s message

interval

What its neighbors think they are connected to!

This information can then be used to detect

faults

FLUSH message is sent when UDLD is

disabled

Aging mechanism with PROBE messages

Information from neighbors that is not periodically

refreshed is eventually timed out

This can also be used for fault detection

Peer Discovery and Relationship

32

Layer 2 Failure Detection

Page 29: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

UDLD Scenario 1

Echo Packet from A to B has “My Switch-ID A, My Port-ID e x/y”

When B sends the echo-reply back, it is expected to have “My Switch-ID

B, My Port-ID e w/z” AND “Your Switch-ID A, Your Port-ID e x/y”.

Transmit path failure from A to B

When B sends the echo-reply back, the echo-reply packet has only “My

Switch-ID B, My Port-ID e w/z. B timed out!

Empty-Echo condition or age out

33

Layer 2 Failure Detection

Switch A e x/y e w/z Switch B e

U

D

L

D

Pk tMg r

X X X

U

D

L

D

Pk tMg r

Page 30: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Switch C e s/t

UDLD Scenario 2

Caused by packet flowing only in one (uni) direction

Key differentiating factor of UDLD!

With SFP type fiber connection, this error is less common

Miswiring Detection

34

Layer 2 Failure Detection

Switch A e x/y

Switch B e w/z

Page 31: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Fast UDLD

UDLD message interval to achieve sub-second detection

New Fast Hello TLV for backward compatibility

Message interval: 200 msec – 1 sec

Similar considerations as Layer 3 timer tuning:

CPU usage (false positives) and scale (not designed for this)

SSO / ISSU support

37

Layer 2 Failure Detection

SW Support

IOS

• 12.2.33 SXI4

• 12.2(54)SG

switch(config)#interface GigabitEthernet1/1

switch(config-if)#udld fast-hello ?

<200-1000> Time in milliseconds between sending of messages in steady state

switch#show udld fast-hello

Total ports with fast hello configured: 10

Total ports with fast hello operational: 5

Total ports with fast hello non-operational: 5

Fast hello configuration setting (millisecond):

Interface Gi1/1 200 operational

Interface Gi1/6 500 configured

IOS

Page 32: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

UDLD Failure Reaction

Normal vs. Aggressive mode

38

Normal Aggressive

Set port to err-disable state in case of uni-

direction condition : Empty Echo packet,

Uni-direction, TX/RX loop, and Neighbor

Mismatch

Set port to err-disable state in case of uni-

direction condition : Empty Echo packet,

Uni-direction, TX/RX loop, and Neighbor

Mismatch

Does NOT err-disable the port in case of

sudden cessation of udld packets

Set port to err-disable state in case of

sudden cessation of UDLD packets:

port is put in err-disable mode if no udld

packets are received for 3 x hello-time + 5

sec (=50 secs, default )

Layer 2 Failure Detection

Page 33: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Spanning Tree Bridge Assurance

Turns STP into a bidirectional protocol

Ensures spanning tree fails “closed” rather than “open”

All ports with “network” port type send BPDUs regardless of state

If network port stops receiving BPDUs, port is placed in BA-Inconsistent

state (blocked)

Caveats:

Not recommended on VPC ports

ISSU on Nexus 5000 not supported with STP BA (VPC peer-link is exception)

Layer 2 Failure Detection

%STP-2-BRIDGE_ASSURANCE_BLOCK: Bridge Assurance blocking port Ethernet2/48 VLAN0700. switch# sh spanning vl 700 | in -i bkn Eth2/48 Desg BKN*4 128.304 Network P2p *BA_Inc

NX-OS

39

SW Support

IOS

• 12.2.33 SXI

• 12.2.50SY

NX-OS

• 4.0(1)

Page 34: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

With Bridge Assurance

Layer 2 Failure Detection

Root

Blocked

BPDUs

Network

Network Network

Network

BPDUs

Edge Edge

Network

Network

BPDUs

Malfunctioning

switch

Stopped receiving BPDUS!

Stopped receiving BPDUS!

BA Inconsistent

BA Inconsistent

%STP-2-BRIDGE_ASSURANCE_BLOCK: Bridge Assurance blocking port Ethernet2/48 VLAN0700.

switch# show spanning vl 700 | in -i bkn

Eth2/48 Altn BKN*4 128.304 Network P2p *BA_Inc

Page 35: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

UDLD “Original” Deployment Scenarios

Assist unidirectional Layer 2 protocols

42

Layer 2 Failure Detection

Root switch

Figure 1: Spanning Tree Loop Prevention

Alternate

block

A

B C

1 2

3

Root switch

Figure 2: Spanning Tree Fast Convergence

Alternate

block

A

B C

1 2

Figure 3: Ether-channel Convergence

Channel group 1 mode on

RSTP 802.1w

STP Bridge Assurance STP Bridge Assurance

LACP

Page 36: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

UDLD Best Practices

How much do you really need UDLD?

Physical uni-directional failures are communicated by Layer 1 mechanisms

STP Bridge Assurance to account for soft failures in either direction

LACP to account for failures on bundle members

Chance of mis-wiring may be rare

Are you on Layer 3 / FabricPath p2p link with already bidirectional protocol?

If UDLD is needed:

Use normal mode

Use default timers

Only choose few interfaces to use for Fast UDLD

43

Layer 2 Failure Detection

Page 37: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

OAM

Link OAM - Any point-to-point 802.3 link

CFM / Y.1731 - End-to-End UNI to UNI

E-LMI - User to Network Interface (UNI)

MPLS OAM - within MPLS cloud

Current Protocol Positioning

45

Access Access Core Customer

Provider Bridges

Provider Bridges

IP/MPLS

Business

Residential

Business

Residential

UNI UNI NNI NNI NNI

Backbone Bridges

Backbone Bridges

Customer

Ethernet Link OAM

Access E-LMI

MPLS OAM

MSE/BNG

Y.1731 Performance Management

Access

Connectivity Fault Management

Layer 2 Failure Detection

Page 38: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Ethernet OAM

IEEE 802.3ah (clause 57)

Ethernet Link OAM

Also referred as 802.3 OAM or Link OAM

IEEE 802.1ag

Connectivity Fault Management (CFM)

Also referred as Service OAM

ITU-T Y.1731

OAM functions and mechanisms for Ethernet-based networks

MEF E-LMI

Ethernet Local Management Interface

Building Blocks

46

Layer 2 Failure Detection

Page 39: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link OAM

Provides mechanisms for “monitoring link operation”

Runs on any single point-to-point Ethernet link

Uses “Slow Protocol”1 frames called OAMPDUs

OAMPDU interval: 100 msec – 1 sec (1-10 pps)

Minimum Timeout: 200 msec (IOS XR), 2 sec (IOS)

Extensible and flexible protocol

Support mainly on Carrier Ethernet platforms:

Cisco 7600, ASR 9000, ASR 901, ASR 903, ME switches

IEEE 802.3ah, Clause 57 (IEEE 802.3-2008)

48

Layer 2 Failure Detection

(1) No more than 10 frames transmitted in any one-second period

Page 40: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

IEEE 802.3ah

OAM Discovery

Discover OAM support, peer identity and capabilities per device

Link Monitoring

Basic error definitions for Ethernet so entities can detect degraded links and

isolate them

Remote Failure Indication

Mechanisms for one entity to signal another that it has detected an error

Remote Loopback

Used to troubleshoot networks, allows one station to put the other station into a

state whereby all inbound traffic is immediately reflected back onto the link

Remote MIB Variable Retrieval

Ability to read one or more MIB variables from the remote DTE

Key Functions

Layer 2 Failure Detection

49

Page 41: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link OAM Discovery

Layer 2 Failure Detection

switch#show ethernet oam discovery interface fas 1/1 FastEthernet1/1 Local client ------------ Administrative configurations: Mode: active Unidirection: not supported Link monitor: supported (on) Remote loopback: not supported MIB retrieval: not supported Mtu size: 1500 Operational status: Port status: operational Loopback status: no loopback PDU revision: 0 Remote client ------------- MAC address: 0011.9321.1640 Vendor(oui): 00000C(cisco) Administrative configurations: PDU revision: 1 Mode: active Unidirection: not supported Link monitor: supported Remote loopback: not supported MIB retrieval: not supported Mtu size: 1500

First phase of Ethernet OAM

Discovery has a simple state machine:

Send Information OAMPDU in a periodic

fashion

Discover peer device and its OAM configuration

and capabilities

Decide whether OAM clients can be fully

operational on the link

Detect timeout based on lack of OAMPDUs

from peer

No message interval exchange or

negotiation!

51

Page 42: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link OAM scale and ISSU

• Scale

Slow protocol but 100 msec interval for all ports on a

linecard is not slow!

Protocol offload to I/O module CPU helps

Protocol offload to FPGA (ME 3400) helps even more!

• ISSU (the “zero service disruption one”)

Need graceful protocol mechanisms to support SSO /

ISSU – standard does not specify

Not possible to inflate timers since timers are not

negotiated!

53

Layer 2 Failure Detection

Page 43: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

IOS and IOS XR

55

Layer 2 Failure Detection

TenGigEthernet4/1 TenGigE 0/1/0/0

interface TenGigE 0/1/0/0

ethernet oam

hello-interval 100ms

connection timeout 2

interface TenGigEthernet4/1

ethernet oam

ethernet oam max-rate 10

ethernet oam timeout 2

Link OAM Basic Configuration

IOS XR IOS

Local hello

multiplier

Value in

seconds

Value in

msec or sec Value in pps

Page 44: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link OAM - Link Monitoring

Monitor link quality every 1 sec (min)

Conditions monitored:

Errored Symbol Period

Errored Frame

Errored Frame Period

Errored Frame Seconds

Receive CRC (Cisco defined – IOS only)

Transmit CRC (Cisco defined – IOS only)

Configure error condition thresholds to:

Signal peer with “Event Notification” OAMPDU

Syslog / SNMP trap

Isolate the link

Layer 2 Failure Detection

56

Page 45: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link OAM – Link Monitoring

Problem

Ensure CRCs injected by devices don’t propagate

through the network

Need to operate with or without neighbor discovery

Solution

IEEE 802.3ah for link monitoring and error-disable

Example: CRC Detection and Link Isolation (IOS)

interface GigabitEthernet1/1

ethernet oam

ethernet oam link-monitor receive-crc window 1

ethernet oam link-monitor receive-crc threshold high 10

ethernet oam link-monitor high-threshold action error-

disable-interface

……

Nov 10 09:56:08.643: EOAM LM(Gi1/1): sending an EventTLV!

Nov 10 09:56:09.643: %ETHERNET_OAM-5-LINK_MONITOR: 94 rx CRC

errors detected over the last 1 seconds on interface Gi1/1.

Nov 10 09:56:09.643: EOAM LM(Gi1/1): sending an EventTLV!

Nov 10 09:56:09.647: %PM-SP-4-ERR_DISABLE: link-monitor-failure

error detected on Gi1/1, putting Gi1/1 in err-disable state

CRC! CRC!

Layer 2 Failure Detection

Page 46: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link OAM Miswiring Detection (IOS XR only)

Mechanism to detect miswiring of Ethernet

ports

Similar to UDLD, but using standard protocol

with Cisco vendor extension

Uses existing 4-byte field in periodic

OAMPU (Information OAMPDU Vendor

TLV ‘Vendor Information’ field)

Vendor Information is copied back by the

peer, allowing for MWD

Interoperates with other 802.3ah-compliant

vendors

Closing the gap with UDLD

Layer 2 Failure Detection

SW Support

IOS XR

• 3.9

59

I am X

X Y

Z

I am Y,

I know X

I am Z,

I know Y

interface TenGigE 0/1/0/0

ethernet oam

action wiring-conflict

error-disable-interface

Page 47: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link OAM Failure Reaction

No standards that define this!

Depending on implementation, available options for

failure reaction / path isolation:

Syslog / SNMP trap

Signal peer using specific OAMPDU

Error-disable

Error-block

Error-disable – operate at Layer 1, useful when

need to force manual intervention after error (like

mis-wiring)

Today, only IOS XR can isolate path based on peer

timeout or received notification OAMPDU!

Path Isolation

Layer 2 Failure Detection

60

Page 48: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link OAM Failure Reaction

Mechanism for OAM protocol to bring down interface “line protocol” state

when a problem is detected

Interface / sub-interface / bundle is “down” to routing / switching protocols

(MSTP, ARP, IGPs, BGP) – will trigger reconvergence

E-OAM protocols continue to operate

Automatic recovery when fault is resolved

IOS XR only, IOS supports error-block

Benefits:

Reduced interface up/down churn

Deterministic recovery

Path Isolation with Ethernet Failure Detection (EFD)

61

Layer 2 Failure Detection

interface TenGigE 0/1/0/0

ethernet oam

action link-fault error-disable-interface

action link-fault efd

action discovery-timeout error-disable-interface

action discovery-timeout efd

Page 49: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Ethernet Failure Detection (EFD)

Logical Diagram

Layer 2 Failure Detection

Interface

MAC layer

L2VPN IPv4 IPv6 MPLS

UP

Packet

I/O

UP

Link OAM EFD

SW Support

IOS XR

• 3.9

UP DOWN

DOWN

Failure detected

62

CDM

Page 50: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link OAM vs UDLD

Link OAM adoption is growing, could be adopted in

enterprises / DC in future

Stick with UDLD (at least for now):

Link OAM mis-wiring detection only on IOS XR as

proprietary extension

Link OAM path isolation based on timeout only in IOS XR

Consider Link OAM today:

Must adhere to standard protocols

Link Monitoring capabilities

Who Wins?

63

Layer 2 Failure Detection

Page 51: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Link Aggregation Control Protocol (LACP)

Protocol used to:

‒ Ensure configuration consistensy across bundle

members on both ends

‒ Ensure wiring consistency (bundle members

between 2 chassis)

‒ Detect unidirectional links

‒ Bundle member keepalive

Peers negotiate requested send rate among

other things through LACPDUs

Loss of heartbeat typically triggers port

suspend

IEEE 802.1ax (formerly 802.3ad)

64

Layer 2 Failure Detection

Page 52: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

interface gig 0/1/2/3

bundle id <n> mode active

lacp period 100

interface Bundle-Ether 1

lacp cisco enable

LACP Slow, Fast and Super Fast Hellos

Traditional LACP heartbeat intervals

Long interval: 30 sec 90 sec failure detection

Short interval: 1 sec 3 sec failure detection

IOS / IOS-XE / IOS XR / NX-OS

Heartbeats typically sent from supervisor, so SSO /

ISSU will not work with aggressive timers

Very fast LACP hellos sent from ASR 9K / CRS

linecard

Proprietary Cisco extension on IOS-XR allows for:

Signalling at 100 msec with 300 msec failure detection

ISSU support with fast timers (from IOS XR 4.1)

Use only if cant do per-link BFD or Fast UDLD and

need sub-second detection!

interface gig 0/1/2/3

bundle id <n> mode active

lacp period short

SW Support: IOS XR 3.9

interface Ethernet1/7

lacp rate fast

IOS / NX-OS

IOS XR

65

Layer 2 Failure Detection

Page 53: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Agenda

Overview

Layer 1 Failure Detection

Layer 2 Failure Detection

Layer 3 Failure Detection

Summary

66

Page 54: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Failure Detection at Layer 3

In some cases, failure detection relies on checks at Layer 3

How quickly can I detect a failure (neighbor down event)?

67

L2 bridged network

DWDM/X without LoS propagation

Tunnels (GRE, IPsec, etc.)

X

Layer 3 Failure Detection

X

Something

happened a

while ago!

Something just

happened!

Page 55: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Is Layer 3 Failure Detection Tuning Necessary?

Needed when:

Intermediate L2 hop over L3 link

Concerns over any protocol software failures

Concerns over unidirectional failures on point-to-point physical L3 links

May not be needed when:

Point-to-point physical L3 links with no concerns over unidirectional failures

Enough software redundancy to account for protocol software failures

FHRPs are running in active-active mode (VPC/VPC+ in Nexus 5000 / 7000)

68

Layer 3 Failure Detection

Page 56: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

FHRPs with vPC / vPC+ in NX-OS

HSRP, VRRP and GBLP in vPC / vPC+

environment operate in Active/Active mode

No additional configuration required

General best practices still apply, except:

Since running in active/active mode,

aggressive timers can be relaxed

No need to manipulate priorities / preemption

on different devices to achieve load-balancing

Active/Active Mode

69

Layer 3 Failure Detection

L3 L2

HSRP/VRRP “Active”: Active

for shared L3 MAC

HSRP/VRRP “Standby”: Active

for shared L3 MAC

Page 57: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Layer 3 Failure Detection

All Layer 3 protocols (FHRPs, BGP, EIGRP, OSPF etc) use HELLOs to:

Maintain adjacencies (pass protocol specific info)

Check neighbour reachability and detect failure

Hello/Keepalive and Dead/Hold timers can be tuned down, however it is

not recommended:

Each interface may have 2-3+ protocols establishing adjacency (e.g. HSRP, PIM,

OSPF on SVI)

Increased supervisor CPU utilization false-positives

Configuration complexity and waste of link bandwidth

Challenges supporting ISSU / SSO

Challenges achieving sub-second detection

Having said this: works reasonably well in small & controlled environments

Protocol Timers

70

Layer 3 Failure Detection

Page 58: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Bidirectional Forwarding Detection (BFD)

Lightweight hello protocol designed to run over

multiple transport protocols:

‒IPv4, IPv6, MPLS, TRILL

Designed for sub-second Layer 3 failure

detection

Any interested client (OSPF, BGP, HSRP etc.)

registers with BFD and is notified as soon as BFD

detects a neighbor loss

All registered clients benefit from uniform failure

detection

Runs on physical, virtual and bundle interfaces

Uses UDP port 3784 / 3785 (for echo)

RFC 5880 / 5881

71

Layer 3 Failure Detection

Page 59: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Layer 3 Failure Detection with BFD

Bidirectional Forwarding Detection (BFD) – recommended Layer 3 failure

detection mechanism over lowered protocol timers

BFD general advantages:

Reduced control plane load and link bandwidth usage

Sub-second failure detection

In-flight timer negotiation

BFD platform-specific advantages:

Stateful restart, SSO and ISSU support

Protocol off-load / distributed implementation – I/O module transmits / receives

BFD packets

Per-link implementations with bundles

72

Layer 3 Failure Detection

Page 60: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD Peer Establishment

• No discovery – peer IP provided by client!

• Neighbors continuously negotiate their desired transmit and receive rates

in terms of microseconds.

• The system reporting the slower rate determines the transmission rate.

Timer Negotiation

73

Desired Receive rate = 50 ms Desired Transmit rate = 100 ms

Desired Receive rate = 60 ms Desired Transmit rate = 40 ms

Green Transmits at 100ms Orange transmits at 50ms

Negotiate rates

interface <name>

bfd interval <msec> min_rx <msec> multiplier <n>

Layer 3 Failure Detection

Page 61: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD Operation Modes

Session established using

asynchronous control packets

Asynchronous mode (no echo):

Control packets sent at negotiated rate

Independent session

Neighbour declared dead if no packet is

received for <interval * multiplier> period

Additionally, if echo is negotiated:

Control packets sent at slow rate

Self-directed echo packets sent at fast

negotiated rate (min Rx interval), used

for failure detection

74

green is alive orange is alive

orange is alive green is alive

Async Mode

Async Mode + Echo

Layer 3 Failure Detection

Page 62: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD – OSPF Interaction Example

76

R2 R1

BFD Session

BFD BFD

OSPF OSPF

X X

X

X- Forwarding plane failure between R1 and R2 X- BFD detects failure between R1 and R2 X- OSPF adjacency reset between R1 and R2

OSPF registers with BFD

OSPF registers with BFD

BFD notifies OSPF BFD notifies OSPF

OSPF peering

Layer 3 Failure Detection

Page 63: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD Off-load / Distributed Processing

Helps achieve higher BFD scale

SUP-BFD - BFD process running on

Supervisor Engine

Interfaces with LC-BFD processes

Interfaces with BFD clients

LC-BFD – BFD process running on CPU

of each I/O module

Communicates with SUP-BFD process

Generates BFD hellos (echo and async)

Receives BFD hellos from peer (async)

Support for stateful restart, SSO and

ISSU

Nexus 7000 Architecture Example

Layer 3 Failure Detection

I/O Module I/O Module I/O Module

Supervisor Engine

OSPF HSRP PIM BGP Etc.

SUP-BFD

Hardware

LC-BFD

Hardware Hardware

LC-BFD LC-BFD

EOBC

Module Inband

IS-IS

Similar Architectures:

CRS-1

ASR 9000

12000 / XR12000

ASR 1K (from IOS XE 3.6)

7600 with ES+ I/O modules

80

Page 64: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Layer 3 Fast Failure Detection and Link Bundles

• Scenarios:

1. Layer 2 bundle between 2 SVIs

2. Layer 3 bundle

• Each node uses a hash algorithm to distribute the load across bundle members

• Chances are high that control plane packets are only carried on a single link:

‒ Can’t reliably test all links

‒ Single bundle member malfunction can cause black holes which remain undetected

‒ Rely on Layer 1 or Layer 2 (LACP/PaGP/UDLD/OAM) detection

• Can use parallel Layer 3 links instead, load-sharing properties are often similar

• Two approaches for BFD:

1. Single session

2. Per-link sessions

Challenges

82

Single BFD session

Layer 3 Failure Detection

Page 65: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD over Bundle Members (BOB)

IPv4 BFD session per bundle member

IPv6 relies on IPv4 session state

Verify every member link forwarding state by

establishing BFD session before its added to bundle

Master session on RP consolidates member states

and communicates with clients

Async + echo

Ethernet and POS bundles

IOS XR proprietary, close to proposed standard

CRS / ASR 9000 / XR 12000

83

LC1

LC2 RP

LC1

LC2 RP

interface bundle-ether 1

bfd

address-family ipv4

fast-detect

minimum-interval 15

multiplier 3

destination 10.11.12.13

SW Support

IOS XR 4.0.1 for CRS / ASR 9000

IOS XR 4.1 for XR 12000

Per-link

Sessions

Layer 3 Failure Detection

Page 66: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD Per-link Mode

BFD session per port-channel member

Master session on SUP consolidates member states and communicates

with clients

LACP is required for port-channels

Async only, no echo

Layer 3 port-channel / sub-interface only

NX-OS proprietary

Minimum interval: 50 msec x 3

Nexus 7000

84

Layer 3 Failure Detection

LC1

LC2 SUP

LC1

LC2 SUP

SW Support

NX-OS

• 5.0(2a)

Per-link

Sessions

Page 67: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD Logical Bundles (BLB)

• Single BFD session per L3 destination address

• Internal algorithm to decide which I/O module hosts BFD session

• BFD packet distribution - Tx and Rx packets are polarized on one

bundle link per session

• IPv4 and IPv6 sessions

• Async only

• Replaces BVLAN mode but backward compatible!

• Verified interoperability with IOS and NX-OS single session modes

• Minimum interval: 50 msec x 3 (depends on linecard)

CRS / ASR 9000

86

SW Support

IOS XR 4.2.3 (CRS)

IOS XR 4.3 (ASR9K)

LC1

LC2 RP

LC1

LC2 RP

Single

Session

Layer 3 Failure Detection

Page 68: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD Logical Mode

Single BFD session per L3 destination address

Internal algorithm to determine which I/O module hosts BFD session

BFD packet distribution:

‒ Prior to NX-OS 5.2(1) – Tx packets are polarized on one bundle link per session

‒ From NX-OS 5.2(1) – Tx packets are round-robin load-balanced on all bundle links

‒ Rx packets are always polarized on one bundle link per session

• Async + echo

• Verified interoperability with IOS XR BLB mode

• Minimum interval is 250 msec x 3

Nexus 3000 / 7000

87

LC1

LC2 SUP

LC1

LC2 SUP

SW Support

NX-OS

• 5.0(2a)

• 5.0(3)U2(2)

Single

Session

Layer 3 Failure Detection

Page 69: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD Interoperability with Bundles

Current standards do not address this!

Single session

‒ Easiest to achieve with current standards and

implementations

‒ Verified interoperability between IOS XR BLB

mode, IOS and NX-OS single session mode

Per-link sessions

‒ Most recommended, but solutions are platform

proprietary

‒ IETF draft-mmm-bfd-on-lags-03 will address

interoperability!

88

Layer 3 Failure Detection

Page 70: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD and FabricPath / TRILL

Use-case: peer switch path failure detection

Not supported for TRILL / FabricPath yet

Proposed standard:

draft-ietf-trill-rbridge-bfd-07

‒ Does not cover bundle per-link

‒ IS-IS notifies BFD of Rbridge IDs

Link OAM could be adopted in future

FP / TRILL OAM in the works for service /

end-to-end failure detection

Scenario 1 – FabricPath as BFD client

90

Layer 3 Failure Detection

FP

FP

FabricPath

Page 71: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD and FabricPath / TRILL

TRILL specifies support shared Ethernet segment with several peers

FabricPath can only peer on point-to-point links

BFD may be more needed for TRILL than FabricPath except…

Point-to-Point vs. Shared Ethernet segment

91

Layer 3 Failure Detection

TRILL

FP FP FP FP

TRILL

FabricPath

BFD

BFD

TRILL TRILL TRILL

Page 72: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

FabricPath Design Perspective for Failure Detection

DCI may require BFD for FabricPath

Point-to-Point Leaf-Spine vs Data Center Interconnect

92

Fabric Path Active DC1

Fabric Path Active DC2

Fat Spine

Layer 3 Failure Detection

Page 73: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD and FabricPath / TRILL

• Routing protocol / FHRP peering over FabricPath network

Scenario 2 – BFD client using FabricPath / TRILL as transport

93

SVI

SVI

FabricPath

SVI / sub-interface

SVI

FabricPath FabricPath

SVI / sub-interface

SVI / sub-interface

Layer 3 Failure Detection

Page 74: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD for Static Routes

Next-hop liveliness detection

Fail-close solution (remove static route and not reinstate until BFD is up)

Must be configured on both ends

94

Layer 3 Failure Detection

ip route 30.0.0.0/24 Vlan 20 10.0.0.1

ip route static bfd Vlan20 10.0.0.1

ip route 0.0.0.0/0 Vlan10 20.0.0.1

ip route static bfd Vlan10 20.0.0.1

SVI 20

20.0.0.1

SVI 10

10.0.0.1

switch# sh ip route

0.0.0.0/0, ubest/mbest: 1/0

*via 20.0.0.1, Vlan 10, [1/0], static

switch# sh ip route

30.0.0.0/0, ubest/mbest: 1/0

*via 10.0.0.1, Vlan 20, [1/0], static

30.0.0.2

Internet A B

Page 75: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD Multi-hop

• BFD sends packets with TTL=1

• If go through a device that decrements TTL, need multi-hop

• Use-case 1: static route or PBR through routed firewalls / NAT

• Use-case 2: eBGP multi-hop

RFC 5883

95

ip route 30.0.0.0/24 Vlan 20 12.0.0.1

ip route static bfd Vlan20 10.0.0.1

ip route 0.0.0.0/0 Vlan10 11.0.0.1

ip route static bfd Vlan10 20.0.0.1

switch# sh ip route

0.0.0.0/0, ubest/mbest: 1/0

*via 20.0.0.1, Vlan 10, [1/0], static

switch# sh ip route

30.0.0.0/0, ubest/mbest: 1/0

*via 10.0.0.1, Vlan 20, [1/0], static

30.0.0.2

Layer 3 Failure Detection

SW Support

IOS

IOS XR

11.0.0.1 12.0.0.1 SVI 20

20.0.0.1

SVI 10

10.0.0.1

Internet A B

Page 76: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD and Security

97

Layer 3 Failure Detection

Support for SHA-1 (NX-OS / IOS) and MD5 (IOS) authentication

Disable platform hardware security mechanisms for BFD echo to

function:

uRPF (per interface) no [ip|ipv6] verify unicast source reachable-via [any|rx]

IDS checks (global) no hardware ip verify address identical

IP redirects (per interface) no ip redirects

Open rules to allow echo packets though firewall or enable loopback as

source IP (default on IOS XR): bfd echo-interface <a_loop_back_interface>

Page 77: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

BFD Best Practices and Recommendations

1. If Layer 3 fast failure detection is needed, use BFD for all protocols

2. If cant use BFD, check specific platform support for aggressive protocol

timers

3. Always plan your BFD scale and check with platform capabilities

(centralized vs distributed architecture, interface and client support locally

and on peer)

4. Use BFD echo (default on many platforms) whenever possible, check security

5. On Layer 3 port-channels, use per-link mode and prefer that over echo

6. BFD single-hop for BGP – make sure neighbor update source is a directly

connected interface

7. Make sure BFD packets are prioritized appropriately (Marked with IP precedence 6 /

DSCP CS6 / CoS 6, can also be classified by udp 3784+3785)

8. Make sure neighbours support same BFD version (ver 0 / 1)

98

Layer 3 Failure Detection

Page 78: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Agenda

Overview

Layer 1 Failure Detection

Layer 2 Failure Detection

Layer 3 Failure Detection

Summary

99

Page 79: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Protocol Comparison

Key Decision Criteria

100

Summary

BFD UDLD Link OAM

OSI Layer L3 L2 L2

Standard IETF RFC 5880 / 5881

(with some Cisco enhancements)

Cisco proprietary IEEE 802.3ah

(with some Cisco enhancements)

Failures

Detected

Uni-directional soft failures

Bidirectional soft failures

Uni-directional soft failures

Bidirectional soft failures

Mis-wiring Detection

Uni-directional soft failures

Bidirectional soft failures

Mis-wiring Detection (IOS XR)

Link Degradation

Failure

Reaction

Notify peer and clients

Remove link from bundle (IOS

XR, IETF standard in future)

BFD dampening (IOS XR)

Error-disable (depending on mode) Notify peer

Error-disable (depending on error type

and platform)

Error-block

Ethernet Failure Detection (IOS XR)

Bundles and

Virtual

Interfaces

Bundle logical, bundle per-link,

SVI, sub-interface

Single L2 links Single L2 links

Message

Interval and

Timeout

Configurable, exchanged and

negotiated

Timeout generally in msec

Configurable and exchanged

Timeout generally in 20+

seconds

Configurable, not exchanged

Timeout generally in 2+ seconds

ISSU Timer inflation Flush message sent (IOS XR) No (can be extended in future)

For Your Reference

Page 80: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Summary of Network Scenarios and Recommendations

101

Summary

Classical Ethernet Layer 2

Single p2p link

Bundle

FabricPath / TRILL

Single p2p link

Bundle

Layer 3

Single p2p link

Bundle

SVI on top of Classical Ethernet

SVI on top of FabricPath / TRILL

SVI SVI

SVI SVI

Page 81: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Summary

102

Summary

Fast Failure Detection is Key to Fast Convergence

Business requirements and SLAs to drive technology and protocol choice

One protocol may be enough – keep it simple!

Evolving field with IETF / IEEE / MEF and Cisco innovations

Design your network to take advantage of best practices

Page 82: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public 103

Page 83: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Related Cisco Live London 2013 events

104

Summary

Session-ID Session Name

BRKIPM-2265 Deploying BGP Fast Convergence / BGP PIC

BRKCRS-2041 Highly Available Wide Area Network Design

Related Past Cisco Live events

Session-ID Session Name

TECRST-3190 IP Routing Fast Convergence

BRKNMS-2202 Ethernet OAM – Technical Overview and Deployment

Scenarios

BRKRST-2032 Highly Available Wide Area Network Design

Page 84: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered

© 2013 Cisco and/or its affiliates. All rights reserved. BRKRST-2333 Cisco Public

Call to Action

• Visit the Cisco Campus at the World of Solutions to experience Cisco innovations in action

• Get hands-on experience attending one of the Walk-in Labs

• Schedule face to face meeting with one of Cisco’s engineers

at the Meet the Engineer center

• Discuss your project’s challenges at the Technical Solutions Clinics

105

Page 85: Network Failure Detectiond2zmdbbm9feqrf.cloudfront.net/2013/eur/pdf/BRKRST-2333.pdf · Ver 1.8 Network Failure Detection . ... 802.1ag CFM/ Y.1731 FM Failure Detection Tools Layered