Troubleshooting the Cisco Nexus 5000 / 2000 Series...

127
BRKCRS-3145 Troubleshooting the Cisco Nexus 5000 / 2000 Series Switches

Transcript of Troubleshooting the Cisco Nexus 5000 / 2000 Series...

BRKCRS-3145

Troubleshooting the Cisco Nexus 5000 / 2000 Series Switches

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 2

Objectives

Be able to quickly isolate problematic nodes in the datacenter

Become familiar with troubleshooting in NX-OS

Understand Nexus 5000 and Nexus 2000 platform details

Gain comfort using Nexus 5000 and Nexus 2000 day to day

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 3

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Network Diagrams

Types of logging

Outputs

When to call TAC

Platform Overview and troubleshooting

Redundancy operation and troubleshooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 44

Problem Isolation

“A problem well stated is a problem half solved”

Source: Charles F. Kettering, Engineer and Inventor

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 5

Troubleshooting Tool #1

A current, accurate diagram

Physical ports

Logical ports

Spanning-tree root and blocked ports

Helpful to use standard formats

.jpg, .bmp, .pdf

If you cannot describe how your network should be operating, time may be wasted

N7k-1 N7k-2

N5k-1 N5k-2 N5k-3 N5k-4

vPC

po1

vPC

Po2

vPC peer-keep

e1/1 - e1/1

vPC peer-link

e1/2, 2/2

Po100

Domain 100

RSTP Root

N5k-5

e1/10 - e1/10

e1/12 - e1/12

STP BLK

vPC peer-link

e1/1, 1/2

Po101

Domain 101

vPC peer-link

e1/1, 1/2

Po102

Domain 102

e1/30 e1/31

e3/1 e4/1

e1/30 e1/31e1/30 e1/31e1/30 e1/31

e3/1 e4/1

e3/2 e4/2e3/2 e4/2

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 6

Grab a “show tech-support”

Sometimes too general

Large file, time consuming

If time permits, use targeted outputs or a specific show tech

If there is no time, use tac-pac and copy off

Much quicker than transmitting to terminal

Zips entire output to file in volatile:

Copy file off of switch for analysis

Or not…

N5k-1# tac-pac

N5k-1# dir volatile:

180242 Jan 28 4:37:26 2011 show_tech_out.gz

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 7

Which show tech?As of 5.0(3), there are 68N5k-1# show tech-support ?

aaa Display aaa information

aclmgr ACL commands

adjmgr Display Adjmgr information

arp Display ARP information

ascii-cfg Show ascii-cfg information for technical support personnel

assoc_mgr Gather detailed information for assoc_mgr troubleshooting

bcm-usd Gather detailed information for BCM USD troubleshooting

bootvar Gather detailed information for bootvar troubleshooting

brief Display the switch summary

btcm Gather detailed information for BTCM component

callhome Callhome troubleshooting information

cdp Gather information for CDP trouble shooting

...

session-mgr Gather information for troubleshooting session manager

snmp Gather info related to snmp

sockets Display sockets status and configuration

spm Service Policy Manager

stp Gather detailed information for STP troubleshooting

sysmgr Gather detailed information for sysmgr troubleshooting

time-optimized Gather tech-support faster, requires more memory & disk space

track Show track tech-support information

vdc Gather detailed information for VDC troubleshooting

vpc Gather detailed information for VPC troubleshooting

vtp Gather detailed information for vtp troubleshooting

xml Gather information for xml trouble shooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 8

Log your outputRedirect and Append

N5k-1# show clock > bootflash:debug-file.txt

N5k-1# show mac address-table >> bootflash:debug-file.txt

N5k-1# show running-config | count >> bootflash:debug-file.txt

N5k-1# show file bootflash:debug-file.txt

Mon Apr 4 02:39:41 UTC 2011 <==== output from show clock

Legend: <==== output from show mac address-table

* - primary entry, G - Gateway MAC, (R) - Routed MAC, O -

Overlay MAC

age - seconds since last seen,+ - primary entry using vPC Peer-

Link

VLAN MAC Address Type age Secure NTFY Ports

---------+-----------------+--------+---------+------+---+-----------

+ 99 0021.5ad8.c424 dynamic 0 F F Po500

* 1 0021.5ad8.c424 dynamic 250 F F Eth101/1/2

845 <==== output from show running-config | count

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 9

Logging

show logging logfile

Basis for tracing events chronologically

Try using start-time or last

show accounting log

Basis for tracing configuration changes

terminal log-all to also log show commands

All commands end with (SUCCESS) or (FAILURE)

Often overlooked, but very important

N5k-1# show logging logfile start-time 2011 Mar 9 20:00:00

2011 Mar 9 20:17:18 esc-n5548-1 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/1 is

down (None)

2011 Mar 9 20:17:18 esc-n5548-1 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/3 is

down (None)

N5k-1# show logging last ?

<1-9999> Enter number of lines to display

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 10

Other System Logsshow logging nvram

Persistent logging survives reloads – helpful for crash or reload issues.

esc-n5020-1# show logging nvram

2011 Jan 26 14:58:10 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 124 is

online

2011 Jan 28 02:47:38 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-PFM_SYSTEM_RESET: Manual

system restart from Command Line Interface

2011 Jan 28 02:47:38 esc-n5020-1 %$ VDC-1 %$ %KERN-0-SYSTEM_MSG: Shutdown

Ports.. - kernel

2011 Jan 28 02:47:38 esc-n5020-1 %$ VDC-1 %$ %KERN-0-SYSTEM_MSG: writing

reset reason 9, - kernel

2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_OFFLINE:

FEX-101 Off-line (Serial Number JAF132XXXXX)

2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 101 is

offline

2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_OFFLINE:

FEX-124 Off-line (Serial Number JAF140XXXXX)

2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 124 is

offline

2011 Jan 28 02:47:43 esc-n5020-1 %$ VDC-1 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL:

In domain 500, VPC peer keep-alive receive has failed

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 11

When to call TAC

A description of the problem observed, with evidence / clues, along with time and scope

A current network diagram

All parties involved in the problem

show tech is not necessary, but if you must make drastic changes such as reloading or replacing hardware, grab this first

Any targeted outputs, especially around the time of the event in question

You think you have found a bug, but a quick search of defects or release notes on cisco.com may be faster

Most efficient if you have the following:

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 12

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview

NX-OS Operation

FSM

MTS

Crashes

Nexus 5000

Nexus 2000

Platform Overview and troubleshooting

Redundancy operation and troubleshooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 13

Support for tab auto-complete within current context, but commands will execute at higher levels if available.

Filesystems dynamically auto-complete

NX-OSOperation Tips

N5k-3(config-if)# switch?

switchport Configure switchport parameters <=== matching in config-if mode

N5k-3(config-if)# switchn?

switchname Configure system's host name <=== matching in config mode

N5k-3# (config)# show file bootflash:s?

bootflash:stp.log.1

N5k-3# (config)# install all system bootflash:n5<tab>

bootflash:n5000-uk9.5.0.3.N1.1.bin

bootflash:n5000-uk9.5.0.2.N2.1.bin

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 14

CLI list and grep

ctrl-c terminates output

NX-OSOperation Tips

N5k-3# show cli list | grep switchport

show system default switchport san

show interface switchport

show interface <if-mr> switchport

N5k-3# show tech-support

---- show tech-support ----

ctrl-c

N5k-3#

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 15

Mounts could fill, watch /var/tmp – it is cleared by reload or with TAC!!!!

A full /var/tmp can cause upgrade errors, unexpected logs

NX-OSFile Structure

N5k-1# show system internal flash

Mount-on 1K-blocks Used Available Use% Filesystem

/ 204800 111460 93340 55 /dev/root

/proc 0 0 0 0 proc

/sys 0 0 0 0 none

/isan 1536000 453760 1082240 30 none

/var/tmp 131072 108 130964 1 none

/var/sysmgr 512000 4700 507300 1 none

/var/sysmgr/ftp 204800 48604 156196 24 none

/var/sysmgr/ftp/cores 20480 0 20480 0 none

/callhome 32768 0 32768 0 none

/dev/shm 262144 95936 166208 37 none

/volatile 61440 0 61440 0 none

/debug 2048 4 2044 1 none

/dev/mqueue 0 0 0 0 none

/mnt/cfg/0 39257 4332 32898 12 /dev/sda5

/mnt/cfg/1 37242 4332 30987 13 /dev/sda6

/var/sysmgr/startup-cfg 102400 3112 99288 4 none

/dev/pts 0 0 0 0 devpts

/mnt/plog 56192 1784 54408 4 /dev/mtdblock2

/mnt/pss 39273 6058 31187 17 /dev/sda4

/bootflash 859848 768664 47504 95 /dev/sda3

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 16

volatile: filesystem is virtual, use as scratch if needed

Obviously volatile, will not survive a reload

log: filesystem is in root /

NX-OSFile Structure

N5k-1# debug logfile CiscoLive_debugs

N5k-1# show debug

Output forwarded to file CiscoLive_debugs (size: 4194304 bytes)

Debug level is set to Minor(1)

N5k-1# dir log:

0 Apr 04 01:14:01 2011 CiscoLive_debugs

31 Mar 11 11:38:35 2011 dmesg

0 Mar 11 11:38:57 2011 libfipf.4365

79101 Apr 04 00:34:02 2011 messages

6670 Apr 04 00:06:01 2011 startupdebug

N5k-1# copy log:CiscoLive_debugs tftp:

Enter vrf: management

Enter hostname for the tftp server: 10.91.42.134

Trying to connect to tftp server......

Connection to Server Established.

|

TFTP put operation was successful

N5k-1# clear debug-logfile CiscoLive_debugs

-OR-

N5k-1# undebug all

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 17

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview

NX-OS Operation

FSM

MTS

Crashes

Nexus 5000

Nexus 2000

Platform Overview and troubleshooting

Redundancy operation and troubleshooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 18

NX-OS records the finite state machine for many important processes

Using this event-history of FSM states and triggers, debugging can be done after a problem has occurred.

Some common processes:

ethpc – ethernet port client: responsible for talking to the mac and phy

ethpm – ethernet port manager: responsible for translating between configuration and ethpc. ethpc would inform ethpm that link is up, and then ethpm will proceed to give instructions on what the configuration is for the port

port-channel – port-channeling process responsible for aggregating physical links into logical channels

lacp – 802.3ad standard for aggregating links

fwm – forwarding manager; responsible for programming hardware according to the software configuration

Important to compare timestamps and watch for inter-process communication.

NX-OSFSM

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 19

NX-OSFSM

Sometimes it is enough to look at one process FSM, other times you are looking for related events.

Timestamps should line up when there is causality.

Example: A fex comes online after e1/3 is brought up

N5k-1# show logg

2005 Feb 2 13:16:49 esc-n5020-1 %ETHPORT-5-IF_UP: Interface Ethernet1/3 is up

in mode Fex Fabric

2005 Feb 2 13:16:47 esc-n5020-1 %SYSMGR-FEX100-5-MODULE_ONLINE: System

Manager has received notification of local module becoming online.

2005 Feb 2 13:16:47 esc-n5020-1 %SATCTRL-FEX100-2-SATCTRL: FEX-100 Module 1:

Cold boot

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 20

NX-OSFSM

N5k-1# show platform software ethpc event-history interface e100/1/4

1) Event IF_PCFG_RSP, len: 8, at 243054 usecs after Wed Feb 2 13:16:54 2011

Sent port cfg message response to ethpm - Id: 0x2cc1819, Status: success

N5k-1# show port-channel internal event-history interface e100/1/4

>>>>FSM: <Ethernet100/1/4> has 1 logged transitions<<<<<

1) FSM:<Ethernet100/1/4> Transition at 447889 usecs after Wed Feb 2 13:16:54

2011

Previous state: [PCM_ETH_PORT_ST_INIT_DOWN]

Triggered event: [PCM_PORT_EV_IF_CREATE]

Next state: [FSM_ST_NO_CHANGE]

Curr state: [PCM_ETH_PORT_ST_INIT_DOWN]

A given fex host interface shows “port cfg” message

Indicates preparation to enable the interface

port-channel history shows an IF_CREATE event near this time

This is all related to a fex coming online, while e100/1/4 is configured as a port-channel member and is coming up

*e1/3 up at 13:16:49

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 21

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview

NX-OS Operation

FSM

MTS

Crashes

Nexus 5000

Nexus 2000

Platform Overview and troubleshooting

Redundancy operation and troubleshooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 22

NX-OSMTS

NX-OS uses Message and Transaction Service(MTS) to communicate between processes.

When Troubleshooting CPU issues, we can check MTS for a large queue of messages.

When troubleshooting a specific process, we may see specific MTS messages queued.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 23

NX-OSMTS

NX-OS uses Message and Transaction Service(MTS) to communicate between processes.

Useful to check when troubleshooting

high CPU

unresponsive CLI / timeout

control-plane disruption

When troubleshooting a process, we may look for specific MTS messages queued.

MTS messages may be coming in too fast, or there could be a message stuck at the top of the queue

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 24

NX-OSMTS persistant queue is allowed to grow old

N5k-1# show system internal mts buffers details

Node/Sap/queue Age(ms) SrcNode SrcSAP DstNode DstSAP OPC MsgId MsgSize

sup/284/pers 2387380 0x101 1231 0x101 284 86017 1301448368 868

sup/284/pers 14398 0x101 1238 0x101 284 86017 1301470493 868

sup/284/pers 3028 0x101 1897 0x101 284 86017 1301473115 868

sup/284/pers 818 0x101 1328 0x101 284 86017 1301473633 868

sup/284/pers 577 0x101 1236 0x101 284 86017 1301473693 868

sup/284/pers 42 0x101 32562 0x101 284 86017 1301473831 868

N5k-1# sh system internal mts sup sap 284 description

TCPUDP process client MTS queue

N5k-1# sh system internal mts sup sap 1231 description

dcos-xinetd

N5k-1# sh system internal mts opcodes | grep 86017

86017 MTS_OPC_TCP:

The first entry is dcos-xinetd (internet services) and it makes sense to be old, since it‟s a server that is always running (for fabric manager)

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 25

NX-OSMTS

recv queue should not grow old

SAP 0 is an invalid identifier and causes 300 messages to queue, and growing.

Observed impact is various show commands timing out such as show log and show run

N5k-1# show system internal mts buffers details

Node/Sap/queue Age(ms) SrcNode SrcSAP DstNode DstSAP OPC MsgId MsgSize

sup/32/recv 319672424 0x101 25330 0x101 0 7662 1221952768 192

sup/32/recv 319669986 0x101 25336 0x101 32 188 1221953842 328

sup/32/recv 319609082 0x101 25344 0x101 0 7663 1221971222 2452

...

sup/32/recv 227324 0x101 32550 0x101 32 188 1301415915 328

sup/32/recv 165509 0x101 32560 0x101 0 7663 1301432732 2452

sup/32/recv 101893 0x101 32565 0x101 0 7662 1301448663 192

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 26

NX-OSMTS

MTS messages have been addressed to SAP 0 due to a bug.

Reload was needed to clear this scenario

N5k-1# sh system internal mts sup sap 0 description

Not implemented

N5k-1# sh system internal mts sup sap 32 description

Syslog Sup Node Cfg

N5k-1# show system internal sysmgr service name syslogd

Service "syslogd" ("syslogd", 75):

UUID = 0x21, PID = 3924, SAP = 32

State: SRV_STATE_HANDSHAKED (entered at time Sat May 15 05:01:20

2010). Restart count: 1

Time of last restart: Sat May 15 05:01:20 2010. The service never

crashed since the last reboot.

Tag = N/A

Plugin ID: 0

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 27

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview

NX-OS Operation

FSM

MTS

Crashes

Nexus 5000

Nexus 2000

Platform Overview and troubleshooting

Redundancy operation and troubleshooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 28

NX-OS attempts to create a core file with information helpful to aid in finding and fixing the problem

stack trace

memory contents

Some processes in NX-OS are able to be restarted in a stateful manner.

Nexus 5000 is a single-supervisor platform; critical processes require a system restart upon a crash.

NX-OSCrashes

2010 Sep 10 16:19:27.411 N5k-1 %$ VDC-1 %$ %SYSMGR-2-

SERVICE_CRASHED: Service "fwm" (PID 2723) hasn't caught signal

6 (core will be saved).

A syslog message is sent just before crash and system restart

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 29

show process log

View status of all processes, including if a core was created

N5k-1# show process log

Process PID Normal-exit Stack Core Log-create-time

--------------- ------ ----------- ----- ----- ---------------

eth_port_channel 2743 N Y N Wed Mar 17 17:20:57 2010

eth_port_channel 2761 N Y N Tue Aug 3 19:14:58 2010

fwm 2703 N Y N Fri Oct 8 19:24:12 2010

...

N5k-1# show process log pid 2703

======================================================

Service: fwm

Description: Forwarding manager Daemon

Started at Thu Oct 7 14:51:51 2010 (151707 us)

Stopped at Fri Oct 8 19:24:12 2010 (203577 us)

Uptime: 1 days 4 hours 32 minutes 21 seconds

Start type: SRV_OPTION_RESTART_STATELESS (23)

Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2)

...

NX-OSCrashes

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 30

When NX-OS system manager “sysmanager” resets the switch, a core file for the offending process is often generated.

Copy off core file for TAC analysis

NX-OSCrashes

N5k-1# show cores

Module-num Instance-num Process-name PID Core-create-time

---------- ------------ ------------ --- ----------------

1 1 fwm 2723 Sep 17 16:34

N5k-1# copy core://1/fwm/1/ ?

bootflash: Select destination filesystem

ftp: Select destination filesystem

scp: Select destination filesystem

sftp: Select destination filesystem

tftp: Select destination filesystem

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 31

show logging onboard obfl-logs

show logging onboard exception log

show logging onboard kernel-trace

obfl-logs – per module; tracks environmental logs, bootup-records,

uptime at bootup, version at each boot, stack trace if applicable

exception log – crash/exception history and details

kernel-trace – display stack of last kernel exception

OBFL is used to capture information related to hardware, bootup,

and environmental conditions. Onboard failure logging is non-volatile.

NX-OSCrashes

Sometimes a core file does not exist

not enough room in the file system

kernel crashes

third-party processes; ntpd, telnetd, others...

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 32

In addition to the core file, circumstantial evidence around the time of the crash is helpful:

Was there a configuration change?

Was there a physical topology change?

Can this be reproduced?

Was there a recent upgrade?

Are you using an uncommon configuration? – less likely to have been tested or seen by other customers

The more details pointing to a root cause, the more feasible it is to find the problem, provide a workaround, and a fix.

NX-OSCrashes

Additional detail regarding NX-OS:

BRKARC-3471 Cisco NXOS Software - Architecture

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 33

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview and troubleshooting

NX-OS Operation

Nexus 5000

CRC errors

Ethanalyzer / CPU

Queuing and forwarding

SPAN

Spanning-tree

Nexus 2000

Redundancy operation and troubleshooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 34

To talk about forwarding errors and troubleshooting, drops are usually part of this discussion

We have to know a basic hardware layout in order to know where to look for problems

The following hardware overview is a preview of

BRKARC-3452 – Cisco Nexus 5000/5500 and 2000 Switch Architecture

Hardware overview

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 35

Nexus 5000 is a distributed forwarding architecture

Unified Port Controller (UPC) ASIC interconnected by a single stage Unified Crossbar Fabric (UCF)

Unified Port Controllers provide distributed packet forwarding capabilities

All port to port traffic passes through the UCF (Fabric)

Four switch ports managed by each UPC

14 UPC in Nexus 5020

7 UPC in Nexus 5010

Unified Crossbar

Fabric

Unified Port

Controller

SFP SFP SFP SFP SFP SFP SFP SFP

SFP SFP

Unified Port

Controller

SFP SFP SFP SFP

Unified Port

Controller

Unified Port

Controller

SFP SFP SFP SFP

Unified Port

Controller

. . .

Nexus 5000 Hardware OverviewData Plane Elements

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 36

Nexus 5500 Hardware OverviewData and Control Plane Elements

Gen 2 UPC

Unified Crossbar Fabric

Gen 2

Gen 2 UPC Gen 2 UPC

Gen 2 UPC Gen 2 UPC

PEX 8525

4 port PCIE

Switch

South

Bridge

10 Gig

12 Gig

Mgmt 0

Console

L1

L2

PCIe x4

PCIe x8

PCIE

Dual Gig

0 1

CPU Intel

Jasper

Forest

. . .PCIE

Dual Gig

0 1

PCIE

Dual Gig

0 1

Serial

Flash

Memory

NVRAM

DRAM

DDR3

Expansion Module

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 37

Nexus 5000/5500 Hardware OverviewData Plane Elements - Unified Crossbar Fabric

Unified Crossbar

Fabric

Nexus 5000 (Gen-1)

58-port packet based crossbar and scheduler

Three unicast and one multicast crosspoint per egress port

Nexus 5550 (Gen-2)

100-port packet based crossbar and new schedulers

4 crosspoints per egress port dynamically configurable between multicast and unicast traffic

Central tightly coupled scheduler

Request, propose, accept, grant, and acknowledge semantics

Packet enhanced iSLIP scheduler

Distinct unicast and multicast schedulers (see slides later for differences in Gen-1 vs. Gen-2 multicast schedulers)

Eight classes of service within the Fabric

Unicast iSLIP

Scheduler

Multicast

Scheduler

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 38

Nexus 5000 Hardware OverviewUnified Port Controller

Each UPC supports four ports and contains,

Multimode Media access controllers (MAC)

Support 1/10 G Ethernet and 1/2/4 G Fibre Channel

1G is available on first 8 ports of the 5010 and first 16 ports of the 5020

(2/4/8 G Fibre Channel MAC is located on the Expansion Module)

Packet buffering and queuing

480 KB of buffering per port

Forwarding controller

Ethernet and Fibre Channel Forwarding and Policy

Unified Port

Controller

MM

AC

+ B

uffer +

Fo

rward

ing

MM

AC

+ B

uffer +

Fo

rward

ing

MM

AC

+ B

uffer +

Fo

rward

ing

MM

AC

+ B

uffer +

Fo

rward

ing

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 39

Nexus 5500 Hardware OverviewData Plane Elements - Unified Port Controller (Gen 2)

Each UPC supports eight ports and contains,

Multimode Media access controllers (MAC)

Support 1/10 G Ethernet and 1/2/4/8 G Fibre Channel

All MAC/PHY functions supported on the UPC (5548UP and 5596UP)

Packet buffering and queuing

640 KB of buffering per port

Forwarding controller

Ethernet (Layer 2 and FabricPath) and Fibre Channel Forwarding and Policy (L2/L3/L4 + all FC zoning)

Unified Port

Controller 2

MM

AC

+ B

uffer +

Fo

rward

ing

MM

AC

+ B

uffer +

Fo

rward

ing

MM

AC

+ B

uffer +

Fo

rward

ing

MM

AC

+ B

uffer +

Fo

rward

ing

MM

AC

+ B

uffer +

Fo

rward

ing

MM

AC

+ B

uffer +

Fo

rward

ing

MM

AC

+ B

uffer +

Fo

rward

ing

MM

AC

+ B

uffer +

Fo

rward

ing

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 40

Nexus 5000/5500 Hardware OverviewControl Plane Elements

CPU

South

Bridge

NIC

Unified Port

Controller

In-band traffic is identified by the UPC and punted to the CPU via two dedicated UPC interfaces, 5/0 and 5/1, which are in turn connected to eth3 and eth4 interfaces in the CPU complex

Eth3 handles Rx and Tx of low priority control pkts

IGMP, CDP, TCP/UDP/IP/ARP (for management purpose only)

Eth4 handles Rx and Tx of highpriority control pkts

STP, LACP, DCBX, FC and FCoE control frames (FC packets come to Switch CPU as FCoE packets)

There is a built-in control-plane policer to limit the amount of traffic punted to CPU

eth3 eth4

NIC

mgmt0

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 41

Nexus 5000/5500 Hardware OverviewControl Plane Elements

CPU

Intel LV Xeon

1.66 GHz

South

Bridge

NIC

CPU queuing structure provides strict protection and prioritization of inbound traffic

Each of the two in-band ports has 8 queues and traffic is scheduled for those queues based on control plane priority (traffic CoS value)

Prioritization of traffic between queues on each in-band interface

CLASS 7 is configured for strict priority scheduling (e.g. BPDU)

CLASS 6 is configured for DRR scheduling with 50% weight

Default classes (0 to 5) are configured for DRR scheduling with 10% weight

Additionally each of the two in-band interfaces has a priority service order from the CPU

Eth 4 interface has high priority to service packets (no interrupt moderation)

Eth3 interface has low priority (interrupt moderation)

eth3 eth4

BP

DU

ICM

P

CF

S

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 42

Nexus 5000 Hardware OverviewControl Plane Elements

CPU

Intel LV Xeon

1.66 GHz

South

Bridge

NIC

Unified Port

Controller

Monitoring of in-band traffic via NX-OS built-in ethanalyzer (sniffer)

Eth3 is equivalent to „inbound-lo‟

Eth4 is equivalent to „inbound-hi‟

eth3 eth4

N5k-2# ethanalyzer local sniff-interface ?

inbound-hi Inbound(high priority) interface

inbound-low Inbound(low priority) interface

mgmt Management interface

N5k-2# sh hardware internal cpu-mac inband counters

eth3 Link encap:Ethernet HWaddr 00:0D:EC:B2:0C:83

UP BROADCAST RUNNING PROMISC ALLMULTI MULTICAST MTU:2200 Metric:1

RX packets:3 errors:0 dropped:0 overruns:0 frame:0

TX packets:630 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:252 (252.0 b) TX bytes:213773 (208.7 KiB)

Base address:0x6020 Memory:fa4a0000-fa4c0000

eth4 Link encap:Ethernet HWaddr 00:0D:EC:B2:0C:84

UP BROADCAST RUNNING PROMISC ALLMULTI MULTICAST MTU:2200 Metric:1

RX packets:85379 errors:0 dropped:0 overruns:0 frame:0

TX packets:92039 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:33960760 (32.3 MiB) TX bytes:25825826 (24.6 MiB)

Base address:0x6000 Memory:fa440000-fa460000

CLI view of in-band control plane data

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 43

Nexus 5000 Hardware OverviewPacket Forwarding Overview

SFP SFP SFP SFP

SFP SFP SFP SFP

1. Ingress MAC - MAC decoding, MACSEC processing (not supported currently), synchronize bytes

2. Ingress Forwarding Logic - Parse frame and perform forwarding and filtering searches, perform learning apply internal DCE header

3. Ingress Buffer (VoQ) - Queue frames, request service of fabric, dequeue frames to fabric and monitor queue usage to trigger congestion control

4. Cross Bar Fabric - Scheduler determines fairness of access to fabric and determines when frame is de-queued across the fabric

5. Egress Buffers - Landing spot for frames in flight when egress is paused

6. Egress Forwarding Logic - Parse, extract fields, learning and filtering searches, perform learning and finally convert to desired egress format

7. Egress MAC - MAC encoding, pack, synchronize bytes and transmit

1

2

3

4

5

6

7

Unified

Crossbar

Fabric

Ingress

UPC

Egress

UPC

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 44

Nexus 5000 Forwardingcut-through vs. store and forward

Store and forward switching is still utilized when the ingress data rate is slower than the egress data rate.

Cut-through switching is utilized to achieve low latency through the switch fabric.

Bits are serialized in from the ingress port until enough of the packet header has been received to perform a forwarding and policy lookup

Once a lookup decision has been made and the fabric has granted access to the egress port bits are forwarded through the fabric

Egress port performs any header rewrite (e.g. CoS marking) and MAC begins serialization of bits out the egress port

A drop cannot happen on ingress due to any switching logic or even a CRC error. Only faulty hardware or connections can cause a drop on ingress.

Discards can occur on ingress due to queuing configuration and traffic patterns.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 45

Nexus 5000 Forwardingcut-through vs. store and forward

Source Interface Destination Interface Switching Mode

10 GigabitEthernet 10 GigabitEthernet Cut-Through

10 GigabitEthernet 1 GigabitEthernet Cut-Through

1 GigabitEthernet 1 GigabitEthernet Store-and-Forward

1 GigabitEthernet 10 GigabitEthernet Store-and-Forward

FCoE Fibre Channel Cut-Through

FibreChannel FCoE Store-and-Forward

FibreChannel Fibre Channel Store-and-Forward

FCoE FCoE Cut-Through

Simple way to remember: 10G ingress interfaces are always cut-through

Note: 10G interfaces can be configured for Ethernet or FCoE

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 46

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview and troubleshooting

NX-OS Operation

Nexus 5000

CRC errors

Ethanalyzer / CPU

Queuing and forwarding

SPAN

Spanning-tree

Nexus 2000

Redundancy operation and troubleshooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 47

Cut-through mode and CRC errorsReceived errors

Cut-through switching changes how we troubleshoot problems in the switch.

Ethernet CRC is at the end of the frame, so even a CRC error cannot cause a drop on a cut-through port.

We are already forwarding the frame by the time the ingress mac can read the CRC value.

Eth

ern

et

Hea

de

r

IPv4

Hea

de

r

IP Payload

FC

S

Pars

ing

Forwardcorruption

CRC Bad

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 48

Cut-through mode and CRC errorsReceived errors

The corrupted frame must be forwarded, but is accounted for as an output error.

N5k-1# show interface e1/1

...

TX

10157 unicast packets 105 multicast packets 52 broadcast packets

11314 output packets 5317822 bytes

0 jumbo packets

1000 output errors 0 collision 0 deferred 0 late collision

0 lost carrier 0 no carrier 0 babble 0 Tx pause

0 interface resets

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 49

Animation frames for printouts

Eth

ern

et

Hea

de

r

IPv4

Hea

de

r

IP Payload

FC

S

Pa

rsin

g

A frame arrives to be parsed but is corrupted.

corruption

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 50

Eth

ern

et

Hea

de

r

IPv4

Hea

de

r

IP Payload

FC

S

Pa

rsin

g

Forward

Store-and-forward only reads the destination mac address to

make forwarding decision.

Here, the decision to forward is made, while unaware of corruption

to follow

Animation frames for printouts

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 51

IP PayloadF

CS

Pa

rsin

g

CRC Bad

It is not until the FCS field in the Ethernet trailer that we can calculate

CRC value

Animation frames for printouts

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 52

Cut-through mode and CRC “stomping”Originated Errors

In addition to receiving errored frames, the Nexus 5000 can generate a bad CRC for several reasons:

MTU violation

IP length error

Ethernet length error

when ethertype < 1500 / 0x5dc it is interpreted as length

Invalid Ethernet preamble

Received and originated errors will count as TX output errors.

Only received errors will count as RX CRC errors.

You are more likely to see CRC errors in a network with a cut-through switch.

The errors will pass through all cut-through switches and finally drop at the first store-and-forward buffer.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 53

Finding the source of CRC errors CRC errors are introduced in 3 ways:

Bad physical connection

copper, fiber, transceiver, phy

“stomping” due to intentionally originated errors

Received bad CRC “stomped” from neighboring cut-through switch.

Start by finding any RX CRC counters.

If none, then this switch is responsible for originating

Use interrupt counters to find the reason and port, if intentional

Log in to next switch upstream of CRC counters, check for RX CRC there.

Use the above logic to determine if this switch is originating any errors.

Finally, inspect optics/pluggables, fiber/cables and troubleshoot as a Layer 1 issue. Change cable and port to find where the problem follows.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 54

Finding the source of CRC errorsObservations, scenario #1

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1 e1/3e1/4

VLAN 7

VLAN 8

e1/5e1/5

N5k-1# show interface e1/1

RX

20949142 unicast packets 1147746 multicast packets 6 broadcast

packets

22096894 input packets 30452432662 bytes

18967009 jumbo packets 0 storm suppression packets

0 runts 0 giants 1 CRC 0 no buffer

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 55

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1 e1/3e1/4

VLAN 7

VLAN 8

e1/5e1/5

N5k-1# show interface e1/5

TX

1266 unicast packets 1147746 multicast packets 6 broadcast packets

0 output packets 0 bytes

0 jumbo packets

1 output errors 0 collision 0 deferred 0 late collision

Finding the source of CRC errorsObservations, scenario #1

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 56

Finding the source of CRC errorsObservations, scenario #1

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1 e1/3e1/4

VLAN 7

VLAN 8

e1/5e1/5

N5k-2# show interface e1/5

RX

1266 unicast packets 1147746 multicast packets 6 broadcast packets

0 input packets 0 bytes

0 jumbo packets 0 storm suppression packets

0 runts 0 giants 1 CRC 0 no buffer

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 57

Finding the source of CRC errorsObservations, scenario #1

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1 e1/3e1/4

VLAN 7

VLAN 8

e1/5e1/5

N5k-2# show interface e1/3

TX

1266 unicast packets 1147746 multicast packets 6 broadcast packets

0 output packets 0 bytes

0 jumbo packets

1 output errors 0 collision 0 deferred 0 late collision

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 58

Finding the source of CRC errorsScenario #1: Physical Issue

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1 e1/3e1/4

VLAN 7

VLAN 8

bad fiber

e1/5e1/5

N5k-1# show interface e1/1

RX

20949142 unicast packets 1147746 multicast packets 6 broadcast

packets

22096894 input packets 30452432662 bytes

18967009 jumbo packets 0 storm suppression packets

0 runts 0 giants 1 CRC 0 no buffer

Frame enters switch as

a CRC error

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 59

Finding the source of CRC errorsScenario #1: Physical Issue

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1 e1/3e1/4

VLAN 7

VLAN 8

e1/5e1/5

N5k-1# show hardware internal gatos all-ports | egrep name|1/1

name |log|gat|mac|flag|adm|opr|c:m:s:l|ipt|fab|xgat|xpt|if_index|diag

xgb1/1 |0 |7 |2 |b7 |en |up |1:2:2:f|2 |6 |7 |4 |1a000000|pass

Front Panel Internal

e1/1 7:2Look up internal ASIC port

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 60

Finding the source of CRC errorsScenario #1: Physical Issue

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1 e1/3e1/4

VLAN 7

VLAN 8

e1/5

N5k-1# show hardware internal gatos asic 7 counters interrupt

Gatos 7 interrupt statistics:

Interrupt name |Count |ThresRch|ThresCnt|Ivls

-----------------------------------------------+--------+--------+--------+----

gat_fw2_INT_ig_pkt_err_cb_bm_eof_err |1 |0 |1 |0

gat_fw2_INT_ig_pkt_err_eth_crc_stomp |1 |0 |1 |0

gat_fw2_INT_ig_pkt_err_e802_3_len_err |1 |0 |1 |0

gat_mm0_INT_rlp_rx_pkt_crc_err |1 |0 |1 |0

gat_mm0_INT_rlp_rx_pkt_crc_stomped |1 |0 |1 |0

e1/5

Front Panel Internal

e1/1 7:2Interrupt counters will

increment on receipt of

a bad CRC

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 61

Finding the source of CRC errorsScenario #1: Physical Issue

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1 e1/3e1/4

VLAN 7

VLAN 8

e1/5

N5k-1# show interface e1/5

TX

1266 unicast packets 1147746 multicast packets 6 broadcast packets

0 output packets 0 bytes

0 jumbo packets

1 output errors 0 collision 0 deferred 0 late collision

e1/5

Front Panel Internal

e1/1 7:2

e1/5 7:1

10Gb/s interfaces will cut-through

switch these bad frames and

increment an output error at

the egress port

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 62

Finding the source of CRC errorsScenario #1: Physical Issue

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1 e1/3e1/4

VLAN 7

VLAN 8

e1/5

N5k-1# show hardware internal gatos asic 7 counters interrupt

Gatos 7 interrupt statistics:

Interrupt name |Count |ThresRch|ThresCnt|Ivls

-----------------------------------------------+--------+--------+--------+----

gat_fw1_INT_eg_pkt_err_cb_bm_eof_err |1 |0 |0 |0

gat_fw1_INT_eg_pkt_err_eth_crc_stomp |1 |0 |0 |0

gat_fw1_INT_eg_pkt_err_e802_3_len_err |1 |0 |0 |0

e1/5

Front Panel Internal

e1/1 7:2

e1/5 7:1

Interrupt counters increment

upon transmit of errored frame

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 63

Finding the source of CRC errorsScenario #1: Physical Issue

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5

N5k-2# show interface e1/5

RX

1266 unicast packets 1147746 multicast packets 6 broadcast packets

0 input packets 0 bytes

0 jumbo packets 0 storm suppression packets

0 runts 0 giants 1 CRC 0 no buffer

e1/5

e1/3

Front Panel Internal

e1/1 7:2

e1/5 7:1

Another cut-through port

receives bad frame

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 64

Finding the source of CRC errorsScenario #1: Physical Issue

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5

N5k-2# show hardware internal gatos asic 7 counters interrupt

Gatos 7 interrupt statistics:

Interrupt name |Count |ThresRch|ThresCnt|Ivls

-----------------------------------------------+--------+--------+--------+----

gat_fw1_INT_ig_pkt_err_cb_bm_eof_err |1 |0 |1 |0

gat_fw1_INT_ig_pkt_err_eth_crc_stomp |1 |0 |1 |0

gat_fw1_INT_ig_pkt_err_e802_3_len_err |1 |0 |1 |0

gat_mm0_INT_rlp_rx_pkt_crc_err |1 |0 |1 |0

gat_mm0_INT_rlp_rx_pkt_crc_stomped |1 |0 |1 |0

e1/5

e1/3

Front Panel Internal

e1/1 7:2

e1/5 7:1

Interrupt counters will

increment on receipt of

a bad CRC

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 65

Finding the source of CRC errorsScenario #1: Physical Issue

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5

N5k-2# show interface e1/3

TX

1266 unicast packets 1147746 multicast packets 6 broadcast packets

0 output packets 0 bytes

0 jumbo packets

1 output errors 0 collision 0 deferred 0 late collision

e1/5

e1/3

Front Panel Internal

e1/1 7:2

e1/5 7:1

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 66

Finding the source of CRC errorsScenario #1: Physical Issue

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5

N5k-2# show hardware internal gatos asic 0 counters interrupt

Gatos 0 interrupt statistics:

Interrupt name |Count |ThresRch|ThresCnt|Ivls

-----------------------------------------------+--------+--------+--------+----

gat_fw2_INT_eg_pkt_err_cb_bm_eof_err |1 |0 |0 |0

gat_fw2_INT_eg_pkt_err_eth_crc_stomp |1 |0 |0 |0

gat_fw2_INT_eg_pkt_err_e802_3_len_err |1 |0 |0 |0

e1/5

e1/3

Front Panel Internal

e1/1 7:2

e1/5 7:1

e1/3 0:2

Interrupt counters increment

upon transmit of errored frame

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 67

Finding the source of CRC errorsScenario #1: Physical Issue

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5e1/5

e1/3

Front Panel Internal

e1/1 7:2

e1/5 7:1

e1/3 0:2

host will drop bad

frame in Rx buffer

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 68

Finding the source of CRC errorsObservations, scenario #2

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5e1/5

e1/3

N5k-1# show interface e1/1

RX

20995002 unicast packets 1150262 multicast packets 6 broadcast packets

22145270 input packets 30519119563 bytes

1 jumbo packets 0 storm suppression packets

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 69

Finding the source of CRC errorsObservations, scenario #2

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5e1/5

e1/3

N5k-1# show interface e1/7

TX

1266 unicast packets 1147746 multicast packets 6 broadcast packets

0 output packets 0 bytes

0 jumbo packets

1 output errors 0 collision 0 deferred 0 late collision

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 70

Finding the source of CRC errorsObservations, scenario #2

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5e1/5

e1/3

N7k-1# show interface e1/11

RX

4 unicast packets 0 multicast packets 0 broadcast packets

4 input packets 5672 bytes

0 jumbo packets 0 storm suppression packets

0 runts 0 giants 1 CRC 0 no buffer

1 input error 0 short frame 0 overrun 0 underrun 0

ignored

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 71

Finding the source of CRC errorsScenario #2: MTU Exceeded

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5e1/5

e1/3

Front Panel Internal

e1/1 7:2

4000B frame

transmitted

N5k-1# show interface e1/1

RX

20995002 unicast packets 1150262 multicast packets 6 broadcast packets

22145270 input packets 30519119563 bytes

1 jumbo packets 0 storm suppression packets

Jumbo packets increment

whenever ethernet payload is

greater than 1500 – not always

an error!

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 72

Finding the source of CRC errorsScenario #2: MTU Exceeded

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5e1/5

e1/3

Front Panel Internal

e1/1 7:2

4000B frame

transmittedN5k-1# show hardware internal gatos port e1/1 counters

rx

RX_PKT_SIZE_IS_1519_TO_2047 | 0

RX_PKT_SIZE_IS_2048_TO_4095 | 1

RX_PKT_SIZE_IS_4095_TO_8191 | 0

RX_PKT_SIZE_IS_8192_TO_9216 | 0

RX_PKT_SIZE_GT_9216 | 0

Hardware counters keep track

of size ranges.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 73

Finding the source of CRC errorsScenario #2: MTU Exceeded

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5e1/5

e1/3

Front Panel Internal

e1/1 7:2

N5k-1# show hardware internal gatos asic 7 counters interrupt

Gatos 7 interrupt statistics:

Interrupt name |Count |ThresRch|ThresCnt|Ivls

-----------------------------------------------+--------+--------+--------+----

gat_bm_port2_INT_err_ig_mtu_vio |1 | | |

In this case, the MTU is set to

the default of 1500 in class-default

class-based

MTU is 1500

So we enter an error condition.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 74

VLAN 7

Finding the source of CRC errorsScenario #2: MTU Exceeded

N7k-1

N5k-2

e1/11

e1/7 e1/7

e1/4

e1/5e1/5

Front Panel Internal

e1/1 7:2

MTU is configured per class, under network-qos.

This allows for a separate FCoE MTU and Ethernet MTU.

N5k-1

e1/1

VLAN 8

e1/3

e1/12

N5k-1# show policy-map type network-qos

Type network-qos policy-maps

===============================

policy-map type network-qos default-nq-

policy

class type network-qos class-fcoe

pause no-drop

mtu 2158

class type network-qos class-default

mtu 1500

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 75

Finding the source of CRC errorsScenario#2: MTU Exceeded

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5e1/5

e1/3

N5k-1# show hardware internal gatos asic 0 counters interrupt

Gatos 0 interrupt statistics:

Interrupt name |Count |ThresRch|ThresCnt|Ivls

-----------------------------------------------+--------+--------+--------+----

gat_fw1_INT_eg_pkt_err_cb_bm_eof_err |1 |0 |1 |0

gat_fw1_INT_eg_pkt_err_eth_crc_stomp |1 |0 |1 |0

gat_fw1_INT_eg_pkt_err_ip_pyld_len_err |1 |0 |1 |0

gat_mm1_INT_rlp_tx_pkt_crc_err |1 |0 |1 |0

Front Panel Internal

e1/1 7:2

e1/7 0:1

Leaving the egress interface,

the CRC has been stomped and

other interrupts have fired.

Note the egress interface will

aggregate all frames from various

source interfaces. Adding up

counters can be tricky.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 76

N7k-1# show interface e1/11

RX

4 unicast packets 0 multicast packets 0 broadcast packets

4 input packets 5672 bytes

0 jumbo packets 0 storm suppression packets

0 runts 0 giants 1 CRC 0 no buffer

1 input error 0 short frame 0 overrun 0 underrun 0

ignored

Finding the source of CRC errorsScenario #2: MTU Exceeded

N7k-1

N5k-2N5k-1

e1/11 e1/12

e1/7 e1/7

e1/1e1/4

VLAN 7

VLAN 8

e1/5e1/5

e1/3

Front Panel Internal

e1/1 7:2

e1/7 0:1

The store-and-forward card on the

Nexus 7000 parses the entire frame

and finds a bad CRC value. A drop

occurs on N7k1 – the frame never

makes it to N5k2.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 77

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview and troubleshooting

NX-OS Operation

Crashes

Nexus 5000

CRC errors

Ethanalyzer / CPU

Queuing and forwarding

Spanning-tree

Nexus 2000

Redundancy operation and troubleshooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 78

Hardware accelerated switches do not rely on the CPU for frame forwarding and processing.

*Some L3 paths do require CPU path if hw entries are missing – “punt”

CPU is critical for control-plane activities:

LACP – without keeping up with LACPDUs, 802.3ad portchannels would go down

STP and STP Bridge Assurance – A downstream switch missing BPDUs will go forwarding on a blocked port. If the CPU cannot keep up with sending BPDUs, loops can form. Bridge Assurance helps in some ways, instead of going forwarding, a BA-enabled switch will disable the interface.

vPC programming – mac addresses learned on vPC interfaces must be installed on both switches in order to prevent flooding as well as deliver frames to their destination

Redundancy – in the event of a switch outage, the CPU needs to reprogram state information for all processes, configure mac addresses on interfaces in their respective VLANs.

configuration and management – An unresponsive switch is not useful as a troubleshooting tool, and you are blind without a reliable interface with the network

NX-OSHigh CPU

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 79

NX-OSHigh CPU

N5k-1# show process cpu sort | exclude 0.0

PID Runtime(ms) Invoked uSecs 1Sec Process

----- ----------- -------- ----- ------ -----------

4120 1137 10931494 0 17.5% pfma

4204 1477 84831831 0 1.9% gatosusd

N5k-1# show system resources

Load average: 1 minute: 0.63 5 minutes: 1.35 15 minutes: 1.41

Processes : 281 total, 1 running

CPU states : 1.0% user, 8.9% kernel, 90.1% idle

Memory usage: 2073408K total, 1412108K used, 661300K free

Hopefully you have a baseline to compare the current CPU trends with a known nominal state

Always gather 3 commands repeating frequently

show process cpu sort | exclude 0.0

show system resources

show process cpu history

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 80

NX-OSHigh CPU

N5k-1# show process cpu history

1 1 1 1 1 1 11

789509607796857706878950694778698849688895079850886958858500

753105000482598603786430941227125016911055026100692801248500

100 ** * * * * * * * * * * **

90 ** ** * * * * * ** * * * ** * * * * *** * * **

80 *** ** * * * *** **** * * * *** * **** * ** *** * ** * **

70 *** ** **** * *** **** *** *** *** ****** **** *** * ** * **

60 *** ****************** *** ******* *********** ***** ** ****

50 ************************** ******* *************************

40 ************************************************************

30 ***********************************************************#

20 *##**#*******#***********#*#*#**#**##*###*###**##****#****##

10 ############################################################

0....5....1....1....2....2....3....3....4....4....5....5....

0 5 0 5 0 5 0 5 0 5

CPU% per minute (last 60 minutes)

* = maximum CPU% # = average CPU%

Note the difference between *, maximum CPU and #, average CPU

This is a completely normal looking graph, try to focus on extended high average CPU periods

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 81

Displaying and capturing control-plane frames with built-in Ethanalyzer utility

based on wireshark project, NX-OS command frontend

Can display like tshark, or capture to .pcap file to analyze elsewhere

Can be used on mgmt0 as well as eth3 or eth4, the low and high priority CPU queues

NX-OSEthanalyzer

CPU

eth3

eth4

UPC

ICMP

CFS

BPDU

CDP

LACPDU

ARP

DCBX NIC

NIC

MGMT0

eth0

So

uth

Brid

ge

low

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 82

N5k-1# ethanalyzer local interface mgmt write bootflash:managementCAP

Program exited with status 0.

N5k-1# dir bootflash: | inc management

1224 Apr 04 16:56:33 2011 managementCAP

N5k-1#ethanalyzer local read bootflash:managementCAP

2011-04-04 16:56:33.763150 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=68

2011-04-04 16:56:33.763527 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52

2011-04-04 16:56:33.763968 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52

2011-04-04 16:56:33.764391 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52

2011-04-04 16:56:33.764811 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52

2011-04-04 16:56:33.765230 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52

2011-04-04 16:56:33.765649 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52

2011-04-04 16:56:33.765928 64.102.131.28 -> 172.18.118.165 TCP 53538 > ssh [ACK] Seq=0 Ack=68 Win=65535 Len=0 TSV=597611264 TSER=19040186

2011-04-04 16:56:33.765930 64.102.131.28 -> 172.18.118.165 TCP 53538 > ssh [ACK] Seq=0 Ack=120 Win=65535 Len=0 TSV=597611264 TSER=19040186

2011-04-04 16:56:33.765932 64.102.131.28 -> 172.18.118.165 TCP 53538 > ssh [ACK] Seq=0 Ack=172 Win=65535 Len=0 TSV=597611264 TSER=19040186

NX-OSEthanalyzer example

capture mgmt0 traffic and save to a file on bootflash

view capture files

copy off for further analysis

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 83

N5k-1# ethanalyzer local interface inbound-hi capture-filter "not ip"

Capturing on eth4

wireshark-broadcom-rcpu-dissector: ethertype=0xde08, devicetype=0x0

2005-02-11 20:36:50.251412 00:0d:ec:d6:02:e4 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809d

2005-02-11 20:36:50.252075 00:0d:ec:d6:02:e0 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x8099

2005-02-11 20:36:50.252204 00:0d:ec:d6:02:e1 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809a

2005-02-11 20:36:50.252317 00:0d:ec:d6:02:e9 -> 01:80:c2:00:00:00 STP Conf. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x80a2

2005-02-11 20:36:50.252426 00:0d:ec:d6:02:e8 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x80a1

2005-02-11 20:36:50.391691 00:0d:ec:d3:b5:f4 -> 01:80:c2:00:00:0e LLC U, func=UI; SNAP, OUI 0x00000C (Cisco), PID 0x0134

2005-02-11 20:36:50.803069 00:12:43:01:b0:98 -> 01:80:c2:00:00:00 STP Conf. Root = 8291/00:d0:03:62:4c:00 Cost = 0 Port = 0x8081

2005-02-11 20:36:52.251349 00:0d:ec:d6:02:e4 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809d

2005-02-11 20:36:52.251366 00:0d:ec:d6:02:e0 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x8099

2005-02-11 20:36:52.251373 00:0d:ec:d6:02:e1 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809a

NX-OSEthanalyzer example

capture high priority traffic with capture-filter and display to terminal

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 84

N5k-1# show system resources

Load average: 1 minute: 0.95 5 minutes: 1.54 15 minutes: 1.46

Processes : 281 total, 4 running

CPU states : 26.7% user, 26.7% kernel, 46.5% idle

Memory usage: 2073408K total, 1412172K used, 661236K free

N5k-1# show process cpu sort | exclude 0.0

PID Runtime(ms) Invoked uSecs 1Sec Process

----- ----------- -------- ----- ------ -----------

4230 398 5011881 0 22.0% snmpd

4204 1467 84869127 0 20.2% gatosusd

4226 433 5601856 0 5.5% statsclient

4264 1380 391510 3 3.7% ethpm

4302 254 103 2468 1.8% netstack

NX-OSEthanalyzer and CPU

Using to aid in identifying external causes of high CPU utilization

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 85

NX-OSEthanalyzer and CPU

esc-n5020-1# show process cpu history

211111111131111111111121111111131111111114111111831112111111

002244240786947901001225201001390000110010000902910013010023

100

90 #

80 #

70 #

60 #

50 #

40 # # # #

30 # # # ##

20 # #### ## ## # # # ## #

10 ############################################################

0....5....1....1....2....2....3....3....4....4....5....5....

0 5 0 5 0 5 0 5 0 5

CPU% per second (last 60 seconds)

# = average CPU%

Baseline per second

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 86

N5k-1# show process cpu history

1 1

754669098990899966777977656766876775178734455655456466545645

006186077990796258300801881187120477641015900150830621684070

100 ### ### ## #

90 ########### #

80 ########### # # # #

70 # ##################### ##### ## ###

60 # ################################# ### ## # ### #

50 #################################### ### ###################

40 #################################### ### ###################

30 #################################### #######################

20 ############################################################

10 ############################################################

0....5....1....1....2....2....3....3....4....4....5....5....

0 5 0 5 0 5 0 5 0 5

CPU% per second (last 60 seconds)

# = average CPU%

<continued>

NX-OSEthanalyzer and CPU

Observed spike in CPU (per second)

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 87

NX-OSEthanalyzer and CPU

Baseline per minute

N5k-1# show process cpu history

1 1 1 1 1 1 11

789509607796857706878950694778698849688895079850886958858500

753105000482598603786430941227125016911055026100692801248500

100 ** * * * * * * * * * * **

90 ** ** * * * * * ** * * * ** * * * * *** * * **

80 *** ** * * * *** **** * * * *** * **** * ** *** * ** * **

70 *** ** **** * *** **** *** *** *** ****** **** *** * ** * **

60 *** ****************** *** ******* *********** ***** ** ****

50 ************************** ******* *************************

40 ************************************************************

30 ***********************************************************#

20 *##**#*******#***********#*#*#**#**##*###*###**##****#****##

10 ############################################################

0....5....1....1....2....2....3....3....4....4....5....5....

0 5 0 5 0 5 0 5 0 5

CPU% per minute (last 60 minutes)

* = maximum CPU% # = average CPU%

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 88

1 1 1 1 1 1 1

899074676686870687895096077968577068789506947786988496888950

189068779462040167531050004825986037864309412271250169110550

100 *** * ** * * * * * * *

90 *** * * * ** ** * * * * * ** * * * ** * * *

80 ***** * * * * **** ** * * * *** **** * * * *** * **** *

70 ***** *** * *** **** ** **** * *** **** *** *** *** ****** *

60 **#** ************** ****************** *** ******* ********

50 *##**************************************** ******* ********

40 ###*#*******************************************************

30 ######******************************************************

20 #######******#****##**#*******#***********#*#*#**#**##*###*#

10 ############################################################

0....5....1....1....2....2....3....3....4....4....5....5....

0 5 0 5 0 5 0 5 0 5

CPU% per minute (last 60 minutes)

* = maximum CPU% # = average CPU%

NX-OSEthanalyzer and CPU

We also notice a spike in average CPU over the past 5 minutes

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 89

N5k-1# ethanalyzer local interface mgmt capture-filter "not host 10.116.114.157"

Capturing on eth0

wireshark-broadcom-rcpu-dissector: ethertype=0xde08, devicetype=0x0

2005-02-11 21:25:48.452632 172.18.118.162 -> 172.18.118.34 SNMP get-response

2005-02-11 21:25:48.455871 172.18.118.34 -> 172.18.118.162 SNMP get-next-request

2005-02-11 21:25:48.458120 172.18.118.162 -> 172.18.118.34 SNMP get-response

2005-02-11 21:25:48.459968 172.18.118.34 -> 172.18.118.162 SNMP get-next-request

2005-02-11 21:25:48.462428 172.18.118.162 -> 172.18.118.34 SNMP get-response

2005-02-11 21:25:48.464066 172.18.118.34 -> 172.18.118.162 SNMP get-next-request

2005-02-11 21:25:48.466903 172.18.118.162 -> 172.18.118.34 SNMP get-response

2005-02-11 21:25:48.468165 172.18.118.34 -> 172.18.118.162 SNMP get-next-request

2005-02-11 21:25:48.471662 172.18.118.162 -> 172.18.118.34 SNMP get-response

2005-02-11 21:25:48.472263 172.18.118.34 -> 172.18.118.162 SNMP get-next-request

NX-OSEthanalyzer and CPU

Capturing on mgmt, we see there is an snmpwalk occuring

This should be a temporary condition and should not affect switching performance, but perhaps you can “feel” latency on the terminal

Could affect other control-plane transactions like configuration backups, collection scripts, etc.

Now you can check with your network management team to work out when this is appropriate or if this is a mistake. A full walk is not very efficient to run reguarly.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 90

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview and troubleshooting

NX-OS Operation

Crashes

Nexus 5000

CRC errors

Ethanalyzer / CPU

Queuing and forwarding

Spanning-tree

Nexus 2000

Redundancy operation and troubleshooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 91

Nexus 5000/5500 Queuing

Nexus 5000/5500 utilize ingress queuing

Ingress queuing is helpful for data flows where many ports talk to few, the load is spread across the sources

Simple flowcontrol mechanism can be implemented

end-to-end flowcontrol is necessary for FCoE

Ingress queuing is implemented by Virtual Output Queuing (VOQ)

VOQ prevents head of line blocking

One egress interface can be congested, but ingress buff still accepts frame into other queues

8 class-based unicast VOQ per egress interface on every ingress interface

8 class-based multicast VOQ per ingress interface

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 92

Nexus 5000/5500 Queuing

Ingress queuing implication on troubleshooting:

Drops occur at INGRESS!

You must think about where the flow originates on the switch to determine where you would like to look for drops.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 93

Nexus 5000/5500 QueuingN5k-1# show queuing interface e1/5

Ethernet1/5 queuing information:

TX Queuing

qos-group sched-type oper-bandwidth

0 WRR 50

1 WRR 50

RX Queuing

qos-group 0

q-size: 243200, HW MTU: 1600 (1500 configured)

drop-type: drop, xon: 0, xoff: 1520

Statistics:

Pkts received over the port : 100882627

Ucast pkts sent to the cross-bar : 100877529

Mcast pkts sent to the cross-bar : 0

Ucast pkts received from the cross-bar : 786990

Pkts sent to the port : 692821

Pkts discarded on ingress : 5098

Per-priority-pause status : Rx (Inactive), Tx (Inactive)

Ingress discards are present when buffering is not sufficient for the traffic flow.

For example – 2 interfaces transmitting toward 1 interface in sustained oversubscription.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 94

Nexus 5000/5500 QueuingScenario

N5k-1

e1/1

e1/5e1/5

e1/3

Server A is sending some traffic toward Server B

Both servers have had static ARP entries applied for troubleshooting

Server B does not see traffic from Server A when sniffing locally

They are both configured to be in the same VLAN

N5k-2

Server A Server B

Trunk

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 95

Nexus 5000/5500 QueuingScenario

N5k-1

e1/1

e1/5e1/5

e1/3

N5k-2

Server A Server B

Trunk

Start at the ingress interface on server A

N5k-1# show hardware internal gatos port e1/1 | grep “gatos i”

gatos instance : 7

gatos iport : 2

-----------------------------------------------------------------

N55k-1# show hardware internal carmel port e1/1 | grep "carmel i"

carmel instance : 0

carmel iport : 1

Nexus 5000“gatos”

Nexus 5500“carmel”

For this example, we will use Nexus 5000 outputs, but you can substitute gatos for carmel, as they are laid out in a similar architecture.

The actual counters and errors may vary, the methodology does not

Front Panel Internal

e1/1 7:2

e1/5 7:1

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 96

Nexus 5000/5500 QueuingScenario

N5k-1

e1/1

e1/5e1/5

e1/3

N5k-2

Server A Server B

Trunk

Start at the ingress interface on server A

N5k-1# show platform fwm info pif e1/1 | grep stats

Eth1/1 pd: tx stats: bytes 147694477 frames 0 discard 0 drop 0

Eth1/1 pd: rx stats: bytes 26022500 frames 0 discard 0 drop 0

Eth1/1 pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0

Eth1/1 pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0

Front Panel Internal

e1/1 7:2

e1/5 7:1

These outputs are clean

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 97

Nexus 5000/5500 QueuingScenario

N5k-1

e1/1

e1/5e1/5

e1/3

N5k-2

Server A Server B

Trunk

N5k-1# show platform fwm info asic-errors 7

Printing non zero Gatos error registers:

N5k-1# show hardware internal gatos asic 7 counters interrupt

Gatos 7 interrupt statistics:

Interrupt name |Count |ThresRch|ThresCnt|Ivls

Front Panel Internal

e1/1 7:2

e1/5 7:1

These outputs are also clean

Move on to the egress interface e1/5

In this case, e1/5 is on the same ASIC, so we have already gathered the output needed

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 98

Nexus 5000/5500 QueuingScenario

N5k-1

e1/1

e1/5e1/5

e1/3

N5k-2

Server A Server B

Trunk

N5k-1# show platform fwm info pif e1/5 | grep stats

Eth1/5 pd: tx stats: bytes 476497477 frames 0 discard 0 drop 0

Eth1/5 pd: rx stats: bytes 232322392 frames 0 discard 0 drop 0

Eth1/5 pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0

Eth1/5 pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0

Front Panel Internal

e1/1 7:2

e1/5 7:1

These outputs are clean

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 99

Nexus 5000/5500 QueuingScenario

N5k-1

e1/1

e1/5e1/5

e1/3

N5k-2

Server A Server B

Trunk

N5k-1# show platform fwm info pif e1/5 | grep stats

Eth1/5 pd: tx stats: bytes 332298390 frames 0 discard 0 drop 0

Eth1/5 pd: rx stats: bytes 176797274 frames 0 discard 0 drop 208

Eth1/5 pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0

Eth1/5 pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0

Front Panel Internal

e1/1 7:2

e1/5 7:1

208 drops seen received on port e1/5

Next we try to find the reason for these drops

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 100

Nexus 5000/5500 QueuingScenario

N5k-1

e1/1

e1/5e1/5

e1/3

N5k-2

Server A Server B

Trunk

N5k-1# show platform fwm info asic-errors 7

Printing non zero Gatos error registers:

DROP_SRC_VLAN_MBR: res0 = 624 res1 = 0

DROP_SRC_VLAN_MBR is 624

This counter is 3x the number of frame drops - hardware caveat

Front Panel Internal

e1/1 7:2

e1/5 7:1

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 101

Nexus 5000/5500 QueuingScenario

N5k-1

e1/1

e1/5e1/5

e1/3

N5k-2

Server A Server B

Trunk

N5k-1# show hardware internal gatos asic 7 counters interrupt

...

gat_lu_lkup1_INT_func_lo_drop_src_vlan_mbr|74 |

...

Interrupt counters will agree that a given error has fired from the hardware, but the number is HEX and we also do not record every interrupt due to the rate at which interrupts can hit CPU. Generally this number will be somewhat less than the fwm pifdrop number.

Front Panel Internal

e1/1 7:2

e1/5 7:1

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 102

Nexus 5000/5500 QueuingScenario

N5k-1

e1/1

e1/5e1/5

e1/3

N5k-2

Server A Server B

Trunk

N5k-1# show hardware internal gatos asic 7 counters interrupt

...

gat_lu_lkup1_INT_func_lo_drop_src_vlan_mbr|74 |

...

Interrupt counters will agree that a given error has fired from the hardware

number is hex and

we do not record every interrupt due to the rate at which interrupts can hit CPU. Generally this number will be somewhat less than the show platform fwm info pif number

Front Panel Internal

e1/1 7:2

e1/5 7:1

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 103

Nexus 5000/5500 QueuingScenario

N5k-1

e1/1

e1/5e1/5

e1/3

N5k-2

Server A Server B

Trunk

From the outputs gathered, we can say either STP is blocking or the VLAN is not allowed

The configs confirm VLAN is not allowed

Use this same methodology to find counters incrementing with your dropped traffic. Where the numbers increment, you can find a reason

Various scenarios cause drops, register list is not available publically – TAC case should be opened for scenarios with conflicting/confusing output.

Front Panel Internal

e1/1 7:2

e1/5 7:1

N5k-1# interface Ethernet1/5

switchport mode trunk

switchport trunk allowed vlan 100-102

N5k-1# interface Ethernet1/5

switchport mode trunk

switchport trunk allowed vlan 100-103

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 104

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview and troubleshooting

NX-OS Operation

Crashes

Nexus 5000

CRC errors

Ethanalyzer / CPU

Queuing and forwarding

Spanning-tree

Nexus 2000

Redundancy operation and troubleshooting

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 105

Spanning-tree

NX-OS keeps a long history of STP states

Usually you can trace back the change that caused an outage, as long as it has not wrapped in the logs.

STP logs shouldn‟t wrap normally without constant topology changes.

Also a good idea to log stp at level 6:

N5k-2(config)# logging level spanning-tree 6

N5k-2# 2011 Jan 21 01:58:23 N5k-2 %STP-6-PORT_ROLE: Port port-channel14 instance VLAN007 role changed to designated

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 106

Spanning-tree

N5k-1# show spanning-tree internal event-history all

-------------------- All the active STPs -----------

VDC01 VLAN0001

0) Transition at 848207 usecs after Thu Jan 13 05:05:54 2005

Root: 0000.0000.0000.0000 Cost: 0 Age: 0 Root Port: none Port: none [STP_TREE_EV_UP]

1) Transition at 367168 usecs after Thu Jan 13 05:05:57 2005

Root: 8001.000d.ecd6.02fc Cost: 0 Age: 0 Root Port: none Port: Ethernet1/15 [STP_TREE_EV_UPDATE_TOPO_RCVD_SUP_BPDU]

2) Transition at 373395 usecs after Thu Jan 13 05:05:57 2005

Root: 2063.00d0.0362.4c00 Cost: 2 Age: 1 Root Port: Ethernet1/15 Port: none [STP_TREE_EV_MULTI_FLUSH_LOCAL]

3) Transition at 434563 usecs after Thu Jan 13 05:06:00 2005

Root: 2063.00d0.0362.4c00 Cost: 2 Age: 1 Root Port: Ethernet1/15 Port: Ethernet1/15 [STP_TREE_EV_MULTI_FLUSH_RCVD]

Checking all trees

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 107

Spanning-tree

N5k-1# show spanning-tree internal event-history tree 1 brief

2005:01:13 05h:05m:54s:848207us T_EV_UP VLAN0001 [0000.0000.0000.0000 C 0 A 0 Rnone P none]

2005:01:13 05h:05m:57s:367168us T_UT_SBPDU VLAN0001 [8001.000d.ecd6.02fc C 0 A 0 R none P Eth1/15]

2005:01:13 05h:05m:57s:373395us T_EV_M_FLUSH_L VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P none]

2005:01:13 05h:06m:00s:434563us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]

2005:01:13 05h:06m:01s:407259us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]

2005:01:13 05h:06m:02s:947220us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]

2005:01:13 05h:06m:04s:947216us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]

2005:01:13 05h:06m:06s:947457us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]

2005:01:13 05h:06m:08s:837586us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]

... or just the tree you are interested in

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 108

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview and troubleshooting

NX-OS Operation

Crashes

Nexus 5000

Nexus 2000

Management

Queuing and forwarding

Logs

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 109

FEX Management

FEX fabric interfaces run SDP – satellite discovery protocol

You can view the status of a FEX and see some logs from the N5k:

N5k-1# show fex 100

FEX: 100 Description: FEX0100 state: Online

FEX version: 5.0(3)N1(1b) [Switch version: 5.0(3)N1(1b)]

Extender Model: N2K-C2148T-1GE, Extender Serial: JAF1326BBRC

Part No: 73-12009-05

pinning-mode: static Max-links: 1

Fabric port for control traffic: Eth1/3

Fabric interface state:

Eth1/3 - Interface Up. State: Active

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 110

FEX Management

N5k-1# show fex 100 detail

FEX: 100 Description: FEX0100 state: Online

FEX version: 5.0(3)N1(1b) [Switch version: 5.0(3)N1(1b)]

FEX Interim version: 5.0(3)N1(1b)

Switch Interim version: 5.0(3)N1(1b)

Extender Model: N2K-C2148T-1GE, Extender Serial: JAF1326BBRC

Part No: 73-12009-05

Card Id: 70, Mac Addr: 00:0d:ec:d3:b5:c2, Num Macs: 64

Module Sw Gen: 21 [Switch Sw Gen: 21]

post level: complete

...

Logs:

02/02/2005 13:09:06.946120: Module register received

02/02/2005 13:09:06.947614: Image Version Mismatch

02/02/2005 13:09:06.947960: Registration response sent

02/02/2005 13:09:06.948392: Requesting satellite to download image

02/02/2005 13:14:54.149480: Image preload successful.

02/02/2005 13:14:55.375447: Deleting route to FEX

02/02/2005 13:14:55.384270: Module disconnected

02/02/2005 13:14:55.386372: Module Offline

02/02/2005 13:16:52.847574: Module register received

02/02/2005 13:16:52.849146: Registration response sent

02/02/2005 13:16:53.419079: Module Online Sequence

02/02/2005 13:17:09.507541: Module Online

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 111

FEX Management

N5k-1# show system internal fex log fport e1/3

Satmgr debug messages for If 0x1a002000:

[19952]02/02/2005 13:08:32.191646: if [0x1a002000]:Phy cleanup rcvd

[19956]02/02/2005 13:08:32.192257: fport [0x1a002000]:Log - Interface Down

[19957]02/02/2005 13:08:32.192266: fport [0x1a002000]:satmgr_fport_fsm: even:t Port Down. curr state: Discovered

[19958]02/02/2005 13:08:32.192654: fport [0x1a002000]:Log - State changed to: Created

[19962]02/02/2005 13:08:32.192853: fport [0x1a002000]:satmgr_fport_fsm: new state: Created

[19967]02/02/2005 13:08:32.193991: fport [0x1a002000]:Log - fport phy cleanup retry end: sending out resp

[19970]02/02/2005 13:08:32.206315: if [0x1a002000]:Pre Cfg rcvd

[19971]02/02/2005 13:08:32.206606: fport [0x1a002000]:Log - pre config: is not a port-channel member

[19977]02/02/2005 13:08:33.727893: fport [0x1a002000]:Log - Interface Up

[19978]02/02/2005 13:08:33.727904: fport [0x1a002000]:satmgr_fport_fsm: even:t Port Down. curr state: Created

[19982]02/02/2005 13:08:33.729944: fport [0x1a002000]:Log - Port Bringup rcvd

[19986]02/02/2005 13:08:33.731201: fport [0x1a002000]:Log - Suspending Fabric port. reason: Fex not configured

[19987]02/02/2005 13:08:33.731216: fport [0x1a002000]:Log - fport bringup retry end: sending out resp

[19997]02/02/2005 13:08:34.120031: fport [0x1a002000]:Log - Fcot message sent to Ethpm

[19998]02/02/2005 13:08:34.120092: fport [0x1a002000]:Log - Satellite discovered msg sent

[19999]02/02/2005 13:08:34.120459: fport [0x1a002000]:Log - State changed to: Discovered

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 112

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview and troubleshooting

NX-OS Operation

Crashes

Nexus 5000

Nexus 2000

Management

Queuing and forwarding

Logs

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 113

FEX Drops

Network interface drops can be seen from N5k “show queuing interface” as of 5.0(3)N1(1)

Best to “attach” to FEX to get detailed logs

Similar to Cat 6k or Nexus 7k linecard commands

Important to check here as FEX also have crash logs, have their own CPU, and are responsible for communicating link state and offloading some protocols like CDP.

N5k-1# attach fex 100

Attaching to FEX 100 ...

To exit type 'exit', to abort type '$.'

fex-100#

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 114

FEX Drops

Network interface drops can be seen from N5k “show queuing interface” as of 5.0(3)N1(1)

Best to “attach” to FEX to get detailed logs

Similar to Cat 6k or Nexus 7k linecard commands

Important to check here as FEX also have crash logs, have their own CPU, and are responsible for communicating link state and offloading some protocols like CDP.

N5k-1# attach fex 100

Attaching to FEX 100 ...

To exit type 'exit', to abort type '$.'

fex-100#

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 115

FEX Drops

The scenario we are looking for is big pipe to little pipe or many to one.

Know the flow of traffic! If you know the pattern, finding where it is likely to stress the network will be easier.

10G to 1G is especially difficult to buffer, so you may find the FEX is the last stop for the 10G traffic to buffer for your 1G hosts like to drop here and not elsewhere in your 10G network.

Fex queue-limit and buffer-threshold can be adjusted globally, per fex-type, or per fex

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 116

FEX Drops2148

fex-100# dbgexec rw

rw> show ints <0-6>

ASIC: 0:

+-------+--------------------------+--------------+-----------+-----------+-----------+

| ASIC | Interrupt Bit Field | Count1 | Thresh1 | Count2 | Thresh2 |

| Port | | | | | |

+-------+--------------------------+--------------+-----------+-----------+-----------+

| 0-NI1 | not_synced_lane_3 | 1 | 0 | 0 | 1 |

| 0-NI1 | not_synced_lane_2 | 1 | 0 | 0 | 1 |

| 0-NI1 | not_synced_lane_0 | 1 | 0 | 0 | 1 |

| 0-NI1 | synced_lane_3 | 1 | 0 | 0 | 1 |

| 0-NI1 | synced_lane_2 | 1 | 0 | 0 | 1 |

| 0-NI1 | synced_lane_1 | 1 | 0 | 0 | 1 |

| 0-NI1 | synced_lane_0 | 1 | 0 | 0 | 1 |

| 0-NI1 | loc_fault | 1 | 0 | 0 | 1 |

| 0-NI1 | not_aligned | 1 | 0 | 0 | 1 |

| 0-NI1 | aligned | 1 | 0 | 0 | 1 |

+-------+--------------------------+--------------+-----------+-----------+-----------+

this output is clean, no wo_cr counters. *shows non-zero counters.

wo_cr indicates the buffer is “without credit”

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 117

FEX Drops

2148rw> drops <0-6> hi<0-8>

Dropped packet counters for 0-HI0:

red_hix_cnt_rx_allow_vntag_drop : 0

red_hix_cnt_rx_echannel_drop : 0

red_hix_cnt_rx_fwd_drop : 0

red_hix_cnt_rx_mc_drop : 0

red_hix_cnt_rx_runt_pkt_drop : 0

red_hix_cnt_rx_src_vif_out_of_range_drop: 0

red_hix_cnt_tx_lb_drop : 11892

0-SS0 DDROP counters:

OQ0: Class0: 0 Class1: 0 Class2: 0 Class3: 0

OQ1: Class0: 0 Class1: 0 Class2: 0 Class3: 0

OQ2: Class0: 0 Class1: 0 Class2: 0 Class3: 0

OQ3: Class0: 0 Class1: 0 Class2: 0 Class3: 0

OQ4: Class0: 0 Class1: 0 Class2: 0 Class3: 0

0-SS0 ECC1: 0 ECC2: 0

0-SS0 wo_cr: 0 no cells: 0 mtu_vio: 0

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 118

FEX Drops2248

N5k-1# attach fex 130

fex-130# dbgexec satctrl

satctrl/qosctrl> show port 0 0 2 <0-3> *uplink interfaces queue on ingress

...

Rx Discard (WR_DISC): 0

Rx Multicast Discard (WR_DISC_MC): 0

Rx Error (WR_RCV_ERR): 0

...

this output is clean, wr_disc or wr_rcv_err.

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 119

FEX Drops2248

satctrl/qosctrl> show asic 0 0

SS Statistics:

SS No Credit* No Cells MTU Error OQ Discard Free Cells

---+-----------+-----------+-----------+-----------+----------

0 0 0 0 0 10213

1 0 0 0 0 10213

...

Dropped packets per CoS due to OQ head-drop, OQ is per 8 port group:

OQ CoS 0 CoS 1 CoS 2 CoS 3 CoS 4 CoS 5 CoS 6 CoS 7

----+----------+----------+----------+----------+----------+----------+----------+-----------

NR0 0 0 0 0 0 0 0 0

NR1 0 0 0 0 0 0 0 0

NR2 0 0 0 0 0 0 0 0

NR3 0 0 0 0 0 0 0 0

NR4 0 0 0 0 0 0 0 0

NR5 0 0 0 0 0 0 0 0

----+----------+----------+----------+----------+----------+----------+----------+-----------

HR0 0 0 0 0 0 0 0 0

HR1 0 0 0 0 0 0 0 0

HR2 0 0 0 0 0 0 0 0

HR3 0 0 0 0 0 0 0 0

HR4 0 0 0 0 0 0 0 0

HR5 0 0 0 0 0 0 0 0

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 120

FEX Drops2248

fex130# dbgexec prt

prt> drops

PRT_SS_CNT_TAIL_DROP8 : 2 SS0

prt> show rmon 0 ni<0-3>

+----------------------+----------------------+-----------------+----------------------+----------------------+-----------------+

| TX | Current | Diff | RX | Current | Diff |

+----------------------+----------------------+-----------------+----------------------+----------------------+-----------------+

| TX_PKT_LT64 | 0| 0| RX_PKT_LT64 | 0| 0|

| TX_PKT_64 | 5| 1| RX_PKT_64 | 8| 0|

| TX_PKT_65 | 2062219| 264039| RX_PKT_65 | 4073560| 521532|

| TX_PKT_128 | 2149866| 274780| RX_PKT_128 | 2060397| 263419|

| TX_PKT_256 | 1920669| 245601| RX_PKT_256

...

rmon counters are similar to the “counters detailed” on the N5k ports, helpful for error tracking and finding packets of a certain size

updates immediately – “show counters” on n5k waits for the statsclient

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 121

Troubleshooting Nexus 5000 / 2000

Problem Isolation

Platform Overview and troubleshooting

NX-OS Operation

Crashes

Nexus 5000

Nexus 2000

Management

Queuing and forwarding

Logs

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 122

FEX Logs

attach fex <n>

dbgexec rw/prt (rw=2148, prt=2248)

Show ctx – driver information

Show oper – link states for L1 status

Show elog – event log chronicling hardware and software interaction, helpful for L1 issues

Show ints – interrupt counters

Show bootlog – bootup messages

Show log – any other logs

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 1231

2

Final presentation may not end here, look for updated content potentially at the live presentation.

Printout note

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 1241

2

Receive 25 Cisco Preferred Access points for each session evaluation you complete.

Give us your feedback and you could win fabulous prizes. Points are calculated on a daily basis. Winners will be notified by email after July 22nd.

Complete your session evaluation online now (open a browser through our wireless network to access our portal) or visit one of the Internet stations throughout the Convention Center.

Don’t forget to activate your Cisco Live and Networkers Virtual account for access to all session materials, communities, and on-demand and live activities throughout the year. Activate your account at any internet station or visit www.ciscolivevirtual.com.

Complete Your Online Session Evaluation

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 1251

2

Visit the Cisco Store for Related Titles

http://theciscostores.com

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 126

© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 127

Thank you.