Switch Fabric- Troubleshooting Tips DOC-18210
Transcript of Switch Fabric- Troubleshooting Tips DOC-18210
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
1
Switch Fabric- Troubleshooting tips
Introduction Requirements Troubleshooting Tips Troubleshooting Example RelatedInformation
Introduction
Switch fabric is a daughter card installed at Sup720 , it used to be a separate module, Switch Fabric module
in its first implementation back in the days of Sup2. It is used to provide backplane connectivity between
linecards. The default bandwidth available on the backplane of 6500 is 32 Gbps. This 32 Gbps is used by all
slots for serial transmission of data. Therefore at any instant only two ports can be communicating.
With the addition of Switch fabric, the switch’s backplane changes from serially-accessed busto crossbar fabric. By using crossbar fabric, many ports can be simultaneously transmitting andreceiving data, providing a much higher throughput.
The crossbar fabric consists of 18 fabric channels, providing each linecard two fabric channels into the
crossbar fabric. These channels can run at 8Gbps or 20Gbps depending upon the line card used. The CEF256
and dCEF256 series modules connect to fabric using 8 Gbps per channel and CEF720 series modules connect
to it using 20 Gbps per channel.
.
Requirements
For a module to use switch fabric, it should be a fabric enabled module.
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
2
Troubleshooting Tips
1. If the Fabric Switch Module does not work as expected, check the following:
a) Check if the Fabric Switch Status is Active. To do this,use the show fabric active command. This command will display thecurrent status of the Fabric Switch. Here is an example.Switch# show fabric activeActive fabric card in slot 5No backup fabric card in the system
If the system has backup fabric card, then:
Switch #show fabric activeshow fabric active:Active fabric card in slot 5Backup fabric card in slot 6b) Check the fabric status of switching modules in the device. To do this use the show fabric status [slot_number | all] command. This command will display the fabricstatus of one or all switching modules. Here is an example, Switch# show fabric status slot channel speed module fabric status status 1 0 8G OK OK 5 0 8G OK Up- Timeout 6 0 20G OK Up- BufError 8 0 8G OK OK 8 1 8G OK OK 9 0 8G Down- DDRsync OK Switch#
c) Check the fabric utilization of switching modules. To do this use the show fabricutilization [slot_number | all] command. This command will display the fabric utilization ofone or all modules.
Here is an example,
Switch# show fabric utilization all slot channel speed Ingress % Egress % 1 0 20G 0 0
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
3
1 1 20G 0 0 2 0 20G 0 24 2 1 20G 0 24 3 0 20G 48 0 4 0 20G 0 0 4 1 20G 0 0
2. In certain rare condi tions out put of 'show fabric channel-counters'may show incrementing number of rxErrors. Switch#show fabric channel-counters slot channel rxErrors txErrors txDrops lbusDrops 1 1 0 0 0 0 3 0 0 0 0 0 3 1 0 0 0 0 5 0 5 0 0 0 8 0 39 0 0 0 8 1 0 0 0 0a) RxRrror indicates that the module received corrupted packet(s) and dropped.b) The Fabric do NOT check CRC when forwarding frames between different fabric ports/channels.c) This could be due to the receiving module corrupting the frames or receiving corrupted frames from any fabric-enabled module in the switch.The following actions can be taken to solve these errors: a) Reseat the module with rxErrors. Reloading the linecard in question might stop the errors for some time, but the errors might eventually come back.b) If empty slot is available in the chassis move the affected line card to empty slot.c) If no empty slots available, swap the linecard that counts rxErrors with other linecard within the chassis (with no issue) or good known working linecard.d) Swap the active and standby supervisors (i.e. move supervisor from slot 5 to slot 6 and vice versa. Sup failover.e) Replace the affected linecard.
If the output of "show fabric status" command is showing "not-hot" for linecards under
hotStandby support.
Switch#show fabric status slot channel speed module fabric hotStandby Standby Standby status status support module fabric 3 0 20G OK OK Y(not-hot) 3 1 20G OK OK Y(not-hot)
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
4
4 0 20G OK OK Y(not-hot) 4 1 20G OK OK Y(not-hot) 5 0 20G OK OK N/A 5 1 20G OK OK N/A 6 0 20G OK OK N/A 6 1 20G OK OK N/A 7 0 20G OK OK Y(not-hot) 7 1 20G OK OK Y(not-hot) 8 0 20G OK OK Y(not-hot) 8 1 20G OK OK Y(not-hot) 9 0 20G OK OK Y(not-hot) 9 1 20G OK OK Y(not-hot)Reason: The standby fabric hot sync feature is only supported on the E version of the 6500 chassis, and this system has a non-E version.
3. If you see the error message as, “SP: Linecard endpoint of Channel 7 lost Sync. To Lower fabric and trying to recover now!”.
Reason: The message caused by a line card not being fully or properly seated. To identify this line card - the capture of show fabric fpoe map command need to be analyzed. Here is an example,
Switch#show fabric fpoe mapslot channel fpoe 1 0 0 1 1 9 2 0 1 2 1 10 3 0 2 3 1 11 4 0 3 4 1 12 5 0 4 6 0 5 6 1 14 7 0 6 7 1 15 8 0 7 8 1 16 9 0 8 9 1 17
Workaround: The fpoe will be mapped to a specific line card slot. Once the suspect linecard is identified From the output of show fabric fpoe map, fpoe 7 points to the linecard in slot 8 and that is card that is causing the error messages.your next actionshould be to schedule a removal and re-insert of that card to try to eliminate thismessage from re-occurring.
4. If the system switching performance drops from 30Mpps to 15Mpps.Reason: When classic and fabric enabled modules are mixed in a chassis, the system
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
5
switching performance drops from 30Mpps to 15Mpps.
Older "Classic" modules in the 6500, models 61xx, 62xx, 63xx, 64xx, send all traffic
over the switch BUS backplane, to be forwarded by the supervisor. Fabric enabledmodules
only send the packet headers over the bus and the switch fabric can be utilized forforwarding the data portion of
the packet.
Workaround: Consider replacing any "Classic" modules with fabric enabled modules, in
order to increase system performance.
5. To troubleshoot further, collect the following show command output before opening a TAC case.
step 1. turn on service internal. switch# configure terminal switch(config)# service internal step 2. collect the requested logs. terminal length 0 show fabric active show fabric channel-counters show fabric drop show fabric errors show fabric errors threshold show fabric fpoe map show fabric status show fabric utilization show tech-support remote login switch terminal length 0 show fabric error show fabric state-machine channel state show fabric state-machine channel event_trace 11 show fabric resync show fabric timeout show platform hardware capacity fabric exit step 3. turn off service internal
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
6
switch # configure terminal switch(config)# no service internal
Troubleshooting Example
1. Fabric Time out Error Message:
%FABRIC-SP-[module-number]-TIMEOUT_ERR: Fabric in slot [dec] reported timeout error for
channel [dec] (Module [dec], fabric connection [dec])
Description
The error message indicates that firmware code on the fabric detected that the input or
output buffer was not moving. To recover from this condition, the system will automatically
resynchronize the fabric channel.
Troubleshooting Steps
1. Issue the command “hw-module reset” to soft-reset the module. After the module is upagain,
2. capture the output of the command “show module” and the command “show diagnosticmodule all”.
Sample Output Of “show module”
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
7
Show Module
Mod Ports Card Type Model Serial No.
--- ----- -------------------------------------- ------------------ --------- ------------------------------------
2 24 CEF720 24 port 1000mb SFP WS-X6724-SFP SAL0AAAAAAA
3 24 CEF720 24 port 1000mb SFP WS-X6724-SFP SAD0AAAAAAA
5 2 Supervisor Engine 720 (Hot) WS-SUP720-3B SAD0AAAAAAA
6 2 Supervisor Engine 720 (Active) WS-SUP720-3B SAD0AAAAAAA
7 4 CEF720 4 port 10-Gigabit Ethernet WS-X6704-10GE SAL1AAAAAAA
8 4 CEF720 4 port 10-Gigabit Ethernet WS-X6704-10GE SAL1AAAAAAA
Sample Output of “show diagnostic module all”
Switch#show diagnostic module all
Current bootup diagnostic level: minimal
Module 6: Supervisor Engine 720 (Active)
Overall Diagnostic Result for Module 6 : PASS
Diagnostic level at card bootup: minimal
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
8
Test results: (. = Pass, F = Fail, U = Untested)
1) TestScratchRegister -------------> .
2) TestSPRPInbandPing --------------> .
3) TestTransceiverIntegrity:
Port 1 2
----------
U U
4) TestActiveToStandbyLoopback:
Port 1 2
----------
U U
5) TestLoopback:
Port 1 2
----------
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
9
. .
6) TestNewIndexLearn ---------------> .
7) TestDontConditionalLearn --------> .
8) TestBadBpduTrap -----------------> .
9) TestMatchCapture ----------------> .
10) TestProtocolMatchChannel --------> .
11) TestFibDevices ------------------> .
12) TestIPv4FibShortcut -------------> .
13) TestL3Capture2 ------------------> .
14) TestIPv6FibShortcut -------------> .
15) TestMPLSFibShortcut -------------> .
16) TestNATFibShortcut --------------> .
17) TestAclPermit -------------------> .
18) TestAclDeny ---------------------> .
19) TestQoSTcam ---------------------> .
20) TestL3VlanMet -------------------> .
21) TestIngressSpan -----------------> .
22) TestEgressSpan ------------------> .
23) TestNetflowInlineRewrite:
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
10
Port 1 2
----------
U U
24) TestFabricSnakeForward ----------> .
25) TestFabricSnakeBackward ---------> .
26) TestTrafficStress ---------------> U
27) TestFibTcamSSRAM ----------------> U
28) TestAsicMemory ------------------> U
29) TestAclQosTcam ------------------> U
30) TestNetflowTcam -----------------> U
31) ScheduleSwitchover --------------> U
32) TestFirmwareDiagStatus ----------> .
In case the output doesn’t come as expected, physically pull out and reseat the module firmly in
the chassis to hard-reset the module. After the module is up again, capture the output of the command“show module” and “show diagnostic module all”
Here is an example of failed diagnostic test for module 1
Module 1: Catalyst 6000 supervisor 2 (Active) SerialNo : Overall Diagnostic Result for Module 1 : MINOR ERROR Diagnostic level at card bootup: minimal Test results: (. = Pass, F = Fail, U = Untested) 1) TestSPRPInbandPing --------------> F 2) TestTransceiverIntegrity:
2. Overruns on some ports on Card 5 (WS-X6548-GE-TX)
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
11
Switch1# show interface counters
Noticed "overruns" on 4 interfaces. They were not incrementing
GigabitEthernet5/1 is up, line protocol is up (connected)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseT
0 input errors, 0 CRC, 0 frame, 26 overrun, 0 ignored
GigabitEthernet5/6 is up, line protocol is up (connected)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseT
0 input errors, 0 CRC, 0 frame, 14 overrun, 0 ignored
GigabitEthernet5/9 is up, line protocol is up (connected)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseT
0 input errors, 0 CRC, 0 frame, 159 overrun, 0 ignored
GigabitEthernet5/12 is up, line protocol is up (connected)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseT
0 input errors, 0 CRC, 0 frame, 269 overrun, 0 ignored
Also will enabling switch fabric rectify the overruns?
Solution:
Enabling switch fabric will not rectify overruns because after the installation of a Switch Fabric Module in
a Cisco Catalyst 6500 series switch, the traffic is forwarded to and from modules in different modes which
doesn't necessarily facilitate resolution for overruns. The traffic is forwarded in one of these of these modes:
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
12
a) Flow-through mode: In this mode, data passes between the local bus and the supervisor engine bus. This
mode is used for traffic to or from modules that are not fabric-enabled.
b) Truncated mode: Only truncated data (the first 64 bytes of the frame) goes over the switch fabric channel
if both the destination and the source are fabric-enabled modules. If either the source or destination is not a
fabric-enabled module, the data goes through the switch fabric channel and the data bus. The Switch Fabric
Module does not get involved when traffic is forwarded between modules that are not fabric-enabled.
c) Compact mode: A compact version of the DBus header is forwarded over the switch fabric channel, which
delivers the best possible switching rate. Modules that are not fabric-enabled do not support the compact mode
and generate cyclic redundancy check (CRC) errors upon receipt of frames in compact mode. This mode is
used only when no such modules are installed in the chassis.
Let’s understand what overrun is:-
Overrun - The number of times the receiver hardware was unable to hand received data to a hardware buffer.
Common Cause - The input rate of traffic exceeded the ability of the receiver to handle the data.
From the given example, the module used is WS-X6548-GE-TX:
This module is 8:1 oversubscribed. The ports on this module go to servers. On these modules there is a
single 1-Gigabit Ethernet uplink from the port ASIC that supports eight ports. These cards share a 1 Mb buffer
between a group of ports (1-8, 9-16, 17-24, 25-32, 33-40, and 41-48) since each block of eight ports is 8:1
oversubscribed. The aggregate throughput of each block of eight ports cannot exceed 1 Gbps. These line
cards are oversubscription cards that are designed to extend gigabit to the desktop and might not be ideal for
server farm connectivity. For more information refer to:-
Troubleshooting Switch Port and Interface Problems
To resolve this issue of overrun, move the high volume servers to ports ondifferent asic groups, so that the traffic flow through the 8 ports of everyasic group does not exceed 1 Gbps. Alternatively look for other ideal designrecommendations on line cards that have better oversubscription ratio.
Switch Fabric- Troubleshooting tips
Postings may contain unverified user-created content and change frequently. The content is provided as-is andis not warrantied by Cisco.
13
For Best Practices, please refer to Oversubscription and Density Best Practices.
Related Information• Introduction to Switch Fabric• Switch Fabric Functionality• Cisco Catalyst 6500 Series Switches• Configuring Online Diagnostic Tests
For a module to use switch fabric, it should be a fabric enabled module.