Cisco cBR Router Thermal Monitoring and …...Fan Slot 4 Apr 18 2016 13:05:12 CRITICAL System...
Transcript of Cisco cBR Router Thermal Monitoring and …...Fan Slot 4 Apr 18 2016 13:05:12 CRITICAL System...
Cisco cBR Router Thermal Monitoring and TroubleshootingOverview of the Cisco cBR Router Thermal Monitoring 2
Enabling Thermal Shutdown 2
Recovering a Card After a Thermal Shutdown 3
Viewing the System Component Temperatures 3
Viewing Temperature Sensor Alarm Status 3
Setting Up SNMP Traps for Temperature Alarms 4
SUP_dSUM Alarm 7
Critical Thermal Sensor Identifiers and Temperature Limit Set-Points for Specific Line Cards 7
Revised: June 7, 2018
Overview of the Cisco cBR Router Thermal MonitoringThe Cisco cBR routers are equipped with a comprehensive thermal monitoring system. You can enable the thermal protection shutdownof the system for critical thermal sensors. The routers have preconfigured thermal shutdown levels and general alarm levels for allthe sensors. The values indicated in this document are specific to the product IDs referenced and apply to Cisco IOS-XE 3.18.0S andlater.
The main cooling system of the Cisco cBR comprises of five fan modules at the rear of the chassis with two fans placed within eachmodule. Additionally, the power modules have two internal fans each for self-cooling.
The Cisco cBR does not power up if all the five fan modules are not installed. The fan modules do not have to be 100% functionalbut they must all be present. The fan modules (functional or not) are required to be installed in the chassis at startup to seal the chassisfan bay slot and prevent recirculation to the functional fans.
When an individual fan fails or the complete fan module is removed, the system continues to run and post alarms. The system doesnot power down because of missing or failed fans. Alarms are posted for failed or removed fan modules and any thermal event thatoccurs at the line card level. The system relies on the line card level thermal monitoring of the critical sensors and any enabled systemthermal protection to shut down a line card.
Overview of the Cisco cBR AC and DC Power Supplies Thermal Sensors
The Cisco cBRACPower Supply (cBR-AC-PS) and Cisco cBRDCPower Supply (cBR-DC-PS) (CBR8 power supplies) are equippedwith over temperature protection shutdown. This safety feature CANNOT be turned off and is configured by the manufacturer of thepower supply. The power supplies have numerous sensors placed throughout to prevent thermal run away. If a power supply sensorreaches its over temperature limit the power supply will shut itself down.
The released power supplies have only the inlet and outlet temperature sensors that are read by IOS. The power supply inlet sensorhas a preconfigured over temperature protection shutdown limit set at 65C. With sensor tolerances an over temperature shutdowncan occur as low as 60C TRUE facility inlet temperature. This setting is above the recommended operational range for the CBR8product.
If there are not enough power supply modules functioning it can cause a system to shutdown the linecards first, then secondly theentire system.
Enabling Thermal ShutdownYou can configure the Cisco cBR system to protect the major power consuming chips in the chassis that reside on the front line cards.When you enable the thermal shutdown configuration, the chassis shuts down the line cards when the major heat generating chipsreach their design limits listed in the tables in the section Critical Thermal Sensor Identifiers and Temperature Limit Set-Points forSpecific Line Cards, on page 7. Only the line card with a thermal event is shutdown.
To enable the thermal shutdown feature, complete the following procedure:
router#configure terminalrouter(config)#facility-alarm critical exceed-action shutdown
The primary supervisor is an exception to the thermal shutdown configuration. When the thermal shutdown feature is enabled andthe primary supervisor has a thermal event that exceeds the shutdown limit, all the front line cards in the chassis are shut down but
2
the primary supervisor continues to run and provide telemetry until the failure or event is cleared or the system completely shutsdown due to the thermal event.
The thermal shutdown configuration affects only the front line cards and does not affect the rear physical interface card (PIC), powersupplies, or fan modules.
Recovering a Card After a Thermal ShutdownYou must clear the thermal events on the line card and the primary supervisor before the line card comes back online.
After a thermal shutdown event, you must review the alarms, system and facility temperatures, and failure logs to determine the rootcause of the thermal event and correct it.
Once a line card is placed into the thermal shutdown state, there are three ways to recover the line card:
• Issue the hw-module slot x reload command—You can issue this command and try to bring the line card back online. Thiscommand must be issued twice. The first issue of the command resets the line card and the line card will default to be offline.The second issue of the command allows the card to boot up if the thermal event alarm is cleared.
• Online insertion and removal (OIR) of the card or a complete card replacement—This procedure allows the line card slot to bootup. Youmust perform this procedure twice or along with the hw-module slot x reload command. Before OIR or card replacement,you must issue the hw-module slot x reload command. After the card is reset, you can OIR or replace the card. On reinsertingthe card it boots up normally if the thermal event is cleared.
• Reboot or power cycle of the entire chassis—This procedure also clears the thermal shutdown alarm. If the thermal conditionspersist upon rebooting, the line cards shift into the shutdown state.
Viewing the System Component TemperaturesTo view the system component temperatures, use the show env | inc Temp command.
The following example shows the output for the show env | inc Temp command.
router#show env | inc Temp5/1 Temp: RTMAC Normal 40 Celsius5/1 Temp: INLET Normal 31 Celsius5/1 Temp: OUTLET Normal 31 Celsius5/1 Temp: MAX6697 Normal 54 Celsius5/1 Temp: TCXO Normal 37 Celsius5/1 Temp: SUP_OUT Normal 54 Celsius5/1 Temp: 3882_1 P Normal 46 Celsius5/1 Temp: 3882_2 P Normal 40 Celsius5/1 Temp: 3882_3 P Normal 46 Celsius5/1 Temp: INLET PD Normal 28 Celsius
The output displays the slot location, sensor name, status, and temperature.
Viewing Temperature Sensor Alarm StatusTo view the posted temperature sensor alarms, use the show facility-alarm status command to show all the alarms in the system.This command also shows if the fan modules are installed and functioning properly and other important alarm states related to athermal event.
3
The following example shows the output for the show facility-alarm status command.
router#show facility-alarm statusSystem Totals Critical: 11 Major: 1 Minor: 0
Source Time Severity Description [Index]------ ------ -------- -------------------Temp: Outlet P0/4 Apr 18 2016 13:05:12 INFO Temp Above Normal [4]Temp: Outlet P1/4 Apr 18 2016 13:05:12 INFO Temp Above Normal [4]Temp: Outlet P2/4 Apr 18 2016 13:05:12 INFO Temp Above Normal [4]Power Supply Bay 1 Apr 18 2016 13:05:12 INFO Power Supply/FAN Module Missing [2]Power Supply Bay 3 Apr 18 2016 13:05:12 INFO Power Supply/FAN Module Missing [2]Power Supply Bay 4 Apr 18 2016 13:05:12 INFO Power Supply/FAN Module Missing [2]Power Supply Bay 5 Apr 18 2016 13:05:12 INFO Power Supply/FAN Module Missing [2]Fan Slot 0 Apr 18 2016 13:05:12 CRITICAL Fan Tray Module Missing [0]Fan Slot 0 Apr 18 2016 13:05:12 CRITICAL System shutdown will occur in few min [1]Fan Slot 1 Apr 18 2016 13:05:12 CRITICAL Fan Tray Module Missing [0]Fan Slot 1 Apr 18 2016 13:05:12 CRITICAL System shutdown will occur in few min [1]Fan Slot 2 Apr 18 2016 13:05:12 CRITICAL Fan Tray Module Missing [0]Fan Slot 2 Apr 18 2016 13:05:12 CRITICAL System shutdown will occur in few min [1]Fan Slot 3 Apr 18 2016 13:05:12 CRITICAL Fan Tray Module Missing [0]Fan Slot 3 Apr 18 2016 13:05:12 CRITICAL System shutdown will occur in few min [1]Fan Slot 4 Apr 18 2016 13:05:12 CRITICAL Fan Tray Module Missing [0]Fan Slot 4 Apr 18 2016 13:05:12 CRITICAL System shutdown will occur in few min [1]Cable3/0-MAC2 Apr 18 2016 13:07:05 INFO Physical Port Administrative State Down [1]Cable3/0-MAC4 Apr 18 2016 13:07:05 INFO Physical Port Administrative State Down [1]sup 0 Apr 18 2016 13:05:13 MAJOR Unknown state [0]TenGigabitEthernet5/1/3 Apr 18 2016 19:13:17 CRITICAL Physical Port Link Down [35]SFP+ container 5/1/4 Apr 18 2016 13:05:23 INFO Transceiver Missing [0]SFP+ container 5/1/5 Apr 18 2016 13:05:23 INFO Transceiver Missing [0]SFP+ container 5/1/6 Apr 18 2016 13:05:23 INFO Transceiver Missing [0]SFP+ container 5/1/7 Apr 18 2016 13:05:23 INFO Transceiver Missing [0]
To view only the thermal alarms, use the show facility-alarm status | inc Temp command.
The following example shows the output for the show facility-alarm status | inc Temp command.
router#show facility-alarm status | inc TempTemp: Outlet P0/4 Apr 18 2016 13:05:12 INFO Temp Above Normal [4]Temp: Outlet P1/4 Apr 18 2016 13:05:12 INFO Temp Above Normal [4]Temp: Outlet P2/4 Apr 18 2016 13:04:12 INFO Temp Above Normal [4]Temp: U18 P10/1 Apr 18 2016 13:03:12 INFO Temp Above Normal [4]Temp: U17 P11/1 Apr 18 2016 13:01:12 INFO Temp Above Normal [4]
Setting Up SNMP Traps for Temperature AlarmsYou can send alarms to a trap server and capture using the snmp-server enable traps alarms informational command. This commandenables traps for all the alarms and not only the thermal alarms. There is no command to enable traps only for the thermal alarm. Youcannot configure the system to set temperature alarm values as the thresholds are preprogramed on the line cards.
Below is an example of a thermal alarm trap. This example shows a critical alarm for slot 5 BB_DIE on the CBR-SUP-160G card.Received SNMPv1 Trap:Community: publicEnterprise: ciscoEntityAlarmMIBNotificationsPrefixAgent-addr: 10.0.10.10Enterprise Specific trap.Enterprise Specific trap: 1Time Ticks: 362693ceAlarmHistEntPhysicalIndex.74 = 60142
4
ceAlarmHistAlarmType.74 = 2ceAlarmHistSeverity.74 = critical(1)ceAlarmHistTimeStamp.74 = 362692ceAlarmDescrText.9.2 = Temp Above Normal
In the example above, you can find “ceAlarmHistEntPhysicalIndex.74 = 60142” in the trap. Using this index you can get the detaildescriptor in the SNMP table “entPhysicalDescr”.
For example, entPhysicalDescr.60142 = Temp: BB_DIE.
Each thermal alarm trap contains its PhysicalIndex. For example, for BB_DIE: ceAlarmHistEntPhysicalIndex.74 = 60142, its indexstarts with 6xxxx. For the front linecards, cylons, and SUP, cylons and SUP, the PhysicalIndex start from (slot+1)*10,000.
In this example, the equation is (slot5+1)*10,000 = 6000, which means the sensor that has sent the alarm is in slot 5.
Once you get the sensor's PhysicalIndex that starts from (slot+1)*10,000, use the above formula to get the slot number, and searchfor the PhysicalIndex in the entPhysicalDescr table to get its descriptor.
Each line card and type has a different entPhysicalDescr table. Always reference the entPhysicalDescr table for your specific linecard that an alarm is originating from.
The following are some examples of entPhysicalDescr tables.
• Example entPhysicalDescr for temperature sensors for power supplies installed in bay P0 and P5. P1-P4 bays are empty.entPhysicalDescr.1000 = Power Supply BayentPhysicalDescr.1001 = Cisco cBR CCAP AC Power SupplyentPhysicalDescr.1002 = PEM IoutentPhysicalDescr.1003 = PEM VoutentPhysicalDescr.1004 = PEM VinentPhysicalDescr.1005 = Temp: INLETentPhysicalDescr.1006 = Temp: OUTLETentPhysicalDescr.1020 = Power Supply BayentPhysicalDescr.1040 = Power Supply BayentPhysicalDescr.1060 = Power Supply BayentPhysicalDescr.1080 = Power Supply BayentPhysicalDescr.1100 = Power Supply BayentPhysicalDescr.1101 = Cisco cBR CCAP AC Power SupplyentPhysicalDescr.1102 = PEM IoutentPhysicalDescr.1103 = PEM VoutentPhysicalDescr.1104 = PEM VinentPhysicalDescr.1105 = Temp: INLETentPhysicalDescr.1106 = Temp: OUTLET
• Example entPhysicalDescr for temperature sensors for fan modules installed in P10-P14 bays.entPhysicalDescr.2000 = Fan SlotentPhysicalDescr.2001 = Cisco cBR Fan AssemblyentPhysicalDescr.2002 = Temp: U17entPhysicalDescr.2003 = Temp: U18entPhysicalDescr.2004 = Temp: FCentPhysicalDescr.2005 = MPL115AentPhysicalDescr.2012 = FanentPhysicalDescr.2013 = FanentPhysicalDescr.2020 = Fan SlotentPhysicalDescr.2021 = Cisco cBR Fan AssemblyentPhysicalDescr.2022 = Temp: U17entPhysicalDescr.2023 = Temp: U18entPhysicalDescr.2024 = Temp: FCentPhysicalDescr.2025 = MPL115AentPhysicalDescr.2032 = FanentPhysicalDescr.2033 = FanentPhysicalDescr.2040 = Fan Slot
5
entPhysicalDescr.2041 = Cisco cBR Fan AssemblyentPhysicalDescr.2042 = Temp: U17entPhysicalDescr.2043 = Temp: U18entPhysicalDescr.2044 = Temp: FCentPhysicalDescr.2045 = MPL115AentPhysicalDescr.2052 = FanentPhysicalDescr.2053 = FanentPhysicalDescr.2060 = Fan SlotentPhysicalDescr.2061 = Cisco cBR Fan AssemblyentPhysicalDescr.2062 = Temp: U17entPhysicalDescr.2063 = Temp: U18entPhysicalDescr.2064 = Temp: FCentPhysicalDescr.2065 = MPL115AentPhysicalDescr.2072 = FanentPhysicalDescr.2073 = FanentPhysicalDescr.2080 = Fan SlotentPhysicalDescr.2081 = Cisco cBR Fan AssemblyentPhysicalDescr.2082 = Temp: U17entPhysicalDescr.2083 = Temp: U18entPhysicalDescr.2084 = Temp: FCentPhysicalDescr.2085 = MPL115AentPhysicalDescr.2092 = FanentPhysicalDescr.2093 = Fan
• Example entPhysicalDescr for temperature sensors for CBR-SUP-160G installed in slot 4.entPhysicalDescr.50000 = Cisco cBR CCAP Supervisor CardentPhysicalDescr.50001 = Cisco cBR CCAP Supervisor MBentPhysicalDescr.50141 = Temp: Y0_DIEentPhysicalDescr.50142 = Temp: BB_DIEentPhysicalDescr.50143 = Temp: VP_DIEentPhysicalDescr.50144 = Temp: RT-E_DIEentPhysicalDescr.50145 = Temp: INLET_1entPhysicalDescr.50146 = Temp: INLET_2entPhysicalDescr.50147 = Temp: OUTLET_1entPhysicalDescr.50148 = Temp: 3882_1entPhysicalDescr.50149 = Temp: 3882_2entPhysicalDescr.50150 = Temp: 3882_2AentPhysicalDescr.50151 = Temp: 3882_2BentPhysicalDescr.50152 = Temp: 3882_3entPhysicalDescr.50153 = Temp: 3882_3AentPhysicalDescr.50154 = Temp: 3882_3BentPhysicalDescr.50155 = Temp: 3882_4entPhysicalDescr.50156 = Temp: 3882_4AentPhysicalDescr.50157 = Temp: 3882_4BentPhysicalDescr.50158 = Temp: 3882_5entPhysicalDescr.50159 = Temp: 3882_5AentPhysicalDescr.50160 = Temp: 3882_5BentPhysicalDescr.50161 = Temp: 3882_6entPhysicalDescr.50162 = Temp: 3882_6AentPhysicalDescr.50163 = Temp: 3882_6BentPhysicalDescr.50164 = Temp: 3882_7entPhysicalDescr.50165 = Temp: 3882_8entPhysicalDescr.50166 = Temp: 3882_9entPhysicalDescr.50167 = Temp: 3882_9AentPhysicalDescr.50168 = Temp: 3882_9BentPhysicalDescr.50169 = Temp: 3882_10entPhysicalDescr.50170 = Temp: 3882_10AentPhysicalDescr.50171 = Temp: 3882_10BentPhysicalDescr.50172 = Temp: 3882_11entPhysicalDescr.50173 = Temp: 3882_11AentPhysicalDescr.50174 = Temp: 3882_11BentPhysicalDescr.50175 = Temp: 8314_1entPhysicalDescr.50176 = Temp: 8314_2
6
entPhysicalDescr.50177 = Temp: 3536_1AentPhysicalDescr.50178 = Temp: 3536_1BentPhysicalDescr.50179 = Temp: AS_DIEentPhysicalDescr.50182 = SUP_dSUM
• Example entPhysicalDescr for temperature sensors for CBR-LC-8D30-16U30 installed in slot 7.entPhysicalDescr.80000 = Cisco cBR CCAP Line CardentPhysicalDescr.80001 = Cisco cBR CCAP Line CardentPhysicalDescr.80014 = Temp: CAPRICAentPhysicalDescr.80015 = Temp: BASESTARentPhysicalDescr.80016 = Temp: RAIDERentPhysicalDescr.80017 = Temp: CPUentPhysicalDescr.80018 = Temp: INLETentPhysicalDescr.80019 = Temp: OUTLETentPhysicalDescr.80020 = Temp: DIGITALentPhysicalDescr.80021 = Temp: UPXentPhysicalDescr.80022 = Temp: LEOBEN1entPhysicalDescr.80023 = Temp: LEOBEN2
Supported SNMP MIBSThe following SNMP MIBs are supported for thermal sensors:
• CISCO-ENVM ON-MIB
• CISCO-ENTITY-ALARM-MIB
• ENTITY-SENSOR-MIB
• ENTITY-MIB
SUP_dSUM AlarmSUP_dSUMalarm is an alarm on the supervisormotherboard of CBR-SUP-250G, CBR-CCAP-SUP-160G, and CBR-CCAP-SUP-60Gline cards. This alarm is a summation of values of the sensors spread across the motherboard. SUP_dSUM alarm is triggered in casessuch as open slots in a chassis that have not been filled properly upon removal of the cards. This alarm is not a temperature sensoralarm but a warning to inspect the system. The SUP_dSUM alarm is displayed in the facility alarm status output and in the SNMPtrap notifications.
Below is an example of the SUP_dSUM alarm:
router#sho facility-alarm status | inc SUPSUP_dSUM R0/192 Jan 19 2016 10:48:32 Critical CHECK FOR OPEN SLOTS & BLOCKED AIR INTAKE [9]
Critical Thermal Sensor Identifiers and Temperature Limit Set-Points forSpecific Line Cards
This section contains only the important sensors and their alarm limits. It does not list all sensors that are displayed when yourun a query with temperature as the criterion.
Note
7
CBR-SUP-250G: Specific Thermal Sensor Identifier and Temperature Limit Set-Point
System ResponseShutdownCriticalMajorMinorSensor Name
Power Down Card726560NATemp: VP_DIE
Power Down Card57524742Temp: MB_IN_1
Power Down Card57524742Temp: MB_IN_2
Power Down Card103959082Temp: AS_DIE
Power Down Card726560NATemp: Y1_DIE
Power Down Card726560NATemp: Y2_DIE
Power Down Card726560NATemp: Y3_DIE
Power Down Card726560NATemp: Y0_DIE
Just Alarm777063NATemp: Falcon_DIE
Alarm/UINA75NANASUP_dSUM
Just AlarmNA858075Temp: MB_OUT_1
Power Down Card98908580Temp: VP_CHIP
Power Down Card98908580Temp: Falcon_CHIP
Power Down Card918575NATemp:CPU_C0
Power Down Card918575NATemp:CPU_C1
Power Down Card918575NATemp:CPU_C2
Power Down Card918575NATemp:CPU_C3
Power Down Card918575NATemp:CPU_C4
Power Down Card918575NATemp:CPU_C5
Power Down Card918575NATemp:CPU_C6
Power Down Card918575NATemp:CPU_C7
8
CBR-CCAP-SUP-160G/CBR-CCAP-SUP-60G Motherboard Specific Thermal Sensor Identifierand Temperature Limit Set-Point
System ResponseAt ShutdownLimit (If Enabled)
ShutdownLimit
CriticalLimit
Major LimilMinor LimitIOS Sensor NameCard
Power Down Card726560NATemp: VP_DIECBR-CCAP-SUP-160G
/CBR-CCAP-SUP-60G Power Down Card57524742Temp: MB_IN_1
Power Down Card57524742Temp: MB_IN_2
Power Down Card103959082Temp: AS_DIE
Power Down Card726560NATemp: Y0_DIE
Power Down Card726560NATemp: BB_DIE
Power Down Card777063NATemp: RT-E_DIE
Alarm OnlyNA878075Temp:MB_OUT_1
CBR-CCAP-SUP-160G Daughterboard Specific Thermal Sensor Identifier and Temperature LimitSet-Point
SystemResponseAt ShutdownLimit (IfEnabled)
ShutdownLimit
CriticalLimit
Major LimitMinor LimitIOS Sensor NameCard
PowerDownCard726560NATemp: Y1_DIECBR-CCAP-SUP-160G
PowerDownCard726560NATemp: Y2_DIE
PowerDownCard726560NATemp: Y3_DIE
CBR-2X100G-PIC: Specific Thermal Sensor Identifier and Temperature Limit Set-Point
System ResponseShutdownCriticalMajorMinorSensor Name
Just alarmNA706050Temp:INLET
Just alarmNA908070Temp:RT_OUT
Just alarmNA1059585Temp:MB_OUT
Just alarmNA756555Temp: SUP_OUT
9
CBR-CCAP-LC-40G Specific Thermal Sensor Identifier and Temperature Limit Set-Point
System Responseat ShutdownLimit (IfEnabled)
ShutdownLimit
Critical LimitMajor LimitMinor LimitIOSSensorNameCard
Power Down Card858075NATemp: CAPRICACBR-CCAP-LC-40G
Power Down Card98908580Temp:BASESTAR
Power Down Card98908580Temp: RAIDER
Power Down Card95908580Temp: CPU
Power Down Card57524742Temp: INLET
Alarm OnlyNA7570NATemp: UPX
Alarm OnlyNA7570NATemp: LEOBEN1
Power Down Card857873NATemp: LEOBEN2
CBR-FAN-ASSEMBLY Specific Thermal Sensor Identifier and Temperature Limit Set-Point
SystemResponse atShutdownLimit(If Enabled)
ShutdownLimitCriticalLimit
Major LimitMinor LimitIOS SensorName
Card
Alarm OnlyNA706560Temp: U17CBR-FAN-ASSEMBLY
Alarm OnlyNA706560Temp: U18
CBR-XX-PS Specific Thermal Sensor Identifier and Temperature Limit Set-Point
SystemResponse atShutdownLimit(If Enabled)
ShutdownLimitCritical LimitMajor LimitMinor LimitIOS SensorName
Card
Alarm Only65605550Temp: INLETCBR-XX-PS
Alarm OnlyNA706560Temp: OUTLET
10
CBR-DPIC-8X10G Specific Thermal Sensor Identifier and Temperature Limit Set-Point
SystemResponse atShutdownLimit(If Enabled)
ShutdownLimit
Critical LimitMajor LimitMinor LimitIOS Sensor NameCard
Alarm onlyNA756055Temp: INLETCBR-DPIC-8X10G
Alarm onlyNA908070Temp: ZYNQ_OUTLET
Alarm onlyNA908070Temp: SWT_OUTLET
Alarm onlyNA908070Temp: PHY_OUTLET
Alarm onlyNA908070Temp: SFP_OUTLET/OUTLET
11
© 2018 Cisco Systems, Inc. All rights reserved.
Europe HeadquartersAsia Pacific HeadquartersAmericas HeadquartersCiscoSystemsInternationalBVAmsterdam,TheNetherlands
CiscoSystems(USA)Pte.Ltd.Singapore
Cisco Systems, Inc.San Jose, CA 95134-1706USA
Cisco has more than 200 offices worldwide. Addresses, phone numbers, and fax numbers are listed on theCisco Website at www.cisco.com/go/offices.