Troubleshooting High Availability - Cisco · Troubleshooting High Availability •...

Troubleshooting High Availability

• Node State Definitions, page 1

• Node States, Causes and Recommended Actions, page 2

Node State DefinitionsThe following table describes the different node states, and associated reasons. You can view the state of anexisting node by either viewing the node details or the subcluster details on the Cluster Topology interface.

These fields are only displayed on the Cluster Topology interface if you turn on High Availability in asubcluster.

Note

DescriptionState

This is the initial (transition) state when the Cisco Server RecoveryManager service starts; it is a temporary state.

Initializing

IM and Presence Service is in Idle state when failover occurs and servicesare stopped. In Idle state, the IM and Presence Service node does notprovide any availability or Instant Messaging services. In Idle state, youcan manually initiate a fallback to this node from the Cluster Topologyinterface.

Idle

This is a stable state. The IM and Presence Service node is operatingnormally. In this state, you can manually initiate a failover to this nodefrom the Cluster Topology interface.

Normal

This is a stable state. The IM and Presence Service node is acting as thebackup for its peer node. Users have moved to this (backup) node.

Running in Backup Mode

This is a transition state. The IM and Presence Service node is takingover for its peer node.

Taking Over

Configuration and Administration of IM and Presence Service on Cisco Unified Communications Manager, Release9.0(1)

1

DescriptionState

This is a transition state. The IM and Presence Service node is being takenover by its peer node.

Failing Over

This is a stable state. The IM and Presence Service node has failed over,but no critical services are down. In this state, you can manually initiatea fallback to this node from the Cluster Topology interface.

Failed Over

This is a stable state. Some of the critical services on the IM and PresenceService node have either stopped or failed.

Failed Over with Critical ServicesNot Running

This is a transition state. The system is falling back to this IM andPresence Service node from the node running in Backup Mode.

Falling Back

This is a transition state. The failed IM and Presence Service node istaking back over from its peer.

Taking Back

An error occurs during the transition states or Running in Backup Modestate.

Running in Failed Mode

State unknown.Unknown

Node States, Causes and Recommended ActionsThe following table describes the node states, reasons, causes, and recommended actions for failed states.

Table 1: Node High Availability States, Causes and Recommended Actions

Node 2Node 1

Cause/Recommended ActionsReasonStateReasonState

High Availability is running on both nodes inthe subcluster.

Subcluster is running normally (it is in nonfailover mode). The critical services on bothnodes in the subcluster are running.

NormalNormalNormalNormal

The administrator initiates a manual failoverfrom node 1 to node 2. The manual failover isin progress.

On AdminRequest

Taking OverOn AdminRequest

FailingOver

The manual failover from node 1 to node 2(initiated by the administrator) is complete.

On AdminRequest

Running inBackupMode

On AdminRequest

Idle

Configuration and Administration of IM and Presence Service on Cisco Unified Communications Manager,Release 9.0(1)

2

Troubleshooting High AvailabilityNode States, Causes and Recommended Actions

Node 2Node 1


The administrator initiates a manual fallbackfrom node 2 to node 1. The manual fallback isin progress.

On AdminRequest

Falling BackOn AdminRequest

TakingBack

The administrator restarts the SRM service onnode 1 while node 1 is in Idle state.

On AdminRequest


InitializationIdle

The administrator restarts both nodes in thesubcluster, or restarts the SRM service on bothnodes in the subcluster, while the subclusterwas in manual failover mode (failover initiatedby the administrator).

InitializationRunning inBackupMode

InitializationIdle

The administrator restarts the SRM service onnode 2 while node 2 is running in backupmode, but before the heartbeat on node 1 timesout.

InitializationRunning inBackupMode

On AdminRequest

Idle

The administrator restarts the SRM service onnode 2 while node 2 is taking over, but beforethe heartbeat on node1 times out.

InitializationTaking OverOn AdminRequest

FailingOver

The administrator restarts the SRM service onnode 1 while taking back, but before theheartbeat on node 2 times out. After the takingback process is complete, both nodes are inNormal state.

On AdminRequest

Falling BackInitializationTakingBack

Automatic Fallback has been initiated fromnode 2 to node 1 and is currently in progress.

AutomaticFallback

Falling BackAutomaticFallback

TakingBack

Node 1 transitions to Failed Over state when:

• Critical service(s) come back up due toreboot of node 1, or

• The administrator starts critical service(s)on node 1 while node 1 is in "Failed Overwith Critical Services Not Running" state

When node 1 transitions to Failed Overstate the node is ready for theadministrator to perform a manualfallback to restore the nodes in thesubcluster to Normal state.

Critical ServiceDown


Initializationor CriticalServicesDown

FailedOver


3


Node 2Node 1


A critical service is down on node 1. IM andPresence performs an automatic failover tonode 2.

Recommended Actions:

1 Check what critical services are down onnode 1, and try to start these servicesmanually.

2 If the critical services on node 1 do notstart, reboot node 1.

3 After the reboot and when all the criticalservices are running, perform a manualfallback to restore the nodes in thesubcluster to Normal state.



CriticalServiceDown

FailedOver withCriticalServicesnotRunning

A database service is down on node 1. IM andPresence performs an automatic failover tonode 2.


1 Reboot Node 1.


Database FailureRunning inBackupMode

DatabaseFailure

FailedOver withCriticalServicesnotRunning

Critical services fail to start while a node insubcluster is taking back from the other node.

Recommended Actions: (on the node that istaking back)

1 Check what critical services are down onthe node. To start these services manually,select Recovery on the subcluster detailsscreen.

2 If the critical services do not start, rebootthe node.


Start of CriticalServices Failed

Running inFailed Mode

Start ofCriticalServicesFailed

Runningin FailedMode


4


Node 2Node 1


Critical services go down while a node insubcluster is running in backup mode for theother node.


1 Check what critical services are down onbackup node. To start these servicesmanually, select Recovery on thesubcluster details screen.

2 If the critical services do not start, rebootthe subcluster.



CriticalServiceDown


Node 2 has lost its heartbeat with node 1. IMand Presence performs an automatic failoverto node 2.

Recommended Action:

(If node 1 is up)

1 Check and repair the network connectivitybetween nodes in the subcluster.When youreestablish the network connection betweenthe nodes, the node may go into a failedstate. Select Recovery on the subclusterdetails screen to restore the nodes in thesubcluster to Normal state.

2 Start the SRM service, and performmanualfallback to restore the nodes in thesubcluster to Normal state.

(If the node is down)

3 Repair/Power up node 1.

4 When node is up and all critical servicesare running, perform manual fallback torestore the nodes in the subcluster toNormal state.

Peer DownRunning inBackupMode

Node 1 is down due to lossof network connectivity orthe SRM service is notrunning.


5


Node 2Node 1


IM and Presence performs an automaticfailover to node 2 due to possible hardwarefailure/power down/restart /shutdown of Node1.

Recommended Action:

1 Repair/Power up node 1.

2 When node is up and all critical servicesare running, perform manual fallback torestore the nodes in the subcluster toNormal state.

Peer RebootRunning inBackupMode

Node 1 is down (due topossible power down,hardware failure,shutdown, reboot)

Node 2 does not see Node 1 during startup.

Recommended Action:

When node1 is up and all critical services arerunning, perform manual fallback to restorethe nodes in the subcluster to Normal state.

Peer DownDuringInitialization

BackupMode

InitializationFailedOver withCriticalServicesnotRunningOR FailedOver

User move fails during taking over process.

Recommended Action:

Possible database error. Select Recovery onthe subcluster details screen. If that doesn'tresolve the issue, reboot the subcluster.

Cisco ServerRecoveryManager TakeOver UsersFailed


Cisco ServerRecoveryManagerTake OverUsers Failed


User move fails during falling back process.

Recommended Action:

Possible database error. Select Recovery onthe subcluster details screen. If that doesn'tresolve the issue, reboot the subcluster.

Cisco ServerRecoveryManager TakeBack UsersFailed


Cisco ServerRecoveryManagerTake BackUsers Failed


The SRM on a node restarts while the SRMon the other node is in a failed state, or aninternal system error occurs.

Recommended Action:

Select Recovery on the subcluster detailsscreen. If that does not resolve the issue, rebootthe subcluster.

UnknownRunning inFailed Mode

UnknownRunningin FailedMode


6


Node 2Node 1


The Database goes down on the backup node.The peer node is in failover mode and can takeover for all users in the subcluster.Auto-recovery operation automatically occursand all users are moved over to the primarynode.

Auto RecoveryDatabaseFailure.

FailoverAffectedServices

Auto RecoverDatabaseFailure

BackupActivated

A critical service goes down on the backupnode. The peer node is in failover mode andcan take over for all users in the subcluster.Auto-recovery operation automatically occursand all users are moved over to the peer node.

Auto RecoverCritical ServiceDown

FailoverAffectedServices

Auto RecoverDatabaseFailure

BackupActivated


7



8


Troubleshooting High Availability - Cisco · Troubleshooting High Availability •...

Documents

Transcript of Troubleshooting High Availability - Cisco · Troubleshooting High Availability •...