GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

89
Cluster Administration and Troubleshooting Guide Version Number (1.0) January 5, 2004 Author: Fabian SIRACH Microsoft Services, France GLOBE ISIT OASIS2 < Cluster Administration and Troubleshooting Guide> 2001 Nestec Ltd. – GLOBE – Global Business Excellence. http://veviis01.nestec.ch/GLOBE/ Proprietary document not to be divulged outside the Company. Printed by Nestec Ltd., CH-1800 Vevey, Switzerland. GL-GLOBE

Transcript of GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Page 1: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Version Number (1.0)January 5, 2004

Author: Fabian SIRACHMicrosoft Services, France

GLOBE ISIT OASIS2

< Cluster Administration and Troubleshooting Guide>

2001 Nestec Ltd. – GLOBE – Global Business Excellence. http://veviis01.nestec.ch/GLOBE/

Proprietary document not to be divulged outside the Company.

Printed by Nestec Ltd., CH-1800 Vevey, Switzerland.

GL-GLOBE

Page 2: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Document Control

Document Owner Fabian SIRACH

Review Cycle in months 12

Date of update(dd/mm/yyyy)

Updated by(author name)

Changes Made (sectionnumbers and description)

Version #

Status of document

05/01/2004 Fabian SIRACH Draft Creation 1.0 Draft

03/02/2004 Fabian SIRACH Revision 1.1 Final

2001 Nestec Ltd. – GLOBE – Global Business Excellence. http://veviis01.nestec.ch/GLOBE/

Proprietary document not to be divulged outside the Company.

Printed by Nestec Ltd., CH-1800 Vevey, Switzerland.

GL-GLOBE

Page 3: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Table of Contents

DOCUMENT CONTROL..........................................................................................................3

TABLE OF CONTENTS...........................................................................................................4

1 PREFACE..........................................................................................................................7

1.1 DOCUMENT AUDIENCE..................................................................................................71.2 PREREQUISITES AND RELATED DOCUMENTATION...........................................................7

2 CLUSTER ADMINISTRATION..........................................................................................8

2.1 RESOURCES MANAGEMENT...........................................................................................82.2 GROUP MANAGEMENT..................................................................................................92.3 NODE MANAGEMENT..................................................................................................112.4 APPLYING SERVICE PACK AND HOTFIX........................................................................142.5 CHKDSK AND AUTOCHK..............................................................................................142.6 CLUSTER COMMAND LINE...........................................................................................15

3 TROUBLESHOOTING.....................................................................................................16

3.1 ONE NODE IS DOWN...................................................................................................163.2 ENTIRE CLUSTER IS DOWN.........................................................................................183.3 ONE OR MORE SERVERS QUIT RESPONDING...............................................................183.4 CLUSTER SERVICE DO NOT START..............................................................................183.5 CLUSTER SERVICE STARTS BUT CLUSTER ADMINISTRATOR WILL NOT CONNECT............203.6 CLUSTER ADMINISTRATOR STOPS RESPONDING ON FAILOVER.....................................203.7 GROUP/RESOURCES FAILOVER PROBLEMS..................................................................213.8 QUORUM RESOURCES FAILURE...................................................................................243.9 NETWORK NAME RESOURCE DOES NOT GO ONLINE...................................................263.10 PHYSICAL DISK RESOURCE PROBLEM.........................................................................273.11 CLIENT CONNECTIVITY PROBLEM................................................................................27

3.11.1 Clients have intermittent Connectivity Based on Group Ownership..................273.11.2 Clients do not Have any Connection with the Cluster.......................................283.11.3 Clients have Problems Accessing Data Through a File Share..........................283.11.4 Client Experience Intermittent Access...............................................................29

4 APPENDIX: MSCS EVENT MESSAGES........................................................................30

4.1 EVENT ID 1000..........................................................................................................304.2 EVENT ID 1002..........................................................................................................304.3 EVENT ID 1006..........................................................................................................314.4 EVENT ID 1007..........................................................................................................314.5 EVENT ID 1009..........................................................................................................314.6 EVENT ID 1010..........................................................................................................324.7 EVENT ID 1011..........................................................................................................324.8 EVENT ID 1012..........................................................................................................324.9 EVENT ID 1015..........................................................................................................32

2001 Nestec Ltd. – GLOBE – Global Business Excellence. http://veviis01.nestec.ch/GLOBE/

Proprietary document not to be divulged outside the Company.

Printed by Nestec Ltd., CH-1800 Vevey, Switzerland.

GL-GLOBE

Page 4: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.10 EVENT ID 1016..........................................................................................................334.11 EVENT ID 1019..........................................................................................................334.12 EVENT ID 1021..........................................................................................................334.13 EVENT ID 1022..........................................................................................................344.14 EVENT ID 1023..........................................................................................................344.15 EVENT ID 1024..........................................................................................................344.16 EVENT ID 1034..........................................................................................................344.17 EVENT ID 1035..........................................................................................................354.18 EVENT ID 1036..........................................................................................................354.19 EVENT ID 1037..........................................................................................................364.20 EVENT ID 1038..........................................................................................................364.21 EVENT ID 1040..........................................................................................................364.22 EVENT ID 1041..........................................................................................................374.23 EVENT ID 1042..........................................................................................................374.24 EVENT ID 1043..........................................................................................................374.25 EVENT ID 1044..........................................................................................................384.26 EVENT ID 1045..........................................................................................................384.27 EVENT ID 1046..........................................................................................................384.28 EVENT ID 1047..........................................................................................................394.29 EVENT ID 1048..........................................................................................................394.30 EVENT ID 1049..........................................................................................................394.31 EVENT ID 1050..........................................................................................................404.32 EVENT ID 1051..........................................................................................................404.33 EVENT ID 1052..........................................................................................................404.34 EVENT ID 1053..........................................................................................................414.35 EVENT ID 1054..........................................................................................................414.36 EVENT ID 1055..........................................................................................................414.37 EVENT ID 1056..........................................................................................................424.38 EVENT ID 1057..........................................................................................................424.39 EVENT ID 1058..........................................................................................................424.40 EVENT ID 1059..........................................................................................................434.41 EVENT ID 1061..........................................................................................................434.42 EVENT ID 1062..........................................................................................................434.43 EVENT ID 1063..........................................................................................................434.44 EVENT ID 1064..........................................................................................................444.45 EVENT ID 1065..........................................................................................................444.46 EVENT ID 1066..........................................................................................................444.47 EVENT ID 1067..........................................................................................................454.48 EVENT ID 1068..........................................................................................................454.49 EVENT ID 1069..........................................................................................................454.50 EVENT ID 1070..........................................................................................................464.51 EVENT ID 1071..........................................................................................................464.52 EVENT ID 1073..........................................................................................................464.53 EVENT ID 1077..........................................................................................................464.54 EVENT ID 1080..........................................................................................................46

2001 Nestec Ltd. – GLOBE – Global Business Excellence. http://veviis01.nestec.ch/GLOBE/

Proprietary document not to be divulged outside the Company.

Printed by Nestec Ltd., CH-1800 Vevey, Switzerland.

GL-GLOBE

Page 5: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.55 EVENT ID 1093..........................................................................................................474.56 EVENT ID 1096..........................................................................................................474.57 EVENT ID 1097..........................................................................................................474.58 EVENT ID 1098..........................................................................................................474.59 EVENT ID 1100..........................................................................................................484.60 EVENT ID 1102..........................................................................................................484.61 EVENT ID 1104..........................................................................................................484.62 EVENT ID 1105..........................................................................................................484.63 EVENT ID 1107..........................................................................................................494.64 EVENT ID 1109..........................................................................................................494.65 EVENT ID 1115..........................................................................................................49

5 APPENDIX: RELATED EVENT MESSAGES.................................................................50

5.1 EVENT ID 9................................................................................................................505.2 EVENT ID 101............................................................................................................505.3 EVENT ID 1004..........................................................................................................515.4 EVENT ID 1005..........................................................................................................515.5 EVENT ID 2511..........................................................................................................515.6 EVENT ID 4199..........................................................................................................515.7 EVENT ID 5719..........................................................................................................525.8 EVENT ID 7000..........................................................................................................525.9 EVENT ID 7013..........................................................................................................525.10 EVENT ID 7023..........................................................................................................53

6 APPENDIX: MAINTENANCE TOOLS.............................................................................54

6.1 WINDOWS 2000 TOOLS...............................................................................................546.2 WINDOWS 2000 RESOURCE KIT TOOLS.......................................................................55

7 APPENDIX: USING AND READING THE CLUSTER LOGFILE....................................56

7.1 CLUSTERLOG ENVIRONMENT VARIABLE...................................................................567.2 OPERATING SYSTEM VERSION NUMBER AND SERVICE PACK LEVE...............................567.3 CLUSTER SERVICE STARTUP.......................................................................................567.4 LOGFILE ENTRIES FOR COMMON FAILURES..................................................................60

8 Appendix: Q258078 Cluster Service Startup Options......................................................65

2001 Nestec Ltd. – GLOBE – Global Business Excellence. http://veviis01.nestec.ch/GLOBE/

Proprietary document not to be divulged outside the Company.

Printed by Nestec Ltd., CH-1800 Vevey, Switzerland.

GL-GLOBE

Page 6: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

1 Preface

This document intends to provide general operations required for administering and troubleshooting Windows 2000 Cluster Service for NESTLE Enterprise Portal on Windows 2000 Operating System.

1.1 Document Audience

Windows 2000 System Administrators

1.2 Prerequisites and Related Documentation

Good Knowledge about Microsoft Windows 2000 Advanced Server

Good Knowledge about Microsoft Windows 2000 Cluster Service

W2K_GEO_SERVICE_CHECK Procedure

W2K_GEO_SERVICE_FAILOVER Procedure

W2K_GEO_SERVICE_SHUTDOWN Procedure

W2K_GEO_SERVICE_TAKEOVER Procedure

W2K_Cluster Server_Manual Takeover Procedure

W2K_Operating System_Restart Service Procedure

W2K_Server Status_Check Patch Level Procedure

W2K_Software_Apply Hotfix Procedure

Operation Runbook W2K GeoCluster Document

Operation Runbook W2K Document

document.doc

Author: Fabian SIRACH

Page 7 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 7: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

2 Cluster Administration

2.1 Resources Management

Bring a resource online:

1. Open Cluster Administrator (click Start, point to Programs, point to Administrative Tools, and then click Cluster Administrator).

2. In the console tree, click the Resources folder.

3. In the details pane, click the resource you want.

4. On the File menu, click Bring Online.

Refer to

W2K_GEO_SERVICE_FAILOVER Procedure

W2K_GEO_SERVICE_TAKEOVER Procedure

Take a resource offline:

1. Open Cluster Administrator (click Start, point to Programs, point to Administrative Tools, and then click Cluster Administrator).

2. In the console tree, click the Resources folder.

3. In the details pane, click the resource you want.

4. On the File menu, click Take Offline.

Remark: taking a resource offline causes all resources that depend on that resource to be taken offline.

Refer to

W2K_GEO_SERVICE_FAILOVER Procedure

W2K_GEO_SERVICE_TAKEOVER Procedure

document.doc

Author: Fabian SIRACH

Page 8 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 8: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

2.2 Group Management

Bring a group online:

1. Open Cluster Administrator (click Start, point to Programs, point to Administrative Tools, and then click Cluster Administrator).

2. In the console tree, double-click the Groups folder.

3. In the details pane, click the group you want.

4. On the File menu, click Bring Online.

Take a group offline:

1. Open Cluster Administrator (click Start, point to Programs, point to Administrative Tools, and then click Cluster Administrator).

2. In the console tree, double-click the Groups folder.

3. In the details pane, click the group you want.

4. On the File menu, click Take Offline.

Remark: resources in a group go offline in the order of their dependencies.

Move a group to another node:

1. Open Cluster Administrator (click Start, point to Programs, point to Administrative Tools, and then click Cluster Administrator).

2. In the console tree, double-click the Groups folder.

3. In the details pane, click the group you want.

4. On the File menu, click Move Group.

Remark: after the transfer, the new node owns all resources in the group, as the Owner column in the details pane should reflect.

document.doc

Author: Fabian SIRACH

Page 9 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 9: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Specify preferred owners of a group:

1. Open Cluster Administrator (click Start, point to Programs, point to Administrative Tools, and then click Cluster Administrator).

2. In the console tree, click the Groups folder.

3. In the details pane, click the group you want.

4. On the File menu, click Properties.

5. On the General tab, next to Preferred owners, click Modify.

6. In the Modify Preferred Owners dialog box, enter any changes you want to make:

1. To add one or more preferred owners, under Available nodes, not Preferred owners, click the nodes you want to add, and then click the right arrow.

2. To remove a preferred owner, under Preferred owners, click the nodes you want to remove, and then click the left arrow.

3. To change the priority of a preferred owner, click the node, and then click the up or down arrow.

Set group failover policy:

1. Open Cluster Administrator (click Start, point to Programs, point to Administrative Tools, and then click Cluster Administrator).

2. In the console tree, click the Groups folder.

3. In the details pane, click the group you want.

4. On the File menu, click Properties.

5. On the Failover tab, type values for Threshold and Period.

Remark: the failover policy for a group is the maximum number of times (Threshold) that the group is allowed to fail over in the specified number of hours (Period) before it is taken completely offline. If a group fails over more often than this, the Cluster service leaves it offline.

document.doc

Author: Fabian SIRACH

Page 10 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 10: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Set group failback policy:

1. Open Cluster Administrator (click Start, point to Programs, point to Administrative Tools, and then click Cluster Administrator).

2. In the console tree, click the Groups folder.

3. In the details pane, click the group you want.

4. On the File menu, click Properties.

5. On the Failback tab, click Prevent failback or Allow failback.

If you click Allow failback, then either click Immediately, or click Failback between and set the time interval.

Remark: to set the time interval for Failback between, enter numbers between 0 and 23 for the beginning and end of the interval. If the first number is greater than the second, the interval will end on the following day. The numbers correspond to the local time of the cluster group, as read on a 24-hour clock.

2.3 Node Management

Stopping the Cluster service:

1. Open Cluster Administrator (click Start, point to Programs, point to Administrative Tools, and then click Cluster Administrator).

2. In the console tree, click the node.

3. On the File menu, click Stop Cluster Service.

Remark: when you stop the Cluster service on a node, you prevent clients from accessing cluster resources through that node. When you stop the Cluster service on a node, all groups move to the other node (if the failover policies allow it).

Refer to

W2K_GEO_SERVICE_CHECK Procedure

W2K_GEO_SERVICE_FAILOVER Procedure

W2K_GEO_SERVICE_SHUTDOWN Procedure

Starting the Cluster service:

document.doc

Author: Fabian SIRACH

Page 11 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 11: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

1. Open Cluster Administrator (click Start, point to Programs, point to Administrative Tools, and then click Cluster Administrator).

2. In the console tree, click the node.

3. On the File menu, click Start Cluster Service.

Installing Cluster Service   when the other node is online:

1. Open Add/Remove Programs (click Start, point to Settings, click Control Panel, and then double-click Add/Remove Programs),

2. Click Add/Remove Windows Components,

3. The Welcome to the Windows Components wizard will begin.

4. In Components, select Cluster Service.

5. Click Next.

6. Cluster Service files are located on the Windows 2000 Advanced Server CD-ROM. Enter Z:\i386.

7. Click OK.

8. Click Next.

9. Click I Understand to accept the condition that Cluster Service is supported on hardware from the Hardware Compatibility List only.

10. In the Create or Join a Cluster dialog, select The second or next node in the cluster, and click Next.

11.Enter the cluster name and click Next.

12.Leave Connect to cluster as unchecked. The Cluster Service Configuration wizard will automatically supply the name of the user account selected during the installation of the first node. Always use the same account as you used when setting up the first cluster node.

13.Enter the password for the account and click Next.

14.At the next dialog box, click Finish to complete configuration.

15.The Cluster Service will start. Click OK.

16.Close Add/Remove Programs.

document.doc

Author: Fabian SIRACH

Page 12 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 12: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

To validate the cluster installation, do to following:

1. Click Start, click Programs, click Administrative Tools, and click Cluster Administrator.

2. The presence of two nodes shows that a cluster exists and is in operation.

Removing Cluster Service   :

1. Open Add/Remove Programs (click Start, point to Settings, click Control Panel, and then double-click Add/Remove Programs),

2. Click Add/Remove Windows Components,

3. The Welcome to the Windows Components wizard will begin.

4. Click Next,

5. In Components, click to clear Cluster Service, and then click Next.

Enable diagnostic Logging :

1. Open Control Panel (click Start, point to Settings, and then click Control Panel),

2. Double click System,

3. On the Advanced tab, click Environment Variables.

4. Under System variables, click New.

5. In Variable Name, specify the name of the variable. In Variable Value, specify the name of the diagnostic log file.

For example, set Variable Name to Clusterlog and Variable Value to C:\Temp\Cluster.log.

6. Click OK, and then click OK again. Close Control Panel.

7. Stop and restart the Cluster service.

Disable diagnostic Logging :

document.doc

Author: Fabian SIRACH

Page 13 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 13: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

1. Open Control Panel (click Start, point to Settings, and then click Control Panel),

2. Double click System,

3. On the Advanced tab, click Environment Variables.

4. Under System variables, select Clusterlog.

5. Click Delete, click OK, and then click OK again. Close Control Panel.

6. Stop and restart the Cluster service.

2.4 Applying Service Pack and Hotfix

Refer to document “GLOBE ISIT OASIS2 Applying Hotfixes”

2.5 Chkdsk and Autochk

Disks that are attached to the shared bus interact differently with Chkdsk.Exe than with Autochk.Exe. Autochk.Exe, the system startup version of Chkdsk.Exe, does not perform file system checks on shared drives when the system starts, even if the operations are required. The Cluster service performs a file system integrity test for each drive when it brings a physical disk online. The cluster automatically starts Chkdsk if it is necessary.If you have to run Chkdsk on a drive, click the following article numbers to view the articles in the Microsoft Knowledge Base:

174617 CHKDSK Runs While Running Microsoft Cluster Server Setup 176970 How to Run the CHKDSK /F Command on a Shared Cluster Disk

document.doc

Author: Fabian SIRACH

Page 14 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 14: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

2.6 Cluster Command Line

Using Cluster tool:

With this tool, you can do each operation that can be done with Cluster Administrator.

To verify nodes state: type cluster <cluster name> nodeTo verify resource state: type cluster <ressource name> resourceTo verify network adapter state: type cluster <cluster name> networkTo move a group: type CLUSTER <cluster name> GROUP <cluster group> /MOVETO : <Node name>

document.doc

Author: Fabian SIRACH

Page 15 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 15: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

3 Troubleshooting

3.1 One node is Down

General Information:

Before troubleshooting, if a single node is unavailable, make sure that resources and groups are available on the other node.

If a node is online, gather information about failure:

1. Check event logs on the online node (event log messages meaning is indicated in Appendix)

2. Check cluster diagnostic logfile,

3. Check for the existence of a recent Memory.dmp file that may have been created from a recent crash. If necessary, contact Microsoft Product Support Services for assistance with this file.

4. Go to the paragraph corresponding to the failure.

Symptoms and solutions:

Symptoms Causes SolutionsSecond node cannot join the cluster.

You may not be using the proper cluster name, node name, or IP address.

Confirm that you are using the proper cluster name, node name, or IP address

The Cluster Name resource may not have started.

Confirm that the Cluster Name resource started.

The Cluster service may not be running on the first node.

Confirm that the Cluster service is running on the first node and that all resources within the Cluster Group are online before installing the second node.

Network connectivity may not exist between the two nodes

Confirm that network connectivity exists between the two nodes

You may not have IP connectivity to the cluster address

Confirm that you have IP connectivity to the cluster address and that the IP address is assigned to the correct network.

document.doc

Author: Fabian SIRACH

Page 16 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 16: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Symptoms Causes SolutionsSecond node cannot connect to the cluster drives.

The same drive letters may not have been assigned to the cluster drives on all nodes

Confirm that the cluster drives are assigned the same drive letters on all nodes.

The SCSI devices may not have unique IDs.

Verify that each SCSI device has a unique ID. SCSI controller IDs are preset to seven. Reset one SCSI controller ID to six.

The second node may not be physically connected to the cluster drive.

Confirm that the second node is physically connected to the cluster drive. If it is not, shut down both nodes and the cluster drive. Connect the nodes to the shared SCSI bus. Then, start the cluster drive and start the first node. After the Cluster service starts on the first node, start the second node, and attempt to connect to the cluster drive

The SCSI controllers on the shared SCSI bus may not be correctly configured

Confirm that the SCSI controllers on the shared SCSI bus are correctly configured (with both cards configured to transfer data at the same rate).

The devices and controllers may not match.

Confirm that your devices and controllers match.

document.doc

Author: Fabian SIRACH

Page 17 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 17: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

3.2 Entire Cluster is Down

Before troubleshooting, try to bring at least one node online. If you can achieve this goal, the affect on users may be substantially reduced.

If a node is online or one server could be started, gather information about failure:

1. Check event logs on the online node (event log messages meaning is indicated in Appendix)

2. Check cluster diagnostic logfile,3. Check for the existence of a recent Memory.dmp file that may have been created from

a recent crash. If necessary, contact Microsoft Product Support Services for assistance with this file.

4. Go to paragraph corresponding to the identified failure.

If no server can be restarted, in last resort, restore both nodes (this procedure is defined Disaster Recovery Guide).

3.3 One or More Servers Quit Responding

If one or more servers are not responding but have not crashed or otherwise failed, the problem may be related to:

Domain controllers’ connectivity: Check network connectivity with domain controllers and for other network problems (use the ping command with dc name and dc IP address).

configuration, software, driver issues, Fiber Channel Connectivity Connected disk devices.

For theses problems go to the corresponding chapter.

3.4 Cluster Service Do not Start

1. Check the event log messages and look at Appendix.

2. Determine if the issue comes from the service account used for Cluster Service.

Information: failures related to the service account may result in Event ID 7000 or Event ID 7013 errors in the event log. In addition, you may receive the following error message:

document.doc

Author: Fabian SIRACH

Page 18 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 18: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

"Could not start the Cluster Service on \\computername. Error 1069: The service did not start because of a logon failure."

Make sure the account is not disabled and that password expiration is not a factor.

This domain account needs to be a member of the local administrators group on each server.

The account needs the Logon as a service and Lock pages in memory rights.

Make sure the password specified for the Cluster Service account is correct. (Retype it and click on the apply button, try to restart the service).

3. Check to make sure the quorum disk is online and that the Fiber Channel has proper termination and proper function.

Information: if the quorum disk is not accessible during startup, the following error message may occur:

"Could not start the Cluster Service on \\computername. Error 0021: The device is not ready."

If the Cluster Service is running on the other cluster node, check the cluster logfile on that system for indications of whether or not the other node attempted to join the cluster. If the cluster node did try to join the cluster, and the request was denied, the logfile may contain details of the event. For example, if you evict a node from the cluster, but do not remove and reinstall MSCS on that node, when the server attempts to join the cluster, the request to join will be denied.

To start a cluster node if the cluster service doesn’t start and no cluster.log file exists, use the –debug option:

1. Open a console Windows by typing CMD in the Run menu,

2. Go in the %systemroot%\cluster directory ( cd \winnt\cluster)

3. Type CLUSSVC –debug

The debug informations are then send to console.

To stop this service, type CTRL+C.

3.5 Cluster Service Starts but Cluster Administrator will not connect

If the Services utility in Administrative Tools indicates that the service is running, and you cannot connect with Cluster Administrator to administer the cluster, the problem may be related to:

document.doc

Author: Fabian SIRACH

Page 19 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 19: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

the Cluster Network Name,

to the cluster IP address resources or

there may also be RPC-related problems.

1. Check to make sure the RPC Service is running on both nodes.

2. If it is, try to connect to a known running cluster node by the computer name . If running Cluster Administrator on the local node, you may specify a period (.) in place of the name when prompted. This will create a local connection and will not require name resolution.

3. If you can connect through the computer name or using the period, check the cluster network name and cluster IP address resources. Make sure that these and other resources in the cluster group are online. These resources may fail if a duplicate name or IP address on the network conflicts with either of these resources.

4. A duplicate IP address on the network may cause the network adapter to shut down. Check the system event log for errors.

3.6 Cluster Administrator Stops Responding On Failover

The Cluster Administrator application uses RPC communication to connect with the cluster. If you use the cluster name to establish the connection, Cluster Administrator may appear to stop responding during a failover of the Cluster group and its resources. This ordinary delay occurs during the registration of the IP address and network name resources in the group and the establishment of a new RPC connection. If a problem occurs with the registration of these resources, the process may take extended time until these resources become available. The first RPC connection must time out before the application tries to establish another connection. As a result, Cluster Administrator may eventually time out if problems occur when the IP address or network name resources are brought online in the Cluster group. In this situation, try to connect by using the computer name of one of the cluster nodes instead of the cluster name. Doing so typically allows a more real-time display of resource and group transitions without delay.

document.doc

Author: Fabian SIRACH

Page 20 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 20: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

3.7 Group/Resources Failover Problems

General Information:

The typical reason that a group may not failover properly is usually because of problems with resources within the group. For example, if you elect to move a group from one node to another, the resources within the group will be taken offline, and ownership of the group will be transferred to the other node. On receiving ownership, the node will attempt to bring resources online, according to dependencies defined for the resources. If resources fail to go online, MSCS attempts again to bring them online. After repeated failures, the failing resource or resources may affect the group and cause the group to transition back to the previous node. Eventually, if failures continue, the group or affected resources may be taken offline. You can configure the number of attempts and allowed failures through resource and group properties.

1. When you experience problems with group or resource failover, evaluate which resource or resources may be failing. Determine why the resource won't go online.

2. Check resource dependencies for proper configuration and make sure they are available.

3. Also, make sure that the "Possible Owners" list includes both nodes.4. If resource properties do not appear to be part of the problem, check the event log

or cluster logfile for details.

Information: the "Preferred Owners" list is designed for automatic failback or initial group placement within the cluster. In a two-node cluster, this list should only contain the name of the preferred node for the group, and should not contain multiple entries.

To move a group from one node to another:

1. You must have administrative rights to run Cluster Administrator.2. The destination node must be online and the cluster service started.3. The state of the node must be online and not Paused.4. Both cluster nodes should be listed in the Possible Owners list for the

resources within the group. 5. Also, to move a group, resources within the group cannot be in a pending state. To

initiate a Move Group request, resources must be in one of the following three states: online, offline, or failed.

document.doc

Author: Fabian SIRACH

Page 21 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 21: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Symptoms and causes:

Symptoms Causes SolutionsA resource fails, but is not brought back online.

A resource may depend on another resource that has failed

In the resource Properties dialog box, make sure that the Do not restart is clear. If the resource needs another resource in order to function, and if the second resource fails, confirm that the dependencies are correctly configured.

You cannot bring a resource online.

The resource is not properly installed.

Make sure the application or service associated with the resource is properly installed

The resource is not properly configured.

Make sure the properties are set correctly for the resource.

You cannot bring the default physical disk resource online in Cluster Administrator.

You may not have restarted the servers after installing the Cluster service

Make sure that you restarted all servers after installing the Cluster service

There may be hardware errors or transport problems

Make sure that there are no hardware errors or transport problems. Using Event Viewer, look in the event log for disk I/O error messages or indications of problems with the communications transport.

One or more SCSI adapters on the shared SCSI bus are configured incorrectly.

Make sure that the SCSI adapters are configured correctly.

The shared SCSI bus exceeds the maximum cable length.

Make sure that the shared SCSI bus does not exceed the maximum cable length.

The shared SCSI bus is improperly terminated

Make sure that the shared SCSI bus is properly terminated

The disk is not supported. Make sure that the disk hardware or firmware revision level is not outdated.

Duplicate SCSI IDs have been specified on the shared SCSI bus.

Verify that each SCSI device has a unique ID. SCSI controller IDs are preset to seven. Reset one SCSI controller ID to six.

Symptoms Causes Solutions

document.doc

Author: Fabian SIRACH

Page 22 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 22: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

If you move your SCSI bus adapter to another I/O slot, add or remove bus adapters, or install a new version of the bus adapter driver, the cluster software may not be able to access disks on your shared SCSI bus

In order to accommodate these changes, make sure that your shared SCSI bus adapter has been properly reconfigured.

Windows 2000 is incorrectly configured to access the shared SCSI bus

Verify that Windows 2000 can detect the shared SCSI bus adapter and that the SCSI IDs for the adapter and disks are listed. ( open Control Panel and double-click SCSI Adapters)

A resource in the group may be continually failing.

Determine if a resource in the group is continually failing. If the node can, it will bring the resource back up without failing over the group. If the resource continually fails but does not fail over, make sure that the resource property Restart and affect the group is selected. Also, check the Restart Threshold and Restart Period settings, which are also in the resource Properties dialog box.

A group failed over but did not fail back.

The failback policies of both the group and the resources may not be properly configured.

Make sure that the Prevent failback check box is clear in the group Properties dialog box. If the Allow failback check box is selected, be sure to wait long enough for the group to fail back. Check these settings for all affected resources within a group. Because groups fail over as a whole, one resource that is prevented from failing back affects the entire group.

The node to which you want the group to fail back is not configured as the preferred owner of the group.

Make sure that the node to which you want the group to fail back is configured as the preferred owner of the group. If not, the Cluster service leaves the group on the node to which they failed over.

The entire group failed and has not restarted.

A node is offline. Make sure that the node is not offline.

document.doc

Author: Fabian SIRACH

Page 23 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 23: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Symptoms Causes SolutionsThe group has failed repeatedly

The group may have exceeded its failover threshold or its failover period. Try to bring the resources online individually (following the correct sequence of dependencies) to determine which resource is causing the problem. Or, create a temporary resource group (for testing purposes) and move the resources to it, one at a time.

3.8 Quorum Resources Failure

Symptoms and solutions:

Symptoms Causes SolutionsQuorum resource does not start

The resource is not physically connected to the server.

Make sure that the resource is physically connected to the server

The devices are not properly terminated.

Make sure that the devices are properly terminated.

The problem is with the hardware configuration.

Turn off the SCSI devices and check the SCSI IDs of the devices. Make sure that the IDs are not both set to 7 (the default).

Quorum resource fails.

The disk on the shared bus holding the quorum resource has failed.

If the disk on the shared bus holding the quorum resource fails and cannot be brought online, the Cluster service cannot start. To correct this situation, use the fixquorum option. For more information, read the paragraph below.

Quorum log becomes corrupted.

This may occur for a variety of reasons.

If the quorum log is corrupted, the Cluster service attempts to correct the problem by resetting the log file. In this case, the Cluster service writes the following message in the Windows 2000 system log:The log file [name] was found to be corrupt. An attempt will be made to reset it.If the quorum log cannot be reset, the Cluster service cannot start.

document.doc

Author: Fabian SIRACH

Page 24 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 24: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

If the Cluster service fails to detect that the quorum log is corrupted, the Cluster service may fail to start. In this case, there may be an "ERROR_CLUSTERLOG_CORRUPT" message in the system log.To correct this, you must use the noquorumlogging option. For more information read the paragraph below.

If the Cluster Service won't start because of a quorum disk failure, check the corresponding device. Quorum access problem is usually caused by connectivity or authentication issues. If this is not the case, execute the procedure below.

This operation is complex and has a great impact on cluster, so it’s recommended to realize it with Microsoft Product Support Services’ assistance.

Recover a failed quorum disk:

You can check the status of the quorum device by starting the service with the -fixquorum switch, and attempt to bring the quorum disk online, or change the quorum location for the service.When you use the fixquorum option to start the Cluster service, only cluster name and cluster IP resources are set online. To recover a failed quorum resource:1. Start Cluster Service with the -fixquorum option on a single node :

a. Start the Services snap-in. Click Start , point to Programs , click Administrative Tools , and then click Services .

b. Right-click and select the properties of the Cluster Service.

c. In the Start Parameters box, type: /fix quorum

d. Then press the Start button.

2. Use Cluster Administrator to configure the Cluster Service to use a different disk on the shared bus for the quorum resource.

3. To view or change the quorum drive settings, right-click the cluster name at the top of the tree, listed on the left portion of the Cluster Administrator window, and select Properties.

4. The Cluster Properties window contains three different tabs, one of which is for the quorum disk,

5. From this tab, you may view or change quorum disk settings. You may also redesignate the quorum resource.

Recover a failed quorum log:

document.doc

Author: Fabian SIRACH

Page 25 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 25: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

This operation is complex and has a great impact on cluster, so it’s recommended to realize it with Microsoft Product Support Services’ assistance.

When this occurs, a message is logged in the eventlog :

The log file D:\MSCS\quolog.log was found to be corrupt. An attempt will be made to reset it, or you should use the Cluster Administrator utility to adjust the maximum size.

Remark : If the error message occurs after you restore the system state on a computer that has lost the quorum log, the quorum information is copied to %SystemRoot%\Cluster\Cluster_backup. You can then use the Clusrest.exe tool from the Resource Kit to restore this information to the quorum disk.

If you have a backup of the system state on one of the computers after the last changes were made to the cluster, you can restore the quorum by restoring this information.

If you do not have a backup of the Quorum log file, recreate a new quorum log file based on the cluster configuration information in the local system's cluster hive by starting the Cluster Service with the ResetQuorumLog switch : 1. Start the Services snap-in. Click Start, point to Programs , click Administrative Tools , and

then click Services .

2. Right-click and select the properties of the Cluster Service.

3. In the Start Parameters box, type: /ResetQuorumLog

4. Then press the Start button.

3.9 Network Name Resource Does Not Go Online

There are various causes that can prevent a network name resource from going online. Many causes may be clearly indicated in the system event log. Potential causes may include:

A duplicate name on the network from an offending computer. Static WINS entries for the network name. A malfunctioning switch or router. An incorrect TCP/IP configuration for one or more network adapters. An incorrect setting for the RequireDNS property.

3.10Physical Disk Resource Problem

Problems with physical disk resources are usually hardware related. Cables, termination, or SCSI host adapter configuration may cause problems with failover, or may cause premature failure of the resource. The system event log may often show events related to physical disk

document.doc

Author: Fabian SIRACH

Page 26 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 26: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

or controller problems. However, some cable or termination problems may not yield such helpful information.

1. It is important to verify the configuration of the Fiber Channel and attached devices, whenever you detect trouble with one of these devices.

2. BIOS or firmware problems might also be factors.

3.11Client Connectivity Problem

3.11.1 Clients have intermittent Connectivity Based on Group Ownership

If clients successfully connect to clustered resources only when a specific node is the owner, a few possible problems could lead to this condition. To define more precisely the problem:

1. Check the system event log on each server for possible errors.

2. Check to make sure that the group has at least one IP address resource and one network name resource,

3. Check that clients use one of these to access the resource or resources within the group. If clients connect with any other network name or IP address, they may not be accessing the correct server in the event that ownership of the resources changes. As a result of improper addressing, access to these resources may appear limited to a particular node.

4. If you are able to confirm that clients use proper addressing for the resource or resources, check the IP address and network name resources to see that they are online.

5. Check network connectivity with the server that owns the resources. For example, try some of the following techniques:

From the serverPING server's primary adapter IP address (on client network)PING IP address of the groupPING Network Name of the groupPING Router/Gateway between client and server (if any)PING Client IP addressIf the above tests work correctly up to the router/gateway check, the problem may be elsewhere on the network because you have connectivity with the other server and local addresses. If tests complete up to the client IP address test, there may be a client configuration or routing problem.

From the client:PING Client IP addressPING Router/Gateway between client and server (if any)PING server's primary adapter IP address (on client network)

document.doc

Author: Fabian SIRACH

Page 27 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 27: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

PING IP address of the groupPING Network Name of the groupIf the tests from the server all pass, but you experience failures performing tests from the client, there may be client configuration problems. If all tests complete except the test using the network name of the group, there may be a name resolution problem. This may be related to client configuration, or it may be a problem with the client's designated DNS server.

3.11.2 Clients do not Have any Connection with the Cluster

If clients lose connectivity with both cluster nodes:

1. Check to make sure that the Cluster Service is running on each node.

2. Check the system event log for possible errors.

3. Check network connectivity between cluster nodes, and with other network devices, by using the procedure in the previous section.

4. If the Cluster Service is running, and there are no apparent connectivity problems between the two servers, there is likely a network or client configuration problem that does not directly involve the cluster : Check to make sure the client uses the TCP/IP protocol and has a valid IP address on the network

5. Make sure that the client is using the correct network name or IP address to access the cluster.

3.11.3 Clients have Problems Accessing Data Through a File Share

If clients experience problems accessing cluster file shares:1. Check the resource and make sure it is online, and that any dependent resources

(disks, network names, and so on) are online,

2. Check the system event log for possible errors,

3. Check network connectivity between the client and the server that owns the resource.

4. If the data for the share is on a shared drive (using a physical disk resource), make sure that the file share resource has a dependency declared for the physical disk resource.

5. You can reset the file share by toggling the file share resource offline and back online again.

6. Cluster file shares behave essentially the same as standard file shares. So, make sure that clients have appropriate access at both the file system level and the share level.

7. Make sure that the server has the proper number of client access licenses loaded for the clients connecting, in the event that the client cannot connect because of insufficient available connections.

document.doc

Author: Fabian SIRACH

Page 28 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 28: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

3.11.4 Client Experience Intermittent Access

Network adapter configuration is one possible cause of intermittent access to the cluster, and of premature failover:

1. Some autosense settings for network speed can spontaneously redetect network speed. During the detection, network traffic through the adapter may be compromised. For best results, set the network speed manually to avoid the recalibration.

8. Make sure to use the correct network adapter drivers. Some adapters may require special drivers, although they may be detected as a similar device.

document.doc

Author: Fabian SIRACH

Page 29 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 29: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4 Appendix: MSCS Event Messages

4.1 Event ID 1000

Source ClusSvc

Description Microsoft Cluster Server suffered an unexpected fatal error at line ### of source module %path%. The error code was 1006.

Problem Messages similar to this may occur in the event of a fatal error that may cause the Cluster Service to terminate on the node that experienced the error.

Solution Check the system event log and the cluster diagnostic logfile for additional information. It is possible that the cluster service may restart itself after the error. This event message may indicate serious problems that may be related to hardware or other causes.

4.2 Event ID 1002

Source ClusSvc

Description Microsoft Cluster Server handled an unexpected error at line 528 of source module G:\Nt\Private\Cluster\Resmon\Rmapi.c. The error code was 5007.

Problem Messages similar to this may occur after installation of Microsoft Cluster Server. If the cluster service starts and successfully forms or joins the cluster, they may be ignored. Otherwise, these errors may indicate a corrupt quorum logfile or other problem.

Solution Ignore the error if the cluster appears to be working properly. Otherwise, you may want to try creating a new quorum logfile using the -noquorumlogging or -fixquorum parameters as documented in the Microsoft Cluster Server Administrator's Guide.

document.doc

Author: Fabian SIRACH

Page 30 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 30: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.3 Event ID 1006

Source ClusSvc

Description Microsoft Cluster Server was halted because of a cluster membership or communications error. The error code was 4.

Problem An error may have occurred between communicating cluster nodes that affected cluster membership. This error may occur if nodes lose the ability to communicate with each other.

Solution Check network adapters and connections between nodes. Check the system event log for errors. There may be a network problem preventing reliable communication between cluster nodes.

4.4 Event ID 1007

Source ClusSvc

Description A new node, "ComputerName", has been added to the cluster.

Information The Microsoft Cluster Server Setup program ran on an adjacent computer. The setup process completed, and the node was admitted for cluster membership. No action required.

4.5 Event ID 1009

Source ClusSvc

Description Microsoft Cluster Server could not join an existing cluster and could not form a new cluster. Microsoft Cluster Server has terminated.

Problem The cluster service started and attempted to join a cluster. The node may not be a member of an existing cluster because of eviction by an administrator. After a cluster node has been evicted from the cluster, the cluster software must be removed and reinstalled if you want it to rejoin the cluster. And, because a cluster already exists with the same cluster name, the node could not form a new cluster with the same name.

Solution Remove MSCS from the affected node, and reinstall MSCS on that system if desired.

document.doc

Author: Fabian SIRACH

Page 31 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 31: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.6 Event ID 1010

Source ClusSvc

Description Microsoft Cluster Server is shutting down because the current node is not a member of any cluster. Microsoft Cluster Server must be reinstalled to make this node a member of a cluster.

Problem The cluster service attempted to run but found that it is not a member of an existing cluster. This may be due to eviction by an administrator or incomplete attempt to join a cluster. This error indicates a need to remove and reinstall the cluster software.

Solution Remove MSCS from the affected node, and reinstall MSCS on that server if desired.

4.7 Event ID 1011

Source ClusSvc

Description Cluster Node "ComputerName" has been evicted from the cluster.

Information A cluster administrator evicted the specified node from the cluster.

4.8 Event ID 1012

Source ClusSvc

Description Microsoft Cluster Server did not start because the current version of Windows is not correct.

4.9 Event ID 1015

Source ClusSvc

Description No checkpoint record was found in the logfile W:\Mscs\Quolog.log; the checkpoint file is invalid or was deleted.

Problem The Cluster Service experienced difficulty reading data from the quorum logfile. The logfile could be corrupted.

Solution If the Cluster Service fails to start because of this problem, try manually starting the cluster service with the -noquorumlogging parameter. If you need to adjust the quorum disk designation, use the -fixquorum startup parameter when starting the cluster service. Both of these parameters are covered in the MSCS Administrator's Guide.

document.doc

Author: Fabian SIRACH

Page 32 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 32: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.10 Event ID 1016

Source ClusSvc

Description Microsoft Cluster Server failed to obtain a checkpoint from the cluster database for log file W:\Mscs\Quolog.log.

Problem The cluster service experienced difficulty establishing a checkpoint for the quorum logfile. The logfile could be corrupt, or there may be a disk problem.

Solution You may need to use procedures to recover from a corrupt quorum logfile. You may also need to run chkdsk on the volume to ensure against file system corruption.

4.11 Event ID 1019

Source ClusSvc

Description The log file D:\MSCS\Quolog.log was found to be corrupt. An attempt will be made to reset it, or you should use the Cluster Administrator utility to adjust the maximum size.

Problem The quorum logfile for the cluster was found to be corrupt. The system will attempt to resolve the problem.

Solution The system will attempt to resolve this problem. This error may also be an indication that the cluster property for maximum size should be increased through the Quorum tab. You can manually resolve this problem by using the -noquorumlogging parameter.

4.12 Event ID 1021

Source ClusSvc

Description There is insufficient disk space remaining on the quorum device. Please free up some space on the quorum device. If there is no space on the disk for the quorum log files then changes to the cluster registry will be prevented.

Problem Available disk space is low on the quorum disk and must be resolved.

Solution Remove data or unnecessary files from the quorum disk so that sufficient free space exists for the cluster to operate. If necessary, designate another disk with adequate free space as the quorum device.

document.doc

Author: Fabian SIRACH

Page 33 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 33: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.13 Event ID 1022

Source ClusSvc

Description There is insufficient space left on the quorum device. The Microsoft Cluster Server cannot start.

Problem Available disk space is low on the quorum disk and is preventing the startup of the cluster service.

Solution Remove data or unnecessary files from the quorum disk so that sufficient free space exists for the cluster to operate. If necessary, use the -fixquorum startup option to start one node. Bring the quorum resource online and adjust free space or designate another disk with adequate free space as the quorum device.

4.14 Event ID 1023

Source ClusSvc

Description The quorum resource was not found. The Microsoft Cluster Server has terminated.

Problem The device designated as the quorum resource could not be found. This could be due to the device having failed at the hardware level, or that the disk resource corresponding to the quorum drive letter does not match or no longer exists.

Solution Use the -fixquorum startup option for the cluster service. Investigate and resolve the problem with the quorum disk. If necessary, designate another disk as the quorum device and restart the cluster service before starting other nodes.

4.15 Event ID 1024

Source ClusSvc

Description The registry checkpoint for cluster resource "resourcename" could not be restored to registry key registrykeyname. The resource may not function correctly. Make sure that no other processes have open handles to registry keys in this registry subkey.

Problem The registry key checkpoint imposed by the cluster service failed because an application or process has an open handle to the registry key or subkey.

Solution Close any applications that may have an open handle to the registry key so that it may be replicated as configured with the resource properties. If necessary, contact the application vendor about this problem.

4.16 Event ID 1034

Source ClusSvc

Description The disk associated with cluster disk resource resource name could not be found. The expected signature of the disk was signature. If the disk was removed from the cluster, the resource should be deleted. If the disk was replaced, the resource must be deleted and created again to bring the disk online. If the disk has not been removed or replaced, it may be inaccessible at this time because it is reserved by another cluster node.

document.doc

Author: Fabian SIRACH

Page 34 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 34: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Problem The cluster service attempted to mount a physical disk resource in the cluster. The cluster disk driver could not locate a disk with this signature. The disk may be offline or may have failed. This error may also occur if the drive has been replaced or reformatted. This error may also occur if another system continues to hold a reservation for the disk.

Solution Determine why the disk is offline or nonoperational. Check cables, termination, and power for the device. If the drive has failed, replace the drive and restore the resource to the same group as the old drive. Remove the old resource. Restore data from a backup and adjust resource dependencies within the group to point to the new disk resource.

4.17 Event ID 1035

Source ClusSvc

Description Cluster disk resource %1 could not be mounted.

Problem The cluster service attempted to mount a disk resource in the cluster and could not complete the operation. This could be due to a file system problem, hardware issue, or drive letter conflict.

Solution Check for drive letter conflicts, evidence of file system issues in the system event log, and for hardware problems.

4.18 Event ID 1036

Source ClusSvc

Description Cluster disk resource "resourcename" did not respond to a SCSI inquiry command.

Problem The disk did not respond to the issued SCSI command. This usually indicates a hardware problem.

Solution Check SCSI bus configuration. Check the configuration of SCSI adapters and devices. This may indicate a misconfigured or failing device.

document.doc

Author: Fabian SIRACH

Page 35 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 35: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.19 Event ID 1037

Source ClusSvc

Description Cluster disk resource %1 has failed a filesystem check. Please check your disk configuration.

Problem The cluster service attempted to mount a disk resource in the cluster. A filesystem check was necessary and failed during the process.

Solution Check cables, termination, and device configuration. If the drive has failed, replace the drive and restore data. This may also indicate a need to reformat the partition and restore data from a current backup.

4.20 Event ID 1038

Source ClusSvc

Description Reservation of cluster disk "Disk W:" has been lost. Please check your system and disk configuration.

Problem The cluster service had exclusive use of the disk, and lost the reservation of the device on the shared SCSI bus.

Solution The disk may have gone offline or failed. Another node may have taken control of the disk, or a SCSI bus reset command was issued on the bus that caused a loss of reservation.

4.21 Event ID 1040

Source ClusSvc

Description Cluster generic service "ServiceName" could not be found.

Problem The cluster service attempted to bring the specified generic service resource online. The service could not be located and could not be managed by the Cluster Service.

Solution Remove the generic service resource if this service is no longer installed. The parameters for the resource may be invalid. Check the generic service resource properties and confirm correct configuration.

document.doc

Author: Fabian SIRACH

Page 36 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 36: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.22 Event ID 1041

Source ClusSvc

Description Cluster generic service "ServiceName" could not be started.

Problem The cluster service attempted to bring the specified generic service resource online. The service could not be started at the operating system level.

Solution Remove the generic service resource if this service is no longer installed. The parameters for the resource may be invalid. Check the generic service resource properties and confirm correct configuration. Check to make sure the service account has not expired, that it has the correct password, and has necessary rights for the service to start. Check the system event log for any related errors.

4.23 Event ID 1042

Source ClusSvc

Description Cluster generic service "resourcename" failed.

Problem The service associated with the mentioned generic service resource failed.

Solution Check the generic service properties and service configuration for errors. Check system and application event logs for errors.

4.24 Event ID 1043

Source ClusSvc

Description The NetBIOS interface for "IP Address" resource has failed.

Problem The network adapter for the specified IP address resource has experienced a failure. As a result, the IP address is either offline, or the group has moved to a surviving node in the cluster.

Solution Check the network adapter and network connection for problems. Resolve the network-related problem.

document.doc

Author: Fabian SIRACH

Page 37 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 37: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.25 Event ID 1044

Source ClusSvc

Description Cluster IP Address resource %1 could not create the required NetBios interface.

Problem The cluster service attempted to initialize an IP Address resource and could not establish a context with NetBios.

Solution This could be a network adapter- or network adapter driver-related issue. Make sure the adapter is using a current driver and the correct driver for the adapter. If this is an embedded adapter, check with the OEM to determine if a specific OEM version of the driver is a requirement. If you already have many IP Address resources defined, make sure you have not reached the NetBios limit of 64 addresses. If you have IP Address resources defined that do not have a need for NetBios affiliation, use the IP Address private property to disable NetBios for the address. This option is available in SP4 and helps to conserve NetBios address slots.

4.26 Event ID 1045

Source ClusSvc

Description Cluster IP address "IP address" could not create the required TCP/IP Interface..

Problem The cluster service tried to bring an IP address online. The resource properties may specify an invalid network or malfunctioning adapter. This error may occur if you replace a network adapter with a different model and continue to use the old or inappropriate driver. As a result, the IP address resource cannot be bound to the specified network.

Solution Resolve the network adapter problem or change the properties of the IP address resource to reflect the proper network for the resource.

4.27 Event ID 1046

Source ClusSvc

Description Cluster IP Address resource %1 cannot be brought online because the subnet mask parameter is invalid. Please check your network configuration.

Problem The cluster service tried to bring an IP address resource online but could not do so. The subnet mask for the resource is either blank or otherwise invalid.

Solution Correct the subnet mask for the resource.

document.doc

Author: Fabian SIRACH

Page 38 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 38: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.28 Event ID 1047

Source ClusSvc

Description Cluster IP Address resource %1 cannot be brought online because the IP address parameter is invalid. Please check your network configuration.

Problem The cluster service tried to bring an IP address resource online but could not do so. The IP address property contains an invalid value. This may be caused by incorrectly creating the resource through an API or the command line interface.

Solution Correct the IP address properties for the resource.

4.29 Event ID 1048

Source ClusSvc

Description Cluster IP address, "IP address," cannot be brought online because the specified adapter name is invalid.

Problem The cluster service tried to bring an IP address online. The resource properties may specify an invalid network or a malfunctioning adapter. This error may occur if you replace a network adapter with a different model. As a result, the IP address resource cannot be bound to the specified network.

Solution Resolve the network adapter problem or change the properties of the IP address resource to reflect the proper network for the resource.

4.30 Event ID 1049

Source ClusSvc

Description Cluster IP address "IP address" cannot be brought online because the address IP address is already present on the network. Please check your network configuration.

Problem The cluster service tried to bring an IP address online. The address is already in use on the network and cannot be registered. Therefore, the resource cannot be brought online.

Solution Resolve the IP address conflict, or choose another address for the resource.

document.doc

Author: Fabian SIRACH

Page 39 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 39: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.31 Event ID 1050

Source ClusSvc

Description Cluster Network Name resource %1 cannot be brought online because the name %2 is already present on the network. Please check your network configuration.

Problem The cluster service tried to bring a Network Name resource online. The name is already in use on the network and cannot be registered. Therefore, the resource cannot be brought online.

Solution Resolve the conflict, or choose another network name.

4.32 Event ID 1051

Source ClusSvc

Description Cluster Network Name resource "resourcename" cannot be brought online because it does not depend on an IP address resource. Please add an IP address dependency.

Problem The cluster service attempted to bring the network name resource online, and found that a required dependency was missing.

Solution Microsoft Cluster Server requires an IP address dependency for network name resource types. Cluster Administrator presents a pop-up message if you attempt to remove this dependency without specifying another like dependency. To resolve this error, replace the IP address dependency for this resource. Because it is difficult to remove this dependency, Event 1051 may be an indication of problems within the cluster registry. Check other resources for possible dependency problems.

4.33 Event ID 1052

Source ClusSvc

Description Cluster Network Name resource "resourcename" cannot be brought online because the name could not be added to the system.

Problem The cluster service attempted to bring the network name resource online but the attempt failed.

Solution Check the system event log for errors. Check network adapter configuration and operation. Check TCP/IP configuration and name resolution methods. Check DNS servers for possible database problems or invalid static mappings.

document.doc

Author: Fabian SIRACH

Page 40 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 40: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.34 Event ID 1053

Source ClusSvc

Description Cluster File Share "resourcename" cannot be brought online because the share could not be created.

Problem The cluster service attempted to bring the share online, but the attempt to create the share failed.

Solution Make sure the Server service is started and functioning properly. Check the path for the share. Check ownership and permissions on the directory. Check the system event log for details. Also, if diagnostic logging is enabled, check the log for an entry related to this failure. Use the net helpmsg errornumber command with the error code found in the log entry.

4.35 Event ID 1054

Source ClusSvc

Description Cluster File Share %1 could not be found.

Problem The share corresponding to the named File Share resource was deleted using a mechanism other than Cluster Administrator. This may occur if you select the share with Explorer and choose 'Not Shared'.

Solution Delete shares or take them offline via Cluster Administrator or the command line program CLUSTER.EXE.

4.36 Event ID 1055

Source ClusSvc

Description Cluster File Share "sharename" has failed a status check.

Problem The cluster service (through resource monitors) periodically monitors the status of cluster resources. In this case, a file share failed a status check. This could mean that someone attempted to delete the share through Windows NT Explorer or Server Manager, instead of through Cluster Administrator. This event could also indicate a problem with the Server service, or access to the shared directory.

Solution Check the system event log for errors. Check the cluster diagnostic log (if it is enabled) for status codes that may be related to this event. Check the resource properties for proper configuration. Also, make sure the file share has proper dependencies defined for related resources.

document.doc

Author: Fabian SIRACH

Page 41 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 41: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.37 Event ID 1056

Source ClusSvc

Description The cluster database on the local node is in an invalid state. Please start another node before starting this node.

Problem The cluster database on the local node may be in a default state from the installation process and the node has not properly joined with an existing node.

Solution Make sure another node of the same cluster is online first before starting this node. Upon joining with another cluster node, the node will receive an updated copy of the official cluster database and should alleviate this error.

4.38 Event ID 1057

Source ClusSvc

Description The cluster service CLUSDB could not be opened.

Problem The Cluster Service tried to open the CLUSDB registry hive and could not do so. As a result, the cluster service cannot be brought online.

Solution Check the cluster installation directory for the existence of a file called CLUSDB. Make sure the registry file is not held open by any applications, and that permissions on the file allow the cluster service access to this file and directory.

4.39 Event ID 1058

Source ClusSvc

Description The Cluster Resource Monitor could not load the DLL %1 for resource type %2.

Problem The Cluster Service tried to bring a resource online that requires a specific resource DLL for the resource type. The DLL is either missing, corrupt, or an incompatible version. As a result, the resource cannot be brought online.

Solution Check the cluster installation directory for the existence of the named resource DLL. Make sure the DLL exists in the proper directory on both nodes.

document.doc

Author: Fabian SIRACH

Page 42 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 42: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.40 Event ID 1059

Source ClusSvc

Description The Cluster Resource DLL %1 for resource type %2 failed to initialize.

Problem The Cluster Service tried to load the named resource DLL and it failed to initialize. The DLL could be corrupt, or an incompatible version. As a result, the resource cannot be brought online.

Solution Check the cluster installation directory for the existence of the named resource DLL. Make sure the DLL exists in the proper directory on both nodes and is of proper version. If the DLL is clusres.dll, this is the default resource DLL that comes with MSCS. Check to make sure the version/date stamp is equivalent to or has a later date than the version contained in the service pack in use.

4.41 Event ID 1061

Source ClusSvc

Description Microsoft Cluster Server successfully formed a cluster on this node.

Information This informational message indicates that an existing cluster of the same name was not detected on the network, and that this node elected to form the cluster and own access to the quorum disk

4.42 Event ID 1062

Source ClusSvc

Description Microsoft Cluster Server successfully joined the cluster.

Information When the Cluster Service started, it detected an existing cluster on the network and was able to successfully join the cluster. No action needed.

4.43 Event ID 1063

Source ClusSvc

Description Microsoft Cluster Server was successfully stopped.

Information The Cluster Service was stopped manually by the administrator.

document.doc

Author: Fabian SIRACH

Page 43 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 43: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.44 Event ID 1064

Source ClusSvc

Description The quorum resource was changed. The old quorum resource could not be marked as obsolete. If there is a partition in time, you may lose changes to your database, because the node that is down will not be able to get to the new quorum resource.

Problem The administrator changed the quorum disk designation without all cluster nodes present.

Solution When other cluster nodes attempt to join the existing cluster, they may not be able to connect to the quorum disk and may not participate in the cluster, because their configuration indicates a different quorum device. For any nodes that meet this criterion, you may need to use the -fixquorum option to start the Cluster Service on these nodes and make configuration changes.

4.45 Event ID 1065

Source ClusSvc

Description Cluster resource %1 failed to come online.

Problem The cluster service attempted to bring the resource online, but the resource could not reach an online status. The resource may have exhausted the timeout period allotted for the resource to reach an online state.

Solution Check any parameters related to the resource and check the event log for details.

4.46 Event ID 1066

Source ClusSvc

Description Cluster disk resource resourcename is corrupted. Running Chkdsk /F to repair problems.

Problem The Cluster Service detected corruption on the indicated disk resource and started Chkdsk /f on the volume to repair the structure. The Cluster Service will automatically perform this operation, but only for cluster-defined disk resources (not local disks).

Solution Scan the event log for additional errors. The disk corruption could be indicative of other problems. Check related hardware and devices on the shared bus and ensure proper cables and termination. This error may be a symptom of failing hardware or a deteriorating drive.

document.doc

Author: Fabian SIRACH

Page 44 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 44: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.47 Event ID 1067

Source ClusSvc

Description Cluster disk resource %1 has corrupt files. Running Chkdsk /F to repair problems.

Problem The Cluster Service detected corruption on the indicated disk resource and started Chkdsk /f on the volume to repair the structure. The Cluster Service will automatically perform this operation, but only for cluster-defined disk resources (not local disks).

Solution Scan the event log for additional errors. The disk corruption could be indicative of other problems. Check related hardware and devices on the shared bus and ensure proper cables and termination. This error may be a symptom of failing hardware or a deteriorating drive.

4.48 Event ID 1068

Source ClusSvc

Description The cluster file share resource resourcename failed to start. Error 5.

Problem The file share cannot be brought online. The problem may be caused by permissions to the directory or disk in which the directory resides. This may also be related to permission problems within the domain.

Solution Check to make sure that the Cluster Service account has rights to the directory to be shared. Make sure a domain controller is accessible on the network. Make sure dependencies for the share and for other resource in the group are set correctly. Error 5 translates to "Access Denied."

4.49 Event ID 1069

Source ClusSvc

Description Cluster resource "Disk G:" failed.

Problem The named resource failed and the cluster service logged the event. In this example, a disk resource failed.

Solution For disk resources, check the device for proper operation. Check cables, termination, and logfiles on both cluster nodes. For other resources, check resource properties for proper configuration, and check to make sure dependencies are configured correctly. Check the diagnostic log (if it is enabled) for status codes corresponding to the failure.

document.doc

Author: Fabian SIRACH

Page 45 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 45: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

4.50 Event ID 1070

Source ClusSvc

Description Cluster node attempted to join the cluster but failed with error 5052.

Problem The cluster node attempted to join an existing cluster but was unable to complete the process. This problem may occur if the node was previously evicted from the cluster.

Solution If the node was previously evicted from the cluster, you must remove and reinstall MSCS on the affected server.

4.51 Event ID 1071

Source ClusSvc

Description Cluster node 2 attempted to join but was refused. Error 5052.

Problem Another node attempted to join the cluster and this node refused the request.

Solution If the node was previously evicted from the cluster, you must remove and reinstall MSCS on the affected server. Look in Cluster Administrator to see if the other node is listed as a possible cluster member.

4.52 Event ID 1073

Source ClusSvc

Description Microsoft Cluster Server was halted to prevent an inconsistency within the cluster. The error code was 5028.

Problem The cluster service on the affected node was halted because of some kind of inconsistency between cluster nodes.

Solution Check connectivity between systems. This error may be an indication of configuration or hardware problems.

4.53 Event ID 1077

Source ClusSvc

Description The TCP/IP interface for cluster IP address resourcename has failed.

Problem The IP address resource depends on the proper operation of a specific network interface as configured in the resource properties. The network interface failed.

Solution Check the system event log for errors. Check the network adapter for proper operation and replace the adapter if necessary. Check to make sure the proper adapter driver is loaded for the device and check for newer versions of the driver.

4.54 Event ID 1080

Source ClusSvc

Description The Microsoft Cluster Server could not write file W:\MSCS\Chk7f5.tmp. The disk may be low on disk space, or some other serious condition exists.

document.doc

Author: Fabian SIRACH

Page 46 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 46: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Problem The cluster service attempted to create a temporary file in the MSCS directory on the quorum disk. Lack of disk space or other factors prevented successful completion of the operation.

Solution Check the quorum drive for available disk space. The file system may be corrupted or the device may be failing. Check file system permissions to ensure that the cluster service account has full access to the drive and directory.

4.55 Event ID 1093

Source ClusSvc

Description Node %1 is not a member of cluster %2. If the name of the node has changed, Microsoft Cluster Server must be reinstalled.

Problem The cluster service attempted to start but found that it was not a valid member of the cluster.

Solution Microsoft Cluster Server may need to be reinstalled on this node. If this is the result of a server name change, be sure to evict the node from the cluster (from an operational node) prior to reinstallation.

4.56 Event ID 1096

Source ClusSvc

Description Microsoft Cluster Server cannot use network adapter %1 because it does not have a valid IP address assigned to it.

Problem The network configuration for the adapter has changed and the cluster service cannot make use of the adapter for the network that was assigned to it.

Solution Check the network configuration. If a DHCP address was used for the primary address of the adapter, the address may have been lost. For best results, use a static address.

4.57 Event ID 1097

Source ClusSvc

Description Microsoft Cluster Server did not find any network adapters with valid IP addresses installed in the system. The node will not be able to join a cluster.

Problem The network configuration for the system needs to be corrected to match the same connected networks as the other node of the cluster.

Solution Check the network configuration and make sure it agrees with the working node of the cluster. Make sure the same networks are accessible from all systems in the cluster.

4.58 Event ID 1098

Source ClusSvc

Description The node is no longer attached to cluster network network_id by adapter adapter. Microsoft Cluster Server will delete network interface interface from the cluster configuration.

Information The Cluster Service observed a change in network

document.doc

Author: Fabian SIRACH

Page 47 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 47: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

configuration that might be induced by a change of adapter type or by removal of a network. The network will be removed from the list of available networks.

4.59 Event ID 1100

Source ClusSvc

Description Microsoft Cluster Server discovered that the node is now attached to cluster network network_id by adapter adapter. A new cluster network interface will be added to the cluster configuration.

Information The Cluster Service noticed a new network accessible by the cluster nodes, and has added the new network to the list of accessible networks.

4.60 Event ID 1102

Source ClusSvc

Description Microsoft Cluster Server discovered that the node is attached to a new network by adapter adapter. A new network and network interface will be added to the cluster configuration.

Information The cluster service noticed the addition of a new network. The network will be added to list of available networks.

4.61 Event ID 1104

Source ClusSvc

Description Microsoft Cluster Server failed to update the configuration for one of the nodes Network interfaces. The error code was errorcode.

Problem The cluster service attempted to update a cluster node and could not perform the operation.

Solution Use the net helpmsg errorcode command to find an explanation of the underlying error. For example, error 1393 indicates that a corrupted disk caused the operation to fail.

4.62 Event ID 1105

Source ClusSvc

Description Microsoft Cluster Server failed to initialize the RPC services. The error code was %1.

Problem The cluster service attempted to utilize required RPC services and could not successfully perform the operation.

Solution Use the net helpmsg errorcode command to find an explanation of the underlying error. Check the system event log for other RPC related errors or performance problems.

4.63 Event ID 1107

Source ClusSvc

Description Cluster node node name failed to make a connection to the node over network network name. The error code was 1715.

Problem The cluster service attempted to connect to another cluster

document.doc

Author: Fabian SIRACH

Page 48 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 48: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

node over a specific network and could not establish a connection. This error is a warning message.

Solution Check to make sure that the specified network is available and functioning correctly. If the node experiences this problem, it may try other available networks to establish the desired connection.

4.64 Event ID 1109

Source ClusSvc

Description The node was unable to secure its connection to cluster node %1. The error code was %2. Check that both nodes can communicate with their domain controllers.

Problem The cluster service attempted to connect to another cluster node and could not establish a secure connection. This could indicate domain connectivity problems.

Solution Check to make sure that the networks are available and functioning correctly. This may be a symptom of larger network problems or domain security issues.

4.65 Event ID 1115

Source ClusSvc

Description An unrecoverable error caused the join of node nodename to the cluster to be aborted. The error code was errorcode.

Problem A node attempted to join the cluster but was unable to obtain successful membership.

Solution Use the NET HELPMSG errorcode command to obtain further description of the error that prevented the join operation. For example, error code 1393 indicates that a disk structure is corrupted and nonreadable. An error code like this could indicate a corrupted quorum disk.

document.doc

Author: Fabian SIRACH

Page 49 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 49: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

5 Appendix: Related Event Messages

5.1 Event ID 9

Source Disk

Description The device \Device\ScsiPort2 did not respond within the timeout period.

Problem An I/O request was sent to a SCSI device and was not serviced within acceptable time. The device timeout was logged by this event.

Solution You may have a device or controller problem. Check SCSI cables, termination, and adapter configuration. Excessive recurrence of this event message may indicate a serious problem that could indicate potential for data loss or corruption. If necessary, contact your hardware vendor for help troubleshooting this problem.

5.2 Event ID 101

Source W3SVC

Description The server was unable to add the virtual root "/" for the directory "path" because of the following error: The system cannot find the path specified. The data is the error.

Problem The World Wide Web Publishing service could not create a virtual root for the IIS Virtual Root resource. The directory path may have been deleted.

Solution Re-create or restore the directory and contents. Check the resource properties for the IIS Virtual Root resource and ensure that the path is correct. This problem may occur if you had an IIS Virtual Root resource defined and then uninstalled Microsoft Cluster Server without first deleting the resource. In this case, you may evaluate and change virtual root properties by using the Internet Service Manager.

document.doc

Author: Fabian SIRACH

Page 50 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 50: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

5.3 Event ID 1004

Source DHCP

Description DHCP IP address lease "IP address" for the card with network address "media access control Address" has been denied.

Problem This system uses a DHCP-assigned IP address for a network adapter. The system attempted to renew the leased address and the DHCP server denied the request. The address may already be allocated to another system. The DHCP server may also have a problem. Network connectivity may be affected by this problem.

Solution Resolve the problem by correcting DHCP server problems or assigning a static IP address. For best results within a cluster, use statically assigned IP addresses.

5.4 Event ID 1005

Source DHCP

Description DHCP failed to renew a lease for the card with network address "MAC Address." The following error occurred: The semaphore timeout period has expired.

Problem This system uses a DHCP assigned IP address for a network adapter. The system attempted to renew the leased address and was unable to renew the lease. Network operations on this system may be affected.

Solution There may be a connectivity problem preventing access to the DHCP server that leased the address, or the DHCP server may be offline. For best results within a cluster, use statically assigned IP addresses.

5.5 Event ID 2511

Source Server

Description The server service was unable to recreate the share "Sharename" because the directory "path" no longer exists.

Problem The Server service attempted to create a share using the specified directory path. This problem may occur if you create a share (outside of Cluster Administrator) on a cluster shared device. If the device is not exclusively available to this computer, the server service cannot create the share. Also, the directory may no longer exist or there may be RPC related issues.

Solution Correct the problem by creating a shared resource through Cluster Administrator, or correct the problem with the missing directory. Check dates of RPC files in the system32 directory. Make sure they concur with those contained in the service pack in use, or any hotfixes applied.

5.6 Event ID 4199

Source TCPIP

Description The system detected an address conflict for IP address "IP address" with the system having network hardware address "media access control address." Network operations on this system may be disrupted as a result.

document.doc

Author: Fabian SIRACH

Page 51 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 51: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Problem Another system on the network may be using one of the addresses configured on this computer.

Solution Resolve the IP address conflict. Check network adapter configuration and any IP address resources defined within the cluster.

5.7 Event ID 5719

Source Netlogon

Description No Windows NT Domain controller is available for domain "domain." (This event is expected and can be ignored when booting with the "No Net" hardware profile.) The following error occurred: There are currently no logon servers available to service the logon request.

Problem A domain controller for the domain could not be contacted. As a result, proper authentication of accounts could not be completed. This may occur if the network is disconnected or disabled through system configuration.

Solution Resolve the connectivity problem with the domain controller and restart the system.

5.8 Event ID 7000

Source Service Control Manager

Description The Cluster Service failed to start because of the following error: The service did not start because of a logon failure.

Problem The service control manager attempted to start a service (possibly ClusSvc). It could not authenticate the service account. This error may be seen with Event 7013.

Solution The service account could not be authenticated. This may be because of a failure contacting a domain controller, or because account credentials are invalid. Check the service account name and password and ensure that the account is available and that credentials are correct. You may also try running the cluster service from a command prompt (if currently logged on as an administrator) by changing to the %systemroot%\Cluster directory (or where you installed the software) and typing ClusSvc -debug. If the service starts and runs correctly, stop it by pressing CTRL+C and troubleshoot the service account problem. This error may also occur if network connectivity is disabled through the system configuration or hardware profile. Microsoft Cluster Server requires network connectivity.

5.9 Event ID 7013

Source Service Control Manager

Description Logon attempt with current password failed with the following error: There are currently no logon servers available to service the logon request.

More Info The description for this error message may vary somewhat based on the actual error. For example, another error that may be listed in the event detail might be: "Logon Failure: unknown username or bad password."

Problem The service control manager attempted to start a service (possibly ClusSvc). It could not authenticate the service account with a domain controller.

document.doc

Author: Fabian SIRACH

Page 52 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 52: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Solution The service account may be in another domain, or this system is not a domain controller. It is acceptable for the node to be a nondomain controller, but the node needs access to a domain controller within the domain as well as the domain that the service account belongs to. Inability to contact the domain controller may occur because of a problem with the server, network, or other factors. This problem is not related to the cluster software and must be resolved before you start the cluster software. This error may also occur if network connectivity is disabled through the system configuration or hardware profile. Microsoft Cluster Server requires network connectivity.

5.10 Event ID 7023

Source Service Control Manager

Description The Cluster Server service terminated with the following error: The quorum log could not be created or mounted successfully

Problem The Cluster Service attempted to start but could not gain access to the quorum log on the quorum disk. This may be because of problems gaining access to the disk or problems joining a cluster that has already formed.

Solution Check the disk and quorum log for problems. If necessary, check the cluster logfile for more information. There may be other events in the system event log that may give more information.

document.doc

Author: Fabian SIRACH

Page 53 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 53: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

6 Appendix: Maintenance Tools

6.1 Windows 2000 tools

The tools presented below can be used for troubleshooting:

Tools Use toDisk Management (compmgmt.msc) Determine whether a disk is available to a

particular node If the disk can be selected under Disk Management, it is online to the local system. If the disk object appears dimmed, it is not available for that node.

Services option in Administrative Tools Verify that the Cluster Service is runningWindows 2000 Explorer, My Computer, or the Net View command

Verify that a particular share has been exported from the server you expected

Event Viewer View and manage System, Security, and Application event logs

Dr. Watson Detect, log, and diagnose application errorsTask Manager Monitor applications, tasks, and key

performance metrics; and view detailed information on memory and CPU usage on each application and process

Performance Monitor Monitor system details of application and system behaviours, and monitor performance

Network Monitor Monitor and troubleshoot network connectivity by capturing and analyzing network traffic

Windows Diagnostics (Winmsd.exe) Easily examine your system information on device drivers, network usage, and system resources, such as IRQ, DMA, and I/O addresses

document.doc

Author: Fabian SIRACH

Page 54 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 54: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

6.2 Windows 2000 Resource Kit tools

The tools presented below can be used for troubleshooting:

Tools Use toDiskmap This command-line utility produces a detailed report on the configuration of the

hard disk that you specify. It provides information from the registry about disk characteristics and geometry, and reads and displays data about all of the partitions and logical drives defined on the disk. It also shows Disk Signatures.

Dumpel Dump Event Log is a command-line utility that dumps an event log for a local or remote system into a tab-separated text file. This utility can also be used to filter for or filter out certain event types.

Filever This command-line tool examines the version resource structure of a file or a directory of files on either a local or remote computer, and displays information on the versions of executable files, such as .exe and .dll files.

Getmac GetMAC provides a quick method for obtaining the MAC (Ethernet) layer address and binding order for a computer running Windows 2000, locally or across a network. This can be useful when you want to enter the address into a sniffer, or if you need to know what protocols are currently in use on a computer.

Netcons This GUI tool monitors and displays current net connections, taking the place of the Windows command-line command net use.

Clustool This tool permit cluster configuration backup and restore.

document.doc

Author: Fabian SIRACH

Page 55 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 55: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

7 Appendix: Using AND Reading the Cluster LogFile

7.1 CLUSTERLOG Environment Variable

If you set the CLUSTERLOG environment variable, the cluster will create a logfile that contains diagnostic information using the path specified. Important events during the operation of the Cluster Service will be logged in this file. Because so many different events occur, the logfile may be somewhat cryptic or hard to read. This document gives some hints about how to read the logfile and information about what items to look for.

Note: Each time you attempt to start the Cluster Service, the log will be cleared and a new logfile started. Each component of MSCS that places an entry in the logfile will indicate itself by abbreviation in square brackets. For example, the Node Manager component would be abbreviated [NM]. Logfile entries will vary from one cluster to another. As a result, other logfiles may vary from excerpts referenced in this document.

Note Log entry lines in the following sections have been wrapped for space constraints in this document. The lines do not normally wrap.

7.2 Operating System Version Number and Service Pack Leve

Near the beginning of the logfile, notice the build number of MSCS, followed by the operating system version number and service pack level. If you call for support, engineers may ask for this information:

082::14-21:29:26.625 Cluster Service started - Cluster Version 1.224.082::14-21:29:26.625 OS Version 4.0.1381 - Service Pack 3.

7.3 Cluster Service Startup

Following the version information, some initialization steps occur. Those steps are followed by an attempt to join the cluster, if one node already exists in a running state. If the Cluster Service could not detect any other cluster members, it will attempt to form the cluster. Consider the following log entries:

0b5::12-20:15:23.531 We’re initing Ep...0b5::12-20:15:23.531 [DM]: Initialization0b5::12-20:15:23.531 [DM] DmpRestartFlusher: Entry0b5::12-20:15:23.531 [DM] DmpStartFlusher: Entry0b5::12-20:15:23.531 [DM] DmpStartFlusher: thread created0b5::12-20:15:23.531 [NMINIT] Initializing the Node Manager...0b5::12-20:15:23.546 [NMINIT] Local node name = NODEA.0b5::12-20:15:23.546 [NMINIT] Local node ID = 1.0b5::12-20:15:23.546 [NM] Creating object for node 1 (NODEA)0b5::12-20:15:23.546 [NM] node 1 state 1

document.doc

Author: Fabian SIRACH

Page 56 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 56: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

0b5::12-20:15:23.546 [NM] Initializing networks.0b5::12-20:15:23.546 [NM] Initializing network interface facilities.0b5::12-20:15:23.546 [NMINIT] Initialization complete.0b5::12-20:15:23.546 [FM] Starting worker thread...0b5::12-20:15:23.546 [API] Initializing0a9::12-20:15:23.546 [FM] Worker thread running0b5::12-20:15:23.546 [lm] :LmInitialize Entry. 0b5::12-20:15:23.546 [lm] :TimerActInitialize Entry. 0b5::12-20:15:23.546 [CS] Initializing RPC server.0b5::12-20:15:23.609 [INIT] Attempting to join cluster MDLCLUSTER0b5::12-20:15:23.609 [JOIN] Spawning thread to connect to sponsor 192.88.80.11406c::12-20:15:23.609 [JOIN] Asking 192.88.80.114 to sponsor us.0b5::12-20:15:23.609 [JOIN] Waiting for all connect threads to terminate.06c::12-20:15:32.750 [JOIN] Sponsor 192.88.80.114 is not available, status=1722.0b5::12-20:15:32.750 [JOIN] All connect threads have terminated.0b5::12-20:15:32.750 [JOIN] Unable to connect to any sponsor node.0b5::12-20:15:32.750 [INIT] Failed to join cluster, status 530b5::12-20:15:32.750 [INIT] Attempting to form cluster MDLCLUSTER0b5::12-20:15:32.750 [Ep]: EpInitPhase10b5::12-20:15:32.750 [API] Online read only04b::12-20:15:32.765 [RM] Main: Initializing.

Note that the cluster service attempts to join the cluster. If it cannot connect with an existing member, the software decides to form the cluster. The next series of steps attempts to form groups and resources necessary to accomplish this task. It is important to note that the cluster service must arbitrate control of the quorum disk.

0b5::12-20:15:32.781 [FM] Creating group a1a13a86-0eaf-11d1-8427-0000f80345990b5::12-20:15:32.781 [FM] Group a1a13a86-0eaf-11d1-8427-0000f8034599 contains a1a13a87-0eaf-11d1-8427-0000f8034599.0b5::12-20:15:32.781 [FM] Creating resource a1a13a87-0eaf-11d1-8427-0000f80345990b5::12-20:15:32.781 [FM] FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-0000f8034599 possible node list0b5::12-20:15:32.781 [FMX] Found the quorum resource a1a13a87-0eaf-11d1-8427-0000f8034599.0b5::12-20:15:32.781 [FM] All dependencies for a1a13a87-0eaf-11d1-8427-0000f8034599 created0b5::12-20:15:32.781 [FM] arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427-0000f8034599.0b5::12-20:15:32.781 FmpRmCreateResource: creating resource a1a13a87-0eaf-11d1-8427-0000f8034599 in shared resource monitor0b5::12-20:15:32.812 FmpRmCreateResource: created resource a1a13a87-0eaf-11d1-8427-0000f8034599, resid 13630160dc::12-20:15:32.828 Physical Disk <Disk D:>: Arbitrate returned status 0.0b5::12-20:15:32.828 [FM] FmGetQuorumResource successful0b5::12-20:15:32.828 FmpRmOnlineResource: bringing resource a1a13a87-0eaf-11d1-8427-0000f8034599 (resid 1363016) online.0b5::12-20:15:32.843 [CP] CppResourceNotify for resource Disk D:0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker waiting type 0 context 80b5::12-20:15:32.843 [GUM] Thread 0xb5 UpdateLock wait on Type 00b5::12-20:15:32.843 [GUM] DoLockingUpdate successful, lock granted to 10b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker dispatching seq 388 type 0 context 80b5::12-20:15:32.843 [GUM] GumpDoUnlockingUpdate releasing lock ownership0b5::12-20:15:32.843 [GUM] GumSendUpdate: completed update seq 388 type 0 context 80b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker waiting type 0 context 90b5::12-20:15:32.843 [GUM] Thread 0xb5 UpdateLock wait on Type 00b5::12-20:15:32.843 [GUM] DoLockingUpdate successful, lock granted to 10b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker dispatching seq 389 type 0 context 90b5::12-20:15:32.843 [GUM] GumpDoUnlockingUpdate releasing lock ownership0b5::12-20:15:32.843 [GUM] GumSendUpdate: completed update seq 389 type 0 context 90b5::12-20:15:32.843 FmpRmOnlineResource: Resource a1a13a87-0eaf-11d1-8427-0000f8034599 pending

document.doc

Author: Fabian SIRACH

Page 57 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 57: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

0e1::12-20:15:33.359 Physical Disk <Disk D:>: Online, created registry watcher thread.090::12-20:15:33.359 [FM] NotifyCallBackRoutine: enqueuing event04d::12-20:15:33.359 [FM] WorkerThread, processing transition event for a1a13a87-0eaf-11d1-8427-0000f8034599, oldState = 129, newState = 2.04d::12-20:15:33.359 [FM] HandleResourceTransition: Resource Name = a1a13a87-0eaf-11d1-8427-0000f8034599 old state=129 new state=204d::12-20:15:33.359 [DM] DmpQuoObjNotifyCb: Quorum resource is online04d::12-20:15:33.375 [DM] DmpQuoObjNotifyCb: Own quorum resource, try open the quorum log04d::12-20:15:33.375 [DM] DmpQuoObjNotifyCb: the name of the quorum file is D:\MSCS\quolog.log04d::12-20:15:33.375 [lm] LogCreate : Entry FileName=D:\MSCS\quolog.log MaxFileSize=0x0001000004d::12-20:15:33.375 [lm] LogpCreate : Entry

In this case, the node forms the cluster group and quorum disk resource, gains control of the disk, and opens the quorum logfile. From here, the cluster performs operations with the logfile, and proceeds to form the cluster. This involves configuring network interfaces and bringing them online.

0b5::12-20:15:33.718 [NM] Beginning form process.0b5::12-20:15:33.718 [NM] Synchronizing node information.0b5::12-20:15:33.718 [NM] Creating node objects.0b5::12-20:15:33.718 [NM] Configuring networks & interfaces.0b5::12-20:15:33.718 [NM] Synchronizing network information.0b5::12-20:15:33.718 [NM] Synchronizing interface information.0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Entry0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Exit, pLocalXsaction=0x00151c20 dwError=0x000000000b5::12-20:15:33.718 [NM] Setting database entry for interface a1a13a7f-0eaf-11d1-8427-0000f80345990b5::12-20:15:33.718 [dm] DmCommitLocalUpdate Entry0b5::12-20:15:33.718 [dm] DmCommitLocalUpdate Exit, dwError=0x000000000b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Entry0b5::12-20:15:33.875 [dm] DmBeginLocalUpdate Exit, pLocalXsaction=0x00151c20 dwError=0x000000000b5::12-20:15:33.875 [NM] Setting database entry for interface a1a13a81-0eaf-11d1-8427-0000f80345990b5::12-20:15:33.875 [dm] DmCommitLocalUpdate Entry0b5::12-20:15:33.875 [dm] DmCommitLocalUpdate Exit, dwError=0x000000000b5::12-20:15:33.875 [NM] Matched 2 networks, created 0 new networks.0b5::12-20:15:33.875 [NM] Resynchronizing network information.0b5::12-20:15:33.875 [NM] Resynchronizing interface information.0b5::12-20:15:33.875 [NM] Creating network objects.0b5::12-20:15:33.875 [NM] Creating object for network a1a13a7e-0eaf-11d1-8427-0000f80345990b5::12-20:15:33.875 [NM] Creating object for network a1a13a80-0eaf-11d1-8427-0000f80345990b5::12-20:15:33.875 [NM] Creating interface objects.0b5::12-20:15:33.875 [NM] Creating object for interface a1a13a7f-0eaf-11d1-8427-0000f8034599.0b5::12-20:15:33.875 [NM] Registering network a1a13a7e-0eaf-11d1-8427-0000f8034599 with cluster transport.0b5::12-20:15:33.875 [NM] Registering interfaces for network a1a13a7e-0eaf-11d1-8427-0000f8034599 with cluster transport.0b5::12-20:15:33.875 [NM] Registering interface a1a13a7f-0eaf-11d1-8427-0000f8034599 with cluster transport, addr 9.9.9.2, endpoint 3003.0b5::12-20:15:33.890 [NM] Instructing cluster transport to bring network a1a13a7e-0eaf-11d1-8427-0000f8034599 online.0b5::12-20:15:33.890 [NM] Creating object for interface a1a13a81-0eaf-11d1-8427-0000f8034599.0b5::12-20:15:33.890 [NM] Registering network a1a13a80-0eaf-11d1-8427-0000f8034599 with cluster transport.

document.doc

Author: Fabian SIRACH

Page 58 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 58: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

0b5::12-20:15:33.890 [NM] Registering interfaces for network a1a13a80-0eaf-11d1-8427-0000f8034599 with cluster transport.0b5::12-20:15:33.890 [NM] Registering interface a1a13a81-0eaf-11d1-8427-0000f8034599 with cluster transport, addr 192.88.80.190, endpoint 3003.0b5::12-20:15:33.890 [NM] Instructing cluster transport to bring network a1a13a80-0eaf-11d1-8427-0000f8034599 online.

After initializing network interfaces, the cluster will continue formation with the enumeration of cluster nodes. In this case, as a newly formed cluster, the cluster will contain only one node. If this session had been joining an existing cluster, the node enumeration would show two nodes. Next, the cluster will bring the Cluster IP address and Cluster Name resources online.

0b5::12-20:15:34.015 [FM] OnlineGroup: setting group state to Online for f901aa29-0eaf-11d1-8427-0000f8034599069::12-20:15:34.015 IP address <Cluster IP address>: Created NBT interface \Device\NetBt_If6 (instance 355833456).0b5::12-20:15:34.015 [FM] FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-0000f8034599 possible node list0b5::12-20:15:34.015 [FM] FmFormNewClusterPhase2 complete....0b5::12-20:15:34.281 [INIT] Successfully formed a cluster.09c::12-20:15:34.281 [lm] :ReSyncTimerHandles Entry. 09c::12-20:15:34.281 [lm] :ReSyncTimerHandles Exit gdwNumHandles=30b5::12-20:15:34.281 [INIT] Cluster Started! Original Min WS is 204800, Max WS is 1413120.08c::12-20:15:34.296 [CPROXY] clussvc initialized069::12-20:15:40.421 IP address <Cluster IP Address>: IP Address 192.88.80.114 on adapter DC21X41 online...04d::12-20:15:40.421 [FM] OnlineWaitingTree, a1a13a84-0eaf-11d1-8427-0000f8034599 depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Start first04d::12-20:15:40.421 [FM] OnlineWaitingTree, Start resource a1a13a84-0eaf-11d1-8427-0000f803459904d::12-20:15:40.421 [FM] OnlineResource: a1a13a84-0eaf-11d1-8427-0000f8034599 depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Bring online first.04d::12-20:15:40.421 FmpRmOnlineResource: bringing resource a1a13a84-0eaf-11d1-8427-0000f8034599 (resid 1391032) online.04d::12-20:15:40.421 [CP] CppResourceNotify for resource Cluster Name04d::12-20:15:40.421 [GUM] GumSendUpdate: Locker waiting type 0 context 804d::12-20:15:40.437 [GUM] Thread 0x4d UpdateLock wait on Type 004d::12-20:15:40.437 [GUM] DoLockingUpdate successful, lock granted to 1076::12-20:15:40.437 Network Name <Cluster Name>: Bringing resource online...04d::12-20:15:40.437 [GUM] GumSendUpdate: Locker dispatching seq 411 type 0 context 804d::12-20:15:40.437 [GUM] GumpDoUnlockingUpdate releasing lock ownership04d::12-20:15:40.437 [GUM] GumSendUpdate: completed update seq 411 type 0 context 804d::12-20:15:40.437 [GUM] GumSendUpdate: Locker waiting type 0 context 11...076::12-20:15:43.515 Network Name <Cluster Name>: Registered server name MDLCLUSTER on transport \Device\NetBt_If6.076::12-20:15:46.578 Network Name <Cluster Name>: Registered workstation name MDLCLUSTER on transport \Device\NetBt_If6.076::12-20:15:46.578 Network Name <Cluster Name>: Network Name MDLCLUSTER is now online

document.doc

Author: Fabian SIRACH

Page 59 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 59: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Following these steps, the cluster will attempt to bring other resources and groups online. The logfile will continue to increase in size as the cluster service runs. Therefore, it may be a good idea to enable this option when you are having problems, rather than leaving it on for days or weeks at a time.

7.4 Logfile Entries for Common Failures

After reviewing a successful startup of the Cluster Service, you may want to examine some errors that may appear because of various failures. The following examples illustrate possible log entries for four different failures.

Example 1: Quorum Disk Turned OffIf the cluster attempts to form and cannot connect to the quorum disk, entries similar to the following may appear in the logfile. Because of the failure, the cluster cannot form, and the Cluster Service terminates.

0b9::14-20:59:42.921 [RM] Main: Initializing.08f::14-20:59:42.937 [FM] Creating group a1a13a86-0eaf-11d1-8427-0000f803459908f::14-20:59:42.937 [FM] Group a1a13a86-0eaf-11d1-8427-0000f8034599 contains a1a13a87-0eaf-11d1-8427-0000f8034599.08f::14-20:59:42.937 [FM] Creating resource a1a13a87-0eaf-11d1-8427-0000f803459908f::14-20:59:42.937 [FM] FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-0000f8034599 possible node list08f::14-20:59:42.937 [FMX] Found the quorum resource a1a13a87-0eaf-11d1-8427-0000f8034599.08f::14-20:59:42.937 [FM] All dependencies for a1a13a87-0eaf-11d1-8427-0000f8034599 created08f::14-20:59:42.937 [FM] arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427-0000f8034599.08f::14-20:59:42.937 FmpRmCreateResource: creating resource a1a13a87-0eaf-11d1-8427-0000f8034599 in shared resource monitor08f::14-20:59:42.968 FmpRmCreateResource: created resource a1a13a87-0eaf-11d1-8427-0000f8034599, resid 13626160e9::14-20:59:43.765 Physical Disk <Disk D:>: SCSI, error reserving disk, error 21.0e9::14-20:59:54.125 Physical Disk <Disk D:>: SCSI, error reserving disk, error 21.0e9::14-20:59:54.140 Physical Disk <Disk D:>: Arbitrate returned status 21.08f::14-20:59:54.140 [FM] FmGetQuorumResource failed, error 21.08f::14-20:59:54.140 [INIT] Cleaning up failed form attempt.08f::14-20:59:54.140 [INIT] Failed to form cluster, status 3213068.08f::14-20:59:54.140 [CS] ClusterInitialize failed 2108f::14-20:59:54.140 [INIT] The cluster service is shutting down.08f::14-20:59:54.140 [evt] EvShutdown08f::14-20:59:54.140 [FM] Shutdown: Failover Manager requested to shutdown groups.08f::14-20:59:54.140 [FM] DestroyGroup: destroying a1a13a86-0eaf-11d1-8427-0000f803459908f::14-20:59:54.140 [FM] DestroyResource: destroying a1a13a87-0eaf-11d1-8427-0000f803459908f::14-20:59:54.140 [OM] Deleting object Physical Disk08f::14-20:59:54.140 [FM] Resource a1a13a87-0eaf-11d1-8427-0000f8034599 destroyed.08f::14-20:59:54.140 [FM] Group a1a13a86-0eaf-11d1-8427-0000f8034599 destroyed.08f::14-20:59:54.140 [Dm] DmShutdown08f::14-20:59:54.140 [DM] DmpShutdownFlusher: Entry08f::14-20:59:54.156 [DM] DmpShutdownFlusher: Setting event062::14-20:59:54.156 [DM] DmpRegistryFlusher: got 0062::14-20:59:54.156 [DM] DmpRegistryFlusher: exiting0ca::14-20:59:54.156 [FM] WorkItem, delete resource <Disk D:> status 00ca::14-20:59:54.156 [OM] Deleting object Disk Group 1 (a1a13a86-0eaf-11d1-8427-0000f8034599)0e7::14-20:59:54.375 [CPROXY] clussvc terminated, error 0.

document.doc

Author: Fabian SIRACH

Page 60 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 60: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

0e7::14-20:59:54.375 [CPROXY] Service Stopping...0b9::14-20:59:54.375 [RM] Going away, Status = 1, Shutdown = 0.02c::14-20:59:54.375 [RM] PollerThread stopping. Shutdown = 1, Status = 0, WaitFailed = 0, NotifyEvent address = 196.0e7::14-20:59:54.375 [CPROXY] Cleaning up0b9::14-20:59:54.375 [RM] RundownResources posting shutdown notification.0e7::14-20:59:54.375 [CPROXY] Cleanup complete.0e3::14-20:59:54.375 [RM] NotifyChanges shutting down.0e7::14-20:59:54.375 [CPROXY] Service Stopped.

Perhaps the most meaningful lines from above are:0e9::14-20:59:43.765 Physical Disk <Disk D:>: SCSI, error reserving disk, error 21.0e9::14-20:59:54.125 Physical Disk <Disk D:>: SCSI, error reserving disk, error 21.0e9::14-20:59:54.140 Physical Disk <Disk D:>: Arbitrate returned status 21.

Note The error code on these logfile entries is 21. You can issue net helpmsg 21 from the command line and receive the explanation of the error status code. Status code 21 means, "The device is not ready.” This indicates a possible problem with the device. In this case, the device was turned off, and the error status correctly indicates the problem.

document.doc

Author: Fabian SIRACH

Page 61 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 61: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Example 2: Quorum Disk FailureIn this example, the drive has failed or has been reformatted from the SCSI controller. As a result, the cluster service cannot locate a drive with the specific signature it is looking for.

0b8::14-21:11:46.515 [RM] Main: Initializing.074::14-21:11:46.531 [FM] Creating group a1a13a86-0eaf-11d1-8427-0000f8034599074::14-21:11:46.531 [FM] Group a1a13a86-0eaf-11d1-8427-0000f8034599 contains a1a13a87-0eaf-11d1-8427-0000f8034599.074::14-21:11:46.531 [FM] Creating resource a1a13a87-0eaf-11d1-8427-0000f8034599074::14-21:11:46.531 [FM] FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-0000f8034599 possible node list074::14-21:11:46.531 [FMX] Found the quorum resource a1a13a87-0eaf-11d1-8427-0000f8034599.074::14-21:11:46.531 [FM] All dependencies for a1a13a87-0eaf-11d1-8427-0000f8034599 created074::14-21:11:46.531 [FM] arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427-0000f8034599.074::14-21:11:46.531 FmpRmCreateResource: creating resource a1a13a87-0eaf-11d1-8427-0000f8034599 in shared resource monitor074::14-21:11:46.562 FmpRmCreateResource: created resource a1a13a87-0eaf-11d1-8427-0000f8034599, resid 1362696075::14-21:11:46.671 Physical Disk <Disk D:>: SCSI, Performing bus rescan.075::14-21:11:51.843 Physical Disk <Disk D:>: SCSI, error attaching to signature 71cd0549, error 2.075::14-21:11:51.843 Physical Disk <Disk D:>: Unable to attach to signature 71cd0549. Error: 2.074::14-21:11:51.859 [FM] FmGetQuorumResource failed, error 2.074::14-21:11:51.859 [INIT] Cleaning up failed form attempt.

In this case, the most important logfile entries are:

075::14-21:11:51.843 Physical Disk <Disk D:>: SCSI, error attaching to signature 71cd0549, error 2.075::14-21:11:51.843 Physical Disk <Disk D:>: Unable to attach to signature 71cd0549. Error: 2.

Status code 2 means, "The system cannot find the file specified.” The error in this case may mean that it cannot find the disk, or that, because of some kind of problem, it cannot locate the quorum logfile that should be on the disk.

document.doc

Author: Fabian SIRACH

Page 62 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 62: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

Example 3: Duplicate Cluster IP Address

If another computer on the network has the same IP address as the cluster IP address resource, the resource will be prevented from going online. Further, the cluster name will not be registered on the network, as it depends on the IP address resource. Because this name is the network name used for cluster administration, you will not be able to administer the cluster using this name, in this type of failure. However, you may be able to use the computer name of the cluster node to connect with Cluster Administrator. Additionally, you may be able to connect locally from the console using the loopback address. The following sample entries are from a cluster logfile during this type of failure:

0b9::14-21:32:59.968 IP Address <Cluster IP Address>: The IP address is already in use on the network, status 5057.0d2::14-21:32:59.984 [FM] NotifyCallBackRoutine: enqueuing event03e::14-21:32:59.984 [FM] WorkerThread, processing transition event for a1a13a83-0eaf-11d1-8427-0000f8034599, oldState = 129, newState = 4.03e...03e::14-21:32:59.984 FmpHandleResourceFailure: taking resource a1a13a83-0eaf-11d1-8427-0000f8034599 and dependents offline03e::14-21:32:59.984 [FM] TerminateResource: a1a13a84-0eaf-11d1-8427-0000f8034599 depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Terminating first0d3::14-21:32:59.984 Network Name <Cluster Name>: Terminating name MDLCLUSTER...0d3::14-21:32:59.984 Network Name <Cluster Name>: Name MDLCLUSTER is already offline....03e::14-21:33:00.000 FmpRmTerminateResource: a1a13a84-0eaf-11d1-8427-0000f8034599 is now offline0c7::14-21:33:00.000 IP Address <Cluster IP Address>: Terminating resource...0c7::14-21:33:00.000 IP Address <Cluster IP Address>: Address 192.88.80.114 on adapter DC21X41 offline.

Example 4: Evicted Node Attempts to Join Existing Cluster

If you evict a node from a cluster, the cluster software on that node must be reinstalled to gain access to the cluster again. If you start the evicted node, and the Cluster Service attempts to join the cluster, entries similar to the following may appear in the cluster logfile:

032::26-16:11:45.109 [INIT] Attempting to join cluster MDLCLUSTER032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor 192.88.80.115040::26-16:11:45.109 [JOIN] Asking 192.88.80.115 to sponsor us.032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor 9.9.9.2032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor 192.88.80.190099::26-16:11:45.109 [JOIN] Asking 9.9.9.2 to sponsor us.032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor NODEA098::26-16:11:45.109 [JOIN] Asking 192.88.80.190 to sponsor us.032::26-16:11:45.125 [JOIN] Waiting for all connect threads to terminate.092::26-16:11:45.125 [JOIN] Asking NODEA to sponsor us.040::26-16:12:18.640 [JOIN] Sponsor 192.88.80.115 is not available (JoinVersion), status=1722.098::26-16:12:18.640 [JOIN] Sponsor 192.88.80.190 is not available (JoinVersion), status=1722.099::26-16:12:18.640 [JOIN] Sponsor 9.9.9.2 is not available (JoinVersion), status=1722.

document.doc

Author: Fabian SIRACH

Page 63 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 63: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

098::26-16:12:18.640 [JOIN] JoinVersion data for sponsor 157.57.224.190 is invalid, status 1722.099::26-16:12:18.640 [JOIN] JoinVersion data for sponsor 9.9.9.2 is invalid, status 1722.040::26-16:12:18.640 [JOIN] JoinVersion data for sponsor 157.58.80.115 is invalid, status 1722.092::26-16:12:18.703 [JOIN] Sponsor NODEA is not available (JoinVersion), status=1722.092::26-16:12:18.703 [JOIN] JoinVersion data for sponsor NODEA is invalid, status 1722.032::26-16:12:18.703 [JOIN] All connect threads have terminated.032::26-16:12:18.703 [JOIN] Unable to connect to any sponsor node.032::26-16:12:18.703 [INIT] Failed to join cluster, status 0032::26-16:12:18.703 [INIT] Attempting to form cluster MDLCLUSTER...032::26-16:12:18.734 [FM] arbitrate for quorum resource id 24acc093-1e28-11d1-9e5d-0000f8034599.032::26-16:12:18.734 [FM] FmpQueryResourceInfo:initialize the resource with the registry information032::26-16:12:18.734 FmpRmCreateResource: creating resource 24acc093-1e28-11d1-9e5d-0000f8034599 in shared resource monitor032::26-16:12:18.765 FmpRmCreateResource: created resource 24acc093-1e28-11d1-9e5d-0000f8034599, resid 136000006d::26-16:12:18.812 Physical Disk <Disk G:>: SCSI, error attaching to signature b2320a9b, error 2.06d::26-16:12:18.812 Physical Disk <Disk G:>: Unable to attach to signature b2320a9b. Error: 2.032::26-16:12:18.812 [FM] FmGetQuorumResource failed, error 2.032::26-16:12:18.812 [INIT] Cleaning up failed form attempt.032::26-16:12:18.812 [INIT] Failed to form cluster, status 2.032::26-16:12:18.828 [CS] ClusterInitialize failed 2

The node attempts to join the existing cluster, but has invalid credentials, because it was previously evicted. Therefore, the existing node refuses to communicate with it. The node may attempt to form its own version of the cluster, but cannot gain control of the quorum disk, because the existing cluster node maintains ownership. Examination of the logfile on the existing cluster node reveals that the Cluster Service posted entries to reflect the failed attempt to join:

0c4::29-18:13:31.035 [NMJOIN] Processing request by node 2 to begin joining.0c4::29-18:13:31.035 [NMJOIN] Node 2 is not a member of this cluster. Cannot join.

document.doc

Author: Fabian SIRACH

Page 64 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 64: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

8 Appendix: Q258078 Cluster Service Startup Options

PSS ID Number: Q258078Article last modified on 08-08-2001 :2000,4.0

======================================================================-------------------------------------------------------------------------------The information in this article applies to: - Microsoft Windows 2000 Advanced Server - Microsoft Windows 2000 Datacenter Server - Microsoft Windows NT Server, Enterprise Edition version 4.0 ------------------------------------------------------------------------------- SUMMARY======= This is a list of all the available switches that can be used as startupparameters to start the Cluster service. To do this, go to the properties of the service, and put the appropriate switchin the Start Parameters box, and then click Start. NOTE: You must include a forward slash (/) at the beginning of the switch. You can also use the desired switch when starting the Cluster service from thecommand line as well: net start clussvc.exe /<switch> NOTE: The Debug command has special startup parameters, please reference theDebug section below for proper usage. Valid option switches are: - FixQuorum -- No quorum device, no quorum logging- DebugResMon -- Enable debugging of resrcmon process- Debug -- Displays events during the start of Cluster Service. See below for special syntax Windows 2000 and later only switches:

document.doc

Author: Fabian SIRACH

Page 65 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 65: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

- ResetQuorumLog -- Dynamically re-creates the quorum log and checkpoint files (this functionality is automatic in Windows NT 4.0)- NoRepEvtLogging -- No replication of Event Log entries MORE INFORMATION================ Explanation of some of the switches: - Debug Function: It is possible that Cluster logging may not contain any helpful information in diagnosing failures of the Cluster service to start. This is because the Cluster service may fail prior to the Cluster.log starting. Starting the Cluster service with this switch displays the initialization of the Cluster service and can be beneficial in identifying these early occurring problems. Requirements: This switch is intended for temporary diagnostic use only. If the Cluster service fails to start because of a logon error of the service account, or another system-related error, the service may not have a chance to run. As a result, a cluster.log file may not be created. This method runs the service outside of the normal environment given by the Service Control Manager. To use this switch, you must be logged on locally with administrative rights and launched from the command prompt. Do not use the /debug option for normal use or for any length of time. The service does not run as efficiently with the option set. Usage scenarios: This switch must be used only when the Cluster service fails to start up. This switch will display on the screen the operation of the Cluster Service as it attempts to start. This switch can only be used when starting the service from the command prompt and you must be in the directory that the Cluster Service is installed to, by default this is %SystemRoot%\Cluster. This is also the only switch that you do not use the NET START command to start the service. Operation: Open a Command Prompt and change your current directory to the %SystemRoot%\cluster directory. Then type: "CLUSSVC /debug" The cluster service will send output to the window similar to what would normally be seen in the cluster.log. You may also capture this information to a file by using the following command syntax instead: "CLUSSVC /debug > c:\debug.log" At the point that you are satisfied that the Cluster service is running

document.doc

Author: Fabian SIRACH

Page 66 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 66: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

properly, use CTRL+C from the keyboard to stop the service. Note: You may wish to use the ClusterLogLevel environment variable to control the output level when using the Debug switch, see this article for additional information: Q168801 How to Enable Cluster Logging in Microsoft Cluster Server - FixQuorum Function: Lets the cluster service start up despite problems with the quorum device. The only resources that will be brought online once the service is started is the Cluster IP Address and the Cluster Name. You can open Cluster Administrator and bring other resources online manually. Requirements: This switch MUST be used only in diagnosis mode on a very temporary basis and not during normal operation. Only 1 node must be started up using this switch and a second node must not be attempted to be joined to the node started up using this switch. Typically, this switch is used alone. Usage scenarios: If the cluster service is unable to start up in the normal way due to the failure of the quorum resource, users can start up the cluster service in this mode and attempt to diagnose the failure. Operation: After the cluster service is started up, all resources including the quorum resource remain offline. Users can then manually try to bring the quorum resource online and monitor the cluster log entries as well as the new event log entries and attempt to diagnose any problems with the quorum resource. NET START ClusSvc /FixQuorum - ResetQuorumLog Function: If the quorum log and checkpoint file is not found or is corrupt, this can be used to create files based from the information in the local node's %SystemRoot%\Cluster\CLUSDB registry hive. If the quorum log file is found to be in proper order, this switch has no effect. Requirements: Typically, only one node is started up using this switch and this switch is used alone. Must be used only by experienced users who understand the consequences of using information that is potentially out of date, to create a new quorum log file. Usage scenarios: This switch must be used only when the Cluster service fails to start up on a Windows 2000 or later machine due to a missing/corrupt

document.doc

Author: Fabian SIRACH

Page 67 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 67: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

quorum log QUOLOG.LOG and CHKxxx.TMP files. Windows NT 4.0 will automatically recreate these files if they do not exist, this functionality was added in Windows 2000 to give more control over the start of the Cluster service. Operation: The Cluster service does an auto-reset of the quorum log file if it is found missing or corrupt by using the information in the currently loaded cluster hive using the file %systemroot%\Cluster\CLUSDB. NET START ClusSvc /ResetQuorumLog - DebugResMon Function: Helps you to debug the resource monitor process and, therefore, the resource dynamic-link libraries (DLLs) that are loaded by the resource monitor. You can use any standard Windows-based debugger. Requirements: Can only be used when the cluster service is started from the command prompt and using the "/debug" option, there is no equivalent registry setting that could be used when cluster service is run as a service. Debugger must be available for attaching to the resource monitor when it starts up. Typically, this switch is used alone. Usage scenarios: Developers use it to debug the resource monitor process and resource DLLs. This option is extremely useful if a bug in a resource DLL causes the resource monitor process to crash soon after it is started up by the cluster service and before users can manually attach a debugger to the resource monitor process. Operation: Just before the resource monitor process is started up, the cluster service process waits with a message "Waiting for debugger to connect to the resmon process X" where X is the PID (Process ID) of the resource monitor process. The cluster service does this waiting for all resource monitor processes created by it. Once the user attaches a debugger to the resource monitor process, and the resource monitor process starts up, the cluster service continues with its initialization. - NoRepEvtLogging Function: The norepevtlogging command prevents replication of those events recorded in the event log. This command is useful in reducing the amount of information displayed in the command window by filtering out events already recorded in the event log. Event log replication is a new feature added in Windows 2000. Usage scenarios: For example, to start the cluster service and log those events not recorded in the event log to a local file, Debugnorep.log:

document.doc

Author: Fabian SIRACH

Page 68 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39

Page 68: GLOBE ISIT OASIS2 Cluster Administration and Troubleshooting

Cluster Administration and Troubleshooting Guide

clussvc /debug /norepevtlogging > c:\DEBUGNOREP.log\ Operation: The norepevtlogging command can be set as a start parameter when starting the cluster service from the Computer Management console. The command line syntax is: NET START ClusSvc /NoRepEvtLogging This will prevent the node that was started with this switch not to replicate it's information to other nodes, but it will still receive information from other nodes that were started normally. Additional query words: MSCS ======================================================================Keywords : kbenv kbtool w2000mscs kbClustering Technology : kbWinNTsearch kbWinNT400search kbwin2000AdvServ kbwin2000AdvServSearch kbwin2000DataServ kbwin2000DataServSearch kbWinNTSsearch kbWinNTSEntSearch kbWinNTSEnt400 kbWinNTS400search kbwin2000Search kbWinAdvServSearch kbWinDataServSearchVersion : :2000,4.0Issue type : kbinfo=============================================================================Copyright Microsoft Corporation 2001.

document.doc

Author: Fabian SIRACH

Page 69 of 69 Created: 06.01.2004

Printed: 22.08.2002 17:39