Hacmp Full Notes(1)

download Hacmp Full Notes(1)

of 31

description

HACMP BOOK

Transcript of Hacmp Full Notes(1)

Installation of the software

HACMP

Elements in System Availability:

- Well planned and implemented concept

- "no single point of failure"

- recovery without user intervention, using scripting

High availability is:

- The masking or elimination of both planned and unplanned downtime.

- The elimination of single points of failure (SPOFs)

- Fault resilience, but not fault tolerant.

The failure of any component of the solution, be it hardware, software or system management, will not cause the application and its data to be inaccessible to the user community. High availability solutions do fail, Fault Tolerant solutions should not fail.

High availability solution should be to achieve continuous availability, i.e., no downtime. We must not only implement High Availability solution, but also reduce our planned downtime through disciplined and documented change management.

The causes of downtime:

Planned downtime :

Unplanned downtime :Hardware upgrades

User Error

Repairs

Application failure

Software updates

Hardware faults

Backups

Environmental Disasters

Testing

Development

The standalone system may offer limited availability benefits:

Journalled Filesystem

Dynamic CPU Deallocation

Service Processor

Redundant Power

Redundant Cooling

ECC Memory

Hot Swap Adapters

Dynamic Kernel

Single points of failure:

Operating System

Network

Network Adapter

Node

Disk

Application

Site Failure

The enhanced system may offer increased availability benefits:

Redundant Data Paths

Data Mirroring

Hot Swap Storage

Redundant Power for Storage Arrays

Redundant Cooling for Storage Arrays

Hot Spare Storage

Single points of failure:

Operating System

Application

Network

Network Adapter

Node

Site Failure

Clustering technologies offer High Availability:

Redundant Servers

Redundant Networks

Redundant Network Adapters

Heartbeat Monitoring

Failure Detection

Failure Diagnosis

Automated Fallover

Automated Reintegration

Single points of failure:

Site Failure

Application

Benefits of High Availability Solutions: Standard Components (no specialized hardware)

Can be built from existing hardware (no need to invest in new kit)

Works with just about any application

Works with wide range of disk and network types

No specialized operating system or microcode

Excellent availability at low cost

HACMP is largely independent of the disk type , network and application chosen.

High Availability solutions require the following :

Thorough design and detailed planning

Selection of appropriate hardware

Disciplined system administration practices

Documented operational procedures

Comprehensive testing

A High Availability solution based upon HACMP provides automated failure detection , diagnosis , recovery and reintegration.

The highly available solution will include: AIX Operating System, HACMP for

AIX , customized enhancements, Cluster Proven applications and of course a plan for the design and testing.

AIX's contribution to High Availability : Object Data Manager (ODM)

System Resource Controller (SRC)

Logical Volume Manager (LVM)

Journalled File System (JFS)

Online JFS Backup (splitlvcopy)

Work Load Manager (WLM)

Quality of Service (Qos)

External Boot

Software Installation Management (installp)

Reliable Scalable Cluster Technology (RSCT)

Hardware Prerequisites:

All pSeries systems will work with high availability, in any combination of nodes within a cluster, however a minimum of 4 free adapter slots is recommended ( 2 for network adapters and 2 for disk adapters) . Any other adapters (graphics adapters) will occupy additional slots. The internal Ethernet adapter should not be included in the calculations. Even with 4 adapter slots free, there will be a single point of failure as the cluster will only be able to accommodate a single TCP/IP local area network between the nodes.

HACMP Features:

1)Availability using:

- Cluster concept

- Redundancy at component level (standby adapters)

- AIX: LVM (JFS, disk mirroring) , SRC, Error Notify

2)Event (fault) Detection

- network adapter, network, or node

3)Automatically triggered or customized Event Recovery

- adapter swap, fallover or notification of network down

4)CSPOC: Tools for global changes across all nodes

- create AIX users, passwords, LVM components (VG, LV, JFS)

5)DARE (Dynamic Automatic Reconfiguration Event)

- Make HACMP changes without stopping the application

6)Monitoring using:

- HACMP commands, HAview, or HAtivoli, pager support

HACMP is Not the Right Solution If...

1)You cannot suffer any downtime

- Fault tolerance is required

- 7 x 24 operation is required

- Life critical systems

2)Your environment is insecure

- Users have access to root password

- Network security has not been implemented

3)Your environment is unstable

- Change management is not respected

- You do not have trained administrators

- Environment is prone to 'user fiddle factor'

HACMP will never be an out-of-the-box solution to availability. A certain degree of skill will be always required.

HACMP Basic Terms:

Cluster

HACMP's Resource Components:

Resources are logical components of the cluster configuration that can be moved from one node to another. Because they are logical components, they can be moved without human intervention. Resource components include: Resource Groups, IP labels, filesystems, NFS exports and mounts, Application Servers, Volume Groups and other

items. All the logical resources are collected together into a resource group. All components in a resource group move together from one node to another in the event of a node failure.

The difference between topology and resources is that the topology components are physical, i.e., nodes, networks and network adapters, which would require manual intervention to move from one place to another.

Failures detected by HACMP

1) Node failures Processor hardware or OS failure

2) Network Adapter failure Moves IP address to standby

3) Network failures Message displayed on console, event is logged.

HACMP/ES can also monitor on applications, processor load and available disk capacity.

Other Failures:

1) Disk Drive failure LVM mirroring , RAID

2) Other Hardware failures

No direct HACMP support. HACMP for AIX provides SMIT interface to AIX Error Notification Facility.

Trap on specific errors.

Execute command in response to error.

3) Application failures

4) HACMP failure Promoted to node failure.

5) Power failure Avoid common power supplies across replicated devices / Use UPS.

Cluster Resources :

- Applications

- Disk Drives

- Volume Groups

- File Systems

- NFS File Systems

- IP Addresses

Disk Crash

1) Data replicated through LVM mirroring

2) Data replicate on RAID.

Disk adapter Failure:

1) Data replicated through LVM mirroring across buses.

2) If RAID uses multiple buses , data available.

Network Adapter Failure:

1) Move IP address to standby adapter.

- The standby adapter takes over IP and, where applicable, MAC address.

- Duration between 5 and about 25 seconds

- If the standby adapter fails , no fallover action is taken.

- If the failed adapter comes back to life , it joins as a joined standby.

- No significant effect on applications

Network Fails:

1) HACMP provides notification and runs any user defined scripts.

- HACMP detects a fault.

- Standard event script does not contain any actions.

- Network takeover only possible with customizing.

- Behavior of application depends on infrastructure.

Machine Fails:

1) Workload ( resources ) moved to surviving node.

2) TCP/IP address moved from failed to surviving node.

3) Users login again , using same host name.

What you lose:

1) Work in progress.

2) Any data not cached to disk.

3) All process state.

Types of HACMP Resource Groups

1) Cascading- computers have a fixed priority.

- resources move through an ordered list of nodes in the event that a node fails.

- resources automatically revert to the top level computer that is active.

A cascading resource group can be modified to prevent the resources moving back to a higher priority node upon reintegration of the previously failed node , this is called cascading without fallback.

2) Rotating

- all computers have equal priority

- takeover only happens when there is a defect, not with reintegration

- When a previously failed node rejoins cluster , the resources do not move back.

A limitation with rotating resource group is that ni1 rotating resource groups are supported for each cluster, where n is the number of nodes in the cluster. Another consideration for rotating resource group is that all nodes should ideally be eqully sized.

3) Concurrent- all computers work on the data simultaneously

Resource Groups, points to ponder :

You may have hundreds of resource groups in a single cluster.

You may mix and match all three types of resource group in a single cluster.

Simply because you have 32 nodes in a cluster does not mean that all nodes must service all resource groups. The only requirement is that each resource group is serviced by at least two nodes.

Any given node may service one or more resource groups (of the same or differing types).

You may manually move resource groups about the cluster to perform a type of load balancing.

You may chose which node has which resource group at any point in time or following failure.

LVM and HACMP Considerations: All LVM constructs must have unique names in the cluster.

- like httplv, httploglv, httpfs and httpvg.

Mirror all critical logical volumes.

- Don't forget the jfslog.

The VG major device numbers should be the same.

- Mandatory for clusters exporting NFS filesystems, but it is a good habit for any cluster.

Shared data on internal disks is a bad idea.

HACMP and Networks:

HACMP clusters must contain two or more nodes and two or more networks, one being IP based, one being non-IP based. The non-IP networks are there to allow HACMP to determine if a failure is a network related failure or a node failure.

LANs carry the following traffic:

HACMP heartbeat or keepalive packets.

HACMP messages, used for cluster communications.

HACMP Lock Manager traffic used in concurrent access configurations.

Client communications, e.g., Telnet, FTP, NFS, sqlnet.

Serial networks carry:

HACMP heartbeats.

HACMP messages.

HACMP Network Components Terminology:

Public Network

Any TCP/IP LAN that supports HACMP client

network traffic

Private Network

Any TCP/IP LAN that carries only HACMP traffic.

Serial Network

An RS232/RS422 or tmscsi/tmssa network used

only for HACMP traffic.

Adapter

The HACMP ODM definition associated with a

TCP/IP or Serial network interface.

Adapter IP label

The name in /etc/hosts that maps to an IP address.

Nodename

The name associated with a cluster node , not to be

confused with hostname.

Network Type

Identifies the physical media type to HACMP , e.g.,

FDDI , ether, ATM, HPS.

Adapter Function

Service, Standby or Boot.

Adapter hardware AddressThis is the LAN adapters Locally Administered

Address ( LAA).

All nodes must have atleast one standby and one service adapter per network.

Keep the following items in mind when you do your network planning:

All service and boot adapters must be in the same subnet.

All standby adapters must be in the same subnet.

Service and Standby adapters must be in different logical subnets.

All adapters in the cluster must have the same subnet mask.

Do not edit the route entries for service and standby adapters in SMIT.

IP Address Takeover (IPAT): Service IP address of failed node may be taken over by a surviving node.

This behaviour is known as IP Address Takeover or IPAT for short.

Standby adapter on a surviving node acquires the service address of failed node.

This is an optional behaviour that must be configured.

Requires the configuration of a 'boot' adapter.

If more than one standby adapter is available the surviving node may takeover the

workload of one or more failed nodes.

When a cluster is configured to use IPAT , an additional network adapter must be defined. This is known as a Boot Adapter. When a failed node recovers, it cannot boot on the Service IP address if this has been acquired by another node in the cluster. For this reason, the failed node needs to boot on a unique IP address which is not used elsewhere in the cluster. This ensures that there is no IP address duplication during reintegration.

Configuring IPAT:

IPAT is only required on rotating resource groups and is optional on CASCADING. It is not supported on concurrent resource groups.

On all nodes, prepare security and name resolution:

1. Add an entry for the boot IP label in to /etc/hosts on each node.

2. Add the boot IP label to /.rhosts on each node.

3. Use FTP or rdist to keep these files in sync' and minimise human error.

On the node that will have its service IP address taken over:

4. Change the IP address that is held in the ODM to that of the boot IP address by using smit chinet. This will cause cfgmgr to read the 'boot' address at system startup.

On any node, update cluster configuration:

5. Add the boot adapter definition to the cluster topology, for that node which will have its service IP address taken over.

6. Synchronise the topology (you will get a warning message).

7. Now add the service IP label of the node to be taken over to a resource group.

8. Take a snapshot of your modified topology and update your cluster planning worksheets.

Configuring Hardware Address Takeover: Do not enable the "ALTERNATE" hardware address field in smit devices.

This will cause the adapter to boot on its locally administered address rather than

the burned in ROM address.

This causes serious communications problems and will put the cluster in to an

unstable state.

Correct method is to enter your chosen LAA in to the smit hacmp menus.

Some adapter types are very specific about the numbering of the first two digits in an LAA.

Token-ring and FDDI in particular.

Must start with 42 for Token-Ring and 4,5,6 or 7 for FDDI in the first octet of the first byte.

Always check the documentation provided with the adapter and the HACMP manuals.

Token-Ring adapters will not release the LAA if AIX crashes.

AIX must be set to reboot automatically after a system crash (smit environment).

Installation of the software

Install the HACMP software from the HACMP CD

- cluster.adt

- cluster.base

- cluster.cspoc

- cluster.man

- cluster.taskguides

- cluster.vsm

HACMP Daemons:

1) clstrmgr

2) clsmuxpd ( works on SNMP)

3) cllockd for concurrent access

4) clinfo required for IPAT ( hardware MAC takeover )

clstrmgr and clsmuxpd daemons are mandatory. The other two are optional.

1)Cluster Manager (clstrmgr) :

- Runs on all cluster nodes.

- Tracks cluster topology.

- Tracks network status.

- Externalize failure events.

Cluster Manager has four functional pieces Cluster Controller (CC) , Event Manager (EM) , NIM Interface Layer (NIL)and Network Interface Module (NIM).

CC , EM and NIL are all part of the clstrmgr executable. NIMs are separate executables one for each network.

The Cluster Controller reads on the start the information out of the ODM the Network Interface Layer is controlling the Hardware using the Network Interface Modules. The Event Manager is handling the Event Scripts and communicates to the clsmuxpd and cllockd.

The Cluster Controller performs a number of coordinating functions:

Retrieves cluster configuration information from the HACMP ODM object classes at startup and during a refresh or DARE operation.

Establishes the ordering of cluster neighbours for the purpose of sending keep alive packets.

Tracks changes to the cluster topology.

Receives information about cluster status changes from the NIM via the NIL.

Queues events in response to status changes in the cluster.

Handles node isolation and partitioned clusters.

The NIL provides a common interface between the Cluster Controller and one or more NIMs. This allows NIMs to be developed for new adapter hardware without rewriting the cluster manager.

It tells the NIMs the appropriate keep alive and failure detection rates for each network type as mentioned in the ODM.

It starts the appropriate NIMs for the network types that have been defined in the HACMP classes of the ODM.

It gives the NIMs a list of the IP addresses or /dev files to send keep alive to.

It restarts the NIMs if they hang or exit.

The NIMs are the contact point between HACMP and the network interfaces.

The NIMs send and receive keep-alive and message information.

They detect network related failures.

They are provided for each supported network type including a generic one.

The Event manager performs the following functions:

It starts the appropriate event scripts in response to status change in the cluster.

It sets the required environment variables.

It communicates with clsmuxpd and cllockd when required.

It starts the config_too_long event if any event does not exit 0 within 6 minutes.

Event Manager causes event scripts to execute . Primary events ( such as node_up, node_up_complete , node_down , node_down_complete, etc.) are called directly by the cluster manager. Sub events ( such as node_up_local , node_up_remote , node_down_remote , node_down_local , etc. ) are called by primary events.

1) Cluster SNMP Agent (clsmuxpd):

- Receives information from Cluster Manager.

- Maintains the HACMP enterprise specific MIB.

- Provides information to SNMP.

1) Cluster Lock Manager (cllockd):

- Cluster wide advisory level locking.

- CLM locking API.

- Unix locking API.

- Only for processes running on cluster.

1) Cluster Information Services (clinfo):

- Optional on both cluster nodes and cluster clients.

- Provides cluster status information to clients.

- Clinfo API allows for cluster aware applications.

Characteristics of keep-alive (KA) packets:

- Transmitted over all interfaces known to HACMP

- Direct I/O (non-IP networks) or UDP packets

- Three adjustable transmission rates (fast, normal, slow)

- If a failure rate is exceeded, an event is triggered

Dead Man's Switch (DMS):

If on one of your cluster nodes the LED flashes 888, then you may have experienced a DMS time-out.

The reason is clstrmgr could not send heartbeat to itself due to excessive I/O traffic. Based on this all other nodes in the cluster will start the node_down event. This may cause data corruption on shared disks. To avoid this problem the DMS happen. The operating system (AIX) will panic (888)

The deadman switch is a kernel extension to AIX .The cluster manager tries to reset the DMS frequently (every 0.5 seconds).

If DMS is not reset for n-1 seconds - PANIC!

n = (KA rate) x (Missed KAs) for slowest network

Cluster Single Point of Control (C-SPOC):

C-SPOC provides facilities for performing common cluster wide administration tasks from any node within the cluster.

Requires either /.rhosts or kerberos to be configured on all nodes.

C-SPOC operations fail if any target node is down at the time of execution or selected resource is not available.

Any change to a shared VGDA is synchronised automatically if C-SPOC is used to change shared LVM components (VGs, LVs, JFS).

C-SPOC uses a script parser called the "command execution language"

The cluster snapshot utility:

HACMP's cluster snapshot utility records the HACMP ODM configuration information, both cluster topology and resources.

When a new snapshot is created, two files are generated.

.odm (contains ALL cluster topology and resource information)

.info (a printable report, can be extended)

By default, snapshots are stored in the directory /usr/sbin/cluster/snapshots

The SNAPSHOTPATH environment variable can be used to specify an

alternative location for storing snapshots.

The documentary report that a snapshot creates can be customised to include information specific to a given cluster (application configuration).

Snapshots can be applied to a running cluster.

HACMP Log Files :

1) /use/adm/cluster.log output of cluster script and daemons.

2) /usr/sbin/cluster/history/cluster. - History files are created everyday.

3) /tmp/cspoc.log output of all commands executed for C-SPOC.

4) /tmp/cm.log log for Cluster Manager daemon.

5) /tmp/emuhacmp.out for emulation scripts.

6) /tmp/hacmp.out detailed output of all scripts.

Dynamic Reconfiguration :

HACMP provides a facility that allows changes to cluster topology and resources to be made while the cluster is active. This facility is known as DARE or to give it it's full name "Dynamic Automatic Reconfiguration Event". This requires 3 copies of the HACMP ODM.

Default Configuration Directory (DCD) which is updated by SMIT/command line

/etc/objrepos Staging Configuration Directory (SCD) which is used during reconfiguration

/usr/sbin/cluster/etc/objrepos/staging. Active Configuration Directory (ACD) from which clstrmgr reads the cluster configuration /usr/sbin/cluster/etc/objrepos/active

DARE allows changes to be made to most cluster topology and nearly all resource group components without the need to stop HACMP, take the application offline or reboot a node. All changes must be synchronised in order to take effect.

Pre configuration steps

1) Add the Boot, Service, Standby entries in the /etc/hosts file.

2) Make these entries in the .rhosts file (required at the time of sync). HACMP uses rsh on remote machine for updation.

3) Create volume group on node1 with the major number which is free in both nodes ( use lvlstmajor command to check the free major number )

4)Off the auto vary on feature on that node with command

chvg -a n

5)Create a jfslog logical volume

# mklv -t jfslog -y

6) Format the log logical volume

logform /dev/

7) Create a logical volume

mklv -t jfs -y

8) Create file system

crfs -v jfs -d /dev/ -m /

9)Turn off the vg using

varyoffvg

10)Import the VG in the other node with same major number

importvg -V 44 -y

11)Off the auto varry on feature with command

chvg -a n

12) Turn off the VG using

varyoffvg

Configuration steps ( short )

1)Define cluster Id and cluster name

2)Configure nodes

3)Configure net adapters

4)Sync the topology

5)Create resource group

6)Create resources

7)Sync the resources

Steps in Configuring Clusters:

Step 1 - Plan your cluster

- Use the planning worksheets and the documentation

Step 2 - Configure TCP/IP and LVM - /etc/hosts, /.rhosts, jfs mirroring and layout

Step 3 - Install the HACMP software- Select the necessary filesets

Step 4 - Define the cluster topology - Nodes, networks and network adapters

Step 5 - Synchronize the cluster topology - Verification performed automatically

Step 6 - Configure the application start and stop scripts- Application servers

Step 7 - Define the cluster resources and resource groups-File systems, IP addresses,

exports and NFS mounts

Step 8 - Synchronize the cluster resources -Verification performed automatically

Step 9 - Test the cluster

- Including application tests

HACMP Enhanced Scalability (HACMP/ES):

- extended functionality compared with classic HACMP

- upto 32 nodes in one HACMP cluster.

- Based on IBM High Availability Infrastructure.

- Includes all features of classic HACMP.

Difference between HACMP and HACMP/ES:

- HACMP/ES uses RSCT (Reliable Scalable Cluster Technology)

- User defined events based on RSCT

- Application Monitoring

- Recovery from resource group acquisition failure.

- Dynamic node policy

- Selective Fallover

- Plugins

In 4.4.1, plugins are provided to help configure the following services:

- DNS

- DHCP

- Print services

The plugins add the server application and an application monitor to an existing resource group.

- Process Monitoring via provided plugin scripts

- Users must create their own scripts/programs for Custom Monitoring

What Plugins provide:

- The resource group now contains an application server to start and stop the relevant daemons.

- The resource group also contains the shared volume group and filesystem needed by the daemons.

- An application monitor is configured to watch the daemons.

- When the resource group comes online, the daemon(s) will be activated.

- If a daemon should fail, the application monitor will detect it.

- the daemon will be restarted or the resource group will move to another

node.

- Note that the limitations of application monitoring still exist!

- only one monitored application per resource group.

- the monitored application can be included in at most one resource group.

HACMP Commands

rdist -b -f /etc/disfile1To distribute the files in disfile1 to all nodes in disfile1 in binary mode

Sample entry for disfile1

HOSTS = ( root@node1 root@node3 )

FILES = ( /etc/passwd /etc/security/passwd)

${FILES} -> ${HOSTS}

clstart -m -s -b -i -l

To start cluster daemons (m-clstrmgr, s-clsmuxpd,

b-broadcast message, -i-clinfo, -l cllockd)

clstop -f -NTo force shutdown cluster immediately without releasing resources

clstop -g -N

To do graceful shutdown immediately with no takeover

clstop -gr -N

To do graceful shutdown immediately with takeover

cldare -t

To sync the cluster toplogy

cldare -t -f

To do the mock sync of topology

cldare -r

To sync the cluster resources

cldare -r -f

To do the mock sync of resources

clverify

cluster verification utility

cllscf

To list clustur topology information

cllsclstr

To list the name and security level of the cluster

cllsnode

To list the info about the cluster nodes

cllsnode -i node1

To list info about node1

cllsdisk -g shrgTo list the PVID of the shared hard disk for resource group shrg

cllsnw

To list all cluster networks

cllsnw -n ether1

To list the details of network ether1

cllsif

To list the details by network adapter

cllsif -n node1_service

To list the details of network adapter node1_service

cllsvg

To list the shared vgs which can be accessed by all nodes

cllsvg -g sh1

To list the shared vgs in resource group sh1

cllslv

To list the shared lvs

cllslv -g sh1

To list the shared lvs in the resource group sh1

cllsdisk -g sh1

To list the PVID of disks in the resource group sh1

cllsfs

To list the shared file systems

cllsfs -g sh1

To list the shared file systems in the resource group sh1

cllsnim

Show info about all network modules

cllsnim -n ether

Show info about ether network module

cllsparam -n node1

To list the runtime parameters for the node node1

cllsserv

To list all the application servers

claddclstr -i 3 -n dcm

To add a cluster definition with name dcm and id 3

claddnode

To add an adapter

claddnim

To add network interface module

claddgrp -g sh1 -r cascading -n n1 n2To create resource group sh1 with nodes n1,n2 in cascade

claddserv -s ser1 -b /usr/start -e /usr/stopCreates an application server ser1 with startscript as /usr/start and stop script as /usr/stop

clchclstr -i 2 -n dcmds

To change cluster definitions name to dcmds and id to 2

clchclstr -s enhanced

To change the clustur security to enhanced

clchnode

To change the adapter parameters

clchgrp

To change the resource group name or node relationship

clchparam

To change the run time parameters (like verbose logging)

clchservTo change the name of app. server or change the start/end scripts

clrmclstr

To remove the cluster definition

clrmgrp -g sh1

To delete the resource group sh1 and related resources

clrmnim ether

To remove the network interface module ether

clrmnode -n node1

To remove the node node1

clrmnode -a node1_svcTo remove the adapter named node1_svc

clrmres -g sh1

To remove all resources from resource group sh1

clrmserv app1

To remove the application server app1

clrmserv ALL

To remove all applicaion servers

clgetactivenodes -n node1To list the nodes with active cluster manager processes from cluster manager on node node1

clgetaddr node1

returns a pingable address from node node1

clgetgrp -g sh1

To list the info about resource group sh1

clgetgrp -g sh1 -f nodesTo list the participating nodes in the resource group sh1

clgetifTo list interface name/interface device name/netmask associated with a specified ip label / ip address of a specific node

clgetip sh1

To get the ip label associated to the resource group

clgetnet 193.9.200.2 255.255.255.0To list the network for ip 193.9.200.2, netmask 255.255.255.0

clgetvg -l nodelv

To list the VG of LV nodelv

cllistlogs

To list the logs

clnodename -a node5

To add node5 to the cluster

clnodename -o node5 -n node3To change the cluster node name node5 to node3

clshowres

Lists resources defined for all resource group

clfindres

To find the resource group within a cluster

xclconfig

X utility for cluster configuration

xhacmpm

X utility for hacmp management

xclstat

X utility for cluster status

Setup for Cascading mode :

HACMP Configuration Exercise :

Scenario : Connecting machines in Cascading Resource Group , so as to operate for IP address takeover , NFS availablility and Application takeover.

1. Create .rhosts file on all nodes which are going to be part of HACMP . The file should be in root directory and contain name of boot , standby and service adapters .

# cat .rhosts

cws

node3

node1

# End of generated entries by updauthfiles script

node1_boot

node1_svc

node1_stby

node3_boot

node3_svc

node3_stby

2. # smit hacmp

A.) Define a cluster

HACMP for AIX

Move cursor to desired item and press Enter.

Cluster Configuration

Cluster Services

Cluster System Management

Cluster Recovery Aids

RAS Support

Cluster Configuration

Move cursor to desired item and press Enter.

Cluster Topology

Cluster Security

Cluster Resources

Cluster Snapshots

Cluster Verification

Cluster Custom Modification

Restore System Default Configuration from Active Configuration

Cluster Topology

Move cursor to desired item and press Enter.

Configure Cluster

Configure Nodes

Configure Adapters

Configure Network Modules

Show Cluster Topology

Synchronize Cluster Topology

Configure Cluster

Move cursor to desired item and press Enter.

Add a Cluster Definition

Change / Show Cluster Definition

Remove Cluster Definition

Add a Cluster Definition

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Cluster ID [10] #

* Cluster Name [dcm]

B.) Add Participating nodes

# smit hacmp

- Cluster Configuration

- Cluster Topology

- Configure Nodes

- Add Cluster Nodes

Make entries for participating nodes . The names may not be related to /etc/hosts file. Can be any name.

Add Cluster Nodes

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Node Names [node1 node3]

B.) Create entries for all IP addresses

First confirm that the system has booted through the boot IP given. Check using

# lsattr El en0

Make entries for all adapters like node1_boot , node1_svc, node1_stby , node3_boot, node3_svc and node3_stby.

# smit hacmp

- Cluster Configuration

- Cluster Topology

- Configure Adapters

- Add an adapter

Add an Adapter

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Adapter IP Label [node1_boot]

* Network Type [ether] +

* Network Name [ether1] +

* Network Attribute public +

* Adapter Function boot +

Adapter Identifier [193.9.200.225]

Adapter Hardware Address []

Node Name [node1] +

B.) Check the Cluster Topology

#smit hacmp

- Cluster Configuration

- Cluster Topology

- Show Cluster Topology

- Check the cluster topology.

C.) Synchronize Cluster Topology

#smit hacmp

- Cluster Configuration

- Cluster Topology

- Synchronize Cluster Topology

The topology is copied to all participating nodes.

D.) Create a Resource Group

#smit hacmp

- Cluster Configuration

- Cluster Resources

- Define resource Group

- Add a resource Group

Add a Resource Group

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Resource Group Name [rg1]

* Node Relationship cascading +

* Participating Node Names [node1 node3] +

Give Resource Group Name , Node Relationship and Participating Node Name.

B.) Define Resources for a Resource Group

#smit hacmp

- Cluster Configuration

- Cluster resources

- Change/Show Resources for a RG

Select a Resource Group

Move cursor to desired item and press Enter.

rg1

Configure a Resource Group

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[TOP] [Entry Fields]

Resource Group Name rg1

Node Relationship cascading

Participating Node Names node1 node3

Service IP Label [node1_svc] +

Filesystems [] +

Filesystems Consistency Check fsck +

Filesystems Recovery Method sequential +

Filesystems to Export [] +

Filesystems to NFS Mount [] +

Volume Groups [] +

Concurrent Volume Groups [] +

Raw Disk PVIDs [] +

AIX Connections Services [] +

AIX Fast Connect Services [] +

Application Servers [] +

Highly Available Communication Links [] +

Miscellaneous Data []

Inactive Takeover Activated false +

9333 Disk Fencing Activated false +

SSA Disk Fencing Activated false +

Filesystems mounted before IP configured false +

[BOTTOM]

An entry is made for node1_svc in rg1 so that it can be taken over in case of adapter failure or network failure.

Similarly create another Resource Group rg2 for node3_svc.

Entries for NFS and Application Servers if required, also have to be made in the above screen.

B.) Copy the Resource information to all participating nodes

#smit hacmp

- Cluster Configuration

- Cluster resources

- Synchronize Cluster resources.

The Resources configuration gets copied to all participating nodes.

C.) Start HACMP on all participating nodes.

#smit hacmp

- Cluster Services

- Start Cluster Services

Started on individual machine.

We can use C-SPOC ( Cluster single point of control ) for all machines. However it has not been enabled on SP due to security reasons.

For C-SPOC

# smit hacmp

- Cluster system management

- HACMP for AIX Cluster Services.

- Start Cluster Services ( It takes time )

B.) Check cluster services are started

# lssrc g cluster

Now the system has been configured for high availability of IP . It can be tested by stopping the HACMP services on one of the nodes. For this follow these steps :

a.) On both the nodes run the following command and check that service IP is being used on en1. en2 should be using standby IP.

# netstat in

NameMtuNetworkAddress

Ipkts Ierrs Opkts Oerrs Coll

lo0 16896 link#1

210 0 229 0 0

lo0 16896 127 127.0.0.1

210 0 229 0 0

lo0 16896 ::1

210 0 229 0 0

en0 1500 link#2 0.60.94.e9.56.e3 40117 0 38081 0 0

en0 1500 192.9.200 192.9.200.2 40117 0 38081 0 0

en1 1500 link#3 0.6.29.ac.ca.66 63612 0 1136 0 0

en1 1500 193.9.200 193.9.200.226 63612 0 1136 0 0

en2 1500 link#4 0.6.29.ac.f2.f6 0 0 3 3 0

en2 1500 193.9.201 193.9.201.1 0 0 3 3 0

The above display is obtained on node1.

b.) Stop the cluster services on node1.

# smit hacmp

- Cluster Services

- Stop Cluster Services

Stop Cluster Services

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Stop now, on system restart or both now +

BROADCAST cluster shutdown? true +

* Shutdown mode graceful with takeover +

(graceful, graceful with takeover, forced)

a.) Now check on node3 that node1_svc IP has shifted to its standby adapter (en2) using netstat command.

Adding Serial links obtained through SSA :

Check the device addresses on each node using

# lsdev C | grep tmssa .

# smit hacmp

- Cluster Configuration

- Cluster Topology

- Configure Adapters

- Add an adapter

Add an Adapter

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Adapter IP Label [node1_tmssa]

* Network Type [tmssa] +

* Network Name [ssa1] +

* Network Attribute serial +

* Adapter Function service +

Adapter Identifier [/dev/tmssa3]

Adapter Hardware Address []

Node Name [ node1 ] +

Similarly make entry for node3 ( /dev/tmssa2).

Create Application server :

# smit hacmp

- Cluster Configuration

- Cluster Resources

- Define Application Servers

Define Application Servers

Move cursor to desired item and press Enter.

Add an Application Server

Change / Show an Application Server

Remove an Application Server

Add Application Server

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Server Name [tsm]

* Start Script [/usr/bin/tsmstart]

* Stop Script [/usr/bin/tsmstop]

The same scripts have to be copied on all participating nodes . The entry for Application Server has to be made in Resource Group (Step 2. G ).

Sample tsmstart script

SERVICEIP=1

while [ SERVICEIP -ne 0 ]

do

x=`netstat I | grep -c node1_svc`

if [ $x -eq 1 ]

then

SERVICEIP=0

echo "Exiting with SERVICEIP"

else

echo "Executing IP Take over"

sleep 2

fi

done

sleep 15

/usr/tivoli/tsm/server/bin/rc.adsmserv

Sample tsmstop script

cd /usr/tivoli/tsm/client/ba/bin

dsmadmc -id=admin -password=support halt

sleep 15

Cluster snapshot creation :

# smit hacmp

Cluster Snapshots

Add a Cluster Snapshot

We are required to provide Cluster Snapshot Name , Custom defined Snapshot Method and Cluster Snapshot Description.

The snapshot is created in directory /usr/bin/cluster/snapshots. Two file are created .odm and .info.

Testing the non-IP serial communication (SSA):

Node1 : # cat < /dev/tmssa3.tm

Node3 : # cat > /dev/tmssa2.im

Scenario :

Connecting the nodes in Rotating Resource Group. In Rotating mode , a resource does not belong to any node . Therefore when creating a resource like IP address , node name is not required to be given.

Setup in Rotaing mode :

The configuration of the resource group is exactly the same as was done for cascading mode. However while adding adapter for node1_svc , do not provide the Node Name. The other adapters like node1_boot, node1_stby , node3_boot and node3_stby are added as in Step 2.C .

# smit hacmp

- Cluster Configuration

- Cluster Topology

- Configure Adapters

- Add an adapter

Add an Adapter

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Adapter IP Label [node1_svc]

* Network Type [ether] +

* Network Name [ether1] +

* Network Attribute public +

* Adapter Function service +

Adapter Identifier [193.9.200.226]

Adapter Hardware Address []

Node Name [ ] +

After adding adapter definitions , synchronize the topology . Then create a resource group rg3 using node relationship as rotating . After making entries for resource group , synchronize the RG. Start HACMP on each node.

The service IP is allocated on the boot adapter of node1. To test for IP takeover , stop the hacmp services on node1 ( Step 2.J b ) . The service IP label moves to boot adapter of node3.

Note : It has been observed that the serial link tmssa should be configured in rotating RG . The test for IP takeover goes through successfully only when tmssa is configured.

31