Hacmp Full Notes(1)
-
Upload
manikanta-dogiparthi -
Category
Documents
-
view
100 -
download
1
description
Transcript of Hacmp Full Notes(1)
Installation of the software
HACMP
Elements in System Availability:
- Well planned and implemented concept
- "no single point of failure"
- recovery without user intervention, using scripting
High availability is:
- The masking or elimination of both planned and unplanned downtime.
- The elimination of single points of failure (SPOFs)
- Fault resilience, but not fault tolerant.
The failure of any component of the solution, be it hardware, software or system management, will not cause the application and its data to be inaccessible to the user community. High availability solutions do fail, Fault Tolerant solutions should not fail.
High availability solution should be to achieve continuous availability, i.e., no downtime. We must not only implement High Availability solution, but also reduce our planned downtime through disciplined and documented change management.
The causes of downtime:
Planned downtime :
Unplanned downtime :Hardware upgrades
User Error
Repairs
Application failure
Software updates
Hardware faults
Backups
Environmental Disasters
Testing
Development
The standalone system may offer limited availability benefits:
Journalled Filesystem
Dynamic CPU Deallocation
Service Processor
Redundant Power
Redundant Cooling
ECC Memory
Hot Swap Adapters
Dynamic Kernel
Single points of failure:
Operating System
Network
Network Adapter
Node
Disk
Application
Site Failure
The enhanced system may offer increased availability benefits:
Redundant Data Paths
Data Mirroring
Hot Swap Storage
Redundant Power for Storage Arrays
Redundant Cooling for Storage Arrays
Hot Spare Storage
Single points of failure:
Operating System
Application
Network
Network Adapter
Node
Site Failure
Clustering technologies offer High Availability:
Redundant Servers
Redundant Networks
Redundant Network Adapters
Heartbeat Monitoring
Failure Detection
Failure Diagnosis
Automated Fallover
Automated Reintegration
Single points of failure:
Site Failure
Application
Benefits of High Availability Solutions: Standard Components (no specialized hardware)
Can be built from existing hardware (no need to invest in new kit)
Works with just about any application
Works with wide range of disk and network types
No specialized operating system or microcode
Excellent availability at low cost
HACMP is largely independent of the disk type , network and application chosen.
High Availability solutions require the following :
Thorough design and detailed planning
Selection of appropriate hardware
Disciplined system administration practices
Documented operational procedures
Comprehensive testing
A High Availability solution based upon HACMP provides automated failure detection , diagnosis , recovery and reintegration.
The highly available solution will include: AIX Operating System, HACMP for
AIX , customized enhancements, Cluster Proven applications and of course a plan for the design and testing.
AIX's contribution to High Availability : Object Data Manager (ODM)
System Resource Controller (SRC)
Logical Volume Manager (LVM)
Journalled File System (JFS)
Online JFS Backup (splitlvcopy)
Work Load Manager (WLM)
Quality of Service (Qos)
External Boot
Software Installation Management (installp)
Reliable Scalable Cluster Technology (RSCT)
Hardware Prerequisites:
All pSeries systems will work with high availability, in any combination of nodes within a cluster, however a minimum of 4 free adapter slots is recommended ( 2 for network adapters and 2 for disk adapters) . Any other adapters (graphics adapters) will occupy additional slots. The internal Ethernet adapter should not be included in the calculations. Even with 4 adapter slots free, there will be a single point of failure as the cluster will only be able to accommodate a single TCP/IP local area network between the nodes.
HACMP Features:
1)Availability using:
- Cluster concept
- Redundancy at component level (standby adapters)
- AIX: LVM (JFS, disk mirroring) , SRC, Error Notify
2)Event (fault) Detection
- network adapter, network, or node
3)Automatically triggered or customized Event Recovery
- adapter swap, fallover or notification of network down
4)CSPOC: Tools for global changes across all nodes
- create AIX users, passwords, LVM components (VG, LV, JFS)
5)DARE (Dynamic Automatic Reconfiguration Event)
- Make HACMP changes without stopping the application
6)Monitoring using:
- HACMP commands, HAview, or HAtivoli, pager support
HACMP is Not the Right Solution If...
1)You cannot suffer any downtime
- Fault tolerance is required
- 7 x 24 operation is required
- Life critical systems
2)Your environment is insecure
- Users have access to root password
- Network security has not been implemented
3)Your environment is unstable
- Change management is not respected
- You do not have trained administrators
- Environment is prone to 'user fiddle factor'
HACMP will never be an out-of-the-box solution to availability. A certain degree of skill will be always required.
HACMP Basic Terms:
Cluster
HACMP's Resource Components:
Resources are logical components of the cluster configuration that can be moved from one node to another. Because they are logical components, they can be moved without human intervention. Resource components include: Resource Groups, IP labels, filesystems, NFS exports and mounts, Application Servers, Volume Groups and other
items. All the logical resources are collected together into a resource group. All components in a resource group move together from one node to another in the event of a node failure.
The difference between topology and resources is that the topology components are physical, i.e., nodes, networks and network adapters, which would require manual intervention to move from one place to another.
Failures detected by HACMP
1) Node failures Processor hardware or OS failure
2) Network Adapter failure Moves IP address to standby
3) Network failures Message displayed on console, event is logged.
HACMP/ES can also monitor on applications, processor load and available disk capacity.
Other Failures:
1) Disk Drive failure LVM mirroring , RAID
2) Other Hardware failures
No direct HACMP support. HACMP for AIX provides SMIT interface to AIX Error Notification Facility.
Trap on specific errors.
Execute command in response to error.
3) Application failures
4) HACMP failure Promoted to node failure.
5) Power failure Avoid common power supplies across replicated devices / Use UPS.
Cluster Resources :
- Applications
- Disk Drives
- Volume Groups
- File Systems
- NFS File Systems
- IP Addresses
Disk Crash
1) Data replicated through LVM mirroring
2) Data replicate on RAID.
Disk adapter Failure:
1) Data replicated through LVM mirroring across buses.
2) If RAID uses multiple buses , data available.
Network Adapter Failure:
1) Move IP address to standby adapter.
- The standby adapter takes over IP and, where applicable, MAC address.
- Duration between 5 and about 25 seconds
- If the standby adapter fails , no fallover action is taken.
- If the failed adapter comes back to life , it joins as a joined standby.
- No significant effect on applications
Network Fails:
1) HACMP provides notification and runs any user defined scripts.
- HACMP detects a fault.
- Standard event script does not contain any actions.
- Network takeover only possible with customizing.
- Behavior of application depends on infrastructure.
Machine Fails:
1) Workload ( resources ) moved to surviving node.
2) TCP/IP address moved from failed to surviving node.
3) Users login again , using same host name.
What you lose:
1) Work in progress.
2) Any data not cached to disk.
3) All process state.
Types of HACMP Resource Groups
1) Cascading- computers have a fixed priority.
- resources move through an ordered list of nodes in the event that a node fails.
- resources automatically revert to the top level computer that is active.
A cascading resource group can be modified to prevent the resources moving back to a higher priority node upon reintegration of the previously failed node , this is called cascading without fallback.
2) Rotating
- all computers have equal priority
- takeover only happens when there is a defect, not with reintegration
- When a previously failed node rejoins cluster , the resources do not move back.
A limitation with rotating resource group is that ni1 rotating resource groups are supported for each cluster, where n is the number of nodes in the cluster. Another consideration for rotating resource group is that all nodes should ideally be eqully sized.
3) Concurrent- all computers work on the data simultaneously
Resource Groups, points to ponder :
You may have hundreds of resource groups in a single cluster.
You may mix and match all three types of resource group in a single cluster.
Simply because you have 32 nodes in a cluster does not mean that all nodes must service all resource groups. The only requirement is that each resource group is serviced by at least two nodes.
Any given node may service one or more resource groups (of the same or differing types).
You may manually move resource groups about the cluster to perform a type of load balancing.
You may chose which node has which resource group at any point in time or following failure.
LVM and HACMP Considerations: All LVM constructs must have unique names in the cluster.
- like httplv, httploglv, httpfs and httpvg.
Mirror all critical logical volumes.
- Don't forget the jfslog.
The VG major device numbers should be the same.
- Mandatory for clusters exporting NFS filesystems, but it is a good habit for any cluster.
Shared data on internal disks is a bad idea.
HACMP and Networks:
HACMP clusters must contain two or more nodes and two or more networks, one being IP based, one being non-IP based. The non-IP networks are there to allow HACMP to determine if a failure is a network related failure or a node failure.
LANs carry the following traffic:
HACMP heartbeat or keepalive packets.
HACMP messages, used for cluster communications.
HACMP Lock Manager traffic used in concurrent access configurations.
Client communications, e.g., Telnet, FTP, NFS, sqlnet.
Serial networks carry:
HACMP heartbeats.
HACMP messages.
HACMP Network Components Terminology:
Public Network
Any TCP/IP LAN that supports HACMP client
network traffic
Private Network
Any TCP/IP LAN that carries only HACMP traffic.
Serial Network
An RS232/RS422 or tmscsi/tmssa network used
only for HACMP traffic.
Adapter
The HACMP ODM definition associated with a
TCP/IP or Serial network interface.
Adapter IP label
The name in /etc/hosts that maps to an IP address.
Nodename
The name associated with a cluster node , not to be
confused with hostname.
Network Type
Identifies the physical media type to HACMP , e.g.,
FDDI , ether, ATM, HPS.
Adapter Function
Service, Standby or Boot.
Adapter hardware AddressThis is the LAN adapters Locally Administered
Address ( LAA).
All nodes must have atleast one standby and one service adapter per network.
Keep the following items in mind when you do your network planning:
All service and boot adapters must be in the same subnet.
All standby adapters must be in the same subnet.
Service and Standby adapters must be in different logical subnets.
All adapters in the cluster must have the same subnet mask.
Do not edit the route entries for service and standby adapters in SMIT.
IP Address Takeover (IPAT): Service IP address of failed node may be taken over by a surviving node.
This behaviour is known as IP Address Takeover or IPAT for short.
Standby adapter on a surviving node acquires the service address of failed node.
This is an optional behaviour that must be configured.
Requires the configuration of a 'boot' adapter.
If more than one standby adapter is available the surviving node may takeover the
workload of one or more failed nodes.
When a cluster is configured to use IPAT , an additional network adapter must be defined. This is known as a Boot Adapter. When a failed node recovers, it cannot boot on the Service IP address if this has been acquired by another node in the cluster. For this reason, the failed node needs to boot on a unique IP address which is not used elsewhere in the cluster. This ensures that there is no IP address duplication during reintegration.
Configuring IPAT:
IPAT is only required on rotating resource groups and is optional on CASCADING. It is not supported on concurrent resource groups.
On all nodes, prepare security and name resolution:
1. Add an entry for the boot IP label in to /etc/hosts on each node.
2. Add the boot IP label to /.rhosts on each node.
3. Use FTP or rdist to keep these files in sync' and minimise human error.
On the node that will have its service IP address taken over:
4. Change the IP address that is held in the ODM to that of the boot IP address by using smit chinet. This will cause cfgmgr to read the 'boot' address at system startup.
On any node, update cluster configuration:
5. Add the boot adapter definition to the cluster topology, for that node which will have its service IP address taken over.
6. Synchronise the topology (you will get a warning message).
7. Now add the service IP label of the node to be taken over to a resource group.
8. Take a snapshot of your modified topology and update your cluster planning worksheets.
Configuring Hardware Address Takeover: Do not enable the "ALTERNATE" hardware address field in smit devices.
This will cause the adapter to boot on its locally administered address rather than
the burned in ROM address.
This causes serious communications problems and will put the cluster in to an
unstable state.
Correct method is to enter your chosen LAA in to the smit hacmp menus.
Some adapter types are very specific about the numbering of the first two digits in an LAA.
Token-ring and FDDI in particular.
Must start with 42 for Token-Ring and 4,5,6 or 7 for FDDI in the first octet of the first byte.
Always check the documentation provided with the adapter and the HACMP manuals.
Token-Ring adapters will not release the LAA if AIX crashes.
AIX must be set to reboot automatically after a system crash (smit environment).
Installation of the software
Install the HACMP software from the HACMP CD
- cluster.adt
- cluster.base
- cluster.cspoc
- cluster.man
- cluster.taskguides
- cluster.vsm
HACMP Daemons:
1) clstrmgr
2) clsmuxpd ( works on SNMP)
3) cllockd for concurrent access
4) clinfo required for IPAT ( hardware MAC takeover )
clstrmgr and clsmuxpd daemons are mandatory. The other two are optional.
1)Cluster Manager (clstrmgr) :
- Runs on all cluster nodes.
- Tracks cluster topology.
- Tracks network status.
- Externalize failure events.
Cluster Manager has four functional pieces Cluster Controller (CC) , Event Manager (EM) , NIM Interface Layer (NIL)and Network Interface Module (NIM).
CC , EM and NIL are all part of the clstrmgr executable. NIMs are separate executables one for each network.
The Cluster Controller reads on the start the information out of the ODM the Network Interface Layer is controlling the Hardware using the Network Interface Modules. The Event Manager is handling the Event Scripts and communicates to the clsmuxpd and cllockd.
The Cluster Controller performs a number of coordinating functions:
Retrieves cluster configuration information from the HACMP ODM object classes at startup and during a refresh or DARE operation.
Establishes the ordering of cluster neighbours for the purpose of sending keep alive packets.
Tracks changes to the cluster topology.
Receives information about cluster status changes from the NIM via the NIL.
Queues events in response to status changes in the cluster.
Handles node isolation and partitioned clusters.
The NIL provides a common interface between the Cluster Controller and one or more NIMs. This allows NIMs to be developed for new adapter hardware without rewriting the cluster manager.
It tells the NIMs the appropriate keep alive and failure detection rates for each network type as mentioned in the ODM.
It starts the appropriate NIMs for the network types that have been defined in the HACMP classes of the ODM.
It gives the NIMs a list of the IP addresses or /dev files to send keep alive to.
It restarts the NIMs if they hang or exit.
The NIMs are the contact point between HACMP and the network interfaces.
The NIMs send and receive keep-alive and message information.
They detect network related failures.
They are provided for each supported network type including a generic one.
The Event manager performs the following functions:
It starts the appropriate event scripts in response to status change in the cluster.
It sets the required environment variables.
It communicates with clsmuxpd and cllockd when required.
It starts the config_too_long event if any event does not exit 0 within 6 minutes.
Event Manager causes event scripts to execute . Primary events ( such as node_up, node_up_complete , node_down , node_down_complete, etc.) are called directly by the cluster manager. Sub events ( such as node_up_local , node_up_remote , node_down_remote , node_down_local , etc. ) are called by primary events.
1) Cluster SNMP Agent (clsmuxpd):
- Receives information from Cluster Manager.
- Maintains the HACMP enterprise specific MIB.
- Provides information to SNMP.
1) Cluster Lock Manager (cllockd):
- Cluster wide advisory level locking.
- CLM locking API.
- Unix locking API.
- Only for processes running on cluster.
1) Cluster Information Services (clinfo):
- Optional on both cluster nodes and cluster clients.
- Provides cluster status information to clients.
- Clinfo API allows for cluster aware applications.
Characteristics of keep-alive (KA) packets:
- Transmitted over all interfaces known to HACMP
- Direct I/O (non-IP networks) or UDP packets
- Three adjustable transmission rates (fast, normal, slow)
- If a failure rate is exceeded, an event is triggered
Dead Man's Switch (DMS):
If on one of your cluster nodes the LED flashes 888, then you may have experienced a DMS time-out.
The reason is clstrmgr could not send heartbeat to itself due to excessive I/O traffic. Based on this all other nodes in the cluster will start the node_down event. This may cause data corruption on shared disks. To avoid this problem the DMS happen. The operating system (AIX) will panic (888)
The deadman switch is a kernel extension to AIX .The cluster manager tries to reset the DMS frequently (every 0.5 seconds).
If DMS is not reset for n-1 seconds - PANIC!
n = (KA rate) x (Missed KAs) for slowest network
Cluster Single Point of Control (C-SPOC):
C-SPOC provides facilities for performing common cluster wide administration tasks from any node within the cluster.
Requires either /.rhosts or kerberos to be configured on all nodes.
C-SPOC operations fail if any target node is down at the time of execution or selected resource is not available.
Any change to a shared VGDA is synchronised automatically if C-SPOC is used to change shared LVM components (VGs, LVs, JFS).
C-SPOC uses a script parser called the "command execution language"
The cluster snapshot utility:
HACMP's cluster snapshot utility records the HACMP ODM configuration information, both cluster topology and resources.
When a new snapshot is created, two files are generated.
.odm (contains ALL cluster topology and resource information)
.info (a printable report, can be extended)
By default, snapshots are stored in the directory /usr/sbin/cluster/snapshots
The SNAPSHOTPATH environment variable can be used to specify an
alternative location for storing snapshots.
The documentary report that a snapshot creates can be customised to include information specific to a given cluster (application configuration).
Snapshots can be applied to a running cluster.
HACMP Log Files :
1) /use/adm/cluster.log output of cluster script and daemons.
2) /usr/sbin/cluster/history/cluster. - History files are created everyday.
3) /tmp/cspoc.log output of all commands executed for C-SPOC.
4) /tmp/cm.log log for Cluster Manager daemon.
5) /tmp/emuhacmp.out for emulation scripts.
6) /tmp/hacmp.out detailed output of all scripts.
Dynamic Reconfiguration :
HACMP provides a facility that allows changes to cluster topology and resources to be made while the cluster is active. This facility is known as DARE or to give it it's full name "Dynamic Automatic Reconfiguration Event". This requires 3 copies of the HACMP ODM.
Default Configuration Directory (DCD) which is updated by SMIT/command line
/etc/objrepos Staging Configuration Directory (SCD) which is used during reconfiguration
/usr/sbin/cluster/etc/objrepos/staging. Active Configuration Directory (ACD) from which clstrmgr reads the cluster configuration /usr/sbin/cluster/etc/objrepos/active
DARE allows changes to be made to most cluster topology and nearly all resource group components without the need to stop HACMP, take the application offline or reboot a node. All changes must be synchronised in order to take effect.
Pre configuration steps
1) Add the Boot, Service, Standby entries in the /etc/hosts file.
2) Make these entries in the .rhosts file (required at the time of sync). HACMP uses rsh on remote machine for updation.
3) Create volume group on node1 with the major number which is free in both nodes ( use lvlstmajor command to check the free major number )
4)Off the auto vary on feature on that node with command
chvg -a n
5)Create a jfslog logical volume
# mklv -t jfslog -y
6) Format the log logical volume
logform /dev/
7) Create a logical volume
mklv -t jfs -y
8) Create file system
crfs -v jfs -d /dev/ -m /
9)Turn off the vg using
varyoffvg
10)Import the VG in the other node with same major number
importvg -V 44 -y
11)Off the auto varry on feature with command
chvg -a n
12) Turn off the VG using
varyoffvg
Configuration steps ( short )
1)Define cluster Id and cluster name
2)Configure nodes
3)Configure net adapters
4)Sync the topology
5)Create resource group
6)Create resources
7)Sync the resources
Steps in Configuring Clusters:
Step 1 - Plan your cluster
- Use the planning worksheets and the documentation
Step 2 - Configure TCP/IP and LVM - /etc/hosts, /.rhosts, jfs mirroring and layout
Step 3 - Install the HACMP software- Select the necessary filesets
Step 4 - Define the cluster topology - Nodes, networks and network adapters
Step 5 - Synchronize the cluster topology - Verification performed automatically
Step 6 - Configure the application start and stop scripts- Application servers
Step 7 - Define the cluster resources and resource groups-File systems, IP addresses,
exports and NFS mounts
Step 8 - Synchronize the cluster resources -Verification performed automatically
Step 9 - Test the cluster
- Including application tests
HACMP Enhanced Scalability (HACMP/ES):
- extended functionality compared with classic HACMP
- upto 32 nodes in one HACMP cluster.
- Based on IBM High Availability Infrastructure.
- Includes all features of classic HACMP.
Difference between HACMP and HACMP/ES:
- HACMP/ES uses RSCT (Reliable Scalable Cluster Technology)
- User defined events based on RSCT
- Application Monitoring
- Recovery from resource group acquisition failure.
- Dynamic node policy
- Selective Fallover
- Plugins
In 4.4.1, plugins are provided to help configure the following services:
- DNS
- DHCP
- Print services
The plugins add the server application and an application monitor to an existing resource group.
- Process Monitoring via provided plugin scripts
- Users must create their own scripts/programs for Custom Monitoring
What Plugins provide:
- The resource group now contains an application server to start and stop the relevant daemons.
- The resource group also contains the shared volume group and filesystem needed by the daemons.
- An application monitor is configured to watch the daemons.
- When the resource group comes online, the daemon(s) will be activated.
- If a daemon should fail, the application monitor will detect it.
- the daemon will be restarted or the resource group will move to another
node.
- Note that the limitations of application monitoring still exist!
- only one monitored application per resource group.
- the monitored application can be included in at most one resource group.
HACMP Commands
rdist -b -f /etc/disfile1To distribute the files in disfile1 to all nodes in disfile1 in binary mode
Sample entry for disfile1
HOSTS = ( root@node1 root@node3 )
FILES = ( /etc/passwd /etc/security/passwd)
${FILES} -> ${HOSTS}
clstart -m -s -b -i -l
To start cluster daemons (m-clstrmgr, s-clsmuxpd,
b-broadcast message, -i-clinfo, -l cllockd)
clstop -f -NTo force shutdown cluster immediately without releasing resources
clstop -g -N
To do graceful shutdown immediately with no takeover
clstop -gr -N
To do graceful shutdown immediately with takeover
cldare -t
To sync the cluster toplogy
cldare -t -f
To do the mock sync of topology
cldare -r
To sync the cluster resources
cldare -r -f
To do the mock sync of resources
clverify
cluster verification utility
cllscf
To list clustur topology information
cllsclstr
To list the name and security level of the cluster
cllsnode
To list the info about the cluster nodes
cllsnode -i node1
To list info about node1
cllsdisk -g shrgTo list the PVID of the shared hard disk for resource group shrg
cllsnw
To list all cluster networks
cllsnw -n ether1
To list the details of network ether1
cllsif
To list the details by network adapter
cllsif -n node1_service
To list the details of network adapter node1_service
cllsvg
To list the shared vgs which can be accessed by all nodes
cllsvg -g sh1
To list the shared vgs in resource group sh1
cllslv
To list the shared lvs
cllslv -g sh1
To list the shared lvs in the resource group sh1
cllsdisk -g sh1
To list the PVID of disks in the resource group sh1
cllsfs
To list the shared file systems
cllsfs -g sh1
To list the shared file systems in the resource group sh1
cllsnim
Show info about all network modules
cllsnim -n ether
Show info about ether network module
cllsparam -n node1
To list the runtime parameters for the node node1
cllsserv
To list all the application servers
claddclstr -i 3 -n dcm
To add a cluster definition with name dcm and id 3
claddnode
To add an adapter
claddnim
To add network interface module
claddgrp -g sh1 -r cascading -n n1 n2To create resource group sh1 with nodes n1,n2 in cascade
claddserv -s ser1 -b /usr/start -e /usr/stopCreates an application server ser1 with startscript as /usr/start and stop script as /usr/stop
clchclstr -i 2 -n dcmds
To change cluster definitions name to dcmds and id to 2
clchclstr -s enhanced
To change the clustur security to enhanced
clchnode
To change the adapter parameters
clchgrp
To change the resource group name or node relationship
clchparam
To change the run time parameters (like verbose logging)
clchservTo change the name of app. server or change the start/end scripts
clrmclstr
To remove the cluster definition
clrmgrp -g sh1
To delete the resource group sh1 and related resources
clrmnim ether
To remove the network interface module ether
clrmnode -n node1
To remove the node node1
clrmnode -a node1_svcTo remove the adapter named node1_svc
clrmres -g sh1
To remove all resources from resource group sh1
clrmserv app1
To remove the application server app1
clrmserv ALL
To remove all applicaion servers
clgetactivenodes -n node1To list the nodes with active cluster manager processes from cluster manager on node node1
clgetaddr node1
returns a pingable address from node node1
clgetgrp -g sh1
To list the info about resource group sh1
clgetgrp -g sh1 -f nodesTo list the participating nodes in the resource group sh1
clgetifTo list interface name/interface device name/netmask associated with a specified ip label / ip address of a specific node
clgetip sh1
To get the ip label associated to the resource group
clgetnet 193.9.200.2 255.255.255.0To list the network for ip 193.9.200.2, netmask 255.255.255.0
clgetvg -l nodelv
To list the VG of LV nodelv
cllistlogs
To list the logs
clnodename -a node5
To add node5 to the cluster
clnodename -o node5 -n node3To change the cluster node name node5 to node3
clshowres
Lists resources defined for all resource group
clfindres
To find the resource group within a cluster
xclconfig
X utility for cluster configuration
xhacmpm
X utility for hacmp management
xclstat
X utility for cluster status
Setup for Cascading mode :
HACMP Configuration Exercise :
Scenario : Connecting machines in Cascading Resource Group , so as to operate for IP address takeover , NFS availablility and Application takeover.
1. Create .rhosts file on all nodes which are going to be part of HACMP . The file should be in root directory and contain name of boot , standby and service adapters .
# cat .rhosts
cws
node3
node1
# End of generated entries by updauthfiles script
node1_boot
node1_svc
node1_stby
node3_boot
node3_svc
node3_stby
2. # smit hacmp
A.) Define a cluster
HACMP for AIX
Move cursor to desired item and press Enter.
Cluster Configuration
Cluster Services
Cluster System Management
Cluster Recovery Aids
RAS Support
Cluster Configuration
Move cursor to desired item and press Enter.
Cluster Topology
Cluster Security
Cluster Resources
Cluster Snapshots
Cluster Verification
Cluster Custom Modification
Restore System Default Configuration from Active Configuration
Cluster Topology
Move cursor to desired item and press Enter.
Configure Cluster
Configure Nodes
Configure Adapters
Configure Network Modules
Show Cluster Topology
Synchronize Cluster Topology
Configure Cluster
Move cursor to desired item and press Enter.
Add a Cluster Definition
Change / Show Cluster Definition
Remove Cluster Definition
Add a Cluster Definition
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Cluster ID [10] #
* Cluster Name [dcm]
B.) Add Participating nodes
# smit hacmp
- Cluster Configuration
- Cluster Topology
- Configure Nodes
- Add Cluster Nodes
Make entries for participating nodes . The names may not be related to /etc/hosts file. Can be any name.
Add Cluster Nodes
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Node Names [node1 node3]
B.) Create entries for all IP addresses
First confirm that the system has booted through the boot IP given. Check using
# lsattr El en0
Make entries for all adapters like node1_boot , node1_svc, node1_stby , node3_boot, node3_svc and node3_stby.
# smit hacmp
- Cluster Configuration
- Cluster Topology
- Configure Adapters
- Add an adapter
Add an Adapter
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Adapter IP Label [node1_boot]
* Network Type [ether] +
* Network Name [ether1] +
* Network Attribute public +
* Adapter Function boot +
Adapter Identifier [193.9.200.225]
Adapter Hardware Address []
Node Name [node1] +
B.) Check the Cluster Topology
#smit hacmp
- Cluster Configuration
- Cluster Topology
- Show Cluster Topology
- Check the cluster topology.
C.) Synchronize Cluster Topology
#smit hacmp
- Cluster Configuration
- Cluster Topology
- Synchronize Cluster Topology
The topology is copied to all participating nodes.
D.) Create a Resource Group
#smit hacmp
- Cluster Configuration
- Cluster Resources
- Define resource Group
- Add a resource Group
Add a Resource Group
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Resource Group Name [rg1]
* Node Relationship cascading +
* Participating Node Names [node1 node3] +
Give Resource Group Name , Node Relationship and Participating Node Name.
B.) Define Resources for a Resource Group
#smit hacmp
- Cluster Configuration
- Cluster resources
- Change/Show Resources for a RG
Select a Resource Group
Move cursor to desired item and press Enter.
rg1
Configure a Resource Group
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[TOP] [Entry Fields]
Resource Group Name rg1
Node Relationship cascading
Participating Node Names node1 node3
Service IP Label [node1_svc] +
Filesystems [] +
Filesystems Consistency Check fsck +
Filesystems Recovery Method sequential +
Filesystems to Export [] +
Filesystems to NFS Mount [] +
Volume Groups [] +
Concurrent Volume Groups [] +
Raw Disk PVIDs [] +
AIX Connections Services [] +
AIX Fast Connect Services [] +
Application Servers [] +
Highly Available Communication Links [] +
Miscellaneous Data []
Inactive Takeover Activated false +
9333 Disk Fencing Activated false +
SSA Disk Fencing Activated false +
Filesystems mounted before IP configured false +
[BOTTOM]
An entry is made for node1_svc in rg1 so that it can be taken over in case of adapter failure or network failure.
Similarly create another Resource Group rg2 for node3_svc.
Entries for NFS and Application Servers if required, also have to be made in the above screen.
B.) Copy the Resource information to all participating nodes
#smit hacmp
- Cluster Configuration
- Cluster resources
- Synchronize Cluster resources.
The Resources configuration gets copied to all participating nodes.
C.) Start HACMP on all participating nodes.
#smit hacmp
- Cluster Services
- Start Cluster Services
Started on individual machine.
We can use C-SPOC ( Cluster single point of control ) for all machines. However it has not been enabled on SP due to security reasons.
For C-SPOC
# smit hacmp
- Cluster system management
- HACMP for AIX Cluster Services.
- Start Cluster Services ( It takes time )
B.) Check cluster services are started
# lssrc g cluster
Now the system has been configured for high availability of IP . It can be tested by stopping the HACMP services on one of the nodes. For this follow these steps :
a.) On both the nodes run the following command and check that service IP is being used on en1. en2 should be using standby IP.
# netstat in
NameMtuNetworkAddress
Ipkts Ierrs Opkts Oerrs Coll
lo0 16896 link#1
210 0 229 0 0
lo0 16896 127 127.0.0.1
210 0 229 0 0
lo0 16896 ::1
210 0 229 0 0
en0 1500 link#2 0.60.94.e9.56.e3 40117 0 38081 0 0
en0 1500 192.9.200 192.9.200.2 40117 0 38081 0 0
en1 1500 link#3 0.6.29.ac.ca.66 63612 0 1136 0 0
en1 1500 193.9.200 193.9.200.226 63612 0 1136 0 0
en2 1500 link#4 0.6.29.ac.f2.f6 0 0 3 3 0
en2 1500 193.9.201 193.9.201.1 0 0 3 3 0
The above display is obtained on node1.
b.) Stop the cluster services on node1.
# smit hacmp
- Cluster Services
- Stop Cluster Services
Stop Cluster Services
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Stop now, on system restart or both now +
BROADCAST cluster shutdown? true +
* Shutdown mode graceful with takeover +
(graceful, graceful with takeover, forced)
a.) Now check on node3 that node1_svc IP has shifted to its standby adapter (en2) using netstat command.
Adding Serial links obtained through SSA :
Check the device addresses on each node using
# lsdev C | grep tmssa .
# smit hacmp
- Cluster Configuration
- Cluster Topology
- Configure Adapters
- Add an adapter
Add an Adapter
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Adapter IP Label [node1_tmssa]
* Network Type [tmssa] +
* Network Name [ssa1] +
* Network Attribute serial +
* Adapter Function service +
Adapter Identifier [/dev/tmssa3]
Adapter Hardware Address []
Node Name [ node1 ] +
Similarly make entry for node3 ( /dev/tmssa2).
Create Application server :
# smit hacmp
- Cluster Configuration
- Cluster Resources
- Define Application Servers
Define Application Servers
Move cursor to desired item and press Enter.
Add an Application Server
Change / Show an Application Server
Remove an Application Server
Add Application Server
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Server Name [tsm]
* Start Script [/usr/bin/tsmstart]
* Stop Script [/usr/bin/tsmstop]
The same scripts have to be copied on all participating nodes . The entry for Application Server has to be made in Resource Group (Step 2. G ).
Sample tsmstart script
SERVICEIP=1
while [ SERVICEIP -ne 0 ]
do
x=`netstat I | grep -c node1_svc`
if [ $x -eq 1 ]
then
SERVICEIP=0
echo "Exiting with SERVICEIP"
else
echo "Executing IP Take over"
sleep 2
fi
done
sleep 15
/usr/tivoli/tsm/server/bin/rc.adsmserv
Sample tsmstop script
cd /usr/tivoli/tsm/client/ba/bin
dsmadmc -id=admin -password=support halt
sleep 15
Cluster snapshot creation :
# smit hacmp
Cluster Snapshots
Add a Cluster Snapshot
We are required to provide Cluster Snapshot Name , Custom defined Snapshot Method and Cluster Snapshot Description.
The snapshot is created in directory /usr/bin/cluster/snapshots. Two file are created .odm and .info.
Testing the non-IP serial communication (SSA):
Node1 : # cat < /dev/tmssa3.tm
Node3 : # cat > /dev/tmssa2.im
Scenario :
Connecting the nodes in Rotating Resource Group. In Rotating mode , a resource does not belong to any node . Therefore when creating a resource like IP address , node name is not required to be given.
Setup in Rotaing mode :
The configuration of the resource group is exactly the same as was done for cascading mode. However while adding adapter for node1_svc , do not provide the Node Name. The other adapters like node1_boot, node1_stby , node3_boot and node3_stby are added as in Step 2.C .
# smit hacmp
- Cluster Configuration
- Cluster Topology
- Configure Adapters
- Add an adapter
Add an Adapter
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Adapter IP Label [node1_svc]
* Network Type [ether] +
* Network Name [ether1] +
* Network Attribute public +
* Adapter Function service +
Adapter Identifier [193.9.200.226]
Adapter Hardware Address []
Node Name [ ] +
After adding adapter definitions , synchronize the topology . Then create a resource group rg3 using node relationship as rotating . After making entries for resource group , synchronize the RG. Start HACMP on each node.
The service IP is allocated on the boot adapter of node1. To test for IP takeover , stop the hacmp services on node1 ( Step 2.J b ) . The service IP label moves to boot adapter of node3.
Note : It has been observed that the serial link tmssa should be configured in rotating RG . The test for IP takeover goes through successfully only when tmssa is configured.
31