HACMP Course Moudule 1

HACMP and HACMP/ESHACMP 5.2 includes all the features of the product that was previously referred to as

HACMP/ES (Enhanced Scalability). The product previously referred to as HACMP (Classic) is no longer available.Note: Prior to version 5.1, the HACMP for AIX software included four

features: HAS and CRM with core filesets namedcluster.base*; and ES and ESCRM with core filesets namedcluster.es*. Starting with HACMP 5.1, the HAS, CRM and ES

features are no longer available, and the ESCRM feature is now called HACMP.To summarize, HACMP 5.2 has the following characteristics:

• Includes all the features of ESCRM 4.5 in addition to the new functionality added inHACMP versions 5.1 and 5.2.

• Takes advantage of the enhanced Reliable Scalable Cluster Technology (RSCT). RSCTprovides facilities for monitoring node membership; network interface and communication

interface health; and event notification, synchronization and coordination via reliablemessaging.

• HACMP clusters with both non-concurrent and concurrent access can have up to 32 nodes.• Is supported on AIX 5.1 and AIX 5.2.

HACMP

Chapter 11.1 Introduction

Hacmp – 1.1 Introduction Chapter – 1

High Availability Cluster Multi-Processing for AIX

The IBM tool for building UNIX-based mission-critical computing platforms is the HACMPsoftware. The HACMP software ensures that critical resources, such as applications, areavailable for processing. HACMP has two major components: high availability (HA) andcluster multi-processing (CMP).

The primary reason to create HACMP clusters is to provide a highly available environment formission-critical applications. For example, an HACMP cluster could run a database serverprogram which services client applications. The clients send queries to the server programwhich responds to their requests by accessing a database, stored on a shared external disk.In an HACMP cluster, to ensure the availability of these applications, the applications are putunder HACMP control. HACMP takes measures to ensure that the applications remainavailable to client processes even if a component in a cluster fails. To ensure availability, incase of a component failure, HACMP moves the application (along with resources that ensureaccess to the application) to another node in the cluster.

High Availability and Hardware Availability

High availability is sometimes confused with simple hardware availability. Fault tolerant,redundant systems (such as RAID), and dynamic switching technologies (such as DLPAR)provide recovery of certain hardware failures, but do not provide the full scope of errordetection and recovery required to keep a complex application highly available.A modern, complex application requires access to all of these elements:• Nodes (CPU, memory)• Network interfaces (including external devices in the network topology)• Disk or storage devices.Recent surveys of the causes of downtime show that actual hardware failures account for only a small percentage of unplanned outages. Other contributing factors include:• Operator errors• Environmental problems• Application and operating system errors.Reliable and recoverable hardware simply cannot protect against failures of all these different aspects of the configuration. Keeping these varied elements—and therefore the application—highly available requires:


Hacmp – 1.1 Introduction Chapter – 1High Availability vs. Fault Tolerance

Fault tolerance relies on specialized hardware to detect a hardware fault and instantaneouslyswitch to a redundant hardware component—whether the failed component is a processor,memory board, power supply, I/O subsystem, or storage subsystem. Although this cutover isapparently seamless and offers non-stop service, a high premium is paid in both hardware costand performance because the redundant components do no processing. More importantly, thefault tolerant model does not address software failures, by far the most common reason fordowntime.

High availability views availability not as a series of replicated physical components, but ratheras a set of system-wide, shared resources that cooperate to guarantee essential services. Highavailability combines software with industry-standard hardware to minimize downtime byquickly restoring essential services when a system, component, or application fails. While notinstantaneous, services are restored rapidly, often in less than a minute.

The difference between fault tolerance and high availability, then, is this a fault tolerantenvironment has no service interruption, while a highly available environment has a minimalservice interruption. Many sites are willing to absorb a small amount of downtime with highavailability rather than pay the much higher cost of providing fault tolerance. Additionally, inmost highly available configurations, the backup processors are available for use during normaloperation.

Hacmp – 1.1 Introduction Chapter – 1History and evolutionIBM High Availability Cluster Multi-Processing goes back to the early 1990s.HACMP development started in 1990 to provide high availability solution for applications running on RS/6000 servers.HACMP V4.2.2 -- Along with HACMP Classic (HAS), this version introduced the enhancedscalability version (ES) based on RSCT (Reliable Scalable Clustering Technology) topology, group, and event management services, derived from PSSP (Parallel Systems Support Program).

HACMP V4.3.X -- This version introduced, among other aspects, 32 node support for HACMP/ES,C-SPOC enhancements, ATM network support, HACMP Task guides (GUI forsimplifying cluster configuration), multiple pre- and post- event scripts, FDDI MAC address takeover, monitoring and administration support enhancements, node by node migration, and AIX fast connect support.

HACMP V4.4.X---New items in this version are integration with Tivoli®, application monitoring, cascading with out fallback, C-SPOC enhancements, improved migrationsupport, integration of HA-NFS functionality, and soft copy documentation (HTML and PDF).

HACMP V4.5--- In this version, AIX 5L is required, and there is an automated configurationdiscovery feature, multiple service labels on each network adapter (through theuse of IP aliasing), persistent IP address support, 64-bit-capable APIs, andmonitoring and recovery from loss of volume group quorum.

Hacmp – 1.1 Introduction Chapter – 1HACMP V5.1This is the version that introduced major changes, from configurationsimplification and performance enhancements to changing HACMP terminology.Some of the important new features in HACMP V5.1 were: SMIT “Standard” and “Extended” configuration paths (procedures) Automated configuration discovery Custom resource groups Non IP networks based on heartbeating over disks Fast disk takeover Forced varyon of volume groups Heartbeating over IP aliases HACMP “classic” (HAS) has been dropped; now there is only HACMP/ES,based on IBM Reliable Scalable Cluster TechnologyHeartbeat monitoring of service IP addresses/labels on takeover node(s) Heartbeating over disks Various C-SPOC enhancements GPFS integration Fast disk takeover Cluster verification enhancements Improved resource group management

Hacmp – 1.1 Introduction Chapter – 1HACMP V5.2Starting July 2004, the new HACMP V5.2 added more improvements inmanagement, configuration simplification, automation, and performance areas.Here is a summary of the improvements in HACMP V5.2: Two-Node Configuration Assistant, with both SMIT menus and a Java™interface (in addition to the SMIT “Standard” and “Extended” configurationpaths). File collections. User password management. Classic resource groups are not used anymore, having been replaced by custom resource groups. Automated test procedure Automatic cluster verification. Improved Online Planning Worksheets (OLPW) can now import a configuration from an existing HACMP cluster. Event management (EM) has been replaced by resource monitoring and a control (RMC) subsystem (standard in AIX). Enhanced security. Resource group dependencies. Self-healing clusters.


High availability concepts

What needs to be protected? Ultimately, the goal of any IT solution in a criticalenvironment is to provide continuous service and data protection.The high availability is just one building block in achieving the continuousoperation goal. The high availability is based on the availability of the hardware,software (operating system and its components), application, and networkcomponents.For a high availability solution you need: Redundant servers Redundant networks Redundant network adapters Monitoring Failure detection Failure diagnosis Automated fallover Automated reintegration

HACMP

Chapter 1 1.2 – Concepts

Hacmp – 1.2 Concepts Chapter – 1HACMP conceptsThe basic concepts of HACMP can be classified as follows: Cluster topology Contains basic cluster components nodes, networks, communicationinterfaces, communication devices, and communication adapters. Cluster resourcesEntities that are being made highly available (for example, file systems, raw devices, service IP labels, and applications). Resources are grouped together in resource groups (RGs), which HACMP keeps highly available as a single entity.

Resource groups can be available from a single node or, in the case ofconcurrent applications, available simultaneously from multiple nodes. FalloverRepresents the movement of a resource group from one active node toanother node (backup node) in response to a failure on that active node. FallbackRepresents the movement of a resource group back from the backup node tothe previous node, when it becomes available. This movement is typically inresponse to the reintegration of the previously failed node

Hacmp – 1.2 Concepts Chapter – 1HACMP terminologyTo understand the correct functionality and utilization of HACMP, it is necessaryto know some important terms: Cluster –Loosely-coupled collection of independent systems (nodes) or LPARs organized into a network for the purpose of sharing resources and communicating with each other.HACMP defines relationships among cooperating systems where peer cluster nodes provide the services offered by a cluster node should that node beunable to do so.These individual nodes are together responsible for maintaining the functionality of one or more applications in case of a failure of any cluster component. Node :An IBM Eserver pSeries machine (or LPAR) running AIX and HACMP that is defined as part of a cluster. Each node has a collection of resources (disks, file systems, IP address(es), and applications) that can be transferred toanother node in the cluster in case the node fails. Resource: Resources are logical components of the cluster configuration that can bemoved from one node to another. All the logical resources necessary to provide a Highly Available application or service are grouped together in a resource group (RG).The components in a resource group move together from one node to anotherin the event of a node failure. A cluster may have more than one resource group, thus allowing for efficient use of the cluster nodes (thus the“Multi-Processing” in HACMP).

Hacmp – 1.2 Concepts Chapter – 1 TakeoverIt is the operation of transferring resources between nodes inside the cluster.If one node fails due to a hardware problem or crash of AIX, its resourcesapplication will be moved to the another node. ClientsA client is a system that can access the application running on the clusternodes over a local area network. Clients run a client application that connectsto the server (node) where the application runs.

Hacmp – 1.2 Concepts Chapter – 1HACMP/XD (extended distance)The High Availability Cluster Multi-Processing for AIX (HACMP) base softwareproduct addresses part of the continous operation problem. It addressesrecovery from the failure of a computer, an adapter, or a local area network withina computing complex at a single site.A typical HACMP/XD High Availability Geographic Cluster (HAGEO) is presentedin Figure 1-2.

Hacmp – 1.2 Concepts Chapter – 1For protecting an application in case of a major disaster (site failure), additionalsoftware is needed. HAGEO provides: Ability to configure a cluster with geographically separate sites.HAGEO extends HACMP to encompass two geographically distant datacenters or sites. This extension prevents an individual site from being a singlepoint of failure within the cluster.The geo-mirroring process supplies each site with an updated copy ofessential data.Either site can run key applications, ensuring that mission-critical computingresources remain continuously available at a geographically separate site if afailure or disaster disables one site. Automatic failure detection and notification.HAGEO works with HACMP to provide automatic detection of a site orgeographic network failure. It initiates the recovery process and notifies thesystem administrator about all failures it detects and actions it takes inresponse. Automatic falloverHAGEO includes event scripts to handle recovery from a site or geographicnetwork failure. These scripts are integrated with the standard HACMP eventscripts.You can customize the behavior for your configuration by adding pre- or posteventscripts, just as you can for HACMP.

Hacmp – 1.2 Concepts Chapter – 1Fast recovery from a disaster.HAGEO also provides fast recovery of data and applications at the operablesite. The geo-mirroring process ensures that the data is already available atthe second site when a disaster strikes.Recovery time typically takes minutes, not including the application recoverytime. Automatic resynchronization of data during site recovery.HAGEO handles the resynchronization of the mirrors on each site as anintegral part of the site recovery process. The nodes at the rejoining site areautomatically updated with the data received while the site was in failure. Reliable data integrity and consistency.HAGEO’s geographic mirroring and geographic messaging componentsensure that if a site fails, the surviving site’s data is consistent with the failedsite’s data.

HACMP

Chapter 2 2.1 Planning & Design

HACMP Course Moudule 1

Documents

Transcript of HACMP Course Moudule 1