Ame 2269 ibm mq high availability

© 2015 IBM Corporation

IBM MQ:High AvailabilitySession AME-2269

Andrew Schofield

Introduction

• Availability is a very large subject

• You can have the best technology in the world, but you have to

manage it correctly

• Technology is not a substitute for good planning and testing!

2

What is Disaster Recovery?

• Disaster recovery is the process, policies and procedures

related to preparing for recovery or continuation of technology

infrastructure critical to an organization after a natural or

human-induced disaster. Disaster recovery is a subset of

business continuity. While business continuity involves planning

for keeping all aspects of a business functioning in the midst of

disruptive events, disaster recovery focuses on the IT or

technology systems that support business functions

3

What is DR?

• Getting applications running after a major (often whole-site) failure or loss

• It is not about High Availability although often the two are related and share design and implementation choices

• “HA is having 2, DR is having them a long way apart”

• More seriously, HA is about keeping things running, while DR is about recovering when HA has failed.

• Requirements driven by business, and often by regulators

• Data integrity, timescales, geography …

• One major decision point: cost

• How much does DR cost you, even if it’s never used?

• How much are you prepared to lose

4

High Availability vs Disaster Recovery

• Designs for HA typically involve a single site for each

component of the overall architecture

• Designs for DR typically involve separate sites

• Designs for HA (and CA) typically require no data loss

• Designs for DR typically can have limited data loss

• Designs for HA typically involve high-speed takeover

• Designs for DR typically can permit several hours down-time

5

High Availability

Single Points of Failure

• With no redundancy or fault tolerance, a failure of any component can lead to a loss of availability

• Every component is critical. The system relies on the:

• Power supply, system unit, CPU, memory

• Disk controller, disks, network adapter, network cable

• ...and so on

• Various techniques have been developed to tolerate failures:

• UPS or dual supplies for power loss

• RAID for disk failure

• Fault-tolerant architectures for CPU/memory failure

• ...etc

• Elimination of SPOFs is important to achieve HA

7

MQ HA technologies

• Queue manager clusters

• Queue-sharing groups

• Multi-instance queue managers

• HA clusters

• MQ Appliance HA groups

• Client reconnection

8

Comparison of Technologies

9

Queue-sharing

groups continuous continuous

MQ

Clusters none continuous

continuousautomatic

automatic automatic

none none

HA Clusters,

Multi-instance,

HA Groups

No special

support

Access to

existing messages

Access for

new messages

Queue manager clusters

• Sharing cluster queues on

multiple queue managers

prevents a queue from

being a SPOF

• Cluster workload algorithm

automatically routes traffic

away from failed queue

managers

10

Queue-sharing groups

• On z/OS, queue managers can be members of a queue-sharing group

• Shared queues are held in a coupling facility

• All queue managers in the QSG can access the messages

• Benefits:

• Messages remain available even if a queue manager fails

• Pull workload balancing

• Apps can connect to the group

11

Queue

manager

Private

queues

Queue

manager

Private

queues

Queue

manager

Private

queues

Shared

queues

Introduction to Failover and MQ

• Failover is the automatic switching of availability of a service• For MQ, the “service” is a queue manager

• Traditionally the preserve of an HA cluster, such as HACMP

• Requires:• Data accessible on all servers

• Equivalent or at least compatible servers– Common software levels and environment

• Sufficient capacity to handle workload after failure– Workload may be rebalanced after failover requiring spare capacity

• Startup processing of queue manager following the failure

• MQ offers three ways of configuring for failover:• Multi-instance queue managers

• HA clusters

• MQ Appliance HA groups

12

Failover considerations

• Failover times are made up of three parts:

• Time taken to notice the failure

– Heartbeat missed

– Bad result from status query

• Time taken to establish the environment before activating the service

– Switching IP addresses and disks, and so on

• Time taken to activate the service

– This is queue manager restart

• Failover involves a queue manager restart

• Nonpersistent messages, nondurable subscriptions discarded

• For fastest times, ensure that queue manager restart is fast

• No long running transactions, for example

13

Multi-instance Queue Managers


• Basic failover support without HA cluster

• Two instances of a queue manager on different machines

• One is the “active” instance, other is the “standby” instance

• Active instance “owns” the queue manager’s files

– Accepts connections from applications

• Standby instance monitors the active instance

– Applications cannot connect to the standby instance

– If active instance fails, standby restarts queue manager and

becomes active

• Instances are the SAME queue manager – only one set of data

files

• Queue manager data is held in networked storage

15

Setting up Multi-instance Queue Manager

• Set up shared filesystems for QM data and logs

• Create the queue manager on machine1crtmqm –md /shared/qmdata –ld /shared/qmlog QM1

• Define the queue manager on machine2 (or edit mqs.ini)addmqinf –v Name=QM1 –v Directory=QM1 –v Prefix=/var/mqm-v DataPath=/shared/qmdata/QM1

• Start an instance on machine1 – it becomes activestrmqm –x QM1

• Start another instance on machine2 – it becomes standbystrmqm –x QM1

• That’s it. If the queue manager instance on machine1 fails, the standby instance on machine2 takes over and becomes active

16


17

1. Normal

execution

Owns the queue manager data

MQ

Client

Machine A Machine B

QM1

QM1

Active

instance

QM1

Standby

instance

can fail-over

MQ

Client

network

168.0.0.2168.0.0.1

networked storage


18

2. Disaster

strikesMQ

Client

Machine A Machine B

QM1

QM1

Active

instance

QM1

Standby

instance

locks freed

MQ

Client

network

IPA

networked storage

168.0.0.2

Client

connections

broken


19

3. FAILOVER

Standby

becomes

active

MQ

Client

Machine B

QM1

QM1

Active

instance

MQ

Client

network

networked storage


168.0.0.2

Client

connection

still broken


20

4. Recovery

completeMQ

Client

Machine B

QM1

QM1

Active

instance

MQ

Client

network

networked storage


168.0.0.2

Client

connections

reconnect


• MQ is NOT becoming an HA cluster

• If other resources need to be coordinated, you need an HA

cluster

• IBM Integration Bus (Message Broker) integrates with multi-

instance QM

• Queue manager services can be automatically started, but with

limited control

• System administrator is responsible for restarting another

standby instance when failover has occurred

21

Dealing with multiple IP addresses

• The IP address of the queue manager changes when it moves• So MQ channel configuration needs way to select address

• Connection name syntax permits a comma-separated list• CONNAME(‘168.0.0.1,168.0.0.2’)

• Unless you use external IPAT or an intelligent router or MR01

• WAS8 admin panels understand this syntax

• For earlier levels of WAS• Connection Factories:

– Set a custom property called XMSC_WMQ_CONNECTION_NAME_LIST to the list of host/port names that you wish to connect to

– Make sure that the existing host and port values defined on the connection factory match the first entry in this property

• Activation Specs:– Set a custom property called connectionNameList on the activation

spec with the same format

22

Administering Multi-instance Queue Managers

• All queue manager administration must be performed on active

instance

• dspmq enhanced to display instance information

• dspmq issued on “staravia”

• On “staravia”, there’s a standby instance

• The active instance is on “starly”

23

$ hostname

staravia

$ dspmq -x

QMNAME(MIQM) STATUS(Running as standby)

INSTANCE(starly) MODE(Active)

INSTANCE(staravia) MODE(Standby)

Multi-instance Queue Manager in MQ Explorer

24

MQ Explorer automatically switches to the active instance

High Availability Clusters

HA clusters

• MQ traditionally made highly available using an HA cluster

• IBM PowerHA for AIX (formerly HACMP), Veritas Cluster

Server, Microsoft Cluster Server, HP Serviceguard, Linux HA …

• HA clusters can:

• Coordinate multiple resources such as application server,

database

• Consist of more than two machines

• Failover more than once without operator intervention

• Takeover IP address as part of failover

• Likely to be more resilient in cases of MQ and OS defects

26

HA clusters

• In HA clusters, queue manager data and logs are placed on a

shared disk

• Disk is switched between machines during failover

• The queue manager has its own “service” IP address

• IP address is switched between machines during failover

• Queue manager’s IP address remains the same after failover

• The queue manager is defined to the HA cluster as a resource

dependent on the shared disk and the IP address

• During failover, the HA cluster will switch the disk, take over the

IP address and then start the queue manager

27

MQ in an HA cluster – Active/active

28

Normal

executionMQ

Client

Machine A Machine BQM1

Active

instance

MQ

Client

network

168.0.0.1

QM2

Active

instance

QM2

data

and logs

QM1

data

and logs

shared disk

168.0.0.2

HA cluster

MQ in an HA cluster – Active/active

29

FAILOVERMQ

Client

Machine A Machine B

MQ

Client

network

168.0.0.1

QM2

Active

instance

QM2

data

and logs

QM1

data

and logs

shared disk

168.0.0.2QM1

Active

instance

Shared disk

switched

IP address

takeover

Queue

manager

restarted

HA cluster

Multi-instance QM or HA cluster?

30

• Multi-instance queue manager Integrated into the WebSphere MQ product

Faster failover than HA cluster Delay before queue manager restart is much shorter

Runtime performance of networked storage

Suitable storage can sometimes be a challenge

• HA clusterCapable of handling a wider range of failures

Failover historically rather slow, but some HA clusters are improving

Capable of more flexible configurations (eg N+1)

Required MC91 SupportPac or equivalent configuration

Extra product purchase and skills required

• Storage distinction• Multi-instance queue manager typically uses NAS

• HA clustered queue manager typically uses SAN

Configuring a queue manager for failover

• Create filesystems on the shared disk, for example

• /MQHA/QM1/data for the queue manager data

• /MQHA/QM1/log for the queue manager logs

• On one of the nodes:

• Mount the filesystems

• Create the queue manager

crtmqm –md /MQHA/QM1/data –ld /MQHA/QM1/log QM1

• Print out the configuration information for use on the other nodes

dspmqinf –o command QM1

• On the other nodes:

• Mount the filesystems

• Add the queue manager’s configuration information

addmqinf –s QueueManager –v Name=QM1 –v Prefix=/var/mqm–v DataPath=/MQHA/QM1/data/QM1 –v Directory=QM1

31

Virtual Machine Failover

• Another mechanism being regularly used

• When MQ is in a virtual machine … simply restart the VM

• “Turning it off and back on again”

• Can be faster than any other kind of failover

32

MQ ApplianceHA Groups

MQ Appliance HA groups

• MQ Appliance introduces a new appliance-only HA feature

• HA groups of appliances running HA queue managers

• An HA group consists of a pair of MQ appliances

• Controls the availability of HA queue managers in the group

• The appliances are connected together using 3 interfaces

– HA group primary interface

– HA group alternate interface

– Replication interface

• In an HA group, you can create HA queue managers

• Data is replicated synchronously between the appliances

• Shared-nothing, appliance-only HA

34

Setting up a queue manager in an HA group

• On appliance2, prepare to receive a security key

prepareha –s your_secret –a appliance1.haprimary.ipaddr

• Create the HA group on appliance1

crthagrp –s your_secret –a appliance2.haprimary.ipaddr

• Create the queue manager on appliance1

crtmqm –sx QM1

• Once created, the queue manager will start automatically

• That’s it. We have:

• QM1 under control of the HA group

• It prefers to run on appliance1

• If appliance1 fails, it automatically restarts on appliance2

35

MQ in an HA group – Active/active

36

Normal

executionMQ

Client

QM1

Active

instance

MQ

Client

network

168.0.1.1

QM2

Active

instance

QM2

data

and logs

QM1

data

and logs

168.0.2.2

HA group

QM1

data

and logs

QM2

data

and logs

Disk replication

Disk replication


37

MQ

Client

QM1

Active

instance

MQ

Client

network

QM2

Active

instance

QM2

data

and logs

QM1

data

and logs

HA group

QM1

data

and logs

QM2

data

and logs

network

168.0.1.2 168.0.2.2168.0.1.1

FAILOVER

Replication

temporarily

stopped

IP address

changes

Queue

manager

restarted


38

MQ

Client

QM1

Active

instance

MQ

Client

network

QM2

Active

instance

QM2

data

and logs

QM1

data

and logs

HA group

QM1

data

and logs

QM2

data

and logs

network

168.0.1.2 168.0.2.2168.0.1.1

Recovery

begins

Replication

resynchronizes

appliances

IP address

changes

Queue

manager

restarted

Disk replication

Disk replication


39

Recovery

completeMQ

Client

QM1

Active

instance

MQ

Client

network

168.0.1.1

QM2

Active

instance

QM2

data

and logs

QM1

data

and logs

168.0.2.2

HA group

QM1

data

and logs

QM2

data

and logs

Disk replication

Disk replication

Queue

manager

restarted

Direction of

replication

switches back

Applications and Client Reconnection

High availability application connectivity

• If an application loses connection to a queue manager, what

does it do?

• End abnormally

• Handle the failure and retry the connection

• Reconnect automatically thanks to application container

– WebSphere Application Server contains logic to reconnect JMS

clients

• Use MQ automatic client reconnection

41

Automatic client reconnection

• MQ client automatically reconnects when connection broken

• MQI C clients and standalone JMS clients

• JMS in app servers (EJB, MDB) does not need auto-reconnect

• Reconnection includes reopening queues, remaking subscriptions

• All MQI handles keep their original values

• Can reconnect to same queue manager or another, equivalent queue manager

• MQI or JMS calls block until connection is remade

• By default, will wait for up to 30 minutes

• Long enough for a queue manager failover (even a really slow one)

42

Automatic client reconnection

• Can register event handler to observe reconnection

• Not all MQI is seamless, but majority repaired transparently

• Browse cursors revert to the top of the queue

• Nonpersistent messages are discarded during restart

• Nondurable subscriptions are remade and may miss some

messages

• In-flight transactions backed out

• Tries to keep dynamic queues with same name

• If queue manager doesn’t restart, reconnecting client’s TDQs

are kept for a while in case it reconnects

• If queue manager does restart, TDQs are recreated when it

reconnects

43

Client Configurations for Availability

1. Use wildcarded queue manager names in CCDT• Gets weighted distribution of connections

• Selects a “random” queue manager from an equivalent set

2. Use multiple addresses in a CONNAME• Could potentially point at different queue managers

• More likely pointing at the same queue manager in a multi-instance setup

3. Use automatic reconnection

4. Pre-connect exit

5. Use IP routers to select address from a list• Based on workload or anything else known to the router

• Can use all of these in combination!

44

Application Patterns for Availability

• Article describing how to build a hub topology supporting:

• Continuous availability to send messages, with no single point of failure

• Linear horizontal scale of throughput, for both MQ and the applications

• Exactly once delivery, with high availability of persistent messages

• Three messaging styles: Request/response, fire-and-forget, and pub/sub

• http://www.ibm.com/developerworks/websphere/library/techarticles/1303_broadhurst/1303_broadhurst.html

45

Disaster Recovery- Local Recovery

“MQ has failed”

• Don’t restart queue manager until you know why it failed• You can probably do one restart attempt safely, but don’t

continually retry

• When running under an HA coordinator, have retry counts

• At least ensure you take a backup of the queue manager data, log files, and any error logs/FDCs/dumps

• So it can be investigated later

• Might be possible for IBM to recover messages

• Consider taking these copies as part of an HA failover procedure

• While trying restart/recovery procedures, consider a PMR• Often see “cold start” as first reaction at some customers

• If you have a support contract, open PMR before trying cold start – IBM may have an alternative

– IBM may ask you to start collecting documentation

• Do not make everything a Sev1

47

First Manual Restart

• Restart your queue manager

• Only clean the IPC (shared memory/semaphores) if IBM

requests it

– This should never be necessary

– Remove calls to ipcrm or amqiclen from any startup/failover scripts

• Start as simply as possible

– strmqm –ns QM1

• Monitor the restart, look for FDC’s

• If OK, then end the qmgr and restart normally

• What if the restart fails?

• Option to escalate to cold start

• Further escalation to rebuilding queue manager

48

Cold Starting MQ

• Typical reason: hardware (most likely disk) failure or logs

deleted by mistaken administrator

• Symptoms: Cannot start queue manager because logs

unavailable or corrupt files

• “Cold start” is a technique to restart without needing logs

• What does a cold start cost you?

• In-flight transactions will not be automatically rolled-back

• In-doubt transactions will be forgotten

• Ability to recover messages from the logs

• Possible loss of messages

• Possible duplication of already-processed messages

49

Cold Starts on Distributed Platforms

• Basic idea is to replace your ‘bad’ logs with ‘good’ logs

• By creating a dummy queue manager and using its logs instead

• Logs do not contain queue manager name

• Other considerations

• Is this queue manager part of an MQ cluster?

– This shouldn’t matter, but may want to resynchronise repositories

– In case any updates were in-flight when system failed

• Is this queue manager under the control of an HA cluster?

– Failover will not help if the shared disks/files are corrupt

– Disable failover in the HA system until recovery complete

50

Cold Start Procedure (1)

• Create a queue manager with the same log parameters as the

failed one

• Use qm.ini to find the values required

Log:

LogPrimaryFiles=10

LogSecondaryFiles=10

LogFilePages=16384

LogType=CIRCULAR

• Issue the crtmqm command with the same parameters

crtmqm –lc –lp 10 –ls 10 –lf 16384 –ld /tmp/mqlogs TEMP.QM1

• This queue manager is just used to generate the log files

51

Cold Start Procedure (2)

• Do not start this temporary queue manager, just create it

• Replace the failed queue manager’s log extents and control file

with the new ones from the temporary queue manager

cd /var/mqm/log

mv QM1 QM1.SAVE

mv /tmp/mqlogs/TEMP!QM1 QM1

• Persistent message data in the queues is preserved

• Object definitions are also preserved

52

Disaster Recovery- Remote Recovery

Topologies

• Lots of different topologies in practice

• 2 data centers geographically separated

• Both sites in active use, providing DR back for each other

• 2 data centers close to each other, 3rd geographically separated

• Nearby replication is synchronous => no data loss

• Distant replication is asynchronous => some data loss

• And a lot of other variations

54

Disk replication

• Disk replication can be used for MQ disaster recovery

• Either synchronous or asynchronous is OK

• Synchronous

– No data loss if disaster occurs

– Performance of workload impacted by replication latency

– Limited by distance to approximately 100km

• Asynchronous

– Some limited data loss if disaster occurs

– If a queue manager is recovered from an async replica, additional actions may be required

– QM data and logs MUST be in the same consistency group

• Disk replication is used for MQ Appliance HA Groups

• Cannot be used between the instances of an MIQM

55

Combining HA and DR

56

QM1

Machine A

Active

instance

Machine B

Standby

instance

shared storage

Machine C

Backup

instance

QM1Replication

Primary Site Backup Site

HA Pair

Combining HA and DR – Active/Active

57

QM1

Machine AActive

instance

Machine BStandby

instance

Machine CBackup

instance QM1

Replication

Site 1 Site 2

HA Pair

Machine CActive

instance

Machine DStandby

instance

Machine ABackup

instance

QM2 QM2Replication

HA Pair

Planning and Testing

Planning for Recovery

• Write a DR plan

• Document everything – to tedious levels of detail

• Include actual commands, not just a description of the operation

– Not “Stop MQ”, but “as mqm, run /usr/local/bin/stopmq.sh US.PROD.01”

• And test it frequently

• Recommend twice a year

• Record time taken for each task

• Remember that the person executing the plan in a real emergency might be under-skilled and over-pressured

• Plan for no access to phones, email, online docs …

• Each test is likely to show something you’ve forgotten

• Update the plan to match

• You’re likely to have new applications, hardware, software …

59

Networking Considerations

• DNS

• You’ll probably redirect hostnames to the DR site

• Will you keep the same IP addresses?

• Would a change in IP address be significant?

– E.g. Third-party firewalls that do not recognise DR servers

• LOCLADDR

• Not often used, but firewalls, IPT, channel exits may inspect it

• QM clusters

• Don’t accidentally join your DR servers into the production cluster

– Can use CHLAUTH rules, not start chinit, certificates to manage

• Backup will be out of sync with the cluster

– REFRESH CLUSTER can be used to resolve updates

60

Summary

Summary

• MQ is a mature product with a wide variety of HA and DR

options

• Plan what you need to recover MQ

• Test your plan

62

Notices and Disclaimers

Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or

transmitted in any form without written permission from IBM.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with

IBM.

Information in these presentations (including information relating to products that have not yet been announced by IBM) has been

reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM

shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY,

EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF

THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT

OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the

agreements under which they are provided.

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without

notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are

presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual

performance, cost, savings or other results in other operating environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products,

programs or services available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not

necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither

intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.

It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal

counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s

business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or

represent or warrant that its services or products will ensure that the customer is in compliance with any law.

Notices and Disclaimers (con’t)

Information concerning non-IBM products was obtained from the suppliers of those products, their published

announcements or other publicly available sources. IBM has not tested those products in connection with this

publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM

products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to

interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED,

INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A

PARTICULAR PURPOSE.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any

IBM patents, copyrights, trademarks or other intellectual property right.

• IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document

Management System™, Global Business Services ®, Global Technology Services ®, Information on Demand,

ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™,

PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,

pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®,

urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of

International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and

service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on

the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

http://www.ibm.com/legal/copytrade.shtml

Thank YouYour Feedback is

Important!

Access the InterConnect 2015

Conference CONNECT Attendee

Portal to complete your session

surveys from your smartphone,

laptop or conference kiosk.

Ame 2269 ibm mq high availability

Software

Transcript of Ame 2269 ibm mq high availability