Ame 2269 ibm mq high availability
-
Upload
andrew-schofield -
Category
Software
-
view
749 -
download
8
Transcript of Ame 2269 ibm mq high availability
© 2015 IBM Corporation
IBM MQ:High AvailabilitySession AME-2269
Andrew Schofield
Introduction
• Availability is a very large subject
• You can have the best technology in the world, but you have to
manage it correctly
• Technology is not a substitute for good planning and testing!
2
What is Disaster Recovery?
• Disaster recovery is the process, policies and procedures
related to preparing for recovery or continuation of technology
infrastructure critical to an organization after a natural or
human-induced disaster. Disaster recovery is a subset of
business continuity. While business continuity involves planning
for keeping all aspects of a business functioning in the midst of
disruptive events, disaster recovery focuses on the IT or
technology systems that support business functions
3
What is DR?
• Getting applications running after a major (often whole-site) failure or loss
• It is not about High Availability although often the two are related and share design and implementation choices
• “HA is having 2, DR is having them a long way apart”
• More seriously, HA is about keeping things running, while DR is about recovering when HA has failed.
• Requirements driven by business, and often by regulators
• Data integrity, timescales, geography …
• One major decision point: cost
• How much does DR cost you, even if it’s never used?
• How much are you prepared to lose
4
High Availability vs Disaster Recovery
• Designs for HA typically involve a single site for each
component of the overall architecture
• Designs for DR typically involve separate sites
• Designs for HA (and CA) typically require no data loss
• Designs for DR typically can have limited data loss
• Designs for HA typically involve high-speed takeover
• Designs for DR typically can permit several hours down-time
5
High Availability
Single Points of Failure
• With no redundancy or fault tolerance, a failure of any component can lead to a loss of availability
• Every component is critical. The system relies on the:
• Power supply, system unit, CPU, memory
• Disk controller, disks, network adapter, network cable
• ...and so on
• Various techniques have been developed to tolerate failures:
• UPS or dual supplies for power loss
• RAID for disk failure
• Fault-tolerant architectures for CPU/memory failure
• ...etc
• Elimination of SPOFs is important to achieve HA
7
MQ HA technologies
• Queue manager clusters
• Queue-sharing groups
• Multi-instance queue managers
• HA clusters
• MQ Appliance HA groups
• Client reconnection
8
Comparison of Technologies
9
Queue-sharing
groups continuous continuous
MQ
Clusters none continuous
continuousautomatic
automatic automatic
none none
HA Clusters,
Multi-instance,
HA Groups
No special
support
Access to
existing messages
Access for
new messages
Queue manager clusters
• Sharing cluster queues on
multiple queue managers
prevents a queue from
being a SPOF
• Cluster workload algorithm
automatically routes traffic
away from failed queue
managers
10
Queue-sharing groups
• On z/OS, queue managers can be members of a queue-sharing group
• Shared queues are held in a coupling facility
• All queue managers in the QSG can access the messages
• Benefits:
• Messages remain available even if a queue manager fails
• Pull workload balancing
• Apps can connect to the group
11
Queue
manager
Private
queues
Queue
manager
Private
queues
Queue
manager
Private
queues
Shared
queues
Introduction to Failover and MQ
• Failover is the automatic switching of availability of a service• For MQ, the “service” is a queue manager
• Traditionally the preserve of an HA cluster, such as HACMP
• Requires:• Data accessible on all servers
• Equivalent or at least compatible servers– Common software levels and environment
• Sufficient capacity to handle workload after failure– Workload may be rebalanced after failover requiring spare capacity
• Startup processing of queue manager following the failure
• MQ offers three ways of configuring for failover:• Multi-instance queue managers
• HA clusters
• MQ Appliance HA groups
12
Failover considerations
• Failover times are made up of three parts:
• Time taken to notice the failure
– Heartbeat missed
– Bad result from status query
• Time taken to establish the environment before activating the service
– Switching IP addresses and disks, and so on
• Time taken to activate the service
– This is queue manager restart
• Failover involves a queue manager restart
• Nonpersistent messages, nondurable subscriptions discarded
• For fastest times, ensure that queue manager restart is fast
• No long running transactions, for example
13
Multi-instance Queue Managers
Multi-instance Queue Managers
• Basic failover support without HA cluster
• Two instances of a queue manager on different machines
• One is the “active” instance, other is the “standby” instance
• Active instance “owns” the queue manager’s files
– Accepts connections from applications
• Standby instance monitors the active instance
– Applications cannot connect to the standby instance
– If active instance fails, standby restarts queue manager and
becomes active
• Instances are the SAME queue manager – only one set of data
files
• Queue manager data is held in networked storage
15
Setting up Multi-instance Queue Manager
• Set up shared filesystems for QM data and logs
• Create the queue manager on machine1crtmqm –md /shared/qmdata –ld /shared/qmlog QM1
• Define the queue manager on machine2 (or edit mqs.ini)addmqinf –v Name=QM1 –v Directory=QM1 –v Prefix=/var/mqm-v DataPath=/shared/qmdata/QM1
• Start an instance on machine1 – it becomes activestrmqm –x QM1
• Start another instance on machine2 – it becomes standbystrmqm –x QM1
• That’s it. If the queue manager instance on machine1 fails, the standby instance on machine2 takes over and becomes active
16
Multi-instance Queue Managers
17
1. Normal
execution
Owns the queue manager data
MQ
Client
Machine A Machine B
QM1
QM1
Active
instance
QM1
Standby
instance
can fail-over
MQ
Client
network
168.0.0.2168.0.0.1
networked storage
Multi-instance Queue Managers
18
2. Disaster
strikesMQ
Client
Machine A Machine B
QM1
QM1
Active
instance
QM1
Standby
instance
locks freed
MQ
Client
network
IPA
networked storage
168.0.0.2
Client
connections
broken
Multi-instance Queue Managers
19
3. FAILOVER
Standby
becomes
active
MQ
Client
Machine B
QM1
QM1
Active
instance
MQ
Client
network
networked storage
Owns the queue manager data
168.0.0.2
Client
connection
still broken
Multi-instance Queue Managers
20
4. Recovery
completeMQ
Client
Machine B
QM1
QM1
Active
instance
MQ
Client
network
networked storage
Owns the queue manager data
168.0.0.2
Client
connections
reconnect
Multi-instance Queue Managers
• MQ is NOT becoming an HA cluster
• If other resources need to be coordinated, you need an HA
cluster
• IBM Integration Bus (Message Broker) integrates with multi-
instance QM
• Queue manager services can be automatically started, but with
limited control
• System administrator is responsible for restarting another
standby instance when failover has occurred
21
Dealing with multiple IP addresses
• The IP address of the queue manager changes when it moves• So MQ channel configuration needs way to select address
• Connection name syntax permits a comma-separated list• CONNAME(‘168.0.0.1,168.0.0.2’)
• Unless you use external IPAT or an intelligent router or MR01
• WAS8 admin panels understand this syntax
• For earlier levels of WAS• Connection Factories:
– Set a custom property called XMSC_WMQ_CONNECTION_NAME_LIST to the list of host/port names that you wish to connect to
– Make sure that the existing host and port values defined on the connection factory match the first entry in this property
• Activation Specs:– Set a custom property called connectionNameList on the activation
spec with the same format
22
Administering Multi-instance Queue Managers
• All queue manager administration must be performed on active
instance
• dspmq enhanced to display instance information
• dspmq issued on “staravia”
• On “staravia”, there’s a standby instance
• The active instance is on “starly”
23
$ hostname
staravia
$ dspmq -x
QMNAME(MIQM) STATUS(Running as standby)
INSTANCE(starly) MODE(Active)
INSTANCE(staravia) MODE(Standby)
Multi-instance Queue Manager in MQ Explorer
24
MQ Explorer automatically switches to the active instance
High Availability Clusters
HA clusters
• MQ traditionally made highly available using an HA cluster
• IBM PowerHA for AIX (formerly HACMP), Veritas Cluster
Server, Microsoft Cluster Server, HP Serviceguard, Linux HA …
• HA clusters can:
• Coordinate multiple resources such as application server,
database
• Consist of more than two machines
• Failover more than once without operator intervention
• Takeover IP address as part of failover
• Likely to be more resilient in cases of MQ and OS defects
26
HA clusters
• In HA clusters, queue manager data and logs are placed on a
shared disk
• Disk is switched between machines during failover
• The queue manager has its own “service” IP address
• IP address is switched between machines during failover
• Queue manager’s IP address remains the same after failover
• The queue manager is defined to the HA cluster as a resource
dependent on the shared disk and the IP address
• During failover, the HA cluster will switch the disk, take over the
IP address and then start the queue manager
27
MQ in an HA cluster – Active/active
28
Normal
executionMQ
Client
Machine A Machine BQM1
Active
instance
MQ
Client
network
168.0.0.1
QM2
Active
instance
QM2
data
and logs
QM1
data
and logs
shared disk
168.0.0.2
HA cluster
MQ in an HA cluster – Active/active
29
FAILOVERMQ
Client
Machine A Machine B
MQ
Client
network
168.0.0.1
QM2
Active
instance
QM2
data
and logs
QM1
data
and logs
shared disk
168.0.0.2QM1
Active
instance
Shared disk
switched
IP address
takeover
Queue
manager
restarted
HA cluster
Multi-instance QM or HA cluster?
30
• Multi-instance queue manager Integrated into the WebSphere MQ product
Faster failover than HA cluster Delay before queue manager restart is much shorter
Runtime performance of networked storage
Suitable storage can sometimes be a challenge
• HA clusterCapable of handling a wider range of failures
Failover historically rather slow, but some HA clusters are improving
Capable of more flexible configurations (eg N+1)
Required MC91 SupportPac or equivalent configuration
Extra product purchase and skills required
• Storage distinction• Multi-instance queue manager typically uses NAS
• HA clustered queue manager typically uses SAN
Configuring a queue manager for failover
• Create filesystems on the shared disk, for example
• /MQHA/QM1/data for the queue manager data
• /MQHA/QM1/log for the queue manager logs
• On one of the nodes:
• Mount the filesystems
• Create the queue manager
crtmqm –md /MQHA/QM1/data –ld /MQHA/QM1/log QM1
• Print out the configuration information for use on the other nodes
dspmqinf –o command QM1
• On the other nodes:
• Mount the filesystems
• Add the queue manager’s configuration information
addmqinf –s QueueManager –v Name=QM1 –v Prefix=/var/mqm–v DataPath=/MQHA/QM1/data/QM1 –v Directory=QM1
31
Virtual Machine Failover
• Another mechanism being regularly used
• When MQ is in a virtual machine … simply restart the VM
• “Turning it off and back on again”
• Can be faster than any other kind of failover
32
MQ ApplianceHA Groups
MQ Appliance HA groups
• MQ Appliance introduces a new appliance-only HA feature
• HA groups of appliances running HA queue managers
• An HA group consists of a pair of MQ appliances
• Controls the availability of HA queue managers in the group
• The appliances are connected together using 3 interfaces
– HA group primary interface
– HA group alternate interface
– Replication interface
• In an HA group, you can create HA queue managers
• Data is replicated synchronously between the appliances
• Shared-nothing, appliance-only HA
34
Setting up a queue manager in an HA group
• On appliance2, prepare to receive a security key
prepareha –s your_secret –a appliance1.haprimary.ipaddr
• Create the HA group on appliance1
crthagrp –s your_secret –a appliance2.haprimary.ipaddr
• Create the queue manager on appliance1
crtmqm –sx QM1
• Once created, the queue manager will start automatically
• That’s it. We have:
• QM1 under control of the HA group
• It prefers to run on appliance1
• If appliance1 fails, it automatically restarts on appliance2
35
MQ in an HA group – Active/active
36
Normal
executionMQ
Client
QM1
Active
instance
MQ
Client
network
168.0.1.1
QM2
Active
instance
QM2
data
and logs
QM1
data
and logs
168.0.2.2
HA group
QM1
data
and logs
QM2
data
and logs
Disk replication
Disk replication
MQ in an HA group – Active/active
37
MQ
Client
QM1
Active
instance
MQ
Client
network
QM2
Active
instance
QM2
data
and logs
QM1
data
and logs
HA group
QM1
data
and logs
QM2
data
and logs
network
168.0.1.2 168.0.2.2168.0.1.1
FAILOVER
Replication
temporarily
stopped
IP address
changes
Queue
manager
restarted
MQ in an HA group – Active/active
38
MQ
Client
QM1
Active
instance
MQ
Client
network
QM2
Active
instance
QM2
data
and logs
QM1
data
and logs
HA group
QM1
data
and logs
QM2
data
and logs
network
168.0.1.2 168.0.2.2168.0.1.1
Recovery
begins
Replication
resynchronizes
appliances
IP address
changes
Queue
manager
restarted
Disk replication
Disk replication
MQ in an HA group – Active/active
39
Recovery
completeMQ
Client
QM1
Active
instance
MQ
Client
network
168.0.1.1
QM2
Active
instance
QM2
data
and logs
QM1
data
and logs
168.0.2.2
HA group
QM1
data
and logs
QM2
data
and logs
Disk replication
Disk replication
Queue
manager
restarted
Direction of
replication
switches back
Applications and Client Reconnection
High availability application connectivity
• If an application loses connection to a queue manager, what
does it do?
• End abnormally
• Handle the failure and retry the connection
• Reconnect automatically thanks to application container
– WebSphere Application Server contains logic to reconnect JMS
clients
• Use MQ automatic client reconnection
41
Automatic client reconnection
• MQ client automatically reconnects when connection broken
• MQI C clients and standalone JMS clients
• JMS in app servers (EJB, MDB) does not need auto-reconnect
• Reconnection includes reopening queues, remaking subscriptions
• All MQI handles keep their original values
• Can reconnect to same queue manager or another, equivalent queue manager
• MQI or JMS calls block until connection is remade
• By default, will wait for up to 30 minutes
• Long enough for a queue manager failover (even a really slow one)
42
Automatic client reconnection
• Can register event handler to observe reconnection
• Not all MQI is seamless, but majority repaired transparently
• Browse cursors revert to the top of the queue
• Nonpersistent messages are discarded during restart
• Nondurable subscriptions are remade and may miss some
messages
• In-flight transactions backed out
• Tries to keep dynamic queues with same name
• If queue manager doesn’t restart, reconnecting client’s TDQs
are kept for a while in case it reconnects
• If queue manager does restart, TDQs are recreated when it
reconnects
43
Client Configurations for Availability
1. Use wildcarded queue manager names in CCDT• Gets weighted distribution of connections
• Selects a “random” queue manager from an equivalent set
2. Use multiple addresses in a CONNAME• Could potentially point at different queue managers
• More likely pointing at the same queue manager in a multi-instance setup
3. Use automatic reconnection
4. Pre-connect exit
5. Use IP routers to select address from a list• Based on workload or anything else known to the router
• Can use all of these in combination!
44
Application Patterns for Availability
• Article describing how to build a hub topology supporting:
• Continuous availability to send messages, with no single point of failure
• Linear horizontal scale of throughput, for both MQ and the applications
• Exactly once delivery, with high availability of persistent messages
• Three messaging styles: Request/response, fire-and-forget, and pub/sub
• http://www.ibm.com/developerworks/websphere/library/techarticles/1303_broadhurst/1303_broadhurst.html
45
Disaster Recovery- Local Recovery
“MQ has failed”
• Don’t restart queue manager until you know why it failed• You can probably do one restart attempt safely, but don’t
continually retry
• When running under an HA coordinator, have retry counts
• At least ensure you take a backup of the queue manager data, log files, and any error logs/FDCs/dumps
• So it can be investigated later
• Might be possible for IBM to recover messages
• Consider taking these copies as part of an HA failover procedure
• While trying restart/recovery procedures, consider a PMR• Often see “cold start” as first reaction at some customers
• If you have a support contract, open PMR before trying cold start – IBM may have an alternative
– IBM may ask you to start collecting documentation
• Do not make everything a Sev1
47
First Manual Restart
• Restart your queue manager
• Only clean the IPC (shared memory/semaphores) if IBM
requests it
– This should never be necessary
– Remove calls to ipcrm or amqiclen from any startup/failover scripts
• Start as simply as possible
– strmqm –ns QM1
• Monitor the restart, look for FDC’s
• If OK, then end the qmgr and restart normally
• What if the restart fails?
• Option to escalate to cold start
• Further escalation to rebuilding queue manager
48
Cold Starting MQ
• Typical reason: hardware (most likely disk) failure or logs
deleted by mistaken administrator
• Symptoms: Cannot start queue manager because logs
unavailable or corrupt files
• “Cold start” is a technique to restart without needing logs
• What does a cold start cost you?
• In-flight transactions will not be automatically rolled-back
• In-doubt transactions will be forgotten
• Ability to recover messages from the logs
• Possible loss of messages
• Possible duplication of already-processed messages
49
Cold Starts on Distributed Platforms
• Basic idea is to replace your ‘bad’ logs with ‘good’ logs
• By creating a dummy queue manager and using its logs instead
• Logs do not contain queue manager name
• Other considerations
• Is this queue manager part of an MQ cluster?
– This shouldn’t matter, but may want to resynchronise repositories
– In case any updates were in-flight when system failed
• Is this queue manager under the control of an HA cluster?
– Failover will not help if the shared disks/files are corrupt
– Disable failover in the HA system until recovery complete
50
Cold Start Procedure (1)
• Create a queue manager with the same log parameters as the
failed one
• Use qm.ini to find the values required
Log:
LogPrimaryFiles=10
LogSecondaryFiles=10
LogFilePages=16384
LogType=CIRCULAR
• Issue the crtmqm command with the same parameters
crtmqm –lc –lp 10 –ls 10 –lf 16384 –ld /tmp/mqlogs TEMP.QM1
• This queue manager is just used to generate the log files
51
Cold Start Procedure (2)
• Do not start this temporary queue manager, just create it
• Replace the failed queue manager’s log extents and control file
with the new ones from the temporary queue manager
cd /var/mqm/log
mv QM1 QM1.SAVE
mv /tmp/mqlogs/TEMP!QM1 QM1
• Persistent message data in the queues is preserved
• Object definitions are also preserved
52
Disaster Recovery- Remote Recovery
Topologies
• Lots of different topologies in practice
• 2 data centers geographically separated
• Both sites in active use, providing DR back for each other
• 2 data centers close to each other, 3rd geographically separated
• Nearby replication is synchronous => no data loss
• Distant replication is asynchronous => some data loss
• And a lot of other variations
54
Disk replication
• Disk replication can be used for MQ disaster recovery
• Either synchronous or asynchronous is OK
• Synchronous
– No data loss if disaster occurs
– Performance of workload impacted by replication latency
– Limited by distance to approximately 100km
• Asynchronous
– Some limited data loss if disaster occurs
– If a queue manager is recovered from an async replica, additional actions may be required
– QM data and logs MUST be in the same consistency group
• Disk replication is used for MQ Appliance HA Groups
• Cannot be used between the instances of an MIQM
55
Combining HA and DR
56
QM1
Machine A
Active
instance
Machine B
Standby
instance
shared storage
Machine C
Backup
instance
QM1Replication
Primary Site Backup Site
HA Pair
Combining HA and DR – Active/Active
57
QM1
Machine AActive
instance
Machine BStandby
instance
Machine CBackup
instance QM1
Replication
Site 1 Site 2
HA Pair
Machine CActive
instance
Machine DStandby
instance
Machine ABackup
instance
QM2 QM2Replication
HA Pair
Planning and Testing
Planning for Recovery
• Write a DR plan
• Document everything – to tedious levels of detail
• Include actual commands, not just a description of the operation
– Not “Stop MQ”, but “as mqm, run /usr/local/bin/stopmq.sh US.PROD.01”
• And test it frequently
• Recommend twice a year
• Record time taken for each task
• Remember that the person executing the plan in a real emergency might be under-skilled and over-pressured
• Plan for no access to phones, email, online docs …
• Each test is likely to show something you’ve forgotten
• Update the plan to match
• You’re likely to have new applications, hardware, software …
59
Networking Considerations
• DNS
• You’ll probably redirect hostnames to the DR site
• Will you keep the same IP addresses?
• Would a change in IP address be significant?
– E.g. Third-party firewalls that do not recognise DR servers
• LOCLADDR
• Not often used, but firewalls, IPT, channel exits may inspect it
• QM clusters
• Don’t accidentally join your DR servers into the production cluster
– Can use CHLAUTH rules, not start chinit, certificates to manage
• Backup will be out of sync with the cluster
– REFRESH CLUSTER can be used to resolve updates
60
Summary
Summary
• MQ is a mature product with a wide variety of HA and DR
options
• Plan what you need to recover MQ
• Test your plan
62
Notices and Disclaimers
Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or
transmitted in any form without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been
reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM
shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY,
EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF
THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT
OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the
agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without
notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are
presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual
performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products,
programs or services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not
necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither
intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal
counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s
business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or
represent or warrant that its services or products will ensure that the customer is in compliance with any law.
Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products in connection with this
publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED,
INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any
IBM patents, copyrights, trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document
Management System™, Global Business Services ®, Global Technology Services ®, Information on Demand,
ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™,
PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®,
urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and
service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on
the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
Thank YouYour Feedback is
Important!
Access the InterConnect 2015
Conference CONNECT Attendee
Portal to complete your session
surveys from your smartphone,
laptop or conference kiosk.