Practical Approaches
-
Upload
guest3bd2a12 -
Category
Technology
-
view
170 -
download
0
Transcript of Practical Approaches
![Page 1: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/1.jpg)
11
HA & DR Strategy
Giles Gamon of High-Availability.Com
Practical Approaches
July 2007
![Page 2: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/2.jpg)
22
Business Continuity
A system of planning for, recovering and A system of planning for, recovering and maintaining both the IT and business maintaining both the IT and business environments within an organisation environments within an organisation regardless of the type of interruption. In regardless of the type of interruption. In addition to the IT infrastructure, it covers addition to the IT infrastructure, it covers people, facilities, workplaces, equipment, people, facilities, workplaces, equipment, business processes, and more business processes, and more
![Page 3: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/3.jpg)
33
Defining High-Availability
Provision of end-to-end access to a service and Provision of end-to-end access to a service and data without interruptiondata without interruption The elimination of all Single Points Of Failure (SPOF)The elimination of all Single Points Of Failure (SPOF) Objective - Zero/Near Zero downtimeObjective - Zero/Near Zero downtime
Includes handling scheduled downtimeIncludes handling scheduled downtime
![Page 4: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/4.jpg)
44
Defining Disaster Recovery
The process of restoring and maintaining The process of restoring and maintaining the data, equipment, applications and the data, equipment, applications and other technical resources on which a other technical resources on which a business depends business depends
Response to complete loss of a facilityResponse to complete loss of a facility May include dealing with loss of key staffMay include dealing with loss of key staff Disaster may also affect alternate facilities Disaster may also affect alternate facilities
that were assumed to be availablethat were assumed to be available
![Page 5: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/5.jpg)
55
Achieving Business Continuity
Identification of threats to serviceIdentification of threats to service Systems failures, human errors, sabotage, Systems failures, human errors, sabotage,
software bugs, acts of God etcsoftware bugs, acts of God etc
Management of riskManagement of risk Building in redundancy, taking backups, Building in redundancy, taking backups,
training staff, testing systems, active training staff, testing systems, active management solutionsmanagement solutions
![Page 6: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/6.jpg)
66
Causes of Down Time
Source - IEEE
![Page 7: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/7.jpg)
77
Causes - Disaster
Planning to cope with disasters is an Planning to cope with disasters is an important component of a High-Availability important component of a High-Availability strategystrategy Flood, fire, power grid failure, terrorism etcFlood, fire, power grid failure, terrorism etc
Most ‘disasters’ are classified as Most ‘disasters’ are classified as environmental causes of downtimeenvironmental causes of downtime Collectively environmental causes approximately Collectively environmental causes approximately
5% of downtime5% of downtime
![Page 8: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/8.jpg)
88
Causes - Environmental
Power cuts and brown outsPower cuts and brown outs UPS & GeneratorUPS & Generator
What do they power?What do they power?
Cooling systems errorCooling systems error Humidification regulation errors can cause Humidification regulation errors can cause
hardware failureshardware failures
![Page 9: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/9.jpg)
99
Southampton University 2005
![Page 10: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/10.jpg)
1010
UK – Jan 2005 & June 2007
![Page 11: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/11.jpg)
1111
Causes – Hardware Failure
Probably the most recognised cause of downtimeProbably the most recognised cause of downtime
Server failuresServer failuresDisk, CPU, internal cooling fans, memory faults, …Disk, CPU, internal cooling fans, memory faults, …
Network failuresNetwork failuresDNS, DHCP, router, ISP, switches, cables cut, …DNS, DHCP, router, ISP, switches, cables cut, …
OtherOtherTape backup corruption, client hardware, …Tape backup corruption, client hardware, …
![Page 12: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/12.jpg)
1212
Causes - Planned
Hardware upgradesHardware upgradesOS version upgradesOS version upgradesSoftware version upgradesSoftware version upgradesData migration / transformationData migration / transformationBackupsBackupsBatch processingBatch processingPreventative maintenancePreventative maintenanceTestingTesting
![Page 13: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/13.jpg)
1313
Causes – Human Factor
Failure to maintainFailure to maintain File systems fullFile systems full Database tables fullDatabase tables full Patches for known bugs not appliedPatches for known bugs not applied
AccidentsAccidents root # rm –rf / tmp/tempstuffroot # rm –rf / tmp/tempstuff Network mis-configurationNetwork mis-configuration Incorrect cable removedIncorrect cable removed
InexperienceInexperience root# rebootroot# reboot Cleaner knocks cables outCleaner knocks cables out
MaliceMalice root# uadmin 1 5 root# uadmin 1 5 or or halthalt Physical sabotagePhysical sabotage
![Page 14: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/14.jpg)
1414
Causes – Software Error
Code crashesCode crashes Application suddenly stops with a Application suddenly stops with a core dumpcore dump
Memory leaksMemory leaks Slowly consumes all memory until system crashSlowly consumes all memory until system crash
Run away codeRun away code Taking all CPU time in a loopTaking all CPU time in a loop
Hanging codeHanging code Code pauses waiting for reply that never comesCode pauses waiting for reply that never comes
Resource shortfallsResource shortfalls Overflowing logs, failure to allocate memory or Overflowing logs, failure to allocate memory or
processprocess
Buffer overflowsBuffer overflows Possibly exploited or just bad codePossibly exploited or just bad code
![Page 15: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/15.jpg)
1515
Managing Risks
Identify critical servicesIdentify critical services
Describe service level targetsDescribe service level targets
Map risks to servicesMap risks to services
Quantify the level of threatQuantify the level of threat
Design and cost solutionsDesign and cost solutions
Compromise in a rational wayCompromise in a rational way
![Page 16: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/16.jpg)
1616
Identify Critical Services
How long can the web server be down?How long can the web server be down? Think – internal & publicThink – internal & public
How about Email?How about Email? Can some Emails be lost?Can some Emails be lost?
How about finance, HR, ?How about finance, HR, ? How much downtime is acceptable?How much downtime is acceptable?
Who will be affected?Who will be affected? Admin, public, suppliers …Admin, public, suppliers …
What is the impact on the ‘business’What is the impact on the ‘business’ Reputation, income, disruption, political …Reputation, income, disruption, political …
![Page 17: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/17.jpg)
1717
Describe Service Level Targets
Email, Web (external)Email, Web (external) Downtime < 2 hours per month 8a.m. – 2a.m.Downtime < 2 hours per month 8a.m. – 2a.m.
Housing ServerHousing Server Downtime < 30 mins per month – 24x7Downtime < 30 mins per month – 24x7
Revenue & BenefitsRevenue & Benefits Downtime < 5 mins per year – 24x7Downtime < 5 mins per year – 24x7
Statistical ServerStatistical Server Fix when you can – not really requiredFix when you can – not really required
![Page 18: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/18.jpg)
1818
Balancing Risk and Reward
Unless you have an infinite budget you will have to make ‘trade-offs’Unless you have an infinite budget you will have to make ‘trade-offs’
Identify and remove SPoFs for critical servicesIdentify and remove SPoFs for critical services SPoF = Single Points of FailureSPoF = Single Points of Failure
Identify the least reliable – MTBFsIdentify the least reliable – MTBFs Moving parts typically have the lowest MTBFMoving parts typically have the lowest MTBF
Identify the most difficult components to repair/rebuildIdentify the most difficult components to repair/rebuild e.g.:- Security server, databasee.g.:- Security server, database
Identify what will have biggest impact on failureIdentify what will have biggest impact on failure Usually a core serverUsually a core server
Database, Email, Web, authentication server etcDatabase, Email, Web, authentication server etc
![Page 19: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/19.jpg)
1919
Technical Approaches
ClusteringClustering
ReplicationReplication Transaction / block levelTransaction / block level
Emerging technologiesEmerging technologies iSCSIiSCSI
Multi-domain clustersMulti-domain clusters
Oracle RACOracle RAC
![Page 20: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/20.jpg)
2020
Typical Multi-Tier Architecture
View the service in a holistic fashionView the service in a holistic fashion
List all SPoFsList all SPoFs NetworkNetwork Load balancersLoad balancers SwitchesSwitches Application serverApplication server Database serverDatabase server Data disksData disks EtcEtc
Design in redundancy where possibleDesign in redundancy where possible
![Page 21: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/21.jpg)
2121
Resilient Architecture
Multi-site solutionMulti-site solution Replication to remote siteReplication to remote site Load balancers shown actually provide Load balancers shown actually provide
each other with redundant functionalityeach other with redundant functionality Multiple switches used but not shownMultiple switches used but not shown
SPoFs reduced near to zeroSPoFs reduced near to zero Multiple active blades centresMultiple active blades centres Multiple active application serversMultiple active application servers Clustered database serversClustered database servers
This architecture is resilient to almost This architecture is resilient to almost every conceivable faultevery conceivable fault
![Page 22: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/22.jpg)
2222
Resilient Architecture
![Page 23: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/23.jpg)
2323
Resilient Architecture
![Page 24: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/24.jpg)
2424
High-Availability Clustering
Intelligent management solutionIntelligent management solutionSoftware onlySoftware onlyDeployed on critical serversDeployed on critical serversCan be active-active or active-passiveCan be active-active or active-passiveConstant monitoringConstant monitoring
Application availabilityApplication availability Server healthServer health Network availabilityNetwork availability Other defined componentsOther defined components
Automated restart / move in the event of a faultAutomated restart / move in the event of a faultNotifications to administrative staffNotifications to administrative staff
GUI, Email, SMSGUI, Email, SMS
![Page 25: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/25.jpg)
2525
High-Availability Clustering
Active-PassiveActive-Passive Simple setupSimple setup
Externalise ‘shared’ dataExternalise ‘shared’ dataUse RAID &/ MirroringUse RAID &/ Mirroring
Low cost, fast and simpleLow cost, fast and simpleVery reliableVery reliable
![Page 26: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/26.jpg)
2626
High-Availability Replication
Traditional cluster locallyTraditional cluster locallyReplicate to remote nodeReplicate to remote nodeReplication at transaction Replication at transaction levellevelRemote node probably Remote node probably included in clusterincluded in cluster
Automatic locallyAutomatic locally Manual remotelyManual remotely
![Page 27: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/27.jpg)
2727
High-Availability Replication
Typically replication does a ‘log scrape’Typically replication does a ‘log scrape’ Although some newer versions have closer Although some newer versions have closer
integrationintegration
Takes committed transactions and copies Takes committed transactions and copies them across to the other node(s)them across to the other node(s)
Other nodes ‘apply’ the transactions to a Other nodes ‘apply’ the transactions to a read-onlyread-only copy of the database copy of the database
![Page 28: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/28.jpg)
2828
High-Availability Replication
Block level replicationBlock level replication Suitable for user filesSuitable for user files Not ideal for databasesNot ideal for databases
Many better approaches that understand dB dataMany better approaches that understand dB data Available in different guises - likeAvailable in different guises - like
Sun’s SNDR (remote mirror) – in kernelSun’s SNDR (remote mirror) – in kernel Sync / asyncSync / async Streams type moduleStreams type module
Rsync – user spaceRsync – user space Periodic checking and copyPeriodic checking and copy
![Page 29: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/29.jpg)
2929
High-Availability Replication
Use dB replication for dB when possibleUse dB replication for dB when possible
Use block level for other file types and Use block level for other file types and legacy applications that have no legacy applications that have no replication option availablereplication option available
![Page 30: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/30.jpg)
3030
Practical Examples
CarlisleCarlisle Some lessons learned Some lessons learned
Surrey AmbulanceSurrey Ambulance 999 call handling centre999 call handling centre
North Yorkshire PoliceNorth Yorkshire Police Tasking & operational managementTasking & operational management
![Page 31: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/31.jpg)
3131
Carlisle – Jan 2005
Extensive flooding Jan 2005Extensive flooding Jan 2005 Civic centre hub of all operations hitCivic centre hub of all operations hit
Backup generators in basement (flooded 1Backup generators in basement (flooded 1st)st)
Guardian IT ‘insurance’ not usedGuardian IT ‘insurance’ not used
All major systems down for a weekAll major systems down for a week
Flooded in Jan 2005 and still dealing with Flooded in Jan 2005 and still dealing with substantial issues todaysubstantial issues today
![Page 32: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/32.jpg)
3232
Carlisle - Lessons
Don’t assume just because you have ‘a plan’ it Don’t assume just because you have ‘a plan’ it will actually workwill actually work Guardian IT / Sun Guard provide a warm feeling but Guardian IT / Sun Guard provide a warm feeling but
not useful – Carlisle terminatingnot useful – Carlisle terminating Test itTest it Keep testing and updatingKeep testing and updating
Recovery takes longer than you imagineRecovery takes longer than you imagine Administration relating to recovery and the process of Administration relating to recovery and the process of
recovery itself are a huge drains on resourcesrecovery itself are a huge drains on resources
![Page 33: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/33.jpg)
3333
Surrey Ambulance Service
999 call centre999 call centre
24x7 live operations environment24x7 live operations environment
Handling calls from the publicHandling calls from the public
Live feeds from ambulance GPS Live feeds from ambulance GPS devicesdevices
Automatic escalation and loggingAutomatic escalation and logging
![Page 34: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/34.jpg)
3434
North Yorkshire Police
24x7 live CAD system24x7 live CAD system Command and controlCommand and control Custody managementCustody management Crime managementCrime management Duty rosteringDuty rostering Imaging and biometricsImaging and biometrics
Oracle backend to ‘STORM’ applicationOracle backend to ‘STORM’ applicationHighly integrated systemsHighly integrated systems
Mapping systemsMapping systems PNC linksPNC links DVLA linksDVLA links Firearms databaseFirearms database Neighbouring force systemsNeighbouring force systems
![Page 35: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/35.jpg)
3535
North Yorkshire Police
![Page 36: Practical Approaches](https://reader035.fdocuments.us/reader035/viewer/2022081403/55649751d8b42ab8278b4bff/html5/thumbnails/36.jpg)
3636
Contacts
Giles GamonHigh-Availability.Com
[email protected]@High-Availability.Com
01565 754 459