Breda Development Meetup 2016-06-08 - High Availability

Post on 09-Apr-2017

49 views 0 download

Transcript of Breda Development Meetup 2016-06-08 - High Availability

HighAvailabilityBredaDevelopmentMeetupBasPeters- june 8,2016

UptimePercentiletarget Max downtimeperyear

90% 36days

99% 3.65days

99.5% 1.83days

99.9% 8.76hours

99.99% 52.56minutes

99.999% 5.25minutes

99.9999% 31.5seconds

HA is Redundancyü RAID: Disk crash? Another disk still works!

ü Virtualization: Physical host crashes? VM available on other physical host!

ü Clustering: Server crashes? Another server still works!

ü Power: Power outage? Redundant power supply!

ü Network: Switch or NIC crashes? 2nd network route available!

ü Geographical: Datacenter offline? Another DC available to perform work!

Traditional setup

router

server

enduser

Traditional setup - enhanced

router databaseserverenduser applicationserver

Adding redundancy

router databaseserverenduser

applicationserver1

loadbalancer

applicationserver2

Enhanced redundancy

router databaseserverenduser

applicationserver1

loadbalancer

applicationserver2

router(backup) loadbalancer (backup)

Database redundancy

routerenduser

applicationserver1

loadbalancer

applicationserver2

router(backup) loadbalancer (backup)

databaseserver1

databaseserver2

Datacenter redundancy

routerenduser

applicationserver1

loadbalancer applicationserver2

router(backup) loadbalancer (backup) databaseserver1

databaseserver2

datacenter1

datacenter2

States and sessionso Multiplerequestscanbeservedby

differentbackendservers

o StoresessionindatabaseornoSQL cache

o Loadbalancercan“stick”asinglebackend

servertoauser…

o ...butnot inallcases!

app1 app2 app3 app4

12

3

12 3

Local storageo Avoidstoringmeaningfulpersistentusercontentonalocalserver

o Applicationlevelcachingisusefulaslongasitisnotdestructive

o Synchronizationofcontentsbetweenbackendserversisapain

o Usedatabaseforstoragewherepossible

…Therearepossibilitiestosharestorageamongstbackendservers

Shared storage - NASo NetworkAttachedStorage

o ANAShandlesthecompletefilesystemo Reliesonprotocolslike:

NFS: NetworkFilesystemSMB/CIFS: WindowsFileSharing

o Simpletoimplement

o Redundancyisveryhardtoachieve,oftensinglepointoffailure

o Performanceismediocreandbottleneckscanoccur

Shared storage - SANo StorageAreaNetwork

o ASANhandlesonlythe“blocklevel”partofthefilesystemo Reliesonprotocolslike:

iSCSI: IPbasedSCSIFibre Channel: OpticalfibertransportprotocolAoE: ATAoverEthernet

o Hardtoimplement,expensive

o Redundancycanbeachievedtoavoidsinglepointoffailure

o Performanceandscalabilityis(reasonably)good

Shared storage – Cluster Filesystemo Filesystemsharedonmultipleserversusingspecialsoftware/driverso Windowsimplementation:

DFS: WindowsDistributedFileSystemo Linuximplementations:

HDFS: HadoopDistributedFilesystemCeph: ObjectStoragePlatformGlusterFS: RedHatClusterFilesystem

o Relativelyeasytoimplement

o Redundancycaneasilybeachieved

o Performanceandscalabilityis(reasonably)good

Database High Availabilityo HighAvailabilityonRDBMS(relationaldatabasemanagementsystems)is

oftenthemostdifficultthinginaHighAvailablesetup

o Hardwareresourcesanddataneed toberedundant

o Rememberthatitisn’tjustdata,itisconstantlychangingdata

o HighAvailabilitymeanstheoperationcancontinueuninterrupted,notby

restoringanew/backupserver

Database HA - Replication

o Asynchronousbydefault

o Onemaster,manyslaves

o Nowritescale-outpossible

o Difficulttorecoverfromafailoversituation

o Pronetoinconsistencywhennotusedproperly

Database HA - Shardingo Separatedataovermultipledatabase

back-endsusingkeyeddistribution

o Multimastersetuppossible

o Excellentscalability

o Redundancyneedstobeobtainedthroughacomplementarymethodology

o Requiresmorecomplexapplicationlogic

Database HA – Clustering I

o Synchronousbydefault

o Multimastersetuppossible

o Writescale-outpossible

o Near-automaticfaultrecovery

o Requirescodelevelreplicationconflictresolving

Database HA – Clustering IIClusteringforMicrosoftSQL(from2012)o AlwaysOnAvailabilityGroupso EachnoderequiresWSFC(WindowsServerFailoverClustering)o Asynchronousandsynchronouscommitmodesupportedo Upto8“warm”availabilityreplicascanbesetupo Thesereplicascanbeusedforreadtransactionsandbackupso Availabilitygrouplistenertoautomaticallyredirectclientstothebestavailableservero Nota“real”cluster,nomaster-masterreplicationpossible

Database HA – Clustering IIIClusteringforMySQL(MariaDB)o Galera (wsrep)plugintoenableclustering

(includedinMariaDB 10.1bydefault)o Asynchronousandsynchronouscommitmodesupportedo Multi-mastersynchronousreplicationo Readandwritescalabilityo Automaticmembershipcontrol,nodejoininganddroppingo Nolistenerfunctionalitythatredirectsclientstoavailablenodes

Clustering – Quorum I

”A quorum istheminimumnumberofmembersofa deliberative

assembly necessarytoconductthebusinessofthatgroup”

- Wikipedia

Clustering – Quorum IIo NodeMajority:Eachnodethatisavailable

andincommunicationcanvote.Theclusterfunctionsonlywithamajorityofthevotes.

o Whenanetworkpartitionoccurs,thenodesintheminoritypartwillgoinlockdowntoavoida“splitbrain”situation

o Whenanetworkpartitionresolves,theminoritypartwillrejointheactiveclusterafterastatetransfertoretrievethedatathatwaschangedinthemeantime

o Aclustershouldcontainanoddnumberofnodestopreventatotallockdownduringanodefailureornetworkpartition

Clustering – Scenario 1o NodeAisgracefullystopped

o Othernodesreceive“leave”messageandquorumisreducedby1

o Clusterisonline

o NodeBandCcontinuetoserverequestsbecausetheyhavethemajorityofvotes(2of2)

Clustering – Scenario 2o NodeAandBaregracefullystopped

o NodeCreceive“leave”messagesfromAandBandquorumisreducedby2

o Clusterisonline

o NodeCcontinuestoserveclientssinceithasthemajorityofvotesinthequorum(1of1)

Clustering – Scenario 3o Allnodesaregracefullystopped

o Clusterisoffline

o Thereisapotentialprobleminstartingtheclusteragain.Themostrecent(laststopped)nodeshouldbeusedtobootstraptheclusterorthereispotentialdataloss

Clustering – Scenario 4o NodeAdisappearsfromtheclusterdueto

unforeseencircumstances

o NodeBandCwilltrytoreconnecttoAbutwilleventuallyremoveAfromthecluster,maintainingthequorum(3)

o Clusterisonline

o NodeBandCcontinuetoserverequestsbecausetheyhavethemajorityofvotes(2of3)

Clustering – Scenario 5o NodeAandBdisappearfromthecluster

duetounforeseencircumstances

o NodeCwilltrytoreconnecttoAandBbutwilleventuallyremovebothfromthecluster,maintainingthequorum(3)

o Clusterisoffline

o TheclusterisofflinebecauseNodeCcannotacquireamajorityofthevotes(1of3)andwillremaininlockdown

Clustering – Scenario 6o Allnodesdisappearfromthecluster

duetounforeseencircumstances

o Clusterisoffline (obviously)

o ThisisapotentialproblemastheNodewiththemostrecentdatashouldbeusedtobootstraptheclusteragaintoavoiddataloss

Clustering – Scenario 7o AnetworksplitcausesNodeA,BandC

toloseconnectivitywithNodeD,EandF

o Clusterisoffline

o NodeA,BandChavenomajority(3of6)andNodeD,EandFalsohavenomajority(3of6).AllNodesgoinlockdown

Clustering – Multiple Datacenters IDC1 DC2

node1

node2

node3

Clustering – Multiple Datacenters IIDC1 DC2

node1

node2

node3

node4

Clustering – Multiple Datacenters IIIDC1 DC2

node1 node2

DC3

node3

Clustering – Multiple Datacenters IVDC1 DC2

node1

node2

node3

node4

DC3

node5 node6

Health Endpoint Monitoring

o MonitorapplicationsforavailabilityinaHApool

o Monitormiddle-tierservicesforavailability

o Automaticremovalofmisbehavingendpointsfromthepool

o Endpointsthatarehealthyagainafteraserviceinterruptionare

automaticallyre-added

Application Health Check

loadbalancer

ApplicationNode

StorageavailableCodecanbeexecutedDatabasereachableServiceArunningServiceBrunning

statusrequest

200(OK)Responsetime:50ms

Database Health Check

loadbalancer

DatabaseNode

DatabaserunningSimplequerycanbeexecutedLocaldatabasenode ishealthyclusternode

statusrequest

200(OK)Responsetime:50ms

appserver 1

appserver 2appserver 3

Monitoring Strategy

Loadbalancer

DBloadbalancer

db node1db node2

db node3

DBloadbalancer

db node1db node2

db node3

appserver1appserver2

DBnode1DBnode3

Design Patterns for HA environments

o Safeguardperformance

o Increasefaulttolerancy

o Improveconsistency

Queue based load leveling pattern I

o Temporaldecoupling

o Loadleveling

o Loadbalancing

o Loosecoupling

tasks

service

messagequeue

requestsreceivedatvariablerate

messagesprocessedatamore

consistentrate

Queue based load leveling pattern II

Whentouse?o Anytypeofapplicationorservicethatissubjecttooverloading

Whennottouse?o Notsuitableifaresponsewithminimallatencyisexpectedfromthe

applicationorservice

Throttling pattern Io Rejectordelayrequeststotheapplicationwhenacertainnumberof

requestsinacertainamountoftimeisreached

o Disableordegradefunctionalityofselectednonessentialservicessothatessentialservicescanrununimpededwithsufficientresources

Throttling pattern IIWhentouse?o Toensurethatasystemcontinuestomeetservicelevelagreements

o Topreventasingletenantfrommonopolizingtheresourcesprovidedbyanapplication

o Tohandleburstsinactivity

o Tohelpcost-optimizeasystembylimitingthemaximumresourcelevelsneededtokeepitfunctioning

Retry patterno Enabletheapplicationtohandleanticipated,temporaryfailures

o Transparentlyretryinganoperationthathaspreviouslyfailedintheexpectationthatthecauseofthefailureistransient

o Especiallyusefulinmicro-serviceandcloudarchitectures

DeploymentsHighavailableenvironmentsbringadditionalchallengestosoftwaredeployments:

o Howtoperformatomicreleases?

o Howtorollbackafaultyreleasequickly?

o Howtoreleasenewsoftwarewithoutanydowntime?

Basic deployment

loadbalancer

applicationserver1

applicationserver2

databasecluster

1.replaceapplicationcodeonappserver 1

2.replaceapplicationcodeonappserver 2

3.applydatabasechanges

DONE!

Enhanced deployment

loadbalancer

applicationserver1

applicationserver2

databasecluster

1.removeappserver 1fromthepool

3.enableappserver 1inthepoolanddisableappserver 2

2.replaceapplicationcodeonappserver 1

DONE!

4.replaceapplicationcodeonappserver 2

5.enableappserver 2inthepool

6.applydatabasechanges

A/B Deployments Iloadbalancer applicationserver1 applicationserver2

www.live.nlappserver 1- Aappserver 2- A

www.shadow.nlappserver 1- Bappserver 2- B

webserverA/deploy/A

webserverA/deploy/A

webserverB/deploy/B

webserverB/deploy/B

A/B Deployments IIloadbalancer

requestfor:www.live.nl

“www.live.nl isbeingservedbypoolA”

applicationserver

WebserverAcoderesidesat/deploy/A

requestfor:www.shadow.nl

“www.shadow.nl isbeingservedbypool B”

Webserver Bcoderesides at/deploy/B

A/B Deployments IIIloadbalancer

www.live.nlwww.shadow.nl

POOLAè BPOOLBè A

ByswappingPoolAwithPoolBinthe loadbalancer,theentirebackendsareswitchedinstantaneously.

Thisenablesseamlessdeploymentwithout downtime

Deployment best practiceso Neverintroducebackwardsbreakingchangestothedatabase

o Thoroughlytestshadow-liveenvironmentasitistheclosesttothereallivedeployment

o Maintainatightreleaseversioning,basedonsemanticversioning

o ReleasingendofdayandonaFridayisnotrecommended

Questions?

WWW.CMTELECOM.COM

THANKSFORLISTENING!