Breda Development Meetup 2016-06-08 - High Availability

HighAvailabilityBredaDevelopmentMeetupBasPeters- june 8,2016

UptimePercentiletarget Max downtimeperyear

90% 36days

99% 3.65days

99.5% 1.83days

99.9% 8.76hours

99.99% 52.56minutes

99.999% 5.25minutes

99.9999% 31.5seconds

HA is Redundancyü RAID: Disk crash? Another disk still works!

ü Virtualization: Physical host crashes? VM available on other physical host!

ü Clustering: Server crashes? Another server still works!

ü Power: Power outage? Redundant power supply!

ü Network: Switch or NIC crashes? 2nd network route available!

ü Geographical: Datacenter offline? Another DC available to perform work!

Traditional setup

router

server

enduser

Traditional setup - enhanced

router databaseserverenduser applicationserver

Adding redundancy

router databaseserverenduser

applicationserver1

loadbalancer

applicationserver2

Enhanced redundancy

router databaseserverenduser

applicationserver1

loadbalancer

applicationserver2

router(backup) loadbalancer (backup)

Database redundancy

routerenduser

applicationserver1

loadbalancer

applicationserver2

router(backup) loadbalancer (backup)

databaseserver1

databaseserver2

Datacenter redundancy

routerenduser

applicationserver1

loadbalancer applicationserver2

router(backup) loadbalancer (backup) databaseserver1

databaseserver2

datacenter1

datacenter2

States and sessionso Multiplerequestscanbeservedby

differentbackendservers

o StoresessionindatabaseornoSQL cache

o Loadbalancercan“stick”asinglebackend

servertoauser…

o ...butnot inallcases!

app1 app2 app3 app4

Local storageo Avoidstoringmeaningfulpersistentusercontentonalocalserver

o Applicationlevelcachingisusefulaslongasitisnotdestructive

o Synchronizationofcontentsbetweenbackendserversisapain

o Usedatabaseforstoragewherepossible

…Therearepossibilitiestosharestorageamongstbackendservers

Shared storage - NASo NetworkAttachedStorage

o ANAShandlesthecompletefilesystemo Reliesonprotocolslike:

NFS: NetworkFilesystemSMB/CIFS: WindowsFileSharing

o Simpletoimplement

o Redundancyisveryhardtoachieve,oftensinglepointoffailure

o Performanceismediocreandbottleneckscanoccur

Shared storage - SANo StorageAreaNetwork

o ASANhandlesonlythe“blocklevel”partofthefilesystemo Reliesonprotocolslike:

iSCSI: IPbasedSCSIFibre Channel: OpticalfibertransportprotocolAoE: ATAoverEthernet

o Hardtoimplement,expensive

o Redundancycanbeachievedtoavoidsinglepointoffailure

o Performanceandscalabilityis(reasonably)good

Shared storage – Cluster Filesystemo Filesystemsharedonmultipleserversusingspecialsoftware/driverso Windowsimplementation:

DFS: WindowsDistributedFileSystemo Linuximplementations:

HDFS: HadoopDistributedFilesystemCeph: ObjectStoragePlatformGlusterFS: RedHatClusterFilesystem

o Relativelyeasytoimplement

o Redundancycaneasilybeachieved

o Performanceandscalabilityis(reasonably)good

Database High Availabilityo HighAvailabilityonRDBMS(relationaldatabasemanagementsystems)is

oftenthemostdifficultthinginaHighAvailablesetup

o Hardwareresourcesanddataneed toberedundant

o Rememberthatitisn’tjustdata,itisconstantlychangingdata

o HighAvailabilitymeanstheoperationcancontinueuninterrupted,notby

restoringanew/backupserver

Database HA - Replication

o Asynchronousbydefault

o Onemaster,manyslaves

o Nowritescale-outpossible

o Difficulttorecoverfromafailoversituation

o Pronetoinconsistencywhennotusedproperly

Database HA - Shardingo Separatedataovermultipledatabase

back-endsusingkeyeddistribution

o Multimastersetuppossible

o Excellentscalability

o Redundancyneedstobeobtainedthroughacomplementarymethodology

o Requiresmorecomplexapplicationlogic

Database HA – Clustering I

o Synchronousbydefault

o Multimastersetuppossible

o Writescale-outpossible

o Near-automaticfaultrecovery

o Requirescodelevelreplicationconflictresolving

Database HA – Clustering IIClusteringforMicrosoftSQL(from2012)o AlwaysOnAvailabilityGroupso EachnoderequiresWSFC(WindowsServerFailoverClustering)o Asynchronousandsynchronouscommitmodesupportedo Upto8“warm”availabilityreplicascanbesetupo Thesereplicascanbeusedforreadtransactionsandbackupso Availabilitygrouplistenertoautomaticallyredirectclientstothebestavailableservero Nota“real”cluster,nomaster-masterreplicationpossible

Database HA – Clustering IIIClusteringforMySQL(MariaDB)o Galera (wsrep)plugintoenableclustering

(includedinMariaDB 10.1bydefault)o Asynchronousandsynchronouscommitmodesupportedo Multi-mastersynchronousreplicationo Readandwritescalabilityo Automaticmembershipcontrol,nodejoininganddroppingo Nolistenerfunctionalitythatredirectsclientstoavailablenodes

Clustering – Quorum I

”A quorum istheminimumnumberofmembersofa deliberative

assembly necessarytoconductthebusinessofthatgroup”

- Wikipedia

Clustering – Quorum IIo NodeMajority:Eachnodethatisavailable

andincommunicationcanvote.Theclusterfunctionsonlywithamajorityofthevotes.

o Whenanetworkpartitionoccurs,thenodesintheminoritypartwillgoinlockdowntoavoida“splitbrain”situation

o Whenanetworkpartitionresolves,theminoritypartwillrejointheactiveclusterafterastatetransfertoretrievethedatathatwaschangedinthemeantime

o Aclustershouldcontainanoddnumberofnodestopreventatotallockdownduringanodefailureornetworkpartition

Clustering – Scenario 1o NodeAisgracefullystopped

o Othernodesreceive“leave”messageandquorumisreducedby1

o Clusterisonline

o NodeBandCcontinuetoserverequestsbecausetheyhavethemajorityofvotes(2of2)

Clustering – Scenario 2o NodeAandBaregracefullystopped

o NodeCreceive“leave”messagesfromAandBandquorumisreducedby2

o Clusterisonline

o NodeCcontinuestoserveclientssinceithasthemajorityofvotesinthequorum(1of1)

Clustering – Scenario 3o Allnodesaregracefullystopped

o Clusterisoffline

o Thereisapotentialprobleminstartingtheclusteragain.Themostrecent(laststopped)nodeshouldbeusedtobootstraptheclusterorthereispotentialdataloss

Clustering – Scenario 4o NodeAdisappearsfromtheclusterdueto

unforeseencircumstances

o NodeBandCwilltrytoreconnecttoAbutwilleventuallyremoveAfromthecluster,maintainingthequorum(3)

o Clusterisonline

o NodeBandCcontinuetoserverequestsbecausetheyhavethemajorityofvotes(2of3)

Clustering – Scenario 5o NodeAandBdisappearfromthecluster

duetounforeseencircumstances

o NodeCwilltrytoreconnecttoAandBbutwilleventuallyremovebothfromthecluster,maintainingthequorum(3)

o Clusterisoffline

o TheclusterisofflinebecauseNodeCcannotacquireamajorityofthevotes(1of3)andwillremaininlockdown

Clustering – Scenario 6o Allnodesdisappearfromthecluster

duetounforeseencircumstances

o Clusterisoffline (obviously)

o ThisisapotentialproblemastheNodewiththemostrecentdatashouldbeusedtobootstraptheclusteragaintoavoiddataloss

Clustering – Scenario 7o AnetworksplitcausesNodeA,BandC

toloseconnectivitywithNodeD,EandF

o Clusterisoffline

o NodeA,BandChavenomajority(3of6)andNodeD,EandFalsohavenomajority(3of6).AllNodesgoinlockdown

Clustering – Multiple Datacenters IDC1 DC2

Clustering – Multiple Datacenters IIDC1 DC2

Clustering – Multiple Datacenters IIIDC1 DC2

node1 node2

Clustering – Multiple Datacenters IVDC1 DC2

node5 node6

Health Endpoint Monitoring

o MonitorapplicationsforavailabilityinaHApool

o Monitormiddle-tierservicesforavailability

o Automaticremovalofmisbehavingendpointsfromthepool

o Endpointsthatarehealthyagainafteraserviceinterruptionare

automaticallyre-added

Application Health Check

loadbalancer

ApplicationNode

StorageavailableCodecanbeexecutedDatabasereachableServiceArunningServiceBrunning

statusrequest

200(OK)Responsetime:50ms

Database Health Check

loadbalancer

DatabaseNode

DatabaserunningSimplequerycanbeexecutedLocaldatabasenode ishealthyclusternode

statusrequest

200(OK)Responsetime:50ms

appserver 1

appserver 2appserver 3

Monitoring Strategy

Loadbalancer

DBloadbalancer

db node1db node2

db node3

DBloadbalancer

db node1db node2

db node3

appserver1appserver2

DBnode1DBnode3

Design Patterns for HA environments

o Safeguardperformance

o Increasefaulttolerancy

o Improveconsistency

Queue based load leveling pattern I

o Temporaldecoupling

o Loadleveling

o Loadbalancing

o Loosecoupling

service

messagequeue

requestsreceivedatvariablerate

messagesprocessedatamore

consistentrate

Queue based load leveling pattern II

Whentouse?o Anytypeofapplicationorservicethatissubjecttooverloading

Whennottouse?o Notsuitableifaresponsewithminimallatencyisexpectedfromthe

applicationorservice

Throttling pattern Io Rejectordelayrequeststotheapplicationwhenacertainnumberof

requestsinacertainamountoftimeisreached

o Disableordegradefunctionalityofselectednonessentialservicessothatessentialservicescanrununimpededwithsufficientresources

Throttling pattern IIWhentouse?o Toensurethatasystemcontinuestomeetservicelevelagreements

o Topreventasingletenantfrommonopolizingtheresourcesprovidedbyanapplication

o Tohandleburstsinactivity

o Tohelpcost-optimizeasystembylimitingthemaximumresourcelevelsneededtokeepitfunctioning

Retry patterno Enabletheapplicationtohandleanticipated,temporaryfailures

o Transparentlyretryinganoperationthathaspreviouslyfailedintheexpectationthatthecauseofthefailureistransient

o Especiallyusefulinmicro-serviceandcloudarchitectures

DeploymentsHighavailableenvironmentsbringadditionalchallengestosoftwaredeployments:

o Howtoperformatomicreleases?

o Howtorollbackafaultyreleasequickly?

o Howtoreleasenewsoftwarewithoutanydowntime?

Basic deployment

loadbalancer

applicationserver1

applicationserver2

databasecluster

1.replaceapplicationcodeonappserver 1

3.applydatabasechanges

Enhanced deployment

loadbalancer

applicationserver1

applicationserver2

databasecluster

1.removeappserver 1fromthepool

3.enableappserver 1inthepoolanddisableappserver 2

5.enableappserver 2inthepool

6.applydatabasechanges

A/B Deployments Iloadbalancer applicationserver1 applicationserver2

www.live.nlappserver 1- Aappserver 2- A

www.shadow.nlappserver 1- Bappserver 2- B

webserverA/deploy/A

webserverB/deploy/B

A/B Deployments IIloadbalancer

requestfor:www.live.nl

“www.live.nl isbeingservedbypoolA”

applicationserver

WebserverAcoderesidesat/deploy/A

requestfor:www.shadow.nl

“www.shadow.nl isbeingservedbypool B”

Webserver Bcoderesides at/deploy/B

A/B Deployments IIIloadbalancer

www.live.nlwww.shadow.nl

POOLAè BPOOLBè A

ByswappingPoolAwithPoolBinthe loadbalancer,theentirebackendsareswitchedinstantaneously.

Thisenablesseamlessdeploymentwithout downtime

Deployment best practiceso Neverintroducebackwardsbreakingchangestothedatabase

o Thoroughlytestshadow-liveenvironmentasitistheclosesttothereallivedeployment

o Maintainatightreleaseversioning,basedonsemanticversioning

o ReleasingendofdayandonaFridayisnotrecommended

Questions?

WWW.CMTELECOM.COM

THANKSFORLISTENING!

Breda Development Meetup 2016-06-08 - High Availability

Internet

Transcript of Breda Development Meetup 2016-06-08 - High Availability

Violetta Breda Portfolio

Justin Van Breda London - Upholstery Collection

PAMELA BREDA - CV

Justin Van Breda London - Mineral Collection

Justin Van Breda London - Original Collection

Social-mediasessie: Social Media Club Breda

Bibliotheek breda 5 juli 2010

Transportation Solutions from Italy and Japan€¦ · 5. Transformation of Ansaldo, Breda and Hitachi 5 1853 Giovanni Ansaldo founded Ansaldo 1886 Eugenio Breda founded Breda 1910

Breda Case Study on Housing - UvA · Breda Foreword This report is part of the Eurofound project “Cities for Local Integration Policy” (CLIP), which started in 2006. Breda is

Client: City of Breda/Breda Municipal Electric system Advisor: James D. Mcalley

ANSALDO-BREDA - TAF.pdf

ANSALDO BREDA - Sirio Bergamo

Breda Whelan, Guidance Counsellor, Rice College. 1 Study Skills Tips for Students By Breda Whelan.

UNESCO-BREDA Framework for Action in TVET › BREDA › BREDA_TVET-Framework-for... · 2014-03-29 · Page 2 UNESCO-BREDA Framework for Action in TVET Page 2 Executive Summary Chapter

Working together on Sustainable Agenda Breda

Justin Van Breda London - Curate Collection

The Hub | Smart City Lofts | Breda

Breda Bruist december 2012

Tomas Vaispacher Breda Adamcik

RD Conference ppt Breda Quinn final