Download - Failure Happens - Reliability and how to run large websites.

Failure HappensF***, the f*****g thing is f****d

What broke and what we learned

Redundancy

Redundancy, in general terms, refers tothe quality or state of being redundant,that is: exceeding what is necessary ornormal; or duplication. This can have anegative connotation, especially inrhetoric: superfluous or repetitive; or apositive implication, especially inengineering: serving as a duplicate forpreventing failure of an entire system.

Jesse Robbins Artur Bergman

Artur Bergman Jesse Robbins

• Jesse– Runs ops for Etelos– Firefighter/EMT– Emergency Manager

• Katrina– Experiences running large websites– Had the best title ever “Master of Disaster”

• Artur– Runs ops & engineering for Wikia– Experiences of running large websites, enterprise

(boring) and stock exchanges– Core Perl developer, long development background

• Both of us– Write for O’Reilly Radar– Speak at conferences– Annoy our peers and coworkers– Agree on nearly everything

Redundant

Jesse is sick

• Thankfully, we have high availability– Hence this talk

• Jesse has a 98% availability• I am more honest, probably more like

90% excluding the time I sleep• Our combined availability is 99.84%• His war stories will be missing

June 23-24, 2008Jesse & Steve

364.96 Main

• San Francisco data center• Hosts a lot of Web 2.0 companies• Power outage• 24 July 2008

– A day I am sure a lot of people rememberfondly

Mistakes

• Generator 3 took down 1 and 4– 200% more outage than needed

• But really?– Not 365 Mains fault

Failure happens

• A single datacenter is the problem– Since they all fail at some point

• Recovery procedures after failure– Power was gone ~45 minutes– Most services took hours to come back– Some unnamed ones more than 12 hours

• Communication– All DNS servers in the same datacenter!

Radar article• Disaster recovery plans exist on a different

continuum, affecting not just operations butalso your entire organisation's response todisasters.

• An earthquake is a question of when, not if.Are the startups ready for this? How long willwe expect them to be gone? Several of theworld's largest websites went down. None ofthem were ready for a datacenter outage.None of them had backup datacenters or failover that worked.

• None even had a coherent strategy forcommunicating the situation to the rest of theworld.

Futility of MTBF

• Mean time between failures– Vendor quote you this all time

• Irrelevant!• Failure is inevitable• 365 Main probably had a excellent

aggregated MTBF– But when something fails, the mean time to the

next failure is hardly going to make you feel better

MTTR

• Mean time to recovery• Drastically reduced severity of the

power outage even without hot standby• Noone cares if you fail once a minute

– If you recover in 50 ms• If you are down 1 minute a week, you

are still going to hit 4 nines (99.99%)

Nines (roughly)

• 99% 5000 Minutes / Year 3.5 Days

Nines (roughly)

• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )

Nines (roughly)

• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year

Nines (roughly)

• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year

Nines (roughly)

• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year• 99.9999% 30 Seconds / Year

Nines (roughly)

• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year• 99.9999% 30 Seconds / Year• 99.99999% 3 Seconds / Year

Irrelevance of the nines

• Blizzard– $520 million in profit last year

• World of Warcraft– 10 million players

• 98-99%– By design

Train your users

• Scheduled Downtime each week• Very little redundancy• Server failure

– Up to 10 minutes of data loss• Been like this from the beginning

“We pay them money, so wehave to accept the downtime.”

Reliability

• Don’t aim to high unless– Banks– Space shuttles– Lung/heart machines

• The higher you aim– Increases complexity (exponentially)– The harder you fail

Complexity killed the cat

5m360.yahoo.comYahoo! 360

10mwww.livejournal.comLiveJournal

25mwww.myspace.comMySpace

45mwww.xanga.comXanga

1h 10mwww.last.fmLast.fm

1h 10mwww.orkut.comOrkut

1h 35mwww.facebook.comFacebook

2h 5mwww.classmates.comClassmates.com

4h 0mwww.linkedin.comLinkedIn

2h 55mwww.reunion.comReunion.com

5h 5mwww.hi5.comhi5

6h 0mwww.friendster.comFriendster

7h 25mspaces.live.comWindows Live Spaces

12h 28mwww.bebo.comBebo

Jan-Feb 2008 - Source pingdom.com

5m360.yahoo.comYahoo! 360

10mwww.livejournal.comLiveJournal

25mwww.myspace.comMySpace

45mwww.xanga.comXanga

1h 10mwww.last.fmLast.fm

1h 10mwww.orkut.comOrkut

1h 35mwww.facebook.comFacebook

2h 5mwww.classmates.comClassmates.com

4h 0mwww.linkedin.comLinkedIn

2h 55mwww.reunion.comReunion.com

5h 5mwww.hi5.comhi5

6h 0mwww.friendster.comFriendster

7h 25mspaces.live.comWindows Live Spaces

12h 28mwww.bebo.comBebo

Jan-Feb 2008 - Source pingdom.com

$800 MM

Measurement

• How do you measure uptime?• Ping doesn’t work• Connect• Your view is limited from your

monitoring stations• Network problems outside your control

– Hello Cogent

Measurement• Look at the traffic

– The data is there– HTML delivery time– Image delivery time– TCP packet loss– Use an image call to collect end user performance

metrics• Calculate expected traffic rates

– Benchmark against that (bandwidth curves shouldbe smooth!)

– I always watch the bandwidth• Wikipieda method

– How many people complain on IRC?

Outage?

Outage!

Youtube vs BGP vs Pakistan

• BGP runs your internet– Protocol for routers to share routing data– How to get from me to somewhere else

• Each organization has an AS number• Each router keeps track of the number

of AS numbers to the destination overdifferent routes

• Chooses the shortest one

Anycast / Multihoming

• BGP allows you to tell multiple ISPs thatyou are capable of handling a network

• Traffic will flow the “shortest” path• If a link goes down, that router-router

BGP session goes away and the routeis then withdrawn through the system

• “BGP Convergence”– Don’t ask what it really means

Networks and prefixes

• Each netblock is subclassed and has aprefix.

• People mostly know /24 which is 255addresses

• /23 is twice as that• /8 is a vast quantity

IP Conservationvs

Routing table conservation• We are running out of Ips• Our routing table is growing fast

• To limit the growth of the routing table,routers will usually block any routesmore specific than /24

• Youtube was being a good citizen andbroadcasting one 22 instead of four /24

Pakistan Telekom

• Government orders ban of Youtube• PT achives this by broadcasting a BGP

route for the one of Youtubes IP rangesusing a /24 prefix– Sadly, they did this to the entire world

• Routers choose the most specific routefirst, so /24 wins over /22

• All of youtube traffic went to Pakistan

Try reaching for 4 nines

• A BGP error anywhere, can quickly bring youdown

• Thank the souls running the large ISPs corenetworking.– They are the reason it works

• Only way to solve this, is to be a bad citizenand spam the table with more routes. Buteven that doesn’t fully protect you from localoutages

June 23-24, 2008Jesse & Steve

Value of reliability(operations and performance)

• Bad reliability is a waste or R&D• Why develop if you can’t deliver?

• Operations is always treated as thestepchild of Engineering

• But with no reliability, no company• Fixed amount of time + faster site =

more page views

Speed / Reliability

• Important• Direct correlation between speed and

user interaction• Brand name relies on reliability

Requests /sec

Response time

Nothing matters

• This entire conference!• Any cool features!

• Unless it works

Cost benefit

• Cost of deliver• Revenue earned

• Increase cost for more complexity

Metrics you need

• Cost per page view• Cost per specific feature/page

• This is key, what you should prioritize, whatyou should do is, dependent on thesenumbers

• How else can you value it?• Don’t always go for cheap, sometimes it is

better to buy time using money, sometimesnot.

Operational Engineers

• Ops stepchild of development?– Ops is staffed with failed developers

• Fire them

• Hire good ones• Who are passionate to learn and

explore the entire stack

My story

• Software developer• Interested in ops• I always get transferred to ops

– Fixing the same problems every time• (Save me, go to Velocity and learn!)

• I bring engineering to ops, and a way tolook at the entire system

Pyromaniac

Paranoid

Backups / High Availability

• Don’t confuse them• Backups protect your data• High Availability keeps your site running

• Mysql replication is a valid HA solution• But it won’t help you with

– DROP TABLE;

Debugging

• 9 Rules of debugging• http://www.debuggingrules.com/Poster_

download.html– Yes the font is horrible

Rule 1:Understand the system

• Complexity Kills• No excuse• If you write it, you must know it• If you run it, you must know it• If you buy it, you must know it

Rule 3:Quit thinking and look

• "It is a capital mistake to theorize beforeone has data. Insensibly one begins totwist facts to suit theories, instead oftheories to suit facts.”

Rule 3:Quit thinking and look

• What do you look at?• The importance of monitoring• Monitoring• Monitoring• Monitoring

My my, confusing term

• Monitoring• Alerting• Trending

Alerting

• Acts on monitoring data• Severe alerts

– Active– Needs action

• Passive alerts– Things that need to be done but not right now

• DO NOT OVER ALERT• DO NOT CRY WOLF

Wikia alerting strategy

• When the site is slow• Or down• We send emails and do phone calls• Europe and US West coast• Looking to hire in East Asia• No night time

Trending

• Long term• Capacity planning

Ganglia

• We love ganglia• Automatically graphs everything you

want - just works• Large scale clusters• Multicast• Zero config• RRD

http://ganglia.wikimedia.org/

• 270 hosts• 880 CPU• 2 clusters• 1.2 TB of Memory

http://ganglia.wikimedia.org

Something is wrong

• Don’t worry, data warehouse

Problem found

• If it is critical, start a phone conversation• Use IRC to communicate technical data• One person liasons with non technical

staff• One person specifically in command• Sleep scheduling ( audit log important )

Post crisis

• Root cause analysis– Just find out what went wrong– And how to avoid it– Or fix it faster next time if you can’t

• Keep track of your uptime

Automation

• All machines are created equal• Seriously• If you manually make changes• You are wrong

– Unless you know what you are doing

Best practices

• Version control• Gold images• Centralised authentication• Time Sync ( NTP )• Central logging• ( All of this applies for virtual machines

too!)

Puppet

• New hip kid on the block• Written in ruby• Better support?• Much nicer syntax• Easier to extend

tcpdump / wireshark

• If you suspect the network• Don’t just suspect• LOOK AT IT• Tcpdump / waveshark will tell you

– If your packets are lost, delayed orcorrupted

– Your windowing is wrong

Puppet

• Automated machine configuration• Automation is key

• Our Motd states

“If change change anything locally, I will huntdown and kill you”

Rule 4: Divde and Conquer

• Look at the problems in turn• Split between people• Go in the order you suspect is the most

likely

Rule 5:Change one thing at a time

• I cannot stress this enough• IF YOU DO NOT THEN YOU HAVE

FAILED TO IDENTIFY THE PROBLEM

Rule 6:Keep an audit trail

• You might be making things worse• Good for the root cause analysis• Have your shell log all commands

– Good practice anyway• Version control

Rule 9:If you didn’t fix it, it ain’t fixed

• You must do something to fix a problem• Or it will bite you again• And again• And again• They don’t just appear and disappear• Except BGP route convergence :)

Good Book!

“multiple and unexpectedinteractions of failures are

inevitable”-Charles Perrow

shit happens.

[email protected]@oreilly.com