Failure HappensF***, the f*****g thing is f****d
What broke and what we learned
Redundancy
Redundancy, in general terms, refers tothe quality or state of being redundant,that is: exceeding what is necessary ornormal; or duplication. This can have anegative connotation, especially inrhetoric: superfluous or repetitive; or apositive implication, especially inengineering: serving as a duplicate forpreventing failure of an entire system.
Jesse Robbins Artur Bergman
Artur Bergman Jesse Robbins
• Jesse– Runs ops for Etelos– Firefighter/EMT– Emergency Manager
• Katrina– Experiences running large websites– Had the best title ever “Master of Disaster”
• Artur– Runs ops & engineering for Wikia– Experiences of running large websites, enterprise
(boring) and stock exchanges– Core Perl developer, long development background
• Both of us– Write for O’Reilly Radar– Speak at conferences– Annoy our peers and coworkers– Agree on nearly everything
Redundant
Jesse is sick
• Thankfully, we have high availability– Hence this talk
• Jesse has a 98% availability• I am more honest, probably more like
90% excluding the time I sleep• Our combined availability is 99.84%• His war stories will be missing
June 23-24, 2008Jesse & Steve
364.96 Main
• San Francisco data center• Hosts a lot of Web 2.0 companies• Power outage• 24 July 2008
– A day I am sure a lot of people rememberfondly
Mistakes
• Generator 3 took down 1 and 4– 200% more outage than needed
• But really?– Not 365 Mains fault
Failure happens
• A single datacenter is the problem– Since they all fail at some point
• Recovery procedures after failure– Power was gone ~45 minutes– Most services took hours to come back– Some unnamed ones more than 12 hours
• Communication– All DNS servers in the same datacenter!
Radar article• Disaster recovery plans exist on a different
continuum, affecting not just operations butalso your entire organisation's response todisasters.
• An earthquake is a question of when, not if.Are the startups ready for this? How long willwe expect them to be gone? Several of theworld's largest websites went down. None ofthem were ready for a datacenter outage.None of them had backup datacenters or failover that worked.
• None even had a coherent strategy forcommunicating the situation to the rest of theworld.
Futility of MTBF
• Mean time between failures– Vendor quote you this all time
• Irrelevant!• Failure is inevitable• 365 Main probably had a excellent
aggregated MTBF– But when something fails, the mean time to the
next failure is hardly going to make you feel better
MTTR
• Mean time to recovery• Drastically reduced severity of the
power outage even without hot standby• Noone cares if you fail once a minute
– If you recover in 50 ms• If you are down 1 minute a week, you
are still going to hit 4 nines (99.99%)
Nines (roughly)
• 99% 5000 Minutes / Year 3.5 Days
Nines (roughly)
• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )
Nines (roughly)
• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year
Nines (roughly)
• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year
Nines (roughly)
• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year• 99.9999% 30 Seconds / Year
Nines (roughly)
• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year• 99.9999% 30 Seconds / Year• 99.99999% 3 Seconds / Year
Irrelevance of the nines
• Blizzard– $520 million in profit last year
• World of Warcraft– 10 million players
• 98-99%– By design
Train your users
• Scheduled Downtime each week• Very little redundancy• Server failure
– Up to 10 minutes of data loss• Been like this from the beginning
“We pay them money, so wehave to accept the downtime.”
Reliability
• Don’t aim to high unless– Banks– Space shuttles– Lung/heart machines
• The higher you aim– Increases complexity (exponentially)– The harder you fail
Complexity killed the cat
5m360.yahoo.comYahoo! 360
10mwww.livejournal.comLiveJournal
25mwww.myspace.comMySpace
45mwww.xanga.comXanga
1h 10mwww.last.fmLast.fm
1h 10mwww.orkut.comOrkut
1h 35mwww.facebook.comFacebook
2h 5mwww.classmates.comClassmates.com
4h 0mwww.linkedin.comLinkedIn
2h 55mwww.reunion.comReunion.com
5h 5mwww.hi5.comhi5
6h 0mwww.friendster.comFriendster
7h 25mspaces.live.comWindows Live Spaces
12h 28mwww.bebo.comBebo
Jan-Feb 2008 - Source pingdom.com
5m360.yahoo.comYahoo! 360
10mwww.livejournal.comLiveJournal
25mwww.myspace.comMySpace
45mwww.xanga.comXanga
1h 10mwww.last.fmLast.fm
1h 10mwww.orkut.comOrkut
1h 35mwww.facebook.comFacebook
2h 5mwww.classmates.comClassmates.com
4h 0mwww.linkedin.comLinkedIn
2h 55mwww.reunion.comReunion.com
5h 5mwww.hi5.comhi5
6h 0mwww.friendster.comFriendster
7h 25mspaces.live.comWindows Live Spaces
12h 28mwww.bebo.comBebo
Jan-Feb 2008 - Source pingdom.com
$800 MM
Measurement
• How do you measure uptime?• Ping doesn’t work• Connect• Your view is limited from your
monitoring stations• Network problems outside your control
– Hello Cogent
Measurement• Look at the traffic
– The data is there– HTML delivery time– Image delivery time– TCP packet loss– Use an image call to collect end user performance
metrics• Calculate expected traffic rates
– Benchmark against that (bandwidth curves shouldbe smooth!)
– I always watch the bandwidth• Wikipieda method
– How many people complain on IRC?
Outage?
Outage!
Youtube vs BGP vs Pakistan
• BGP runs your internet– Protocol for routers to share routing data– How to get from me to somewhere else
• Each organization has an AS number• Each router keeps track of the number
of AS numbers to the destination overdifferent routes
• Chooses the shortest one
Anycast / Multihoming
• BGP allows you to tell multiple ISPs thatyou are capable of handling a network
• Traffic will flow the “shortest” path• If a link goes down, that router-router
BGP session goes away and the routeis then withdrawn through the system
• “BGP Convergence”– Don’t ask what it really means
Networks and prefixes
• Each netblock is subclassed and has aprefix.
• People mostly know /24 which is 255addresses
• /23 is twice as that• /8 is a vast quantity
IP Conservationvs
Routing table conservation• We are running out of Ips• Our routing table is growing fast
• To limit the growth of the routing table,routers will usually block any routesmore specific than /24
• Youtube was being a good citizen andbroadcasting one 22 instead of four /24
Pakistan Telekom
• Government orders ban of Youtube• PT achives this by broadcasting a BGP
route for the one of Youtubes IP rangesusing a /24 prefix– Sadly, they did this to the entire world
• Routers choose the most specific routefirst, so /24 wins over /22
• All of youtube traffic went to Pakistan
Try reaching for 4 nines
• A BGP error anywhere, can quickly bring youdown
• Thank the souls running the large ISPs corenetworking.– They are the reason it works
• Only way to solve this, is to be a bad citizenand spam the table with more routes. Buteven that doesn’t fully protect you from localoutages
June 23-24, 2008Jesse & Steve
Value of reliability(operations and performance)
• Bad reliability is a waste or R&D• Why develop if you can’t deliver?
• Operations is always treated as thestepchild of Engineering
• But with no reliability, no company• Fixed amount of time + faster site =
more page views
Speed / Reliability
• Important• Direct correlation between speed and
user interaction• Brand name relies on reliability
Requests /sec
Response time
Requests /sec
Response time
Nothing matters
• This entire conference!• Any cool features!
• Unless it works
Cost benefit
• Cost of deliver• Revenue earned
• Increase cost for more complexity
Metrics you need
• Cost per page view• Cost per specific feature/page
• This is key, what you should prioritize, whatyou should do is, dependent on thesenumbers
• How else can you value it?• Don’t always go for cheap, sometimes it is
better to buy time using money, sometimesnot.
Operational Engineers
• Ops stepchild of development?– Ops is staffed with failed developers
• Fire them
• Hire good ones• Who are passionate to learn and
explore the entire stack
My story
• Software developer• Interested in ops• I always get transferred to ops
– Fixing the same problems every time• (Save me, go to Velocity and learn!)
• I bring engineering to ops, and a way tolook at the entire system
Pyromaniac
Paranoid
Backups / High Availability
• Don’t confuse them• Backups protect your data• High Availability keeps your site running
• Mysql replication is a valid HA solution• But it won’t help you with
– DROP TABLE;
Debugging
• 9 Rules of debugging• http://www.debuggingrules.com/Poster_
download.html– Yes the font is horrible
Rule 1:Understand the system
• Complexity Kills• No excuse• If you write it, you must know it• If you run it, you must know it• If you buy it, you must know it
Rule 3:Quit thinking and look
• "It is a capital mistake to theorize beforeone has data. Insensibly one begins totwist facts to suit theories, instead oftheories to suit facts.”
Rule 3:Quit thinking and look
• What do you look at?• The importance of monitoring• Monitoring• Monitoring• Monitoring
My my, confusing term
• Monitoring• Alerting• Trending
Alerting
• Acts on monitoring data• Severe alerts
– Active– Needs action
• Passive alerts– Things that need to be done but not right now
• DO NOT OVER ALERT• DO NOT CRY WOLF
Wikia alerting strategy
• When the site is slow• Or down• We send emails and do phone calls• Europe and US West coast• Looking to hire in East Asia• No night time
Trending
• Long term• Capacity planning
Ganglia
• We love ganglia• Automatically graphs everything you
want - just works• Large scale clusters• Multicast• Zero config• RRD
http://ganglia.wikimedia.org/
• 270 hosts• 880 CPU• 2 clusters• 1.2 TB of Memory
http://ganglia.wikimedia.org
Custom Ganglia Gmetrics
• Write your own
gmetric --name='Oldest query' --type=int32--units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass |grep -v Sleep | grep -v 'system user' | head -2 |tail -1 | cut -f 6`
Something is wrong
• Don’t worry, data warehouse
Problem found
• If it is critical, start a phone conversation• Use IRC to communicate technical data• One person liasons with non technical
staff• One person specifically in command• Sleep scheduling ( audit log important )
Post crisis
• Root cause analysis– Just find out what went wrong– And how to avoid it– Or fix it faster next time if you can’t
• Keep track of your uptime
Automation
• All machines are created equal• Seriously• If you manually make changes• You are wrong
– Unless you know what you are doing
Best practices
• Version control• Gold images• Centralised authentication• Time Sync ( NTP )• Central logging• ( All of this applies for virtual machines
too!)
Puppet
• New hip kid on the block• Written in ruby• Better support?• Much nicer syntax• Easier to extend
tcpdump / wireshark
• If you suspect the network• Don’t just suspect• LOOK AT IT• Tcpdump / waveshark will tell you
– If your packets are lost, delayed orcorrupted
– Your windowing is wrong
Puppet
• Automated machine configuration• Automation is key
• Our Motd states
“If change change anything locally, I will huntdown and kill you”
Rule 4: Divde and Conquer
• Look at the problems in turn• Split between people• Go in the order you suspect is the most
likely
Rule 5:Change one thing at a time
• I cannot stress this enough• IF YOU DO NOT THEN YOU HAVE
FAILED TO IDENTIFY THE PROBLEM
Rule 6:Keep an audit trail
• You might be making things worse• Good for the root cause analysis• Have your shell log all commands
– Good practice anyway• Version control
Rule 9:If you didn’t fix it, it ain’t fixed
• You must do something to fix a problem• Or it will bite you again• And again• And again• They don’t just appear and disappear• Except BGP route convergence :)
Good Book!
“multiple and unexpectedinteractions of failures are
inevitable”-Charles Perrow
shit happens.
[email protected]@oreilly.com
Top Related