Architecting for Failure in AWS - PuppetConf 2013

Post on 12-Nov-2014

3.845 views 3 download

Tags:

description

"Architecting for Failure in AWS" by Jos Boumans, VP of Operations, Krux Digital. Presentation Overview: Krux is an infrastructure provider for many of the websites you use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For every request on those properties, Krux will get one or more as well. We grew from zero traffic to several billion requests per day in the span of 2 years, and we did so exclusively in AWS. As anyone using AWS will be able to tell you, there's good parts, and there's the bad ones. This is the story of all the pitfalls we encountered, and how, through architecture, convention and common sense, we managed to build an infrastructure that is "Always Up" from the end user perspective and incredibly economical to build, scale & operate. Speaker Bio: Jos is the VP of Operations at Krux, supporting a platform with over 4 billion requests per day with a tiny Ops team. Every bit of the AWS stack is automated, monitored & graphed, with maximized resilience and minimized cost. In a previous life I ran the Ubuntu Server group at Canonical and the Database group at RIPE, which is responsible for all the authoritative IP address data in Europe, the Middle East & Asia. Jos is a regular speaker at conferences like OSCON, Devoxx, Puppetconf, etc where he mostly speaks on dealing with AWS Operations from all angles.

Transcript of Architecting for Failure in AWS - PuppetConf 2013

ARCHITECTING IN AWSfor resilience & cost at scale

Jos Boumans - @jiboumanshttp://rafaykhan619.wix.com/downhouse

Thursday 22 August 13

CANONICAL

http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775

Engineering manager for Ubuntu Server 10.04 & 10.10

http://www.ubuntu.com/business/server/overview

Thursday 22 August 13

SOME OF OUR CUSTOMERS

Thursday 22 August 13

LOTS OF TRAFFIC

http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

Thursday 22 August 13

AVERAGE REQUESTS* / SEC

http://mashable.com/2013/03/21/happy-7th-birthday-twitter/http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm

*Twitter : New tweets Wikipedia: Articles readKrux: New data points

0 3,750 7,500 11,250 15,000

Thursday 22 August 13

MONTHLY UNIQUE USERS

0 200,000,000 400,000,000 600,000,000 800,000,000

http://en.wikipedia.org/wiki/Wikipedia http://mashable.com/2013/03/21/happy-7th-birthday-twitter/

Thursday 22 August 13

THERE ARE DOWNSIDES

http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlinesThursday 22 August 13

RESILIENCE & COST AT SCALE

Thursday 22 August 13

So#ware,)8)

Automa/on,)4)

Process,)14)

#"of"Issues"

Amazon"Cloud"Major"Outage"7"Issues"Categories"

ROOT CAUSE CATEGORIES

http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis

Software bugs & human error

Thursday 22 August 13

RESILIENCE @ SCALEEmbrace Failure: Hardware will fail. Humans will make errors.

Nature will produce thunderstorms.http://blabitcanada.com/category/twitter-2/

Thursday 22 August 13

DEFINE 'AVAILABLE'Things will break, so choose your degraded state.

http://libcom.org/library/occupied-wall-street-some-tactical-thoughts-malcolm-harris

Thursday 22 August 13

BASIC API CALL3 potential points of failure

Thursday 22 August 13

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

USER EXPERIENCEMy tweet got posted

Thursday 22 August 13

RESILIENCE TOOLSStorage, Network & ACL

http://wordyou.ru/kolonki/my-teper-ne-na-avrore-a-na-titanike.html

Thursday 22 August 13

MANY SMALL NODES VERSUS A FEW LARGER NODES

The benefits of the many outweigh the benefits of the fewhttp://www.stealingfaith.com/2012/07/08/throw-off-the-tiny-ropes/

Thursday 22 August 13

DATABASESCAP Theorem applies.

Your choice: sacrifice availability or consistency. Orange is a lie.

RDBMSBigTable Based

Master / Slave based

CouchDBDynamo Based

http://ferd.ca/beating-the-cap-theorem-checklist.html

Thursday 22 August 13

SIMPLE STORAGE SERVICES3: Arguably AWS' best feature

http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/http://aws.amazon.com/s3/

https://forums.aws.amazon.com/message.jspa?messageID=182919#182919Thursday 22 August 13

CACHE WHAT YOU CANHTTP Responses, DB Queries, User content

Browsers have caches too!http://cruncht.com/95/drupal-caching/

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

CLIENT SIDE STORAGEKeep a copy of your users data locally

http://www.w3.org/2001/tag/2010/09/ClientSideStorage.htmlhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/

Thursday 22 August 13

USE ELASTIC LOAD BALANCERSThey will save you more than once

http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/

Thursday 22 August 13

USE GLOBAL LOAD BALANCINGFail over to the closest data center on region failure

Thursday 22 August 13

SHOUT OUT: DYNDNS for Bit.ly, Quora, Twitter, Wikia, Fastly, etc

http://dyn.com

Thursday 22 August 13

USE IAM ROLES FOR ACCESSHumans make mistakes, including your humans

Thursday 22 August 13

COST @ SCALEScaling without breaking the bank

http://mgx.com/blogs/wp-content/uploads/2013/07/piggybank.jpg

Thursday 22 August 13

EMR + SPOT INSTANCESOn demand rate: $0.165 / hour

http://aws.amazon.com/ec2/spot-instances/

Thursday 22 August 13

AMAZON REDSHIFTEconomical Business Intelligence

Scales with data sizehttp://www.flitemedia.com/music.php

http://aws.amazon.com/redshifthttp://www.tableausoftware.com/

Thursday 22 August 13

AMAZON GLACIER"Tapes for the Cloud Era"

Writes vastly cheaper than readshttp://aws.amazon.com/glacier/http://www.gorp.com/parks-guide/glacier-national-park-outdoor-pp2-guide-cid350021.html

Thursday 22 August 13

AWS SIMPLE EMAIL SERVICEDealing with email is boring and time consuming

http://aws.amazon.com/ses/http://bfsdaniels.copycop.com/blog/all-about-printing/hypertargeting-with-direct-mail/

Thursday 22 August 13

AWS SIMPLE QUEUE SERVICEExcellent for latency insensitive, small volume queues

http://www.toledoblade.com/Retail/2013/01/13/Disney-s-magic-bracelet-new-key-to-its-kingdom.htmlhttp://aws.amazon.com/sqs/

http://colby.id.au/benchmarking-sqsThursday 22 August 13

INSTANCE MARKETPLACEBuy & sell reserved instances

http://commons.wikimedia.org/wiki/File:Javanese_market_place.jpg http://aws.amazon.com/ec2/reserved-instances/marketplace/

Thursday 22 August 13

AWS DYNAMO DBExcellent for small keys & high read rates

at known & consistent IOPShttp://hlbike.en.ecplaza.net/2.jpg http://aws.amazon.com/dynamodb/

Thursday 22 August 13

MAXIMIZE IOPSRAID 0 Ephemeral drives

use m1.xlarge or c1.xlarge, or use ssds if you need >20k IOPShttp://calculator.s3.amazonaws.com/calc5.html

http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#disk-performance

Thursday 22 August 13

RED FLAGSAnti-patterns to watch out for

http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/Thursday 22 August 13

PROVISIONED IOPS EBSEphemeral storage on c1/m1.xlarge or SSD is betterIf you must: m*large or c1.xlarge for dedicated NIC

http://www.slideshare.net/AmazonWebServices/ebs-mongo-dbwebinarfinal-nnhttp://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.htmlhttp://navidoo.ru/interest/Nasha_jizn/17676.html

Thursday 22 August 13

AWS DYNAMO DBFor high write rates or

large/variable keyshttp://aws.amazon.com/dynamodb/http://www.walltowall.co.uk/program/standing-tall-worlds-tallest-people_93.aspx

Thursday 22 August 13

HIGH IO/DISK/RAM NODESUse them deliberately

http://elledecoration.co.za/2010/07/gigantic/

Thursday 22 August 13

AWS CLOUDWATCHMetric collection, Amazon style

Cost prohibitive & resolution too lowhttp://www.flickr.com/photos/65683080@N08/6893582132/ http://aws.amazon.com/cloudwatch/

Thursday 22 August 13

LOWER COST PER METRICUse graphite & statsd

http://graphite.wikidot.com/https://github.com/etsy/statsd

Thursday 22 August 13

HOSTED ALTERNATIVESCirconus: All the insights you ever wanted

StackDriver : Optimized for AWShttp://circonus.com

http://stackdriver.com

Thursday 22 August 13

AWS CLOUDFORMATIONTemplatize your entire stack

Harder to use as complexity increaseshttp://aws.amazon.com/cloudwatch/http://fullnfenil7.blogspot.com/2012/05/amazing-cloud-shapes-photos.html#.UhKrZmRgZHg

Thursday 22 August 13

RDS FOR ANALYTICS/REPORTSPaying OLTP prices for BI usageSharding will be a matter of time

http://nerds.airbnb.com/redshift-performance-costhttp://business901.com/blog1/understanding-your-customer-problem/

Thursday 22 August 13

Q & A

http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html

@jiboumanshttp://slideshare.net/jiboumans

Thursday 22 August 13