Scalable Web Architectures

Post on 10-Jan-2016

53 views 0 download

description

Scalable Web Architectures. Common Patterns & Approaches Cal Henderson. Hello. Flickr. Large scale kitten-sharing Almost 5 years old 4.5 since public launch Open source stack An open attitude towards architecture. Flickr. Large scale In a single second: Serve 40,000 photos - PowerPoint PPT Presentation

Transcript of Scalable Web Architectures

Scalable Web Architectures

Common Patterns & Approaches

Cal Henderson

Web 2.0 Expo NYC, 16th September 2008 2

Hello

Web 2.0 Expo NYC, 16th September 2008 3

Flickr

• Large scale kitten-sharing

• Almost 5 years old– 4.5 since public launch

• Open source stack– An open attitude towards architecture

Web 2.0 Expo NYC, 16th September 2008 4

Flickr

• Large scale

• In a single second:– Serve 40,000 photos– Handle 100,000 cache operations– Process 130,000 database queries

Web 2.0 Expo NYC, 16th September 2008 5

Scalable Web Architectures?

What does scalable mean?

What’s an architecture?

An answer in 12 parts

Web 2.0 Expo NYC, 16th September 2008 6

1.

Scaling

Web 2.0 Expo NYC, 16th September 2008 7

Scalability – myths and lies

• What is scalability?

Web 2.0 Expo NYC, 16th September 2008 8

Scalability – myths and lies

• What is scalability not ?

Web 2.0 Expo NYC, 16th September 2008 9

Scalability – myths and lies

• What is scalability not ?– Raw Speed / Performance– HA / BCP– Technology X– Protocol Y

Web 2.0 Expo NYC, 16th September 2008 10

Scalability – myths and lies

• So what is scalability?

Web 2.0 Expo NYC, 16th September 2008 11

Scalability – myths and lies

• So what is scalability?– Traffic growth– Dataset growth– Maintainability

Web 2.0 Expo NYC, 16th September 2008 12

Today

• Two goals of application architecture:

Scale

HA

Web 2.0 Expo NYC, 16th September 2008 13

Today

• Three goals of application architecture:

Scale

HA

Performance

Web 2.0 Expo NYC, 16th September 2008 14

Scalability

• Two kinds:– Vertical (get bigger)– Horizontal (get more)

Web 2.0 Expo NYC, 16th September 2008 15

Big Irons

Sunfire E20k

$450,000 - $2,500,00036x 1.8GHz processors

PowerEdge SC1435Dualcore 1.8 GHz processor

Around $1,500

Web 2.0 Expo NYC, 16th September 2008 16

Cost vs Cost

Web 2.0 Expo NYC, 16th September 2008 17

That’s OK

• Sometimes vertical scaling is right

• Buying a bigger box is quick (ish)

• Redesigning software is not

• Running out of MySQL performance?– Spend months on data federation– Or, Just buy a ton more RAM

Web 2.0 Expo NYC, 16th September 2008 18

The H & the V

• But we’ll mostly talk horizontal– Else this is going to be boring

Web 2.0 Expo NYC, 16th September 2008 19

2.

Architecture

Web 2.0 Expo NYC, 16th September 2008 20

Architectures then?

• The way the bits fit together

• What grows where

• The trade-offs between good/fast/cheap

Web 2.0 Expo NYC, 16th September 2008 21

LAMP

• We’re mostly talking about LAMP– Linux– Apache (or LightHTTPd)– MySQL (or Postgres)– PHP (or Perl, Python, Ruby)

• All open source• All well supported• All used in large operations• (Same rules apply elsewhere)

Web 2.0 Expo NYC, 16th September 2008 22

Simple web apps

• A Web Application– Or “Web Site” in Web 1.0 terminology

Interwebnet App server Database

Web 2.0 Expo NYC, 16th September 2008 23

Simple web apps

• A Web Application– Or “Web Site” in Web 1.0 terminology

Interwobnet App server Database

Cache

Storage array

AJAX!!!1

Web 2.0 Expo NYC, 16th September 2008 24

App servers

• App servers scale in two ways:

Web 2.0 Expo NYC, 16th September 2008 25

App servers

• App servers scale in two ways:

– Really well

Web 2.0 Expo NYC, 16th September 2008 26

App servers

• App servers scale in two ways:

– Really well

– Quite badly

Web 2.0 Expo NYC, 16th September 2008 27

App servers

• Sessions!– (State)

– Local sessions == bad• When they move == quite bad

– Centralized sessions == good

– No sessions at all == awesome!

Web 2.0 Expo NYC, 16th September 2008 28

Local sessions

• Stored on disk– PHP sessions

• Stored in memory– Shared memory block (APC)

• Bad!– Can’t move users– Can’t avoid hotspots– Not fault tolerant

Web 2.0 Expo NYC, 16th September 2008 29

Mobile local sessions

• Custom built– Store last session location in cookie– If we hit a different server, pull our session

information across

• If your load balancer has sticky sessions, you can still get hotspots– Depends on volume – fewer heavier users

hurt more

Web 2.0 Expo NYC, 16th September 2008 30

Remote centralized sessions

• Store in a central database– Or an in-memory cache

• No porting around of session data• No need for sticky sessions• No hot spots

• Need to be able to scale the data store– But we’ve pushed the issue down the stack

Web 2.0 Expo NYC, 16th September 2008 31

No sessions

• Stash it all in a cookie!

• Sign it for safety– $data = $user_id . ‘-’ . $user_name;– $time = time();– $sig = sha1($secret . $time . $data);– $cookie = base64(“$sig-$time-$data”);

– Timestamp means it’s simple to expire it

Web 2.0 Expo NYC, 16th September 2008 32

Super slim sessions

• If you need more than the cookie (login status, user id, username), then pull their account row from the DB– Or from the account cache

• None of the drawbacks of sessions• Avoids the overhead of a query per page

– Great for high-volume pages which need little personalization

– Turns out you can stick quite a lot in a cookie too– Pack with base64 and it’s easy to delimit fields

Web 2.0 Expo NYC, 16th September 2008 33

App servers

• The Rasmus way– App server has ‘shared nothing’– Responsibility pushed down the stack

• Ooh, the stack

Web 2.0 Expo NYC, 16th September 2008 34

Trifle

Web 2.0 Expo NYC, 16th September 2008 35

Trifle

Sponge / Database

Jelly / Business Logic

Custard / Page Logic

Cream / Markup

Fruit / Presentation

Web 2.0 Expo NYC, 16th September 2008 36

Trifle

Sponge / Database

Jelly / Business Logic

Custard / Page Logic

Cream / Markup

Fruit / Presentation

Web 2.0 Expo NYC, 16th September 2008 37

App servers

Web 2.0 Expo NYC, 16th September 2008 38

App servers

Web 2.0 Expo NYC, 16th September 2008 39

App servers

Web 2.0 Expo NYC, 16th September 2008 40

Well, that was easy

• Scaling the web app server part is easy

• The rest is the trickier part– Database– Serving static content– Storing static content

Web 2.0 Expo NYC, 16th September 2008 41

The others

• Other services scale similarly to web apps– That is, horizontally

• The canonical examples:– Image conversion– Audio transcoding– Video transcoding– Web crawling– Compute!

Web 2.0 Expo NYC, 16th September 2008 42

Amazon

• Let’s talk about Amazon– S3 - Storage– EC2 – Compute! (XEN based)– SQS – Queueing

• All horizontal

• Cheap when small– Not cheap at scale

Web 2.0 Expo NYC, 16th September 2008 43

3.

Load Balancing

Web 2.0 Expo NYC, 16th September 2008 44

Load balancing

• If we have multiple nodes in a class, we need to balance between them

• Hardware or software

• Layer 4 or 7

Web 2.0 Expo NYC, 16th September 2008 45

Hardware LB

• A hardware appliance– Often a pair with heartbeats for HA

• Expensive!– But offers high performance– Easy to do > 1Gbps

• Many brands– Alteon, Cisco, Netscalar, Foundry, etc– L7 - web switches, content switches, etc

Web 2.0 Expo NYC, 16th September 2008 46

Software LB

• Just some software– Still needs hardware to run on– But can run on existing servers

• Harder to have HA– Often people stick hardware LB’s in front– But Wackamole helps here

Web 2.0 Expo NYC, 16th September 2008 47

Software LB

• Lots of options– Pound– Perlbal– Apache with mod_proxy

• Wackamole with mod_backhand– http://backhand.org/wackamole/– http://backhand.org/mod_backhand/

Web 2.0 Expo NYC, 16th September 2008 48

Wackamole

Web 2.0 Expo NYC, 16th September 2008 49

Wackamole

Web 2.0 Expo NYC, 16th September 2008 50

The layers

• Layer 4– A ‘dumb’ balance

• Layer 7– A ‘smart’ balance

• OSI stack, routers, etc

Web 2.0 Expo NYC, 16th September 2008 51

4.

Queuing

Web 2.0 Expo NYC, 16th September 2008 52

Parallelizable == easy!

• If we can transcode/crawl in parallel, it’s easy– But think about queuing– And asynchronous systems– The web ain’t built for slow things– But still, a simple problem

Web 2.0 Expo NYC, 16th September 2008 53

Synchronous systems

Web 2.0 Expo NYC, 16th September 2008 54

Asynchronous systems

Web 2.0 Expo NYC, 16th September 2008 55

Helps with peak periods

Web 2.0 Expo NYC, 16th September 2008 56

Synchronous systems

Web 2.0 Expo NYC, 16th September 2008 57

Asynchronous systems

Web 2.0 Expo NYC, 16th September 2008 58

Asynchronous systems

Web 2.0 Expo NYC, 16th September 2008 59

5.

Relational Data

Web 2.0 Expo NYC, 16th September 2008 60

Databases

• Unless we’re doing a lot of file serving, the database is the toughest part to scale

• If we can, best to avoid the issue altogether and just buy bigger hardware

• Dual-Quad Opteron/Intel64 systems with 16+GB of RAM can get you a long way

Web 2.0 Expo NYC, 16th September 2008 61

More read power

• Web apps typically have a read/write ratio of somewhere between 80/20 and 90/10

• If we can scale read capacity, we can solve a lot of situations

• MySQL replication!

Web 2.0 Expo NYC, 16th September 2008 62

Master-Slave Replication

Web 2.0 Expo NYC, 16th September 2008 63

Master-Slave Replication

Reads and Writes

Reads

Web 2.0 Expo NYC, 16th September 2008 64

Master-Slave Replication

Web 2.0 Expo NYC, 16th September 2008 65

Master-Slave Replication

Web 2.0 Expo NYC, 16th September 2008 66

Master-Slave Replication

Web 2.0 Expo NYC, 16th September 2008 67

Master-Slave Replication

Web 2.0 Expo NYC, 16th September 2008 68

Master-Slave Replication

Web 2.0 Expo NYC, 16th September 2008 69

Master-Slave Replication

Web 2.0 Expo NYC, 16th September 2008 70

Master-Slave Replication

Web 2.0 Expo NYC, 16th September 2008 71

Master-Slave Replication

Web 2.0 Expo NYC, 16th September 2008 72

6.

Caching

Web 2.0 Expo NYC, 16th September 2008 73

Caching

• Caching avoids needing to scale!– Or makes it cheaper

• Simple stuff– mod_perl / shared memory

• Invalidation is hard

– MySQL query cache• Bad performance (in most cases)

Web 2.0 Expo NYC, 16th September 2008 74

Caching

• Getting more complicated…– Write-through cache– Write-back cache– Sideline cache

Web 2.0 Expo NYC, 16th September 2008 75

Write-through cache

Web 2.0 Expo NYC, 16th September 2008 76

Write-back cache

Web 2.0 Expo NYC, 16th September 2008 77

Sideline cache

Web 2.0 Expo NYC, 16th September 2008 78

Sideline cache

• Easy to implement– Just add app logic

• Need to manually invalidate cache– Well designed code makes it easy

• Memcached– From Danga (LiveJournal)– http://www.danga.com/memcached/

Web 2.0 Expo NYC, 16th September 2008 79

Memcache schemes

• Layer 4– Good: Cache can be local on a machine– Bad: Invalidation gets more expensive with

node count– Bad: Cache space wasted by duplicate

objects

Web 2.0 Expo NYC, 16th September 2008 80

Memcache schemes

• Layer 7– Good: No wasted space– Good: linearly scaling invalidation– Bad: Multiple, remote connections

• Can be avoided with a proxy layer– Gets more complicated

» Last indentation level!

Web 2.0 Expo NYC, 16th September 2008 81

7.

HA Data

Web 2.0 Expo NYC, 16th September 2008 82

But what about HA?

Web 2.0 Expo NYC, 16th September 2008 83

But what about HA?

Web 2.0 Expo NYC, 16th September 2008 84

SPOF!

• The key to HA is avoiding SPOFs– Identify– Eliminate

• Some stuff is hard to solve– Fix it further up the tree

• Dual DCs solves Router/Switch SPOF

Web 2.0 Expo NYC, 16th September 2008 85

Master-Master

Web 2.0 Expo NYC, 16th September 2008 86

Master-Master

• Either hot/warm or hot/hot

• Writes can go to either– But avoid collisions– No auto-inc columns for hot/hot

• Bad for hot/warm too• Unless you have MySQL 5

– But you can’t rely on the ordering!

– Design schema/access to avoid collisions• Hashing users to servers

Web 2.0 Expo NYC, 16th September 2008 87

Rings

• Master-master is just a small ring– With 2 nodes

• Bigger rings are possible– But not a mesh!– Each slave may only have a single master– Unless you build some kind of manual

replication

Web 2.0 Expo NYC, 16th September 2008 88

Rings

Web 2.0 Expo NYC, 16th September 2008 89

Rings

Web 2.0 Expo NYC, 16th September 2008 90

Dual trees

• Master-master is good for HA– But we can’t scale out the reads (or writes!)

• We often need to combine the read scaling with HA

• We can simply combine the two models

Web 2.0 Expo NYC, 16th September 2008 91

Dual trees

Web 2.0 Expo NYC, 16th September 2008 92

Cost models

• There’s a problem here– We need to always have 200% capacity to

avoid a SPOF• 400% for dual sites!

– This costs too much

• Solution is straight forward– Make sure clusters are bigger than 2

Web 2.0 Expo NYC, 16th September 2008 93

N+M

• N+M– N = nodes needed to run the system– M = nodes we can afford to lose

• Having M as big as N starts to suck– If we could make each node smaller, we can

increase N while M stays constant– (We assume smaller nodes are cheaper)

Web 2.0 Expo NYC, 16th September 2008 94

1+1 = 200% hardware

Web 2.0 Expo NYC, 16th September 2008 95

3+1 = 133% hardware

Web 2.0 Expo NYC, 16th September 2008 96

Meshed masters

• Not possible with regular MySQL out-of-the-box today

• But there is hope!– NBD (MySQL Cluster) allows a mesh– Support for replication out to slaves in a

coming version• RSN!

Web 2.0 Expo NYC, 16th September 2008 97

8.

Federation

Web 2.0 Expo NYC, 16th September 2008 98

Data federation

• At some point, you need more writes– This is tough– Each cluster of servers has limited write

capacity

• Just add more clusters!

Web 2.0 Expo NYC, 16th September 2008 99

Simple things first

• Vertical partitioning– Divide tables into sets that never get joined– Split these sets onto different server clusters– Voila!

• Logical limits– When you run out of non-joining groups– When a single table grows too large

Web 2.0 Expo NYC, 16th September 2008 100

Data federation

• Split up large tables, organized by some primary object– Usually users

• Put all of a user’s data on one ‘cluster’– Or shard, or cell

• Have one central cluster for lookups

Web 2.0 Expo NYC, 16th September 2008 101

Data federation

Web 2.0 Expo NYC, 16th September 2008 102

Data federation

• Need more capacity?– Just add shards!– Don’t assign to shards based on user_id!

• For resource leveling as time goes on, we want to be able to move objects between shards– Maybe – not everyone does this– ‘Lockable’ objects

Web 2.0 Expo NYC, 16th September 2008 103

The wordpress.com approach

• Hash users into one of n buckets– Where n is a power of 2

• Put all the buckets on one server

• When you run out of capacity, split the buckets across two servers

• Then you run out of capacity, split the buckets across four servers

• Etc

Web 2.0 Expo NYC, 16th September 2008 104

Data federation

• Heterogeneous hardware is fine– Just give a larger/smaller proportion of objects

depending on hardware

• Bigger/faster hardware for paying users– A common approach– Can also allocate faster app servers via magic

cookies at the LB

Web 2.0 Expo NYC, 16th September 2008 105

Downsides

• Need to keep stuff in the right place• App logic gets more complicated• More clusters to manage

– Backups, etc

• More database connections needed per page– Proxy can solve this, but complicated

• The dual table issue– Avoid walking the shards!

Web 2.0 Expo NYC, 16th September 2008 106

Bottom line

Data federation is how large applications are

scaled

Web 2.0 Expo NYC, 16th September 2008 107

Bottom line

• It’s hard, but not impossible

• Good software design makes it easier– Abstraction!

• Master-master pairs for shards give us HA

• Master-master trees work for central cluster (many reads, few writes)

Web 2.0 Expo NYC, 16th September 2008 108

9.

Multi-site HA

Web 2.0 Expo NYC, 16th September 2008 109

Multiple Datacenters

• Having multiple datacenters is hard– Not just with MySQL

• Hot/warm with MySQL slaved setup– But manual (reconfig on failure)

• Hot/hot with master-master– But dangerous (each site has a SPOF)

• Hot/hot with sync/async manual replication– But tough (big engineering task)

Web 2.0 Expo NYC, 16th September 2008 110

Multiple Datacenters

Web 2.0 Expo NYC, 16th September 2008 111

GSLB

• Multiple sites need to be balanced– Global Server Load Balancing

• Easiest are AkaDNS-like services– Performance rotations– Balance rotations

Web 2.0 Expo NYC, 16th September 2008 112

10.

Serving Files

Web 2.0 Expo NYC, 16th September 2008 113

Serving lots of files

• Serving lots of files is not too tough– Just buy lots of machines and load balance!

• We’re IO bound – need more spindles!– But keeping many copies of data in sync is

hard– And sometimes we have other per-request

overhead (like auth)

Web 2.0 Expo NYC, 16th September 2008 114

Reverse proxy

Web 2.0 Expo NYC, 16th September 2008 115

Reverse proxy

• Serving out of memory is fast!– And our caching proxies can have disks too– Fast or otherwise

• More spindles is better

• We can parallelize it! – 50 cache servers gives us 50 times the

serving rate of the origin server– Assuming the working set is small enough to

fit in memory in the cache cluster

Web 2.0 Expo NYC, 16th September 2008 116

Invalidation

• Dealing with invalidation is tricky

• We can prod the cache servers directly to clear stuff out– Scales badly – need to clear asset from every

server – doesn’t work well for 100 caches

Web 2.0 Expo NYC, 16th September 2008 117

Invalidation

• We can change the URLs of modified resources– And let the old ones drop out cache naturally– Or poke them out, for sensitive data

• Good approach!– Avoids browser cache staleness– Hello Akamai (and other CDNs)– Read more:

• http://www.thinkvitamin.com/features/webapps/serving-javascript-fast

Web 2.0 Expo NYC, 16th September 2008 118

CDN – Content Delivery Network

• Akamai, Savvis, Mirror Image Internet, etc

• Caches operated by other people– Already in-place– In lots of places

• GSLB/DNS balancing

Web 2.0 Expo NYC, 16th September 2008 119

Edge networks

Origin

Web 2.0 Expo NYC, 16th September 2008 120

Edge networks

Origin

Cache

Cache

Cache

CacheCache

Cache

CacheCache

Web 2.0 Expo NYC, 16th September 2008 121

CDN Models

• Simple model– You push content to them, they serve it

• Reverse proxy model– You publish content on an origin, they proxy

and cache it

Web 2.0 Expo NYC, 16th September 2008 122

CDN Invalidation

• You don’t control the caches– Just like those awful ISP ones

• Once something is cached by a CDN, assume it can never change– Nothing can be deleted– Nothing can be modified

Web 2.0 Expo NYC, 16th September 2008 123

Versioning

• When you start to cache things, you need to care about versioning

– Invalidation & Expiry– Naming & Sync

Web 2.0 Expo NYC, 16th September 2008 124

Cache Invalidation

• If you control the caches, invalidation is possible

• But remember ISP and client caches

• Remove deleted content explicitly– Avoid users finding old content– Save cache space

Web 2.0 Expo NYC, 16th September 2008 125

Cache versioning

• Simple rule of thumb:– If an item is modified, change its name (URL)

• This can be independent of the file system!

Web 2.0 Expo NYC, 16th September 2008 126

Virtual versioning

• Database indicates version 3 of file

• Web app writes version number into URL

• Request comes through cache and is cached with the versioned URL

• mod_rewrite converts versioned URL to path

Version 3

example.com/foo_3.jpg

Cached: foo_3.jpg

foo_3.jpg -> foo.jpg

Web 2.0 Expo NYC, 16th September 2008 127

Reverse proxy

• Choices– L7 load balancer & Squid

• http://www.squid-cache.org/

– mod_proxy & mod_cache• http://www.apache.org/

– Perlbal and Memcache?• http://www.danga.com/

Web 2.0 Expo NYC, 16th September 2008 128

More reverse proxy

• HA proxy

• Squid & CARP

• Varnish

Web 2.0 Expo NYC, 16th September 2008 129

High overhead serving

• What if you need to authenticate your asset serving?– Private photos– Private data– Subscriber-only files

• Two main approaches– Proxies w/ tokens– Path translation

Web 2.0 Expo NYC, 16th September 2008 130

Perlbal Re-proxying

• Perlbal can do redirection magic– Client sends request to Perlbal– Perlbal plugin verifies user credentials

• token, cookies, whatever• tokens avoid data-store access

– Perlbal goes to pick up the file from elsewhere– Transparent to user

Web 2.0 Expo NYC, 16th September 2008 131

Perlbal Re-proxying

Web 2.0 Expo NYC, 16th September 2008 132

Perlbal Re-proxying

• Doesn’t keep database around while serving

• Doesn’t keep app server around while serving

• User doesn’t find out how to access asset directly

Web 2.0 Expo NYC, 16th September 2008 133

Permission URLs

• But why bother!?

• If we bake the auth into the URL then it saves the auth step

• We can do the auth on the web app servers when creating HTML

• Just need some magic to translate to paths

• We don’t want paths to be guessable

Web 2.0 Expo NYC, 16th September 2008 134

Permission URLs

Web 2.0 Expo NYC, 16th September 2008 135

Permission URLs

(or mod_perl)

Web 2.0 Expo NYC, 16th September 2008 136

Permission URLs

• Downsides– URL gives permission for life– Unless you bake in tokens

• Tokens tend to be non-expirable– We don’t want to track every token

» Too much overhead

• But can still expire– But you choose at time of issue, not later

• Upsides– It works– Scales very nicely

Web 2.0 Expo NYC, 16th September 2008 137

11.

Storing Files

Web 2.0 Expo NYC, 16th September 2008 138

Storing lots of files

• Storing files is easy!– Get a big disk– Get a bigger disk– Uh oh!

• Horizontal scaling is the key– Again

Web 2.0 Expo NYC, 16th September 2008 139

Connecting to storage

• NFS– Stateful == Sucks– Hard mounts vs Soft mounts, INTR

• SMB / CIFS / Samba– Turn off MSRPC & WINS (NetBOIS NS)– Stateful but degrades gracefully

• HTTP– Stateless == Yay!– Just use Apache

Web 2.0 Expo NYC, 16th September 2008 140

Multiple volumes

• Volumes are limited in total size– Except (in theory) under ZFS & others

• Sometimes we need multiple volumes for performance reasons– When using RAID with single/dual parity

• At some point, we need multiple volumes

Web 2.0 Expo NYC, 16th September 2008 141

Multiple volumes

Web 2.0 Expo NYC, 16th September 2008 142

Multiple hosts

• Further down the road, a single host will be too small

• Total throughput of machine becomes an issue

• Even physical space can start to matter

• So we need to be able to use multiple hosts

Web 2.0 Expo NYC, 16th September 2008 143

Multiple hosts

Web 2.0 Expo NYC, 16th September 2008 144

HA Storage

• HA is important for file assets too– We can back stuff up– But we tend to want hot redundancy

• RAID is good– RAID 5 is cheap, RAID 10 is fast

Web 2.0 Expo NYC, 16th September 2008 145

HA Storage

• But whole machines can fail

• So we stick assets on multiple machines

• In this case, we can ignore RAID– In failure case, we serve from alternative

source– But need to weigh up the rebuild time and

effort against the risk– Store more than 2 copies?

Web 2.0 Expo NYC, 16th September 2008 146

HA Storage

Web 2.0 Expo NYC, 16th September 2008 147

HA Storage

• But whole colos can fail

• So we stick assets in multiple colos

• Again, we can ignore RAID– Or even multi-machine per colo– But do we want to lose a whole colo because

of single disk failure?– Redundancy is always a balance

Web 2.0 Expo NYC, 16th September 2008 148

Self repairing systems

• When something fails, repairing can be a pain– RAID rebuilds by itself, but machine

replication doesn’t

• The big appliances self heal– NetApp, StorEdge, etc

• So does MogileFS (reaper)

Web 2.0 Expo NYC, 16th September 2008 149

Real Life

Case Studies

Web 2.0 Expo NYC, 16th September 2008 150

GFS – Google File System

• Developed by … Google

• Proprietary

• Everything we know about it is based on talks they’ve given

• Designed to store huge files for fast access

Web 2.0 Expo NYC, 16th September 2008 151

GFS – Google File System

• Single ‘Master’ node holds metadata– SPF – Shadow master allows warm swap

• Grid of ‘chunkservers’– 64bit filenames– 64 MB file chunks

Web 2.0 Expo NYC, 16th September 2008 152

GFS – Google File System

1(a) 2(a)

1(b)

Master

Web 2.0 Expo NYC, 16th September 2008 153

GFS – Google File System

• Client reads metadata from master then file parts from multiple chunkservers

• Designed for big files (>100MB)

• Master server allocates access leases

• Replication is automatic and self repairing– Synchronously for atomicity

Web 2.0 Expo NYC, 16th September 2008 154

GFS – Google File System

• Reading is fast (parallelizable)– But requires a lease

• Master server is required for all reads and writes

Web 2.0 Expo NYC, 16th September 2008 155

MogileFS – OMG Files

• Developed by Danga / SixApart

• Open source

• Designed for scalable web app storage

Web 2.0 Expo NYC, 16th September 2008 156

MogileFS – OMG Files

• Single metadata store (MySQL)– MySQL Cluster avoids SPF

• Multiple ‘tracker’ nodes locate files

• Multiple ‘storage’ nodes store files

Web 2.0 Expo NYC, 16th September 2008 157

MogileFS – OMG Files

Tracker

Tracker

MySQL

Web 2.0 Expo NYC, 16th September 2008 158

MogileFS – OMG Files

• Replication of file ‘classes’ happens transparently

• Storage nodes are not mirrored – replication is piecemeal

• Reading and writing go through trackers, but are performed directly upon storage nodes

Web 2.0 Expo NYC, 16th September 2008 159

Flickr File System

• Developed by Flickr

• Proprietary

• Designed for very large scalable web app storage

Web 2.0 Expo NYC, 16th September 2008 160

Flickr File System

• No metadata store– Deal with it yourself

• Multiple ‘StorageMaster’ nodes

• Multiple storage nodes with virtual volumes

Web 2.0 Expo NYC, 16th September 2008 161

Flickr File System

SM

SM

SM

Web 2.0 Expo NYC, 16th September 2008 162

Flickr File System

• Metadata stored by app– Just a virtual volume number– App chooses a path

• Virtual nodes are mirrored– Locally and remotely

• Reading is done directly from nodes

Web 2.0 Expo NYC, 16th September 2008 163

Flickr File System

• StorageMaster nodes only used for write operations

• Reading and writing can scale separately

Web 2.0 Expo NYC, 16th September 2008 164

Amazon S3

• A big disk in the sky

• Multiple ‘buckets’

• Files have user-defined keys

• Data + metadata

Web 2.0 Expo NYC, 16th September 2008 165

Amazon S3

Servers Amazon

Web 2.0 Expo NYC, 16th September 2008 166

Amazon S3

Servers Amazon

Users

Web 2.0 Expo NYC, 16th September 2008 167

The cost

• Fixed price, by the GB

• Store: $0.15 per GB per month

• Serve: $0.20 per GB

Web 2.0 Expo NYC, 16th September 2008 168

The cost

S3

Web 2.0 Expo NYC, 16th September 2008 169

The cost

S3

Regular Bandwidth

Web 2.0 Expo NYC, 16th September 2008 170

End costs

• ~$2k to store 1TB for a year

• ~$63 a month for 1Mb

• ~$65k a month for 1Gb

Web 2.0 Expo NYC, 16th September 2008 171

12.

Field Work

Web 2.0 Expo NYC, 16th September 2008 172

Real world examples

• Flickr– Because I know it

• LiveJournal– Because everyone copies it

Web 2.0 Expo NYC, 16th September 2008 173

FlickrArchitecture

Web 2.0 Expo NYC, 16th September 2008 174

FlickrArchitecture

Web 2.0 Expo NYC, 16th September 2008 175

LiveJournalArchitecture

Web 2.0 Expo NYC, 16th September 2008 176

LiveJournalArchitecture

Web 2.0 Expo NYC, 16th September 2008 177

Buy my book!

Web 2.0 Expo NYC, 16th September 2008 178

Or buy Theo’s

Web 2.0 Expo NYC, 16th September 2008 179

The end!

Web 2.0 Expo NYC, 16th September 2008 180

Awesome!

These slides are available online:

iamcal.com/talks/