Handling Data in Mega Scale Web Systems
-
Upload
vineet-gupta -
Category
Technology
-
view
1.362 -
download
0
Transcript of Handling Data in Mega Scale Web Systems
![Page 1: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/1.jpg)
Vineet Gupta | GM – Software Engineering | Directihttp://www.vineetgupta.com
Licensed under Creative Commons Attribution Sharealike Noncommercial
Intelligent People. Uncommon Ideas.
![Page 2: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/2.jpg)
• 22M+ users• Dozens of DB servers• Dozens of Web servers• Six specialized graph database servers to
run recommendations engine
Source: http://highscalability.com/digg-architecture
![Page 3: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/3.jpg)
• 1 TB / Day• 100 M blogs indexed / day• 10 B objects indexed / day• 0.5 B photos and videos• Data doubles in 6 months• Users double in 6 months
Source: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
![Page 4: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/4.jpg)
• 2 PB Raw Storage• 470 M photos, 4-5 sizes each• 400 k photos added / day• 35 M photos in Squid cache (total)• 2 M photos in Squid RAM• 38k reqs / sec to Memcached• 4 B queries / day
Source: http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
![Page 5: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/5.jpg)
• Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters
• 2 PB of data• 26 B queries / day• 1 B page views / day• 3 B API calls / month• 15,000 App servers
Source: http://highscalability.com/ebay-architecture/
![Page 6: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/6.jpg)
• 450,000 low cost commodity servers in 2006• Indexed 8 B web-pages in 2005• 200 GFS clusters (1 cluster = 1,000 – 5,000
machines)• Read / write thruput = 40 GB / sec across a
cluster• Map-Reduce
o 100k jobs / dayo 20 PB of data processed / dayo 10k MapReduce programs
Source: http://highscalability.com/google-architecture/
![Page 7: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/7.jpg)
• Data Size ~ PB• Data Growth ~ TB / day• No of servers – 10s to 10,000• No of datacenters – 1 to 10• Queries – B+ / day
![Page 8: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/8.jpg)
Host
App Server
DB Server
RAMCPUCPU
CPURAM
RAM
![Page 9: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/9.jpg)
Sunfire X4640 M28 x 6-core 2.6 GHz
$ 27k to $ 170k
PowerEdge R200Dual core 2.8 GHz
Around $ 550
![Page 10: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/10.jpg)
• Increasing the hardware resources on a host• Pros
o Simple to implemento Fast turnaround time
• Conso Finite limito Hardware does not scale linearly (diminishing
returns for each incremental unit)o Requires downtimeo Increases Downtime Impacto Incremental costs increase exponentially
![Page 11: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/11.jpg)
T1, T2, T3, T4
App Layer
![Page 12: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/12.jpg)
T1, T2, T3, T4
App Layer
T1, T2, T3, T4
T1, T2, T3, T4
T1, T2, T3, T4
T1, T2, T3, T4
• Each node has its own copy of data• Shared Nothing Cluster
![Page 13: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/13.jpg)
• Read : Write = 4:1o Scale reads at cost of writes!
• Duplicate Data – each node has its own copy
• Master Slaveo Writes sent to one node, cascaded to others
• Multi-Mastero Writes can be sent to multiple nodeso Can lead to deadlockso Requires conflict management
![Page 14: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/14.jpg)
Master
App Layer
Slave Slave Slave Slave
• n x Writes – Async vs. Sync• SPOF• Async - Critical Reads from Master!
![Page 15: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/15.jpg)
Master
App Layer
Master Slave Slave Slave
• n x Writes – Async vs. Sync• No SPOF• Conflicts! O(N2) or O(N3) resolution
![Page 16: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/16.jpg)
Write
Read
Write
Read
Write
Read
Write
Read
Write
Read
Write
Read
Write
Read
Per Server:• 4R, 1W• 2R, 1W• 1R, 1W
![Page 17: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/17.jpg)
• Vertical Partitioningo Divide data on tables / columnso Scale to as many boxes as there are tables or
columnso Finite
• Horizontal Partitioningo Divide data on rowso Scale to as many boxes as there are rows!o Limitless scaling
![Page 18: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/18.jpg)
T1, T2, T3, T4,
T5
App Layer
![Page 19: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/19.jpg)
T3
App Layer
T4 T5T2T1
• Facebook - User table, posts table can be on separate nodes
• Joins need to be done in code (Why have them?)
![Page 20: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/20.jpg)
T3
App Layer
T4 T5T2T1First million rows
T3 T4 T5T2T1Second million rows
T3 T4 T5T2T1Third million rows
![Page 21: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/21.jpg)
• Value Basedo Split on timestamp of postso Split on first alphabet of user name
• Hash Basedo Use a hash function to determine cluster
• Lookup Mapo First Come First Serveo Round Robin
![Page 22: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/22.jpg)
Source: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
![Page 23: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/23.jpg)
• In distributed systems, much weaker forms of consistency are often acceptable, e.g.,o Only a few (or even one) possible writers of data,
and/oro Read-mostly data (seldom modified), and/oro Stale data may be acceptable
• Eventual consistencyo If no updates take place for a long time, all
replicas will eventually become consistent• Implementation
o Need only ensure updates eventually reach all of the replicated copies of the data
![Page 24: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/24.jpg)
• Monotonic Readso If a node sees a version x at time t, it will never see an older
version at a later time
• Monotonic Writeso A write operation by a process on a data item x is completed
before any successive write operation on x by the same process
• Read your writeso The effect of a write operation by a process on data item x
will always be seen by a successive read operation on x by the same process
• Writes follow Readso Write occurs on a copy of x that is at least as recent as the
last copy read by the process
![Page 25: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/25.jpg)
• Many Kinds of Computing are “Append-Only”o Lots of observations are made about the world
Debits, credits, Purchase-Orders, Customer-Change-Requests, etc
o As time moves on, more observations are added You can’t change the history but you can add new
observations• Derived Results May Be Calculated
o Estimate of the “current” inventoryo Frequently inaccurate
• Historic Rollups Are Calculatedo Monthly bank statements
![Page 26: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/26.jpg)
![Page 27: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/27.jpg)
• 5 joins for 1 query!o Do you think FB would do this?o And how would you do joins with partitioned
data?• De-normalization removes joins• But increases data volume
o However disk is cheap and getting cheaper• And can lead to inconsistent data
o But only if we do UPDATEs and DELETEs
![Page 28: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/28.jpg)
• Normalization’s Goal Is Eliminating Update Anomalieso Can Be Changed Without “Funny Behavior”o Each Data Item Lives in One Place
Emp #
Emp NameMgr
#Mgr
NameEmp Phone Mgr Phone
47 Joe 13 Sam5-1234 6-9876
18 Sally 38 Harry3-3123 5-6782
91 Pete 13 Sam2-1112 6-9876
66 Mary 02 Betty5-7349 4-0101
Classic problemwith de-normalization
Can’t updateSam’s phone #since there aremany copies
De-normalization isOK if you aren’t going to update!
Source: http://blogs.msdn.com/pathelland/
![Page 29: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/29.jpg)
• Partitioning for scalingo Replication for availability
• No ACID transactions• No JOINs• Immutable data
o No cascaded UPDATEs and DELETEs
![Page 30: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/30.jpg)
![Page 31: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/31.jpg)
• Partitioning – for R/W scaling• Replication – for availability• Versioning – for immutable data• Eventual Consistency• Error detection and handling
![Page 32: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/32.jpg)
• Google – BigTable• Amazon – Dynamo• Facebook – Cassandra (BigTable +
Dynamo)• LinkedIn – Voldemort (similar to Dynamo)• Many more
![Page 33: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/33.jpg)
• Tens of millions of customers served at peak times
• Tens of thousands of servers• Both customers and servers distributed
world wide
![Page 34: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/34.jpg)
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html • Eventually consistent data store• Always writable• Decentralized• All nodes have the same responsibilities
![Page 35: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/35.jpg)
![Page 36: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/36.jpg)
• Similar to Chordo Each node gets an ID from the space of keyso Nodes are arranged in a ringo Data stored on the first node clockwise of the
current placement of the data key• Replication
o Preference lists of N nodes following the associated node
![Page 37: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/37.jpg)
• A problem with the Chord schemeo Nodes placed randomly on ringo Leads to uneven data & load distribution
• In Dynamoo “Virtual” nodeso Each physical node has multiple virtual nodes
More powerful machines have more virtual nodeso Distribute virtual nodes across the ring
![Page 38: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/38.jpg)
• Updates generate a new timestampo Vector clocks are used
• Eventual consistencyo Multiple versions of the same object might co-
exist• Syntactic Reconciliation
o System might be able to resolve conflicts automatically
• Semantic Reconciliationo Conflict resolution pushed to application
![Page 39: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/39.jpg)
![Page 40: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/40.jpg)
• Request arrives at a node (coordinator)o Ideally the node responsible for the particular keyo Else forwards request to the node responsible for
that key and that node will become the coordinator
• The first N healthy and distinct nodes following the key position are considered for the request
• Application defineso N = total number of participating nodeso R = number of nodes required for successful
Reado W = number of nodes required for successful
write• R + W > N gives quorum
![Page 41: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/41.jpg)
• Writeso Requires generation of a new vector clock by
coordinatoro Coordinator writes locallyo Forwards to N nodes, if W-1 respond then the
write was successful• Reads
o Forwards to N nodes, if R-1 respond then forwards to user
o Only unique responses forwardedo User handles merging if multiple versions exist
![Page 42: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/42.jpg)
• Sloppy Quorumo Read write ops performed on first N healthy
nodeso Increases availability
• Hinted Handoffo If node in preference list is not available, send
replica to a node further down in the listo With a hint containing the identity of the original
nodeo The receiving node keeps checking for the
originalo If the original becomes available, transfers replica
to it
![Page 43: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/43.jpg)
• Replica Synchronizationo Synchronize with another
nodeo Each node maintains a
separate Merkel tree for each key range it hosts
o Nodes exchange roots of trees for common key-ranges
o Quickly determine divergent keys by comparing hashes
![Page 44: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/44.jpg)
• Ring Membershipo Membership is explicit to avoid re-balancing of
partition assignmento Use background gossip to build 1-hop DHTo Use external entity to bootstrap the system to avoid
partitioned rings• Failure Detection
o Node A finds node B unreachable (for servicing a request)
o A uses other nodes to service requests and periodically checks B
o A does not assume B to have failedo No globally consistent view of failure (because of
explicit ring membership)
![Page 45: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/45.jpg)
• Application Configurable (N, R, W)• Every node is aware of the data hosted by
its peerso requiring the gossiping of the full routing table
with other nodeso scalability is limited by this to a few hundred
nodeso hierarchy may help to overcome the limitation
![Page 46: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/46.jpg)
• Typical configuration for the Dynamo (N, R, W) is (3, 2, 2)
• Some implementations vary (N, R, W)o Always write might have W=1 (Shopping Cart)o Product catalog might have R=1 and W=N
• Response requirement is 300ms for any request (read or write)
![Page 47: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/47.jpg)
• Consistency vs. Availabilityo 99.94% one versiono 0.00057% twoo 0.00047% threeo 0.00009% four
• Server-driven or Client-driven coordinationo Server-driven
uses load balancers forwards requests to desired set of nodes
o Client-driven 50% faster requires the polling of Dynamo membership updates the client is responsible for determining the appropriate nodes to
send the request to
• Successful responses (without time-out) 99.9995%
![Page 48: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/48.jpg)
![Page 49: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/49.jpg)
![Page 50: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/50.jpg)
![Page 51: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/51.jpg)
![Page 52: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/52.jpg)
• Enormous data (and high growth)o Traditional solutions don’t work
• Distributed databaseso Lots of interesting work happening
• Great time for young programmers!o Problem solving ability
![Page 53: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/53.jpg)
![Page 54: Handling Data in Mega Scale Web Systems](https://reader033.fdocuments.us/reader033/viewer/2022060109/5558c42fd8b42a995d8b461e/html5/thumbnails/54.jpg)
Intelligent People. Uncommon Ideas.
Licensed under Creative Commons Attribution Sharealike Noncommercial