Handling Data in Mega Scale Systems

73
Handling Data in Mega Scale Web Apps (lessons learnt @ Directi) Vineet Gupta | GM – Software Engineering | Directi http://vineetgupta.spaces.live.com Licensed under Creative Commons Attribution Sharealike Noncommercial Intelligent People. Uncommon Ideas.

description

Handling Data in Mega Scale Systems by Vineet Gupta, GM Software Engineer.

Transcript of Handling Data in Mega Scale Systems

Page 1: Handling Data in Mega Scale Systems

Handling Data in Mega Scale Web Apps(lessons learnt @ Directi)

Vineet Gupta | GM – Software Engineering | Directihttp://vineetgupta.spaces.live.com

Licensed under Creative Commons Attribution Sharealike Noncommercial

Intelligent People. Uncommon Ideas.

Page 2: Handling Data in Mega Scale Systems

Outline

• Characteristics• App Tier Scaling• Replication• Partitioning• Consistency• Normalization• Caching• Data Engine Types

Page 3: Handling Data in Mega Scale Systems

Not Covering

• Offline Processing (Batching / Queuing)• Distributed Processing – Map Reduce• Non-blocking IO• Fault Detection, Tolerance and Recovery

Page 4: Handling Data in Mega Scale Systems

Outline

• Characteristics• App Tier Scaling• Replication• Partitioning• Consistency• Normalization• Caching• Data Engine Types

Page 5: Handling Data in Mega Scale Systems

How Big Does it Get

• 22M+ users• Dozens of DB servers• Dozens of Web servers• Six specialized graph database servers to run

recommendations engine

Source: http://highscalability.com/digg-architecture

Page 6: Handling Data in Mega Scale Systems

How Big Does it Get

• 1 TB / Day• 100 M blogs indexed / day• 10 B objects indexed / day• 0.5 B photos and videos• Data doubles in 6 months• Users double in 6 months

Source: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/

Page 7: Handling Data in Mega Scale Systems

How Big Does it Get

• 2 PB Raw Storage• 470 M photos, 4-5 sizes each• 400 k photos added / day• 35 M photos in Squid cache (total)• 2 M photos in Squid RAM• 38k reqs / sec to Memcached• 4 B queries / day

Source: http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html

Page 8: Handling Data in Mega Scale Systems

How Big Does it Get

• Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters

• 2 PB of data• 26 B SQL queries / day• 1 B page views / day• 3 B API calls / month• 15,000 App servers

Source: http://highscalability.com/ebay-architecture/

Page 9: Handling Data in Mega Scale Systems

How Big Does it Get

• 450,000 low cost commodity servers in 2006• Indexed 8 B web-pages in 2005• 200 GFS clusters (1 cluster = 1,000 – 5,000

machines)• Read / write thruput = 40 GB / sec across a cluster• Map-Reduce

– 100k jobs / day– 20 PB of data processed / day– 10k MapReduce programs

Source: http://highscalability.com/google-architecture/

Page 10: Handling Data in Mega Scale Systems

Key Trends

• Data Size ~ PB• Data Growth ~ TB / day• No of servers – 10s to 10,000• No of datacenters – 1 to 10• Queries – B+ / day• Specialized needs – more / other than RDBMS

Page 11: Handling Data in Mega Scale Systems

Outline

• Characteristics• App Tier Scaling• Replication• Partitioning• Consistency• Normalization• Caching• Data Engine Types

Page 12: Handling Data in Mega Scale Systems

Host

App Server

DB Server

RAMCPUCPU

CPURAM

RAM

Vertical Scaling (Scaling Up)

Page 13: Handling Data in Mega Scale Systems

Big Irons

Sunfire E20k

$450,000 - $2,500,00036x 1.8GHz processors

PowerEdge SC1435Dualcore 1.8 GHz processor

Around $1,500

Page 14: Handling Data in Mega Scale Systems

Vertical Scaling (Scaling Up)

• Increasing the hardware resources on a host• Pros

– Simple to implement– Fast turnaround time

• Cons– Finite limit– Hardware does not scale linearly (diminishing returns for each

incremental unit)– Requires downtime– Increases Downtime Impact– Incremental costs increase exponentially

Page 15: Handling Data in Mega Scale Systems

HostHost

App Server

DB Server

Vertical Partitioning of Services

Page 16: Handling Data in Mega Scale Systems

Vertical Partitioning of Services• Split services on separate nodes

– Each node performs different tasks

• Pros– Increases per application Availability– Task-based specialization, optimization and tuning possible– Reduces context switching– Simple to implement for out of band processes– No changes to App required– Flexibility increases

• Cons– Sub-optimal resource utilization– May not increase overall availability– Finite Scalability

Page 17: Handling Data in Mega Scale Systems

Horizontal Scaling of App Server

Web Server

DB Server

Web Server

Web Server

Load Balancer

Page 18: Handling Data in Mega Scale Systems

Horizontal Scaling of App Server

• Add more nodes for the same service– Identical, doing the same task

• Load Balancing– Hardware balancers are faster– Software balancers are more customizable

Page 19: Handling Data in Mega Scale Systems

The problem - State

User 1

User 2

Web Server

DB Server

Web Server

Web Server

Load Balancer

Page 20: Handling Data in Mega Scale Systems

Sticky Sessions

User 1

User 2

Web Server

DB Server

Web Server

Web Server

Load Balancer

• Asymmetrical load distribution

• Downtime

Page 21: Handling Data in Mega Scale Systems

Central Session Store

User 1

User 2

Web Server

Session Store

Web Server

Web Server

Load Balancer

• SPOF• Reads and Writes generate

network + disk IO

Page 22: Handling Data in Mega Scale Systems

Clustered Sessions

User 1

User 2

Web Server

Web Server

Web Server

Load Balancer

Page 23: Handling Data in Mega Scale Systems

Clustered Sessions

• Pros– No SPOF– Easier to setup– Fast Reads

• Cons– n x Writes– Increase in network IO with increase in nodes– Stale data (rare)

Page 24: Handling Data in Mega Scale Systems

Sticky Sessions with Central Store

User 1

User 2

Web Server

DB Server

Web Server

Web Server

Load Balancer

Page 25: Handling Data in Mega Scale Systems

More Session Management

• No Sessions– Stuff state in a cookie and sign it!– Cookie is sent with every request / response

• Super Slim Sessions– Keep small amount of frequently used data in

cookie– Pull rest from DB (or central session store)

Page 26: Handling Data in Mega Scale Systems

Sessions - Recommendation

• Bad– Sticky sessions

• Good– Clustered sessions for small number of nodes

and / or small write volume– Central sessions for large number of nodes or

large write volume• Great

– No Sessions!

Page 27: Handling Data in Mega Scale Systems

App Tier Scaling - More

• HTTP Accelerators / Reverse Proxy– Static content caching, redirect to lighter HTTP– Async NIO on user-side, Keep-alive connection

pool• CDN

– Get closer to your user– Akamai, Limelight

• IP Anycasting• Async NIO

Page 28: Handling Data in Mega Scale Systems

Scaling a Web App

• App-Layer– Add more nodes and load balance!– Avoid Sticky Sessions– Avoid Sessions!!

• Data Store– Tricky! Very Tricky!!!

Page 29: Handling Data in Mega Scale Systems

Outline

• Characteristics• App Tier Scaling• Replication• Partitioning• Consistency• Normalization• Caching• Data Engine Types

Page 30: Handling Data in Mega Scale Systems

Replication = Scaling by Duplication

T1, T2, T3, T4

App Layer

Page 31: Handling Data in Mega Scale Systems

Replication = Scaling by Duplication

T1, T2, T3, T4

App Layer

T1, T2, T3, T4

T1, T2, T3, T4

T1, T2, T3, T4

T1, T2, T3, T4

• Each node has its own copy of data• Shared Nothing Cluster

Page 32: Handling Data in Mega Scale Systems

Replication• Read : Write = 4:1

– Scale reads at cost of writes!

• Duplicate Data – each node has its own copy

• Master Slave– Writes sent to one node, cascaded to others

• Multi-Master– Writes can be sent to multiple nodes– Can lead to deadlocks– Requires conflict management

Page 33: Handling Data in Mega Scale Systems

Master-Slave

Master

App Layer

Slave Slave Slave Slave

• n x Writes – Async vs. Sync• SPOF• Async - Critical Reads from Master!

Page 34: Handling Data in Mega Scale Systems

Multi-Master

Master

App Layer

Master Slave Slave Slave

• n x Writes – Async vs. Sync• No SPOF• Conflicts!

Page 35: Handling Data in Mega Scale Systems

Replication Considerations• Asynchronous

– Guaranteed, but out-of-band replication from Master to Slave– Master updates its own db and returns a response to client– Replication from Master to Slave takes place asynchronously– Faster response to a client – Slave data is marginally behind the Master– Requires modification to App to send critical reads and writes to

master, and load balance all other reads

• Synchronous– Guaranteed, in-band replication from Master to Slave– Master updates its own db, and confirms all slaves have updated their

db before returning a response to client– Slower response to a client – Slaves have the same data as the Master at all times– Requires modification to App to send writes to master and load

balance all reads

Page 36: Handling Data in Mega Scale Systems

Replication Considerations

• Replication at RDBMS level– Support may exists in RDBMS or through 3rd party tool– Faster and more reliable– App must send writes to Master, reads to any db and critical

reads to Master• Replication at Driver / DAO level

– Driver / DAO layer ensures • writes are performed on all connected DBs• Reads are load balanced

– Critical reads are sent to a Master– In most cases RDBMS agnostic– Slower and in some cases less reliable

Page 37: Handling Data in Mega Scale Systems

Diminishing Returns

Write

Read

Write

Read

Write

Read

Write

Read

Write

Read

Write

Read

Write

Read

Per Server:• 4R, 1W• 2R, 1W• 1R, 1W

Page 38: Handling Data in Mega Scale Systems

Outline

• Characteristics• App Tier Scaling• Replication• Partitioning• Consistency• Normalization• Caching• Data Engine Types

Page 39: Handling Data in Mega Scale Systems

Partitioning = Scaling by Division

• Vertical Partitioning– Divide data on tables / columns– Scale to as many boxes as there are tables or columns– Finite

• Horizontal Partitioning– Divide data on rows– Scale to as many boxes as there are rows!– Limitless scaling

Page 40: Handling Data in Mega Scale Systems

Vertical Partitioning

T1, T2, T3, T4,

T5

App Layer

• Note: A node here typically represents a shared nothing cluster

Page 41: Handling Data in Mega Scale Systems

Vertical Partitioning

T3

App Layer

T4 T5T2T1

• Facebook - User table, posts table can be on separate nodes

• Joins need to be done in code (Why have them?)

Page 42: Handling Data in Mega Scale Systems

Horizontal Partitioning

T3

App Layer

T4 T5T2T1First million rows

T3 T4 T5T2T1Second million rows

T3 T4 T5T2T1Third million rows

Page 43: Handling Data in Mega Scale Systems

Horizontal Partitioning Schemes

• Value Based– Split on timestamp of posts– Split on first alphabet of user name

• Hash Based– Use a hash function to determine cluster

• Lookup Map– First Come First Serve– Round Robin

Page 44: Handling Data in Mega Scale Systems

Outline

• Characteristics• App Tier Scaling• Replication• Partitioning• Consistency• Normalization• Caching• Data Engine Types

Page 45: Handling Data in Mega Scale Systems

CAP Theorem

Consistency

Partition ToleranceAvailability

Source: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495

Page 46: Handling Data in Mega Scale Systems

Transactions• Transactions make you feel alone

– No one else manipulates the data when you are

• Transactional serializability– The behavior is as if a serial order exists

TkTl

Tm

Tn

ToTh

Tg TjTe

Tf

Tb

Ta

Tc

Td

Ti

These TransactionsPrecede Ti

These TransactionsFollow Ti

Ti Doesn’t Know About TheseTransactions and They Don’t

Know About Ti

TransactionSerializability

TkTl

Tm

Tn

ToThTh

Tg TjTe

Tf

Tb

Ta

Tc

Td

Ti

These TransactionsPrecede Ti

These TransactionsFollow Ti

Ti Doesn’t Know About TheseTransactions and They Don’t

Know About Ti

TransactionSerializability

Slide 46Source: http://blogs.msdn.com/pathelland/

Page 47: Handling Data in Mega Scale Systems

Life in the “Now”• Transactions live in the “now” inside services

– Time marches forward– Transactions commit – Advancing time– Transactions see

the committed transactions

• A service’s biz-logic lives in the “now”

Service

Each Transaction Only Sees a Simple Advancing of Time with a Clear Set of

Preceding Transactions

ServiceServiceService

Each Transaction Only Sees a Simple Advancing of Time with a Clear Set of

Preceding Transactions

Slide 47Source: http://blogs.msdn.com/pathelland/

Page 48: Handling Data in Mega Scale Systems

Sending Unlocked Data Isn’t “Now”• Messages contain unlocked data

– Assume no shared transactions

• Unlocked data may change– Unlocking it allows change

• Messages are not from the “now”– They are from the past

There is no simultaneity at a distance!• Similar to speed of light• Knowledge travels at speed of light• By the time you see a distant object it may have changed!• By the time you see a message, the data may have changed!

Services, transactions, and locks bound simultaneity!• Inside a transaction, things appear simultaneous (to others)• Simultaneity only inside a transaction!• Simultaneity only inside a service!

Slide 48Source: http://blogs.msdn.com/pathelland/

Page 49: Handling Data in Mega Scale Systems

Outside Data: a Blast from the Past

• All data seen from a distant service is from the “past”– By the time you see it, it has been unlocked and may change

• Each service has its own perspective– Inside data is “now”; outside data is “past”– My inside is not your inside; my outside is not your outside

All data from distant stars is from the past• 10 light years away; 10 year old knowledge• The sun may have blown up 5 minutes ago

• We won’t know for 3 minutes more…

This is like going from Newtonian to Einstonian physics• Newton’s time marched forward uniformly

• Instant knowledge• Classic distributed computing: many systems look like one

• RPC, 2-phase commit, remote method calls…• In Einstein’s world, everything is “relative” to one’s perspective• Today: No attempt to blur the boundary

Slide 49Source: http://blogs.msdn.com/pathelland/

Page 50: Handling Data in Mega Scale Systems

Versions and Distributed Systems

• Can’t have “the same” dataat many locations– Unless it is

a snapshot• Changing

distributed dataneeds versions– Creates a

snapshot…

ListeningPartner

Service-1

ListeningPartner

Service-5

ListeningPartner

Service-7

ListeningPartner

Service-8

Tuesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Monday’sPrice-List

Tuesday’sPrice-List

Wednesday’sPrice-List

Monday’sPrice-List

Tuesday’sPrice-List

Data Owning Service

Price-List

ListeningPartner

Service-1

ListeningPartner

Service-5

ListeningPartner

Service-7

ListeningPartner

Service-8

Tuesday’sPrice-ListTuesday’sPrice-ListTuesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Monday’sPrice-ListMonday’sPrice-ListMonday’sPrice-List

Tuesday’sPrice-ListTuesday’sPrice-ListTuesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Monday’sPrice-ListMonday’sPrice-ListMonday’sPrice-List

Tuesday’sPrice-ListTuesday’sPrice-ListTuesday’sPrice-List

Data Owning Service

Price-List

Data Owning Service

Price-List

Source: http://blogs.msdn.com/pathelland/

Page 51: Handling Data in Mega Scale Systems

Subjective Consistency Given the information I have at hand, make a decision and act on it ! Remember the information at hand !

Subjective Consistency• Given what I know here and now, make a decision

– Remember the versions of all the data used to make this decision– Record the decision as being predicated on these versions

• Other copies of the object may make divergent decisions– Try to sort out conflicts within the family– If necessary, programmatically apologize– Very rarely, whine and fuss for human help

Ambassadors Had AuthorityBack before radio, it could be months between communication with the king.

Ambassadors would make treaties and much more... They had binding authority. The mess was sorted out later!

Source: http://blogs.msdn.com/pathelland/

Page 52: Handling Data in Mega Scale Systems

Eventual Consistency• Eventually, all the copies of the object share their changes

– “I’ll show you mine if you show me yours!”• Now, apply subjective consistency:

– “Given the information I have at hand, make a decision and act on it!”

– Everyone has the same information, everyone comes to the same conclusion about the decisions to take…

This is NOT magic; it is a design requirement !Idempotence, commutativity, and associativity of the operations

(decisions made) are all implied by this requirement

Eventual Consistency Given the same knowledge, produce the same result ! Everyone sharing their knowledge leads to the same result...

Source: http://blogs.msdn.com/pathelland/

Page 53: Handling Data in Mega Scale Systems

Outline

• Characteristics• App Tier Scaling• Replication• Partitioning• Consistency• Normalization• Caching• Data Engine Types

Page 54: Handling Data in Mega Scale Systems

Why Normalize?

• Normalization’s Goal Is Eliminating Update Anomalies– Can Be Changed Without “Funny Behavior”– Each Data Item Lives in One Place

Emp # Emp Name Mgr # Mgr NameEmp Phone Mgr Phone

47 Joe 13 Sam5-1234 6-9876

18 Sally 38 Harry3-3123 5-6782

91 Pete 13 Sam2-1112 6-9876

66 Mary 02 Betty5-7349 4-0101

Classic problemwith de-normalization

Can’t updateSam’s phone #since there aremany copies

De-normalization isOK if you aren’t going to

update!

Source: http://blogs.msdn.com/pathelland/

Page 55: Handling Data in Mega Scale Systems

Eliminate Joins

user table

user_id

first_name

last_name

sexhometown

relationship_status

interested_in

religious_views

political_views

12345 John Doe Male

Atlanta, GA

married

women (null) (null)

user_affiliations table

user_id (foreign_key)

affiliation_id (foreign key)

12345 4212345 598

affiliations table

affiliation_id description member_count42 Microsoft 18,656598 Georgia Tech 23,488

user_phone_numbers table

user_id (foreign_key) phone_number phone_type

12345 425-555-1203 Home12345 425-555-6161 Work12345 206-555-0932 Cell

user_screen_names table

user_id (foreign_key) screen_name im_service

12345 [email protected] AIM

12345 [email protected] Skype

user_work_history table

user_id (foreign_key)

company_affiliation_id (foreign key)

company_name job_title

12345 42 Microsoft Program Manager

12345 78 i2 Technologies

Quality Assurance Engineer

Page 56: Handling Data in Mega Scale Systems

Eliminate Joins

• 6 joins for 1 query!– Do you think FB would do this?– And how would you do joins with partitioned data?

• De-normalization removes joins• But increases data volume

– But disk is cheap and getting cheaper• And can lead to inconsistent data

– If you are lazy– However this is not really an issue

Page 57: Handling Data in Mega Scale Systems

“Append-Only” Data

• Many Kinds of Computing are “Append-Only”– Lots of observations are made about the world

• Debits, credits, Purchase-Orders, Customer-Change-Requests, etc– As time moves on, more observations are added

• You can’t change the history but you can add new observations• Derived Results May Be Calculated

– Estimate of the “current” inventory– Frequently inaccurate

• Historic Rollups Are Calculated– Monthly bank statements

Page 58: Handling Data in Mega Scale Systems

Databases and Transaction Logs

• Transaction Logs Are the Truth– High-performance & write-only– Describe ALL the changes

to the data

• Data-Base the Current Opinion– Describes the latest value of the

data as perceived by the application

Log

DBThe Database Is a Caching

of the Transaction Log !

It is the subset of the latest committed values represented in the transaction log…

Source: http://blogs.msdn.com/pathelland/

Page 59: Handling Data in Mega Scale Systems

We Are Swimming in a Sea of Immutable Data

ListeningPartner

Service-1

ListeningPartner

Service-5

ListeningPartner

Service-7

ListeningPartner

Service-8

Tuesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Monday’sPrice-List

Tuesday’sPrice-List

Wednesday’sPrice-List

Monday’sPrice-List

Tuesday’sPrice-List

Data Owning Service

Price-List

ListeningPartner

Service-1

ListeningPartner

Service-5

ListeningPartner

Service-7

ListeningPartner

Service-8

Tuesday’sPrice-ListTuesday’sPrice-ListTuesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Monday’sPrice-ListMonday’sPrice-ListMonday’sPrice-List

Tuesday’sPrice-ListTuesday’sPrice-ListTuesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Wednesday’sPrice-List

Monday’sPrice-ListMonday’sPrice-ListMonday’sPrice-List

Tuesday’sPrice-ListTuesday’sPrice-ListTuesday’sPrice-List

Data Owning Service

Price-List

Data Owning Service

Price-List

Source: http://blogs.msdn.com/pathelland/

Page 60: Handling Data in Mega Scale Systems

Outline

• Characteristics• App Tier Scaling• Replication• Partitioning• Consistency• Normalization• Caching• Data Engine Types

Page 61: Handling Data in Mega Scale Systems

Caching

• Makes scaling easier (cheaper)

• Core Idea– Read data from persistent store into memory– Store in a hash-table– Read first from cache, if not, load from persistent

store

Page 62: Handling Data in Mega Scale Systems

Write thru Cache

App Server

Cache

Page 63: Handling Data in Mega Scale Systems

Write back Cache

App Server

Cache

Page 64: Handling Data in Mega Scale Systems

Sideline Cache

App Server Cache

Page 65: Handling Data in Mega Scale Systems

Memcached

Page 66: Handling Data in Mega Scale Systems

How does it work

• In-memory Distributed Hash Table• Memcached instance manifests as a process

(often on the same machine as web-server)• Memcached Client maintains a hash table

– Which item is stored on which instance• Memcached Server maintains a hash table

– Which item is stored in which memory location

Page 67: Handling Data in Mega Scale Systems

Outline

• Characteristics• App Tier Scaling• Replication• Partitioning• Consistency• Normalization• Caching• Data Engine Types

Page 68: Handling Data in Mega Scale Systems

It’s not all Relational!

• Amazon - S3, SimpleDb, Dynamo• Google - App Engine Datastore, BigTable• Microsoft – SQL Data Services, Azure Storages• Facebook – Cassandra• LinkedIn - Project Voldemort• Ringo, Scalaris, Kai, Dynomite, MemcacheDB,

ThruDB, CouchDB, Hbase, Hypertable

Page 69: Handling Data in Mega Scale Systems

Tuplespaces

• Basic Concepts– No tables - Containers-Entity– No schema - each tuple has its own set of properties

• Amazon SimpleDB – strings only

• Microsoft Azure SQL Data Services– Strings, blob, datetime, bool, int, double, etc.– No x-container joins as of now

• Google App Engine Datastore– Strings, blob, datetime, bool, int, double, etc.

Page 70: Handling Data in Mega Scale Systems

Key-Value Stores• Google BigTable

– Sparse, Distributed, multi-dimensional sorted map– Indexed by row key, column key, timestamp– Each value is an un-interpreted array of bytes

• Amazon Dynamo– Data partitioned and replicated using consistent hashing– Decentralized replica sync protocol– Consistency thru versioning

• Facebook Cassandra– Used for Inbox search– Open Source

• Scalaris– Keys stored in lexicographical order– Improved Paxos to provide ACID– Memory resident, no persistence

Page 71: Handling Data in Mega Scale Systems

In Summary

• Real Life Scaling requires trade offs• No Silver Bullet• Need to learn new things• Need to un-learn• Balance!

Page 72: Handling Data in Mega Scale Systems

QUESTIONS?

Page 73: Handling Data in Mega Scale Systems

Intelligent People. Uncommon Ideas.

Licensed under Creative Commons Attribution Sharealike Noncommercial