Pnuts

28
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno Jacobsen,Nick Puz, Daniel Weaver and Ramana Yerneni Yahoo! Research 1

Transcript of Pnuts

Page 1: Pnuts

1

PNUTS: Yahoo!’s Hosted Data Serving

PlatformBrian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein,

Philip Bohannon, HansArno Jacobsen,Nick Puz, Daniel Weaver and Ramana Yerneni

Yahoo! Research

Page 2: Pnuts

2

Motivation• Web applications need:

o Scalability -architectural scalability, scale linearlyo Geographic scope -data replicas on multiple continentso High availability -failures, apps will still be able to read datao Relaxed consistency needs -Tolerate stale or reordered data

Page 3: Pnuts

3

Relaxed Consistency• Not strictly consistency• Very expensive.

• Not eventually consistency• Ex: a photo sharing application• U1: Remove someone from the list of people who

can view his photos• U2: Post spring-break photos

Page 4: Pnuts

4

What is PNUTS?• PNUTS, a massively parallel and

geographically distributed database system for Yahoo!’s web applications.

• An architecture based on record-level, asynchronous geographic replication, and use of a guaranteed message-delivery service rather than a persistent log.

Page 5: Pnuts

5

System architecture

Page 6: Pnuts

6

• Storage Units• Store several hundreds of tablets, a tablet usually

several hundreds of megabytes. • Routers• The router stores an interval mapping, which defines

the boundaries of each tablet, and also maps each tablet to a storage unit.

• Tablet Controller• Routers contain only a cached copy of the interval

mapping. The mapping is owned by the tablet controller• YMB- Yahoo Message Broker• topic-based pub/sub system

System architecture

Page 7: Pnuts

7

Yahoo Message Broker• Distributed publish-subscribe service.

• Guarantees delivery once a message is published.

• Asynchronously assigned to different regions and applied to their replicas.

Page 8: Pnuts

8

Types of Table

Page 9: Pnuts

9

Tablet splitting and balancingEach storage unit has many tablets (horizontal partitions of the table)Each storage unit has many tablets (horizontal partitions of the table)

Tablets may grow over timeTablets may grow over timeOverfull tablets splitOverfull tablets split

Storage unit may become a hotspotStorage unit may become a hotspot

Shed load by moving tablets to other serversShed load by moving tablets to other servers

Storage unitTablet

Page 10: Pnuts

10

Query processing

Page 11: Pnuts

11

Accessing data

SUSU SU

1Get key k

2Get key k3Record for key k

4Record for key k

Page 12: Pnuts

12

Bulk read

SUScatter

/gather engine

SU SU

1{k1, k2, … kn}

2Get k1

Get k2 Get k3

Page 13: Pnuts

13

Per-record timeline consistency• all replicas of a given record apply all updates to

the record in the same order.

Page 14: Pnuts

14

Per-record timeline consistency

• An example sequence of updates to a record

• 3 events: insert, update and delete.• One replica assigned as the master• Generation: new insert Version: each

update

Page 15: Pnuts

15

Consistency model

• Goal: make it easier for applications to reason about updates and cope with asynchrony

• web applications typically manipulate one record at a time

Time

Record inserted

Update Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Update Update

Page 16: Pnuts

16

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Current version

Stale version

Stale version

Read-any

Consistency model

Read-any: Returns a possibly stale version of the record.

Page 17: Pnuts

17

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Read latest

Current version

Stale version

Stale version

Consistency model

Read latest: Returns the latest copy of the record thatreflects all writes that have succeeded.

Page 18: Pnuts

18

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Read ≥ v.6

Current version

Stale version

Stale version

Read-critical(required version):

Consistency model

Read critical: Returns a version of the record that is strictly newer than, or the same as the required version.

Page 19: Pnuts

19

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale version

Stale version

Test-and-set-write(required version)

Consistency model

This call performs the requested write to the record if and only if the present version of the record is the same as required version

Page 20: Pnuts

20

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale version

Stale version

Mechanism: per record mastershipMechanism: per record mastership

Consistency model

Page 21: Pnuts

Consistency levels• Eventual consistency

o Transactions:• Alice changes status from “Sleeping” to “Awake”• Alice changes location from “Home” to “Work”

(Alice, Home, Sleeping) (Alice, Home, Awake)

Region 1

(Alice, Home, Sleeping) (Alice, Work, Sleeping)

Region 2

(Alice, Work, Awake)

(Alice, Work, Awake)

Work

Awake

Final state consistent

“Invalid” state visible

Awake Work

Page 22: Pnuts

Consistency levels• Timeline consistency

o Transactions:• Alice changes status from “Sleeping” to “Awake”• Alice changes location from “Home” to “Work”

(Alice, Home, Sleeping) (Alice, Home, Awake)

Region 1

(Alice, Home, Sleeping) (Alice, Work, Awake)

Region 2

(Alice, Work, Awake)

Work

(Alice, Work, Awake)

Awake Work

Page 23: Pnuts

23

Experiments

Page 24: Pnuts

24

Experimental setup• Production PNUTS code

o Enhanced with ordered table type

• Three PNUTS regionso 2 west coast, 1 east coasto 5 storage units, 2 message brokers, 1 router

• Workload parameterso Request rate: 1200-3600 requests/secondo Read: write mix ratio:0-50% writeso Locality:80%

Page 25: Pnuts

25

Inserts• Inserts

o required 75.6 ms per insert in West 1 (tablet master)

o 131.5 ms per insert into the non-master West 2, and

o 315.5 ms per insert into the non-master East.

o These results show the expected effect that the cost of inserting is significantly higher if the insert is initiated in a non-master region that is far away from the tablet master.

Page 26: Pnuts

26

10% writes by default

Page 27: Pnuts

Lessons learned (1)• Simpler is better than clever

o Clever approaches are hard to implement, test, debug and maintain

• Incremental is better than big-bang

Page 28: Pnuts

Lessons learned (2)• Non-algorithmic challenges can be hard

o Dealing with network config, legacy software and requirements, the “corporate way,” multiple stakeholders…

• Researchers should get dirty handso Being a part of shipping a real system can

radically readjust your worldviewo Write some test cases to understand

system complexity