1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of...

17
amplab 1700: Multi Data Center Consistency Tim Kraska, Gene Pang, Michael J. Franklin, Sam Madden

Transcript of 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of...

Page 1: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

1700: Multi Data Center Consistency

Tim Kraska, Gene Pang, Michael J. Franklin, Sam Madden

Page 2: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

The  Problem  

2

Page 3: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

The  Problem  

3

AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state.

Google AppEngine 2 hours downtime because of power outage Feb 24, 2010

Page 4: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

Mul--­‐DataCenter  Deployments  

4

Page 5: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

Are  Asynchronous  Replicated  Key/Value  Stores  Enough?  

5

Account Alice

Account Bob

Page 6: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

Are  Asynchronous  Replicated  Key/Value  Stores  Enough?  

6

Transac-ons  are  not  needed  in  >Web  2.0    

Page 7: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

Web  2.0  Transac-ons  

7

100

Gift:100 eggplants for your farm, just click: http:://farmville.com/gift/AB234890o97

100 Gift:100 eggplants for your farm, just click: http:://farmville.com/gift/AB234890o97

US-East-DC

US-West-DC

US-West-DC

Page 8: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

Current  Solu-ons:  Yahoo  PNUTS  

8

Europe  US-­‐West  

asynchronous single key Step 1: 100ms

•  1 round-trip •  No transactions •  Either no availability or

consistency during failures

•  Risk of losing data Singapore  

Tokyo  

US-­‐East  

Page 9: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

Current  Solu-on:  Amazon  Mul--­‐AZ  RDS  

9

Europe  US-­‐West-­‐1  

US-­‐West-­‐2  

synchronous Step 1: 100ms

•  2 round-trip times •  1 partition per machine •  Only same location, different

availability zone

Step 2: 2ms

Page 10: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

Current  Solu-on:  Google  MegaStore  

10

Europe  US-­‐West  

Singapore  

Tokyo  

US-­‐East  

synchronous Step 1: 100ms

•  Force everything into (very) tiny partitions

•  1 transaction at a time per partition

•  2 continent round-trip times

Page 11: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

What  is  MDCC  

11

•  SLO  aware  •  Enables  developers  to  handle  

the  latencies  across  data-­‐centers    

trx(300ms){ val p1 = products.get(“Harry1”) p1.stock -= 1 val o = new Order … }.onAccept{ … }.onCommit{ … }finally{ … }

Programming Model New Protocol •  Only  1  round-­‐trip  per  

transac-on  in  the  normal  case  •  No  master  required  •  No  par77oning  required  •  Read  commi;ed  consistency  

Guarantees  •  Stronger  guarantees  are  

possible  •  Op-mis-c  approach  •  Local  reads  possible  

Page 12: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

MDCC  

Europe   US-­‐West  

Tokyo  

Step 1: 100ms

Singapore  

US-­‐East  

•  1 round-trip (in the normal case)

•  No data loss •  Transactions •  Strong consistency

Page 13: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

How  do  we  do  it  -­‐  key-­‐observa-on  

13

Conflicts are very rare unless we fight about a resource

•  Everybody only cares about their belongings

•  Example: I update my own profile (why should somebody else update it)

But : •  Updates commute up to a limit •  Examples:

•  Ticket reservation •  Crops •  Product stocks

Page 14: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

MDCC  Protocol  

14

Mul--­‐Paxos  

Commutative Updates (Generalized Paxos)

Escrow

Demarcation Protocol

Stochastic Optimizations

Page 15: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

Like  to  know  more  

15

Tim Kraska [email protected]

Page 16: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

MDCC  

Europe   US-­‐West  

Tokyo  

Step 1: 100ms

Singapore  

US-­‐East  

•  1 round-trip (in the normal case)

•  No data loss •  Transactions •  Strong consistency

Page 17: 1700: Multi Data Center Consistency · amplab TheProblem$ 3 AWS 2011 outage: Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers

amplab

Current  Solu-on:  Dynamo-­‐Approach  

Europe   US-­‐West  

Tokyo  

Step 1: 100ms

Singapore  

US-­‐East  

•  1 round-trip •  No data loss •  Monotonicity •  No transactions