Intro to Databases

DatabasesSargun Dhillon

@Sargun

What is a database? A database is an organized collection of data

What are databases for?

Applications

Internet ApplicationsExperiencing exploding growth

Internet Traffic vs. Penetration

0

25

50

75

100

0

10000

20000

30000

40000

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

IP Traffic (PB/mo) Global Penetration (%)

Number of Internet Users in 2012

Average Distance to Every Human

ExtrapolatingWe have not yet reached Peak “Web” and we won’t see

it for some time

ApplicationsHow are they built?

Basic Application

Useful ApplicationAdd Persistence

Scale Out

Scale Out with Correctness

What is a Transaction?A Unit of Work

Transaction SchedulingConcurrent Operations

Non-Conflicting ConcurrencyParallel Execution

ACID = AtomicityA transaction executes or it does not

ACID = ConsistencyCorrectness; Require the database to follow set of

invariants

ACID = IsolationPrevent inter-actor visibility during concurrent operations

ACID = DurabilityOnce you write, it will survive

Lifecycle of a Transaction

Vertically ScalabilityMoore’s Law can take us places

Biggest AWS Database• vCPUs: 32

• Memory: 244

• Storage: 3TB

• IOPs: 30,000 IOPs

• Networking: 10 Gigabit

• Resiliency: Multi-AZ

• SLA: 99.95%

• Backend: Postgresql

$141,052.66/yr

Scaling Beyond

Sharding?

Do we have a natural sharding key?

Add a Coordinator?

Two-phase commit?

Three-phase commit?

Paxos?

Enhanced Three-phase commit?

Wat?

Egalitarian Paxos?

Do we really want to run NxM databases?

Partial Availability

Failure detectors are hard

Database Failure

Cascading App Failure

Recovery

Hotspots? (The “Bieber” problem)

Scaling SSI databases is a hard problem

What if want multidatacenter?

No latency win for mutable data

Must sacrifice recency for latency win

Complex Routing Semantics

Multi-master requires at least 1 RTT

-F1: A Distributed SQL Database That Scales, Google

“Because the data is synchronously replicated across multiple datacenters, and because

we’ve chosen widely distributed datacenters, the commit latencies are relatively high (50-150

ms).”

-Kohavi and Longbotham 2007

“Every 100 ms increase in load time of Amazon.com decreased sales by 1%.”

(~$120M of losses per 100 ms)

“Average partition duration ranged from 6 minutes for software-related failures to more than 8.2 hours for

hardware-related failures (median 2.7 and 32 minutes; 95th percentile of 19.9 minutes and 3.7 days,

respectively).” -The Network is Reliable

WANs Fail

Is there another way?

Eventually Consistent Systems

-F1: A Distributed SQL Database That Scales, Google

“We also have a lot of experience with eventual consistency systems at Google. In all such

systems, we find developers spend a significant fraction of their time building

extremely complex and error-prone mechanisms to cope with eventual consistency

and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems

should be solved at the database level. ”

CAP Theorem

“A shared-data system can have at most two of the three following properties:

Consistency, Availability, and tolerance to network Partitions.”

-Dr. Eric Brewer

On Consistency

• ACID Consistency: Any transaction, or operation will bring the database from one valid state to another

• CAP Consistency: All nodes see the same data at the same time (synchrony)

On Partition Tolerance

• The network will be allowed to lose arbitrarily many messages sent from one node to another.

• Databases systems, in order to be useful must have communication over the network

• Clients count

There is no such thing as a 100% reliable network:

Can’t choose CA

http://codahale.com/you-cant-sacrifice-partition-tolerance

http://codahale.com/you-cant-sacrifice-partition-tolerance

We Can Have Both*(*Just not at the same time)

PNUTS• Paper released by Yahoo! research in 2008

• Operations:

• Read-Any

• Read-Critical(Required-Version)*

• Read-Latest

• Write

• Test-and-set-write(Required-Version)

* Will fall back to CP operation

Weak Consistency

“This is a specific form of weak consistency; the storage system

guarantees that if no new updates are made to the object,

eventually all accesses will return the last updated value.”

Definition of “Eventual Consistency” from “Eventually Consistency Revisited” - Werner Vogels

Eventual Consistency in the LAN

Less Relevant Today

Good at Building LANs at Scale

Facebook Fabric

Microsoft VL2

Google Jupiter

Less Interesting

Eventual Consistency in the WAN

Low-latency everywhere

Write AnywhereBeat the speed of the light

Build for WAN locality

Typical Pattern with

COTS EC Store

System Model

Use Case: Social Network

Models: Users, Posts, Friends

SchemaCREATE TABLE test.users ( user_name text PRIMARY KEY, friends set<text>, posts set<text> )

State*****:test> SELECT * FROM users;

user_name | friends | posts -----------+----------+------- sargun | {'BOSS'} | null

Let’s Post!(But First)

Remove Boss

*****:test> UPDATE users SET friends = friends - {'BOSS'} WHERE user_name = 'sargun' ;

Hidden Failure

Dropped Unfriending

State at DC2 & DC3*****:test> SELECT * FROM users;

user_name | friends | posts -----------+----------+------- sargun | {'BOSS'} | null

Post Message

*****:test> UPDATE users SET posts = posts + {'PARTY'} WHERE user_name = 'sargun' ;

State at DC2 & DC3*****:test> SELECT * FROM users;

user_name | friends | posts -----------+----------+----------- sargun | {'BOSS'} | {'PARTY'}

Worse Than Banking

Unbounded Financial Loss

No Happens-Before (h.b.)

Relationship

Solution: Wait For Acks

Very Little Benefit Over

CP system

Quorum Systems

RYOW at an Incredible Cost

Why not just do Paxos*?

Single-Decree Paxos Variant such as EPaxos, Cheap Paxos, or Multi-Paxos

Quorum

Participating Quorums Must Overlap

Just Perform Paxos Reconfiguration

to Recover from Failure

Is there an alternative?

Strong Eventual

Consistency

Strong Eventual Consistency

“Any set of nodes that have received the same (unordered) set of updates

will be in the same state.”

How do you even use this?

Vector Clocks

Vector Clocks• Extension of Lamport Clocks

• Used to detect cause and effect in distributed systems

• Can determine concurrency of events, and causality violations

• Preserves h.b. relationships

• CRDTs:

• Convergent Replicated Data Types

• Commutative Replication Data Types

• Enables data structures to be always writeable on both sides of a partition, and replay after healing a partition

• Enable distributed computation across monotonic functions

• Two Types:

• CvRDTs

• CmRDTs

CRDTs

CvRDTs

• State / value based CRDTs

• Minimal state

• Don’t require active garbage collection

Set CvRDT

CmRDTs

• Op / method based CRDTs

• Size grows monotonically

• Uses version vectors to determine order of operations

Counter CmRDT

CRDTs in the Wild• Sets

• Observe-remove set

• Grow-only sets

• Counters

• Grow-only counters

• PN-Counters

• Flags

• Maps

Data structures that are CRDTs

• Probabilistic, convergent data structures

• Hyper log log

• Bloom filter

• Co-recursive folding functions

• Maximum-counter

• Running Average

• Operational Transform

CRDTs

• Incredibly powerful primitive

• Not only useful for in-database manipulation but client-database interaction

• You can compose them, and build your own

• Garbage collection is tricky

Riak In Action

Modelcurl -s http://localhost:8098/types/test/buckets/test/datatypes/sargun |python -mjson.tool { "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq", "type": "map", "value": { "friends_set": [ "Boss" ], "posts_set": [] } }

“Primary Key”curl -s http://localhost:8098/types/test/buckets/test/datatypes/sargun |python -mjson.tool { "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq", "type": "map", "value": { "friends_set": [ "Boss" ], "posts_set": [] } }

Causal Contextcurl -s http://localhost:8098/types/test/buckets/test/datatypes/sargun |python -mjson.tool { "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq", "type": "map", "value": { "friends_set": [ "Boss" ], "posts_set": [] } }

Updatecurl -XPOST http://localhost:8098/types/test/buckets/test/datatypes/sargun \ -H "Content-Type: application/json" \ -H "X-Riak-Vclock: g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq" \ -d ' { "update": { "friends_set": { "remove": "Boss" } } }'

Updated Entries (during partition)

{ "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQpq", "type": "map", "value": { "friends_set": [ "Boss" ], "posts_set": [] } }

{ "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQtq", "type": "map", "value": { "friends_set": [], "posts_set": [] } }

Updatecurl -XPOST http://localhost:8098/types/test/buckets/test/datatypes/sargun \ -H "Content-Type: application/json" -H "X-Riak-Vclock: g2wAAAABaAJtAAAACBjtDYuvG6A4YQtq" -d ' { "update": { "posts_set": { "add": "Party" } } }'

Updated Entries (After Healing)

{ "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQ5q", "type": "map", "value": { "friends_set": [], "posts_set": [ "Party" ] } }

{ "context": "g2wAAAABaAJtAAAACBjtDYuvG6A4YQ5q", "type": "map", "value": { "friends_set": [], "posts_set": [ "Party" ] } }

Currently: Replicates entire value

Future Work: δ-CRDT

Ship only Deltas

Eventual Consistency In Summary

SEC Enables

Distributed

ScalableSc

alab

ility

Processors

Fault-Tolerant

Applications

Eventual Consistency (CAP) Without Consistency (ACID)

Gives EC a Bad Name

Invariant Operation AP / CPSpecify unique ID Any CP

Generate unique ID Any AP

> INCREMENT AP

> DECREMENT CP

< INCREMENT CP

< DECREMENT AP

Secondary Index Any AP

Materialized View Any APAUTO_INCREMEN

TINSERT CP

Linearizability CAS CP

Operations Requiring

Weak Consistency

vs.

Strong Consistency

BASE not ACID• Basically Available: There will be a response

per request (failure, or success)

• Soft State: Any two reads against the system may yield different data (when measured against time)

• Eventually Consistent: The system will eventually become consistent when all failures have healed, and time goes to infinity

Brand New Technology Still being invented

Technology Timeline• 1996 - Log structured merge tree

• 2000 - CAP Theorem

• 2007 - Amazon Dynamo Paper

• 2011 - INRIA CRDT Technical Report

• 2014 - Riak DT map: a composable, convergent replicated dictionary

Further Reading• Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area

Storage with COPS

• PNUTS: Yahoo!’s Hosted Data Serving Platform

• F1: A Distributed SQL Database That Scales

• Spanner: Google's Globally-Distributed Database

• The Network is Reliable: An informal survey of real-world communications failures

• A comprehensive study of Convergent and CommutativeReplicated Data Types

• Riak DT Map: A Composable, Convergent Replicated Dictionary

Get in Touch• If you’re interested in cheating the speed of light

• Come use our software

• If you’re interested in solving today’s computer science problems

• Come work for us

• If you’d like to learn more about distributed systems at scale

• Maybe you have a better idea

Sargun Dhillon @Sargun

[email protected]

The Case for

Eventual Consistency

mailto:[email protected]

Intro to Databases

Technology

Transcript of Intro to Databases