Crash Recovery - Part 2 - Carnegie Mellon University

16
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) CMU SCS Administrivia Final Exam Who: You What: R&G Chapters 15-22 When: Monday May 11th 5:30pm8:30pm Where: GHC 4401 Why: Databases will help your love life. Faloutsos/Pavlo CMU SCS 15-415/615 2 CMU SCS Today’s Class High-level overview of distributed DBMSs. Not meant to be a detailed examination of all aspects of these systems. Faloutsos/Pavlo CMU SCS 15-415/615 3 CMU SCS Today’s Class Overview & Background Design Issues Distributed OLTP Distributed OLAP Faloutsos/Pavlo CMU SCS 15-415/615 4

Transcript of Crash Recovery - Part 2 - Carnegie Mellon University

Page 1: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Carnegie Mellon Univ. Dept. of Computer Science

15-415/615 - DB Applications

C. Faloutsos – A. Pavlo

Lecture#23: Distributed Database Systems

(R&G ch. 22)

CMU SCS

Administrivia – Final Exam

• Who: You

• What: R&G Chapters 15-22

• When: Monday May 11th 5:30pm‐ 8:30pm

• Where: GHC 4401

• Why: Databases will help your love life.

Faloutsos/Pavlo CMU SCS 15-415/615 2

CMU SCS

Today’s Class

• High-level overview of distributed DBMSs.

• Not meant to be a detailed examination of

all aspects of these systems.

Faloutsos/Pavlo CMU SCS 15-415/615 3

CMU SCS

Today’s Class

• Overview & Background

• Design Issues

• Distributed OLTP

• Distributed OLAP

Faloutsos/Pavlo CMU SCS 15-415/615 4

Page 2: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Why Do We Need Parallel/Distributed DBMSs?

• PayPal in 2008…

• Single, monolithic Oracle installation.

• Had to manually move data every xmas.

• Legal restrictions.

Faloutsos/Pavlo CMU SCS 15-415/615 5

CMU SCS

Why Do We Need Parallel/Distributed DBMSs?

• Increased Performance.

• Increased Availability.

• Potentially Lower TCO.

Faloutsos/Pavlo CMU SCS 15-415/615 6

CMU SCS

Parallel/Distributed DBMS

• Database is spread out across multiple

resources to improve parallelism.

• Appears as a single database instance to the

application.

– SQL query for a single-node DBMS should

generate same result on a parallel or distributed

DBMS.

Faloutsos/Pavlo CMU SCS 15-415/615 7

CMU SCS

Parallel vs. Distributed

• Parallel DBMSs:

– Nodes are physically close to each other.

– Nodes connected with high-speed LAN.

– Communication cost is assumed to be small.

• Distributed DBMSs:

– Nodes can be far from each other.

– Nodes connected using public network.

– Communication cost and problems cannot be

ignored. Faloutsos/Pavlo CMU SCS 15-415/615 8

Page 3: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Database Architectures

• The goal is parallelize operations across

multiple resources.

– CPU

– Memory

– Network

– Disk

Faloutsos/Pavlo CMU SCS 15-415/615 9

CMU SCS

Database Architectures

Faloutsos/Pavlo CMU SCS 15-415/615 10

Shared Nothing

Shared Memory

Shared Disk

CMU SCS

Shared Memory

• CPUs and disks have access

to common memory via a fast

interconnect.

– Very efficient to send

messages between processors.

– Sometimes called “shared

everything”

Faloutsos/Pavlo CMU SCS 15-415/615 11

• Examples: All single-node DBMSs.

CMU SCS

Shared Disk

• All CPUs can access all disks

directly via an interconnect

but each have their own

private memories.

– Easy fault tolerance.

– Easy consistency since there is

a single copy of DB.

Faloutsos/Pavlo CMU SCS 15-415/615 12

• Examples: Oracle Exadata, ScaleDB.

Page 4: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Shared Nothing

• Each DBMS instance has its

own CPU, memory, and disk.

• Nodes only communicate

with each other via network.

– Easy to increase capacity.

– Hard to ensure consistency.

Faloutsos/Pavlo CMU SCS 15-415/615 13

• Examples: Vertica, Parallel DB2, MongoDB.

CMU SCS

Early Systems

• MUFFIN – UC Berkeley (1979)

• SDD-1 – CCA (1980)

• System R* – IBM Research (1984)

• Gamma – Univ. of Wisconsin (1986)

• NonStop SQL – Tandem (1987)

Bernstein Mohan DeWitt Gray Stonebraker

CMU SCS

Inter- vs. Intra-query Parallelism

• Inter-Query: Different queries or txns are

executed concurrently.

– Increases throughput & reduces latency.

– Already discussed for shared-memory DBMSs.

• Intra-Query: Execute the operations of a

single query in parallel.

– Decreases latency for long-running queries.

Faloutsos/Pavlo CMU SCS 15-415/615 15

CMU SCS

Parallel/Distributed DBMSs

• Advantages:

– Data sharing.

– Reliability and availability.

– Speed up of query processing.

• Disadvantages:

– May increase processing overhead.

– Harder to ensure ACID guarantees.

– More database design issues.

Faloutsos/Pavlo CMU SCS 15-415/615 16

Page 5: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Today’s Class

• Overview & Background

• Design Issues

• Distributed OLTP

• Distributed OLAP

Faloutsos/Pavlo CMU SCS 15-415/615 17

CMU SCS

Design Issues

• How do we store data across nodes?

• How does the application find data?

• How to execute queries on distributed data?

– Push query to data.

– Pull data to query.

• How does the DBMS ensure correctness?

Faloutsos/Pavlo CMU SCS 15-415/615 18

CMU SCS

Database Partitioning

• Split database across multiple resources:

– Disks, nodes, processors.

– Sometimes called “sharding”

• The DBMS executes query fragments on

each partition and then combines the results

to produce a single answer.

Faloutsos/Pavlo CMU SCS 15-415/615 19

CMU SCS

Naïve Table Partitioning

• Each node stores one and only table.

• Assumes that each node has enough storage

space for a table.

Faloutsos/Pavlo CMU SCS 15-415/615 20

Page 6: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Naïve Table Partitioning

Faloutsos/Pavlo CMU SCS 15-415/615 21

Table1 Partitions

Tuple1

Tuple2

Tuple3

Tuple4

Tuple5

SELECT * FROM table

Ideal Query:

Table2

CMU SCS

Horizontal Partitioning

• Split a table’s tuples into disjoint subsets.

– Choose column(s) that divides the database

equally in terms of size, load, or usage.

– Each tuple contains all of its columns.

• Three main approaches:

– Round-robin Partitioning.

– Hash Partitioning.

– Range Partitioning.

Faloutsos/Pavlo CMU SCS 15-415/615 22

CMU SCS

Horizontal Partitioning

Faloutsos/Pavlo CMU SCS 15-415/615 23

Table Partitions

Tuple1

Tuple2

Tuple3

Tuple4

Tuple5

SELECT * FROM table WHERE partitionKey = ?

Ideal Query:

Partitioning Key

CMU SCS

Vertical Partitioning

• Split the columns of tuples into fragments:

– Each fragment contains all of the tuples’ values

for column(s).

• Need to include primary key or unique

record id with each partition to ensure that

the original tuple can be reconstructed.

Faloutsos/Pavlo CMU SCS 15-415/615 24

Page 7: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Vertical Partitioning

Faloutsos/Pavlo CMU SCS 15-415/615 25

Table Partitions

Tuple1

Tuple2

Tuple3

Tuple4

Tuple5

SELECT column FROM table

Ideal Query:

CMU SCS

Replication

• Partition Replication: Store a copy of an

entire partition in multiple locations.

– Master – Slave Replication

• Table Replication: Store an entire copy of

a table in each partition.

– Usually small, read-only tables.

• The DBMS ensures that updates are

propagated to all replicas in either case.

Faloutsos/Pavlo CMU SCS 15-415/615 26

CMU SCS

Replication

Faloutsos/Pavlo CMU SCS 15-415/615 27

Partition Replication

Master

Slave

Slave

Slave

Slave Master

Table Replication

Node 1

Node 2

CMU SCS

Data Transparency

• Users should not be required to know where

data is physically located, how tables are

partitioned or replicated.

• A SQL query that works on a single-node

DBMS should work the same on a

distributed DBMS.

Faloutsos/Pavlo CMU SCS 15-415/615 28

Page 8: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

OLTP vs. OLAP

• On-line Transaction Processing:

– Short-lived txns.

– Small footprint.

– Repetitive operations.

• On-line Analytical Processing:

– Long running queries.

– Complex joins.

– Exploratory queries.

Faloutsos/Pavlo CMU SCS 15-415/615 29

CMU SCS

Workload Characterization

Writes Reads

Simple

Complex

Workload Focus

Oper

atio

n C

om

ple

xit

y

OLTP

OLAP

Michael Stonebraker – “Ten Rules For Scalable Performance In Simple Operation' Datastores” http://cacm.acm.org/magazines/2011/6/108651

Social Networks

CMU SCS

Today’s Class

• Overview & Background

• Design Issues

• Distributed OLTP

• Distributed OLAP

Faloutsos/Pavlo CMU SCS 15-415/615 31

CMU SCS

Distributed OLTP

• Execute txns on a distributed DBMS.

• Used for user-facing applications:

– Example: Credit card processing.

• Key Challenges:

– Consistency

– Availability

Faloutsos/Pavlo CMU SCS 15-415/615 32

Page 9: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Single-Node vs. Distributed Transactions

• Single-node txns do not require the DBMS

to coordinate behavior between nodes.

• Distributed txns are any txn that involves

more than one node.

– Requires expensive coordination.

Faloutsos/Pavlo CMU SCS 15-415/615 33

CMU SCS

Simple Example

Faloutsos/Pavlo

Application Server

Node 1

Node 2

Begin Commit Execute Queries

CMU SCS

Transaction Coordination

• Assuming that our DBMS supports multi-

operation txns, we need some way to

coordinate their execution in the system.

• Two different approaches:

– Centralized: Global “traffic cop”.

– Decentralized: Nodes organize themselves.

Faloutsos/Pavlo CMU SCS 15-415/615 35

CMU SCS

TP Monitors

• Example of a centralized coordinator.

• Originally developed in the 1970-80s to

provide txns between terminals +

mainframe databases.

– Examples: ATMs, Airline Reservations.

• Many DBMSs now support the same

functionality internally.

Faloutsos/Pavlo CMU SCS 15-415/615 36

Page 10: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Centralized Coordinator

Faloutsos/Pavlo CMU SCS 15-415/615 37

Partitions

Application Server

Coordinator

Lock Request

Acknowledgement

Commit Request

Safe to commit?

CMU SCS

Centralized Coordinator

Faloutsos/Pavlo CMU SCS 15-415/615 38

Partitions

Application Server

Mid

dlew

are

Query Requests

Safe to commit?

CMU SCS

Decentralized Coordinator

Faloutsos/Pavlo CMU SCS 15-415/615 39

Partitions

Application Server

Commit Request Safe to commit?

CMU SCS

Observation

• Q: How do we ensure that all nodes agree

to commit a txn?

– What happens if a node fails?

– What happens if our messages show up late?

Faloutsos/Pavlo CMU SCS 15-415/615 40

Page 11: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

CAP Theorem

• Proposed by Eric Brewer that it is

impossible for a distributed system to

always be:

– Consistent

– Always Available

– Network Partition Tolerant

• Proved in 2002.

Faloutsos/Pavlo CMU SCS 15-415/615 41

Brewer

Pick Two!

CMU SCS

CAP Theorem

Faloutsos/Pavlo CMU SCS 15-415/615 42

Consistency Availability Partition Tolerant

Linearizability

All up nodes can satisfy all requests.

Still operate correctly despite message loss.

No Man’s Land

CMU SCS

CAP – Consistency

Faloutsos/Pavlo CMU SCS 15-415/615 43

Node 1 Node 2

NETWORK

Application Server

Set A=2, B=9

Application Server

A=1

B=8

A=2

B=9

Read A,B A=2 B=9

A=1

B=8

A=2

B=9

Must see both changes or no changes

Master Replica

CMU SCS

CAP – Availability

Faloutsos/Pavlo CMU SCS 15-415/615 44

Node 1 Node 2

NETWORK

Application Server

Read B

Application Server

Read A B=8

A=1

B=8

A=1

B=8 X

A=1

Page 12: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

CAP – Partition Tolerance

Faloutsos/Pavlo CMU SCS 15-415/615 45

Node 1 Node 2

NETWORK

Application Server

Set A=2, B=9

Application Server

A=1

B=8

A=2

B=9

Set A=3, B=6

A=1

B=8

A=3

B=6 X

Master Master

CMU SCS

CAP Theorem

• Relational DBMSs: CA/CP

– Examples: IBM DB2, MySQL Cluster, VoltDB

• NoSQL DBMSs: AP

– Examples: Cassandra, Riak, DynamoDB

Faloutsos/Pavlo CMU SCS 15-415/615 46

These are essentially the same!

CMU SCS

Atomic Commit Protocol

• When a multi-node txn finishes, the DBMS

needs to ask all of the nodes involved

whether it is safe to commit.

– All nodes must agree on the outcome

• Examples:

– Two-Phase Commit

– Three-Phase Commit

– Paxos

Faloutsos/Pavlo CMU SCS 15-415/615 47

CMU SCS

Two-Phase Commit

Faloutsos/Pavlo CMU SCS 15-415/615 48

Node 1

Node 2

Application Server

Commit Request

Node 3

OK

OK

OK

OK

Phase1: Prepare

Phase2: Commit

Pa

rticipa

nt

Pa

rticipa

nt

Co

ord

ina

tor

Page 13: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Two-Phase Commit

Faloutsos/Pavlo CMU SCS 15-415/615 49

Node 1

Node 2

Application Server

Commit Request

Node 3

OK

ABORT

OK

Phase1: Prepare

Phase2: Abort

CMU SCS

Two-Phase Commit

• Each node has to record the outcome of

each phase in a stable storage log.

• Q: What happens if coordinator crashes?

– Participants have to decide what to do.

• Q: What happens if participant crashes?

– Coordinator assumes that it responded with an

abort if it hasn’t sent an acknowledgement yet.

• The nodes have to block until they can

figure out the correct action to take. Faloutsos/Pavlo CMU SCS 15-415/615 50

CMU SCS

Three-Phase Commit

• The coordinator first tells other nodes that it

intends to commit the txn.

• If the coordinator fails, then the participants

elect a new coordinator and finish commit.

• Nodes do not have to block if there are no

network partitions.

Faloutsos/Pavlo CMU SCS 15-415/615 51

Failure doesn’t always mean a hard crash.

CMU SCS

Paxos

• Consensus protocol where a coordinator

proposes an outcome (e.g., commit or abort)

and then the participants vote on whether

that outcome should succeed.

• Does not block if a majority of participants

are available and has provably minimal

message delays in the best case.

– First correct protocol that was provably

resilient in the face asynchronous networks

Faloutsos/Pavlo CMU SCS 15-415/615 52

Page 14: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

2PC vs. Paxos

• Two-Phase Commit: blocks if coordinator

fails after the prepare message is sent, until

coordinator recovers.

• Paxos: non-blocking as long as a majority

participants are alive, provided there is a

sufficiently long period without further

failures.

Faloutsos/Pavlo CMU SCS 15-415/615 53

CMU SCS

Distributed Concurrency Control

• Need to allow multiple txns to execute

simultaneously across multiple nodes.

– Many of the same protocols from single-node

DBMSs can be adapted.

• This is harder because of:

– Replication.

– Network Communication Overhead.

– Node Failures.

Faloutsos/Pavlo CMU SCS 15-415/615 54

CMU SCS

Distributed 2PL

Faloutsos/Pavlo 55

Node 1 Node 2

NETWORK

Application Server

Set A=2, B=9

Application Server

A=1

Set A=0, B=7

B=8

CMU SCS

Recovery

• Q: What do we do if a node crashes in

CA/CP DBMS?

• If node is replicated, use Paxos to elect a

new primary.

– If node is last replica, halt the DBMS.

• Node can recover from checkpoints + logs

and then catch up with primary.

Faloutsos/Pavlo CMU SCS 15-415/615 56

Page 15: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Today’s Class

• Overview & Background

• Design Issues

• Distributed OLTP

• Distributed OLAP

Faloutsos/Pavlo CMU SCS 15-415/615 57

CMU SCS

Distributed OLAP

• Execute analytical queries that examine

large portions of the database.

• Used for back-end data warehouses:

– Example: Data mining

• Key Challenges:

– Data movement.

– Query planning.

Faloutsos/Pavlo CMU SCS 15-415/615 58

CMU SCS

Distributed OLAP

Faloutsos/Pavlo CMU SCS 15-415/615 59

Partitions

Application Server

Single Complex Query

CMU SCS

Distributed Joins Are Hard

• Assume tables are horizontally partitioned:

– Table1 Partition Key → table1.key

– Table2 Partition Key → table2.key

• Q: How to execute?

• Naïve solution is to send all partitions to a

single node and compute join.

Faloutsos/Pavlo CMU SCS 15-415/615 60

SELECT * FROM table1, table2 WHERE table1.val = table2.val

Page 16: Crash Recovery - Part 2 - Carnegie Mellon University

CMU SCS

Semi-Joins

• Main Idea: First distribute the join attributes

between nodes and then recreate the full

tuples in the final output.

– Send just enough data from each table to

compute which rows to include in output.

• Lots of choices make this problem hard:

– What to materialize?

– Which table to send?

Faloutsos/Pavlo CMU SCS 15-415/615 61

CMU SCS

Summary

• Everything is harder in a distributed setting:

– Concurrency Control

– Query Execution

– Recovery

Faloutsos/Pavlo CMU SCS 15-415/615 62

CMU SCS

Next Class

• Discuss distributed OLAP more.

– You’ll learn why MapReduce was a bad idea.

• Compare OldSQL vs. NoSQL vs. NewSQL

• Real-world Systems

Faloutsos/Pavlo CMU SCS 15-415/615 63