Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

28
Internet-scale Distributed Systems Google Spanner a Synchronously-Replicated Globally-Distributed Multi-Version Database 22.01.2013 Maciej Jozwiak Page 1 Presented by: Maciej Jozwiak

description

Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Transcript of Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Page 1: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Google Spanner a

Synchronously-Replicated Globally-Distributed

Multi-Version Database

22.01.2013 Maciej Jozwiak Page 1

Presented by: Maciej Jozwiak

Page 2: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Agenda • Problem description

• Overview of available solutions

• Globally-distributed database

• Architecture

• How is data replicated?

• Data model

• TrueTime API

• Transactions

• Summary

22.01.2013 Maciej Jozwiak Page 2

Page 3: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Problem – Need for Scalable MySQL • Google’s advertising backend

– Based on MySQL • Relations

• Query language

– Manually sharded • Resharding is very costly

– Global distribution

22.01.2013 Maciej Jozwiak Page 3

SHARDING:

Sharding is another name for "horizontal partitioning" of a database. Rows of a database table are held separately, form a partition which can be located on a separate database server or physical location.

Page 4: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems 22.01.2013 Maciej Jozwiak Page 4

• Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance • Lack of query language

• Scalability • Throughput • Performance • Eventually-consistent replication support across data-centers

Overview of Available Solutions

Google Megastore

Page 5: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems 22.01.2013 Maciej Jozwiak Page 5

• Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance • Lack of query language

• Scalability • Throughput • Performance • Eventually-consistent replication support across data-centers

Overview of Available Solutions

Google Megastore

Page 6: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems 22.01.2013 Maciej Jozwiak Page 6

• Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance • Lack of query language

• Scalability • Throughput • Performance • Eventually-consistent replication support across data-centers

Overview of Available Solutions

Google Megastore

Page 7: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Bridging the gap between Megastore and Bigtable

22.01.2013 Maciej Jozwiak Page 7

Google Megastore

• Removes the need to manually partition data • Synchronous replication and automatic failover • Strong transactional semantics • SQL based query language • Semi-relational, schematized tables

Solution: Google Spanner

Page 8: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Globally-Distributed Database

22.01.2013 Maciej Jozwiak Page 8

Future scale: • one million to 10 million servers • 100s to 1000s locations around the world • 1013 directories • 1018 bytes of storage

cross-datacenter replicated data management: • high availability • minimize latency of data reads and writes • replication configuration dynamically controlled at a fine grain by applications

Page 9: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Spanner Deployment - Universe

22.01.2013 Maciej Jozwiak Page 9

Universe master (status + interactive debugging)

Placement driver (move data across

zones automatically)

Page 10: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

How Is Data Replicated?

22.01.2013 Maciej Jozwiak Page 10

Paxos: protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.

Spanserver software stack

Page 11: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Replication Configuration

• Replication configurations for data can be dynamically controllered at a fine grain by applications

• Applications can specify constraints to control:

– which datacenters contain which data

– how far data is from user (to control read latency)

– how far replicas are from each other (to control write latency)

– how many replicas are maintained (to control durability, availability, and read performance) • North America: 5 replicas, Europe 2 replicas

22.01.2013 Maciej Jozwiak Page 11

Page 12: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Hierarchical Data Model • Universe (Spanner deployment)

– Database

• Tables – Rows and columns

– Must have an ordered set one or more primary key columns

– Primary key uniquely identifies each row

• Hierarchies of tables – Tables must be partioned by client into one or more

hierarchies of tables (INTERLEAVE IN)

– Table in the top – directory table

22.01.2013 Maciej Jozwiak Page 12

Page 13: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Storing Photo Metadata

22.01.2013 Maciej Jozwiak Page 13

Page 14: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Storing Photo Metadata

22.01.2013 Maciej Jozwiak Page 14

directory table

directory table

Page 15: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Storing Photo Metadata

22.01.2013 Maciej Jozwiak Page 15

directory

Page 16: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Storing Photo Metadata

22.01.2013 Maciej Jozwiak Page 16

directory

Page 17: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Storing Photo Metadata

22.01.2013 Maciej Jozwiak Page 17

Albums(2,1) – row from the Albums table for user_id 2, album_id 1 Interleaving is important because it allows clients to describe the locality relationship which is necessary for good performance in a sharded, distributed database.

Page 18: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Key Innovation

22.01.2013 Maciej Jozwiak Page 18

Spanner knows what time is it

Page 19: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Is Synchronizing Time at the Global Scale Possible?

22.01.2013 Maciej Jozwiak Page 19

Distributed systems dogma: • synchronizing time within and between datacenters is extremely hard and uncertain • serialization of requests is impossible at global scale

Page 20: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Is Synchronizing Time at the Global Scale Possible?

22.01.2013 Maciej Jozwiak Page 20

Distributed systems dogma: • synchronizing time within and between datacenters is extremely hard and uncertain • serialization of requests is impossible at global scale

Page 21: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Is Synchronizing Time at the Global Scale Possible?

22.01.2013 Maciej Jozwiak Page 21

Idea: Accept uncertainty, keep it small and quantify (using GPS and Atomic Clocks)

Page 22: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

TrueTime API

22.01.2013 Maciej Jozwiak Page 22

Idea: Accept uncertainty, keep it small and quantify (using GPS and Atomic Clocks)

Novel API distributing a globally synchronized „proper time”

Method Returns

TT.now() TTinterval: [earliest, latest]

TT.after(t) True if t has definitely passed

TT.before(t) True if t has definitely not arrived

TT interval - is guaranteed to contain the absolute time during which TT.now() was invoked

Page 23: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

How TrueTime Is Implemented?

22.01.2013 Maciej Jozwiak Page 23

set of time master machines per datacenter

majority of masters have GPS receivers with dedicated antennas

timeslave daemon per machine

The remaining masters (which we refer to as Armageddon masters) are equipped with atomic clocks.

Page 24: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Time References Vulnerabilities

• GPS:

– antenna and receiver failures

– local radio interference

– correlated failures (e.g. spoofing)

– GPS system outages

• Atomic clock:

– can drift significantly due to frequency error

2 forms of time reference – 2 failure modes (uncorrelated to each other):

22.01.2013 Maciej Jozwiak Page 24

Page 25: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

How Does Daemon Work?

22.01.2013 Maciej Jozwiak Page 25

Daemon polls variety of masters: • chosen from nearby datacenters • from further datacenters • Armageddon masters

Daemon polls variety of masters and reaches a consensus about correct timestamp. Daemon’s poll interval is 30 seconds.

Between synchronizations daemon advertises a slowy increasing time uncertainty (e)

Page 26: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Transactions In Spanner

• Globally meaningful commit timestamps to distributed transactions

– If A happens-before B, then timestamp(A) < timestamp (B)

– A happens-before B if its effects become visible before B begins, in real time • Visible means acked to client or updates applied to some replica

• Begins means first request arrived at Spanner server

• Two-phase commit

22.01.2013 Maciej Jozwiak Page 26

Page 27: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

What About Performance?

22.01.2013 Maciej Jozwiak Page 27

„We believe it is better to have application

programmers deal with performance problems

due to overuse of transactions as bottlenecks arise,

rather than always coding around the lack of

transactions.”

Two-phase commit can raise availability and performance

issues.

Page 28: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database

Internet-scale Distributed Systems

Summary

• Externally consistent global write-transactions with synchronous replication.

• Schematized, semi-relational data model.

• SQL-like query interface.

• Auto-sharding, auto-rebalancing, automatic failure response.

• Exposes control of data replication and placement to user/application.

22.01.2013 Maciej Jozwiak Page 28