Everything you always wanted to know about Distributed databases, at devoxx london, by javier...

download Everything you always wanted to know about Distributed databases, at devoxx london, by javier ramirez, teowaki

If you can't read please download the document

Transcript of Everything you always wanted to know about Distributed databases, at devoxx london, by javier...

Boom - Cartoon pop art template

everything you always wanted to know about
Highly Available Distributed Databases

Javier Ramirez @supercoco9 https://teowaki.com

Everything you always wanted to know about highly available distributed databases

Javier Ramirez: 20 years in web development (C/Java/Ruby/Python)

6 years in NoSQL (Redis, Mongo, Neo4j)

4 years in Cloud (AWS, GCP)

3 years in Big Data (BigQuery, Spark, Apache Beam/Dataflow)

Google Developer Expert and Authorised trainer on the Google Cloud Platform

My projects:https://teowaki.com

https://aprendoaprogramar.com

IBM Data Centerin Japan duringand after an earthquake

2011

in 2011

A squirrel did take out half of our Santa Clara data centre two years backMike Christian, Yahoo Director of Engineering

A squirrel did take out half of our Santa Clara data centre two years backMike Christian, Yahoo Director of Engineering

2012, at a conference

that's the reason why google wraps submarine fibre cables in kevlar, so shark bites won't damage them

Hayastan Shakarian

a.k.a.The SpadeHacker

Cut-offArmeniafrom the Internetfor almostone day*

* By accident, while scavenging copper

I have no idea what the internet is

Some data center outages reported in 2015:

* Amazon Web Services* Apple iCloud* Microsoft Azure* IBM Softlayer* Google Cloud Platform

* And of course every hosting with scheduled maintenance operations (rackspace, digitalocean, ovh...)

rackspace was taken down when a truck driver had an accident during a delivery to the data centre

Complex systems can and will fail

hurricanes, truck drivers, sharks eating transoceanic cable, and of course electronic and mechanical failures, human errors, and malicious attacks

You better distribute your data, or else...

Also, distributed databases can perform better and run on cheaper hardware thancentralised ones

Starbucks customers couldn't buy any coffees a whole morningTinder users lost temporarily their matches for a few hours

Twilio did goodNetflix had a few problems in the past, but now they are awesome

Most basic level:Backup

of course this doesn't give you high availability, but at least prevent from data lost to an extent (depending on your backup practices)

And keep the copyon a separate data centre*

* Vodafone once lost one yearof data on a fire because of this

Next Level:Replicas(master-slave)

A main server sends a binary log of changes to one or more replicas

* Also known as Write Ahead Log or WAL

Frequently used not only on relational databases, but on every kind of distributed system. Redis when configured as master-slave works in a very similar way too

Master-slave is good but

* All the operations are replicated on all slaves

* Good scalability on reads, but not on writes

* Cannot function during a network partition

* Single point of failure (SPOF)

So the more writes you have, the busiest all of your servers will be

When I say write I mean updates and deletions too

Recovery is not fully automatic and, at best, requires some extra coordination

Next Level:Multi-Master Cluster(master-master)

Every server can accept reads or writes, and send its binary log to all the other servers

* also referred as update-anywhere

Multi-master is great, but:

* All the operations are replicated on all masters.

* When synchronous, high latency (Consistency achieved via locks, coordination and serializable transactions)

* When asynchronous, typically poor conflict resolution

*Hard to scale up or down automatically

OrientDB is quite good, so I put it into distributed databases

The more writes you have, the more load in the whole system

Also, the usual case is all the data lives on all the servers, and that simply doesn't scale

netflix several thousands cassandra nodes

facebook: several tenths of thousands nodes for analytics

The system I want:

* Always ON, even with network partitions

* Scales out both reads and writes. Doesn't need to keep all the data in all the servers

* Runs on cheap commodity diverse hardware

* Runs locally to my users (low latency)

* Grows/shrinks elastically and survives server failures

Cheap hardware: important to be heterogeneus!

or else it's very difficult to support

netflix several thousands cassandra nodes

facebook: several tenths of thousands nodes for analytics

Then you need to let go ofmany convenient things you take for granted in databases

Forget about:

flexible queries, table design where everything can be queried no matter what (even if slow)

transactions

strong consistency

delegating all the complexity to the servers

CAP Theorem

Everything is a trade-off

Eventually consistent Eric Brewer

Next Level:Distributed Data stores

you know some of the names on relational, traditional, non distributed databases

mysqlmariadboraclepostgresqlsql serveribm db2sqliteSAP HANA

The Amazon Dynamo paper and the Google BigTable paper are behind many of the concepts of modern distributed databases, together with the work of Leslie Lamport, the creator of Latex and a member of Microsoft Research

There is a new generation of systems based on the Google Spanner paper

Distributed DB design decisions

* data (keys) distribution* data replication/durability* conflict resolution* membership* status of the other peers* operation under partitions and during unavailability of peers* incremental scalability

Data distribution

Consistent hashing based on the key

Usually implies operations work on single keys. Somesolutions, like Redis, allow the clients to group related keys consistently. Some solutions, like BigTable, allow tocollocate data by group or family.

Queries are frequently limited to query by key or by secondary indexes (say bye to the power of SQL)

Data distribution. The Ring

some systems allow to define virtual nodes, so a physical node contains in reality several nodes

that's one way of allow heterogeneity of the system

Data Replication

How many replicas of each? Typically at least 3, so in case of conflicts there can be a quorum

Often, the distribution of keys is done taking into account the physical location of nodes, so replicas live in different racks or different datacentres

Replication: durability

If we want to have a durable system, we need at least to make sure the data is replicated in at least 2 nodes before confirming the transaction to the client.

This is called the write quorum, and in many cases it can be configured individually.

Not all data are equally important, and not all systems have the same R/W ratio.

Systems can be configured to be always writable or always readable.

Parameters W and R can also be configured to LOCAL_QUORUM, so they need agreement only from local nodes and not across datacenters

by combining global quorum for reads and local quorum for reads, netflix gets 500 ms from the time it writes on one region until it can be read from another, while keeping very fast reads

Conflicts

I see a record that I thought was deleted

I created a record but cannot see it

I have different values in two nodes

Something should be unique, but it's not

usually due to load balancing, concurrency, or network partitions

No-Conflict strategies

Quorum-based systems: Paxos, RAFT. Require coordination of processes with continuous electionsof leaders and consensus. Worse latency

Last Write Wins (LWW): Doesn't require coordination. Good latency

But, what does Last mean?

* Google spanner uses atomic clocks and servers with GPS clocks to synchronize time

* Cassandra tries to sync clocks and divides updates in small parts to minimize conflict

* Dynamo-like use vector clocks

Conflict resolution

Can be done at Write time or at Read time.

Conflict resolution

Can be done at Write time or at Read time.

As long as R + W > N it's possible to reach a quorum

Vector clocks

* Don't need to sync time

* There are several versions of a same item

* Need consolidationto prune size

* Usually client needs tofix the conflict and update

Alternatives to conflict resolution

* Conflict-Free-Replicated-Datatypes(CRDT).Counters, Hashes, Maps

* Allowing for strong consistency on keys from the same family

* The Uber solution with serialized tokens

* Some solutions are implementing immutability, so no conflicts

* Peter David Bailis paper on Coordination Avoidance usingRead Atomic Multi-Partition transactions (Nov/15)

riak: crdt

membership

gossip

infection-likeprotocols

Gossip

A centralised server is a SPOF

Communicating state with each node is very time consumingand doesn't support partitions

Gossip protocols communicate pairs of random nodes atregular frequent intervals and exchange information.

Based on that information exchange, a new status is agreed

Gossip example

systems based in gossip for membership and liveliness can be extended adding extra monitoring information. This solution, for example, is used at CERN to monitor grids of thousands of nodes and monitor memory/cpu usage

Amazon dynamo uses gossip to send ring distribution information, apart from using it to check disconnected/failed/new nodes

Incremental scalability

When a new node enters the system, the rest of nodes noticevia gossip.

The node claims a partition of the ring and asksthe replicas of the same partition to send data to it.

When the rest of nodes decide (after gossiping) that a nodehas left the system and it's not a temporary failure, the dataassigned to the partitions of that node is copied to more replicas to reach the N copies.

All the process is automatic and transparent.

Adding more than one node at a time is tricky

Operation under partition:Hinted Handoff

On a network partition, it can happen that we have less than W nodes of the same segment in the current partition.

In this case, the data is replicated to W nodes, even if thatnode wasn't responsible for the segment. The data is keptwith a hint, and stored in a special area.

Periodically, the server will try to contact the original destination and will hand off the data to it.

Operation under partition:Hinted Handoff

Anti Entropy

A system with handoffs can be chaotic and not veryeffective

Anti Entropy is implemented to make sure hints arehanded off or synchronized to other nodes

Anti entropy is usually achieved by using Merkle Trees, ahash of hashes structure very efficient to compare differences between nodes

All this features mean your clients need tobe aware of some internals of the system

Clients must

* Know which close nodes are responsible for each segment of the ring, and hash locally**

* Be aware of when nodes become available or unavailable**

* Decide on durability

* Handle conflict resolution, unless under LWW

** some solutions offer a load balancer proxy to abstract the client from that complexity, but trading off latency

now you know how it works

* A system that always can work, even with network partitions

* That scales out both reads and writes

* On cheap commodity diverse hardware

* Running locally to your users (low latency)

* Can grow/shrink elastically and survive server failures

Cheap hardware: important to be heterogeneus!

netflix several thousands cassandra nodes

facebook: several tenths of thousands nodes for analytics

Extra level: Build yourown distributed database

Netflix dynomite, built in Java

Uber ringpop, built in JavaScript

Netflix performance:

Chaos Monkeys and 500 ms between recovery across regions

Of course you can always read the source of any open source solution, but it's easier to plug a generic ring/membership and extend it

Not ScaredOf YouAnymore

Q & A

Find related links at

http://bit.ly/teowaki-distributed-systems(https://teams.teowaki.com/teams/javier-community/link-categories/distributed-systems)

Cheers!

need help with cloud, distributed systems or big data?https://teowaki.com

07/06/16

AUTORE

London skyline flat extra wide.jpg@supercoco9

#distributed-devoxx

RB_DEVOXX_LOGO_BLK no head.png

London skyline flat extra wide small.jpg

@supercoco9

#distributed-devoxx

RB_DEVOXX_LOGO_BLK no head.png

London skyline flat extra wide small.jpg@YourTwitterHandle

#DVXFR14{session hashtag}

@supercoco9

#distributed-devoxx

RB_DEVOXX_LOGO_BLK no head.png