OSDC 2014: Fabrizio Manfredi - Data replication

Post on 05-Dec-2014

341 views 7 download

description

Data replication is a crucial component for distributed services deployed in a multi-Data Center environment. The replication schema needs to be carefully evaluated before its implementation, wrong design or the misuse in most of the case end with a big service outages. To understand the replication it is needed to understand the algorithms behind it, for this reason the session will start to explaining the most used algorithms to solve the CAP theorem (Consistency , Availability and Partitioning Tolerance) like Consistent Hash, Vector clock, Gossip protocol, Paxos and Raft. The second part of the talk will be focused to analyze how the products on the market do the replication (replication in action) with advantages and disadvantages, the talk will cover the distributed filesystem (cephs, tahoe, extreemfs..), distributed databases (db replication primitieves and external tool like Tungsten), Nosql (riak, cassandra, mongodb, couchdb) and Frameworks for in house solution (beardb, open replication,..). The talk will also show the evaluation methods and testing process for identify the best solution for your environment.

Transcript of OSDC 2014: Fabrizio Manfredi - Data replication

Beolink.org!

Data replication Fabrizio Manfredi Furuholmen"

Beolink.org!

FOSDEM 2014"2"

Agenda

!  Introduction !  overview !  Theorem !  Common Pattern

!  Implementation !  Filesystem !  RDBMS !  Nosql !  Framework

!  Example

Beolink.org!

3"

Data Replication

http://blog.open-e.com/in-a-nutshell-data-replication-snapshots-and-backup/"

Beolink.org!

4"

Data Replication

http://www.dreamstime.com/stock-images-cloud-computing-scalability-reliability-background-concept-word-image34898574"

Beolink.org!

5"

Introduction

Beolink.org!

6"

World Connection

Beolink.org!

7"

Main Problem

VS!

Beolink.org!

8"

Main Problem

Beolink.org!

9"

CAP theorem

According to Brewer’s CAP theorem, it is impossible for any distributed computer system to simultaneously provide all three of Consistency, Availability and Partition Tolerance.""

You "can’t have the three at the

same time !and get an acceptable latency."

Beolink.org!

10"

CAP

ACID!!Atomic: Everything in a transaction succeeds or the entire transaction is rolled back."Consistent: A transaction cannot leave the database in an inconsistent state."Isolated: Transactions cannot interfere with each other."Durable: Completed transactions persist, even when servers restart etc.""-  Strong consistency for transaction highest priority"-  Pessimistic"-  Complex mechanisms"

"-  Availability and scaling highest priorities"-  Weak consistency"-  Optimistic"-  Best Effort"-  Simple and FAST "

Basic Availability"Soft-state"Eventual consistency""

BASE""

RDBMS!

NoSQL!

Beolink.org!

11"

Data Distribution

Business Decision!

Beolink.org!

12"

Start with some Algorithms

Beolink.org!

13"

Data Distribution

Replication!

Data Placement"

Data Consistency"

System Coordination"

Data Transmission"

Beolink.org!

14"

Data Placement

Better Distribution = partitioning !Parallel operation = parallel stream/multi core!

!

Beolink.org!

15"

Data Placement

Beolink.org!

16"

Data placement by HASH

It isn’t rocket science !!

Beolink.org!

17"

Data Distribution

http://www.cs.rutgers.edu/~pxk/417/notes/23-lookup.html"

Consistent HASH!

Chord"

Space base/multi dimension"

Beolink.org!

18"

Data placement

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/"

Vnode base" Proximity base"

Replication"

Beolink.org!

19"

Data Consistency

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/"

To avoid ACID implementation but to guarantee the consistency some solution leave to the client the ownership of the algorithm.""-  Read and Write quorum!-  Write quorum Read all!

Beolink.org!

20"

Data Consistency

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/"

To avoid ACID implementation but to guarantee the consistency some solution leave to the client the ownership of the algorithm.""-  Read and Write quorum!-  Write quorum Read all!

Beolink.org!

21"

Coordination Protocol

Consensus protocol!"Paxos , Raft, ect""Based on the state machine approach (The state machine approach is a technique for converting an algorithm into a fault-tolerant, distributed implementation. )"""""

Epidemic (Gossip)!"epidemic: anybody can infect anyone "else with equal probability"""""""

Anti-entropy protocols assume that synchronization is performed by a fixed schedule – every node regularly chooses another node at random or by some rule and exchanges database contents, resolving differences. "

O(log n)"http://www.cis.cornell.edu/IAI/events/Gossip_Tutorial.pdf"

Beolink.org!

22"

Transmission Protocol

Optimization!-  Re order"-  Deduplication""

!Transmission"-  By difference (Merkel tree) "-  Callback "-  Compression"-  Auto correction"

Locking!-  Distributed locking"-  Multiversioning"-  …"

!"

mito

sis!

Beolink.org!

23"

Implementation

Beolink.org!

24"

Answer …no Answer

Block replication, file

Information

Document , blog, session

Content with a TTL over a 1m

Distributed file system

RDMBS

NoSQL

Caching system

Beolink.org!

25"

Distributed Filesystem

DFS is a service that provides a single point of reference and a logical tree structure for file system resources that may be physically located anywhere on the network."""

One significant responsibility of a file system is to ensure that, regardless of the actions by programs accessing the data, the structure remains consistent…"

Beolink.org!

26"

Filesystem

""

Properties of DFS!"•  Simple from application point of view"•  Data consistency""

Base on the solution!"•  Partitioning Tolerance "•  Scalability"•  High Avaibility """"

Beolink.org!

27"

Filesystem DRDB

DRDB!!Replication mode: Asynchronous, Memory synchronous , Synchronous "Transfer optimization: DRProxy """

Main Goals!!Disk replication, single service availability""Disaster Recovery"""

Beolink.org!

28"

Filesystem CEPH

""

Ceph!Data distribution: Hash base"Consensus protocol: Raft for consensus"Write mode: Write one, read one, client is notified when all replicas have been written"Weak consistency with cache pool"""

Openstack Backednd at Cern""1128 OSDs"3PB"XXX vms""http://www.slideshare.net/"Inktank_Ceph/scaling-ceph-at-cern "

Main Goals!!- Blockdevice/base for other filesystem"- Cloud support, image storage and vm storage"""

Beolink.org!

29"

CEPH

""

Users: > 5000"VMs > 7000"> 250k VMs spawned"

http://www.synnefo.org/resources.html"

Beolink.org!

30"

RDBMS

""

Property of RDBMS!"•  Quite Simple from application point of view"•  Data consistency""

Base on the solution!"•  Low Partitioning Tolerance "•  Low Scalability"•  Low High Availability """"

Beolink.org!

31"

RDBMS

!Asynchronous Replication"Semi synchronous""

Postgres"Synchronous"Asynchronous"

Beolink.org!

32"

NoSQL

Properties of DFS!"•  Fast"""

Base on the solution!"•  Partitioning Tolerance "•  Scalability"•  High Availability"•  Simple """"

Beolink.org!

33"

NoSQL Performance

http://planetcassandra.org/nosql-performance-benchmarks/"

Beolink.org!

34"

Riak

Geo Replication!

Tunable trade-offs for distribution and replication (N, R, W) "

Distributed Hash Table"

Beolink.org!

35"

Filesystem over NoSQL

FUSE!In most of the case non stable"!S3 Interface!Internet standard de facto"

Beolink.org!

36"

Filesystem over NoSQL

Wooga"

http://www.slideshare.net/wooga/riak-at-woogariak-meetup-sept-2013?qid=4809eca2-8378-4e70-8e75-0db29b635fa5&v=qf1&b=&from_search=3"

https://fosdem.org/2014/schedule/event/nyt_cassandra/"

Beolink.org!

37"

Combine different solution

37"

Edge node (Varnish)!

Nosql!

Loc

al !

cach

e!C

entr

aliz

e! c

ache!

Info!

Sto

rage!

DFS!

Origin (Distribute cache)!

Loca

l !

DB! Nosql!Dec

reas

e th

e nu

mbe

r of t

he re

ques

ts!

Increase of the age of the data!

Beolink.org!

38"

Framework

Build your system if you need … " ""….do you really need"

CERN"

CERN"

Beolink.org!

39"

Framework

Don’t forget Rsync !!

Beolink.org!

40"

Framework

Replication or Caching ?!

Beolink.org!

41"

Build a solution

•  Split in pieces"

•  Track version "

•  Transfer when needed"

•  Transfer the difference"

•  Use Notification when is possible"

•  Move data close to computation"

•  Move master close to write operation"

•  Split counter to avoid dead lock"

•  In HTTP don’t forget the Etag and lastmodify"" ""

openkad!

open-chord!

openReplica!

Raft!

Beolink.org!

42"

Build a solution

Beolink.org! " Five pylons

43"

Obj

ects"

• Separation btw data and metadata"

•  Each element is marked with a revision"

• Each element is marked with an hash."

Cac

he"

•  Client side"

•  Callback/Notify"

•  Persistent!

Tran

smis

sion"

•  Parallel operation"

•  Http like protocol"

•  Compression"

•  Transfer by difference"

Dis

trib

utio

n" • Resource discovery by DNS"

• Data spread on multi node cluster"

• Decentralize!

• Independents cluster!

• Data Replication!

Secu

rity" • Secure

connection"

•  Encryption client side,"

•  Extend ACL"

•  Delegation/Federation!

• Admin Delegation!

Beolink.org!

44"

Build a solution

- Consistent HASH"

-  Zmq transport protocol"

- Gossip protocol for failure detection"

-  Tunable trade-offs ""

Pisa is a simple block data replication !on a wide range of node!

Beolink.org! " And …

45"

“There is always a failure waiting around the corner”"

*Werner Vogel! "

Beolink.org! !

Thank you http://restfs.beolink.orgmanfred.furuholmen@gmail.com"