Providing Big Data Applications with Fault-Tolerant Data Migration ...

Providing Big Data Applications with Fault-Tolerant Data Migration Across

Heterogeneous NoSQL databasesMarco Scavuzzo, Damian A. Tamburri, Elisabetta Di Nitto

Politecnico di Milano, Italy

BIGDSE ‘16 – May 16th 2016, Austin

NoSQLs and Big Data applications

� Highly-available, big data applications need specific storage technologies:� Distributed File Systems – DFSs (e.g., HDFS, Ceph, etc.)� NoSQL databases (e.g., Riak, Cassandra, MongoDB, Neo4j, etc.)

�NoSQLs preferred to DFSs for:� Efficient data access (for reads and/or writes)� Concurrent data access� Adjustable data consistency and integrity policies� Logics (filter, group, aggregate) in the data layer in place of the application layer (Hive, Pig, etc.).

2

NoSQLs heterogeneity

� Lack of standard data access interfaces and languages

� Lack of common data models (e.g., data types, secondary indexes, integrity constraints, etc.)

� Different architectures leading to different ways of approaching important problems (e.g., concurrency control, replication, transactions, etc.)

3

Vendor lock-in“The lack of standards due to most NoSQLs creating their own APIs [..] is going to be a nightmare in due course of time w.r.t. porting applications from one NoSQL to another. Whether it is an open source system or a proprietary one, users will feel locked in.”

C. Mohan 4

Research objective

Provide a method and supporting architecture to aid fault-tolerant data migration across heterogeneous NoSQL databases for Big Data applications

5

Hegira4Cloud

Hegira4Cloud requirements

1. Big Data migration across any NoSQL database and Database as a Service (DaaS)

2. High performant data migration

3. Fault-tolerant data migration

6

Migration System Core

Hegira4Cloud approach

Source DB

Target DB

SRC TWCMIGRATION QUEUE

7

Conversion to the Metamodel Format

Conversion from the Metamodel Format

Monolithic architecture data migration GAE Datastore -> Azure Tables

dataset #1 dataset #2 dataset #3

Source size (MB) 16 64 512

# of Entities 36940 147758 1182062

Migration time (sec) 1098 4270 34111

Entities throughput (ent/s) 33.643 34.604 34.653

Avg. %CPU usage 4.749 3.947 4.111

Hegira4Cloud V1

~18m ~71m ~568m

8

Migration System Core

Source DB

Target DB


Improving performance: components decoupling

Components decoupling helps in:� distributing the computation (conversion to/from the intermediate meta

model);

� isolating possible bottlenecks;

� finding (and solving) errors.

Source DB

Target DB


9

TWCTWCSRCSRC

Improving performance: parallelization

Operations to be executed can be parallelized:� data extraction (from the source database)� data should be partitionable

� data load (to the target database)

Source DB

Target DB


10

Improving performance: TWC parallelization

Challenges:� avoid to duplicate data (i.e., process disjunct data only once)� avoid threads starvation� in case of fault, already extracted data shouldn’t be lost

Solution: RabbitMQ�messages distributed (disjointly) in round-robin fashion�messages correctly processed are acknowledged and removed�messages are persisted on disk

11

TWCTWCSRCSRCSource DB

Target DB


Improving performance: SRC parallelization

Challenges:� complete knowledge of stored data is needed to partition data

� partitions should be processed at most once (to avoid duplications)

12


Target DB

SRC TWCMETAMODEL QUEUE

Improving performance: SRC parallelization

13


Target DB


Source DB

1

…

10

11

…

20

21

…

30

VDP1

VDP2

VDP3

Lets assume that data are associated with an unique, incremental primary key (or an indexed property)

References to the VDPs are stored inside a persistent storage

Addressing faults

Types of (non-trivial) faults:� Database faults� Components faults� Network faults

On connection loss, not all databases guarantee a unique pointer to the data (e.g., Google Datastore)

Source DB

Target DB


Connection loss

14

Virtual data partitioning

Source DB

Key Values

1

2

…

10

11

…

20

21

…

30

VDP1

VDP2

VDP3

Status Log

VDPid Status

1 migrated

2 under_mig

3 not_mig

not mig. under mig. migrated

migrate finish_mig

15

ZooKeeper

PARTITION STATUS

Hegira4Cloud V2

Source DB

Target DB


STATUS LOG

16

Hegira4Cloud V2: Evaluation

� 1 Source Reading Thread� 40 Target Writing Threads

Monolithic arcitecture Parallel distributedarchitecture

dataset #1 dataset #2 dataset #3 dataset #1Source size (MB) 16 64 512 318464 (311GB)# of Entities 36940 147758 1182062 ~107MMigration time (sec) 1098 4270 34111 124867 (34½h)Entities throughput(ent/s) 33.643 34.604 34.653 856.41Avg. %CPU usage 4.749 3.947 4.111 49.87

17

Conclusions

�Efficient, fault-tolerant method for data migration

�Architecture supporting data migration across NoSQL databases� Supporting several databases (Azure Tables, Cassandra, Google Datastore, HBase)� Evaluated on industrial case study

Future work

� Support online data migrations

� Rigorous tests for assessing data completeness and correctness

19

Marco ScavuzzoPhD student @ Politecnico di MilanoYou can find me at: [email protected]

CreditsPresentation template by SlidesCarnival

Providing Big Data Applications with Fault-Tolerant Data Migration ...

Documents

Transcript of Providing Big Data Applications with Fault-Tolerant Data Migration ...