Chapter 1 Final

8
Chapter 1 1TABLE OF CONTENTS 2 Introduction................................................... 1 2.1 Existing solutions..........................................3 2.2 Motivation.................................................. 4 2.3 Problem statement........................................... 5 2.4 Contribution................................................ 5 2.5 Dissertation Outline........................................6 2INTRODUCTION At a time where the requirement of storing world digital data exceeds the limit of a Zettabyte (i.e. 10 21 bytes), it is a challenge as well as necessity to develop a powerful and efficient system that has larger storage to accommodate that data so, if we talk about storage, a system using a very dense storage medium like deoxyribonucleic acid (DNA) has so far been shown to hold no more than 700 terabytes per gram [1].Considering that it requires more than a ton of genetic material so that it can store a zettabyte of information in DNA, the argument can simply be made that data storage needs to be distributed for quantitative reasons alone. Of course, among the qualitative arguments for distributed storage systems are reliability and availability. This is to say that distribution, possibly at geographically remote locations, and concurrent accessibility by anything from

Transcript of Chapter 1 Final

Page 1: Chapter 1 Final

Chapter 1

1 TABLE OF CONTENTS

2 Introduction...................................................................................................................................1

2.1 Existing solutions...................................................................................................................3

2.2 Motivation.............................................................................................................................4

2.3 Problem statement................................................................................................................5

2.4 Contribution...........................................................................................................................5

2.5 Dissertation Outline...............................................................................................................6

2 INTRODUCTION

At a time where the requirement of storing world digital data

exceeds the limit of a Zettabyte (i.e. 1021 bytes), it is a challenge as

well as necessity to develop a powerful and efficient system that

has larger storage to accommodate that data so, if we talk about

storage, a system using a very dense storage medium like

deoxyribonucleic acid (DNA) has so far been shown to hold no more

than 700 terabytes per gram [1].Considering that it requires more

than a ton of genetic material so that it can store a zettabyte of

information in DNA, the argument can simply be made that data

storage needs to be distributed for quantitative reasons alone. Of

course, among the qualitative arguments for distributed storage

systems are reliability and availability. This is to say that

distribution, possibly at geographically remote locations, and

concurrent accessibility by anything from millions of users just as

important as raw storage capacity.

The web giant like Google, Amazon, Facebook and LinkedIn are the

industries that use the distributed storage systems. To fulfill all their

requirements they have deployed thousands of datacenters globally

so that not only they can make their data available all the time but

also provide scalability anytime. Furthermore, in the case of scales

Page 2: Chapter 1 Final

failure, either it is attributed to the failure of the software or the

hardware components operating permanently for a long time.

Moreover, they need to be considered during planning of solutions

and handled in implementations.

In addition to inheriting all the challenges that are in a distributed

large scale storage, there is also a concern about the kind of data

being stored and the way it is accessed and the data that is stored

in distributed manner might be of any size that poses many

challenges towards conventional data storage systems, these new

horizons bring big data in the scene.

There is no single clear definition of big data, still widely accepted

definition of it given by Edd Dumbill, “Big data is data that exceeds

the processing capacity of conventional database systems. The data

is too big, moves too fast, or doesn’t fit the structure of your

database architectures. To gain value from this data, you must

choose an alternative way to process it”.

The challenge is to handle big data with available resources or naïve

solutions. When it comes to access, workloads are typically being

categorized as online transaction processing1 (OLTP) and online

analytical processing2 (OLAP). These concepts are mostly used for

structured database systems, although the modern application

requires more powerful and efficient solutions to process petabytes

of data, it is well exemplified by MapReduce [2], which is originally

implemented by Google but today it is the base for many open

source systems, like Apache’s Hadoop framework.

The important aspect of this framework is that it works parallel on

large datasets and long running tasks. These datasets are

structured, unstructured and semi-structured therefore it requires a

new kind of database systems that can deal with all kind of data. To

handle heterogeneous datasets of above-mentioned, NoSQL (Not

1OLTP is used for transactional based data entry and retrieval.2OLAP is used for analytical loads.

Page 3: Chapter 1 Final

Only SQL) was developed. NoSQL has the capability to process the

big data that has the potential to be mined for valued information.

According to the requirements of individual industries they develop

different NoSQL for their use for example Facebook’s Cassandra [3],

Amazon’s Dynamo [4], Yahoo! PNUTS [5], Google’s BigTable [6] or

Riak and MongoDB [7]. Each one of them is using shards to store

their data. Since, the main challenge is to handle the big data

among all globally deployed shards so NoSQL databases have come

with solutions.

2.1 EXISTING SOLUTIONS

In following section, some solutions are given by the NoSQL

databases.

Facebook’s Cassandra [12] [3] is a distributed database designed

to store structured data in a key-value pair and indexed by a key.

Cassandra is highly scalable in both perspective, one storage and

second request throughput while preventing from single point

failure. Additionally, Cassandra’s stores data in the form of tables

that is very similar to the distributed multi-dimensional maps, which

is also an index by a key. It belongs to the column family like

BigTable [6]. In a single row key, it provides atomic per-replica

operation. In Cassandra, consistent hashing [13] is used for the

notion of data partitioning to fulfill the purpose to map keys to

nodes in a similar manner like Chord distributed hash table (DHT).

Partitioned data will be stored in a Cassandra cluster that would

contain the moving nodes on the ring. To facilitate the load

balancing it uses DHT on its keys.

Amazon’s Dynamo [4] is distributed key-value-store that sacrifices

consistency, guarantees for scalability and availability. It uses a

similar scheme for data partitioning as Cassandra but does not use

hashes to store chunks. Furthermore, Dynamo addresses non-

uniformity of node distribution in the ring with virtual nodes

Page 4: Chapter 1 Final

(vnodes) and provides a slightly different partition-to-vnode

assignment strategy which achieves better distribution of load

across the vnodes and thus over the physical nodes.

Scatter [14] is a distributed consistent key-value store and highly

decentralized that provides linearisable consistent in the face of

(some) failures. For the storage perspective, it uses uniform key

distribution through consistent hashing typical for DHTs and Scatter

uses two mechanisms for load balancing. The first policy directs a

node that newly joins the system to randomly sample k groups and

then join one handling the largest number of operations. The second

policy allows neighboring groups to trade off their responsibility

ranges based on load distribution.

MongoDB [15] [7] is schema-free document-oriented database

written in C++. MongoDB uses replication to provide data available

and sharding to provide partition tolerance, and to manage data

across the distributed environment. In this system, the balancer is

used so that the chunks are evenly distributed across all servers in

the cluster. MongoDB stores data in the form of chunks. Balancer

waits for the threshold of uneven chunks count to occur. In the field,

if chunks difference become 8 of the least to the most loaded shards

then balancer needs to redistribute chunks until the difference in

chunks between any two shards is down to 2 chunks [16].

Drawbacks:

2.2 MOTIVATION

It is clear from the elaboration above that the development of

flexible, scalable and reliable distributed storage systems is a

necessity. Of course there are a larger number of products already

available today, some are RDBMSs and some are NoSQL. In RDBMS

like MySQL has the capability to deploy in a distributed manner with

replication and shards or as MySQL Cluster but it has the limitation

with scalability. Although some have to put their considerable effort

Page 5: Chapter 1 Final

to make, their system work with large datasets, as Facebook has

done [8], for example.

In the NoSQL, there are many systems like Google’s BigTable or

Riak3, Amazon’s Dynamo, Facebook’s Cassandra and MongoDB.

These systems scale very well over the trade-offs of consistency,

availability and partition tolerance according to CAP theorem [11].

However, transactional support is not provided by the NoSQL

databases like classical RDBMSs and another major problem faced

by these systems is low utilization of the storage (shards). The load

balancing techniques used in these systems do not consider chunk

migration as performance indicator rather they prefer high

availability and partition tolerance as their key indicator. In this

work, we aim to bridge this gap by proposing an improvement over

existing load balancing techniques by taking into account shard

utilization and migration of data to increase the efficiency of NoSQL

databases.

2.3 PROBLEM STATEMENT

All NoSQL databases have their implicit load balancing mechanism

to make sure that load is evenly distributed among all the shards in

the cluster. If any shard gets overloaded, then the balancer will

redistribute chunks among all the under loaded shards so that all

the shards become evenly loaded in the cluster. In this work, we

have proposed an improvement over original MongoDB load

balancing [12] by minimizing the number of chunk migration and

better memory utilization of the shards.

2.4 CONTRIBUTION

In this dissertation, we design an algorithm for load balancing in

NoSQL databases. Although, load balancing algorithm exists for all

3 It is a NoSQL that offers high availability, fault tolerance, operational simplicity and scalability and implements the principles from Amazon’s Dynamo and Google’s BigTable.

Page 6: Chapter 1 Final

NoSQL despite that we propose a new technique for MongoDB [12],

which is a type of NoSQL database and a document-oriented

database. The load that reaches to the MongoDB system is stored in

multiple commodity servers that store data in a distributed manner.

However, all hosts have the fixed amount of storage capacity.

Therefore, we need to store data on all hosts, so that we can avoid

the overloaded situation of any host.

So, in this prototype implementation, we simulate both algorithms,

one MongoDB automatic load balancing algorithm and second an

improved version of the load balancing algorithm, and then compare

the results with the help of charts and tables.

2.5 DISSERTATION OUTLINE

The layout of the dissertation is as follows:

Chapter 1, talks about the purpose of this study, problems with big

data and some existing solutions and our contribution.

Chapter 2, discusses NoSQL databases and classifications of NoSQL

databases. Furthermore, MongoDB a type of NoSQL database is

elaborated.

Chapter 3, describes some basic concepts related to distributed

systems and modified load balancing algorithm for MongoDB in

detail.

Chapter 4, describes about the implementation and the results.

Chapter 5, contains the concluding remarks as well as works to be

done in the future.