Chapter 1 Final
-
Upload
mahender-kumar -
Category
Documents
-
view
217 -
download
2
Transcript of Chapter 1 Final
Chapter 1
1 TABLE OF CONTENTS
2 Introduction...................................................................................................................................1
2.1 Existing solutions...................................................................................................................3
2.2 Motivation.............................................................................................................................4
2.3 Problem statement................................................................................................................5
2.4 Contribution...........................................................................................................................5
2.5 Dissertation Outline...............................................................................................................6
2 INTRODUCTION
At a time where the requirement of storing world digital data
exceeds the limit of a Zettabyte (i.e. 1021 bytes), it is a challenge as
well as necessity to develop a powerful and efficient system that
has larger storage to accommodate that data so, if we talk about
storage, a system using a very dense storage medium like
deoxyribonucleic acid (DNA) has so far been shown to hold no more
than 700 terabytes per gram [1].Considering that it requires more
than a ton of genetic material so that it can store a zettabyte of
information in DNA, the argument can simply be made that data
storage needs to be distributed for quantitative reasons alone. Of
course, among the qualitative arguments for distributed storage
systems are reliability and availability. This is to say that
distribution, possibly at geographically remote locations, and
concurrent accessibility by anything from millions of users just as
important as raw storage capacity.
The web giant like Google, Amazon, Facebook and LinkedIn are the
industries that use the distributed storage systems. To fulfill all their
requirements they have deployed thousands of datacenters globally
so that not only they can make their data available all the time but
also provide scalability anytime. Furthermore, in the case of scales
failure, either it is attributed to the failure of the software or the
hardware components operating permanently for a long time.
Moreover, they need to be considered during planning of solutions
and handled in implementations.
In addition to inheriting all the challenges that are in a distributed
large scale storage, there is also a concern about the kind of data
being stored and the way it is accessed and the data that is stored
in distributed manner might be of any size that poses many
challenges towards conventional data storage systems, these new
horizons bring big data in the scene.
There is no single clear definition of big data, still widely accepted
definition of it given by Edd Dumbill, “Big data is data that exceeds
the processing capacity of conventional database systems. The data
is too big, moves too fast, or doesn’t fit the structure of your
database architectures. To gain value from this data, you must
choose an alternative way to process it”.
The challenge is to handle big data with available resources or naïve
solutions. When it comes to access, workloads are typically being
categorized as online transaction processing1 (OLTP) and online
analytical processing2 (OLAP). These concepts are mostly used for
structured database systems, although the modern application
requires more powerful and efficient solutions to process petabytes
of data, it is well exemplified by MapReduce [2], which is originally
implemented by Google but today it is the base for many open
source systems, like Apache’s Hadoop framework.
The important aspect of this framework is that it works parallel on
large datasets and long running tasks. These datasets are
structured, unstructured and semi-structured therefore it requires a
new kind of database systems that can deal with all kind of data. To
handle heterogeneous datasets of above-mentioned, NoSQL (Not
1OLTP is used for transactional based data entry and retrieval.2OLAP is used for analytical loads.
Only SQL) was developed. NoSQL has the capability to process the
big data that has the potential to be mined for valued information.
According to the requirements of individual industries they develop
different NoSQL for their use for example Facebook’s Cassandra [3],
Amazon’s Dynamo [4], Yahoo! PNUTS [5], Google’s BigTable [6] or
Riak and MongoDB [7]. Each one of them is using shards to store
their data. Since, the main challenge is to handle the big data
among all globally deployed shards so NoSQL databases have come
with solutions.
2.1 EXISTING SOLUTIONS
In following section, some solutions are given by the NoSQL
databases.
Facebook’s Cassandra [12] [3] is a distributed database designed
to store structured data in a key-value pair and indexed by a key.
Cassandra is highly scalable in both perspective, one storage and
second request throughput while preventing from single point
failure. Additionally, Cassandra’s stores data in the form of tables
that is very similar to the distributed multi-dimensional maps, which
is also an index by a key. It belongs to the column family like
BigTable [6]. In a single row key, it provides atomic per-replica
operation. In Cassandra, consistent hashing [13] is used for the
notion of data partitioning to fulfill the purpose to map keys to
nodes in a similar manner like Chord distributed hash table (DHT).
Partitioned data will be stored in a Cassandra cluster that would
contain the moving nodes on the ring. To facilitate the load
balancing it uses DHT on its keys.
Amazon’s Dynamo [4] is distributed key-value-store that sacrifices
consistency, guarantees for scalability and availability. It uses a
similar scheme for data partitioning as Cassandra but does not use
hashes to store chunks. Furthermore, Dynamo addresses non-
uniformity of node distribution in the ring with virtual nodes
(vnodes) and provides a slightly different partition-to-vnode
assignment strategy which achieves better distribution of load
across the vnodes and thus over the physical nodes.
Scatter [14] is a distributed consistent key-value store and highly
decentralized that provides linearisable consistent in the face of
(some) failures. For the storage perspective, it uses uniform key
distribution through consistent hashing typical for DHTs and Scatter
uses two mechanisms for load balancing. The first policy directs a
node that newly joins the system to randomly sample k groups and
then join one handling the largest number of operations. The second
policy allows neighboring groups to trade off their responsibility
ranges based on load distribution.
MongoDB [15] [7] is schema-free document-oriented database
written in C++. MongoDB uses replication to provide data available
and sharding to provide partition tolerance, and to manage data
across the distributed environment. In this system, the balancer is
used so that the chunks are evenly distributed across all servers in
the cluster. MongoDB stores data in the form of chunks. Balancer
waits for the threshold of uneven chunks count to occur. In the field,
if chunks difference become 8 of the least to the most loaded shards
then balancer needs to redistribute chunks until the difference in
chunks between any two shards is down to 2 chunks [16].
Drawbacks:
2.2 MOTIVATION
It is clear from the elaboration above that the development of
flexible, scalable and reliable distributed storage systems is a
necessity. Of course there are a larger number of products already
available today, some are RDBMSs and some are NoSQL. In RDBMS
like MySQL has the capability to deploy in a distributed manner with
replication and shards or as MySQL Cluster but it has the limitation
with scalability. Although some have to put their considerable effort
to make, their system work with large datasets, as Facebook has
done [8], for example.
In the NoSQL, there are many systems like Google’s BigTable or
Riak3, Amazon’s Dynamo, Facebook’s Cassandra and MongoDB.
These systems scale very well over the trade-offs of consistency,
availability and partition tolerance according to CAP theorem [11].
However, transactional support is not provided by the NoSQL
databases like classical RDBMSs and another major problem faced
by these systems is low utilization of the storage (shards). The load
balancing techniques used in these systems do not consider chunk
migration as performance indicator rather they prefer high
availability and partition tolerance as their key indicator. In this
work, we aim to bridge this gap by proposing an improvement over
existing load balancing techniques by taking into account shard
utilization and migration of data to increase the efficiency of NoSQL
databases.
2.3 PROBLEM STATEMENT
All NoSQL databases have their implicit load balancing mechanism
to make sure that load is evenly distributed among all the shards in
the cluster. If any shard gets overloaded, then the balancer will
redistribute chunks among all the under loaded shards so that all
the shards become evenly loaded in the cluster. In this work, we
have proposed an improvement over original MongoDB load
balancing [12] by minimizing the number of chunk migration and
better memory utilization of the shards.
2.4 CONTRIBUTION
In this dissertation, we design an algorithm for load balancing in
NoSQL databases. Although, load balancing algorithm exists for all
3 It is a NoSQL that offers high availability, fault tolerance, operational simplicity and scalability and implements the principles from Amazon’s Dynamo and Google’s BigTable.
NoSQL despite that we propose a new technique for MongoDB [12],
which is a type of NoSQL database and a document-oriented
database. The load that reaches to the MongoDB system is stored in
multiple commodity servers that store data in a distributed manner.
However, all hosts have the fixed amount of storage capacity.
Therefore, we need to store data on all hosts, so that we can avoid
the overloaded situation of any host.
So, in this prototype implementation, we simulate both algorithms,
one MongoDB automatic load balancing algorithm and second an
improved version of the load balancing algorithm, and then compare
the results with the help of charts and tables.
2.5 DISSERTATION OUTLINE
The layout of the dissertation is as follows:
Chapter 1, talks about the purpose of this study, problems with big
data and some existing solutions and our contribution.
Chapter 2, discusses NoSQL databases and classifications of NoSQL
databases. Furthermore, MongoDB a type of NoSQL database is
elaborated.
Chapter 3, describes some basic concepts related to distributed
systems and modified load balancing algorithm for MongoDB in
detail.
Chapter 4, describes about the implementation and the results.
Chapter 5, contains the concluding remarks as well as works to be
done in the future.