NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and Cassandra
-
Upload
vivek-adithya-mohankumar -
Category
Technology
-
view
136 -
download
1
Transcript of NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and Cassandra
INSY 5337 Data Warehousing – Term Paper
NoSQL Databases: An Introduction and Comparison between Dynamo,
MongoDB and Cassandra
Authored By-
Nitin Shewale Aditya Kashyap Akshay Vadnere Vivek Adithya Aditya Trilok
Abstract
Data volumes have been growing exponentially in recent years, this increase in data across all
the business domains have played a significant part in the analysis and structuring of data. NoSQL
databases are becoming popular as more organizations consider it as a feasible option because
of its schema-less structure along with its capability of handling BIG Data. In this paper, we talk
about various types of NoSQL databases based on implementation perspective like key store,
columnar and document oriented. This research paper covers the consolidated applied
interpretation of NoSQL system, depending on the various database features like security,
concurrency control, partitioning, replication, Read/Write implementation. We also would draw
out comparisons among the popular products and recommend a particular NoSQL solution on
the above mentioned factors.
1. Introduction
Until recently, Relational database systems have been on the forefront of data storage and
management operations. The advent of mobile applications that requires real time analysis like
GPS based services, banking and social media has led to huge unstructured data being produced
every second. Traditional RDBMS systems have found it difficult to cater to these huge chunks of
unstructured data, as RDBMS mainly stores structured data in tabular format. Also, the
unstructured data being mapped to a relational database results in increase in complexity as it
uses expensive infrastructure to model the same. Also, even if the data model fits into SQL,
platter of features provided by SQL becomes an overhead. Relational schema becomes a burden
on applications which are trying to store data in multiple forms like videos, blogs and images etc.
A new methodology for data management was introduced for the management of unstructured
data known as NoSQL (Not Only Structured Query Language).
NoSQL covers a broader topic of data structuring, storage and aggregation via various
implementation approaches. It can store unstructured data and provide real time analysis to back
up the web service applications. It gives up on conventional benchmarking of database
management principles like Atomicity, Consistency, Isolation and Durability, to attain flexible
data handling. Also, it provides inbuilt data partitioning and replication. Essentially, data across
the business domains is governed by company policies and processes for data control and quality.
NoSQL moves away from these restrictions to promote performance and scalability requirements
of particular application and services [1][2][3][4][10].
2. NoSQL Characteristics
Analogy of ACID properties in NoSQL is BASE, which is derived from CAP Theorem. CAP Theorem
assures following database management standards –
Consistency – The given data should be available at all parts of the system at the same
time.
Availability – The data should be available any time and should provide a response
every time.
Partition Tolerance – Total failure of the system should not be driven by failure of one
section or partition of the system [7][8][9].
NoSQL database systems, like MongoDB and Cassandra, have strayed away from consistency to
attain greater availability and efficient partitioning. This gave rise to systems driven on BASE
principles.
Basically Available – Data is distributed across various systems, hence the data is always
available in one of the system, even if one of the systems fail.
Soft State – Since the data is distributed, there is no assurance of consistency.
Eventually Consistent – The data would be consistent eventually, even if it’s not at a given
point in time [5][6][10].
3. Features of NoSQL
Flexible Data Models – NoSQL allows horizontal data partitioning across different
distributed systems or processors. However, relational model has a fixed schema in
contrast to NoSQL. Applications based on NoSQL have data models explicitly designed
and augmented for them.
Partial Record Updates – Data models that use NoSQL emphasize on column based
processing that enable data aggregation on more than one attributes and entities.
Optimized MapReduce Processing – MapReduce, a native functionality for data
movement and mapping is a part of NoSQL.
Horizontal Scalability – It allows on-the-fly addition of the processors with their own
resources. Each node is fed with a subset of data to process, thus increasing the efficiency
of the application. Horizontal scalability is more achievable in NoSQL data model as
compared to RDBMS [1].
4. Types of NoSQL Databases
Key Value -> Key value data stores references the data using a unique key. The unique
key acts as a link to the data that is randomly and independently stored on the disk.
Addition of new data values can be done without inflicting with existing data .Thus the
key value stores are entirely schema less, the only structure that could possibly derived
from the key stores is the combination of key value pairs. In this paper we discuss our
findings on DynamoDB by Amazon.
Document -> Document data stores references a collection of uniquely identified key-
value pairs known as Documents. Each document is recognized by its own unique ID in
the document collection. Document stores enables new documents to be stored with
different kind of attributes. In this paper we discuss our findings on MongoDB
Column -> These are column centric data stores, where the indexing is done on every
column. It provides efficient and high speed read-write operations. Any modification or
addition of new data is stored using a timestamped version. We have introduced
Cassandra as an example for Column data store [1][10][12].
5. Comparative study between DynamoDB, Cassandra and MongoDB
5. A. DynamoDB
Dynamo was designed to provide a storage system within Amazon's platform that would be
stubborn during unforeseen circumstances.
5. A.1 Key Features of Dynamo:
Key-Value Data Model - Data are represented as objects, and objects are determined
based on unique keys. The operations supported on the data are get/put associated with
the specified unique key.
Eventual Consistency - The primary objective of Dynamo is to be stubborn against
unforeseen circumstances. However, it is a challenge to obtain such a consistency at the
initial phase. Hence, consistency increases eventually while all the replicas are updated in
a timely manner.
Symmetry and Decentralization - Every node had as much responsibility as the peers in
Dynamo. Every node is equally responsible for its peers in Dynamo. Thus the probability
of failure is very low and the amount of manual intervention required would be very less
[24][26][27][28][29].
5. A.2 Operations:
Dynamo performs the following operations:
Get - To return the object associated with the key.
Put - To associate the object with the specified key.
5. A.3 Security:
Dynamo does not implement an efficient security mechanism, making it inefficient in handling
scenarios that require authorization [24][26][27][28][29].
5. A.4 Partitioning:
Dynamo is a highly scalable system that can adapt to varying amounts of data by adding and
removing nodes in a flexible manner. To implement partitioning, Dynamo uses a technique called
consistent hashing, every node is allocated to one or more points on a fixed ring. Each data item
is identified by a unique key. The data item is allocated to that specific node by hashing the key
of the same. The output thus obtained is a point on the ring. Ring is rotated clockwise to identify
the initial node. This derives an effective methodology of partitioning, as any deletion of node
would have an impact only on their immediate members in the ring.
5. A.5 Replication:
The data is replicated on multiple hosts, thus resulting in supreme quality, reliability and
durability. To implement replication, Dynamo replicates each data object at N nodes, where the
value of N is set by the user. Coordinator node is allocated with a key, K, data associated with the
K node is stored locally. Also, the node replicates N-1 different nodes forming a ring
[24][26][27][28][29].
5. A.6 Storage:
Every specific node in Dynamo has its own persistence engine, this engine is used for storage as
binary objects. Every instance uses its own unique persistence engine for storage. Few types of
persistence engines used by instances are MySQL and Berkeley Database (BDB). The persistence
engine makes use of pluggable components. The advantage of these pluggable engines is that
users can choose the engine based on their requirements. For instance, BDB handles relatively
small objects whereas MySQL can handle objects of large sizes [24][26][27][28][29].
5. A.7 Read/Write Implementation:
Dynamo implements a protocol that has two parameters R/W which represent the minimum
number of nodes that must participate in a read/write operation. When a write operation is to
be performed by the coordinator, it writes the data locally and then sends the write request to
the other N-1 replica nodes. If a response is obtained for at least W-1 nodes, then the operation
is said to be a success. Then, the coordinator informs the client.
When the coordinator is requested to perform a read operation, the coordinator sends a read
request to the N-1 nodes. When there is a response from at least R-1 nodes, the result is returned
to the client. If the objects received are different and if they are received from different nodes, a
list of objects is sent by the coordinator to the client rather than a single object
[24][26][27][28][29].
5. A.8 Concurrency Control:
Shared objects are allowed access concurrently among multiple clients. Before all replica nodes
are updated, write operations are returned. Hence, different versions of the same object may be
returned. Such inconsistencies are handled effectively by Dynamo [24][26][27][28][29].
B. MongoDB
MongoDB is a document key store based product developed in C++. The indexing in case of
MongoDB is done using document key structure. It is a schema-less, performance and query
optimization based product [12] [10].
5. B.1 Features of MongoDB
Flexibility during initial phases of development and design.
Horizontal scalability is infused an inbuilt feature in Mongo
User friendly tools to transfer data between different databases
Inter compatibility between implementation in various programming language [30].
5. B.2 Operations:
MongoDB allows these operations:
Insert – Adds new documents to a collection.
Find – Retrieves documents from a collection.
Update– Updates documents of a collection.
Remove – Removes a document from a collection [23][10].
5. B.3 Security
Since Mongo data files are unencrypted, they are prone to attacks. To lessen this, the application
must actively encrypt every sensitive information before writing into the DB and also prevent
unauthorized access. As Mongo uses java script for internal language, it is also prone to potential
script injection attacks. Authentication is not provided in sharded clusters of Mongo DB [13].
5. B.4 Partitioning/ Sharding
Sharding enables segregating of data in numerous machines. MongoDB allows automated
partitioning as a built in feature. This feature allows horizontal scaling across many processors
(nodes). Sharding combined with replication leads to availability of a highly mountable cluster.
For resource hungry applications, MongoDB creates cluster of shards, and balances the nodes,
without impacting the original node [1] [10] [11] [13].
5. B.5 Storage
Data format used in storing data in Binary JSON (BSON) with a maximum size of 16MB. Data
allocation is limited to 2GB per node in a 32 bit system. Data is mapped in-memory to increase
performance. Data is transferred to the disc after every minute by default, which is customizable
.Creation of new files is followed up by immediate flushing of data to disc, thus freeing up the
memory [10][11] [14].
5. B.6 Replication
In MongoDB data replication is driven by Master-slave replication with various replica sets. Data
is replicated in asynchronous form across servers. Read operations can be performed by multiple
slave servers whereas write operation can be handled by only one server at a given point in time.
All the servers at a given point in time have a master server and a new master server is elected
in case the previous one falters. Reading from multiple slave servers leads to eventual
consistency, to achieve load balancing. The client has the ability to enforce the write operations
the master server [10][11] [14] [13].
Replicas can be created in many ways in MongoDB catering to different needs of the application-
Secondary Replicas - These replicas are not giving the opportunity to become a Master ,
they just store data
Hidden Replicas - These replicas are hidden from the application and cannot be elected
as Master server. These replicas perform read only operations and are given voting rights
to elect a new master server in case of failover.
Delayed - Delayed replicas are not synced with the master and will not have the updated
data.
Arbiters - These replicas are basically arbitrators and do not take part in any functionalities
except communicating with other members and taking part in election [11] [14].
5. B.7 Read/Write Implementation
Indexing used in Mongo allows efficient read operations but effects negatively on the write/
insert operations. Mongo allows read operations on slave servers, write operations are controlled
by master server. Data reading is performed by slave servers simultaneously in an asynchronous
manner [14].
5. B.8 Concurrency Control:
Instant update on all the nodes is done on a MongoDB database system. Mongo DB does not
support concurrency control. It exhibits eventual consistency. Data is sent out asynchronously to
slave servers, thus it is not controlled.[11][12][13][14]
5. C. Cassandra
Cassandra was designed by Facebook to cater to humongous data needs of the organization.
Cassandra essentially vouches for two BASE features i.e. availability and scalability [21] It brings
together the data structure of BigTable and high availability feature of Dynamo [25][11].
5. C.1 Features of Cassandra
Cassandra has multiple nodes in a cluster which are identical in terms of their software
infrastructure. All the nodes are symmetric and does not need a master node. This
feature allows linear scalability.
Hashing implemented for a new data value does not significantly impact the indexing
maintained for other data values.
Interface provided by Cassandra is not easy to use for developers [13].
5. C.2 Operations/Read Write Implementation
1) Write -> Write function when executed by a client, it is captured by one of the nodes in
the cluster randomly. This nodes then in turn writes the data to the cluster. The write
action is then replicated on all the other nodes of the cluster via a Replication placement
strategy.
2) Append -> after the write action being passed on to the individual nodes, change in data
is proceeded to commit.
3) Update -> Update function modifies the main memory structure table with the update.
4) Read -> Client makes a read request to the random node in Cassandra, this node then
identifies the node in the cluster holding the required data and then transfers the read
request to that particular node.
[11][13][14]
5. C.3 Storage
Column based storage is the mainstay of storage system in Cassandra. Cassandra predominantly
has one table as its primary operational unit. It also has a multidimensional map which is
distributed and linked using keys. [19][20]Column families are defined in the initial phase of
launching Cassandra, column families can be infinite. Specifications of at least some of the
column families is mandatory. These families are further subdivided into columns and super
columns. These can be added on runtime to the column families. Indexing of the columns can be
done using the name which is being assigned to the column, they store numerous data values in
each row. Similarly super columns are identified by their name and consist of multiple columns
which are linked to super columns randomly [11][13]14].
5. C.4 Partitioning
Cassandra runs on nodes in a cluster which are symmetrical, hence the same data is distributed
on all the nodes. Partitioning is done using two techniques i.e. Order-preserving partitioning and
Random partitioning. Order-preserving partitioning enables efficient execution of range queries
but might cause issues in load-balancing. The nodes and their keys are evenly distributed in the
cluster in both these techniques [11][13][14].
5. C.5 Replication
Replication of data is done on all the nodes of a cluster, data set is assigned to a particular node
in the cluster. Data items are allocated to a spot in the node depending on the key of the data
item, consistent hashing is used to identify the key of the data item. (24) Each data item has a
node coordinator which coordinates the replication of that data item to other nodes. Also client
can choose the no of replicas that a particular data item can maintain [8] [11].
5. C.6 Concurrency Control
Cassandra enables Multi version concurrency control [8].
5. C.7 Security
Data files and the interactions between the client-database are unencrypted, as a result of which
any user with access to file systems can extract the information he/she desires. Also, Intra cluster
communication can be done freely whereas Inter cluster communication comes with a facility of
authentication. Security in Cassandra is loosely implemented, IP addresses of the nodes of the
cluster is the only info needed to sniff into the system [13][11].
6. Conclusion and Recommendation
We have compared three main products i.e. Dynamo, MongoDB and Cassandra on the basis of
major features that drive the selection of a NoSQL product for any organization. MongoDB and
Cassandra are supersets of Dynamo, as they also are essentially implemented on the key-value
pair indexing. Dynamo fails to maintain relatively similar attributes together, as can be done in
MongoDB through document linking. Also, horizontal scalability is better achieved in MongoDB
and Cassandra than Dynamo.
Eventually, we have figured out that MongoDB and Cassandra are better products in terms of
partitioning, replication and concurrency control than Dynamo.
When it comes to update operations, Cassandra is much faster than MongoDB and is
independent of the size of the data.
Read operations in Cassandra are relatively fast than MongoDB for medium sized data
sets, speed of read operations decline with increase in number of records.
Complex queries consisting of read and update operations simultaneously are better
performed in Cassandra than MongoDB.
Symmetric node structure in cluster formation in Cassandra serves better concurrency
control than Master slave structure in MongoDB.
Security in NoSQL systems is loosely implemented, comparatively Cassandra provides
better authentication and authorization mechanisms than what we have in MongoDB
[8][11][14].
We would recommend Cassandra as an overall better product when compared on the basis of
replication, concurrency control, Partitioning and Read/Write Implementation. Cassandra is tried
and tested, it is being used by more than 1500 companies [25].
References
[1] NoSQL Systems for Big Data Management- 2014 IEEE 10th World Congress on Services- Venkat N
Gudivada Weisburg Division of Computer Science Marshall University Huntington, WV, USA
[email protected] Dhana Rao Biological Sciences Department Marshall University Huntington,
WV, USA [email protected] Vijay V. Raghavan Center for Advanced Computer Studies University of
Louisiana at Lafayette Lafayette, LA, USA [email protected]
[2] R. Cattell, “Scalable sql and nosql data stores,” SIGMOD Rec., vol. 39, no. 4, pp. 12–27, May 2011.
[3] V. Benzaken, G. Castagna, K. Nguyen, and J. Siméon, “Static and dynamic semantics of NoSQL
languages,” SIGPLAN Not., vol. 48, no. 1, pp. 101–114, Jan. 2013.
[4] F. Cruz, F. Maia, M. Matos, R. Oliveira, J. a. Paulo, J. Pereira, and R. Vilaça, “MeT: Workload aware
elasticity for NoSQL, booktitle = Proceedings of the 8th ACM European Conference on Computer
Systems, series = EuroSys ’13, year = 2013, isbn = 978-1-4503-1994-2, location = Prague, Czech Republic,
pages = 183–196, numpages = 14, publisher = ACM, address = New York, NY, USA.”
[5] REDUCE, YOU SAY: What NoSQL can do for Data Aggregation and BI in Large Repositories - 2011 22nd
International Workshop on Database and Expert Systems Applications - Laurent Bonnet1,2 , Anne
Laurent1 , Michel Sala1 1LIRMM Universite Montpellier 2 – CNRS ´ 161 rue Ada, 34095 Montpellier –
France [email protected] Ben´ edicte Laurent ´ 2 2Namae Concept Cap Omega 34000
Montpellier – France [email protected] Nicolas Sicard3 3LRIE – EFREI 30-32 av. de la
republique ´ 94 800 Villejuif – France [email protected]
[6] P. A. Bernstein and N. Goodman. Multiversion concurrency control – theory and algorithms. ACM
Trans. Database Syst., 8:465–483, December 1983.
[7] NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and
Comparison - International Journal of Database Theory and Application Vol. 6, No. 4. 2013 - A B M
Moniruzzaman and Syed Akhter Hossain Department of Computer Science and Engineering Daffodil
International University [email protected], [email protected]
[8] NoSQL Databases: MongoDB vs Cassandra - Veronika Abramova Polytechnic Institute of Coimbra
ISEC - Coimbra Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239
790 200 [email protected] Jorge Bernardino Polytechnic Institute of Coimbra ISEC - Coimbra
Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239 790 200
[9] Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on NoSQL database," Pervasive Computing and
Applications (ICPCA), 2011 6th International Conference on , vol., no., pp.363,366, 26-28 Oct. 2011.
doi:10.1109/ICPCA.2011.6106531.
[10] NoSQL Evaluation A Use Case Oriented Survey - 2011 International Conference on Cloud and Service
Computing - Robin Hecht Chair ofApplied Computer Science IV University of Bayreuth Bayreuth,
Germany robin.hecht@uni -bayreuth.de, Stefan Jablonski Chair ofApplied Computer Science IV
University ofBayreuth Bayreuth, Germany [email protected]
[11] A Comparative Analysis of Different NoSQL Databases on Data Model, Query Model and Replication
Model-> Clarence J. M. Tauro1,∗, Baswanth Rao Patil2 and K. R. Prashanth3 - 1Christ University, Hosur
Road, Bangalore, India. 2Department of Computer Science, Christ University, Hosur Road, Bangalore,
India. 3Department of Computer Science, Christ University, Hosur Road, Bangalore, India. e-mail:
[email protected]; [email protected];
[12] 2012 Third International Conference on Emerging Intelligent Data and Web Technologies -
MongoDB vs Oracle - database comparison - Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin
Faculty of Automatic Control and Computer Science , Politehnical University of Bucharest,Bucharest,
Romania . [email protected], [email protected], [email protected]
[13] Security Issues in NoSQL Databases - 2011 International Joint Conference of IEEE TrustCom-11/IEEE
ICESS-11/FCST-11 - Lior Okman Deutsche Telekom Laboratories at Ben-Gurion University, Beer-Sheva,
Israel, Nurit Gal-Oz, Yaron Gonen, Ehud Gudes Deutsche Telekom Laboratories at Ben-Gurion University,
and Dept of Computer Science, Ben-Gurion University, Beer-Sheva, Israel, Jenny Abramov Deutsche
Telekom Laboratories at Ben-Gurion University and Dept of Information Systems Eng. Ben-Gurion
University, Beer-Sheva, Israel.
[14] NoSQL Databases: MongoDB vs Cassandra - Veronika Abramova Polytechnic Institute of Coimbra
ISEC - Coimbra Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239
790 200 [email protected], Jorge Bernardino Polytechnic Institute of Coimbra ISEC - Coimbra
Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239 790 200
[15] E. Brewer. (2000, Jun.) Towards robust distributed systems. [Online]. Available:
http://www.cs.berkeley.edu/ brewer/cs262b-2004/PODCkeynote.pdf
[16] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partition-
tolerant web services,” SIGACT News, vol. 33, pp. 51–59, June 2002. [Online]. Available:
http://doi.acm.org/10.1145/564585.56460
[17] Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on NoSQL database," Pervasive Computing and
Applications (ICPCA), 2011 6th International Conference on , vol., no., pp.363,366, 26-28 Oct. 2011.
doi:10.1109/ICPCA.2011.6106531.
[18] Tudorica, B.G.; Bucur, C., "A comparison between several NoSQL databases with comments and
notes," Roedunet International Conference (RoEduNet), 2011 10th , vol., no., pp.1,5, 23-25 June 2011.
doi:10.1109/RoEduNet.2011.5993686.
[19] Lakshman, Avinash, Malik and Prashant, Cassandra – A Decentralized Structured Storage
System. In: SIGOPS Operating Systems Review, vol. 44, pp. 35–40, April (2010).
[20] Lakshman and Avinash, Cassandra – A structured storage system on a P2P Network. August
(2008).
[21] The apache software foundation, The Apache Cassandra Project (2011).
http://cassandra.apache.org/, last accessed on January (2011).
[22] David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine and Daniel
Lewin, Consistent hashing and random trees: distributed caching protocols for relieving
hotspots on the WorldWideWeb .In Proceedings of the twenty-ninth annual ACM symposium
on Theory of computing, STOC’97, pp. 654–663, New York, NY, USA (1997) ACM.
[23] https://www.mongodb.org/ - https://docs.mongodb.org/manual/
[24] https://aws.amazon.com/dynamodb/
[25] http://cassandra.apache.org/
[26] Dynamo and BigTable – Review and Comparison - 2041 IEEE 28-th Convention of Electrical and
Electronics Engineers in Israel - Grisha Weintraub Dept. of Mathematics and Computer Science The
Open University Raanana, Israel
[27] Neal Leavitt: Will NoSQL Databases Live Up to Their Promise?. IEEE Computer (COMPUTER)
43(2):12-14 (2010)
[28] G. DeCandia et a l.: Dynamo: amazon's highly available keyvalue store. SOSP 2007:205-220
[29] Rick Cattell: Scalable SQL and NoSQL data stores. SIGMOD Record (SIGMOD) 39(4):12-27 (2010)
[30] 2012 Third International Conference on Emerging Intelligent Data and Web Technologies -
MongoDB vs Oracle - database comparison- Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin
Faculty of Automatic Control and Computer Science Politehnica University of Bucharest Bucharest,
Romania [email protected], [email protected], [email protected]