NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and Cassandra

INSY 5337 Data Warehousing – Term Paper

NoSQL Databases: An Introduction and Comparison between Dynamo,

MongoDB and Cassandra

Authored By-

Nitin Shewale Aditya Kashyap Akshay Vadnere Vivek Adithya Aditya Trilok

Abstract

Data volumes have been growing exponentially in recent years, this increase in data across all

the business domains have played a significant part in the analysis and structuring of data. NoSQL

databases are becoming popular as more organizations consider it as a feasible option because

of its schema-less structure along with its capability of handling BIG Data. In this paper, we talk

about various types of NoSQL databases based on implementation perspective like key store,

columnar and document oriented. This research paper covers the consolidated applied

interpretation of NoSQL system, depending on the various database features like security,

concurrency control, partitioning, replication, Read/Write implementation. We also would draw

out comparisons among the popular products and recommend a particular NoSQL solution on

the above mentioned factors.

1. Introduction

Until recently, Relational database systems have been on the forefront of data storage and

management operations. The advent of mobile applications that requires real time analysis like

GPS based services, banking and social media has led to huge unstructured data being produced

every second. Traditional RDBMS systems have found it difficult to cater to these huge chunks of

unstructured data, as RDBMS mainly stores structured data in tabular format. Also, the

unstructured data being mapped to a relational database results in increase in complexity as it

uses expensive infrastructure to model the same. Also, even if the data model fits into SQL,

platter of features provided by SQL becomes an overhead. Relational schema becomes a burden

on applications which are trying to store data in multiple forms like videos, blogs and images etc.

A new methodology for data management was introduced for the management of unstructured

data known as NoSQL (Not Only Structured Query Language).

NoSQL covers a broader topic of data structuring, storage and aggregation via various

implementation approaches. It can store unstructured data and provide real time analysis to back

up the web service applications. It gives up on conventional benchmarking of database

management principles like Atomicity, Consistency, Isolation and Durability, to attain flexible

data handling. Also, it provides inbuilt data partitioning and replication. Essentially, data across

the business domains is governed by company policies and processes for data control and quality.

NoSQL moves away from these restrictions to promote performance and scalability requirements

of particular application and services [1][2][3][4][10].

2. NoSQL Characteristics

Analogy of ACID properties in NoSQL is BASE, which is derived from CAP Theorem. CAP Theorem

assures following database management standards –

Consistency – The given data should be available at all parts of the system at the same

time.

Availability – The data should be available any time and should provide a response

every time.

Partition Tolerance – Total failure of the system should not be driven by failure of one

section or partition of the system [7][8][9].

NoSQL database systems, like MongoDB and Cassandra, have strayed away from consistency to

attain greater availability and efficient partitioning. This gave rise to systems driven on BASE

principles.

Basically Available – Data is distributed across various systems, hence the data is always

available in one of the system, even if one of the systems fail.

Soft State – Since the data is distributed, there is no assurance of consistency.

Eventually Consistent – The data would be consistent eventually, even if it’s not at a given

point in time [5][6][10].

3. Features of NoSQL

Flexible Data Models – NoSQL allows horizontal data partitioning across different

distributed systems or processors. However, relational model has a fixed schema in

contrast to NoSQL. Applications based on NoSQL have data models explicitly designed

and augmented for them.

Partial Record Updates – Data models that use NoSQL emphasize on column based

processing that enable data aggregation on more than one attributes and entities.

Optimized MapReduce Processing – MapReduce, a native functionality for data

movement and mapping is a part of NoSQL.

Horizontal Scalability – It allows on-the-fly addition of the processors with their own

resources. Each node is fed with a subset of data to process, thus increasing the efficiency

of the application. Horizontal scalability is more achievable in NoSQL data model as

compared to RDBMS [1].

4. Types of NoSQL Databases

Key Value -> Key value data stores references the data using a unique key. The unique

key acts as a link to the data that is randomly and independently stored on the disk.

Addition of new data values can be done without inflicting with existing data .Thus the

key value stores are entirely schema less, the only structure that could possibly derived

from the key stores is the combination of key value pairs. In this paper we discuss our

findings on DynamoDB by Amazon.

Document -> Document data stores references a collection of uniquely identified key-

value pairs known as Documents. Each document is recognized by its own unique ID in

the document collection. Document stores enables new documents to be stored with

different kind of attributes. In this paper we discuss our findings on MongoDB

Column -> These are column centric data stores, where the indexing is done on every

column. It provides efficient and high speed read-write operations. Any modification or

addition of new data is stored using a timestamped version. We have introduced

Cassandra as an example for Column data store [1][10][12].

5. Comparative study between DynamoDB, Cassandra and MongoDB

5. A. DynamoDB

Dynamo was designed to provide a storage system within Amazon's platform that would be

stubborn during unforeseen circumstances.

5. A.1 Key Features of Dynamo:

Key-Value Data Model - Data are represented as objects, and objects are determined

based on unique keys. The operations supported on the data are get/put associated with

the specified unique key.

Eventual Consistency - The primary objective of Dynamo is to be stubborn against

unforeseen circumstances. However, it is a challenge to obtain such a consistency at the

initial phase. Hence, consistency increases eventually while all the replicas are updated in

a timely manner.

Symmetry and Decentralization - Every node had as much responsibility as the peers in

Dynamo. Every node is equally responsible for its peers in Dynamo. Thus the probability

of failure is very low and the amount of manual intervention required would be very less

[24][26][27][28][29].

5. A.2 Operations:

Dynamo performs the following operations:

Get - To return the object associated with the key.

Put - To associate the object with the specified key.

5. A.3 Security:

Dynamo does not implement an efficient security mechanism, making it inefficient in handling

scenarios that require authorization [24][26][27][28][29].

5. A.4 Partitioning:

Dynamo is a highly scalable system that can adapt to varying amounts of data by adding and

removing nodes in a flexible manner. To implement partitioning, Dynamo uses a technique called

consistent hashing, every node is allocated to one or more points on a fixed ring. Each data item

is identified by a unique key. The data item is allocated to that specific node by hashing the key

of the same. The output thus obtained is a point on the ring. Ring is rotated clockwise to identify

the initial node. This derives an effective methodology of partitioning, as any deletion of node

would have an impact only on their immediate members in the ring.

5. A.5 Replication:

The data is replicated on multiple hosts, thus resulting in supreme quality, reliability and

durability. To implement replication, Dynamo replicates each data object at N nodes, where the

value of N is set by the user. Coordinator node is allocated with a key, K, data associated with the

K node is stored locally. Also, the node replicates N-1 different nodes forming a ring

[24][26][27][28][29].

5. A.6 Storage:

Every specific node in Dynamo has its own persistence engine, this engine is used for storage as

binary objects. Every instance uses its own unique persistence engine for storage. Few types of

persistence engines used by instances are MySQL and Berkeley Database (BDB). The persistence

engine makes use of pluggable components. The advantage of these pluggable engines is that

users can choose the engine based on their requirements. For instance, BDB handles relatively

small objects whereas MySQL can handle objects of large sizes [24][26][27][28][29].

5. A.7 Read/Write Implementation:

Dynamo implements a protocol that has two parameters R/W which represent the minimum

number of nodes that must participate in a read/write operation. When a write operation is to

be performed by the coordinator, it writes the data locally and then sends the write request to

the other N-1 replica nodes. If a response is obtained for at least W-1 nodes, then the operation

is said to be a success. Then, the coordinator informs the client.

When the coordinator is requested to perform a read operation, the coordinator sends a read

request to the N-1 nodes. When there is a response from at least R-1 nodes, the result is returned

to the client. If the objects received are different and if they are received from different nodes, a

list of objects is sent by the coordinator to the client rather than a single object

[24][26][27][28][29].

5. A.8 Concurrency Control:

Shared objects are allowed access concurrently among multiple clients. Before all replica nodes

are updated, write operations are returned. Hence, different versions of the same object may be

returned. Such inconsistencies are handled effectively by Dynamo [24][26][27][28][29].

B. MongoDB

MongoDB is a document key store based product developed in C++. The indexing in case of

MongoDB is done using document key structure. It is a schema-less, performance and query

optimization based product [12] [10].

5. B.1 Features of MongoDB

Flexibility during initial phases of development and design.

Horizontal scalability is infused an inbuilt feature in Mongo

User friendly tools to transfer data between different databases

Inter compatibility between implementation in various programming language [30].

5. B.2 Operations:

MongoDB allows these operations:

Insert – Adds new documents to a collection.

Find – Retrieves documents from a collection.

Update– Updates documents of a collection.

Remove – Removes a document from a collection [23][10].

5. B.3 Security

Since Mongo data files are unencrypted, they are prone to attacks. To lessen this, the application

must actively encrypt every sensitive information before writing into the DB and also prevent

unauthorized access. As Mongo uses java script for internal language, it is also prone to potential

script injection attacks. Authentication is not provided in sharded clusters of Mongo DB [13].

5. B.4 Partitioning/ Sharding

Sharding enables segregating of data in numerous machines. MongoDB allows automated

partitioning as a built in feature. This feature allows horizontal scaling across many processors

(nodes). Sharding combined with replication leads to availability of a highly mountable cluster.

For resource hungry applications, MongoDB creates cluster of shards, and balances the nodes,

without impacting the original node [1] [10] [11] [13].

5. B.5 Storage

Data format used in storing data in Binary JSON (BSON) with a maximum size of 16MB. Data

allocation is limited to 2GB per node in a 32 bit system. Data is mapped in-memory to increase

performance. Data is transferred to the disc after every minute by default, which is customizable

.Creation of new files is followed up by immediate flushing of data to disc, thus freeing up the

memory [10][11] [14].

5. B.6 Replication

In MongoDB data replication is driven by Master-slave replication with various replica sets. Data

is replicated in asynchronous form across servers. Read operations can be performed by multiple

slave servers whereas write operation can be handled by only one server at a given point in time.

All the servers at a given point in time have a master server and a new master server is elected

in case the previous one falters. Reading from multiple slave servers leads to eventual

consistency, to achieve load balancing. The client has the ability to enforce the write operations

the master server [10][11] [14] [13].

Replicas can be created in many ways in MongoDB catering to different needs of the application-

Secondary Replicas - These replicas are not giving the opportunity to become a Master ,

they just store data

Hidden Replicas - These replicas are hidden from the application and cannot be elected

as Master server. These replicas perform read only operations and are given voting rights

to elect a new master server in case of failover.

Delayed - Delayed replicas are not synced with the master and will not have the updated

data.

Arbiters - These replicas are basically arbitrators and do not take part in any functionalities

except communicating with other members and taking part in election [11] [14].

5. B.7 Read/Write Implementation

Indexing used in Mongo allows efficient read operations but effects negatively on the write/

insert operations. Mongo allows read operations on slave servers, write operations are controlled

by master server. Data reading is performed by slave servers simultaneously in an asynchronous

manner [14].

5. B.8 Concurrency Control:

Instant update on all the nodes is done on a MongoDB database system. Mongo DB does not

support concurrency control. It exhibits eventual consistency. Data is sent out asynchronously to

slave servers, thus it is not controlled.[11][12][13][14]

5. C. Cassandra

Cassandra was designed by Facebook to cater to humongous data needs of the organization.

Cassandra essentially vouches for two BASE features i.e. availability and scalability [21] It brings

together the data structure of BigTable and high availability feature of Dynamo [25][11].

5. C.1 Features of Cassandra

Cassandra has multiple nodes in a cluster which are identical in terms of their software

infrastructure. All the nodes are symmetric and does not need a master node. This

feature allows linear scalability.

Hashing implemented for a new data value does not significantly impact the indexing

maintained for other data values.

Interface provided by Cassandra is not easy to use for developers [13].

5. C.2 Operations/Read Write Implementation

1) Write -> Write function when executed by a client, it is captured by one of the nodes in

the cluster randomly. This nodes then in turn writes the data to the cluster. The write

action is then replicated on all the other nodes of the cluster via a Replication placement

strategy.

2) Append -> after the write action being passed on to the individual nodes, change in data

is proceeded to commit.

3) Update -> Update function modifies the main memory structure table with the update.

4) Read -> Client makes a read request to the random node in Cassandra, this node then

identifies the node in the cluster holding the required data and then transfers the read

request to that particular node.

[11][13][14]

5. C.3 Storage

Column based storage is the mainstay of storage system in Cassandra. Cassandra predominantly

has one table as its primary operational unit. It also has a multidimensional map which is

distributed and linked using keys. [19][20]Column families are defined in the initial phase of

launching Cassandra, column families can be infinite. Specifications of at least some of the

column families is mandatory. These families are further subdivided into columns and super

columns. These can be added on runtime to the column families. Indexing of the columns can be

done using the name which is being assigned to the column, they store numerous data values in

each row. Similarly super columns are identified by their name and consist of multiple columns

which are linked to super columns randomly [11][13]14].

5. C.4 Partitioning

Cassandra runs on nodes in a cluster which are symmetrical, hence the same data is distributed

on all the nodes. Partitioning is done using two techniques i.e. Order-preserving partitioning and

Random partitioning. Order-preserving partitioning enables efficient execution of range queries

but might cause issues in load-balancing. The nodes and their keys are evenly distributed in the

cluster in both these techniques [11][13][14].

5. C.5 Replication

Replication of data is done on all the nodes of a cluster, data set is assigned to a particular node

in the cluster. Data items are allocated to a spot in the node depending on the key of the data

item, consistent hashing is used to identify the key of the data item. (24) Each data item has a

node coordinator which coordinates the replication of that data item to other nodes. Also client

can choose the no of replicas that a particular data item can maintain [8] [11].

5. C.6 Concurrency Control

Cassandra enables Multi version concurrency control [8].

5. C.7 Security

Data files and the interactions between the client-database are unencrypted, as a result of which

any user with access to file systems can extract the information he/she desires. Also, Intra cluster

communication can be done freely whereas Inter cluster communication comes with a facility of

authentication. Security in Cassandra is loosely implemented, IP addresses of the nodes of the

cluster is the only info needed to sniff into the system [13][11].

6. Conclusion and Recommendation

We have compared three main products i.e. Dynamo, MongoDB and Cassandra on the basis of

major features that drive the selection of a NoSQL product for any organization. MongoDB and

Cassandra are supersets of Dynamo, as they also are essentially implemented on the key-value

pair indexing. Dynamo fails to maintain relatively similar attributes together, as can be done in

MongoDB through document linking. Also, horizontal scalability is better achieved in MongoDB

and Cassandra than Dynamo.

Eventually, we have figured out that MongoDB and Cassandra are better products in terms of

partitioning, replication and concurrency control than Dynamo.

When it comes to update operations, Cassandra is much faster than MongoDB and is

independent of the size of the data.

Read operations in Cassandra are relatively fast than MongoDB for medium sized data

sets, speed of read operations decline with increase in number of records.

Complex queries consisting of read and update operations simultaneously are better

performed in Cassandra than MongoDB.

Symmetric node structure in cluster formation in Cassandra serves better concurrency

control than Master slave structure in MongoDB.

Security in NoSQL systems is loosely implemented, comparatively Cassandra provides

better authentication and authorization mechanisms than what we have in MongoDB

[8][11][14].

We would recommend Cassandra as an overall better product when compared on the basis of

replication, concurrency control, Partitioning and Read/Write Implementation. Cassandra is tried

and tested, it is being used by more than 1500 companies [25].

References

[1] NoSQL Systems for Big Data Management- 2014 IEEE 10th World Congress on Services- Venkat N

Gudivada Weisburg Division of Computer Science Marshall University Huntington, WV, USA

[email protected] Dhana Rao Biological Sciences Department Marshall University Huntington,

WV, USA [email protected] Vijay V. Raghavan Center for Advanced Computer Studies University of

Louisiana at Lafayette Lafayette, LA, USA [email protected]

[2] R. Cattell, “Scalable sql and nosql data stores,” SIGMOD Rec., vol. 39, no. 4, pp. 12–27, May 2011.

[3] V. Benzaken, G. Castagna, K. Nguyen, and J. Siméon, “Static and dynamic semantics of NoSQL

languages,” SIGPLAN Not., vol. 48, no. 1, pp. 101–114, Jan. 2013.

[4] F. Cruz, F. Maia, M. Matos, R. Oliveira, J. a. Paulo, J. Pereira, and R. Vilaça, “MeT: Workload aware

elasticity for NoSQL, booktitle = Proceedings of the 8th ACM European Conference on Computer

Systems, series = EuroSys ’13, year = 2013, isbn = 978-1-4503-1994-2, location = Prague, Czech Republic,

pages = 183–196, numpages = 14, publisher = ACM, address = New York, NY, USA.”

[5] REDUCE, YOU SAY: What NoSQL can do for Data Aggregation and BI in Large Repositories - 2011 22nd

International Workshop on Database and Expert Systems Applications - Laurent Bonnet1,2 , Anne

Laurent1 , Michel Sala1 1LIRMM Universite Montpellier 2 – CNRS ´ 161 rue Ada, 34095 Montpellier –

France [email protected] Ben´ edicte Laurent ´ 2 2Namae Concept Cap Omega 34000

Montpellier – France [email protected] Nicolas Sicard3 3LRIE – EFREI 30-32 av. de la

republique ´ 94 800 Villejuif – France [email protected]

[6] P. A. Bernstein and N. Goodman. Multiversion concurrency control – theory and algorithms. ACM

Trans. Database Syst., 8:465–483, December 1983.

[7] NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and

Comparison - International Journal of Database Theory and Application Vol. 6, No. 4. 2013 - A B M

Moniruzzaman and Syed Akhter Hossain Department of Computer Science and Engineering Daffodil

International University [email protected], [email protected]

[8] NoSQL Databases: MongoDB vs Cassandra - Veronika Abramova Polytechnic Institute of Coimbra

ISEC - Coimbra Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239

790 200 [email protected] Jorge Bernardino Polytechnic Institute of Coimbra ISEC - Coimbra

Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239 790 200

[email protected]

[9] Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on NoSQL database," Pervasive Computing and

Applications (ICPCA), 2011 6th International Conference on , vol., no., pp.363,366, 26-28 Oct. 2011.

doi:10.1109/ICPCA.2011.6106531.

[10] NoSQL Evaluation A Use Case Oriented Survey - 2011 International Conference on Cloud and Service

Computing - Robin Hecht Chair ofApplied Computer Science IV University of Bayreuth Bayreuth,

Germany robin.hecht@uni -bayreuth.de, Stefan Jablonski Chair ofApplied Computer Science IV

University ofBayreuth Bayreuth, Germany [email protected]

[11] A Comparative Analysis of Different NoSQL Databases on Data Model, Query Model and Replication

Model-> Clarence J. M. Tauro1,∗, Baswanth Rao Patil2 and K. R. Prashanth3 - 1Christ University, Hosur

Road, Bangalore, India. 2Department of Computer Science, Christ University, Hosur Road, Bangalore,

India. 3Department of Computer Science, Christ University, Hosur Road, Bangalore, India. e-mail:

[email protected]; [email protected];

[email protected]

[12] 2012 Third International Conference on Emerging Intelligent Data and Web Technologies -

MongoDB vs Oracle - database comparison - Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin

Faculty of Automatic Control and Computer Science , Politehnical University of Bucharest,Bucharest,

Romania . [email protected], [email protected], [email protected]

[13] Security Issues in NoSQL Databases - 2011 International Joint Conference of IEEE TrustCom-11/IEEE

ICESS-11/FCST-11 - Lior Okman Deutsche Telekom Laboratories at Ben-Gurion University, Beer-Sheva,

Israel, Nurit Gal-Oz, Yaron Gonen, Ehud Gudes Deutsche Telekom Laboratories at Ben-Gurion University,

and Dept of Computer Science, Ben-Gurion University, Beer-Sheva, Israel, Jenny Abramov Deutsche

mailto:[email protected]

Telekom Laboratories at Ben-Gurion University and Dept of Information Systems Eng. Ben-Gurion

University, Beer-Sheva, Israel.

[14] NoSQL Databases: MongoDB vs Cassandra - Veronika Abramova Polytechnic Institute of Coimbra

ISEC - Coimbra Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239

790 200 [email protected], Jorge Bernardino Polytechnic Institute of Coimbra ISEC - Coimbra

Institute of Engineering Rua Pedro Nunes, 3030-199 Coimbra, Portugal Tel. ++351 239 790 200

[email protected]

[15] E. Brewer. (2000, Jun.) Towards robust distributed systems. [Online]. Available:

http://www.cs.berkeley.edu/ brewer/cs262b-2004/PODCkeynote.pdf

[16] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partition-

tolerant web services,” SIGACT News, vol. 33, pp. 51–59, June 2002. [Online]. Available:

http://doi.acm.org/10.1145/564585.56460

[17] Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on NoSQL database," Pervasive Computing and

Applications (ICPCA), 2011 6th International Conference on , vol., no., pp.363,366, 26-28 Oct. 2011.

doi:10.1109/ICPCA.2011.6106531.

[18] Tudorica, B.G.; Bucur, C., "A comparison between several NoSQL databases with comments and

notes," Roedunet International Conference (RoEduNet), 2011 10th , vol., no., pp.1,5, 23-25 June 2011.

doi:10.1109/RoEduNet.2011.5993686.

[19] Lakshman, Avinash, Malik and Prashant, Cassandra – A Decentralized Structured Storage

System. In: SIGOPS Operating Systems Review, vol. 44, pp. 35–40, April (2010).

[20] Lakshman and Avinash, Cassandra – A structured storage system on a P2P Network. August

(2008).



http://doi.acm.org/10.1145/564585.56460

[21] The apache software foundation, The Apache Cassandra Project (2011).

http://cassandra.apache.org/, last accessed on January (2011).

[22] David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine and Daniel

Lewin, Consistent hashing and random trees: distributed caching protocols for relieving

hotspots on the WorldWideWeb .In Proceedings of the twenty-ninth annual ACM symposium

on Theory of computing, STOC’97, pp. 654–663, New York, NY, USA (1997) ACM.

[23] https://www.mongodb.org/ - https://docs.mongodb.org/manual/

[24] https://aws.amazon.com/dynamodb/

[25] http://cassandra.apache.org/

[26] Dynamo and BigTable – Review and Comparison - 2041 IEEE 28-th Convention of Electrical and

Electronics Engineers in Israel - Grisha Weintraub Dept. of Mathematics and Computer Science The

Open University Raanana, Israel

[27] Neal Leavitt: Will NoSQL Databases Live Up to Their Promise?. IEEE Computer (COMPUTER)

43(2):12-14 (2010)

[28] G. DeCandia et a l.: Dynamo: amazon's highly available keyvalue store. SOSP 2007:205-220

[29] Rick Cattell: Scalable SQL and NoSQL data stores. SIGMOD Record (SIGMOD) 39(4):12-27 (2010)

[30] 2012 Third International Conference on Emerging Intelligent Data and Web Technologies -

MongoDB vs Oracle - database comparison- Alexandru Boicea, Florin Radulescu, Laura Ioana Agapin

Faculty of Automatic Control and Computer Science Politehnica University of Bucharest Bucharest,

Romania [email protected], [email protected], [email protected]

https://www.mongodb.org/

http://cassandra.apache.org/


NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and Cassandra

Technology

Transcript of NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and Cassandra