NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

47
NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability

Transcript of NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Page 1: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

NoSQLby Michael Britton, Mark McGregor, and Sam Howard

Simplicity, Speed, Scalability

Page 2: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

What is NoSQL?

�Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable

�The term “NoSQL” is actually misleading. A more appropriate term is actually “Not Only SQL”

Page 3: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Origins

● �1998 - Carlo Strozzi

● Still used Relational model

● �More accurately called “NoRel”

● �2009 – Eric Evans and Johan Oskarsson

● �Organized event to discuss open-source distributed databases

● �Originally a term to label Non-ACID databases

● meant to be a Twitter hashtag but went viral and stuck

Page 4: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Why NoSQL

Page 5: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

What You Are Giving Up With NoSQL● Relationships between entities are basically non-

existent● Limited ACID transactions● No standard language for queries (SQL)● Less structured

Page 6: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

RDBMS Vs. NoSQLRDBMS

● Structured and organized data

● Structured Query Language (SQL)

● Data and its relationships stored in separate tables.

● Data Manipulation Language, Data Definition Language

● Tight Consistency

● BASE Transaction

NoSQL● No declarative query language● No predefined schema● Key-Value pair storage, Column

Store, Document Store, Graph Databases

● Eventual consistency rather ACID property

● Unstructured and unpredictable data

● CAP Theorem● Prioritize high performance, high

availability and scalability

Page 7: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

SQL VS NoSQL QueriesNoSQL Query:

SQL Query:

Page 8: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

NoSQL vs. MySQL● MySQL > 50 GB Data

● Writes Average: ~300 ms

● Reads Average: ~350 ms

● Cassandra > 50 GB Data

● Writes Average: 0.12 ms

● Reads Average: 15 ms

Page 9: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

NoSQL Pros/ConsPros

● High Scalability

● Distributed Computing

● Lower Cost

● Schema Flexibility, Semi-Structured Data

● No Complicated Relationships

Cons

● No Standardization● Limited query capabilities● Eventual consistent model is not

intuitive to program for

Page 10: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Non-Relational:

Distributed:

Open-Source:

Horizontally Scalable:

The concept of joining tables together by relations is non-existent.

A network of interconnected computers, controlled by a central Database Management System

Anyone can make changes to the original source code.

Using multiple computers as one unit to increase productivity

Page 11: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Non-Relational● �Relational databases join tables together using Primary

Key / Foreign Key relationships

● �Non-Relational databases have no such structure

● �Items are aggregated into one file, much like a giant Excel spreadsheet

● �Prone to data duplication

● �Difficult to update records

Page 12: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Distributed● �Non-relational databases can easily be spread out

over multiple machines over the same network

● �Each machine in the distributed network can carry information most relevant to it’s area

● �Controlled by the DDBMS – Distributed Database Management System

Page 13: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Open-Source● �Source code is generally available to the open

public

● �Improve the software as needed

● �Share with the community

Page 14: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Horizontally ScalableHorizontal Vertical

Page 15: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Other Important Terms● Denormalization - optimizing read performance by adding redundant data

or grouping data in order to improve scalability and performance

● does NOT mean that the data has not been normalized

● Denormalization should ideally take place after 3NF has been achieved

● Constraints are used to ensure that redundant copies of data are synchronized

● Materialized View - a database object that contains the results of a query.

● query result is cached but can be updated from the original query as necessary

Page 16: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Other Important Terms● Keyspace - object that holds together all column families of a design

● outermost grouping of data in datastore

● resembles a schema in RDMS

● Column Families - tuple (pair) consisting of a key-value pair, where the key is set to a value that is a set of columns

● object that contains columns of related data

● resembles a table in RDMS

Page 17: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Other Important Terms● Super Column Family - tuple (pair) that consists of key-value pair, where

the key is mapped to a value that are column families

● similar to a view in RDBS

● Column (data store) - tuple (triplet) key-value pair consisting of a unique name, a value, and a timestamp.

the timestamp determines old data from new data

not to be confused with a standard relational database column

lowest level object in a keyspace

Page 18: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Other Important Terms● Database Shard - a horizon partition in a database or a search partition.

Each partition is a separate shard.

● shards can be distributed to separate hardware, reducing the number of rows in each table

● not to be confused with horizontal partitioning, which refers to splitting one or more tables by rows within a single schema or database server

● Sharding - the process of forming shards within the distributed database system.

● traditionally done by hand coding

● auto-sharding code is highly sought after

Page 19: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Other Important Terms

● Consistent Hashing - special hashing in which when the hash table is resized, only K / n keys need to be remapped

● K is the number of rows

● n is the number of slots

Page 20: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

All your BASE are belonging to NoSQL● A BASE system gives up on consistency.

● Basically Available indicates the system does guarantee availability.

● Soft state indicates that the state of the system may change over time, even without input.

● Eventual consistency indicates that the system will become consistent over time, given the system doesn’t receive input during that time.

Page 21: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

CAP Theorem (Brewer’s Theorem) ● There are three basic requirements which exist in a special relation when

designing for a distributed architecture.

● Consistency ‘C’ - the data in the database remains consistent after the execution of the operation

● Availability ‘A’ - the system is always on, no downtime.

● Partition Tolerance ‘P’ - the system continues to function even if the communication among the servers is unreliable.

Page 22: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

CAP Theorem Cont.● CAP provides the basic requirements for a distributed systems to follow 2

of the 3 requirements. All of the current NoSQL database follow the different combinations of C, A, and P.

● CA - Single site cluster, therefore all the nodes are always in contact.

● CP - Some data may not be accessible, but the rest is still consistent/accurate

● AP - System is still available under partitioning, but some of the data may be inaccurate.

Page 23: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.
Page 24: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Challenges of NoSQL● Maturity - In comparison RDBMS systems have been around for a

long time. Most NoSQL alternatives are in pre-production versions with many key features yet to be implemented.

● Support - Most NoSQL systems are Open Source projects, and the companies that offer support are small start-ups without global reach, support services, or the credibility of Oracle, Microsoft, or IBM.

Page 25: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Challenges of NoSQL● Analytics and Business Intelligence - NoSQL databases have

evolved to meet the scaling demands of Web 2.0 applications.

● Administration - The design goals for NoSQL is to provide a zero-admin solution, but as of today it requires a lot of skill to install and a lot of to effort to maintain.

● Expertise - Almost all NoSQL developers is learning how to use and develop for NoSQL

Page 26: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Advantages of NoSQL● Elastic Scaling - NoSQL databases are designed to expand

transparently to take advantage of new nodes, and they are usually designed with low-cost commodity hardware in mind.

● Big Data - The volumes of data that can be handled by NoSQL systems are greater than what can be handled by the biggest RDBMS.

● No DBA - NoSQL databases are designed from the ground up to require less management: automatic repair, data distribution, and simpler data models to lead to lower administration and tuning requirements.

Page 27: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Advantages of NoSQL● Economic - NoSQL databases typically use clusters of cheap

commodity servers to managing the ever-expanding amount of data and transactions.

● Flexible Data Models - NoSQL databases have more relaxed data model restrictions. Key Value stores and document databases allow the application to store virtually any structure it wants in a data element.

Page 28: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Taxonomy (Data Models)Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality

Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.

Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB.

Column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.

Page 29: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Key-Value stores● Examples-Tokyo Cabinet/Tyrant,

Redis, Voldemort, Oracle BDB

● Typical Application- Content caching (Focus on scaling to huge amounts of data, designed to handle massive load), logging, etc.

● Strengths- Fast Lookups

● Weaknesses- Stored data has no schema

Page 30: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Oracle Embraces NoSQL

Page 31: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Oracle Embraces NoSQL● Distributed key-value database

● Designed to provide highly reliable, scalable, and available data storage across a configurable set of systems that function as storage nodes

● Data is stored as key-value pairs, which are written to particular storage node(s), based on the hashed value of the primary key.

● Storage nodes are replicated to ensure high availability, rapid failover in the event of a node failure and optimal load balancing of queries.

● Customer applications are written using an easy-to-use Java/C API to read and write data.

Page 32: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Oracle Embraces NoSQL

● Utilizes storage nodes

● more storage nodes provide greater throughput

● Storage Node Agent (SNA) monitors each nodes behavior

● Replication nodes work in groups to serve the same data

● Replication factor of 3

● Single-master architecture

● Master node replicates to replication nodes

● Election system elects new master in case of failure

Page 33: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Column Stores

● Examples-Cassandra, HBase, Riak

● Typical applications-Distributed file systems

● Data model-Columns → column families

● Strengths-Fast lookups, good distributed storage of data

● Weaknesses-Very low-level API

Page 34: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Apache Cassandra Project

● Scalability and high availability without compromising performance

● Uses column indexes

● Denormalization

● Materialized Views

● Built-in caching

Page 35: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Apache Cassandra Project● Used in over 1500 companies with large, active data sets

● Largest cluster has 300 TB of data on over 400 machines

● Replication across multiple data centers allows failed nodes to be replaced with no downtime

● Every node is identical, allowing no single point of failure

● Users can choose between synchronous and asynchronous replication

Page 36: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.
Page 37: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Document Databases● Examples-CouchDB, MongoDb● Typical applications-Web

applications (Similar to Key-Value stores, but the DB knows what the Value is)

● Data model-Collections of Key-Value collections

● Strengths-Tolerant of incomplete data

● Weaknesses-Query performance, no standard query syntax

Page 38: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

● Stores data in the form of BSON (Binary JSON) documents with dynamic schemas, making the integration of data in certain types of applications easy and fast.

● Most talked about NoSQL DBMS technology because it features auto sharding, replication,schema less design, and scalability, and more.

Hu - MongoDB - us

Page 39: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Hu - MongoDB - us● Full indexing support - index on any attribute

● Replicable - mirror across WAN and LAN

● Auto Sharding

● Document-based querying

● Flexible aggregation

● GridFS allows for storage of data files larger than BSON allows

Page 40: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Graph Databases● Graphs databases store data in

graphics to easily represent data

● Graphs records data in nodes with properties

● Nodes can have unlimited properties, but are generally broken up into multiple nodes

● Useful for answering questions based on related information

Page 41: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Neo4J

● Highly Scalable

● Fully ACID

● Intuitive graphical models

● Custom disk-based native storage engine

● Massively scalable, with potential for BILLIONS of nodes

● Highly available

Page 42: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Neo4J● Expressive, powerful, human

readable graph query language

● EX:

MATCH (a:Actor { name:"Keanu Reeves" })

RETURN a

Page 43: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Other NoSQL DBMS Products Cont.● CouchDB - stores data in the form of a collection document. Each

document is a bunch of ‘keys’ and corresponding ‘values’. CouchDB support indices, queries, and views. It uses JSON to story data, JavaScript as its query language using MapReduce and HTTP for the API.

● Redis - An in-memory, key value data store. Mostly used as a caching mechanism in most of the applications because it stores data in the RAM making it extremely fast when retrieving data. It is a data structure server and not a replacement to the traditional database. Used in combination with products like MySql to deliver high performance when the data is needed to be delivered rapidly.

Page 44: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Other NoSQL DBMS Products Cont.● Hadoop - An open-source framework. Written in Java and supports

data-intensive distributed applications. Supports applications running on largest clusters of computers and allows analyzing data among many different computers. Designed to scale up from single servers to thousands of machines.

● There are currently 150 different NoSQL databases

Page 45: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Companies That Implement NoSQL● Google - BIGTABLE● Facebook - CASSANDRA● Mozilla - HBASE● Adobe - HBASE● Foursquare - MongoDB● LinkedIn - VOLDEMORT● Digg - REDIS● Twitter - HADOOP, PIG, CASSANDRA

Page 46: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Questions?Tough!

Page 47: NoSQL by Michael Britton, Mark McGregor, and Sam Howard Simplicity, Speed, Scalability.

Sources:● �http://nosql-database.org/

● �http://www.ignoredbydinosaurs.com/2013/05/explaining-non-relational-databases-my-mom

● �http://en.wikipedia.org/wiki/NoSQL

● �http://greendatacenterconference.com/blog/the-five-key-advantages-and-disadvantages-of-nosql

● �http://www.tutorialindustry.com/nosql-tutorial-for-beginners

● �http://www.techrepublic.com/blog/10-things/10-things-you-should-know-about-nosql-databases

● http://readwrite.com/2011/10/24/oracle-formally-embraces-nosql#awesm=~oCvdI8zKkJmAiZ

● http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html

● http://cassandra.apache.org/

● http://www.neo4j.org/learn/nosql

● http://www.w3resource.com/mongodb/nosql.php

● http://architects.dzone.com/articles/putting-nosql-perspective

● http://en.wikipedia.org/wiki/Shard_%28database_architecture%29

● http://en.wikipedia.org/wiki/Consistent_hashing

● https://www.mongodb.org/