Introduction to Big Data NoSQL Databases - IPTricardo/ficheiros/BD - NoSQL.pdf · Introduction to...

30
Introduction to Big Data Ricardo Campos NoSQL Databases Mestrado EI-IC Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2016

Transcript of Introduction to Big Data NoSQL Databases - IPTricardo/ficheiros/BD - NoSQL.pdf · Introduction to...

Introduction to Big Data

Ricardo Campos

NoSQL Databases

Mestrado EI-IC – Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2016

Instituto Politécnico de Tomar

What is Information Retrieval?

Part of the slides used in this presentation were adapted from presentations found in internet and from reference bibliography:

What is Information Retrieval?

30

What is Information Retrieval?

AGENDAWhat is this talk about?

Key Value

2Overview

1Column

3

Graph

5Document

4Q&A

6

What is Information Retrieval?

They also did not provide traditional atomicity, consistency, isolation, and durability (ACID) properties provided by a relational database.

Early NoSQL databases did not support SQL, hence the name for this class of data management engines.

Unlike traditional relational database management systems, NoSQL databases generally do not provide robust built-in security mechanisms. They instead rely on simple HTTP-based APIs where data is exchanged in plaintext, making the data prone to network-based

What is Information Retrieval?

A variety of NoSQL database types have emerged. These include the following:

Key Value Pairs

Column-based:

Document-based:

Graph-based

What is Information Retrieval?

By far, the simplest of the NoSQL databases are those employing the key-value pair (KVP) model. KVP databases do not require a schema (like RDBMSs).

They consist of keys and a value or set of values and that are often used for very lightweight transactions.

Key-value storage devices store data as key-value pairs and act like hash tables. Key-value storage devices generally do not maintain any indexes, therefore writes are quite fast.

The table is a list of values where each value is identified by a key.

What is Information Retrieval?

Most of the data is stored as strings;

What is Information Retrieval?

A single collection can hold multiple data formats:

What is Information Retrieval?

A key-value storage device is appropriate when:

• unstructured data storage is required;

• the value is fully identifiable via the key alone;

• value is a standalone entity that is not dependent on other values;

• values have a comparatively simple structure or are binary;

• query patterns are simple, involving insert, select and delete operations only;

What is Information Retrieval?

A key-value storage device is inappropriate when:

• applications require searching or filtering data using attributes of the stored value;

• relationships exist between different key-value entries;

• update to individual attributes of the value is required;

KVP databases do not offer ACID (Atomicity, Consistency, Isolation, Durability) capability, and require implementers to think about replication, and fault tolerance as they are not expressly controlled by the technology itself.

One widely used open source key-value pair database is called Riak. Others: Redis, Amazon Dynamo DB, Voldemort (linkedin)

What is Information Retrieval?

Riak (http://basho.com) is a very fast and scalable implementation of a key-value database. It has many features and is part of an ecosystem consisting of the following:

• Parallel processing:

• Riak Search has a fault-tolerant, distributed full-text searching capability

Tutorial: http://www.littleriakbook.com/

Installing: http://docs.basho.com/riak/kv/2.1.4/

What is Information Retrieval?

Relational databases are row oriented, as the data in each row of a table is stored together. In a columnar, or column-oriented database, the data is stored across columns.

The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed up the time it takes to return a query.

In a columnar database, all the column 1 values are physically together, followed by all the column 2 values, etc. The data is stored in record order, so the 100th entry for column 1 and the 100th entry for column 2 belong to the same input record.

Check this video: https://www.youtube.com/watch?v=8KGVFB3kVHQ

What is Information Retrieval?

Each column can be a collection of related columns itself, referred to as a super-column.

Each super-column can contain an arbitrary number of related columns that are generally retrieved or updated as a single unit.

Each row (identified by a row key) consists of multiple column-families and can have a different set of columns, thereby manifesting flexible schema support.

One of the main benefits of a columnar database is that data can be highly compressed. The compression permits columnar operations — like MIN, MAX, SUM, COUNT and AVG— to be performed very rapidly.

What is Information Retrieval?

Column-family storage devices provide fast data access with random read/write capability. They store different column-families in separate physical files, which improves query responsiveness as only the required column-families are searched.

What is Information Retrieval?

A column-based storage device is appropriate when:

• data represents a tabular structure, each row consists of a large number of columns and nested groups of interrelated data exist;

• support for schema evolution is required as column families can be added or removed without any system downtime;

• certain fields are mostly accessed together;

• query patterns involve insert, select, update and delete operations;

What is Information Retrieval?

A column-based storage device is inappropriate when:

• relational data access is required; for example, joins

• ACID transactional support is required;

• binary data needs to be stored;

• SQL-compliant queries need to be executed;

One of the most popular columnar databases is HBase (http://hbase.apache.org). Others: Cassandra (used by Facebook) and Amazon SimpleDB

What is Information Retrieval?

HBase is natively integrated with Hadoop

• Consistency. Although not an “ACID” implementation, HBase offers strongly consistent reads and writes

• High-volume, incremental data gathering and processing

• High availability:

Important characteristics of HBase include the following:

What is Information Retrieval?

Document storage devices also store data as key-value pairs. However, unlike key-value storage devices, the stored value is a document that can be queried by the database.

One is often described as a repository for full document-style content (Word files, complete web pages, and so on).

The other is a database for storing document components for permanent storage as a static entity or for dynamic assembly of the parts of a document.

You find two kinds of document databases:

What is Information Retrieval?

The structure of the documents and their parts is provided by XML or JavaScript Object Notation (JSON) and/or Binary JSON (BSON).

Each document can have a different schema; therefore, it is possible to store different types of documents in the same collection or bucket. Additional fields can be added to a document after the initial insert, thereby providing flexible schema support.

What is Information Retrieval?

A document storage device is appropriate when:

• schema evolution is a requirement as the structure of the document is either unknown or is likely to change;

• searches need to be performed on different fields of the documents;

• query patterns involve insert, select, update and delete operations;

What is Information Retrieval?

A document storage device is inappropriate when:

• binary data needs to be stored;

• multiple documents need to be updated as part of a single transaction;

Document databases are becoming a gold standard for big data adoption. Two of the most well known projects are MongoDB (www.mongodb.com) and CouchDB(http://couchdb.apache.org). Others: Terrastore

What is Information Retrieval?

Databases that use treelike structures with nodes and edges connected via relations.

The fundamental structure for graph databases is called “node-relationship.” This structure is most useful when you must deal with highly interconnected data.

Nodes and relationships support properties, a key-value pair where the data is stored.

Unlike other NoSQL storage devices, where the emphasis is on the structure of the entities, graph storage devices place emphasis on storing the linkages between entities

What is Information Retrieval?

What is Information Retrieval?

Entities are stored as nodes (not to be confused with cluster nodes) and are also called vertices, while the linkages between entities are stored as edges

Nodes can have more than one type of link between them through multiple edges. Having multiple edges are similar to defining multiple foreign keys in an RDBMS; however, not every node is required to have the same edges.

Each node can have attribute data as key-value pairs, such as a customer node with ID, name and age attributes.

Unlike other NoSQL storage devices, where the emphasis is on the structure of the entities, graph storage devices place emphasis on storing the linkages between entities

What is Information Retrieval?

Queries generally involve finding interconnected nodes based on node attributes and/or edge attributes, commonly referred to as node traversal.

Generally, graph storage devices provide consistency via ACID compliance.

Graph storage devices generally allow adding new types of nodes without making changes to the database. This also enables defining additional links between nodes as new types of relationships or nodes appear in the database

Edges can be unidirectional or bidirectional, setting the node traversal direction.

What is Information Retrieval?

A graph-based storage device is appropriate when:

• interconnected entities need to be stored;

• querying entities based on the type of relationship with each other rather than the attributes of the entities;

• finding groups of interconnected entities;

• finding distances between entities in terms of the node traversal distance.

What is Information Retrieval?

A graph-based storage device is inappropriate when:

• updates are required to a large number of node attributes or edge attributes, as this involves searching for nodes or edges, which is a costly operation compared to performing node traversals;

• binary storage is required;

• entities have a large number of attributes or nested data—it is best to store lightweight entities in a graph storage device while storing the rest of the attribute data in a separate non-graph NoSQL storage device;

One of the most widely used graph databases is Neo4J (www.neo4j.org). Others: Infinite Graph and OrientDB

What is Information Retrieval?

There are dozens of NoSQL database engines of the various types we have described. Some that you are more likely to encounter include Apache Cassandra, MongoDB, Amazon DynamoDB, Oracle NoSQL Database, IBM Cloudant, Couchbase, and MarkLogic

What is Information Retrieval?