Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default...

35
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Distributed File Systems and NoSQL Database Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Distributed File Systems and NoSQL Database 1 / 31

Transcript of Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default...

Page 1: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics

Big Data Analytics

Lucas Rego Drumond

Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science

University of Hildesheim, Germany

Distributed File Systems and NoSQL Database

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 1 / 31

Page 2: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics

Outline

1. Distributed File Systems

2. NoSQL DataBases

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 1 / 31

Page 3: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Outline

1. Distributed File Systems

2. NoSQL DataBases

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 1 / 31

Page 4: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Why do we need a Distributed File System?

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 1 / 31

Page 5: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Why do we need a Distributed File System?

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 2 / 31

Page 6: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Why do we need a Distributed File System?

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 3 / 31

Page 7: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Why do we need a Distributed File System?

Read??? - Whole File? - Specific part?

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 4 / 31

Page 8: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Why do we need a Distributed File System?

Write??? - Append to the end of the file? - Insert content in the middle?

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 5 / 31

Page 9: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Why do we need a Distributed File System?

We want to:

I Perform multiple parallel reads and writes

I Have the files available even if one computer crashes (replication)

I Hide parallelization and distribution details

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 6 / 31

Page 10: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

What is a Distributed File System?

File Namespace

/

/home

/home/lucas

/home/lucas/big_file

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 7 / 31

Page 11: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

What is a Distributed File System?

File Namespace

/

/home

/home/john

/home/john/big_file

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 8 / 31

Page 12: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Examples

I GFS (Google Inc.)

I HDFS (Apache Software Foundation)

I Ceph (Inktank, Red Hat)

I MooseFS (Core Technology / Gemius)

I Windows Distributed File System (DFS) (Microsoft)

I FhGFS (Fraunhofer)

I GlusterFS (Red Hat)

I Lustre

I Ibrix

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 9 / 31

Page 13: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Components

A typical distributed filesystem contains the following components

I Clients - they do the interface with the user

I Chunk nodes - stores chunks of files

I Master node - stores which parts of each file are on which chunk node

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 10 / 31

Page 14: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Distributed File Systems

The Google File System Architecture

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 11 / 31

Page 15: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Distributed File Systems - Storing files

C1 C2 C3 C4

Master node

/

/home

/home/john

/home/john/big_file

Chu

nk

1C

hun

k 2

Chu

nk

3C

hun

k 4

C5 C6 C7 C8

/home/john/big_file

Chunk 1 C1 C7

Chunk 2 C3 C5

Chunk 3 C4 C6

Chunk 4 C2 C8

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 12 / 31

Page 16: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Read Example

C1 C2 C3 C4

Master node

/

/home

/home/john

/home/john/big_file

C5 C6 C7 C8

/home/john/big_file

Chunk 1 C1 C7

Chunk 2 C3 C5

Chunk 3 C4 C6

Chunk 4 C2 C8

Client Application

1. read(/home/john/big_file, chunk 1)

2. (Chunk 1 handle, {C1, C7})

3. (Chunk 1 handle, byte range)

4. Chunk 1 data

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 13 / 31

Page 17: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Write Example

I Make sure each replica contains the same data all the time

I One replica is designated to be the primary replica

I Master pings the nodes to make sure they are alive

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 14 / 31

Page 18: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Write Example

C1 C2 C3 C4

Master node

/

/home

/home/john

/home/john/big_file

C5 C6 C7 C8

/home/john/big_file

Chunk 1 C1 C7

Chunk 2 C3 C5

Chunk 3 C4 C6

Chunk 4 C2 C8

Client Application

1. write(/home/john/big_file, chunk 1)

2. (Chunk 1 handle, {C1, C7})

3. (Chunk 1 handle, data)

6. done

4. (Chunk 1 handle, offset)

5. Return status (success or failure)

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 15 / 31

Page 19: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Considerations

I Reads are very efficient operations

I Writes are efficient if they are appends to the end of the file

I Write in the middle of a file can be problematicI Primary replica decides the order in which to make writes:

I Data is always consistent in all replicas

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 16 / 31

Page 20: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

GFS vs. HDFS

HDFS GFSChunk Size 128Mb 64MbDefault replicas 2 Files (data and

generation stamp)3 Chunknodes

Master NameNode GFS MasterChunk Nodes DataNode Chunk Server

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 17 / 31

Page 21: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Google File System

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 18 / 31

Page 22: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 1. Distributed File Systems

Hadoop Distributed File System

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 19 / 31

Page 23: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Outline

1. Distributed File Systems

2. NoSQL DataBases

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 20 / 31

Page 24: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Databases for Big Data: NoSQL

NoSQL: “Not only SQL”

Wide variety of database technologies addressing:

I Non-relational

I Distributed storing and processing

I Dynamic Schema

I Horizontal Scalability

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 20 / 31

Page 25: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Relational vs NoSQL Databases

Relational Databases

I Structured Data: RelationalTables

I Vertical Scaling

I ACID

I Atomic transaction

I More Functionality LessScalability

NoSQL Databases

I Structured and UnstructuredData: Collections

I Horizontal Scaling

I BASE

I Eventual Consistency

I Less Functionality MoreScalability

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 21 / 31

Page 26: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Types of NoSQL Databases

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 22 / 31

Page 27: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Graph databasesA graph database is a database that uses graph structures with nodes,edges, and properties to represent and store data.

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 23 / 31

Page 28: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Graph databases

I Compared with relational databases, graph databases are often fasterfor associative data sets

I They map more directly to the structure of object-orientedapplications.

I As they depend less on a rigid schema, they are more suitable tomanage ad hoc and changing data with evolving schemas.

I Graph databases are a powerful tool for graph-like queries.

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 24 / 31

Page 29: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Graph queries

I Reachability queries

I Shortest path queries

I Pattern queries

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 25 / 31

Page 30: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Column Databases

Column databases stores a tuples consisting of three elements:

I Unique name: Used to reference the column.

I Value: The content of the column.

I Timestamp: The system timestamp used to determine the validcontent.

Main Advantage: allows to efficiently add new information about existingentities

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 26 / 31

Page 31: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Example

{street: name: ”street”, value: ”1234 x street”, timestamp: 123456789,city: name: ”city”, value: ”san francisco”, timestamp: 123456789,zip: name: ”zip”, value: ”94107”, timestamp: 123456789,

}

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 27 / 31

Page 32: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Document Databases

I Designed for storing, retrieving, and managing document-orientedinformation.

I In contrast to relational databases and their notions of ”Relations”(or ”Tables”), these systems are designed around an abstract notionof a ”Document”.

I Documents inside a document-oriented database are not required tohave all the same sections, slots, parts, or keys.

I Documents are addressed in the database via a unique key thatrepresents that document.

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 28 / 31

Page 33: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Example 1

{FirstName: ”Bob”,Address: ”5 Oak St.”,Hobby: ”sailing”

}

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 29 / 31

Page 34: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Example 2

{FirstName: ”Jonathan”,Address: ”15 Wanamassa Point Road”,Children: {

Name: ”Michael”, Age: 10,Name: ”Jennifer”, Age: 8,Name: ”Samantha”, Age: 5,Name: ”Elena”, Age: 2

}}

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 30 / 31

Page 35: Big Data Analytics - Universität Hildesheim · GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and generation stamp) 3 Chunknodes Master NameNode GFS Master

Big Data Analytics 2. NoSQL DataBases

Key–Value stores

I Key–Value stores use the associative array as their fundamental datamodel.

I In this model, data is represented as a collection of key–value pairs.

I The key–value model is one of the simplest non-trivial data models.

Example:{”Great Expectations”: ”John”,”Pride and Prejudice”: ”Alice”,”Wuthering Heights”: ”Alice”}

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Distributed File Systems and NoSQL Database 31 / 31