Bigdata and Hadoop

36

Transcript of Bigdata and Hadoop

Page 1: Bigdata and Hadoop
Page 2: Bigdata and Hadoop

What is the Need of Big data Technology

when we have robust, high-performing,

relational database management system

?

Page 3: Bigdata and Hadoop

Data Stored in structured format like PK, Rows,

Columns, Tuples and FK .

It was for just Transactional data analysis.

Later using Data warehouse for offline data.

(Analysis done within Enterprise)

Massive use of Internet and Social Networking(FB,

Linkdin) Data become less structured.

Data is stored on central server.

Page 4: Bigdata and Hadoop
Page 5: Bigdata and Hadoop

‘Big Data’ is similar to ‘small data’, but bigger

…but having data bigger it requires different approaches:› Techniques, tools and architecture

…with an aim to solve new problems› …or old problems in a better way

Page 6: Bigdata and Hadoop

Volume

• Data quantity

Velocity

• Data Speed

Variety

• Data Types

Page 7: Bigdata and Hadoop

HADOOP

Page 8: Bigdata and Hadoop

Open-source data storage and processing API

Massively scalable, automatically parallelizable

Based on work from Google

GFS + MapReduce + BigTable

Current Distributions based on Open Source and Vendor Work

Apache Hadoop

Cloudera – CH4 w/ Impala

Hortonworks

MapR

AWS

Windows Azure HDInsight

Page 9: Bigdata and Hadoop

HDFS

Storage

Self-healing

high-bandwidth

clustered storage

MapReduce

Processing

Fault-tolerant

distributed

processing

Page 10: Bigdata and Hadoop

HDFS is a file system written in Java

Sits on top of a native file system

Provides redundant storage for massive

amounts of data

Use cheap, unreliable computers

Page 11: Bigdata and Hadoop

Data is split into blocks and stored on multiple

nodes in the cluster

› Each block is usually 64 MB or 128 MB (conf)

Each block is replicated multiple times (conf)

› Replicas stored on different data nodes

Large files, 100 MB+

Page 12: Bigdata and Hadoop

Master Nodes Slave Nodes

Page 13: Bigdata and Hadoop

NameNode

› only 1 per cluster

› metadata server and database

› SecondaryNameNode helps with some housekeeping

• JobTracker

• only 1 per cluster

• job scheduler

Page 14: Bigdata and Hadoop

DataNodes

› 1-4000 per cluster

› block data storage

• TaskTrackers

• 1-4000 per cluster

• task execution

Page 15: Bigdata and Hadoop

A single NameNode stores all metadata

Filenames, locations on DataNodes of each

block, owner, group, etc.

All information maintained in RAM for fast lookup

File system metadata size is limited to the amount

of available RAM on the NameNode

Page 16: Bigdata and Hadoop

DataNodes store file contents

Stored as opaque ‘blocks’ on the underlying filesystem

Different blocks of the same file will be stored on different DataNodes

Same block is stored on three (or more) DataNodes for redundancy

Page 17: Bigdata and Hadoop

DataNodes send heartbeats to NameNode

› After a period without any heartbeats, a DataNode is assumed to be lost

› NameNode determines which blocks were on the lost node

› NameNode finds other DataNodes with copies of these blocks

› These DataNodes are instructed to copy the blocks to other nodes

› Replication is actively maintained

Page 18: Bigdata and Hadoop

The Secondary NameNode is not a failover NameNode

Does memory-intensive administrative functions for the NameNode

Should run on a separate machine

Page 19: Bigdata and Hadoop

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

Page 20: Bigdata and Hadoop

MapReduce

Page 21: Bigdata and Hadoop

MapReduce

JobTrackerMapReduce job

submitted by

client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

In our case: circe.rc.usf.edu

Page 22: Bigdata and Hadoop

Preparing for

MapReduceLoading

Files

64 MB

128 MB

File System

Native file system

HDFS

Cloud

Output

Immutable

You Define

Input, Map, Reduce, Output

Use Java or other programming

language

Work with key-value pairs

Page 23: Bigdata and Hadoop

Input: a set of key/value pairs

User supplies two functions:

› map(k,v) list(k1,v1)

› reduce(k1, list(v1)) v2

(k1,v1) is an intermediate key/value pair

Output is the set of (k1,v2) pairs

Page 24: Bigdata and Hadoop

InputFormat

Map function

Partitioner

Sorting & Merging

Combiner

Shuffling

Merging

Reduce function

OutputFormat

Page 25: Bigdata and Hadoop

• Task Tracker

• Data Node

• Task Tracker

• Data Node

• Task Tracker

• Data Node

• Name Node

• Job Tracker

Master Node

Slave Node

1

Slave Node

2

Slave Node

3

Page 26: Bigdata and Hadoop

MapReduce Example -

WordCount

Page 27: Bigdata and Hadoop

InputFormat:

› TextInputFormat

› KeyValueTextInputFormat

› SequenceFileInputFormat

OutputFormat:

› TextOutputFormat

› SequenceFileOutputFormat

Page 28: Bigdata and Hadoop

2

8

Page 29: Bigdata and Hadoop

Probably the most complex aspect of MapReduce!

Map side

› Map outputs are buffered in memory in a circular buffer

› When buffer reaches threshold, contents are “spilled” to disk

› Spills merged in a single, partitioned file (sorted within each

partition): combiner runs here

Reduce side

› First, map outputs are copied over to reducer machine

› “Sort” is a multi-pass merge of map outputs (happens in

memory and on disk): combiner runs here

› Final merge pass goes directly into reducer

Page 30: Bigdata and Hadoop

Mapper

Reducer

other mappers

other reducers

circular buffer

(in memory)

spills (on disk)

merged spills

(on disk)

intermediate files

(on disk)

Combiner

Combiner

Page 31: Bigdata and Hadoop

Writable Defines a de/serialization protocol.

Every data type in Hadoop is a Writable.

WritableComprable Defines a sort order. All keys must be

of this type (but not values).

IntWritable

LongWritable

Text

Concrete classes for different data types.

SequenceFiles Binary encoded of a sequence of

key/value pairs

Page 32: Bigdata and Hadoop

Map function

Reduce function

Run this program as a

MapReduce job

Page 33: Bigdata and Hadoop

Hadoop ClusterYou

1. Load data into HDFS

2. Develop code locally

3. Submit MapReduce job3a. Go back to Step 2

4. Retrieve data from HDFS

Page 34: Bigdata and Hadoop

Applications for Big Data Analytics

Homeland Security

Finance Smarter HealthcareMulti-channel

sales

Telecom

Manufacturing

Traffic Control

Trading Analytics Fraud and Risk

Log Analysis

Search Quality

Retail: Churn, NBO

Page 35: Bigdata and Hadoop

CASE STUDY : 1Environment Change Prediction to Assist Formers Using Hadoop

Page 36: Bigdata and Hadoop

Thank you

Questions ?