Bigdata and Hadoop

What is the Need of Big data Technology

when we have robust, high-performing,

relational database management system

?

Data Stored in structured format like PK, Rows,

Columns, Tuples and FK .

It was for just Transactional data analysis.

Later using Data warehouse for offline data.

(Analysis done within Enterprise)

Massive use of Internet and Social Networking(FB,

Linkdin) Data become less structured.

Data is stored on central server.

‘Big Data’ is similar to ‘small data’, but bigger

…but having data bigger it requires different approaches:› Techniques, tools and architecture

…with an aim to solve new problems› …or old problems in a better way

Volume

• Data quantity

Velocity

• Data Speed

Variety

• Data Types

HADOOP

Open-source data storage and processing API

Massively scalable, automatically parallelizable

Based on work from Google

GFS + MapReduce + BigTable

Current Distributions based on Open Source and Vendor Work

Apache Hadoop

Cloudera – CH4 w/ Impala

Hortonworks

MapR

AWS

Windows Azure HDInsight

HDFS

Storage

Self-healing

high-bandwidth

clustered storage

MapReduce

Processing

Fault-tolerant

distributed

processing

HDFS is a file system written in Java

Sits on top of a native file system

Provides redundant storage for massive

amounts of data

Use cheap, unreliable computers

Data is split into blocks and stored on multiple

nodes in the cluster

› Each block is usually 64 MB or 128 MB (conf)

Each block is replicated multiple times (conf)

› Replicas stored on different data nodes

Large files, 100 MB+

Master Nodes Slave Nodes

NameNode

› only 1 per cluster

› metadata server and database

› SecondaryNameNode helps with some housekeeping

• JobTracker

• only 1 per cluster

• job scheduler

DataNodes

› 1-4000 per cluster

› block data storage

• TaskTrackers

• 1-4000 per cluster

• task execution

A single NameNode stores all metadata

Filenames, locations on DataNodes of each

block, owner, group, etc.

All information maintained in RAM for fast lookup

File system metadata size is limited to the amount

of available RAM on the NameNode

DataNodes store file contents

Stored as opaque ‘blocks’ on the underlying filesystem

Different blocks of the same file will be stored on different DataNodes

Same block is stored on three (or more) DataNodes for redundancy

DataNodes send heartbeats to NameNode

› After a period without any heartbeats, a DataNode is assumed to be lost

› NameNode determines which blocks were on the lost node

› NameNode finds other DataNodes with copies of these blocks

› These DataNodes are instructed to copy the blocks to other nodes

› Replication is actively maintained

The Secondary NameNode is not a failover NameNode

Does memory-intensive administrative functions for the NameNode

Should run on a separate machine

datanode daemon

Linux file system

…

tasktracker

slave node

datanode daemon

Linux file system

…

tasktracker

slave node

datanode daemon

Linux file system

…

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

MapReduce

MapReduce

JobTrackerMapReduce job

submitted by

client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

In our case: circe.rc.usf.edu

Preparing for

MapReduceLoading

Files

64 MB

128 MB

File System

Native file system

HDFS

Cloud

Output

Immutable

You Define

Input, Map, Reduce, Output

Use Java or other programming

language

Work with key-value pairs

Input: a set of key/value pairs

User supplies two functions:

› map(k,v) list(k1,v1)

› reduce(k1, list(v1)) v2

(k1,v1) is an intermediate key/value pair

Output is the set of (k1,v2) pairs

InputFormat

Map function

Partitioner

Sorting & Merging

Combiner

Shuffling

Merging

Reduce function

OutputFormat

• Task Tracker

• Data Node

• Task Tracker

• Data Node

• Task Tracker

• Data Node

• Name Node

• Job Tracker

Master Node

Slave Node

1

Slave Node

2

Slave Node

3

MapReduce Example -

WordCount

InputFormat:

› TextInputFormat

› KeyValueTextInputFormat

› SequenceFileInputFormat

OutputFormat:

› TextOutputFormat

› SequenceFileOutputFormat

Probably the most complex aspect of MapReduce!

Map side

› Map outputs are buffered in memory in a circular buffer

› When buffer reaches threshold, contents are “spilled” to disk

› Spills merged in a single, partitioned file (sorted within each

partition): combiner runs here

Reduce side

› First, map outputs are copied over to reducer machine

› “Sort” is a multi-pass merge of map outputs (happens in

memory and on disk): combiner runs here

› Final merge pass goes directly into reducer

Mapper

Reducer

other mappers

other reducers

circular buffer

(in memory)

spills (on disk)

merged spills

(on disk)

intermediate files

(on disk)

Combiner

Combiner

Writable Defines a de/serialization protocol.

Every data type in Hadoop is a Writable.

WritableComprable Defines a sort order. All keys must be

of this type (but not values).

IntWritable

LongWritable

Text

…

Concrete classes for different data types.

SequenceFiles Binary encoded of a sequence of

key/value pairs

Map function

Reduce function

Run this program as a

MapReduce job

Hadoop ClusterYou

1. Load data into HDFS

2. Develop code locally

3. Submit MapReduce job3a. Go back to Step 2

4. Retrieve data from HDFS

Applications for Big Data Analytics

Homeland Security

Finance Smarter HealthcareMulti-channel

sales

Telecom

Manufacturing

Traffic Control

Trading Analytics Fraud and Risk

Log Analysis

Search Quality

Retail: Churn, NBO

CASE STUDY : 1Environment Change Prediction to Assist Formers Using Hadoop

Thank you

Questions ?

Bigdata and Hadoop

Data & Analytics

Transcript of Bigdata and Hadoop