Hadoop An Introduction

24
By :- Rishi Arora www.rishiarora.com

Transcript of Hadoop An Introduction

Page 1: Hadoop An Introduction

By :- Rishi Arora

www.rishiarora.com

Page 2: Hadoop An Introduction
Page 3: Hadoop An Introduction
Page 4: Hadoop An Introduction
Page 5: Hadoop An Introduction
Page 6: Hadoop An Introduction
Page 7: Hadoop An Introduction
Page 8: Hadoop An Introduction
Page 9: Hadoop An Introduction

Companies

by

estimated

Number

of

Servers

Page 10: Hadoop An Introduction
Page 11: Hadoop An Introduction
Page 12: Hadoop An Introduction
Page 13: Hadoop An Introduction

Source : http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Page 14: Hadoop An Introduction

WHY BIG DATA ?

WHY NOW ?

2,500 exabytes of new information in 2012 with Internet as primary driver

Digital universe grew by 62% last year to 800K petabytes and will grow to

1.2 “zettabytes” this year

Source : An IDC White Paper- As the Economy Contracts, the Digital Universe

Page 15: Hadoop An Introduction

Problems

with

Big Data

?

Page 16: Hadoop An Introduction

Read Write Disk is Slow

1Tb Drives are read at 100Mb/sec

Use Disks in Parallel

1 HDD = 100 Mb/sec

100 HDD = 10 Gb /Sec

Solution

Page 17: Hadoop An Introduction

Problem #2

Hardware Failure

Single Machine Failure

Keep Multiple Copies of Data

Solution

Page 18: Hadoop An Introduction

Problem #3

Merge Data from Different Reads

Keep Multiple Copies of Data

Solution

Only completed results need to be taken into consideration

and failed results need to be ignored

Data needs to be compressed to be sent across the network

Page 20: Hadoop An Introduction

Hadoop

Components

HDFS Map Reduce

Distributed File Manager Map Reduce

Page 21: Hadoop An Introduction

• Designed for modest number of Large files (millions

instead of billions)

• Sequential access not Random access

• Write Once, Read Many

• Data is split into chunks and stored in multiple nodes

as blocks

• Namenode maintains the block locations

• Blocks get replicated over the data nodes

• Single namespace and accessible universally

• Computation is moved to the data – data locality

HDFS Overview

Page 22: Hadoop An Introduction

Map Reduce Overview

• Tasks are distributed to multiple nodes

• Each node processes the data stored in that node

• Consists of two phase:

• Map – Reads input data and output intermediate

keys and values

• Reduce – Values with the same key are sent to

the same reducer for further processing

Page 23: Hadoop An Introduction

HDFS

HDFS v2 YARN

ZO

OK

EE

PE

R

C

oord

inato

r

F

LU

ME

L

og

Co

lle

cto

r

SQ

OO

P

Data

Exchanger

Wo

rkflo

w

P

IG

S

cripting

H

IVE

SQ

L Q

uery

Mach

ine

Le

arn

ing

C

olu

mn

S

tore

Hadoop Ecosystem

Page 24: Hadoop An Introduction

Thank You !!