Hadoop

Post on 17-May-2015

1.685 views 0 download

Tags:

description

This slides demonstrate a basic overview on Hadoop Distributed computing.

Transcript of Hadoop

Overview on HADOOP Distributed Computing

RAGHU JULURI Senior Member Technical Staff Oracle India Development Center.

2/7/2011 1

Dealing with lots of Data20 billion web pages * 20 kb =400 TB1000 hard disks to store web1 computer can read ~50 MB/sec from disk => 3 months

Sol : spread the work over many machines

Hardware & SoftwareSoftware – Communication & Co-ordination , recovery from

failure ,status reporting, debugging .Every application need to implement above functionality

(Google search (indexing) , page ranking,trends,picasa…) In 2003 Google came up with Map Reduce run time library.

2/7/2011 2

2/7/2011 3

2/7/2011 4

Standard Model

2/7/2011 5

Hadoop EcoSystem

2/7/2011 6

2/7/2011 7

2/7/2011 8

Hadoop, Why?Need to process Multi Petabyte DatasetsExpensive to build reliability in each

application.Nodes fail every day

– Failure is expected, rather than exceptional.– The number of nodes in a cluster is not constant.

Need common infrastructure– Efficient, reliable, Open Source Apache License

The above goals are same as Condor, butWorkloads are IO bound and not CPU bound

2/7/2011 9

2/7/2011 10

2/7/2011 11

HDFS splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss.

2/7/2011 12

Goals of HDFSVery Large Distributed File System

– 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware

– Files are replicated to handle hardware failure– Detect failures and recovers from them

Optimized for Batch Processing– Data locations exposed so that computations can move to where data resides– Provides very high aggregate bandwidth

User Space, runs on heterogeneous OS

2/7/2011 13

SecondaryNameNode

Client

HDFS Architecture

NameNode

DataNodes

1. filename

2. BlckId, DataNodes

o

3.Read data

Cluster Membership

Cluster Membership

NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on diskSecondaryNameNode: Periodic merge of Transaction log

2/7/2011 14

MapReduce:Programming Model

How nowBrown cow

How doesIt work now

brown 1cow 1does 1How 2

it 1now 2work 1

M

M

M

M

R

R

<How,1><now,1><brown,1><cow,1><How,1><does,1><it,1><work,1><now,1>

<How,1 1><now,1 1><brown,1><cow,1><does,1><it,1><work,1>

Input OutputMap

ReduceMapReduceFramework

2/7/2011 15

MapReduce:Programming ModelProcess data using special map() and

reduce() functionsThe map() function is called on every item in

the input and emits a series of intermediate key/value pairs

All values associated with a given key are grouped together

The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output

2/7/2011 16

MapReduce BenefitsGreatly reduces parallel programming

complexityReduces synchronization complexityAutomatically partitions dataProvides failure transparencyHandles load balancing

PracticalApproximately 1000 Google MapReduce jobs

run everyday.

2/7/2011 17

MapReduce ExamplesWord frequency

Map

doc

Reduce

<word,3>

<word,1>

<word,1>

<word,1>

RuntimeSystem

<word,1,1,1>

2/7/2011 18

A Brief HistoryFunctional programming (e.g., Lisp)

map() function Applies a function to each value of a sequence

reduce() function Combines all elements of a sequence using a binary

operator

2/7/2011 19

MapReduce Execution Overview1. The user program, via the MapReduce

library, shards the input data

UserProgramInput

Data

Shard 0Shard 1Shard 2Shard 3Shard 4Shard 5Shard 6

* Shards are typically 16-64mb in size

2/7/2011 20

MapReduce Execution Overview2. The user program creates process copies

distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads.

UserProgram

Master

WorkersWorkers

WorkersWorkers

Workers

2/7/2011 21

MapReduce Resources3. The master distributes M map and R

reduce tasks to idle workers. M == number of shards R == the intermediate key space is divided

into R parts

MasterIdle

Worker

Message(Do_map_task)

2/7/2011 22

MapReduce Resources4. Each map-task worker reads assigned input

shard and outputs intermediate key/value pairs.

Output buffered in RAM.

MapworkerShard 0 Key/value pairs

2/7/2011 23

MapReduce Execution Overview5. Each worker flushes intermediate values,

partitioned into R regions, to disk and notifies the Master process.

Master

Mapworker

Disk locations

LocalStorage

2/7/2011 24

MapReduce Execution Overview6. Master process gives disk locations to an

available reduce-task worker who reads all associated intermediate data.

Master

Reduceworker

Disk locations

remoteStorage

2/7/2011 25

MapReduce Execution Overview7. Each reduce-task worker sorts its

intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file.

Reduceworker

Sorts data PartitionOutput file

2/7/2011 26

MapReduce Execution Overview8. Master process wakes up user process

when all tasks have completed. Output contained in R output files.

wakeup UserProgram

Master

Outputfiles

2/7/2011 27

2/7/2011 28

Pig

Data-flow oriented language “Pig latin” Datatypes include sets, associative arrays,tuples High-level language for routing data, allows easy integration of Java for

complex tasks

• Developed at Yahoo!

Hive

• SQL-based data warehousing app Feature set is similar to Pig – Language is more strictly SQL Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets – Partition columns – Sampling – Buckets Developed at Facebook

2/7/2011 29

Hbase Column-store database – Based on design of Google BigTable – Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model – (key, val) lookup – Limited transactions (only one row)

2/7/2011 30

ZooKeeper

Distributed consensus engine Provides well-defined concurrent accesssemantics:– Leader election– Service discovery– Distributed locking / mutual exclusion– Message board / mailboxes

2/7/2011 31

Some more projects…

Chukwa – Hadoop log aggregation Scribe – More general log aggregation Mahout – Machine learning library Cassandra – Column store database on a P2P

backend Dumbo – Python library for streamingGanglia – distributed monitoring

2/7/2011 32

Conclusions Computing with big datasets is afundamentally different challenge than doing “big compute” over a small dataset

• New ways of thinking about problems needed – New tools provide means to capture this – MapReduce, HDFS, etc. can help

2/7/2011 33

2/7/2011 34