H is for_hadoop

September 2008

H is for Hadoop

Steve Loughran [email protected] Guijarro [email protected]

September 2008

What is Hadoop?

September 2008

A yellow elephant

September 2008

A use for a datacentre

Hadoop is behind Yahoo!• Yahoo! has about 10,000 machines running Hadoop

• Largest cluster is currently 1,600 nodes

• Storage is about 1 petabyte of user data (compressed)

• Yahoo! runs about 10,000 research jobs/week

source: Eric Baldeschwieler, OSCON, July 25 2007

September 2008

Java Cloud Computing Edition

•A filesystem that scales to petabytes

•Google's MapReduce implemented in Java

•The foundation for Yahoo!'s search, last.fm's music correlation, and other datamining applications

•Open source: Apache hosted

•A framework for data-centric computation

Commodity data processing for commodity data

September 2008

MapReduce

1. Map input data => (key,data')

2. Reduce (key, data')* => (key, data'')

3. Repeat until final output is generated

The fun comes applying it to terabytes of data

Uses: log analysis, correlations, statistics, indexing

September 2008

Example problem: Bluetooth phones

•Map: Bluetooth device ID

•Reduce: debounce to list of sightings and duration

•Map: sightings and durations

•Reduce: statistics for each device, day of week, …

lost,"00:0F:B3:92:05:D3","2008-04-17T22:11:15",1124313075found,"00:0F:B3:92:05:D3","2008-04-17T22:11:29",1124313089lost,"00:0F:B3:92:05:D3","2008-04-17T22:24:45",1124313885found,"00:0F:B3:92:05:D3","2008-04-17T22:25:00",1124313900found,"00:60:57:70:25:0F","2008-04-17T22:29:00",1124314140

September 2008

Datacentre View

Name Node-index

Data Node

Job Tracker-scheduler

Hardware+ OS

HDFS

TaskTracker

Data Node

TaskTracker

Data Node

TaskTracker

Job

Map/Reduce

MapMap ReduceUser Job

September 2008

old world:

App Server

MessageBean

SessionBean

Entity Bean

Entity Bean

App Server

Entity Bean

SessionBean

Entity Bean

SessionBean

RDBMS

Browser

Travel expensesserver too busy.

Browser


Browser


IIOP

JSP JSP JSP

WS-*

Java EE

September 2008

IE

Your friends arehaving more funthan you

Mozilla

Your friendsare havingmore funthan you

Chrome

Your friendsare havingmore funthan you

iPhone

You are havingmore fun thanyourfriends

Scatter/gather?Message Queue?

Tuple Space?

Diskless frontend

MemcachedJSP?PHP?

Cloud Layer

HDFSHBase

MapReduceDLucene

Ops View

92% servers upNetwork OKCost: $200/hour

Management view

David Hasslehof isvery popular today

Affiliate View

David Hasslehofkeywords costmore today

Cloud Owner View

Customer #17 isusing lots of diskspace. NormalHDD failure rate

Developer View

MR job 17 completed.

REST APIs

September 2008

Layers on Top

Pig ( from Pig Latin) MapReduce query language

Hive SQL against the data (facebook)

HBase non-relational database

Mahout Machine Learning

Distributed Lucene Search over HDFS

Hama Mathematics

September 2008

Limitations of Hadoop

•HDFS − is not HA —the NameNode is a SPOF

−does not like small files (neither does S3, GFS)

−server requirements (esp. RAM) high

•Performance, scalability limits being discovered

•Configuration, lifecycle to be improved

•Need Apache project for web log analysis

•Diagnostics could be better

•How power efficient is Hadoop?

September 2008

What to do?

•Start collecting data now!

• Look at Hadoop for all your large data storage needs

• Look at outsourced hosting of the cluster

•Or learn to manage your own

•Help code the layers on top to meet your needs

http://hadoop.apache.org

September 2008

what are we up to?

September 2008

•Make Hadoop deployment agile•Integrate with dynamic cluster deployments

Around Hadoop ... with SmartFrog now

Hardware

Hadoop

Vertical applications

Man

agem

ent,

Mon

itorin

g,V

irtua

lizat

ion

H is for_hadoop

Technology

Transcript of H is for_hadoop