September 2008
What is Hadoop?
September 2008
A yellow elephant
September 2008
A use for a datacentre
Hadoop is behind Yahoo!• Yahoo! has about 10,000 machines running Hadoop
• Largest cluster is currently 1,600 nodes
• Storage is about 1 petabyte of user data (compressed)
• Yahoo! runs about 10,000 research jobs/week
source: Eric Baldeschwieler, OSCON, July 25 2007
September 2008
Java Cloud Computing Edition
•A filesystem that scales to petabytes
•Google's MapReduce implemented in Java
•The foundation for Yahoo!'s search, last.fm's music correlation, and other datamining applications
•Open source: Apache hosted
•A framework for data-centric computation
Commodity data processing for commodity data
September 2008
MapReduce
1. Map input data => (key,data')
2. Reduce (key, data')* => (key, data'')
3. Repeat until final output is generated
The fun comes applying it to terabytes of data
Uses: log analysis, correlations, statistics, indexing
September 2008
Example problem: Bluetooth phones
•Map: Bluetooth device ID
•Reduce: debounce to list of sightings and duration
•Map: sightings and durations
•Reduce: statistics for each device, day of week, …
lost,"00:0F:B3:92:05:D3","2008-04-17T22:11:15",1124313075found,"00:0F:B3:92:05:D3","2008-04-17T22:11:29",1124313089lost,"00:0F:B3:92:05:D3","2008-04-17T22:24:45",1124313885found,"00:0F:B3:92:05:D3","2008-04-17T22:25:00",1124313900found,"00:60:57:70:25:0F","2008-04-17T22:29:00",1124314140
September 2008
Datacentre View
Name Node-index
Data Node
Job Tracker-scheduler
Hardware+ OS
HDFS
TaskTracker
Data Node
TaskTracker
Data Node
TaskTracker
Job
Map/Reduce
MapMap ReduceUser Job
September 2008
old world:
App Server
MessageBean
SessionBean
Entity Bean
Entity Bean
App Server
Entity Bean
SessionBean
Entity Bean
SessionBean
RDBMS
Browser
Travel expensesserver too busy.
Browser
Travel expensesserver too busy.
Browser
Travel expensesserver too busy.
IIOP
JSP JSP JSP
WS-*
Java EE
September 2008
IE
Your friends arehaving more funthan you
Mozilla
Your friendsare havingmore funthan you
Chrome
Your friendsare havingmore funthan you
iPhone
You are havingmore fun thanyourfriends
Scatter/gather?Message Queue?
Tuple Space?
Diskless frontend
MemcachedJSP?PHP?
Cloud Layer
HDFSHBase
MapReduceDLucene
Ops View
92% servers upNetwork OKCost: $200/hour
Management view
David Hasslehof isvery popular today
Affiliate View
David Hasslehofkeywords costmore today
Cloud Owner View
Customer #17 isusing lots of diskspace. NormalHDD failure rate
Developer View
MR job 17 completed.
REST APIs
September 2008
Layers on Top
Pig ( from Pig Latin) MapReduce query language
Hive SQL against the data (facebook)
HBase non-relational database
Mahout Machine Learning
Distributed Lucene Search over HDFS
Hama Mathematics
September 2008
Limitations of Hadoop
•HDFS − is not HA —the NameNode is a SPOF
−does not like small files (neither does S3, GFS)
−server requirements (esp. RAM) high
•Performance, scalability limits being discovered
•Configuration, lifecycle to be improved
•Need Apache project for web log analysis
•Diagnostics could be better
•How power efficient is Hadoop?
September 2008
What to do?
•Start collecting data now!
• Look at Hadoop for all your large data storage needs
• Look at outsourced hosting of the cluster
•Or learn to manage your own
•Help code the layers on top to meet your needs
http://hadoop.apache.org
September 2008
what are we up to?
September 2008
•Make Hadoop deployment agile•Integrate with dynamic cluster deployments
Around Hadoop ... with SmartFrog now
Hardware
Hadoop
Vertical applications
Man
agem
ent,
Mon
itorin
g,V
irtua
lizat
ion
Top Related