HadoopThe Hadoop Java Software Framework
-
Upload
thoughtworks -
Category
Technology
-
view
2.883 -
download
3
description
Transcript of HadoopThe Hadoop Java Software Framework
Hadoop: Playing with data, at scale
If you have lot of data to process…. What should you know?
Mahesh Tiyyagura
25th November, Bangalore
Mahesh Tiyyagura
Email: [email protected] http://www.twitter.com/tmahesh
Work on large scale crawling and extraction of structured data from the web. Used Hadoop at Yahoo! to run machine learning algorithms and analyzing click logs
Hadoop • Massively scalable storage and batch data processing system
• Its all about Scale…… – Scaling hardware infrastructure (horizontal scaling) – Scaling operations and maintenance (handling failures) – Scaling developer productivity (keep it simple)
Numbers you should know… • You can store say, 10TB of data per node • 1 Disk: 75MB/sec (sequential read) • Say, you want to process, 200GB of data • That’s is ~ 1 hour to just read the data!! • Processing data (CPU) is much much faster (say, 10x) • To remove the bottleneck, we need to read data in parallel • Read from 100 Disks in parallel: 7.5GB/sec!! • Insight: Move computation, NOT data
• Oh! BTW, Data should not (and cannot) reside on only one node • In a 1000 node cluster, you can expect ~10 failures per week • For peace of mind, Reliability should be handled by software
Hadoop is designed to address these issues.
The Platform, in brief… • HDFS: Storing Data
– Data spilt into multiple blocks across nodes – Replication protects data from failures – A master node orchestrates the read/write requests (without being a bottleneck!!) – Scales linearly… 4TB of raw disk translates to ~ 1TB of storage (tunable)
• MapReduce (MR): Processing Data – A beautiful abstraction; asks user to implement just 2 functions (Map and Reduce) – You don’t need no knowledge of network IO, node failures, checkpoints, distributed
what?? – Most of the data processing jobs can be mapped into MapReduce Abstraction – Data processed locally, in parallel. Reliability is implicit. – A giant merge sort infrastructure does the magic
Will revisit this slide. Something's are better understood in retrospect.
HDFS
MR: Programming Model • Map function: (key, value) -> (key1, value1) list • Reduce function: (key1, value1 list) -> key1, output
• Examples: – map(k, v) -> emit (k, v.toUpper()) – map(k, v) -> foreach c in v; do emit(k, c) done;
– reduce(k, vals) -> foreach v in vals; do sum+= v; done emit(k, sum)
MAPREDUCE
Thinking in MapReduce …. • Word Count Example
– map(docid, text) -> foreach word in text.split(); do emit(word, 1); done – reduce(word, counts list) -> foreach count in counts; do sum+=count; done; emit(word,
sum)
• Document search index (Inverted index) – map(docid, html) -> foreach term in getTerms(html); do emit(term, docid); done – reduce(term, docid list) -> emit(term, docid list);.
Thinking in MapReduce ….
• All the anchor text to a page – map(docid, html) -> foreach link in getLinks(html); do emit(link, anchorText); done – Reduce(link, anchorText list) -> emit(link, anchorText list);
• Image resize – Map(imgid, image) -> emit(imgid, image.resize()); – No need for reduce
Hadoop Streaming Demo • cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile
• Each line in InputFile written to stdin of shellMapper.sh • A line on stdout of shellMapper.sh; split into key, value (before first tab is key, rest is value) • Each key, value pair is fed as line into stdin of shellReducer.sh • A line on stdout of shellReducer written to outputFile
• hadoop jar hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /bin/wc
• More Info: http://hadoop.apache.org/common/docs/r0.18.2/streaming.html
• Will share the terminal session now… for DEMO
Brief intro to PIG • Adhoc data analysis • An abstraction over mapreduce • Think of it as a stdlib for mapreduce • Supports common data processing operators (join, group by) • A high level language for data processing
PIG demo – switch to terminal
• Also try… HIVE, exposes a SQL like interface over HDFS data
HIVE demo – switch to terminal