Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem...
Transcript of Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem...
![Page 1: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/1.jpg)
Getting Started with
Josh DevinsNokia
Berlin Expert DaysApril 8, 2011Berlin, Germany
![Page 2: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/2.jpg)
http://www.flickr.com/photos/haiko/154105048/
how did we get here?* Google crawls the web, surfaces the "big data" problem* big data problem defined: so much data that cannot be processed by one individual machine* (also defined as: so much data that you need a team of people to managed it)* solve it: use multiple machines
![Page 3: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/3.jpg)
http://www.flickr.com/photos/jamisonjudd/2433102356/
![Page 4: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/4.jpg)
http://www.flickr.com/photos/torkildr/3462607995/
![Page 5: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/5.jpg)
http://www.flickr.com/photos/torkildr/3462606643/
* since 1999, Google engineers wrote complex distributed programs to analyze crawled data* too complex, not accessible* requirement: must be easy for engineers with little to no distributed computing and large data processing experience * fault tolerance * scaling * simple coding experience * easy to teach * visibility/monitorability
![Page 6: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/6.jpg)
• Google implement MapReduce and GFS
• GFS paper published (Ghemawat, et al)
basic history of MapReduce at Google
* 2003 Google implement MapReduce and GFS * to support large-scale, distributed computing on large data sets using commodity hardware* basically to make data crunching a reality for "regular" Google engineers* 2003 GFS paper published by Sanjay Ghemawat, et al
![Page 7: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/7.jpg)
• MapReduce paper published (Jeffrey Dean and Sanjay Ghemawat)
• MapReduce patent application (2004 applied, 2010 approved)
* 2004 MapReduce paper published by Jeffrey Dean and Sanjay Ghemawat * http://labs.google.com/papers/mapreduce.html* MR is patented by Google (2004 applied, 2010 approved), but supports Hadoop completely and uses the patent defensively only (to ensure that everyone can use the patent)* http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331
![Page 8: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/8.jpg)
• 2004 Doug Cutting and Mike Cafarella create implementation for Nutch
• 2006 Doug Cutting joins Yahoo!
• 2006 Hadoop split out from Nutch
• 2006 Yahoo! search index building powered by Hadoop
• 2007 Yahoo! runs 2x 1,000 node R&D clusters
• 2008 Hadoop wins the 1 TB sort benchmark in 209s on 900 nodes
• 2008 Cloudera founded by ex-Oracle, Yahoo! and Facebook employees
• 2009 Cutting leaves Yahoo! for Cloudera
evolution into Hadoop, natural continuation from Google work, in the public domain
* implemented for Nutch's index creation, relying on their NDFS (Nutch dist filesystem)* Nutch is a web crawler and search engine based on Lucene
![Page 9: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/9.jpg)
book summary
map1
map2
map3
map4
mapn
reduce
so what the hell is it already?“a distributed batch processing system”the non-technical example, courtesy of Matt Biddulph: give n people a book to read and get reports back from themmap/reduce parts can be parallelized, section in the outer box
![Page 10: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/10.jpg)
map(String key, String value): // key: document name // value: row/line from document for each w in value: EmitIntermediate(w, 1);
reduce(String key, Iterator<Integer> values): // key: a word // values: a list of counts Integer count = 0; for each v in count: result += values.next(); Emit(key, count);
sortAndGroup(List<String, Integer> mapOut)
similar to the previous example of reports(simplified) canonical example of word counting* give those same n people or mappers each a line from document and have them write down a ‘1’ for every word they see* the collector is the responsible for summing up all the ‘1’s per word* not a ‘pure function’ (‘emit’ methods have side-effects, impl in Hadoop has side-effects)* based on, but not exact ‘map’ and ‘reduce’ in the strictly functional definition
map function takes: - key as document name - value as the line from the document
map function emits: - key as the word - value as the number 1 (I’ve seen this word one time)
reduce function takes: - key as the word - list of values is list of 1’s -- for each time the word was seen by a mapper
reduce function emits: the word, the sum of number of times word was encountered by a mapper
![Page 11: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/11.jpg)
map input:(doc1,start of the first document)(doc1,the document is super interesting)(doc1,end of the first document)
map output:(start,1) (of,1) (the,1) (first,1) (document,1)(the,1) (document,1) (is,1) (super,1) (interesting, 1)(end,1) (of,1) (the,1) (first,1) (document,1)
sort:(start,1) (of,1)(of,1) (the,1)(the,1)(the,1)(first,1)(first,1) (document,1)(document,1)(document,1)(is,1) (super,1) (interesting, 1) (end,1)
group (reduce input):(start,{1}) (of,{1,1}) (the,{1,1,1}) (first,{1,1})(document,{1,1,1})(is,{1}) (super,{1}) (interesting,{1}) (end,{1})
reduce output:(start,1) (of,2) (the,3) (first,2) (document,3)(is,1) (super,1) (interesting,1) (end,1)
![Page 12: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/12.jpg)
HDFS
logical file view
HDFS primer* block structure* std block size* replicated blocks, std 3x* input task per block* data locality
![Page 13: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/13.jpg)
1
23
4
* high level, physical view of HDFS* walk through write operation steps
![Page 14: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/14.jpg)
1
2
3
* job run* data/processing locality (best effort attempt)* can’t always achieve data-local processing though* stats will show how many data-local map tasks were run
![Page 15: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/15.jpg)
Nomenclature Review
• HDFS
• NameNode: metadata, coordination
• DataNode: storage, retrieval, replication
• MapReduce
• JobTracker: job coordination
• TaskTracker: task management (map and reduce)
* saw all of these pieces in the previous slides
![Page 16: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/16.jpg)
Hadoop ecosystem
![Page 17: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/17.jpg)
Yahoo!
![Page 18: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/18.jpg)
![Page 19: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/19.jpg)
Cloudera* Avro started at Yahoo! by Doug Cutting, continues work at Cloudera
![Page 20: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/20.jpg)
![Page 21: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/21.jpg)
Other (Amazon-AWS Elastic MapReduce, Chris Wensel-Cascading, Infochimps-Wukong, Google-Proto Buf)
![Page 22: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/22.jpg)
Hadoop ecosystem
![Page 23: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/23.jpg)
Diving In
• Cloudera training VM, CDH3b3
• github.com/joshdevins/talks-hadoop-getting-started
• Exercise:
• analyse Apache access logs from mac-geeks.de
• use raw Java MapReduce API, MRUnit
• use Pig, PigUnit
• simple visualization/dashboard
* Cloudera VM, pre-installed with CDH (Cloudera Distribution for Hadoop): http://cloudera-vm.s3.amazonaws.com/cloudera-demo-0.3.5.tar.bz2?downloads (username/password: cloudera/cloudera)* thanks @maxheadroom, mac-geeks.de* throughput analysis* Pig is a high-level abstraction on MR providing a ‘data flow’ language, with constructs
similar to SQL
![Page 24: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/24.jpg)
1.2.3.4 - - [30/Sep/2010:15:07:53 -0400] "GET /foo HTTP/1.1" 200 31901.2.3.4 - - [30/Sep/2010:15:07:53 -0400] "GET /bar HTTP/1.1" 404 31901.2.3.4 - - [30/Sep/2010:15:07:54 -0400] "GET /foo HTTP/1.1" 200 31901.2.3.4 - - [30/Sep/2010:15:07:54 -0400] "GET /foo HTTP/1.1" 200 3190
30/Sep/2010:15:07:53, 130/Sep/2010:15:07:54, 2 group by second
30/Sep/2010:15:00:00,{(30/Sep/2010:15:07:53, 1), (30/Sep/2010:15:07:54, 2)}
group by hour
30/Sep/2010:15:00:00, 3, 2 count, find max
general approach
![Page 25: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/25.jpg)
Code
github.com/joshdevins/talks-hadoop-getting-started
![Page 26: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/26.jpg)
Hadoop at Nokia
* Nokia Berlin - location based services
![Page 27: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/27.jpg)
Global Architecture
* remote DC’s: Singapore, Peking, Atlanta, Mumbai* central DC: Slough/London* R&D DC’s and Hadoop clusters: Berlin, Boston
![Page 28: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/28.jpg)
Hardware
DC LONDON BERLIN
cores 12x (w/ HT) 4x 2.00 GHz (w/ HT)
RAM 48GB 16GB
disks 12x 2TB 4x 1TB
storage 24TB 4TB
LAN 1Gb 2x 1Gb (bonded)
http://www.flickr.com/photos/torkildr/3462607995/in/photostream/
BERLIN* HP DL160 G6* 1x Quad-core Intel Xeon E5504 @ 2.00 GHz (4-cores total)* 16GB DDR3 RAM* 4x 1TB 7200 RPM SATA* 2x 1Gb LAN* iLO Lights-Out 100 Advanced
![Page 29: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/29.jpg)
Meaning?
• Size
• Berlin: 2 master nodes, 13 data nodes, ~17TB HDFS
• London: “large enough to handle a year’s worth of activity log data, with plans for rapid expansion”
• Scribe
• 250,000 1KB msg/sec
• 244MB/sec, 14.3GB/hr, 343GB/day
http://www.flickr.com/photos/torkildr/3462607995/in/photostream/
![Page 30: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/30.jpg)
Reporting
operational - access logs, throughput, general usage, dashboardsbusiness reporting - what are all of the products doing, how do they compare to other monthsad-hoc - random business queries
* almost all of this goes through Pig at some point* pipelines with Oozie* sometimes parsing and decoding in Java MR job, then Pig for the heavy lifting* mostly goes into a RDBMS using Sqoop for display and querying in other tools* Tableau for some dashboards and quick visualizations* many JS libs for good visualization/dashboarding* sometimes roll your own with image libraries in Python, Ruby, etc.
![Page 31: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/31.jpg)
IKEA!
other than reporting, we also occasionally do some data exploration, which can be quite funany guesses what this is a plot of?geo-searches for Ikea!
![Page 32: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/32.jpg)
Ikea Tempelhof
Ikea Spandau
Ikea Schoenefeld
Prenzl Berg Yuppies
Ikea geo-searches bounded to Berlincan we make any assumptions about what the actual locations are?kind of, but not much data hereclearly there is a Tempelhof cluster but the others are not very evidentcertainly shows the relative popularity of all the locationsIkea Lichtenberg was not open yet during this time frame
![Page 33: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/33.jpg)
Ikea Croydon
Ikea Wembley
Ikea Edmonton
Ikea Lakeside
Ikea geo-searches bounded to Londoncan we make any assumptions about what the actual locations are?turns out we can!using a clustering algorithm like K-Means (maybe from Mahout) we probably could guess
> this is considering search location, what about time?
![Page 34: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/34.jpg)
Berlindistribution of searches over days of the week and hours of the daycertainly can make some comments about the hours that Berliners are awakecan we make assumptions about average opening hours?
![Page 35: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/35.jpg)
Berlinupwards trend a couple hours before openingcan also clearly make some statements about the best time to visit Ikea in Berlin - Sat night!
BERLIN * Mon-Fri 10am-9pm * Saturday 10am-10pm
![Page 36: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/36.jpg)
Londonmore data points again so we get smoother results
![Page 37: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/37.jpg)
LondonLONDON * Mon-Fri 10am-10pm * Saturday 9am-10pm * Sunday 11am-5pm
> potential revenue stream?> what to do with this data or data like this?
![Page 38: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/38.jpg)
Productizing
![Page 39: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/39.jpg)
Berlin
another example of something that can be productized
Berlin * traffic sensors * map tiles
![Page 40: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/40.jpg)
Los Angeles
LA * traffic sensors * map tiles
![Page 41: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/41.jpg)
Berlin Los Angeles
![Page 42: Getting Started withbed-con.org/images/files/bed2011/bed2011-hadoop.pdf · * big data problem defined: so much data that cannot be processed by one individual machine * ... data](https://reader033.fdocuments.us/reader033/viewer/2022042307/5ed37c69847f87317f77c034/html5/thumbnails/42.jpg)
Join Us
• Nokia is hiring in Berlin!
• software engineers
• operations engineers
• www.nokia.com/careers