Ajug april 2011
-
Upload
christopher-curtin -
Category
Technology
-
view
448 -
download
3
description
Transcript of Ajug april 2011
![Page 1: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/1.jpg)
Introduction to MapReduce
Christopher Curtin
![Page 2: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/2.jpg)
About Me
• 20+ years in Technology• Background in Factory Automation,
Warehouse Management and Food Safety system development before Silverpop
• CTO of Silverpop• Silverpop is a leading marketing
automation and email marketing company
![Page 3: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/3.jpg)
Contrived Example
![Page 4: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/4.jpg)
What is MapReduce“MapReduce is a programming model and an
associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. “
http://labs.google.com/papers/mapreduce.html
![Page 5: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/5.jpg)
Back to the example
• I need to know:–# of each color M&M– Average weight of each color– Average width of each color
![Page 6: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/6.jpg)
![Page 7: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/7.jpg)
Traditional approach
• Initialize data structure• Read CSV• Split each row into parts• Find color in data structure• Increment count, add width, weight• Write final result
![Page 8: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/8.jpg)
ASSume with me
• Determining weight is a CPU intensive step
• 8 core machine• 5,000,000,000 pieces per shift to
process• Files ‘rotated’ hourly
![Page 9: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/9.jpg)
Thread It!
• Write logic to start multiple threads, pass each one a row (or 1000 rows) to evaluate
![Page 10: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/10.jpg)
Issues with threading
• Have to write coordination logic• Locking of the color data structure• Disk/Network I/O becomes next
bottleneck
• As volume increases, cost of CPUs/Disks isn’t linear
![Page 11: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/11.jpg)
Ideas to solve these problems?
• Put it a database• Multiple machines, each processes a
file
![Page 12: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/12.jpg)
MapReduce
• Map– Parse the data into name/value pairs– Can be fast or expensive
• Reduce– Collect the name/value pairs and
perform function on each ‘name’
• Framework makes sure you get all the distinct ‘names’ and only one per invocation
![Page 13: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/13.jpg)
Distributed File System
• System takes the files and makes copies across all the machines in the cluster
• Often files are broken apart and spread around
![Page 14: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/14.jpg)
Move processing to the data!
• Rather than copying files to the processes, push the application to the machine where the data lives!
• System pushes jar files and launches JVMs to process
![Page 15: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/15.jpg)
Runtime Distribution © Concurrent 2009
![Page 16: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/16.jpg)
Hadoop
• Apache’s MapReduce implementation• Lots of third party support– Yahoo– Cloudera– Others announcing almost daily
![Page 17: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/17.jpg)
Example
![Page 18: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/18.jpg)
Issues with example
• /ajug/output can’t exist!• What’s with all the ‘Writable’
classes?• Data Structures have a lot of coding
overhead• What if I want to do multiple things
off the source?• What if I want to do something after
the Reduce?
![Page 19: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/19.jpg)
Cascading
• Layer on top of Hadoop• Introduces Pipes to abstract when
mappers or reducers are needed• Can easily string together logic steps• No need to think about when to map,
when to reduce• No need for intermediate data
structures
![Page 20: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/20.jpg)
Sample Example in Cascading
![Page 21: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/21.jpg)
Multiple Output example in Cascading
![Page 22: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/22.jpg)
Unit testing
• Kind of hard without some upfront thought
• Separate business logic from hadoop/cascading specific parts
• Try to use domain objects or primitives in business logic, not Tuples or Hadoop structures
• Cascading has a nice testing framework to implement
![Page 23: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/23.jpg)
Other testing
• Known sets of data is critical at volume
![Page 24: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/24.jpg)
Common Use Cases
• Evaluation of large volumes of data at a regular frequency
• Algorithms that take a single pass through the data
• Sensor data, log files, web analytics, transactional data
• First pass ‘what is going on’ evaluation before building/paying for ‘real’ reports
![Page 25: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/25.jpg)
Things it is not good for
• Ad-hoc queries (though there are some tools on top of Hadoop to help)
• Fast/real-time evaluations• OLTP• Well known analysis may be better
off in a data wharehouse
![Page 26: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/26.jpg)
Issues to watch out for
• Lots of small files• Default scheduler is pretty poor• Users need shell-level access?!?
![Page 27: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/27.jpg)
Getting started
• Download latest from Cloudera or Apache
• Setup local only cluster (really easy to do)
• Download Cascading• Optional download Karmasphere if
using Eclipse (http://www.karmasphere.com/)
• Build some simple tests/apps• Running locally is almost the same as
in the cluster
![Page 28: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/28.jpg)
Elastic Map Reduce
• Amazon EC2-based Hadoop• Define as many servers as you want• Load the data and go• 60 CENTS per hour per machine for a
decent size
![Page 29: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/29.jpg)
So ask yourself
What could I do with 100 machines in an hour?
![Page 30: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/30.jpg)
Ask yourself again …
What design/ architecture do I have because I didn’t have a good way to store the data?
Or
What have I shoved into an RDBMS because I had one?
![Page 31: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/31.jpg)
Other Solutions• Apache Pig:
http://hadoop.apache.org/pig/• More ‘sql-like’ • Not as easy to mix regular Java into
processes• More ‘ad hoc’ than Cascading
• Yahoo! Oozie: http://yahoo.github.com/oozie/
• Work coordination via configuration not code• Allows integration of non-hadoop jobs into
process
![Page 32: Ajug april 2011](https://reader036.fdocuments.us/reader036/viewer/2022070304/54c04df54a795930218b45cd/html5/thumbnails/32.jpg)
Resources• Me: [email protected] @ChrisCurtin• Chris Wensel: @cwensel • Web site: www.cascading.org, Mailing list
off website• Atlanta Hadoop Users Group:
http://www.meetup.com/Atlanta-Hadoop-Users-Group/
• Cloud Computing Atlanta Meetup:http://www.meetup.com/acloud/
• O’Reilly Hadoop Book: http://oreilly.com/catalog/9780596521974/