Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson...
-
Upload
ira-jennings -
Category
Documents
-
view
215 -
download
0
Transcript of Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson...
![Page 1: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/1.jpg)
1
Experiences Teaching MapReducein the Clouds
Ari Rabkin, Charles Reiss,Randy Katz, David Patterson
University of California, Berkeley
![Page 2: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/2.jpg)
2
Introduction: What we did
• Hadoop MapReduce performance benchmarking
• 300 students, 80 cores per student(in one semester)
• 2400 cores• Impossible without the cloud
![Page 3: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/3.jpg)
3
Context: Teaching varieties of parallelism
• Instruction (e.g. pipelining), Data (e.g. vector instructions), Request (e.g. replicated webservers), …
• We were teaching many of these in an sophomore course
• This talk focuses on task parallelism
![Page 4: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/4.jpg)
4
Task parallelism• Our example: MapReduce
• Sophomores wrote a MapReduceprogram and ran it in adistributed environment
• Observed speedup
• On a large dataset using real-world tools
<<
![Page 5: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/5.jpg)
5
Others have taught MapReduce
• As a programming paradigm [Johnson '08]• As part of a elective "big data" analysis course
[Aaron '08, Lin '10, Couch '10]
![Page 6: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/6.jpg)
6
Unlike prior work, we
• Cared about performance andits implementation on a cluster
• Taught sophomores• Emphasized cost and economics
![Page 7: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/7.jpg)
7
Outline
• Motivation: MapReduce and why it matters• Assignment goals and design• Experiences
o challenges for studentso challenges for instructors
![Page 8: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/8.jpg)
8
MapReduce: Why it matters
• Trend of "big data"o more data collection — smartphones, Internet
services, etc.o cheaper data storageo cheaper access to data processing capability
— public cloud computing providers• Dominant way to make sense of very large
datasets on commodity hardware is MapReduceo Google, Facebook, IBM, Amazon, many more, …
![Page 9: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/9.jpg)
9
MapReduce: Programming modelinput
input records (e.g. page from a web crawl)
group bylist of values for each key
"map": a function call per record
key-value pairs (e.g. word -> # of times in record)
output
"reduce": a function call per group
results for each key (e.g. word and its number of occurences)
![Page 10: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/10.jpg)
10
MapReduce: Distributed execution
Map task
Multiple "map", "reduce" calls per task
Input FilePartition
Input FilePartition
Input FilePartition
Output File
Output File
Map task
Map task
Reduce task
Reduce task
![Page 11: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/11.jpg)
11
Assignment goals
• Measure performanceo Observe parallel speedup
• Non-trivial use of MapReduceo Multiple stages: output of one MapReduce
program used as input to another• Off-the-shelf tools
o Hadoop (standard industry platform,open source)
![Page 12: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/12.jpg)
12
Why we used cloud computing
• Datacenter-like resources to hundreds of studentso Performance isolationo Complement teaching about datacenter
architecture• Maximum actual usage of >2400 cores
o Larger than our instructional clusterso Interference with other instructional users
![Page 13: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/13.jpg)
13
Usage over time
Lab Projectdeadline
![Page 14: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/14.jpg)
14
Assignment (Spring)
• Two-stage — co-occurrence (“How associated is a target word with other words?”) +sorting (top-K)
• Java — native Hadoop API language• Dataset of Usenet posts —
8.4GB (compressed size)
inst.eecs.berkeley.edu/~cs61c/sp11/
![Page 15: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/15.jpg)
15
Assignment structure (Spring)1. Laboratory 1 — MapReduce programming
o Against native Hadoop APIo Running on lab machines only (not parallel)o Trivial MR tasks (fit in lab time)
2. Laboratory 2 — Measuring MR at scaleo Timing, calculations for existing MR programso Some design excersizes; no new coding
3. Project Part 1 — implement, run locally (smaller datasets)
4. Project Part 2 — time, get working at scale
![Page 16: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/16.jpg)
16
What students achieved= linear speedup
![Page 17: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/17.jpg)
17
Debugging difficulties
• First time efficiency mattered for many students
• Long runtime + remote execution Longer debugging cycleoReal-world problem
![Page 18: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/18.jpg)
18
EfficiencyMost students on par with reference solution
~10 minutes — time on input big enough for MapReduce to make sense
Hadoop not well-tuned for small inputs
on 40 cores
![Page 19: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/19.jpg)
19
Efficiency
But some students observed very bad performance
Waiting 40+ minutes for results which should take 10 minutes
on 40 cores
![Page 20: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/20.jpg)
20
Things we learned about our student Java
Integer numSeen;for (...) { ... numSeen += 1;}
for (each word in bigString) { ... if (bigString.contains(targetWord)) { ... }}
// and more...
![Page 21: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/21.jpg)
21
Using a public cloud provider
• Grant from Amazon ($100 credit/student)
• We wanted:o More capacity than we could provision
internallyo Students use cloud provider like
commercial user
![Page 22: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/22.jpg)
22
Using a public cloud provider
"Backup" billing even with grant
![Page 23: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/23.jpg)
23
What it cost (in grant credits)
Outliers:Usually misunderstood tools;tried restarting repeatedly after problems
Most student costs reasonableEach used a "dedicated" cluster of around 80 cores.
![Page 24: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/24.jpg)
24
Student satisfaction
• When surveyed, students ranked this project first among the three software projectso Most students (90% of responders)
recommended keeping the project in later semesters
• Students reported that this project impressed potential employers
![Page 25: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/25.jpg)
25
Conclusion/Lessons Learned
• Students wrote a parallel program and ran it against a large data seto Almost all students ran programs on large
datasets and observed parallel speedupso Early experience for sophomores debugging,
deploying programs with large datasets• First time that students write programs with
long enough run-time to measure efficiency• Public clouds allowed us to demonstrate scale
with low per-student costs
![Page 26: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/26.jpg)
26
Other CC uses: long-running servers
• Long-running servers per student or group• Web/service classesNo elasticity, low resource
usage — cost-effective?
![Page 27: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/27.jpg)
27
Other CC uses: VM per student
• Consistent infrastructure for development• Way to hand out/in assignments• With or without a “cloud” to host the VMs
![Page 28: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/28.jpg)
28
Other CC uses: static clusters
• Customized machines for a particular course• Sometimes done without cost benefit ---
cluster kept up for entire semester
![Page 29: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/29.jpg)
29
![Page 30: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/30.jpg)
30
Backup Slides
![Page 31: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/31.jpg)
31
Scripts
• https://github.com/woggling/ec2-wrappers
• Danger! Pre-alpha software!– Depends on Berkeley infrastructure in several
places– Could spend real money; do not use without
understanding– Requires some manual monitoring– Documentation is probably incomplete
![Page 32: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d985503460f94a8352b/html5/thumbnails/32.jpg)
32
Using a public cloud provider
56%
44%