Pre-Workshop Deck

Pre-Workshop Deck

Map Reduce WorkshopMonday November 12th, 2012

5:30-6:00 Networking, Software Install, Cloud Setup 6:00-6:10 M/R and Workshop Overview - John Verostek 6:10-7:20 Map/Reduce Tutorial - Vipin Sachdeva (IBM Research Labs)The Map/Reduce Programming Framework will be introduced using a hands-on Word Count example using Python. Next the basics of Hadoop Map/Reduce and File Server will be covered. Time permitting, a demo will be given of running the Python M/R program using Hadoop installed locally. 7:20-7:30 Short Break 7:30-8:45 Applications using Amazon Elastic M/R - J Singh (DataThinks.org) A Facebook application will also be walked through. For this dataset, everyone who attends the workshop will have the option to sign into a workshop prep page with their Facebook account and give permission to share their likes. The data is automatically anonymized and sent to an Amazon S3 file. The exercise will find likes common to people in the sample. What might someone do after the analysis of such data? Design an advertising campaign, perhaps (but designing an ad campaign is not part of the workshop).

Overview: Schedule

Pre-Workshop Checklist

• Cloud: Amazon Web Services to be used

• Facebook Likes Exercise:– App used to anonymously collect “Likes”– Dataset to be used for M/R exercise

• Software Installation:– Python will be used to run programs locally– Download 2.7.3, set Environmental Variables

• Code and Datasets: Included are links to Files located up on Amazon

• Running Python Locally:– Various screen shots to try out a regular (not a Map / Reduce)

wordcount program

Please sign-up for an account here:

http://aws.amazon.com/

Amazon’s Elastic Map/Reduce will be used. The following 6-minute AWS video shows a wordcount example that is somewhat similar to what will be used in the workshop. There is enough info in this video to within 15-minutes run a M/R job.

http://www.youtube.com/watch?v=kNsS9aDf6uE

Cloud Account

http://aws.amazon.com/

http://www.youtube.com/watch?v=kNsS9aDf6uE

We will be using Elastic MapReduce and S3

Python Scripts and Wikipedia Datasets

Dataset Size URL

Very Small 2 KB https://s3.amazonaws.com/workshop-verysmall/input-very-small.txt

Small 10 MB https://s3.amazonaws.com/workshop-small/input-small.txt

Medium 76 MB https://s3.amazonaws.com/workshop-medium/input-medium.txt

Large 1 GB https://s3.amazonaws.com/workshop-large/input-large.txtVery Large 8 GB https://s3.amazonaws.com/workshop-verylarge/input-very-large.txt

These five files of different sizes were created by Vipin to test out the time to run each one locally. Please note the 8GB may take a while to download.

What URL

Word Counter (Non-Map/Reduce) https://s3.amazonaws.com/python-code/seq.pyWord Counter Mapper https://s3.amazonaws.com/python-code/mapper.pySorter for Windows-machines https://s3.amazonaws.com/python-code/sorter.pyWord Counter Reducer https://s3.amazonaws.com/python-code/reducer.pyWord Counter Reducer https://s3.amazonaws.com/python-code/reducer-all.py

sorter.py for folks with Windows machines

After loading the link into the browser, and text appearing, then use “right-click”, “save as”

https://s3.amazonaws.com/workshop-verysmall/input-very-small.txt

https://s3.amazonaws.com/workshop-small/input-small.txt

https://s3.amazonaws.com/workshop-medium/input-medium.txt

https://s3.amazonaws.com/workshop-large/input-large.txt

https://s3.amazonaws.com/workshop-verylarge/input-very-large.txt

https://s3.amazonaws.com/python-code/seq.py

https://s3.amazonaws.com/python-code/mapper.py

https://s3.amazonaws.com/python-code/sorter.py

https://s3.amazonaws.com/python-code/reducer.py

https://s3.amazonaws.com/python-code/reducer-all.py

Facebook Likes Exercise

• The App that collects Facebook data is:

• http://apps.facebook.com/map-reduce-workshop

Please sign in into a workshop prep page with your Facebook account and give permission to share their likes. If you don't have a Facebook account then no worries. If everyone opts out, we won't have much data to work with. All collected data will be anonymized and then deleted after the workshop is done.

The exercise will find likes common to people in the sample. What might someone do after the analysis of such data? Design an advertising campaign, perhaps (though designing an ad campaign is not part of the workshop).

http://apps.facebook.com/map-reduce-workshop

http://apps.facebook.com/map-reduce-workshop

Should be where you can just copy and paste the URL into the browser where you have

Facebook set up

Then something like this should appear after it has

pulled over the Likes.

Python Download

• Mac/Linux comes installed with Python (should be able to run).

• Windows : if you do not already have Python installed, then use the following website to download and install:

– http://www.python.org/download/

– DOWNLOAD VERSION 2.7.3

– DO NOT DOWNLOAD VERSION 3.3.0

http://www.python.org/download/

Python Installation

http://pythoncentral.org/how-to-install-python-2-7-on-windows-7-python-is-not-recognized-as-an-internal-or-external-command/

Python Environmental VariablesThere are many online instructions that explain this such as the below link:



http://docs.python.org/2/using/

More Python Help

http://docs.python.org/2/using/

• Get to the DIR where your code/data is located• CD to call a directory• CD.. to go up a level• e.g. cd users\john\desktop\python

We will be using the Command Line

Wordcount - Very Simple Python Script – seq.py#!/usr/bin/env python import sys import re counts={} for line in sys.stdin: words = re.split("\W+",line) for word in words: counts[word]=counts.get(word,0)+1; for x,y in counts.items(): print x,y

Let’s try running it!!

Import system functions and regular expressions library

Read from stdin

Split along newline character (W+)

> python ./seq.py < input.txt > output.txt

Python Codeseq.py

Use WikipediaDataset filename

Output File – Can be any name

Make sure you have the < and > pointing in the right directions or could over-write input file.

> python ./seq.py < input.txt

CTRL-C to break

No Output File so goes to screen

Mac OS/Linux/Cygwin > time python ./seq.py < input.txt

Time

Category Size TimeVery small 2KB

Small 10 MB

Medium 80 MB

Large 1GB

Very large 8 GB My Windows laptop took 4 hours*

Windows

If you run the datasets locally, and “time” how long they take (output to file not screen), then then please email me the results (including laptop info, etc) as we are making a composite slide of “time vs. file size.”

* I noted the start time, 11:46 PM and then ran overnight and went with the timestamp on the output file; 03:54 AM.

Pre-Workshop Deck

Documents

Transcript of Pre-Workshop Deck