Map Reduce: An Example (James Grant at Big Data Brighton)
-
Upload
brandwatch -
Category
Documents
-
view
549 -
download
1
description
Transcript of Map Reduce: An Example (James Grant at Big Data Brighton)
Map ReduceAn Example
Who am I?
My name is James Grant ([email protected]).
I'm a developer here at Brandwatch.
For the last three years I've been a Data Engineer at Last.fm and the maintainer of their Hadoop Cluster.
Coming up…
● What happens during MapReduce?● Plays and Reach from music listening data● The Mapper pseudo code● The Reducer pseudo code● The result● What if…?
What happens during MapReduce?
Input Data
Data FragmentData FragmentData Fragment
Mapper Map Output
Reducer Input
ReducerData
FragmentData FragmentReduce
Output
Sort
Plays and Reach from music listening data
● Plays - The number of times that song has been played
● Reach - The number of unique listeners to that song
● Similar to hits and uniques for web properties
● Input data has columns for user id and song id (amongst others)
The Mapperfunction map(Integer user, Integer song): emit(song, user);
The Reducerfunction reduce(Integer song, Iterator users): Integer plays = 0; Set uniqueUsers = [];
foreach user in users: increment plays; if user not within uniqueUsers: uniqueUsers.add(user);
result.plays = plays; result.reach = uniqueUsers.cardinality(); emit(song, result);
What if…?
You often hear that for nearly all cases you should use a higher level tool like Pig or Hive to solve problems.
So what does the Pig script look like for this problem?
Using Pigsubs = LOAD 'submissions.tsv' USING PigStorage() AS (user:int, song:int);songs = GROUP subs BY song;songs = FOREACH songs GENERATE group AS song, subs.user;songs = FOREACH songs GENERATE song, COUNT($1.user), COUNT(Distinct($1.user));STORE songs INTO 'playsreach.tsv';
Questions?