Map Reduce: An Example (James Grant at Big Data Brighton)

10
Map Reduce An Example

description

Presentation by Brandwatch Developer James Grant at the second Big Data Brighton meetup, hosted by Brandwatch: www.brandwatch.com

Transcript of Map Reduce: An Example (James Grant at Big Data Brighton)

Page 1: Map Reduce: An Example (James Grant at Big Data Brighton)

Map ReduceAn Example

Page 2: Map Reduce: An Example (James Grant at Big Data Brighton)

Who am I?

My name is James Grant ([email protected]).

I'm a developer here at Brandwatch.

For the last three years I've been a Data Engineer at Last.fm and the maintainer of their Hadoop Cluster.

Page 3: Map Reduce: An Example (James Grant at Big Data Brighton)

Coming up…

● What happens during MapReduce?● Plays and Reach from music listening data● The Mapper pseudo code● The Reducer pseudo code● The result● What if…?

Page 4: Map Reduce: An Example (James Grant at Big Data Brighton)

What happens during MapReduce?

Input Data

Data FragmentData FragmentData Fragment

Mapper Map Output

Reducer Input

ReducerData

FragmentData FragmentReduce

Output

Sort

Page 5: Map Reduce: An Example (James Grant at Big Data Brighton)

Plays and Reach from music listening data

● Plays - The number of times that song has been played

● Reach - The number of unique listeners to that song

● Similar to hits and uniques for web properties

● Input data has columns for user id and song id (amongst others)

Page 6: Map Reduce: An Example (James Grant at Big Data Brighton)

The Mapperfunction map(Integer user, Integer song): emit(song, user);

Page 7: Map Reduce: An Example (James Grant at Big Data Brighton)

The Reducerfunction reduce(Integer song, Iterator users): Integer plays = 0; Set uniqueUsers = [];

foreach user in users: increment plays; if user not within uniqueUsers: uniqueUsers.add(user);

result.plays = plays; result.reach = uniqueUsers.cardinality(); emit(song, result);

Page 8: Map Reduce: An Example (James Grant at Big Data Brighton)

What if…?

You often hear that for nearly all cases you should use a higher level tool like Pig or Hive to solve problems.

So what does the Pig script look like for this problem?

Page 9: Map Reduce: An Example (James Grant at Big Data Brighton)

Using Pigsubs = LOAD 'submissions.tsv' USING PigStorage() AS (user:int, song:int);songs = GROUP subs BY song;songs = FOREACH songs GENERATE group AS song, subs.user;songs = FOREACH songs GENERATE song, COUNT($1.user), COUNT(Distinct($1.user));STORE songs INTO 'playsreach.tsv';

Page 10: Map Reduce: An Example (James Grant at Big Data Brighton)

Questions?