Common MapReduce Patterns

54
Common MapReduce Patterns Chris K Wensel BuzzWords 2011 Monday, June 6, 2011

Transcript of Common MapReduce Patterns

Page 1: Common MapReduce Patterns

Common MapReduce Patterns

Chris K Wensel

BuzzWords 2011

Monday, June 6, 2011

Page 2: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

• Concurrent, Inc., Founder• Cascading support and tools• http://concurrentinc.com/

• Cascading, Lead Developer (started Sept 2007)• An alternative API to MapReduce• http://cascading.org/

• Formerly Hadoop mentoring and training• Sun - Apple - HP - LexisNexis - startups - etc

• Formerly Systems Architect & Consultant• Thomson/Reuters - TeleAtlas - startups - etc

Engineer, Not Academic

Monday, June 6, 2011

Page 3: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Overview

• MapReduce

• Heavy Lifting

• Analytics

• Optimizations

Monday, June 6, 2011

Page 4: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

MapReduce

• A “divide and conquer” strategy for parallelizing workloads against collections of data

• Map & Reduce are two user defined functions chained via Key Value Pairs

• It’s really Map->Group->Reduce where Group is built in

Monday, June 6, 2011

Page 5: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Keys and Values• Map translates input to keys

and values to new keys and values

• System Groups each unique key with all its values

• Reduce translates the values of each unique key to new keys and values

[K1,V1] Map [K2,V2]*

[K2,{V2,V2,....}] [K3,V3]*Reduce

[K2,V2] [K2,{V2,V2,....}]Group

* = zero or more

Monday, June 6, 2011

Page 6: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Word Count

Mapper

Reducer

[0, "when in the course of human events"] Map ["when",1]

["when",{1,1,1,1,1}] ["when",5]Reduce

["when",1]["when",{1,1,1,1,1}]Group

["in",1] ["the",1] [...,1]

["when",1]["when",1]["when",1]["when",1]

Monday, June 6, 2011

Page 7: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Divide and Conquer Parallelism

• Since the ‘records’ entering the Map and ‘groups’ entering the Reduce are independent

• That is, there is no expectation of order or requirement to share state between records/groups

• Arbitrary numbers of Map and Reduce function instances can be created against arbitrary portions of input data

Monday, June 6, 2011

Page 8: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Cluster

• Multiple instances of each Map and Reduce function are distributed throughout the cluster

Cluster

RackRackRack

Node Node Node ...Node

map

reduce

map map map map

reduce reduce

Monday, June 6, 2011

Page 9: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Another View

filesplit1 split2 split3 split4 ... part-00000 part-00001 part-000N

directory

[K2,V2] [K2,{V2,...}][K1,V1] [K3,V3]

Mappers must complete before Reducers can

begin

ReducerTask

ReducerTask

ReducerTask

MapperTask

MapperTask

MapperTask

MapperTask

MapperTask

Shuffle

Shuffle

Shuffle

Map ReduceGroupCombine

same code

Monday, June 6, 2011

Page 10: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Complex job assemblies

• Real applications are many MapReduce jobs chained together

• Linked by intermediate (usually temporary) files

• Executed in order, by hand, from the ‘client’ application

[ k, v ]

ReduceMap[ k, [v] ]

[ k, v ] = key and value pair

[ k, [v] ] = key and associated values collection

[ k, v ]

ReduceMap[ k, [v] ]

File File

[ k, v ]

File

[ k, v ]

Sort JobCount Job

Monday, June 6, 2011

Page 11: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

[1/75] map+reduce

[2/75] map+reduce [3/75] map+reduce [4/75] map+reduce[5/75] map+reduce [6/75] map+reduce[7/75] map+reduce [8/75] map+reduce [9/75] map+reduce[10/75] map+reduce [11/75] map+reduce [12/75] map+reduce[13/75] map+reduce [14/75] map+reduce [15/75] map+reduce[16/75] map+reduce [17/75] map+reduce [18/75] map+reduce

[19/75] map+reduce [20/75] map+reduce[21/75] map+reduce [22/75] map+reduce[23/75] map+reduce [24/75] map+reduce[25/75] map+reduce [26/75] map+reduce[27/75] map+reduce [28/75] map+reduce [29/75] map+reduce [30/75] map+reduce[31/75] map+reduce[32/75] map+reduce [33/75] map+reduce [34/75] map+reduce [35/75] map+reduce

[36/75] map+reduce

[37/75] map+reduce

[38/75] map+reduce[39/75] map+reduce [40/75] map+reduce[41/75] map+reduce [42/75] map+reduce[43/75] map+reduce [44/75] map+reduce[45/75] map+reduce [46/75] map+reduce [47/75] map+reduce [48/75] map+reduce[49/75] map+reduce[50/75] map+reduce [51/75] map+reduce [52/75] map+reduce [53/75] map+reduce

[54/75] map+reduce

[55/75] map [56/75] map+reduce [57/75] map[58/75] map[59/75] map

[60/75] map [61/75] map[62/75] map

[63/75] map+reduce[64/75] map+reduce [65/75] map+reduce [66/75] map+reduce[67/75] map+reduce[68/75] map+reduce [69/75] map+reduce[70/75] map+reduce

[71/75] map [72/75] map

[73/75] map+reduce [74/75] map+reduce

[75/75] map+reduce

1 app, 75 jobs

green = map + reducepurple = mapblue = join/mergeorange = map split

Real World Apps

Monday, June 6, 2011

Page 12: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Heavy Lifting• Thing we must do because data can be heavy

• These patterns are natural to MapReduce and easy to implement

• But have some room for composition/aggregation within a Map/Reduce (i.e., Filter + Binning)

• (leading us to think of Hadoop as an ETL framework)

• Record Filtering

• Parsing, Conversion

• Counting, Summing

• Unique

• Binning

• Distributed Tasks

Monday, June 6, 2011

Page 13: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Record Filtering

• Think unix ‘grep’

• Filtering is discarding unwanted values (or preserving wanted)

• Only uses a Map function, no Reducer

Monday, June 6, 2011

Page 14: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Parsing, Conversion

• Think unix ‘sed’

• A Map function that takes an input key and/or value and translates it into a new format

• Examples:

• raw logs to delimited text or archival efficient binary

• entity extraction

Monday, June 6, 2011

Page 15: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Counting, Summing

• The same as SQL aggregation functions

• Simply applying some function to the values collection seen in Reduce

• Other examples:

• average, max, min, unique

Monday, June 6, 2011

Page 16: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Merging• Where many files of the same type are converted to one

output path

• Map side merges

• One directory with as many part files as Mappers

• Reduce side merges

• Allows for removing duplicates or deleted items

• One directory with as many part files as Reducers

• Examples

• Nutch

• Normalizing log files (apache, log4j, etc)Monday, June 6, 2011

Page 17: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Binning

• Where the values associated w/ unique keys are persisted together

• Typically a directory path based on key’s value

• Must be conscious of total open files, remember no appends

• Examples:

• web log files by year/month/day

• trade data by symbol

Monday, June 6, 2011

Page 18: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Distributed Tasks

• Simply where a Map or Reduce function executes some ‘task’ based on the input key and value.

• Examples:

• web crawling,

• load testing services,

• rdbms/nosql updates,

• file transfers (S3),

• image to pdf (NYT on EC2)

Monday, June 6, 2011

Page 19: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Basic Analytic Patterns

• Some of these patterns are unnatural to MapReduce

• We think in terms of columns/fields, not key value pairs

• (leading us to think of Hadoop as a RDBMS)

• Group By

• Unique

• Secondary Sort

• Secondary Unique

• CoGrouping and Joining

Monday, June 6, 2011

Page 20: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Composite Keys/Values

• It is easier to think in columns/fields

• e.g. “firstname” & “lastname”, not “line”

• Whether a set of columns are Keys or Values is arbitrary

• Keys become a means to piggyback the properties of MR and become an impl detail

[K1,V1] <A1,B1,C1,...>

Monday, June 6, 2011

Page 21: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Group By

• Group By is where Value fields are grouped by Grouping fields• Above, Map output key is “dept_id” and value is “name”

GroupBy

Jim

Mary

Susan

Fred

Wilma

Ernie

Barny

1001

1002

dept_id

name

Monday, June 6, 2011

Page 22: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Group By

• So the K2 key becomes a composite Key of

• key: [grouping], value: [values]

Mapper[K1,V1]

<A2,B2> -> K2, <C2,D2> -> V2

[K1,V1] -> <A1,B1,C1,D1>

[K2,V2]

Reducer

<A3,B3> -> K3, <C3,D3> -> V3

[K3,V3]

[K2,{V2,V2,....}]

[K2,V2] -> <A2,B2,{<C2,D2>,...}>

Map Reduce

Piggyback Code

User Code

Monday, June 6, 2011

Page 23: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Unique

• Or Distinct (as in SQL)

• Globally finding all the unique values in a dataset

• Usually finding unique values in a column

• Often used to filter a second dataset using a join

Mapper

Reducer

[0, "when in the course of human events"] Map ["when",null]

["when",{nulls}] ["when",null]Reduce

["when",1]["when",{nulls}]Group

["in",null] [...,null]

["when",1]["when",1]["when",1]["when",null]

Monday, June 6, 2011

Page 24: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Secondary Sort

• Secondary Sorting is where

• Some Fields are grouped on, and

• Some of the remaining Fields are sorted within their grouping

Date Time Url

08/08/2008, 1:00:00, http://www.example.com/foo

08/08/2008, 1:01:00, http://www.example.com/bar

08/08/2008, 1:01:30, http://www.example.com/baz

(group) (sorted value) (remaining value)

Monday, June 6, 2011

Page 25: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Secondary Sort

• So the K2 key becomes a composite Key of

• key: [grouping, secondary], value: [remaining values]

• The trick is to piggyback the Reduce sort yet not be compared during the unique key comparison

Mapper[K1,V1]

<A2,B2><C2> -> K2, <D2> -> V2

[K1,V1] -> <A1,B1,C1,D1>

[K2,V2]

Map

Reducer

<A3,B3> -> K3, <C3,D3> -> V3

[K3,V3]

[K2,{V2,V2,....}]

[K2,V2] -> <A2,B2,{<C2,D2>,...}>

Reduce

Monday, June 6, 2011

Page 26: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Secondary Unique

• Secondary Unique is where the grouping values are uniqued

• .... in a “scale free” way

• Perform a Secondary Sort...

• Reducer removes duplicates by discarding every value that matches the previous value

• since values are now ordered, no need to maintain a Set of values

Mapper

Reducer

[0, "when in the course of human events"] Map [0, "when"]

["in",null]Reduce

["when",1][0,{"in","in","the","when","when",...}]Group

[0, "in"] [0,"the"] [0,...]

["when",1]["when",1]["when",1][0,"when"]

["the",null] ["when",null][0,{"in","in","the","when","when",...}]

Assume Secondary Sorting magic happens here

Monday, June 6, 2011

Page 27: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Joining

• Where two or more input data sets are ‘joined’ by a common key• Like a SQL join

Jim

Mary

Susan

Fred

Wilma

Ernie

Barny

1001

1002

Accounting

Shipping

lhs datarhs data

dept_id

name

dept_nameAccounting

Accounting

Shipping

Shipping

Shipping

Monday, June 6, 2011

Page 28: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Join Definitions

• Consider the input data [key, value]:

• LHS = [0,a] [1,b] [2,c]

• RHS = [0,A] [2,C] [3,D]

• Joins on the key:

• Inner

• [0,a,A] [2,c,C]

• Outer (Left Outer, Right Outer)

• [0,a,A] [1,b,null] [2,c,C] [3,null,D]

• Left (Left Inner, Right Outer)

• [0,a,A] [1,b,null] [2,c,C]

• Right (Left Outer, Right Inner)• [0,a,A] [2,c,C] [3,null,D]

Monday, June 6, 2011

Page 29: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

CoGrouping

• Before Joining, CoGrouping must happen

• Simply concurrent GroupBy operations on each input data set

Monday, June 6, 2011

Page 30: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

GroupBy vs CoGroupGroupBy

Jim

Mary

Susan

Fred

Wilma

Ernie

Barny

1001

1002

CoGroup

Jim

Mary

Susan

Fred

Wilma

Ernie

Barny

1001

1002

Accounting

Shipping

Independent collections of unordered values

lhs datarhs data

dept_id

name dept_name

Monday, June 6, 2011

Page 31: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

CoGroup Joined

• Considering the previous data, a typical Inner Join

Jim

Mary

Susan

Fred

Wilma

Ernie

Barny

1001

1002

Accounting

Shipping

lhs datarhs data

dept_id

name

dept_nameAccounting

Accounting

Shipping

Shipping

Shipping

Monday, June 6, 2011

Page 32: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

CoGrouping

• Maps must run for each input set in same Job (n, n+1, etc)

• CoGrouping must happen against each common key

Mapper[K1,V1]

<A2,B2> -> K2, [n]<C2,D2> -> V2

[K1,V1] -> <A1,B1,C1,D1>

[K2,V2]

Reducer

<A3,B3> -> K3, <C3,D3> -> V3

[K3,V3]

[K2,{V2,V2,....}]

[K2,V2] -> <A2,B2,{<C2,D2,C2',D2'>,...}>

MapReduce

[K1',V1']

[K1',V1'] -> <A1',B1',C1',D1'>

[n] [n+1]

Monday, June 6, 2011

Page 33: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Joining

• The CoGroups must be joined

• Finally the Reduce can be applied

Reducer

<A3,B3> -> K3, <C3,D3> -> V3

[K3,V3]

[K2,{V2,V2,....}]

[K2,V2] -> <A2,B2,{<C2,D2,C2',D2'>,...}>

Reduce

<A2,B2,{<C2,D2>,...},{<C2',D2'>,...}>

{<C2,D2>,...}

<A2,B2,{<C2,D2,C2',D2'>,...}>

<C2,D2,C2',D2'>

{<C2',D2'>,...}Join

<A2,B2,{[n]<C2,D2>,[n+1]..}>

Monday, June 6, 2011

Page 34: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Optimizations

• Patterns for reducing IO

• Identity Mapper

• Map Side Join

• Combiners

• Partial Aggregates

• Similarity Joins

Monday, June 6, 2011

Page 35: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Identity Mapper

• Move Map operations to the previous Reduce

• Replace with an Identity function

• Assumes Map operations reduce the data

Each('organicCount*paidCount')[Identity[decl:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']]

cascading.hbase.HBaseTap@d9a475a0

Each('organicCount*paidCount')[ExpressionFunction[decl:'fixed_paid_count']]

Each('organicCount*paidCount')[ExpressionFunction[decl:'fixed_count']]

Each('organicCount*paidCount')[ExpressionFunction[decl:'urlid_day']]

CoGroup('organicCount*paidCount')[by:organicCount:['day', 'urlid']paidCount:['paid_day', 'paid_urlid']]

Every('organicCount')[Count[decl:'count']]

GroupBy('organicCount')[by:['day', 'urlid']]

Each('organicCount')[OrganicFilter[decl:'day', 'urlid', 'method']]

Each('import')[ExpressionFilter[decl:'day', 'urlid', 'method']]

Each('import')[RegexParser[decl:'day', 'urlid', 'method'][args:1]]

Dfs['TextLine[['offset', 'line']->[ALL]]']['/logs/stumbles/short-stumbles-20090504.log']']

Each('paidCount')[Identity[decl:'paid_day', 'paid_urlid', 'paid_count']]

Every('paidCount')[Count[decl:'count']]

GroupBy('paidCount')[by:['day', 'urlid']]

Each('paidCount')[Not[decl:'day', 'urlid', 'method']]

Each('organicDomainCount*paidDomainCount')[Identity[decl:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']]

cascading.hbase.HBaseTap@3d07f00

Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'fixed_paid_sum']]

Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'fixed_sum']]

Each('organicDomainCount*paidDomainCount')[ExpressionFunction[decl:'domainid_day']]

CoGroup('organicDomainCount*paidDomainCount')[by:organicDomainCount:['day', 'domainid']paidDomainCount:['paid_day', 'paid_domainid']]

Each('paidDomainCount')[Identity[decl:'paid_day', 'paid_domainid', 'paid_sum']]

Every('paidDomainCount')[Sum[decl:'sum'][args:1]]

GroupBy('paidDomainCount')[by:['paid_day', 'domainid']]

Each('paidDomainCount')[LookupDomainFunction[decl:'domainid'][args:1]]

Every('organicDomainCount')[Sum[decl:'sum'][args:1]]

GroupBy('organicDomainCount')[by:['day', 'domainid']]

Each('organicDomainCount')[LookupDomainFunction[decl:'domainid'][args:1]]

[head]

[tail]

TempHfs['SequenceFile[['day', 'urlid', 'method']]'][import/71897/]

TempHfs['SequenceFile[['day', 'urlid', 'count']]'][organicCount/97544/]TempHfs['SequenceFile[['paid_day', 'paid_urlid', 'paid_count']]'][paidCount/33072/]

TempHfs['SequenceFile[['day', 'domainid', 'sum']]'][organicDomainCount/49784/]

TempHfs['SequenceFile[['paid_day', 'paid_domainid', 'paid_sum']]'][paidDomainCount/54349/]

[{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles'][{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']

[{2}:'offset', 'line'][{2}:'offset', 'line']

[{3}:'day', 'urlid', 'method'][{3}:'day', 'urlid', 'method']

[{3}:'day', 'urlid', 'method'][{3}:'day', 'urlid', 'method']

organicCount[{2}:'day', 'urlid'][{3}:'day', 'urlid', 'method']

[{3}:'day', 'urlid', 'method'][{3}:'day', 'urlid', 'method']

paidCount[{2}:'day', 'urlid'][{3}:'day', 'urlid', 'method']

[{3}:'day', 'urlid', 'count'][{3}:'day', 'urlid', 'method']

organicCount[{2}:'day', 'urlid'],paidCount[{2}:'paid_day', 'paid_urlid'][{6}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count']

[{7}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day'][{7}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day']

[{8}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count'][{8}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count']

[{9}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count', 'fixed_paid_count'][{9}:'day', 'urlid', 'count', 'paid_day', 'paid_urlid', 'paid_count', 'urlid_day', 'fixed_count', 'fixed_paid_count']

[{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles'][{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']

[{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid'][{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']

paidDomainCount[{2}:'paid_day', 'domainid'][{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']

[{3}:'paid_day', 'domainid', 'sum'][{4}:'paid_day', 'paid_urlid', 'paid_count', 'domainid']

[{4}:'day', 'urlid', 'count', 'domainid'][{4}:'day', 'urlid', 'count', 'domainid']

organicDomainCount[{2}:'day', 'domainid'][{4}:'day', 'urlid', 'count', 'domainid']

organicDomainCount[{2}:'day', 'domainid'],paidDomainCount[{2}:'paid_day', 'paid_domainid'][{6}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum']

[{7}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day'][{7}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day']

[{8}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum'][{8}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum']

[{9}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum', 'fixed_paid_sum'][{9}:'day', 'domainid', 'sum', 'paid_day', 'paid_domainid', 'paid_sum', 'domainid_day', 'fixed_sum', 'fixed_paid_sum']

[{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles'][{3}:'domainid_day', 'default:organic_stumbles', 'default:paid_stumbles']

[{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles'][{3}:'urlid_day', 'default:organic_stumbles', 'default:paid_stumbles']

[{3}:'day', 'urlid', 'method'][{3}:'day', 'urlid', 'method']

[{3}:'day', 'urlid', 'count'][{3}:'day', 'urlid', 'method']

[{3}:'day', 'urlid', 'count'][{3}:'day', 'urlid', 'count']

[{3}:'paid_day', 'paid_urlid', 'paid_count'][{3}:'paid_day', 'paid_urlid', 'paid_count']

[{3}:'paid_day', 'paid_urlid', 'paid_count'][{3}:'paid_day', 'paid_urlid', 'paid_count']

[{3}:'day', 'domainid', 'sum'][{4}:'day', 'urlid', 'count', 'domainid']

[{3}:'day', 'domainid', 'sum'][{3}:'day', 'domainid', 'sum']

[{3}:'paid_day', 'paid_domainid', 'paid_sum'][{3}:'paid_day', 'paid_domainid', 'paid_sum']

[{3}:'paid_day', 'paid_domainid', 'paid_sum'][{3}:'paid_day', 'paid_domainid', 'paid_sum']

[{3}:'day', 'urlid', 'method'][{3}:'day', 'urlid', 'method']

[{3}:'day', 'urlid', 'count'][{3}:'day', 'urlid', 'count']

[{3}:'day', 'urlid', 'method'][{3}:'day', 'urlid', 'method']

[{3}:'paid_day', 'paid_urlid', 'paid_count'][{3}:'paid_day', 'paid_urlid', 'paid_count']

identity function

Monday, June 6, 2011

Page 36: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Map Side Joins

• Bypasses the (immediate) need for a Reducer

• Symmetrical

• Where LHS and RHS are of equivalent size

• Requires data to be sorted on key

• Asymmetrical

• One side is small enough to fit in memory

• Typically a hashtable lookup

Monday, June 6, 2011

Page 37: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

• Where Reduce runs Map side, and again Reduce side

• Only works if Reduce is commutative and associative

• Reduces bandwidth by trading CPU for IO

• Serialization/deserialization during local sorting before combining

CombinersMapper

Reducer

Combiner

Same Implementation

[0, "when in the course of human events"] Map ["when",1]

["when",{2,1,2}] ["when",5]Reduce

["when",{2,1,2}]Group

["in",1] ["the",1] [...,1]

["when",1]["when",1]["when",2]

["when",{1,1}] ["when",2]Reduce

["when",{1,1}]Group["when",1]["when",1]

Monday, June 6, 2011

Page 38: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Partial Aggregates

• Supports any aggregate type, while being composable with other aggregates• Reduces bandwidth by trading Memory for IO• Very important for a CPU constrained cluster• Use a bounded LRU to keep constant memory (requires tuning)

Mapper

Reducer

Partial

[0, "when in the course of human events"]

Map

["when",1]

["when",{2,1,2}] ["when",5]Reduce

["when",{2,1,2}]Group

["in",1] ["the",1] [...,1]

["when",1]["when",1]["when",2]

["when",2]["when",1]["when",1]Provides an opportunity to promote the functionality of the next Map to this Reduce

Monday, June 6, 2011

Page 39: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Partial Aggregates

• OK that dupes emit from a Mapper and across Mappers (or prev Reducers!)• Final aggregation happens in Reducer• Larger the cache, fewer dupes

[a,b,c,a,a,b] [a,b,c,a,b]partial unique

a -> {a,_} -> _*cache size of 2

b -> {b,a} -> _

c -> {c,b} -> a

a -> {a,c} -> b

{_,_}

a -> {a,c}

b -> {b,a} -> c

incomingvalue

discardedvalue

LRU*

[a,b,c,a,a,b] [a,b,c,a,b]partial unique[a,b,c,a,a,b] [a,b,c,a,b]partial unique[a,b,c,a,a,b] [a,b,c,a,b]partial unique

Monday, June 6, 2011

Page 40: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Tradeoffs

• CPU for IO == fault tolerance

• Memory for IO == performance

Monday, June 6, 2011

Page 41: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Similarity Join

• Compare all values LHS to values RHS to find duplicates (or similar values)

• Naive approaches

• Cross Join (all data through one reducer)

• In-common features (very common features will bottleneck)

Monday, June 6, 2011

Page 42: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Set-Similarity Joining

• “Efficient Parallel Set-Similarity Joins Using MapReduce” - R Vernica, M Carey, C Li

• Only compare candidate pairs

• Candidates share uncommon features

Monday, June 6, 2011

Page 43: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

1

2

3

4

1: records

4

4

2

1

1

4

2: count tokens

1

2

3

4

3: order by least frequentdiscard common

1 3

5: candidate pairs

1

3

6: final compare

1

3

4: uncommon featuresin common

• 1 and 3 share uncommon features

• thus are candidates for a full comparisonMonday, June 6, 2011

Page 44: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Tokenize

ReduceMap

Count Job

ReduceMap

File

FileFile

File

Join Tokens/Counts Job

ReduceMap

File

Sort/Prefix Filter Job

ReduceMap

File

Self Join Job

ReduceMap

Unique Pairs Job

ReduceMap

File

Join LHS Job

ReduceMap

File

Join RHS / Match Job

ReduceMap

File

File

Match two sets using prefix

filtering

Monday, June 6, 2011

Page 45: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Duality

• Note the use of the previous patterns to route data to implement a more efficient algorithm

Monday, June 6, 2011

Page 46: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Use a Higher Abstraction

• Command Line

• Multitool - CLI for parallel sed, grep & joins

• API

• Cascading - Java Query API and Planner

• Plume - “approximate clone of FlumeJava”

• Interactive Shell

• Cascalog - Clojure+Cascading query language (API also)

• Pig - A text Syntax

• Hive - Syntax + Infrastructure - SQL “like”

Monday, June 6, 2011

Page 47: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

References

• Set Similarity• http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010• http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

• MapReduce Text Processing• http://www.umiacs.umd.edu/~jimmylin/book.html

• Plume/FlumeJava• http://portal.acm.org/citation.cfm?id=1806596.1806638• http://github.com/tdunning/Plume/wiki

Monday, June 6, 2011

Page 48: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

I’m Hiring

• Enterprise Java server and web client

• Language design, compilers, and interpreters

• No Hadoop experience required

• More info

• http://www.concurrentinc.com/careers/

Monday, June 6, 2011

Page 49: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Resources

• Chris K Wensel• [email protected]• @cwensel

• Cascading & Cascalog • http://cascading.org• @cascading

• Concurrent, Inc.• http://concurrentinc.com• @concurrent

Monday, June 6, 2011

Page 50: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Appendix

Monday, June 6, 2011

Page 51: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Simple Total Sorting

• Where lines in a result file should be sorted

• Must set number of reducers to 1

• Sorting in MR is local per Reduce, not global across Reducers

Monday, June 6, 2011

Page 52: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Why Sorting Isn’t “Total”

• Keys emitted from Map are naturally sorted at a given Reducer

• But are Partitioned to Reducers in a random way

• Thus, only one Reducer can be used for a total sort

Reducer

Reducer

Reducer

Mapper

Mapper

Mapper

Mapper

Mapper

aaa

aab

aac

[aaa,aab,aac]

[zzx,zzy,zzz]

zzy

zzz

zzx

[aaa,zzx]

[aac,zzz]

[aab,zzy]

Monday, June 6, 2011

Page 53: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Distributed Total Sort

• To work, the Shuffling phase must be modified with:

• Custom Partitioner to partition on the distribution of ordered Keys

• Custom Comparator for comparing Key types

• Strings work by default

Monday, June 6, 2011

Page 54: Common MapReduce Patterns

Copyright Concurrent, Inc. 2011. All rights reserved.

Distributed Total Sort - Details

• Sample all K2 values and build balanced distribution for num reducers

• Sample all input keys and divide into partitions

• Write out boundaries of partitions

• Supply Partitioner that looks up partition for current K2 value

• Read boundaries into a Trie (pronounced ‘try’) data structure

• Use appropriate Comparator for Key type

a z

ar ax za zo

ara ... ari axe ... axi zag ... zap zon ... zoo

...

... ...

zoneaxisariaaran

Monday, June 6, 2011