MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper...

MapReducesPATTERNS FOR PROCESS

Agenda Overview of all the Map Reduce Design Patterns

MapReduce Design Patterns Overview

Deep Dive into following Patterns

Filtering Patterns

Join Patterns

Input and Output Patterns

Other Patterns Overview

Summarization Patterns

Data Organization Patterns

MetaPatterns

Comparison chart of when to use which design patterns

Best Practices

2

MapReduce Patterns


Filtering Patterns


Join Patterns

Meta Patterns

Input & Output Patterns

BIG DATA SERIES 3Powered by Prognosive © 2015


Numerical Summarizations

Inverted Indexes

Counters

Numerical Summarizations

Word Count

Record Counts

Min / Max

Average/Median/Std Deviation


Inverted Indexes

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Counters

Record Count

Unique Instances

Summations

if( StringUtils.startsWithLetter(token) ){

context.getCounter(WordsNature.STARTS_WITH_LETTER).increment(1);

}

Filter Patterns

Filters

Bloom Filters

Top Ten

Distinct


Filters

Narrowing Views

Tracking Event Threads

Distributed Grep

Data Cleansing

Simple Random Sampling

Low Scoring Data

Bloom Filters Similar to other filters Check each Record – decide to keep or remove

Different: Filter based on set membership

Set membership is evaluated as well

Compares one list to another

Sometimes emits a false positive Often this is OK

Steps: Train the filter and list of values – store in HDFS

Do the filtering

Bloom Filters


Top Ten

Distinct (De-dupe)

Several Methods:

HDFS & MapReduce Alone

HBase & HDFS

HDFS, MapReduce & Storage Controller

Streaming, HDFS & MapReduce

MapReduce with Blocking


Structured to Hierarchical

Partitioning

Binning

Total Ordering

Shuffling


Structural to Hierarchical


Partitioning

Binning

Uses MultipleOutputs class Emits multiple distinct files The mapper:

looks at each line iterates through a list of criteria

for each bin If the record meets the criteria, it

is sent to that bin No combiner, partitioner, or reducer

used


Total Order Sort

Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from file decides which reducer to target

Reducers = Identity reducermust = # of partitions

$ hadoop fs -cat output/part-r-*

Shuffling

Mapper just outputs random K for K,V’s Reducer sorts these further randomization results

Use case: random sampling Load-balances well


Join Patterns

Reduce-Side Joins

Replicated Joins

Composite Joins

Cartesian Product

Review: Inner Join

Review: Outer Join

Review: Cartesian Product

Reduce-side Join


Replicated Join

Map-onlyMapper reads join file

at startup from cache store in-memory


Composite Join

Map-only Driver code handles most of the

work Hadoop does the rest


Cartesian Product

Map-only Driver code handles

most of the work Simple mapper

Input/Output Patterns

Custom Input & Output

Generating Data

External Sources

Partition Pruning

MapReduce Input and Output


Custom Inputs

OutputFormats FileOutputFormat<K,V> superclass

TextOutputFormat<K,V> default output format

SequenceFileOutputFormat<K,V>

MultipleOutputs<K,V> sends to various destinations

NullOutputFormat<K,V> null output

LazyOutputFormat<K,V>


Custom Output Extend OutputFormat usually FileOutputFormat

implement getRecordReader() returning a RecordWriter instance

Define write() in the class invoke for each K-V

write(AccountKey key, Account value) {

out.println(key.getAccountKeyId() + ‘\t

+ value.getAccountNbr());

Class: BankRecordWriter

OutputFormat

RecordWriter

Generating Data

Map-Only

Good for generating sample data

MapReduce is a good tool to use

Seldom done

33

External Outputs

Partition Pruning


MetaPatterns

Job Chaining

Chain Folding

Job Merging


End of Chapter

Lab

38

MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper...

Documents

Transcript of MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper...