FIFO Partitioner Megafunction User Guide - FPGA CPLD and ASIC from
MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper...
Transcript of MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper...
MapReducesPATTERNS FOR PROCESS
Agenda Overview of all the Map Reduce Design Patterns
MapReduce Design Patterns Overview
Deep Dive into following Patterns
Filtering Patterns
Join Patterns
Input and Output Patterns
Other Patterns Overview
Summarization Patterns
Data Organization Patterns
MetaPatterns
Comparison chart of when to use which design patterns
Best Practices
2
MapReduce Patterns
Summarization Patterns
Filtering Patterns
Data Organization Patterns
Join Patterns
Meta Patterns
Input & Output Patterns
BIG DATA SERIES 3Powered by Prognosive © 2015
Summarization Patterns
Numerical Summarizations
Inverted Indexes
Counters
Numerical Summarizations
Word Count
Record Counts
Min / Max
Average/Median/Std Deviation
BIG DATA SERIES 5Powered by Prognosive © 2015
Inverted Indexes
source: MapReduce Design Patterns, Miner & Shook, O’Reilly
Counters
Record Count
Unique Instances
Summations
if( StringUtils.startsWithLetter(token) ){
context.getCounter(WordsNature.STARTS_WITH_LETTER).increment(1);
}
Filter Patterns
Filters
Bloom Filters
Top Ten
Distinct
BIG DATA SERIES 8Powered by Prognosive © 2015
Filters
Narrowing Views
Tracking Event Threads
Distributed Grep
Data Cleansing
Simple Random Sampling
Low Scoring Data
Bloom Filters Similar to other filters Check each Record – decide to keep or remove
Different: Filter based on set membership
Set membership is evaluated as well
Compares one list to another
Sometimes emits a false positive Often this is OK
Steps: Train the filter and list of values – store in HDFS
Do the filtering
Bloom Filters
source: MapReduce Design Patterns, Miner & Shook, O’Reilly
Top Ten
Distinct (De-dupe)
Several Methods:
HDFS & MapReduce Alone
HBase & HDFS
HDFS, MapReduce & Storage Controller
Streaming, HDFS & MapReduce
MapReduce with Blocking
Data Organization Patterns
Structured to Hierarchical
Partitioning
Binning
Total Ordering
Shuffling
BIG DATA SERIES 14Powered by Prognosive © 2015
Structural to Hierarchical
source: MapReduce Design Patterns, Miner & Shook, O’Reilly
Partitioning
Binning
Uses MultipleOutputs class Emits multiple distinct files The mapper:
looks at each line iterates through a list of criteria
for each bin If the record meets the criteria, it
is sent to that bin No combiner, partitioner, or reducer
used
source: MapReduce Design Patterns, Miner & Shook, O’Reilly
Total Order Sort
Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from file decides which reducer to target
Reducers = Identity reducermust = # of partitions
$ hadoop fs -cat output/part-r-*
Shuffling
Mapper just outputs random K for K,V’s Reducer sorts these further randomization results
Use case: random sampling Load-balances well
BIG DATA SERIES 19Powered by Prognosive © 2015
Join Patterns
Reduce-Side Joins
Replicated Joins
Composite Joins
Cartesian Product
Review: Inner Join
Review: Outer Join
Review: Cartesian Product
Reduce-side Join
BIG DATA SERIES 24Powered by Prognosive © 2015
Replicated Join
Map-onlyMapper reads join file
at startup from cache store in-memory
source: MapReduce Design Patterns, Miner & Shook, O’Reilly
Composite Join
Map-only Driver code handles most of the
work Hadoop does the rest
BIG DATA SERIES 26Powered by Prognosive © 2015
Cartesian Product
Map-only Driver code handles
most of the work Simple mapper
Input/Output Patterns
Custom Input & Output
Generating Data
External Sources
Partition Pruning
MapReduce Input and Output
BIG DATA SERIES 29Powered by Prognosive © 2015
Custom Inputs
OutputFormats FileOutputFormat<K,V> superclass
TextOutputFormat<K,V> default output format
SequenceFileOutputFormat<K,V>
MultipleOutputs<K,V> sends to various destinations
NullOutputFormat<K,V> null output
LazyOutputFormat<K,V>
BIG DATA SERIES 31Powered by Prognosive © 2015
Custom Output Extend OutputFormat usually FileOutputFormat
implement getRecordReader() returning a RecordWriter instance
Define write() in the class invoke for each K-V
write(AccountKey key, Account value) {
out.println(key.getAccountKeyId() + ‘\t
+ value.getAccountNbr());
Class: BankRecordWriter
OutputFormat
RecordWriter
Generating Data
Map-Only
Good for generating sample data
MapReduce is a good tool to use
Seldom done
33
External Outputs
Partition Pruning
source: MapReduce Design Patterns, Miner & Shook, O’Reilly
MetaPatterns
Job Chaining
Chain Folding
Job Merging
BIG DATA SERIES 36Powered by Prognosive © 2015
End of Chapter
Lab
38