MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
-
Upload
jonathan-cove -
Category
Documents
-
view
223 -
download
0
Transcript of MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
MapReduce in Action
Team 306Led by
Chen Lin
College of Information Science and Technology
数据挖掘研究组Data Mining Group @ Xiamen University
YOUR SITE HERE
LOGO
1. Basic MapReduce Programs1. Basic MapReduce Programs
2. Advanced MapReduce2. Advanced MapReduce
3. Beyond the horizon 3. Beyond the horizon
4. discussion4. discussion
Contents
YOUR SITE HERE
LOGO
Implement Interface
Environment Configuration
Basic MapReduce Programs
Job Configuration?
Java Class
YOUR SITE HERE
LOGO
Configure
jvm:Mapred.child.java.opts
{mapred.local.dir}
InputPathOutputPath
How many Map/ReduceTasks?
YOUR SITE HERE
LOGO
InputFormat Map Reduce OutputFormat
Basic MapReduce Program
Text
Inputsplit <K1,V2>
K1,List<V1>List<K1,V1>
YOUR SITE HERE
LOGO
Combiners
an optimization in MapReduce that allow for local aggregation before the shue and sort phase
Partitioner
determines which reducer will be responsible for processing a particular key, and the execution framework uses this information to copy the data to the right location during the shue and sort phase
PARTITIONERS AND COMBINERS
YOUR SITE HERE
LOGO
CREATING CUSTOM INPUTFORMAT
KeyValueText
Sequence File
NLine
Text InputFormat
Basic MapReduce Program
InputFormat
YOUR SITE HERE
LOGO
• TextInputFormat - Each line in the text fi les is a record. Key is the byte
offset of the line, and value is the content of the line.
• KeyValueTextInputFormat - Each line in the text fi les is a record. The fi rst separator
character divides each line. Everything before the separator is the key, and everything after is the value. The separator is set by the key.value.separator.in.input.line property, and the default is the tab (\t) character.
• NLineInputFormat
- Same as TextInputFormat, but each split is guaranteed
to have exactly N lines. The mapred.line.input.format. Lines/map property, which defaults to one, sets N.
InputFormat
YOUR SITE HERE
LOGO
code for mapper, reducer,
combiner, partitioner, along with
job conguration parameters
The execution framework handles
everything else
Summary for basic Program
What’s a complete MapReduce job ??
YOUR SITE HERE
LOGO
Chaining MapReduce jobs Chaining MapReduce jobs
LOCAL AGGREGATIONLOCAL AGGREGATION
SECONDARY SORTINGSECONDARY SORTING
Work on Hadoop FilesWork on Hadoop Files
Advanced MapReduce
YOUR SITE HERE
LOGO
You’ve been doing data processing tasks which a single MapReduce job can accomplish.
But……As you get more comfortable writing
MapReduce programs and take on more ambitious data processing tasks
you’ll find many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce job
Chaining MapReduce jobs
YOUR SITE HERE
LOGO
in Hadoop, intermediate results are written to local disk before being sent over the network.
Reductions in the amount of intermediate data translate should increase in algorithmic efficiency
use of the combiner is possible to substantially reduce both the number and size of key-value pairs that need to be shuffled from the mappers to the reducers
LOCAL AGGREGATION
YOUR SITE HERE
LOGO
1. combiners must have the same input and output key-value type
2. Combiners are optimizations that cannot change the correctness of the algorithm
Hadoop makes no guarantees on how many times combiners are called; it could be zero, one, or multiple times
LOCAL AGGREGATION
YOUR SITE HERE
LOGO
we also need to sort by value sometimes (k1;m1; v8) (k1;m2; v1) (k1;m3; v7) ::: (k2;m1; v2) (k2;m2; v6) (k2;m3; v9)
k1 (m1; k8) (k1; m1) (k8)
SECONDARY SORTING
YOUR SITE HERE
LOGO
It’s a shameThe rest I will talk about Plays an
important role in MapReduce, but, they are beyond my horizon.
So, need all your help, to master them together….
Beyond the horizon
YOUR SITE HERE
LOGO
Beyond the horizon
Creat user custom
Inputformat Manipulate
local fileCreat user
customPartitioner
Pipes for C++Streaming
other language
YOUR SITE HERE
LOGO
Beyond the horizon
Joining data from
different sourcesHive
Pig
HBase
MultipleFileoutput
Joining data from different sources
Orders files CSV formatfields: (Customer ID, Order ID, Price,
and Purchase Date)
Customers file
CSV format
record fields:
(Customer ID,
Name, and Phone
Number)
YOUR SITE HERE
LOGOJoey Leung,555-555-55Edward,123-456-7890Jose Madriz,281-330-8004David Stork,408-555-0000…....
A,12.95,02-Jun-2008B,88.25,20-may-2008C,32.00,30-Nov-2007D,25.02,22-Jan-2009
Joining data from different sources
Joey Leung,555-555-5555,B,88.25,20-May-2008Edward,123-456-7890,C,32.00,30-Nov-2007Jose Madriz,281-330-8004,A,12.95,02-Jun-2008Jose Madriz,281-330-8004,D,25.02,22-Jan-2009