7/14/2015EECS 584, Fall 20111 MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai,...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of 7/14/2015EECS 584, Fall 20111 MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai,...
04/19/23 EECS 584, Fall 2011 1
MapReduce: Simplied Data Processing on Large Clusters
Yunxing Dai, Huan Feng
Real world problem
Count the number of occurences of each word in a huge collections of word lists.– sample input: seven book of Harry Potter
Real world problem
Count the number of occurences of each word in a huge collections of word lists.
Word Occurences
The 15414
Good 5435
Never 6546
Tie 694
... ....
Possible solution
Hash table– each entry is key value pair,
(word, occurrence)– scan all the file, put each word into the hash
table
Real world problem--follow up
What if you are given a huge set of files and have access to a large set of machines?
Problem with hash table:– low concurrency – hard to scale
• one node fail, restart all work
MapReduce solution
04/19/23 EECS 584, Fall 2011 6
Map primitive
Idea from functional language Given a function, apply the function to all
element INDIVIDUALLY in the list, combine the result into a new list
e.g. increment each elems from a list by 1
04/19/23 EECS 584, Fall 2011 7
Reduce primitive
Idea from functional language Apply a function to all elems from a list,
combine them into a single resule e.g. calculate the sum of a list
Map reduce solution--Single node
Map each single word into a (key, value) pair.– "Good" -> ("Good", 1)
Put together all the pairs that have the same key. Input these pairs to a reduce program. Add the value together – [("Good", 1), ("Good", 1), ("Good", 1)] -> 3
Map reduce solution
Files
in_01
in_02
in_03
in_04
in_05
...
...
...
...
pairs
(The, 1)
(Good, 1)
(The, 1)
(Bad, 1)
(Not, 1)
(The, 1)
...
Map
Map reduce solution
Files
in_01
in_02
in_03
in_04
in_05
...
...
...
...
pairs
(The, 1)
(Good, 1)
(The, 1)
(Bad, 1)
(Not, 1)
(The, 1)
...
Maplist of pairs
[(The, 1), (The, 1), (The, 1)...]
[(Good, 1), (Good, 1), (Good, 1)...]
[(Is, 1), (Is, 1), (Is, 1)...]
[(Therofiery, 1)]
[(Bad, 1), (Bad, 1), (Bad, 1)...]
....
...
...
...
Merge and Sort
Map reduce solution
What if now we are given a huge number of files.
And a large number of machines.
It can be scalable!
Map can be applied to different part of input in parallel.
If part of map tasks failed, just need to restart them instead of restarting all.
Map reduce solution-scalable version
Map : split the file into several parts, apply map function to every part of them.
Shuffle : distribute intemediate result into different buckets according to the hash value of key, assign buckets to several reducers. Each reducer sort the pairs by key.
Reduce : apply the reduce function to all the elements that have the same key and produce the result.
MapReduce
Gerneralized Software framework Users are only responsible for provide
two functions : map and reduce Easy to scale to large amount of
machines.
(The, 1), (Good, 1), (The 1), (Never, 1)
(Is, 1), (Is, 1), (Tie, 1), (Work, 1)
Each bucket is read by one worker(reducer), then sort and produce the results.
Fault tolerance
worker fail : simply assigned the another worker to do it.
master fail : restart the whole work
Implementation details
Locality – Take location information of input files into
account– assigned a map task to closest machine of
the input data.
Implementation details
Backup tasks – abnormal machines in a task lengthen the total
time.– When a task is almost finished, duplicate
remaining tasks as back-up tasks.– Whenever a primary or a backup execution is
done, mark the task as finished.
TASK1 TASK2 TASK3
In progress
Completed
Implementation details
Backup tasks – abnormal machines in a task lengthen the total
time.– When a task is almost finished, duplicate
remaining tasks as back-up tasks.– Whenever a primary or a backup execution is
done, mark the task as finished.
TASK1 TASK2 TASK3
In progress
Completed
Implementation details
Backup tasks – abnormal machines in a task lengthen the total
time.– When a task is almost finished, duplicate
remaining tasks as back-up tasks.– Whenever a primary or a backup execution is
done, mark the task as finished.
TASK1 TASK2 TASK3
In progress
CompletedTASK1
Back Up
Implementation details
Backup tasks – abnormal machines in a task lengthen the total
time.– When a task is almost finished, duplicate
remaining tasks as back-up tasks.– Whenever a primary or a backup execution is
done, mark the task as finished.
TASK1 TASK2 TASK3
In progress
CompletedTASK1
04/19/23 EECS 584, Fall 2011 34
Useful Extensions
Partitioning Functions– hash-based or range-based– self-defined partition function
Combiner Function (similar to Reduce Function)– <any, 1> <a, 1> <any, 1> <any ,1>– resolve significant repetition in intermediate outputs
Skipping Bad Records– errors or bugs
– acceptable to ignore a few records
Local Execution– help facilitate debugging, profiling and testing
04/19/23 EECS 584, Fall 2011 35
Performance & Evaluation
Cluster Configuration– 1800 nodes– 2×2GHz, 4GB memory, 2×160GB IDE, Gb Ethernet link– 2-level tree-shaped switched network
Grep – in 1010 100-byte records– M = 15000 R = 1– take ~150 seconds
04/19/23 EECS 584, Fall 2011 36
Performance & Evaluation (Sort)
Sort– 1010 100-byte records– M = 15000, R = 4000– Normal, No-Backup, 200 tasks killed
A few things to note– Input rate is higher than the shuffle & output rate– no backup, execution flow is similar except the long tail– kill tasks, the tasks restarted & the rate drop to zero
Application of MapReduce
Broadly applicable– large-scale machine learning problems– clustering problems for Google News– extraction of data & properties– graph computations
Large-Scale Indexing– The indexing code is simpler, smaller (~3800 to ~700)– Indexing process is much easier to operate & easy spead up
MapReduce & Parallel DBMS
MapReduce is not novel at all– a entriely new paradigm?
MapReduce is a step backwards – no schema– no high-level access language
MapReduce Is a poor implementation– no Index– overlook skew– lots of P2P Network traffic in the shuffle phase
Missing features– indexes, updates, transactions
Not compatible to DBMS Tools
MapReduce & Parallel DBMS
Parallel database – has significant performance advantage– take a lot of time to tune and setup– are not general enough (UDFs, UDTs)– SQL is not that easy & straightforward
MapReduce– easy to setup & easy to program– scalable & fault-tolerant– bruteforce solution
What is MapReduce
A parallel programming model /data processing paradigm rather than a complete DBMS– Does not target everything DBMS targets– It's simple but it works
Works for those who– have a lot of data (of some specific type)– UDTs and UDFs are complex to tune– would rather program in sequencial language rather than
SQL– no need to index data because data change all the time– do not need to pay