Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely...
-
Upload
piers-ramsey -
Category
Documents
-
view
214 -
download
0
Transcript of Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely...
1
Map-Reduceexamples
2So, what is it?
• A two phase process geared toward optimizing broad, widely distributed parallel computing platforms
• Apache Hadoop is a MapReduce file system.• MapReduce is Googles version (and it is proprietary).• Phases• 1. Take a series of keys and transform them into a different
series of values, generally, ones that have some semantic context
• 2. Perform a second pass where the new series of values are compressed into far fewer values
3
In its strictest sense…
• Map-reduce is a two phase operation• First, convert a list of data into a list of a different kind of
data• Second, turn the second list into a single or a list of scalar
values, often the cardinality of the items created in the first step
4
Relevant computing/data application types
• For aggregate database processing, and not so much for set-oriented, and certainly not for object-based querying
• Fits well with cluster-based environments, where there are lots of opportunities for parallel processing
• Fits query patterns that calculate the cardinality of sets and the removal of duplicates
5Strategy for M-R
• We try to do the computing on the machines where the data sits
• So we try to engineer the storage of data so that it accommodates the chaining of M-R operations
6
The key bottom line concept
• In a relational database, we try to minimize the I/O costs of moving large volumes of data from the server to the client, so that it can then be scanned and aggregated
• In a database that supports MP, we trying to screen (and sometimes aggregate) data on the server where it sits
• We also use parallel processing within cluster servers to minimize the cost of doing that aggregation if it cannot all be done on a single server housing the original data.
7
Another way of looking at this…
• We have seen the tradeoff between moving data and moving processing logic in the context of distributed, homogenous distributed data
• Often, in distributed databases, it is far cheaper to ship processing logic instead of data, even if it causes extra processing to have to happen
• This is another context in which we often choose to send processing code to a server in order to minimize the movement of large volumes of data
8Example
• 1. We start with a set of person keys and map each of these to the names of the people.• Key 1 -> Harry• Key 2 -> Harry• Key 3 -> Tommy
• 2. We aggregate the list of people names by counting how many unique names are in the list.• Harry, Harry, Tommy -> 2
9
What actually happens?
• Informally: • Each key leads to a name field.• Then, the names are isolated.• Then, each is passed to a “mapper”, which returns the name,
along with a 1. • Then, a “reducer” takes each name and makes a list of 1’s.
The reducer adds up the 1’s for each name and returns a list of (name, count) pairs.
10From wikipedia
Imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age.
SELECT age AS Y, AVG(contacts) AS A
FROM social.person GROUP BY age ORDER BY age
function Map is
input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records
for each social.person record in the K1 batch do
let Y be the person's age
let N be the number of contacts the person has
produce one output record <Y,N>
repeat
end function
function Reduce is
input: age (in years) Y
for each input record <Y,N> do
Accumulate in S the sum of N
Accumulate in C the count of records so far
repeat
let A be S/C
produce one output record <Y,A>
end function
11
From NoSQL Distilled:1. Creating a list with a map
12
2. Aggregating with a reduce
13
3. Partitioning the output of mappers: parallelism & adding a phase that merges the results of the reducers
14
4. Introducing a combiner operation to minimize the movement of redundant data – the output format must be the same as the input format
15
5. A combiner that removes duplicate product-customer pairs
16
6. Concatenating a combining and a reduce (counting) operation
17
7. Maintaining the 1’s counts in the mapping phase
18
8. Adding temporal information to the map/reduce process
19
9. Using a reduce operator to create product per month totals
20
10. A second mapper that creates base year by year comparisons
21
11. A reduce operation combines records for a given year
22Complaints
• M-R is low level.• It is rigid.• It exists to optimize the distributed cluster model – only.• It demands that an application fit perfectly into the
paradigm.• It takes careful planning and knowledge of exactly how
the data will be used to structure the database to optimally serve a series of map/reduce operations
• It thus does not accommodate on-the-fly browsing