Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely...

1

Map-Reduceexamples

2So, what is it?

• A two phase process geared toward optimizing broad, widely distributed parallel computing platforms

• Apache Hadoop is a MapReduce file system.• MapReduce is Googles version (and it is proprietary).• Phases• 1. Take a series of keys and transform them into a different

series of values, generally, ones that have some semantic context

• 2. Perform a second pass where the new series of values are compressed into far fewer values

3

In its strictest sense…

• Map-reduce is a two phase operation• First, convert a list of data into a list of a different kind of

data• Second, turn the second list into a single or a list of scalar

values, often the cardinality of the items created in the first step

4

Relevant computing/data application types

• For aggregate database processing, and not so much for set-oriented, and certainly not for object-based querying

• Fits well with cluster-based environments, where there are lots of opportunities for parallel processing

• Fits query patterns that calculate the cardinality of sets and the removal of duplicates

5Strategy for M-R

• We try to do the computing on the machines where the data sits

• So we try to engineer the storage of data so that it accommodates the chaining of M-R operations

6

The key bottom line concept

• In a relational database, we try to minimize the I/O costs of moving large volumes of data from the server to the client, so that it can then be scanned and aggregated

• In a database that supports MP, we trying to screen (and sometimes aggregate) data on the server where it sits

• We also use parallel processing within cluster servers to minimize the cost of doing that aggregation if it cannot all be done on a single server housing the original data.

7

Another way of looking at this…

• We have seen the tradeoff between moving data and moving processing logic in the context of distributed, homogenous distributed data

• Often, in distributed databases, it is far cheaper to ship processing logic instead of data, even if it causes extra processing to have to happen

• This is another context in which we often choose to send processing code to a server in order to minimize the movement of large volumes of data

8Example

• 1. We start with a set of person keys and map each of these to the names of the people.• Key 1 -> Harry• Key 2 -> Harry• Key 3 -> Tommy

• 2. We aggregate the list of people names by counting how many unique names are in the list.• Harry, Harry, Tommy -> 2

9

What actually happens?

• Informally: • Each key leads to a name field.• Then, the names are isolated.• Then, each is passed to a “mapper”, which returns the name,

along with a 1. • Then, a “reducer” takes each name and makes a list of 1’s.

The reducer adds up the 1’s for each name and returns a list of (name, count) pairs.

10From wikipedia

Imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age.

SELECT age AS Y, AVG(contacts) AS A

FROM social.person GROUP BY age ORDER BY age

function Map is

input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records

for each social.person record in the K1 batch do

let Y be the person's age

let N be the number of contacts the person has

produce one output record <Y,N>

repeat

end function

function Reduce is

input: age (in years) Y

for each input record <Y,N> do

Accumulate in S the sum of N

Accumulate in C the count of records so far

repeat

let A be S/C

produce one output record <Y,A>

end function

11

From NoSQL Distilled:1. Creating a list with a map

12

2. Aggregating with a reduce

13

3. Partitioning the output of mappers: parallelism & adding a phase that merges the results of the reducers

14

4. Introducing a combiner operation to minimize the movement of redundant data – the output format must be the same as the input format

15

5. A combiner that removes duplicate product-customer pairs

16

6. Concatenating a combining and a reduce (counting) operation

17

7. Maintaining the 1’s counts in the mapping phase

18

8. Adding temporal information to the map/reduce process

19

9. Using a reduce operator to create product per month totals

20

10. A second mapper that creates base year by year comparisons

21

11. A reduce operation combines records for a given year

22Complaints

• M-R is low level.• It is rigid.• It exists to optimize the distributed cluster model – only.• It demands that an application fit perfectly into the

paradigm.• It takes careful planning and knowledge of exactly how

the data will be used to structure the database to optimally serve a series of map/reduce operations

• It thus does not accommodate on-the-fly browsing

Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely...

Documents

Transcript of Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely...