Introduction to MapReduce Data Transformations

Introduction to Map/ReduceData Transformations

Tasso ArgyrosCTO and Co-FounderAster Data Systems

[email protected]

A Brief History of MapReduce

2 Confidential and proprietary. Copyright © 2008 Aster Data Systems

What is MapReduce?

It’s the simplest API you have ever seen

It has just two functions 1. Map() and 2. Reduce()

Plus: it’s language independent (Java, Perl, Python, …)


Why is MapReduce Useful?

It simplifies distributed applications…

…by abstracting the details of data distribution (where is the data I need?) and process distribution (where should I run this process?)…

…behind two simple functions.

But let’s see an example


The quick brown fox

jumps over the lazy dog.

The quick brown fox

jumps over the lazy dog.

To be or not to be: that is the

question.

To be or not to be: that is the

question.

Server A Server B Server C Server D

Switch

The world only needs five computers.

The world only needs five computers.

Hello world.Hello world.

In-Database MapReduce is

the future.

In-Database MapReduce is

the future.

MapReduce is a very

powerful programming

paradigm.

MapReduce is a very

powerful programming

paradigm.


GoalWe Want to Count

the # of Times Each Word Occurs


1st ApproachNo MapReduce

1st ApproachNo MapReduce


The quick brown fox jumps over the lazy dog


To be or not to be: that is the question.



Switch

The world only needs

five computers.


five computers.


In-Database MapReduce is the future.


MapReduce is a very powerful concept.


thequickbrownfoxjumpsoverthelazydog

in databasemapreduceisthefuture

theworldonlyneedsfivecomputers

helloworld

mapreduceisaverypowerfulconcept

tobeornottobethatisthequestion

thequickbrownfoxjumpsoverthelazydogin databasemapreduceisthefuturetheworldonlyneedsfivecomputershelloworldmapreduceisaverypowerfulconcepttobeornottobethatisthequestion

Confidential and proprietary. Copyright © 2008 Aster Data Systems8

Server 4 Final Result Filethe 5

is 3

mapreduce 2

… …


What Did We Do?

1. Write a script to parse the documents and output word lists

2. FTP all the word lists to server 43. Write another script to count each word on

Server 4

Problem: (2) and (3) do not scale!


2nd ApproachNo MapReduce

Fully Distributed







Switch


five computers.


five computers.






thequickbrownfoxjumpsoverthelazydog

in databasemapreduceisthefuture

theworldonlyneedsfivecomputers

helloworld

mapreduceisaverypowerfulconcept

tobeornottobethatisthequestion

thethethethethedatabasedatabasefuture

worldworldpowerfullazybrown

mapreducemapreducebebetojumpscomputershello

isisisquestionoverathat


Server 1 Final Result Filethe 5

… ….

Server 2 Final Result Fileworld 2

… ….

Server 3 Final Result Filemapreduce 2

… ….

Server 4 Final Result Fileis 3

… ….13 Confidential and proprietary. Copyright © 2008 Aster Data Systems

2nd Approach: No MapReduce, Distributed


Does it work?Yes

Is it a pain?Yes!!

Does it take lots of time?Yes!

Would you do it?No!!!


Moreover…

Who will manage your files?

What if nodes fail?

What if you want to add more nodes?

What if…

What if…

What if…


Map()

InputAny file

(e.g. documents)

OutputStream of <key, value> pairs

(e.g. <word, count> pairs)

InputAll <key, value> pairs with

the same key grouped(e.g. all <word, count> pairs

where word = “the”)

OutputAnything

(e.g. sum of counts for a specific word)

Reduce()Dat

a Re

dist

ribut

ion

and

Gro

upin

g






Map()

<the, 1><quick, 1><brown,1><fox,1><jumps,1><over,1><the,1><lazy,1><dog,1>

Map()

<in, 1><database, 1><mapreduce,1><is,1><the,1><future,1>

<world,1><world,1><powerful,1><lazy,1><brown,1>

<mapreduce,1><mapreduce,1><be,1><be,1><to,1><jumps,1><computers,1><hello,1>

<is,1><is,1><is,1><question,1><over,1><a,1><that,1>


Switch

<the, 1><the, 1><the, 1><the, 1><the, 1><database,1><database,1><future,1>

Map() and Redistribution Phase


<the, 1><the, 1><the, 1><the, 1><the, 1><database,1><database,1><future,1>

Reduce()

<the, 1><the, 1><the, 1><the, 1><the, 1>

<database,1><database,1>

<future,1>

Server 1 Final Result File

the 5

database 2

future 1

Reduce()

Reduce()

Grouping and Reduce() Phase(on Server 1)


What Just Happened?

By writing two small scripts with a few lines of code…… we achieved exactly the same result!Plus, our code did not have to care about:•the # of servers on the system (4 or 400?)•which server to send each word •any network communication aspects•any fault tolerance aspects•…


Word Count was Only an Example!

Google does all web indexing on MapReduce

“The indexing code is simpler, smaller, and easier tounderstand, because the code that deals with faulttolerance, distribution and parallelization is hiddenwithin the MapReduce library. For example, thesize of one phase of the computation dropped fromapproximately 3,800 lines of C++ code to approximately700 lines when expressed using MapReduce.”

“The indexing code is simpler, smaller, and easier tounderstand, because the code that deals with faulttolerance, distribution and parallelization is hiddenwithin the MapReduce library. For example, thesize of one phase of the computation dropped fromapproximately 3,800 lines of C++ code to approximately700 lines when expressed using MapReduce.”

Google 2004 MapReduce paperGoogle 2004 MapReduce paper


Word Count was Only an Example!

Published work from Stanford University showed that even extremely complex Data Mining algorithms can fit in this very simple model

“We adapt Google’s MapReduce paradigm todemonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).”

“We adapt Google’s MapReduce paradigm todemonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).”

Stanford 2006 AI Lab paperStanford 2006 AI Lab paper


Result?

MapReduce makes writing parallel programs extremely easy…

…and can accommodate

from trivial to very

complex algorithms…

…thus enabling the

processing of petabytes of

data with a few lines of

code!


But…

Today MapReduce is used only by hardcore

coders/programmers/hackers

Changes in MapReduce queries require changes in

the MapReduce code itself•Constantly keep coding

Using MapReduce with database data is hard and

cumbersome…

…when most of the structured data in the

enterprise are stored in databases!


Beyond SQL and MapReduce


SQL vs MapReduce: Two different worlds?

SQL

Declarative• Specifies what needs to

happen

Execution plans optimized

dynamically

Input/output is

structured

Data redistribution inferred

from SQL statement (in

MPP Databases)

MapReduce

Procedural• Specifies how it needs to

happen

Code compiled once;

MapReduce plans are

static

Input/output is

unstructured

Data redistribution based

on <keys> in Reduce()

phase


Implementing MR in the Database

Uses Polymorphic SQL operators to embed MapReduce functions to SQL

Introduces a “PARTITION BY” clause to specify data redistribution

Introduces a “SEQUENCE BY” clause to specify ordering of data flows to the MR functions

Best of both worlds•Planning is still dynamic•MapReduce functions can be used like custom SQL operators•MapReduce functions can implement any algorithm or transformation•Code Once – Use Many (through SQL) model


The SQL/MR Process


SQL/MR Function: Syntax

SELECT…

FROM

MR_Function ( ON source_data [ PARTITION BY column ] [ ORDER BY column ] [Function Arguments]

)WHERE …GROUP BY …HAVING …ORDER BY …LIMIT …;

Optional conditions & filters

(5) Select output (eg. count)

(1) Source table or sub-select

(3) Sort before the MR function

(4) Java/Python/… MR function

(2) <key> for data redistribution

Optional MR_Function Arguments


Example 1: Tokenization

Demo #1: Only Map (Tokenization) in SQL/MR SELECT word, count(*) AS wordcount FROM Tokenize( ON blogs ) GROUP BY word ORDER BY wordcount DESC LIMIT 20;

Demo #2: Map (Tokenization) and Reduce (WordCount) in SQL/MR SELECT key AS word, value AS wordcount FROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key ) ORDER BY wordcount DESC LIMIT 20;

Demo #3: Why do Reduce when you have SQL? SELECT word, count(*) AS wordcount FROM Tokenize( ON blogs ) GROUP BY word ORDER BY wordcount DESC LIMIT 20;


Example 2: Sessionization

What Is Sessionize?

An example Aster SQL/MR function

Leverages Aster’s Java library API

What Does It Do?

User specified a column (eg. timestamp) and a

session timeout value (in seconds)

Spits out unique session identifiers (sessionid

column)

Usage CREATE TABLE sessionized_clicks AS SELECT ts, userid, sessionid, ... FROM Sessionize( ON clicks PARTITION BY userid ORDER BY ts TIMEOUT 60 );


Example 2: Sessionization

Slide 32

timestamp

userid

10:00:00 Shawn1

00:58:24 PrezBush

10:00:24 Shawn1

02:30:33 PrezBush

10:01:23 Shawn1

10:02:40 Shawn1

timestamp

userid sessionid10:00:00 Shawn1 0

10:00:24 Shawn1 0

10:01:23 Shawn1 0

10:02:40 Shawn1 1

Session Timeout = 60 seconds

timestamp

userid sessionid00:58:24 PrezBus

h0

02:30:33 PrezBush

1

Clickstream

INPUT OUTPUT


MR Applications in the Database

ELT

Text and data transformations, in-parallel, in-database

Queries that become too complex for SQL

E.g. Sessionize(), customer segmentation, predictive analytics, …

Queries that SQL inherently cannot handle well

Time series analytics

Aster has a set of pre-defined SQL/MR functions for this

Data structures that do not fit well the relational model

Time series (again)

Graphs, spatial data

Any analytical or reporting application that requires more performance and data proximity!


Summary

Growing challenges in scaling analytical

applications and reporting

MapReduce is driving a data revolution (see:

Google)

In-Database MapReduce will open up databases

to a host of new applications

[email protected](Questions, Comments)

asterdata.com/blog(Lots of technical details)

1.888.Aster.Data(Any other information)

[email protected](Questions, Comments)

asterdata.com/blog(Lots of technical details)

1.888.Aster.Data(Any other information)


Introduction to MapReduce Data Transformations

Technology

Transcript of Introduction to MapReduce Data Transformations