Introduction to MapReduce Data Transformations

34
Introduction to Map/Reduce Data Transformations Tasso Argyros CTO and Co-Founder Aster Data Systems [email protected]

description

MapReduce is a framework for scalable parallel data processing popularized by Google. Although initially used for simple large-scale text processing, map/reduce has recently been expanded to serve some application tasks normally performed by traditional relational databases. You Will Learn * The basics of Map/Reduce programming in Java * The application domains where the framework is most appropriate * How to build analytic database systems that handle large datasets and multiple data sources robustly * Evaluate data warehousing vendors in a realistic and unbiased way * Emerging trends to combine Map/Reduce with standard SQL for improved power and efficiency Geared To * Programmers * Developers * Database Administrators * Data warehouse managers * CIOs * CTOs

Transcript of Introduction to MapReduce Data Transformations

Page 1: Introduction to MapReduce Data Transformations

Introduction to Map/ReduceData Transformations

Tasso ArgyrosCTO and Co-FounderAster Data Systems

[email protected]

Page 2: Introduction to MapReduce Data Transformations

A Brief History of MapReduce

2 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 3: Introduction to MapReduce Data Transformations

What is MapReduce?

It’s the simplest API you have ever seen

It has just two functions 1. Map() and 2. Reduce()

Plus: it’s language independent (Java, Perl, Python, …)

3 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 4: Introduction to MapReduce Data Transformations

Why is MapReduce Useful?

It simplifies distributed applications…

…by abstracting the details of data distribution (where is the data I need?) and process distribution (where should I run this process?)…

…behind two simple functions.

But let’s see an example

4 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 5: Introduction to MapReduce Data Transformations

The quick brown fox

jumps over the lazy dog.

The quick brown fox

jumps over the lazy dog.

To be or not to be: that is the

question.

To be or not to be: that is the

question.

Server A Server B Server C Server D

Switch

The world only needs five computers.

The world only needs five computers.

Hello world.Hello world.

In-Database MapReduce is

the future.

In-Database MapReduce is

the future.

MapReduce is a very

powerful programming

paradigm.

MapReduce is a very

powerful programming

paradigm.

5 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 6: Introduction to MapReduce Data Transformations

GoalWe Want to Count

the # of Times Each Word Occurs

6 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 7: Introduction to MapReduce Data Transformations

1st ApproachNo MapReduce

1st ApproachNo MapReduce

7 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 8: Introduction to MapReduce Data Transformations

The quick brown fox jumps over the lazy dog

The quick brown fox jumps over the lazy dog

To be or not to be: that is the question.

To be or not to be: that is the question.

Server A Server B Server C Server D

Switch

The world only needs

five computers.

The world only needs

five computers.

Hello world.Hello world.

In-Database MapReduce is the future.

In-Database MapReduce is the future.

MapReduce is a very powerful concept.

MapReduce is a very powerful concept.

thequickbrownfoxjumpsoverthelazydog

in databasemapreduceisthefuture

theworldonlyneedsfivecomputers

helloworld

mapreduceisaverypowerfulconcept

tobeornottobethatisthequestion

thequickbrownfoxjumpsoverthelazydogin databasemapreduceisthefuturetheworldonlyneedsfivecomputershelloworldmapreduceisaverypowerfulconcepttobeornottobethatisthequestion

Confidential and proprietary. Copyright © 2008 Aster Data Systems8

Page 9: Introduction to MapReduce Data Transformations

Server 4 Final Result Filethe 5

is 3

mapreduce 2

… …

9 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 10: Introduction to MapReduce Data Transformations

What Did We Do?

1. Write a script to parse the documents and output word lists

2. FTP all the word lists to server 43. Write another script to count each word on

Server 4

Problem: (2) and (3) do not scale!

10 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 11: Introduction to MapReduce Data Transformations

2nd ApproachNo MapReduce

Fully Distributed

11 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 12: Introduction to MapReduce Data Transformations

The quick brown fox jumps over the lazy dog

The quick brown fox jumps over the lazy dog

To be or not to be: that is the question.

To be or not to be: that is the question.

Server A Server B Server C Server D

Switch

The world only needs

five computers.

The world only needs

five computers.

Hello world.Hello world.

In-Database MapReduce is the future.

In-Database MapReduce is the future.

MapReduce is a very powerful concept.

MapReduce is a very powerful concept.

thequickbrownfoxjumpsoverthelazydog

in databasemapreduceisthefuture

theworldonlyneedsfivecomputers

helloworld

mapreduceisaverypowerfulconcept

tobeornottobethatisthequestion

thethethethethedatabasedatabasefuture

worldworldpowerfullazybrown

mapreducemapreducebebetojumpscomputershello

isisisquestionoverathat

12 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 13: Introduction to MapReduce Data Transformations

Server 1 Final Result Filethe 5

… ….

Server 2 Final Result Fileworld 2

… ….

Server 3 Final Result Filemapreduce 2

… ….

Server 4 Final Result Fileis 3

… ….13 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 14: Introduction to MapReduce Data Transformations

2nd Approach: No MapReduce, Distributed

14 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 15: Introduction to MapReduce Data Transformations

Does it work?Yes

Is it a pain?Yes!!

Does it take lots of time?Yes!

Would you do it?No!!!

15 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 16: Introduction to MapReduce Data Transformations

Moreover…

Who will manage your files?

What if nodes fail?

What if you want to add more nodes?

What if…

What if…

What if…

16 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 17: Introduction to MapReduce Data Transformations

Map()

InputAny file

(e.g. documents)

OutputStream of <key, value> pairs

(e.g. <word, count> pairs)

InputAll <key, value> pairs with

the same key grouped(e.g. all <word, count> pairs

where word = “the”)

OutputAnything

(e.g. sum of counts for a specific word)

Reduce()Dat

a Re

dist

ribut

ion

and

Gro

upin

g

Confidential and proprietary. Copyright © 2008 Aster Data Systems17

Page 18: Introduction to MapReduce Data Transformations

The quick brown fox jumps over the lazy dog

The quick brown fox jumps over the lazy dog

In-Database MapReduce is the future.

In-Database MapReduce is the future.

Map()

<the, 1><quick, 1><brown,1><fox,1><jumps,1><over,1><the,1><lazy,1><dog,1>

Map()

<in, 1><database, 1><mapreduce,1><is,1><the,1><future,1>

<world,1><world,1><powerful,1><lazy,1><brown,1>

<mapreduce,1><mapreduce,1><be,1><be,1><to,1><jumps,1><computers,1><hello,1>

<is,1><is,1><is,1><question,1><over,1><a,1><that,1>

Server A Server B Server C Server D

Switch

<the, 1><the, 1><the, 1><the, 1><the, 1><database,1><database,1><future,1>

Map() and Redistribution Phase

Confidential and proprietary. Copyright © 2008 Aster Data Systems18

Page 19: Introduction to MapReduce Data Transformations

<the, 1><the, 1><the, 1><the, 1><the, 1><database,1><database,1><future,1>

Reduce()

<the, 1><the, 1><the, 1><the, 1><the, 1>

<database,1><database,1>

<future,1>

Server 1 Final Result File

the 5

database 2

future 1

Reduce()

Reduce()

Grouping and Reduce() Phase(on Server 1)

19 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 20: Introduction to MapReduce Data Transformations

What Just Happened?

By writing two small scripts with a few lines of code…… we achieved exactly the same result!Plus, our code did not have to care about:•the # of servers on the system (4 or 400?)•which server to send each word •any network communication aspects•any fault tolerance aspects•…

20 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 21: Introduction to MapReduce Data Transformations

Word Count was Only an Example!

Google does all web indexing on MapReduce

“The indexing code is simpler, smaller, and easier tounderstand, because the code that deals with faulttolerance, distribution and parallelization is hiddenwithin the MapReduce library. For example, thesize of one phase of the computation dropped fromapproximately 3,800 lines of C++ code to approximately700 lines when expressed using MapReduce.”

“The indexing code is simpler, smaller, and easier tounderstand, because the code that deals with faulttolerance, distribution and parallelization is hiddenwithin the MapReduce library. For example, thesize of one phase of the computation dropped fromapproximately 3,800 lines of C++ code to approximately700 lines when expressed using MapReduce.”

Google 2004 MapReduce paperGoogle 2004 MapReduce paper

21 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 22: Introduction to MapReduce Data Transformations

Word Count was Only an Example!

Published work from Stanford University showed that even extremely complex Data Mining algorithms can fit in this very simple model

“We adapt Google’s MapReduce paradigm todemonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).”

“We adapt Google’s MapReduce paradigm todemonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).”

Stanford 2006 AI Lab paperStanford 2006 AI Lab paper

22 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 23: Introduction to MapReduce Data Transformations

Result?

MapReduce makes writing parallel programs extremely easy…

…and can accommodate

from trivial to very

complex algorithms…

…thus enabling the

processing of petabytes of

data with a few lines of

code!

23 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 24: Introduction to MapReduce Data Transformations

But…

Today MapReduce is used only by hardcore

coders/programmers/hackers

Changes in MapReduce queries require changes in

the MapReduce code itself•Constantly keep coding

Using MapReduce with database data is hard and

cumbersome…

…when most of the structured data in the

enterprise are stored in databases!

24 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 25: Introduction to MapReduce Data Transformations

Beyond SQL and MapReduce

25 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 26: Introduction to MapReduce Data Transformations

SQL vs MapReduce: Two different worlds?

SQL

Declarative• Specifies what needs to

happen

Execution plans optimized

dynamically

Input/output is

structured

Data redistribution inferred

from SQL statement (in

MPP Databases)

MapReduce

Procedural• Specifies how it needs to

happen

Code compiled once;

MapReduce plans are

static

Input/output is

unstructured

Data redistribution based

on <keys> in Reduce()

phase

26 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 27: Introduction to MapReduce Data Transformations

Implementing MR in the Database

Uses Polymorphic SQL operators to embed MapReduce functions to SQL

Introduces a “PARTITION BY” clause to specify data redistribution

Introduces a “SEQUENCE BY” clause to specify ordering of data flows to the MR functions

Best of both worlds•Planning is still dynamic•MapReduce functions can be used like custom SQL operators•MapReduce functions can implement any algorithm or transformation•Code Once – Use Many (through SQL) model

27 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 28: Introduction to MapReduce Data Transformations

The SQL/MR Process

28 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 29: Introduction to MapReduce Data Transformations

SQL/MR Function: Syntax

SELECT…

FROM

MR_Function (          ON source_data          [ PARTITION BY column ]          [ ORDER BY column ] [Function Arguments]

)WHERE …GROUP BY …HAVING …ORDER BY …LIMIT …;

Optional conditions & filters

(5) Select output (eg. count)

(1) Source table or sub-select

(3) Sort before the MR function

(4) Java/Python/… MR function

(2) <key> for data redistribution

Optional MR_Function Arguments

29 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 30: Introduction to MapReduce Data Transformations

Example 1: Tokenization

Demo #1: Only Map (Tokenization) in SQL/MR    SELECT word, count(*) AS wordcount    FROM Tokenize( ON blogs )    GROUP BY word    ORDER BY wordcount DESC    LIMIT 20;

Demo #2: Map (Tokenization) and Reduce (WordCount) in SQL/MR     SELECT key AS word, value AS wordcount    FROM WordCountReduce (          ON Tokenize ( ON blogs )          PARTITION BY key          )    ORDER BY wordcount DESC    LIMIT 20;

Demo #3: Why do Reduce when you have SQL?    SELECT word, count(*) AS wordcount    FROM Tokenize( ON blogs )    GROUP BY word    ORDER BY wordcount DESC    LIMIT 20;

30 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 31: Introduction to MapReduce Data Transformations

Example 2: Sessionization

What Is Sessionize?

An example Aster SQL/MR function

Leverages Aster’s Java library API

What Does It Do?

User specified a column (eg. timestamp) and a

session timeout value (in seconds)

Spits out unique session identifiers (sessionid

column)

Usage    CREATE TABLE sessionized_clicks AS    SELECT ts, userid, sessionid, ...    FROM Sessionize(          ON clicks          PARTITION BY userid          ORDER BY ts          TIMEOUT 60          );

31 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 32: Introduction to MapReduce Data Transformations

Example 2: Sessionization

Slide 32

timestamp

userid

10:00:00 Shawn1

00:58:24 PrezBush

10:00:24 Shawn1

02:30:33 PrezBush

10:01:23 Shawn1

10:02:40 Shawn1

timestamp

userid sessionid10:00:00 Shawn1 0

10:00:24 Shawn1 0

10:01:23 Shawn1 0

10:02:40 Shawn1 1

Session Timeout = 60 seconds

timestamp

userid sessionid00:58:24 PrezBus

h0

02:30:33 PrezBush

1

Clickstream

INPUT OUTPUT

Confidential and proprietary. Copyright © 2008 Aster Data Systems32

Page 33: Introduction to MapReduce Data Transformations

MR Applications in the Database

ELT

Text and data transformations, in-parallel, in-database

Queries that become too complex for SQL

E.g. Sessionize(), customer segmentation, predictive analytics, …

Queries that SQL inherently cannot handle well

Time series analytics

Aster has a set of pre-defined SQL/MR functions for this

Data structures that do not fit well the relational model

Time series (again)

Graphs, spatial data

Any analytical or reporting application that requires more performance and data proximity!

33 Confidential and proprietary. Copyright © 2008 Aster Data Systems

Page 34: Introduction to MapReduce Data Transformations

Summary

Growing challenges in scaling analytical

applications and reporting

MapReduce is driving a data revolution (see:

Google)

In-Database MapReduce will open up databases

to a host of new applications

[email protected](Questions, Comments)

asterdata.com/blog(Lots of technical details)

1.888.Aster.Data(Any other information)

[email protected](Questions, Comments)

asterdata.com/blog(Lots of technical details)

1.888.Aster.Data(Any other information)

34 Confidential and proprietary. Copyright © 2008 Aster Data Systems