A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin,...

20
A Comparison of Approaches to Large-Scale Data A Comparison of Approaches to Large-Scale Data Analysis Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD 2009 2009-10-09 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

Transcript of A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin,...

Page 1: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

A Comparison of Approaches to Large-Scale Data AnalysisA Comparison of Approaches to Large-Scale Data Analysis

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt,

Samuel Madden, Michael Stonebraker

SIGMOD 2009

2009-10-09

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Page 2: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBTCenter for E-Business Technology

MapReduce vs. Parallel DBMS

Page 3: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

MapReduceMapReduce

Center for E-Business Technology

한재선 , SearchDay2008, http://nexr.tistory.com

Page 4: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

Architectural DifferencesArchitectural Differences

Parallel DBMS MapReduce

Schema Support O X

Indexing O X

Programming ModelStating what you

want(SQL)

Presenting an algorithm

(C/C++, Java, …)

Optimization O X

Flexibility Good

Fault Tolerance Good

Center for E-Business Technology

Page 5: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

Benchmark Environment (1/2)Benchmark Environment (1/2)

Systems

Hadoop: The most popular open-source MR implementation

DBMS-X: a parallel DBMS that stores data in a row-based format

Vertica: a column-based parallel DBMS

All Three systems were deployed on a 100-node cluster

Analytical Tasks

Data Loading

Selection Task

Aggregation Task

Join Task

UDF Aggregation Task

Center for E-Business Technology

Page 6: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

Benchmark Environment (2/2)Benchmark Environment (2/2)

Dataset

Documents : 600,000 unique documents for each node

155 million UserVisits records (20GB/node)

18 million Rankings records (1GB/node)

Center for E-Business Technology

Page 7: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

1. Data Loading1. Data Loading

Center for E-Business Technology

loading timeloading time

ReorganizationReorganization

Page 8: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

2. Selection Task2. Selection Task

The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user-defined threshold

Query

SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x;

x = 10, which yields approximately 36,000 records per data file on each node

For MR, implementing the same task with Java language

Center for E-Business Technology

Page 9: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

2. Selection Task - Result2. Selection Task - Result

Center for E-Business Technology

time for combining the output into a single file(Additional MR)

time for combining the output into a single file(Additional MR)

Processing timeProcessing time

Page 10: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

3. Aggregation Task3. Aggregation Task

The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column

Query

SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP;

This task always produces 2.5 million records

Center for E-Business Technology

Page 11: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

3. Aggregation Task - Result3. Aggregation Task - Result

Center for E-Business Technology

Page 12: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

4. Join Task4. Join Task

The join task consists of two sub-tasks that perform a complex calculation on two data sets

In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range

Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval

Query

SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP;

SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;

Center for E-Business Technology

Page 13: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

4. Join Task - Result4. Join Task - Result

Center for E-Business Technology

Page 14: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

5. UDF Aggregation Task5. UDF Aggregation Task

The final task is to compute the inlink count for each document in the dataset

Query

SELECT INTO Temp F(contents) FROM Document;

– F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database

– With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count

SELECT url, SUM(value) FROM Temp GROUP BY url;

Center for E-Business Technology

Page 15: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

5. UDF Aggregation Task - Result5. UDF Aggregation Task - Result

Center for E-Business Technology

Page 16: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

ConclusionConclusion

Center for E-Business Technology

MapReduce < Parallel DBMS

Page 17: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

HadoopDB: An Architectural Hybrid of MapReduce and HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical WorkloadsDBMS Technologies for Analytical Workloads

Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin

VLDB 2009

2009-10-09

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Page 18: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

HadoopDBHadoopDB

The Basic Idea (An Architectural Hybrid of MR & DBMS)

To use MR as the communication layer above multiple nodes running single-node DBMS instances

Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases

Center for E-Business Technology

Page 19: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

The Architecture of HadoopDBThe Architecture of HadoopDB

Center for E-Business Technology

Page 20: A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

Copyright 2009 by CEBT

HadoopDB – Join TaskHadoopDB – Join Task

Center for E-Business Technology