A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin,...

Post on 18-Jan-2016

218 views 0 download

Transcript of A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin,...

A Comparison of Approaches to Large-Scale Data AnalysisA Comparison of Approaches to Large-Scale Data Analysis

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt,

Samuel Madden, Michael Stonebraker

SIGMOD 2009

2009-10-09

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Copyright 2009 by CEBTCenter for E-Business Technology

MapReduce vs. Parallel DBMS

Copyright 2009 by CEBT

MapReduceMapReduce

Center for E-Business Technology

한재선 , SearchDay2008, http://nexr.tistory.com

Copyright 2009 by CEBT

Architectural DifferencesArchitectural Differences

Parallel DBMS MapReduce

Schema Support O X

Indexing O X

Programming ModelStating what you

want(SQL)

Presenting an algorithm

(C/C++, Java, …)

Optimization O X

Flexibility Good

Fault Tolerance Good

Center for E-Business Technology

Copyright 2009 by CEBT

Benchmark Environment (1/2)Benchmark Environment (1/2)

Systems

Hadoop: The most popular open-source MR implementation

DBMS-X: a parallel DBMS that stores data in a row-based format

Vertica: a column-based parallel DBMS

All Three systems were deployed on a 100-node cluster

Analytical Tasks

Data Loading

Selection Task

Aggregation Task

Join Task

UDF Aggregation Task

Center for E-Business Technology

Copyright 2009 by CEBT

Benchmark Environment (2/2)Benchmark Environment (2/2)

Dataset

Documents : 600,000 unique documents for each node

155 million UserVisits records (20GB/node)

18 million Rankings records (1GB/node)

Center for E-Business Technology

Copyright 2009 by CEBT

1. Data Loading1. Data Loading

Center for E-Business Technology

loading timeloading time

ReorganizationReorganization

Copyright 2009 by CEBT

2. Selection Task2. Selection Task

The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user-defined threshold

Query

SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x;

x = 10, which yields approximately 36,000 records per data file on each node

For MR, implementing the same task with Java language

Center for E-Business Technology

Copyright 2009 by CEBT

2. Selection Task - Result2. Selection Task - Result

Center for E-Business Technology

time for combining the output into a single file(Additional MR)

time for combining the output into a single file(Additional MR)

Processing timeProcessing time

Copyright 2009 by CEBT

3. Aggregation Task3. Aggregation Task

The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column

Query

SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP;

This task always produces 2.5 million records

Center for E-Business Technology

Copyright 2009 by CEBT

3. Aggregation Task - Result3. Aggregation Task - Result

Center for E-Business Technology

Copyright 2009 by CEBT

4. Join Task4. Join Task

The join task consists of two sub-tasks that perform a complex calculation on two data sets

In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range

Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval

Query

SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP;

SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;

Center for E-Business Technology

Copyright 2009 by CEBT

4. Join Task - Result4. Join Task - Result

Center for E-Business Technology

Copyright 2009 by CEBT

5. UDF Aggregation Task5. UDF Aggregation Task

The final task is to compute the inlink count for each document in the dataset

Query

SELECT INTO Temp F(contents) FROM Document;

– F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database

– With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count

SELECT url, SUM(value) FROM Temp GROUP BY url;

Center for E-Business Technology

Copyright 2009 by CEBT

5. UDF Aggregation Task - Result5. UDF Aggregation Task - Result

Center for E-Business Technology

Copyright 2009 by CEBT

ConclusionConclusion

Center for E-Business Technology

MapReduce < Parallel DBMS

HadoopDB: An Architectural Hybrid of MapReduce and HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical WorkloadsDBMS Technologies for Analytical Workloads

Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin

VLDB 2009

2009-10-09

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Copyright 2009 by CEBT

HadoopDBHadoopDB

The Basic Idea (An Architectural Hybrid of MR & DBMS)

To use MR as the communication layer above multiple nodes running single-node DBMS instances

Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases

Center for E-Business Technology

Copyright 2009 by CEBT

The Architecture of HadoopDBThe Architecture of HadoopDB

Center for E-Business Technology

Copyright 2009 by CEBT

HadoopDB – Join TaskHadoopDB – Join Task

Center for E-Business Technology