A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin,...

A Comparison of Approaches to Large-Scale Data AnalysisA Comparison of Approaches to Large-Scale Data Analysis

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt,

Samuel Madden, Michael Stonebraker

SIGMOD 2009

2009-10-09

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

MapReduce vs. Parallel DBMS

MapReduceMapReduce

Center for E-Business Technology

한재선 , SearchDay2008, http://nexr.tistory.com

Architectural DifferencesArchitectural Differences

Parallel DBMS MapReduce

Schema Support O X

Indexing O X

Programming ModelStating what you

want(SQL)

Presenting an algorithm

(C/C++, Java, …)

Optimization O X

Flexibility Good

Fault Tolerance Good

Benchmark Environment (1/2)Benchmark Environment (1/2)

Systems

Hadoop: The most popular open-source MR implementation

DBMS-X: a parallel DBMS that stores data in a row-based format

Vertica: a column-based parallel DBMS

All Three systems were deployed on a 100-node cluster

Analytical Tasks

Data Loading

Selection Task

Aggregation Task

Join Task

UDF Aggregation Task

Benchmark Environment (2/2)Benchmark Environment (2/2)

Dataset

Documents : 600,000 unique documents for each node

155 million UserVisits records (20GB/node)

18 million Rankings records (1GB/node)

1. Data Loading1. Data Loading

loading timeloading time

ReorganizationReorganization

2. Selection Task2. Selection Task

The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user-defined threshold

SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x;

x = 10, which yields approximately 36,000 records per data file on each node

For MR, implementing the same task with Java language

2. Selection Task - Result2. Selection Task - Result

time for combining the output into a single file(Additional MR)

Processing timeProcessing time

3. Aggregation Task3. Aggregation Task

The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column

SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP;

This task always produces 2.5 million records

3. Aggregation Task - Result3. Aggregation Task - Result

4. Join Task4. Join Task

The join task consists of two sub-tasks that perform a complex calculation on two data sets

In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range

Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval

SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP;

SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;

4. Join Task - Result4. Join Task - Result

5. UDF Aggregation Task5. UDF Aggregation Task

The final task is to compute the inlink count for each document in the dataset

SELECT INTO Temp F(contents) FROM Document;

– F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database

– With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count

SELECT url, SUM(value) FROM Temp GROUP BY url;

5. UDF Aggregation Task - Result5. UDF Aggregation Task - Result

ConclusionConclusion

MapReduce < Parallel DBMS

HadoopDB: An Architectural Hybrid of MapReduce and HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical WorkloadsDBMS Technologies for Analytical Workloads

Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin

VLDB 2009

2009-10-09

Summarized by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

HadoopDBHadoopDB

The Basic Idea (An Architectural Hybrid of MR & DBMS)

To use MR as the communication layer above multiple nodes running single-node DBMS instances

Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases

The Architecture of HadoopDBThe Architecture of HadoopDB

HadoopDB – Join TaskHadoopDB – Join Task

A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin,...

Documents

Transcript of A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin,...

Pavlo engineers private limited

Abadi G Adorno

Pavlo lukash vaam15 quality management in shipping

1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.

Abadi Spinoza

Pengelolan lanjutan sepsis satriawan abadi

PT. Sanggar Jaya Abadi › SJA Company Profile-ENGLISH.pdf · 2019-11-19 · PT.Sanggar Jaya Abadi 8 Words of Wisdom PT.Sanggar Jaya Abadi is established with a mission to create

Bombay Talkies_FISCHER and Abadi

Faces of the Rasin Foundation II; Leogane, Haiti

Rabbi Abadi Nidah

Pavlo Sukhachov NORDITA - UNF Materials Theory

c Copyright 2011 by Alexander Rasin · PDF fileThis dissertation by Alexander Rasin is accepted in its present form by the Department of Computer Science as satisfying the dissertation

CR PT Samudera Mulia Abadi

Abadi. El Falso Self

Pavlo Konotop - Print_Portfoliov4

Pavlo Pylyavskyy- A2-web immanants

The Stress-Encapsulation Universal and …rasin/files/Rasin2017_StressEncapsulation...The Stress-Encapsulation Universal and Phonological Modularity Ezer Rasin Massachusetts Institute

VBA 1B VBA RWY 23 RWY 23 ARRIVALS D RASIN 2A OMA VBA 1A … · 2013. 3. 9. · RASIN 1B[RASI1B] VBA 1A 2 2 6 ^ CHANGES: RWY 23 ARRIVALS RWY 05 ARRIVALS RASIN 1B: 3500 M A X 2 5 0

Column-Stores vs. Row-Stores: How Different Are They Really?abadi/talks/abadi-sigmod08-slides.pdf · Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale),

Company Profile - PT.Mulki Abadi Management