Tuple map reduce: beyond classic mapreduce

Tuple MapReduce: Beyond classic MapReduce

Pedro Ferrera, Ivan de Prado, Eric PalaciosDataSalt

Barcelona, SPAINpere,ivan,epalacios@datasalt.com

Jose Luis FernandezMarquezGiovanna Di Marzo Serugendo

University of Geneva, CUIGeneva, SWITZERLAND

joseluis.fernandez@unige.ch

2 / 18

Outline

● Introduction● Related Work● Classic MapReduce

– The problems of MapReduce

● Tuple MapReduce– The basic Tuple MapReduce

– Joins

– Generalization of MapReduce

● Pangool● Conclusions and Future work

3 / 18

Introduction

● A huge amount of information → needs for new processing technologies.

● MapReduce → major contribution ...– … but involves a sharp learning curve.

● Most of design patterns found in real world problems are not well covered.

● We propose Tuple MapReduce as a better foundation model.● TupleMapReduce on Hadoop → Pangool

– No key architectural changes needed.

4 / 18

Related work

● MapReduce: Google paper on 2004● Hadoop● Higher level tools

– Sawzall, FlumeJava, Pig, Hive, Jaql, Cascading

● Higher level abstractions very popular– Supports the idea of MapReduce as a too low-level paradigm

● Merge MapReduce– Targets the problem of relational operations (joins)

– Implies changes in the architecture and a new step merge

5 / 18

Classic MapReduce

● Jobs– input file, ouput file

– Developer provides two functions: map and reduce

● Distributed execution of work– Firstly the map function in the mapper phase

– Then the reduce function in the reducing phase

6 / 18

The problems of MapReduce

● Compound records– Real world problems include multi-field records. They don’t fit well on

the key/value schema

● Sorting– No inherent sorting within the reduce records.

– “secondary sorting trick” on implementations (Hadoop)

● Join– A quite common operation

– Not directly possible in MapReduce without using “tricks”:

● secondary sorting● compound records

7 / 18

Tuple MapReduce

● Idea: replace key/value by tuples● group-by and sort-by clauses

8 / 18

Tuple MapReduce (II)

● group-by and sort-by constraint– group-by as a prefix of sort-by

– Needed if you want to be able to implement Tuple MapReduce over a MapReduce architecture

● Contrary to MapReduce, Tuple MapReduce:– provides compound records → tuple

– provides intra-reduce sorting

9 / 18

Example: cumulative visits

● Cumulative # of visits up to each single date

Input → URL, date, visits

Expected output → URL, date, cumulative visits

10 / 18

Join-Tuple MapReduce

● Joins among heterogeneous datasets– Tuples associated with a source-id.

● Tuples reach the reducer sorted by source-id

– enabling memoryless reduce joins– and grouped by some common fields

11 / 18

Example: join between clients and payments

clients

paymentsInner join

client_idname payment_id amount

12 / 18

Generalization of MapReduce

● MapReduce is a TupleMapReduce with...– tuples of two values and

– group-by and sort-by set to first value

● The opposite is also possible → implementing Tuple MapReduce into existing MapReduce implementations. – Architectural changes are not needed.

– Pangool is a proof of that.

13 / 18

Pangool

● Tuple MapReduce implementation on top of Hadoop. – On top of existing MapReduce implementation.

● It is just a library. No architecture change was needed.

● Used on real world applications– Banking

– Searching

– Social networks

pangool.net

14 / 18

Pangool benchmark – secondary sort

15 / 18

Pangool benchmark – join

16 / 18

Pangool performance

● Just between 5% and 8% worst than Hadoop– Pretty good considering that Pangool is built on top of Hadoop API

● The difference would probably disappear with a native implementation

● Much better than higher level API's– Probably because Pangool is a low level API

17 / 18

Conclusions and Future work

● MapReduce key/value has been shown too strict. ● Tuple MapReduce keep MapReduce features

– Enhancing it with

● compound records, ● joins and ● intra-reduce sorting.

● Pangool is a proof of its viability, – including in existing implementations like Hadoop without changing the

architecture

● Future work would involve abstractions for flow creations– Simplifying job chaining and data flow.

18 / 18

Thanks!

● Any questions, or doubts?

– ivan@datasalt.com

– @ivanprado

Pedro Ferrera, Ivan de Prado, Eric PalaciosDataSalt

Barcelona, SPAINpere,ivan,epalacios@datasalt.com

Jose Luis FernandezMarquezGiovanna Di Marzo Serugendo

University of Geneva, CUIGeneva, SWITZERLAND

joseluis.fernandez@unige.ch

Tuple map reduce: beyond classic mapreduce

Technology

Transcript of Tuple map reduce: beyond classic mapreduce

MapReduce and Hadoop File Systemnsrit.edu.in/admin/img/cms/10096mapreduce.pdf · The Outline Introduction to MapReduce From CS Foundation to MapReduce MapReduce programming model

Multi-Tuple Deletion Propagation ... - theory.stanford.edujvondrak/data/del-multituple.pdf · Multi-Tuple Deletion Propagation: Approximations and Complexity Benny Kimelfeld IBM Research–Almaden

N-TUPLE GROUPOIDS AND OPTIMALLY COUPLED …

©Silberschatz, Korth and Sudarshan1Database System Concepts Tuple and Domain Calculus Tuple Relational Calculus Domain Relational Calculus.

Pipelined-MapReduce an Improved MapReduce

MapReduce-MPI Library Users Manualmapreduce.sandia.gov/doc/Manual.pdf · MapReduce-MPI WWW Site - MapReduce-MPI Documentation What is a MapReduce? The canonical example of a MapReduce

ASP Application Development Session 3. Topics Covered Using SQL Statements for: –Inserting a tuple –Deleting a tuple –Updating a tuple Using the RecordSet.

Python MapReduce Programming with Pydoop · MapReduce and Hadoop Hadoop Crash Course Pydoop: a Python MapReduce and HDFS API for Hadoop Python MapReduce Programming with Pydoop Simone

EE324 DISTRIBUTED SYSTEMS FALL 2015 MapReduce. Overview 2 MapReduce.

Data Intensive Text Processing with MapReduce - #3 MapReduce Algorithm Design -

Concurrent Programming with Ruby and Tuple Spaces

1. Introduction to MapReduce - UPMlsd.ls.fi.upm.es/.../IntroToMapReduce.pdf · Processing of massive data: MapReduce – 1. Introduction to MapReduce MapReduce has a 'low semantic

LUTE (Local Unpruned Tuple Expansion): Accurate ...ttic.uchicago.edu/~mhallen/publications/LUTE_RECOMB.pdf · LUTE (Local Unpruned Tuple Expansion): Accurate Continuously Flexible

Hadoop Mapreduce

Classic MapReduce: there are four independent entities ...

Hadoop and MapReduce - Courses · Hadoop and MapReduce Guest Lecturer: Jiaheng Lu ... Simple example: Word count Mapper (1-2) Mapper (3-4) ... MapReduce: Example. MapReduce in Parallel:

Relational Calculus Tuple Relational Calculus TRC Formulas

C-Store: Tuple Reconstruction

MapReduce Tutorial

PLANET - cs.iit.eduiraicu/teaching/CS595-F10/planet-dt-author-slides.pdfPLANET Massively Parallel Learning of Tree Ensembles with MapReduce Joshua Herbach* Google Inc., ... • Classic