Tuple map reduce: beyond classic mapreduce

Post on 10-May-2015

2.570 views 2 download

Tags:

description

Tuple MapReduce, a new foundational model extending MapReduce with the notion of tuples. Tuple MapReduce allows to bridge the gap between the low-level constructs provided by MapReduce and higher-level needs required by programmers, such as compound records, sorting or joins. This paper presents as well Pangool, an open- source framework implementing Tuple MapReduce. Pangool eases the design and implementation of applications based on MapReduce and increases their flexibility, still maintaining Hadoop’s performance.

Transcript of Tuple map reduce: beyond classic mapreduce

Tuple MapReduce: Beyond classic MapReduce

Pedro Ferrera, Ivan de Prado, Eric PalaciosDataSalt

Barcelona, SPAINpere,ivan,epalacios@datasalt.com

Jose Luis Fernandez­MarquezGiovanna Di Marzo Serugendo

University of Geneva, CUIGeneva, SWITZERLAND

joseluis.fernandez@unige.ch

2 / 18

Outline

● Introduction● Related Work● Classic MapReduce

– The problems of MapReduce

● Tuple MapReduce– The basic Tuple MapReduce

– Joins

– Generalization of MapReduce

● Pangool● Conclusions and Future work

3 / 18

Introduction

● A huge amount of information → needs for new processing technologies.

● MapReduce → major contribution ...– … but involves a sharp learning curve.

● Most of design patterns found in real world problems are not well covered.

● We propose Tuple MapReduce as a better foundation model.● TupleMapReduce on Hadoop → Pangool

– No key architectural changes needed.

4 / 18

Related work

● MapReduce: Google paper on 2004● Hadoop● Higher level tools

– Sawzall, FlumeJava, Pig, Hive, Jaql, Cascading

● Higher level abstractions very popular– Supports the idea of MapReduce as a too low-level paradigm

● Merge MapReduce– Targets the problem of relational operations (joins)

– Implies changes in the architecture and a new step merge

5 / 18

Classic MapReduce

● Jobs– input file, ouput file

– Developer provides two functions: map and reduce

● Distributed execution of work– Firstly the map function in the mapper phase

– Then the reduce function in the reducing phase

6 / 18

The problems of MapReduce

● Compound records– Real world problems include multi-field records. They don’t fit well on

the key/value schema

● Sorting– No inherent sorting within the reduce records.

– “secondary sorting trick” on implementations (Hadoop)

● Join– A quite common operation

– Not directly possible in MapReduce without using “tricks”:

● secondary sorting● compound records

7 / 18

Tuple MapReduce

● Idea: replace key/value by tuples● group-by and sort-by clauses

8 / 18

Tuple MapReduce (II)

● group-by and sort-by constraint– group-by as a prefix of sort-by

– Needed if you want to be able to implement Tuple MapReduce over a MapReduce architecture

● Contrary to MapReduce, Tuple MapReduce:– provides compound records → tuple

– provides intra-reduce sorting

9 / 18

Example: cumulative visits

● Cumulative # of visits up to each single date

Input → URL, date, visits

Expected output → URL, date, cumulative visits

<<<

10 / 18

Join-Tuple MapReduce

● Joins among heterogeneous datasets– Tuples associated with a source-id.

● Tuples reach the reducer sorted by source-id

– enabling memoryless reduce joins– and grouped by some common fields

11 / 18

Example: join between clients and payments

clients

paymentsInner join

client_idname payment_id amount

12 / 18

Generalization of MapReduce

● MapReduce is a TupleMapReduce with...– tuples of two values and

– group-by and sort-by set to first value

● The opposite is also possible → implementing Tuple MapReduce into existing MapReduce implementations. – Architectural changes are not needed.

– Pangool is a proof of that.

13 / 18

Pangool

● Tuple MapReduce implementation on top of Hadoop. – On top of existing MapReduce implementation.

● It is just a library. No architecture change was needed.

● Used on real world applications– Banking

– Searching

– Social networks

pangool.net

14 / 18

Pangool benchmark – secondary sort

15 / 18

Pangool benchmark – join

16 / 18

Pangool performance

● Just between 5% and 8% worst than Hadoop– Pretty good considering that Pangool is built on top of Hadoop API

● The difference would probably disappear with a native implementation

● Much better than higher level API's– Probably because Pangool is a low level API

17 / 18

Conclusions and Future work

● MapReduce key/value has been shown too strict. ● Tuple MapReduce keep MapReduce features

– Enhancing it with

● compound records, ● joins and ● intra-reduce sorting.

● Pangool is a proof of its viability, – including in existing implementations like Hadoop without changing the

architecture

● Future work would involve abstractions for flow creations– Simplifying job chaining and data flow.

18 / 18

Thanks!

● Any questions, or doubts?

– ivan@datasalt.com

– @ivanprado

Pedro Ferrera, Ivan de Prado, Eric PalaciosDataSalt

Barcelona, SPAINpere,ivan,epalacios@datasalt.com

Jose Luis Fernandez­MarquezGiovanna Di Marzo Serugendo

University of Geneva, CUIGeneva, SWITZERLAND

joseluis.fernandez@unige.ch