Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

22
Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

Transcript of Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

Page 1: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

Incremental Recomputationsin MapReduce

Thomas JörgUniversity of Kaiserslautern

Page 2: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 2

Motivation

Base data Result data

Bigtable / HBase

MapReduce Program

Page 3: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 3

Motivation

View Definition

Base data Materialized view

Page 4: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 4

Motivation

Base data Result data

Bigtable / HBase

incrementalMapReduce

Program

MapReduce Program

Page 5: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 5

Agenda

• Related Work

• Case study

• Incremental view maintenance

• Summary Delta Algorithm

• Conclusion and future work

Page 6: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 6

Related Work

• Caching intermediate results

• DryadInc

• Incoop

• Incremental programming models

• Google Percolator

• Continuous bulk processing (CBP)

L. Popa, et al.: DryadInc: Reusing work in large-scale computations. HotCloud 2009P. Bhatotia, et al.: Incoop: MapReduce for Incremental Computations. SoCC 2011D. Peng and F. Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010D. Logothetis et al.: Stateful Bulk Processing for Incremental Analytics. SoCC 2010

Page 7: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 7

Challenges

• Programming model

• SQL / relational algebra vs. MapReduce

• Efficient access paths

• No secondary indexes in Hbase

• Support for transactions

• Only single-row transactions in Hbase

Page 8: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 8

Case Study

• Word histograms

• Reverse web-link graphs

• Term-vectors per host

• Count of URL access frequency

• Inverted Indexes

J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

Page 9: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 9

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

Computing Reverse Web-Link Graphs

9Thomas Jörg, Technische Universität Kaiserslautern

Page 10: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 10

Sample Web-Link Graph

a.htm<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>

<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>

b.htm

Page 11: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 11

Computing Reverse Web-Link Graphs

<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>

Map

<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>

b.htm, a.htm

a.htm, b.htm

a.htm

b.htm

b.htm, a.htm

b.htm, b.htm

a.htm, {b.htm}

b.htm, {a.htm, b.htm}

Shuffle Reduce

Page 12: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 12

I. S. Mumick et al.: Maintenance of Data Cubes and Summary Tables in a Warehouse. SIGMOD Conference 1997W. Labio et al.: Performance Issues in Incremental Warehouse Maintenance. VLDB 2000

Summary Delta Algorithm

CREATE VIEW Parts ASSELECT partID, SUM(qty*price) AS revenue, COUNT(*) AS tplcntFROM OrdersGROUP BY partID

SELECT partID, SUM(revenue) AS revenue, SUM(tplcnt) AS tplcntFROM ( (SELECT partID, SUM(qty*price) AS revenue, COUNT(*) as tplcnt FROM Orders_Insertions GROUP BY partID) UNION ALL (SELECT partID, -SUM(qty*price) AS revenue, -COUNT(*) as tplcnt FROM Orders_Deletions GROUP BY partID))GROUP BY partID

Page 13: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 13

Computing Reverse Web-Link Graphs

<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>

Map

<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>

b.htm, a.htm

a.htm, b.htm

a.htm

b.htm

b.htm, a.htm

b.htm, b.htm

a.htm, {b.htm}

b.htm, {a.htm, b.htm}

Shuffle Reduce

Page 14: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 14

Achieving Self-Maintainability

<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>

Map

<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>

b.htm, [a.htm, 1]

a.htm, [b.htm, 1]

a.htm

b.htm

b.htm, [a.htm, 1]

b.htm, [b.htm, 1]

a.htm, {[b.htm, 1]}

b.htm, {[a.htm, 2], [b.htm, 1]}

Shuffle Reduce

Page 15: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 15

Sample Web-Link Graph

a.htm<html> <a href="b.htm"> ...</a><a href="b.htm"> ...</a></html>

<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>

b.htm<html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a></html>

Page 16: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 16

Summary Delta Algorithm in MapReduce

Mapa.htm (deleted)

Shuffle Reduce

a.htm (inserted)

<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>

<html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a></html>

b.htm, [a.htm, -1]

b.htm, [a.htm, +1]

b.htm, [a.htm, -1]

a.htm, [a.htm, +1]

a.htm, {[a.htm, +1]}

b.htm, {[a.htm, -1]}

Page 17: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 17

Delta Installation Approaches

MapReduce

Base deltas Materialized view

MapReduce

Base deltas Materialized view

Materialized view

Increment Installation

Overwrite Installation

Page 18: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 18

Case Study – Lessons Learned

• Numerical aggregation

• Word histogram

• URL access frequency

• Set aggregation

• Reverse web-link graph

• Inverted index

• Multiset aggregation

• Term-vector per host

Page 19: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 19

General Solution

• Self-maintainable aggregates

• Computed in three steps

• Translation

• Grouping

• Aggregation

• commutative and associative binary function

• inverse elements

• Abelian group

Page 20: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 20

Case Study – Lessons Learned

• Numerical aggregation

• Word histogram

• URL access frequency

• Set aggregation

• Reverse web-link graph

• Inverted index

• Multiset aggregation

• Term-vector per host

Translation function:Translate web pages into (word, 1)

Aggregation function:Abelian group (Natural numbers, +)

Translation function:Translate web pages into (link target, link source)

Aggregation function:Abelian group (Power-multiset of URLs, multiset union)

Page 21: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 21

Evaluation

0 25 50 75 1001

10

100

Word histogram

0 25 50 75 1004

40

400

Reverse web-link graph

0 25 50 75 1001

10

100

URL access frequency

0 25 50 75 1001

10

100

Term-vector per host

y-axis: Elapsed time [min]x-axis: Updates in basedocuments [%]

Page 22: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern.

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 22

Conclusion & Future Work

• View Maintenance in MapReduce

• Case study

• Summary delta algorithm

• Self-maintainable aggregations

• Future Work

• Broader class of MapReduce programs

• High-level MapReduce languages, e.g. Jaql or PigLatin