Map-Side Merge Joins for Scalable SPARQL BGP Processing

23
Map - Side Merge Joins for Scalable SPARQL BGP Processing IEEE CloudCom 2013 Bristol, UK Martin Przyjaciel-Zablocki Alexander Schätzle Eduard Skaley Thomas Hornung Georg Lausen University of Freiburg Databases & Information Systems December 4th, 2013

Transcript of Map-Side Merge Joins for Scalable SPARQL BGP Processing

Map-Side Merge Joins for Scalable SPARQL BGP Processing

IEEE CloudCom 2013Bristol, UK

Martin Przyjaciel-Zablocki Alexander SchätzleEduard SkaleyThomas Hornung Georg Lausen

University of FreiburgDatabases & Information Systems

December 4th, 2013

Map-Side Merge Joins for Scalable SPARQL BGP Processing 2

Motivation

Front-EndArchitecture

Storage

Back-EndArchitecture

Map-Side Merge Joins for Scalable SPARQL BGP Processing 3

Motivation

Front-EndArchitecture

Storage

Back-EndArchitecture

Online

Real-time (~ ms), interactive (~ s)

Low-latency

Fast, simple queries

Accessing precomputed (offline) data

Avoid Joins (costly)

Map-Side Merge Joins for Scalable SPARQL BGP Processing 4

Motivation

Front-EndArchitecture

Storage

Back-EndArchitecture

Online

Off

line

Real-time (~ ms), interactive (~ s)

Low-latency

Fast, simple queries

Accessing precomputed (offline) data

Avoid Joins (costly)

Batch processing (≥ h)

Scalability, Fault Tolerance

Long-running, complex queries

Joins between large datasets common

Used in many DBMS due to stable & efficient performance

What we present in our paper:

◦ Sort-Merge Join for RDF data with MapReduce

◦ Support for n-way joins

◦ Support for cascaded execution

Map-Side Merge Joins for Scalable SPARQL BGP Processing 5

Sort-Merge Join

L

a a

a b

b a

b b

c a

c b

...

sorted

by jo

in key

R

sorted

by jo

in key

L R

a b

a c

b a

b c

c c

c d

...

Approach:

◦ Divide Join task into N independent subtasks

◦ Each subtask is processed by exactly one machine

Map-Side Merge Joins for Scalable SPARQL BGP Processing 6

Sort-Merge Joins with MapReduce

L1[a … c)

L2

[c … e)

LN[x … z]

sorted

by jo

in key

R1

[a … c)

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

Challenges:

◦ Datasets must be sorted according to the join key

Map-Side Merge Joins for Scalable SPARQL BGP Processing 7

Sort-Merge Joins with MapReduce

L1[a … c)

L2

[c … e)

LN[x … z]

sorted

by jo

in key

R1

[a … c)

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

Challenges:

◦ Datasets must be sorted according to the join key

◦ Datasets must be split into N partitions

Map-Side Merge Joins for Scalable SPARQL BGP Processing 8

Sort-Merge Joins with MapReduce

L1[a … c)

L2

[c … e)

LN[x … z]

sorted

by jo

in key

R1

[a … c)

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

Challenges:

◦ Datasets must be sorted according to the join key

◦ Datasets must be split into N partitions

◦ Same join key values must be processed on the same machine i.e. they must be in the same partition

Map-Side Merge Joins for Scalable SPARQL BGP Processing 9

Sort-Merge Joins with MapReduce

L1

L2

[c … e)

LN[x … z]

sorted

by jo

in key

R1

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

[a … c) [a … c)

Challenges:

◦ Datasets must be sorted according to the join key

◦ Datasets must be split into N partitions

◦ Same join key values must be processed on the same machine i.e. they must be in the same partition

◦ Cascaded joins: Output must be sorted & partitioned according to the requirements of the following join

Map-Side Merge Joins for Scalable SPARQL BGP Processing 10

Sort-Merge Joins with MapReduce

L1[a … c)

L2

[c … e)

LN[x … z]

sorted

by jo

in key

R1

[a … c)

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

Main Contributions:

(1) RDF data store that guarantees that one side is already presortedand partitioned we only have to sort & split the output

(2) Merge Join is processed in the Map phase Reduce phase used to postprocess the output for the next join

(3) Dynamic Bloom Filter to remove dangling intermediate results

Map-Side Merge Joins for Scalable SPARQL BGP Processing 11

Sort-Merge Joins with MapReduce

L1[a … c)

L2

[c … e)

LN[x … z]

sorted

by jo

in key

R1

[a … c)

R2[c … e)

…RN

[x … z]

sorted

by jo

in key

L1 R1

L2 R2

LN RN

Linear scaling of query runtimes

Average performance benefit of Merge Joins compared to theother approaches: 15 – 48 %

Map-Side Merge Joins for Scalable SPARQL BGP Processing 12

Experiments

Map-Side Merge Joins for Scalable SPARQL BGP Processing

Thank you

13

Contact information:

Martin Przyjaciel-ZablockiAlexander Schätzlezablocki|[email protected] of Freiburg

Discussion Slides1. RDF Data Store Layout

2. Map-Side Merge Join

3. Experiments

Map-Side Merge Joins for Scalable SPARQL BGP Processing 14

Map-Side Merge Joins for Scalable SPARQL BGP Processing 15

RDF Data Store Layout (1)

Map-Side Merge Joins for Scalable SPARQL BGP Processing 16

RDF Data Store Layout (2)

Map-Side Merge Joins for Scalable SPARQL BGP Processing 17

RDF Data Store Layout (3)

Map-Side Merge Joins for Scalable SPARQL BGP Processing 18

Map-Side Merge Join (1)

Map-Side Merge Joins for Scalable SPARQL BGP Processing 19

Map-Side Merge Join (2)

Map-Side Merge Joins for Scalable SPARQL BGP Processing 20

Experiments (1)

Map-Side Merge Joins for Scalable SPARQL BGP Processing 21

Experiments (2)

Map-Side Merge Joins for Scalable SPARQL BGP Processing 22

Experiments (3)

Map-Side Merge Joins for Scalable SPARQL BGP Processing 23

Experiments (4)