A Main Memory Index Structure to Query Linked Data

33
A Main Memory Index Structure to Query Linked Data Olaf Hartig http://olafhartig.de/foaf.rdf#olaf Frank Huber Database and Information Systems Research Group Humboldt-Universität zu Berlin

description

That's the slides with which I presented my paper at the Linked Data on the Web workshop (LDOW) at WWW 2011, Hyderabad, India.

Transcript of A Main Memory Index Structure to Query Linked Data

Page 1: A Main Memory Index Structure to Query Linked Data

A Main Memory Index Structureto Query Linked Data

Olaf Hartighttp://olafhartig.de/foaf.rdf#olaf

Frank Huber

Database and Information Systems Research GroupHumboldt-Universität zu Berlin

Page 2: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 2

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 20 40 60 80

0 20 40 60 80

The Issue

(Query No. 36)

(Query No. 37)

(Query No. 38)

(Query No. 39)

(Query No. 40)

hit rate query execution time(in seconds)

number of query results

ContactInfoPhillipe

UnsetPropsPhillipe

2ndDegree1Phillipe

2ndDegree2Phillipe

IncomingPhillipe

0 0,2 0,4 0,6 0,8 1

0 0,2 0,4 0,6 0,8 1

no reuse given order

Page 3: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 3

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 20 40 60 80

0 20 40 60 80

The Issue

(Query No. 36)

(Query No. 37)

(Query No. 38)

(Query No. 39)

(Query No. 40)

hit rate query execution time(in seconds)

number of query results

Descriptor objectsin the query-local

dataset afterquery execution:

172

533

ContactInfoPhillipe

UnsetPropsPhillipe

2ndDegree1Phillipe

2ndDegree2Phillipe

IncomingPhillipe

0 0,2 0,4 0,6 0,8 1

0 0,2 0,4 0,6 0,8 1

no reuse given order

Page 4: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 4

query-localdataset

What data structure do we use tophysically represent the query-local dataset?

Logical representation of Linked Data from the Web

Physical representation of Linked Data from the Web

?

Page 5: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 5

Outline

1. Requirements + Existing Work

2. Data Structures

3. Evaluation

Page 6: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 6

Requirements

● (Consecutively) build and use ad hoc collections ofmany small sets of RDF triples

● Four main operations:● Find … matching triples for a triple pattern in all

descriptor objects● Add, Remove, Replace … descriptor objects

● Support of concurrent access (i.e. isolation)

● Non-relevant properties:● Querying descriptor objects individually is not necessary● No need to write data back to the Web● ACID properties not required for complete queries

Page 7: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 7

Requirements

● (Consecutively) build and use ad hoc collections ofmany small sets of RDF triples

● Four main operations:● Find … matching triples for a triple pattern in all

descriptor objects● Add, Remove, Replace … descriptor objects

● Support of concurrent access (i.e. isolation)

● Non-relevant properties:● Querying descriptor objects individually is not necessary● No need to write data back to the Web● ACID properties not required for complete queries

Page 8: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 8

Existing Work

● Disk based storage solutions for RDF data● Unsuitable due to very costly I/O operations

● Main memory based data structures in the literature● Focus on a large, single set of RDF triples● Optimized for complete graph pattern queries or path queries

● Main memory based data structures in RDF frameworks● Focus on Jena, ARQ and NG4J● Inefficient (see evaluation)

Page 9: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 9

Outline

1. Requirements + Existing Work

2. Data Structures

3. Evaluation

Page 10: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 10

Hash-Based Index for RDF Data

Logical representation

Physical representation

PS

SP

O

SOPO

Dict

● Dictionary:● Two-way mapping between RDF

terms and numerical identifiers

● 6 hash tables:● Each hash table contains

all ID-encoded triples● Efficient support for all types of triple patterns

*Similar to Harth and Decker, 2005

Page 11: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 11

Hash-Based Index for RDF Data

Logical representation

Physical representation

PS

SP

O

SOPO

Dict

● Dictionary:● Two-way mapping between RDF

terms and numerical identifiers

● 6 hash tables:● Each hash table contains

all ID-encoded triples● Efficient support for all types of triple patterns

*Similar to Harth and Decker, 2005

tid = ( id

s,id

p,id

o )

Page 12: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 12

Hash-Based Index for RDF Data

Logical representation

Physical representation

PS

SP

O

SOPO

Dict

● Dictionary:● Two-way mapping between RDF

terms and numerical identifiers

● 6 hash tables:● Each hash table contains

all ID-encoded triples● Efficient support for all types of triple patterns

*Similar to Harth and Decker, 2005

Findknows

http://bob.name

?acq

tid = ( id

s,id

p,id

o )

Page 13: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 13

query-localdataset

Individual Indexing

Logical representation

Physical representation

PS

SP

O

SOPO

Dict

● Idea: Index each descriptor objectseparately

● Implementation of the four operations:● Add, Remove, and Replace are straightforward● Find requires iterating over all indexes

PS

SP

O

SOPO

PS

SP

O

SOPO

Page 14: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 14

query-localdataset

Individual Indexing

Logical representation

Physical representation

PS

SP

O

SOPO

Dict

● Idea: Index each descriptor objectseparately

● Implementation of the four operations:● Add, Remove, and Replace are straightforward● Find requires iterating over all indexes

PS

SP

O

SOPO

PS

SP

O

SOPO

Findknows

http://bob.name

?acq

Page 15: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 15

query-localdataset

Combined Indexing

Logical representation

Physical representation

● Idea: Use a single indexfor all descriptor objects● src – maps each triple to a set

of descriptor object IDs

PS

SP

O

SOPO

Dict

Page 16: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 16

query-localdataset

Combined Indexing

Logical representation

Physical representation

● Idea: Use a single indexfor all descriptor objects● src – maps each triple to a set

of descriptor object IDs

PS

SP

O

SOPO

Dict

tid = ( id

s,id

p,id

o )

Page 17: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 17

query-localdataset

Combined Indexing

Logical representation

Physical representation

● Idea: Use a single indexfor all descriptor objects● src – maps each triple to a set

of descriptor object IDs

PS

SP

O

SOPO

Dict

tid = ( id

s,id

p,id

o )

+ src( t

id ) = { , }

Page 18: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 18

query-localdataset

Quad Indexing

Logical representation

Physical representation

● Idea: Use a single quad indexfor all descriptor objects● quad = ID-encoded triple

+ descriptor object ID

PS

SP

O

SOPO

Dict

q = ( (ids,id

p,id

o) , )

Page 19: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 19

Outline

1. Requirements + Existing Work

2. Data Structures

3. Evaluation

Page 20: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 20

Experiment Setup

Does this affect the overall execution time forlink traversal based query executions ?

Page 21: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 21

Experiment Setup

Does this affect the overall execution time forlink traversal based query executions ?

● Simulation of the Web of Data● Linked Data server publishes BSBM dataset (scal. factor: 50)● Adjusted BSBM queries link to the simulation server

● Experiment:● Sequence of 200 query mixes● Reuse of the query-local dataset for the whole sequence● IndIR, CombIR, and QuadIR (as presented), engine: SQUIN● NamedGraphSetImpl (NG4J/Jena), engine: SemWeb Client

Page 22: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 22

Execution Time

0 20 40 60 80 100 120 140 160 180 2000

10

20

30

40

50

60

70

80

NG4J (SWClLib)IndIR, m=4CombIR, m=12CombQuadIR, m=12

query mix

exe

cutio

n tim

e in

se

cond

s

0 40 80 1201602000

500

1000

1500

2000

2500

query mix

ove

rall

n um

ber

of

desc

r.o

bje

cts

in t

he q

uerie

d da

tase

t

Page 23: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 23

Execution Time

0 20 40 60 80 100 120 140 160 180 2000

10

20

30

40

50

60

70

80

NG4J (SWClLib)IndIR, m=4CombIR, m=12CombQuadIR, m=12

query mix

exe

cutio

n tim

e in

se

cond

s

0 40 80 1201602000

500

1000

1500

2000

2500

query mix

ove

rall

n um

ber

of

desc

r.o

bje

cts

in t

he q

uerie

d da

tase

t

Page 24: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 24

Summary

● Three hash index based data structures:● Individually indexing● Combined indexing● Quad indexing

● Findings:● A single index improves query performance significantly● Smaller load times with quads

● Also for other use cases of ad hoc storing of Linked Data● Consecutively retrieved from remote sources● Used for immediate local processing

Page 25: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 25

Backup Slides

Page 26: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 26

query-localdataset

Combined Indexing

Logical representation

Physical representation

● Idea: Use a single indexfor all descriptor objects● src – maps each triple to a set

of descriptor object IDs

● Implementation of the four operations requires:● status – maps each descriptor object ID to a status

( BeingIndexed, Indexed, ToBeRemoved, BeingRemoved )

● Find reports a triple t only if: ∃d ∈ src(t) : status(d) = Indexed

PS

SP

O

SOPO

Dict

Page 27: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 27

Implementation

● Available as Free Software (http://squin.org)

● Hash tables with n = 2m buckets

● Hash functions:

● hS( i

s, i

p, i

o ) = i

s & bitmask[m]

● hSP

( is, i

p, i

o ) = ( i

s • i

p ) & bitmask[m]

● etc.

● Which m ?* *see paper● m = 4 for the individual indexes● m = 12 for the combined index● m = 12 for the quad index

Page 28: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 28

Experiment Setup

● Comparison without link traversal based query execution● Compared data structures:

● Our implementation of IndIR, CombIR, CombQuadIR● NamedGraphSetImpl in NG4J (Jena)● DatasetGraphMap in ARQ (Jena)

● Berlin SPARQL Benchmark (BSBM)● BSBM datasets partitioned into query-local datasets● BSBM (v2.0) query mixes executed over these datasets

Page 29: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 29

Required Memory

BSBM scaling factor

number of descriptor

objects

overall numberof triples

50 2,599 22,616

100 4,178 40,133

150 5,756 57,524

200 7,329 75,062

250 9,873 97,613

300 11,455 115,217

350 13,954 137,567

500 18,687 190,502

Est

imat

ed r

equ

ire d

mem

ory

in M

B

0 100 200 300 400 5000

20

40

60

80

100

120ARQNG4JIndIR (m=4)CombIR (m=12)CombQuad (m=12)

pc (BSBM scaling factor)

Page 30: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 30

Execution Time

BSBM scaling factor

number of descriptor

objects

overall numberof triples

50 2,599 22,616

100 4,178 40,133

150 5,756 57,524

200 7,329 75,062

250 9,873 97,613

300 11,455 115,217

350 13,954 137,567

500 18,687 190,502

aver

age

exec

uti

on

tim

e pe

r qu

e ry

mix

in s

eco n

ds

0 100 200 300 400 5000

500

1000

1500

2000

2500ARQNG4JIndIR (m=4)CombIR (m=12)CombQuad (m=12)

pc (BSBM scaling factor)

Page 31: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 31

Load Time

BSBM scaling factor

number of descriptor

objects

overall numberof triples

50 2,599 22,616

100 4,178 40,133

150 5,756 57,524

200 7,329 75,062

250 9,873 97,613

300 11,455 115,217

350 13,954 137,567

500 18,687 190,502

aver

age

crea

tio

n t

ime

in s

eco n

ds

0 100 200 300 400 5000

1

2

3

4

5

6

7

8

9

10CombIR (m=12)IndIR (m=4)CombQuad (m=12)

pc (BSBM scaling factor)

Page 32: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 32

Execution Time

0 20 40 60 80 100 120 140 160 180 2000

10

20

30

40

50

60

70

80

NG4J (SWClLib)IndIR, m=4CombIR, m=12CombQuadIR, m=12

query mix

exe

cutio

n tim

e in

se

cond

s

0 5 10 15 20 250

10

20

30

40

50

60

CombIR, m=12CombQuadIR, m=12

query mix

Page 33: A Main Memory Index Structure to Query Linked Data

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 33

These slides have been created byOlaf Hartig

http://olafhartig.de

This work is licensed under aCreative Commons Attribution-Share Alike 3.0 License

(http://creativecommons.org/licenses/by-sa/3.0/)