gStore: Answering SPARQL Queries Via Subgraph Matching

53
gStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1 , Jinghui Mo 1 , Lei Chen 2 , M. Tamer Özsu 3 , Dongyan Zhao 1 1 1 Peking University, 2 Hong Kong University of Science and Technology, 3 University of Waterloo

description

gStore: Answering SPARQL Queries Via Subgraph Matching. 1 Peking University, 2 Hong Kong University of Science and Technology, 3 University of Waterloo. Lei Zou 1 , Jinghui Mo 1 , Lei Chen 2 , M. Tamer Özsu 3 , Dongyan Zhao 1. Outline. Background & Related Work Overview of gStore - PowerPoint PPT Presentation

Transcript of gStore: Answering SPARQL Queries Via Subgraph Matching

Page 1: gStore: Answering SPARQL Queries Via Subgraph Matching

gStore: Answering SPARQL Queries Via Subgraph Matching

Lei Zou1, Jinghui Mo1, Lei Chen2, M. Tamer Özsu3, Dongyan Zhao1

1

1Peking University,2Hong Kong University of Science and

Technology,3University of Waterloo

Page 2: gStore: Answering SPARQL Queries Via Subgraph Matching

Outline

• Background & Related Work

• Overview of gStore

• Encoding Technique

• VS*-tree & Query Algorithm

• Experiments

• Conclusions

2

Page 3: gStore: Answering SPARQL Queries Via Subgraph Matching

Outline

• Background & Related Work

• Overview of gStore

• Encoding Technique

• VS*-tree & Query Algorithm

• Experiments

• Conclusions

3

Page 4: gStore: Answering SPARQL Queries Via Subgraph Matching

Semantic Web

4

“Semantic Web Technologies” is a collection of standard technologies to realize a Web of Data.

Page 6: gStore: Answering SPARQL Queries Via Subgraph Matching

RDF Graph

6

Entity VertexLiteral Vertex

Page 7: gStore: Answering SPARQL Queries Via Subgraph Matching

SPARQL Queries

7

SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. }

SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. }

Query Graph

Page 8: gStore: Answering SPARQL Queries Via Subgraph Matching

Subgraph Match vs. SPARQL Queries

8

Page 9: gStore: Answering SPARQL Queries Via Subgraph Matching

Naïve Triple Store

9

SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. }

SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. }

SQL: Select T3.SubjectFrom T as T1, T as T2, T as T3Where T1.Predict=“BornOnDate” and T1.Object=“1809-02-12” and T2.Predict=“DiedOnDate” and T2.Object=“1865-04-15” and T3. Predict=“hasName” and T1.Subject = T2.Subject and T2. Subject= T3.subject

Too many Self-Joins

Page 10: gStore: Answering SPARQL Queries Via Subgraph Matching

Existing Solutions Three categories of solutions are proposed to speed up query

processing: 1. Property Table; Jena [K. Wilkinson et al. SWDB 03], … 2. Vertically Partitioned Solution; SW-store [D. J. Abadi et al. VLDB 07],…3. Exhaustive-Indexing

RDF-3x [T. Neumann et al. VLDB 08], Hexastore [C. Weiss et al. VLDB 08 ],…

10

Page 11: gStore: Answering SPARQL Queries Via Subgraph Matching

Existing Solutions-Property Table

11

SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. }

SPARQL Query: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. }

SQL: Select People.hasName from People where People.BornOnDate = “1809-02-12” and People.DiedOnDate = “1865-04-15”.

Reducing # of join steps

Page 12: gStore: Answering SPARQL Queries Via Subgraph Matching

Existing Solutions-Vertically Partitioned Solution

12

Fast Merge Join

Page 13: gStore: Answering SPARQL Queries Via Subgraph Matching

Existing Solutions- Exhaustive-Indexing

Each SPARQL query statement can be translated into one “range query”.

SPARQL Query: Select ?name Where {

?m <hasName> ?name. ?m <BornOnDate> “1809-02-12”. ?m <DiedOnDate> “1865-04-15”. }

13

Range query & Merge Join

Page 14: gStore: Answering SPARQL Queries Via Subgraph Matching

Some Limitations

1. Difficult to handle ``wildcard queries’’.

2. Difficult to handle updates.

14

Page 15: gStore: Answering SPARQL Queries Via Subgraph Matching

Outline

• Background & Related Work

• Overview of gStore

• Encoding Technique

• VS*-tree & Query Algorithm

• Experiments

• Conclusions

15

Page 16: gStore: Answering SPARQL Queries Via Subgraph Matching

Intuition of gStore

16

Finding Matches over a Large Graph is not a trivial task.

Page 17: gStore: Answering SPARQL Queries Via Subgraph Matching

Preliminaries

17

Entity VertexLiteral Vertex

Page 18: gStore: Answering SPARQL Queries Via Subgraph Matching

Preliminaries

• RDF graph

18

Page 19: gStore: Answering SPARQL Queries Via Subgraph Matching

Preliminaries

• Query Graph

19

Page 20: gStore: Answering SPARQL Queries Via Subgraph Matching

Preliminaries

• match

20

Page 21: gStore: Answering SPARQL Queries Via Subgraph Matching

Preliminaries

• Problem definition

21

Page 22: gStore: Answering SPARQL Queries Via Subgraph Matching

Storage Schema in gStore

22

Encoding all neibhors into a “bit-string”, called signature.

Page 23: gStore: Answering SPARQL Queries Via Subgraph Matching

Encoding Technique (1)

• |eSig(e).e| = M.• we employ m different string hash functions Hi

(i = 1, ...,m)• For each hash function Hi, we set the

(Hi(eLabel) MOD M)-th bit in eS ig(e).e to be ‘1’• Encoding Sig(e).n is the same

– |eSig(e).n| = N– n different hash functions

23

Page 24: gStore: Answering SPARQL Queries Via Subgraph Matching

Encoding Technique (2)

24

“Abr”, “bra”,

”rah”,

”aha”,….,

( hasName, “Abraham Lincoln”)

0010 0000 0000

0000 0010 0000 0000

1000 0000 0000 0000

0000 0000 0100 0000

0000 0000 0000 0001

1000 0010 0100 0001

OR

1000 0010 0100 0001

( BornOnDate, “1809-02-12”)

0100 0000 0000 0100 0010 0100 1000

( DiedOnDate, “1865-04-15”)

0000 1000 0000 0000 0010 0100 0000

( DiedIn, “y:Washington_D.c”)

0000 0010 0000 1000 0010 0100 0001

0110 1010 0000 1100 0010 0100 1001

OR

Page 25: gStore: Answering SPARQL Queries Via Subgraph Matching

Encoding Technique (3)

25

Page 26: gStore: Answering SPARQL Queries Via Subgraph Matching

Encoding Technique (4)

26

Page 27: gStore: Answering SPARQL Queries Via Subgraph Matching

Encoding Technique (5)

27

Page 28: gStore: Answering SPARQL Queries Via Subgraph Matching

Outline

• Background & Related Work

• Overview of gStore

• Encoding Technique

• VS-tree & Query Algorithm

• Experiments

• Conclusions

28

Page 29: gStore: Answering SPARQL Queries Via Subgraph Matching

A Straightforward Solution (1)

29

001

004

006

002

003

006

u1 u2

L1 L2

Page 30: gStore: Answering SPARQL Queries Via Subgraph Matching

A Straightforward Solution (2)

30

001

004

006

002

003

006

Large Join Space !

L1 L2

Page 31: gStore: Answering SPARQL Queries Via Subgraph Matching

VS-tree

Page 32: gStore: Answering SPARQL Queries Via Subgraph Matching

VS-Tree query definition

32

Page 33: gStore: Answering SPARQL Queries Via Subgraph Matching

Pruning Technique

33

u1 u2

31d

34d

34d

32d

3G

10010

001

004

006

002

003

006

*G

Reduced Join Space!

Page 34: gStore: Answering SPARQL Queries Via Subgraph Matching

Query Algorithm-Top-Down

34

Page 35: gStore: Answering SPARQL Queries Via Subgraph Matching

Optimized method

• Too many super edges• Which level to start search• No brute-force enumeration

35

Page 36: gStore: Answering SPARQL Queries Via Subgraph Matching

VS*-Tree Insert

• The criterion in the VS-tree only depends on the Hamming distance between the signatures of u and the node in VS-tree.

• the criterion in VS - tree depends on both ∗node signatures and G ’s structure∗

36

Page 37: gStore: Answering SPARQL Queries Via Subgraph Matching

Updates- Insertion in G*

37

Page 38: gStore: Answering SPARQL Queries Via Subgraph Matching

Updates- Insertion in VS*-tree

38

Page 39: gStore: Answering SPARQL Queries Via Subgraph Matching

VS*-Tree split

• the B+1 entities of the node will be partitioned into two new nodes, where B is the maximal fanout for a node in VS -tree.∗

• 1. we find two entities that have the maximal Hamming distance between them as two seed nodes

• 2. we associate each left entry with the nearest seed node, according to Equation 1.

39

Page 40: gStore: Answering SPARQL Queries Via Subgraph Matching

VS*-Tree deletion

• Similar to split• if some node d has less than b entries, where

b is the minimal fanout of node in VS -tree, ∗then d is deleted and its entries are reinserted into VS -tree.∗

40

Page 41: gStore: Answering SPARQL Queries Via Subgraph Matching

Updates- Deletion in VS*-tree

41

To be deleted

Page 42: gStore: Answering SPARQL Queries Via Subgraph Matching

Which Level To Begin

• a concept “pruning power” of GI with regard to Q denoted as ∗ P(Q ,∗ GI )

42

Page 43: gStore: Answering SPARQL Queries Via Subgraph Matching

Estimate P(Q*,GI)

43

Page 44: gStore: Answering SPARQL Queries Via Subgraph Matching

Finding Valid Child States

• propose a DFS strategy to find all valid child states of J.

• start a DFS over G beginning from some ∗vertex vi

44

Page 45: gStore: Answering SPARQL Queries Via Subgraph Matching

45

Page 46: gStore: Answering SPARQL Queries Via Subgraph Matching

Outline

• Background & Related Work

• Overview of gStore

• Encoding Technique

• VS*-tree & Query Algorithm

• Experiments

• Conclusions

46

Page 47: gStore: Answering SPARQL Queries Via Subgraph Matching

Datasets

47

Triple # Size

Yago 20 million 3.1GB

DBLP 8 million 0.8 GB

Page 48: gStore: Answering SPARQL Queries Via Subgraph Matching

48

Offline Performance

Page 49: gStore: Answering SPARQL Queries Via Subgraph Matching

Exact Queries

49

Page 50: gStore: Answering SPARQL Queries Via Subgraph Matching

Wildcard Queries

50

Page 51: gStore: Answering SPARQL Queries Via Subgraph Matching

Outline

• Background & Related Work

• Overview of gStore

• Encoding Technique

• VS*-tree & Query Algorithm

• Experiments

• Conclusions

51

Page 52: gStore: Answering SPARQL Queries Via Subgraph Matching

Conclusions

• Vertex Encoding Technique;

• An Efficient index Structure: VS-tree;

• A Novel Filtering Technique.

52

Page 53: gStore: Answering SPARQL Queries Via Subgraph Matching

53