G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He...

30
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif Sakr Sameh Elnikety Yuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond, WA Microsoft Research Redmond, WA

Transcript of G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He...

Page 1: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

G-SPARQL: A Hybrid Engine for Querying Large Attributed

Graphs

Sherif Sakr Sameh Elnikety

Yuxiong He

NICTA & UNSWSydney, Australia

Microsoft Research

Redmond, WA

CIKM 2012

Microsoft Research

Redmond, WA

Page 2: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

Example 1: Social Network

Bob

Hillary Alice

Chris David

FranceEd George

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

2

Page 3: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

3

Example 2: Bibliographical Network

Alice JohnSmith

age: 28office: 518

Age:42location: Sydney

age:45

Paper 1 Paper 2

UNSW Microsoft

VLDB¶12

Keyword: graphKeyword: XML

type: Demo

location: Istanbul

country: Australiaestablished: 1949

country: USAestablished: 1975

citedBy

title: Professor

title: Senior Researcher

order: 1order: 2 order: 1 order: 2

Month: 1Month: 3

Page 4: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

4

Contributions1. G-SPARQL language

– Pattern matching– Reachability

2. Hybrid execution engine– Graph topology in main memory– Graph data in relational database

3. Algebraic transformation– Operators– Optimizations

4. Experimental evaluation

Page 5: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

5

1. G-SPARQL Query Language•Extends a subset of SPARQL

– Based on triple pattern: (subject, predicate, object)

•Sub-graph matching patterns on– Graph structure– Node attribute– Edge attribute

•Reachability patterns on– Path– Shortest path

subject object

Page 6: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

6

G-SPARQL Syntax

Page 7: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

7

G-SPARQL Pattern Matching•Node attribute

– ?Person @officeNumber “518”

•Edge attribute– ?E @Role “Programmer”

•Structural– ?Person worksAt Microsoft– ?Person ?E(worksAt) Microsoft

Alice Microsoft

officeNumber=518

Role = Programmer

Page 8: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

8

G-SPARQL Reachability•Path

– Subject ??PathVar Object

•Shortest path– Subject ?*PathVar Object

•Path filters– Path length– All edges– All nodes

Page 9: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

9

Example: G-SPARQL QuerySELECT ?L1 ?L2WHERE {

?X ??P ?Y.

?X @Label ?L1. ?Y @Label ?L2.?X @Age ?Age1. ?Y @Age ?Age2.?X Affiliated UNSW. ?Y ?E(Affiliated) Microsoft.?X LivesIn Sydney. ?E @Title "Researcher".

FILTER(?Age1 >= 40). FILTER(?Age2 >= 40).FILTERPATH( Length( ??P, <= 3) ).

}

Page 10: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

10

Outline1. G-SPARQL language

– Pattern matching– Reachability

2. Hybrid execution engine– Graph topology in main memory– Graph data in relational database

3. Algebraic transformation– Operators– Optimizations

4. Experimental evaluation

Page 11: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

11

2. Hybrid Execution Engine•Reachability queries

– Main memory algorithms– Example: BFS and Dijkstra’s algorithm

•Pattern matching queries– Relational database– Indexing

» Example: B-tree– Query optimizations,

» Example: selectivity estimation, and join ordering– Recursive queries

» Not efficient: large intermediate results and multiple joins

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

Page 12: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

12

Graph Representation

ID Value

1 John

2 Paper 2

3 Alice

4 Microsoft

5 VLDB’12

6 Paper 1

7 UNSW

8 Smith

ID Value

1 45

3 42

8 28

ID Value

8 518

ID Value

3 Sydney

5 Istanbul

ID Value

2 XML

6 graph

ID Value

2 Demo

ID Value

4 USA

7 Australia

ID Value

4 1975

7 1949

eID sID dID

1 1 2

5 3 2

6 3 6

11 8 6

Node Label age office location keyword type established

country

authorOf

eID sID dID

3 1 4

8 3 7

12 8 7

affiliated

eID sID dID

4 2 5

10 6 5

published

eID sID dID

9 6 2

citedBy

eID sID dID

7 3 8

supervise

eID sID dID

2 1 3

know ID Value

3 Senior Researcher

8 Professor

title

ID Value

1 2

5 1

6 2

11 1

order

ID Value

4 3

10 1

month

Page 13: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

13

Hybrid Execution Engine: interfaces

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

G-SPARQL query

SQL commands

Traversal

operations

Page 14: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

14

3. Intermediate Language & Compilation

Physical execution

planSQL

commands

Traversal

operations

G-SPARQL query

Algebraic query plan

Front-end compilation

Step 2

Back-end compilation

Step 1

Bob

Hillary Alice

Chris David

FranceEd George

Photo1

Photo2

Photo3

Photo4Photo5 Photo6

Photo8

Photo7

Page 15: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

15

Intermediate Language•Objective

– Generate query plan and chop it» Reachability part -> main-memory algorithms on topology» Pattern matching part -> relational database

– Optimizations

•Features– Independent of execution engine and graph representation– Algebraic query plan

Page 16: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

16

G-SPARQL Algebra•Variant of “Tuple Algebra”•Algebra details

– Data: tuples» Sets of nodes, edges, paths.

– Operators» Relational: select, project, join» Graph specific: node and edge attributes, adjacency» Path operators

Page 17: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

17

Relational

Page 18: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

18

Relational

NOT Relational

Page 19: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

19

Front-end Compilation (Step 1)• Input

– G-SPARQL query

•Output– Algebraic query plan

•Technique– Map

» from triple patterns» To G-SPARQL operators

– Use inference rules

Page 20: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

20

Front-end Compilation: Inference Rules

Page 21: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

21

Front-end Compilation: Optimizations

•Objective– Delay execution of traversal operations

•Technique– Order triple patterns, based on restrictiveness

•Heuristics– Triple pattern P1 is more restrictive than P2

1. P1 has fewer path variables than P22. P1 has fewer variables than P23. P1’s variables have more filter statements than P2’s variables

Page 22: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

22

Back-end Compilation (Step 2)• Input

– G-SPARQL algebraic plan

•Output– SQL commands– Traversal operations

•Technique– Substitute G-SPARLQ relational operators with SPJ– Traverse

» Bottom up» Stop when reaching root or reaching non-relational operator» Transform relational algebra to SQL commands

– Send non-relational commands to main memory algorithms

Page 23: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

23

Back-end Compilation: Optimizations•Optimize a fragment of query plan

– Before generating SQL command

•All operators are Select/Project/Join•Apply standard techniques

– For example pushing selection

Page 24: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

24

Example: G-SPARQL QuerySELECT ?L1 ?L2WHERE {

?X ??P ?Y.

?X @label ?L1. ?Y @label ?L2.?X @age ?Age1. ?Y @age ?Age2.?X affiliated UNSW. ?Y ?E(affiliated) Microsoft.?X livesIn Sydney. ?E @title "Researcher"

FILTER(?Age1 >= 40). FILTER(?Age2 >= 40).}

Page 25: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

25

Example: Query Plan

Page 26: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

26

4. Experimental Evaluation•Objective

– This is a good idea– Good performance from DBMS and main memory topology

•Data sets– Real ACM bibliographic network

– Synthetic graphs» See technical report

Page 27: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

27

Experimental Environment•Workload

– Created Q1 … Q12

•Process– Compare to Neo4J (non-optimized, optimized)

•Environment– Implementation

» Main memory algorithms in C++» IBM DB2

– PC Server

Page 28: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

28

Results on Real Dataset

Page 29: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

29

Response time on ACM Bibliographic Network

Page 30: G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,

30

Conclusions•G-SPARQL Language

– Expresses pattern matching and reachability queries on attributed graphs

•Hybrid engine– Graph topology in main memory– Graph data in database

•Compilation into algebraic plan– Operators and optimizations

•Evaluation– Real and synthetic datasets– Good performance

» Leveraging database engine and main memory topology