Centro de Investigacio´n de la Web Sem´antica Center for...

Centro de Investigacion de la Web SemanticaCenter for Semantic Web Research

www.ciws.cl, @ciwschile

Four institutions

PUC, Chile

◮ Marcelo Arenas (Boss)◮ SW, data exchange,

semistrucured data

◮ Juan Reutter◮ SW, graph DBs, DLs

◮ Cristian Riveros◮ data exchange, semistr. data

◮ Jorge Baier◮ planning, search

Univ. of Talca

Renzo Angles (SW)

Carlos Buil (SW)

Univ. of Chile

◮ Pablo Barcelo (Deputy)◮ graph DBs, DB theory

◮ Claudio Gutierrez◮ SW, graph DBs

◮ Jorge Perez◮ SW, data exchange

◮ Aidan Hogan◮ SW, semistr. data

◮ Barbara Poblete◮ web mining, SNA

◮ Benjamın Bustos◮ multimedia

Query Languages for Graph DBs(Bridging the Gap Between Theory and Practice)

Pablo Barcelo and Aidan HoganDCC, Universidad de Chile

Center for Semantic Web Research (www.ciws.cl)

BACKGROUND AND OBJECTIVES

Graph databases

Trendy applications:

◮ Social network analysis

◮ Semantic web

◮ Scientific databases

◮ Software bug localization

◮ Geo-routing

Graph databases

Trendy applications:

◮ Social network analysis

◮ Semantic web

◮ Scientific databases

◮ Software bug localization

◮ Geo-routing

More in general:

◮ Wherever connections are as important as data

What is a graph database?

A data management system that exposes a graph data model.1

1Graph databases. Robinson, Webber, & Eifrem. O’Reilly, 2013.

What is a graph database?

A data management system that exposes a graph data model.1

Several existing graph DB engines and query languages:

◮ DEX/Sparksee - basic algebra

◮ IBM System G - Gremlin

◮ Neo4J - Cypher

◮ Oracle PGX - PGQL

◮ RDF stores (Virtuoso, AllegroGraph, Oracle, IBM) - SPARQL

1Graph databases. Robinson, Webber, & Eifrem. O’Reilly, 2013.

What graph databases are good for?

◮ Flexible modelling of interconnected data

◮ Agile evolution of the data model

◮ Scalable evaluation of join-intensive queries

My personal story

◮ Since 2009: Working on theory of query languages for graph DBs

◮ Since 2015: Working group of LDBC for the design of such language

My personal story

◮ Since 2009: Working on theory of query languages for graph DBs

◮ Since 2015: Working group of LDBC for the design of such language

Conclusion:

◮ Theory and practice are more connected than expected

Objectives

◮ Identify topics of common interest for theoreticians and developers

◮ Formalize relevant concepts (syntax, semantics, terminology, etc)

◮ Understand tradeoff expressiveness/efficiency

Overview of the course

////////Aidan/////will///////teach///////////Monday,///////////Tuesday,//////and///////////////Wednesday:

◮ ///////////Semantic//////web,////////RDF,/////and///////////SPARQL

◮ /////Half//////class////////based////on////////////////presentation,//////half///////based////on/////lab

Pablo will teach on Thursday and Friday:

◮ Property graphs, graph patterns, path queries

◮ Lectures presenting how theory of graph DBs relates to practice

Syllabus

//////////Monday:////////////Semantic///////web,//////RDF//////data/////////model,////////////Wikidata

//////////Tuesday:////////////SPARQL/////1.0,///////basic////////graph///////////patterns,///////basic/////////////operations//////////(union,/////////////projection,////////////optional),//////////solution/////////////modifiers,///////graph////////////construct/

//////////////Wednesday:////////////SPARQL/////1.1,///////////property////////paths,////////////negation,////////////////aggregation,//////////////distribution,///////////updates

Syllabus

Thursday: Graph DBs, property graphs, graph patterns, expressiveness,complexity of evaluation

Syllabus

Thursday: Graph DBs, property graphs, graph patterns, expressiveness,complexity of evaluation

Friday: Path query languages, regular path queries, ungrouping,expressiveness, complexity

Evaluation

35% in-class labs, 15% applied assignment, 50% theoretical assignment

THE DATA MODEL:PROPERTY GRAPHS

The data model is important as it must be:

◮ Flexible enough to accomodate scenarios of practical interest

◮ Simple enough to allow for a clean presentation

◮ Expressive enough for theoretical issues to appear in full force

The data model is important as it must be:

◮ Flexible enough to accomodate scenarios of practical interest

◮ Simple enough to allow for a clean presentation

◮ Expressive enough for theoretical issues to appear in full force

This is accomplished by the model of property graphs

A bit of history and context is needed

The simplest possible graph data model: Directed graphs

A bit of history and context is needed

The simplest possible graph data model: Directed graphs

Clint EastwoodThe Bridges of

Madison County

Meryl Streep Out of Africa

Any problems with this data model?

It does not allow to classify relationships

Any problems with this data model?

It does not allow to classify relationships

Madison County

C. Eastwood acts in and directed this movie

Edge-labeled graphs

Allow for such classification

Edge-labeled graphs

Allow for such classification

Madison County

acts in

directs

Meryl Streepacts in

Formal definition of edge-labeled graphs

An edge-labelled graph G is a pair (V ,E ), where:

1. V is a finite set of vertices (or nodes).

2. E is a finite set of labeled edges; formally,

E ⊆ V × Lab× V ,

where Lab is a set of labels.

What if actors play different roles in movies?

Peter Sellers

Lionel Mandrake

Merkin Muffley

Dr.Strangelove

18 minutes

34 minutes

screentime

Any problems with this approach?

We might need to modify the schema every time we add attributes

More convenient to allow attributes in nodes/edges explicitly

◮ This corresponds to what property graphs do

A property graph

role = Scorpio

ref = IMDb

name = Clint Eastwood

gender = male

n1 : Person

title = The Bridges of

Madison County

n2 : Movie

role = Robert Kincaid

ref = IMDb

e1 : acts in

e2 : directs

title = Dirty Harry

n5 : Movie

role = Harry Callahan

ref = IMDb

e5 : acts in

name = Andy Robinson

gender = male

n6 : Person

e6 : acts in

Another property graph

firstName = Julie

lastName = Freud

country = Chile

n1 : Person

firstName = John

lastName = Cook

gender = male

country = Chile

n2 : Person

e1 : knows

e2 : knows

name = U2

n3 : Tag

e3 : hasInterest

content = I love U2

language = en

n4 : Post

e4 : hasTag

content = Queen is awesome

n5 : Post

date = 14-09-15

e5 : likes

date = 15-03-14

e6 : likes

date = 23-10-15

e7 : likes

Yet another property graph

authored by

Author

affil: U. of Edinburgh

Author

affil: Princeton Univ.

Article

year: 1974

authored by

corresponding: YES

Author

affil: Cornell Univ.

affil: Stanford Univ.

affil: IBM Almaden

affil: Rice University

Author

affil: NU Singapore

Article

year: 1967

Article

year: 1989

Article

year: 1983

Article

year: 1995

Article

year: 1995

Article

year: 1995

Journal

Journal

journal

part of

Conference

series

part of

series

authored by

corresponding: YES

authored by

cooresponding: YES

authored by

cooresponding: YES

What is a property graph then?

◮ It is a graph

◮ It is directed

◮ It is labeled in nodes and edges

◮ Nodes and edges can be attributed

◮ It is a graph

◮ It is directed

◮ It is a graph

◮ It is directed

◮ It is a graph

◮ It is directed

◮ It is a graph

◮ It is directed

A formal definition

A property graph G is a tuple (V ,E , ρ, λ, σ), where:

2. E is a finite set of edges.

3. ρ : E → (V × V ) is a total function.◮ ρ(e) = (v1, v2) indicates that e is an edge from v1 to v2.

4. λ : (V ∪ E )→ Lab is a total function with Lab a set of labels.◮ If λ(m) = ℓ, then ℓ is the label of node or edge m.

5. σ : (V ∪ E )× Prop→ Val is a partial functionwith Prop a finite set of properties and Val a set of values.

◮ If σ(m, p) = v , then v is the value of property p in m.

A formal definition

How would you represent property graphs?

Labeled graphs: set of triples

Property graphs:set of triples + special triples describing property/value pairs of objects

GRAPH PATTERNS

Basic graph patterns are:

The basic unit for querying property graphs

A graph-structured query that is to be matched against a property graph

Definition of basic graph pattern (bgp):

◮ A property graph that uses variables and constants

◮ Variables can appear as:node and edge names, labels, properties, and values

Examples of basic graph patterns

acts in acts in

firstName = x1

lastName = x2

x10 : Person

firstName = x3

lastName = x4

x11 : Person

x12 : knows

x13 : knows

x8 = x9

x16 : x7

date = x5

x14 : likes

date = x6

x15 : likes

Evaluation of bgps

Set p(G ) of all possible “matchings” of bgp p in property graph G

Evaluation of bgps

Set p(G ) of all possible “matchings” of bgp p in property graph G

Matching:

◮ Replacement of variables in p by constants such that resultingproperty graph G ′ is “contained” in G

Example of evaluation

x1 : Person

title = x5

x4 : Movie

x2 : x3title = x9

x8 : Movie

x6 : x7

Example of evaluation

x1 : Person

title = x5

x4 : Movie

x2 : x3title = x9

x8 : Movie

x6 : x7

x1 x2 x3 x4 x5 x6 x7 x8 x9

n1 e2 directs n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry

n1 e5 acts in n5 Dirty Harry e2 directs n2 The Bridges of Madison C.

n1 e1 acts in n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry

n1 e5 acts in n5 Dirty Harry e1 acts in n2 The Bridges of Madison C.

n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C.

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C.

n1 e1 acts in n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C.

n1 e2 directs n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C.

n1 e5 acts in n5 Dirty Harry e5 acts in n5 Dirty Harry

Which semantic is the right semantics?

◮ Homomorphism-based semantics: All matchings are valid

◮ Isomorphism-based semantics: Only 1-1 matchings are valid

◮ Node/edge injective semantics: Self-defining

Different semantics by example

Homomorphism-based semantics:

x1 x2 x3 x4 x5 x6 x7 x8 x9

n1 e2 directs n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry X

n1 e5 acts in n5 Dirty Harry e2 directs n2 The Bridges of Madison C. X

n1 e1 acts in n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry X

n1 e5 acts in n5 Dirty Harry e1 acts in n2 The Bridges of Madison C. X

n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. X

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. X

n1 e1 acts in n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. X

n1 e2 directs n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. X

n1 e5 acts in n5 Dirty Harry e5 acts in n5 Dirty Harry X

Isomorphism-based semantics:

x1 x2 x3 x4 x5 x6 x7 x8 x9

n1 e1 acts in n2 The Bridges of Madison C. e5 acts in n5 Dirty Harry ✗

n1 e5 acts in n5 Dirty Harry e1 acts in n2 The Bridges of Madison C. ✗

n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. ✗

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. ✗

n1 e1 acts in n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. ✗

n1 e2 directs n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. ✗

n1 e5 acts in n5 Dirty Harry e5 acts in n5 Dirty Harry ✗

Node-injective semantics:

x1 x2 x3 x4 x5 x6 x7 x8 x9

n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. ✗

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. ✗

Edge-injective semantics:

x1 x2 x3 x4 x5 x6 x7 x8 x9

n1 e1 acts in n2 The Bridges of Madison C. e2 directs n2 The Bridges of Madison C. X

n1 e2 directs n2 The Bridges of Madison C. e1 acts in n2 The Bridges of Madison C. X

To make bgps usable we need more features

◮ Projection: Only some variables appear in the output, i.e., π(p(G ))

◮ Filter: Prune based on the values of variables, i.e., σ(π(G ))

◮ Union: Take the union of two bgps, i.e., p1(G ) ∪ p2(G )

◮ Difference: Take the difference of two bgps, i.e., p1(G ) \ p2(G )

◮ Optional: Extend matchings of p1 with those in p2, if possible

Projection

Some variables in the output are distinguished

Projection

Some variables in the output are distinguished

Evaluation under projection:

◮ Consider all possible matchings of the graph pattern

◮ Remove columns that do not correspond to distinguished variables

Example of projection