Will Graph Data Management Techniques …...Music Brainz (Data Incubator) Moseley Metoffice Folk...

3
Will Graph Data Management Techniques Contribute to the Successful Large-Scale Deployment of Semantic Web Technologies? Philippe Cudre-Mauroux eXascale Infolab University of Fribourg—Switzlerand [email protected] Abstract— Most certainly. I. I NTRODUCTION The Semantic Web vision ambitions to create a large-scale, interconnected Web of machine-processable data in addition to the current Web of documents currently available online. Its most recent and visible incarnation—the Linked Data movement 1 —is based on three main principles: using dereferenceable Web identifiers (HTTP URIs) to uniquely identify the various pieces of data online providing semi-structured information—most often using standard formats like RDF [1] and OWL [2]—about the data when dereferencing the corresponding URIs providing links to other, semantically related pieces of data. The result of this grassroots effort is the emergence of a giant, decentralized and interlinked online database. The wide adoption of Linked Data principles opens the door to many exciting applications, including expressive and structured Web search, automated processing of online data, and Web-scale data integration. While the Web of data is currently capturing a lot of attention because of its potential, it is also being threatened by a series of technical challenges. At this stage, the research community is actively working on novel solutions to store, manage and integrate massive Linked Data sets across the Web. In this short paper, I argue that graph data manage- ment techniques will play an essential role in the successful deployment of Semantic Web technologies. I start below by giving a state of the art in Semantic Data Management, before describing a few areas where graph data management is already having an important impact in this context. II. STATE OF THE ART IN SEMANTIC DATA MANAGEMENT Semantic Web Data Management is an interdisciplinary field or research at the intersection of several disciplines, including database systems, artificial intelligence, and Web engineering. A number of key advances have been made in this field over the past ten years, most notably on efficient and scalable 1 http://linkeddata.org/ RDF/OWL back-end, scalable inference techniques, query languages, and more recently on federated query processing. A. Data Back-End The data model behind the Semantic Web and Linked Data movement is at the same time conceptually simple and technically challenging to capture. RDF data is constituted of bags of (subject, predicate, object) triples, where the subject and predicate are typically URIs, and the object is either a URI or a literal value. A triple storing the name of an identifier representing a person would for instance look as follows: ( http://semanticweb.org/pcudremauroux, http://xmlns.com/foaf/name, “Philippe Cudre-Mauroux” ) RDF data can as such be considered as collections of ternary structures. Many approaches use this perspective to store RDF data in giant triple tables (e.g., in relational tables having three attributes, one for the subjects, one for the predicates, and finally one storing the objects). The GridVine [3], [4] system, for instance, uses a triple-table storage approach to distribute RDF data over decentralized P2P networks using a distributed hash-table. More recently, Hexastore [5] suggests to index RDF data using six possible indices, one for each permutation of the set of columns in the triple table. Similarly, RDF-3X [6] creates various indices from a giant triple-table, including indices based on the six possible permutations of the triple columns, and aggregate indices storing only two out of the three columns. 4Store 2 and AllegroGraph 3 are two further examples of database systems following this philosophy. A second important approach clusters RDF data in various tables based on their properties (i.e., based on the predicates of the triples). Wilkinson et al. [7] propose the use of two types of property tables: one containing clusters of values for properties that are often co-accessed together, and one exploiting the type property of subjects to cluster similar sets of subjects together in the same table. Chong et al. [8] also suggest the use of property tables as materialized views, complementing 2 http://4store.org/ 3 http://www.franz.com/agraph/allegrograph/

Transcript of Will Graph Data Management Techniques …...Music Brainz (Data Incubator) Moseley Metoffice Folk...

Page 1: Will Graph Data Management Techniques …...Music Brainz (Data Incubator) Moseley Metoffice Folk Weather Forecasts Discogs (Data Incubator)Climbing data.gov.uk intervals Data Gov.ie

Will Graph Data Management TechniquesContribute to the Successful Large-Scale

Deployment of Semantic Web Technologies?Philippe Cudre-Mauroux

eXascale InfolabUniversity of Fribourg—Switzlerand

[email protected]

Abstract— Most certainly.

I. INTRODUCTION

The Semantic Web vision ambitions to create a large-scale,interconnected Web of machine-processable data in additionto the current Web of documents currently available online.Its most recent and visible incarnation—the Linked Datamovement1—is based on three main principles:

• using dereferenceable Web identifiers (HTTP URIs) touniquely identify the various pieces of data online

• providing semi-structured information—most often usingstandard formats like RDF [1] and OWL [2]—about thedata when dereferencing the corresponding URIs

• providing links to other, semantically related pieces ofdata.

The result of this grassroots effort is the emergence of agiant, decentralized and interlinked online database. The wideadoption of Linked Data principles opens the door to manyexciting applications, including expressive and structured Websearch, automated processing of online data, and Web-scaledata integration.

While the Web of data is currently capturing a lot ofattention because of its potential, it is also being threatenedby a series of technical challenges. At this stage, the researchcommunity is actively working on novel solutions to store,manage and integrate massive Linked Data sets across theWeb. In this short paper, I argue that graph data manage-ment techniques will play an essential role in the successfuldeployment of Semantic Web technologies. I start below bygiving a state of the art in Semantic Data Management,before describing a few areas where graph data managementis already having an important impact in this context.

II. STATE OF THE ART IN SEMANTIC DATA MANAGEMENT

Semantic Web Data Management is an interdisciplinary fieldor research at the intersection of several disciplines, includingdatabase systems, artificial intelligence, and Web engineering.A number of key advances have been made in this field overthe past ten years, most notably on efficient and scalable

1http://linkeddata.org/

RDF/OWL back-end, scalable inference techniques, querylanguages, and more recently on federated query processing.

A. Data Back-End

The data model behind the Semantic Web and LinkedData movement is at the same time conceptually simple andtechnically challenging to capture. RDF data is constitutedof bags of (subject, predicate, object) triples, where thesubject and predicate are typically URIs, and the object iseither a URI or a literal value. A triple storing the name ofan identifier representing a person would for instance look asfollows:

( http://semanticweb.org/pcudremauroux,http://xmlns.com/foaf/name,“Philippe Cudre-Mauroux” )

RDF data can as such be considered as collections of ternarystructures. Many approaches use this perspective to store RDFdata in giant triple tables (e.g., in relational tables havingthree attributes, one for the subjects, one for the predicates,and finally one storing the objects). The GridVine [3], [4]system, for instance, uses a triple-table storage approach todistribute RDF data over decentralized P2P networks using adistributed hash-table. More recently, Hexastore [5] suggeststo index RDF data using six possible indices, one for eachpermutation of the set of columns in the triple table. Similarly,RDF-3X [6] creates various indices from a giant triple-table,including indices based on the six possible permutations of thetriple columns, and aggregate indices storing only two out ofthe three columns. 4Store2 and AllegroGraph3 are two furtherexamples of database systems following this philosophy.

A second important approach clusters RDF data in varioustables based on their properties (i.e., based on the predicates ofthe triples). Wilkinson et al. [7] propose the use of two types ofproperty tables: one containing clusters of values for propertiesthat are often co-accessed together, and one exploiting thetype property of subjects to cluster similar sets of subjectstogether in the same table. Chong et al. [8] also suggest theuse of property tables as materialized views, complementing

2http://4store.org/3http://www.franz.com/agraph/allegrograph/

Page 2: Will Graph Data Management Techniques …...Music Brainz (Data Incubator) Moseley Metoffice Folk Weather Forecasts Discogs (Data Incubator)Climbing data.gov.uk intervals Data Gov.ie

a primary storage using a triple-table. Going one step further,Abadi et al. suggest a fully-decomposed storage model forRDF: the triples are in that case rewritten into n two-columntables where n is the number of unique properties in the data.In each table, the first column contains the subjects that definethat property and the second column contains the object valuesfor those subjects.

B. Query Processing

Very early on in the development of the Semantic Web, theWorld Wide Web Consortium (W3C)4 standardized a seriesof languages to express and query Semantic Web data. Thepredominant query language in that space is SPARQL [9],which is a declarative query language a la SQL to query (andupdate) RDF data. The basic query construct behind SPARQLis the triple pattern, which looks like an RDF triple exceptthat any of its subparts can be replaced by a variable. Asimple SPARQL query retrieving the name associated to agiven resource would look like this:

SELECT ?xFROM example.rdfWHERE {

http://semanticweb.org/philippecudremaurouxhttp://xmlns.com/foaf/name?x .

}

A large fraction of Semantic Web queries can be expressedby writing conjunctions of such triple patterns. The queriesare then typically answered by iteratively looking up boundvariables in the patterns, and joining up intermediate results.

In addition to local queries, distributed queries involvingseries of autonomous SPARQL end-points (i.e., federatedqueries) are rapidly gaining in popularity. This is a natural con-sequence of the emergence of the Web of Linked Data, sincelocal data systematically include links to distant but relateddata. A growing number of approaches exist to optimize suchoperations (see for instance [10] for a recent survey and [11]for a state of the art approach).

III. GRAPH DATA MANAGEMENT AND THE SEMANTICWEB

While RDF data can be seen as a collection of triples,researchers realized very early that it can also be modeledas a graph (see for example Angles & Gutierrez [12] foran early paper on that topic). Figure 1, for instance, gives agraph representation of two RDF triples (the figure adopts theusual convention to represent such a graph, where URIs arerepresented by ovals, literals by rectangles, and predicates arerepresented by directed edges connecting subjects to objects).

Adopting this formalism, the Linked Data movement canbe represented as a gigantic graph of interconnected RDF datasets. Figure 2shows a macroscopic representation of this graph,

4http://www.w3.org/

http://www.unifr.ch

"Philippe Cudre-Mauroux"

http://semanticweb.org/

cudremauroux

http://xmlns.com/foaf/schoolHomepage

http://xmlns.com/foaf/name

Fig. 1. A small RDF graph representing two triples

where each edge represents an entire RDF data set, and edgesrepresent collections of links between pairs of data sets5 .

As of September 2011

MusicBrainz

(zitgist)

P20

Turismo de

Zaragoza

yovisto

Yahoo! Geo

Planet

YAGO

World Fact-book

El ViajeroTourism

WordNet (W3C)

WordNet (VUA)

VIVO UF

VIVO Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UniRef

UniProt

UMBEL

UK Post-codes

legislationdata.gov.uk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdata.gov.

uk

Traffic Scotland

theses.fr

Thesau-rus W

totl.net

Tele-graphis

TCMGeneDIT

TaxonConcept

Open Library (Talis)

tags2con delicious

t4gminfo

Swedish Open

Cultural Heritage

Surge Radio

Sudoc

STW

RAMEAU SH

statisticsdata.gov.

uk

St. Andrews Resource

Lists

ECS South-ampton EPrints

SSW Thesaur

us

SmartLink

Slideshare2RDF

semanticweb.org

SemanticTweet

Semantic XBRL

SWDog Food

Source Code Ecosystem Linked Data

US SEC (rdfabout)

Sears

Scotland Geo-

graphy

ScotlandPupils &Exams

Scholaro-meter

WordNet (RKB

Explorer)

Wiki

UN/LOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAASKISTI

JISC

IRIT

IEEE

IBM

Eurécom

ERA

ePrints dotAC

DEPLOY

DBLP (RKB

Explorer)

Crime Reports

UK

Course-ware

CORDIS (RKB

Explorer)CiteSeer

Budapest

ACM

riese

Revyu

researchdata.gov.

ukRen. Energy Genera-

tors

referencedata.gov.

uk

Recht-spraak.

nl

RDFohloh

Last.FM (rdfize)

RDF Book

Mashup

Rådata nå!

PSH

Product Types

Ontology

ProductDB

PBAC

Poké-pédia

patentsdata.go

v.uk

OxPoints

Ord-nance Survey

Openly Local

Open Library

OpenCyc

Open Corpo-rates

OpenCalais

OpenEI

Open Election

Data Project

OpenData

Thesau-rus

Ontos News Portal

OGOLOD

JanusAMP

Ocean Drilling Codices

New York

Times

NVD

ntnusc

NTU Resource

Lists

Norwe-gian

MeSH

NDL subjects

ndlna

myExperi-ment

Italian Museums

medu-cator

MARC Codes List

Man-chester Reading

Lists

Lotico

Weather Stations

London Gazette

LOIUS

Linked Open Colors

lobidResources

lobidOrgani-sations

LEM

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

LinkedUser

FeedbackLOV

Linked Open

Numbers

LODE

Eurostat (OntologyCentral)

Linked EDGAR

(OntologyCentral)

Linked Crunch-

base

lingvoj

Lichfield Spen-ding

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Kno.e.sis)

Klapp-stuhl-club

Good-win

Family

National Radio-activity

JP

Jamendo (DBtune)

Italian public

schools

ISTAT Immi-gration

iServe

IdRef Sudoc

NSZL Catalog

Hellenic PD

Hellenic FBD

PiedmontAccomo-dations

GovTrack

GovWILD

GoogleArt

wrapper

gnoss

GESIS

GeoWordNet

GeoSpecies

GeoNames

GeoLinkedData

GEMET

GTAA

STITCH

SIDER

Project Guten-berg

MediCare

Euro-stat

(FUB)

EURES

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

CORDIS(FUB)

Freebase

flickr wrappr

Fishes of Texas

Finnish Munici-palities

ChEMBL

FanHubz

EventMedia

EUTC Produc-

tions

Eurostat

Europeana

EUNIS

EU Insti-

tutions

ESD stan-dards

EARTh

Enipedia

Popula-tion (En-AKTing)

NHS(En-

AKTing) Mortality(En-

AKTing)

Energy (En-

AKTing)

Crime(En-

AKTing)

CO2 Emission

(En-AKTing)

EEA

SISVU

education.data.g

ov.uk

ECS South-ampton

ECCO-TCP

GND

Didactalia

DDC Deutsche Bio-

graphie

datadcs

MusicBrainz

(DBTune)

Magna-tune

John Peel

(DBTune)

Classical (DB

Tune)

AudioScrobbler (DBTune)

Last.FM artists

(DBTune)

DBTropes

Portu-guese

DBpedia

dbpedia lite

Greek DBpedia

DBpedia

data-open-ac-uk

SMCJournals

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Metoffice Weather Forecasts

Discogs (Data

Incubator)

Climbing

data.gov.uk intervals

Data Gov.ie

databnf.fr

Cornetto

reegle

Chronic-ling

America

Chem2Bio2RDF

Calames

businessdata.gov.

uk

Bricklink

Brazilian Poli-

ticians

BNB

UniSTS

UniPathway

UniParc

Taxonomy

UniProt(Bio2RDF)

SGD

Reactome

PubMedPub

Chem

PRO-SITE

ProDom

Pfam

PDB

OMIMMGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Com-pound

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

Affy-metrix

bible ontology

BibBase

FTS

BBC Wildlife Finder

BBC Program

mes BBC Music

Alpine Ski

Austria

LOCAH

Amster-dam

Museum

AGROVOC

AEMET

US Census (rdfabout)

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

Fig. 2. A macroscopic representation of the Linked Data graph

A. Data Back-End

Though often depicted as a graph, there has been surpris-ingly little research and development on graph data infrastruc-tures for the Semantic Web. Oracle led one of the most visibleefforts in that space, offering RDF support for its 11g databaseproduct6, which stores RDF data using two main relationaltables, one for the edges (predicates), and another one storingdata about the nodes (subjects, objects).

Recently, some further approaches took advantage of sub-graph storage to efficiently and compactly store RDF data.My own experience with dipLODocus[RDF ] [13], a hybridback-end [14] for RDF, shows that collocation of data valuesbased on the proper identification of RDF subgraphs canspeed up query processing by one order of magnitude. Alongconceptually similar lines, Yan et al. [15] suggest to divide theRDF graph into subgraphs and to build secondary indices (e.g.,Bloom filters) to quickly detect whether some informationcan be found inside an RDF subgraph or not. gStore [16]

5Linking Open Data cloud diagram, by Richard Cyganiak and AnjaJentzsch. http://lod-cloud.net/

6http://www.oracle.com/technetwork/database/options/semantic-tech/index.html

Page 3: Will Graph Data Management Techniques …...Music Brainz (Data Incubator) Moseley Metoffice Folk Weather Forecasts Discogs (Data Incubator)Climbing data.gov.uk intervals Data Gov.ie

is a recent system storing RDF data as a large, labeled, anddirected multi-edge graph; SPARQL queries are then executedby being transformed into subgraph matching queries, thatare efficiently matched to the graph using a novel indexingmechanism.

B. Query Processing

Beyond triple pattern queries—which can be efficientlyprocessed as subgraph matching queries on graph back-ends(see above)—a number of further SPARQL queries can takeadvantage of graph data management techniques. Path queries,for instance, are getting extremely popular. They come indifferent flavors, but typically are defined as lists of predicates(e.g., (address, town, zip code) for a path query going froma node having and address to a zip code literal) definingseries of edges to follow in the RDF graph. Such queriesare extremely costly to resolve using triple table or propertytable approaches, since they require a (self) join for each newproperty defined in the query. Graph or subgraph systems canoften resolve path queries more efficiently by limiting thenumber of joins (e.g., by joining subgraphs [13]).

Beyond the classical path queries mentioned above,SPARQL is currently being enhanced with additional featuresto express more complex path queries, transitive closures andreachability queries (see the SPARQL 1.1. W3C WorkingDraft [17]). The following SPARQL query (taken from theW3C draft), for example, expresses a reachability query astwo triple patterns and uses a new syntactic construct (+)to express a transitive closure on the foaf:knows transitiveproperty starting from a given node (Alice):

{ ?x foaf:mbox mailto:alice@example .?x foaf:knows+ / foaf:name ?name .}

Reachability queries for RDF have not been widely studiedso far but are currently getting a lot of attention becauseof those recent developments. Shortest-path queries are notyet supported by SPARQL, but are already receiving someattention [18] and are supported by a few systems (likeVirtuoso7).

In addition to local graph query processing, an increasingnumber of researchers are starting to adapt well-known graphtechniques to optimize RDF federated queries on wide-areanetworks. Recent efforts [19], [20], for example, apply well-known graph algorithms (Edmonds’ algorithms or Minimum-Spanning-Tree algorithms) to optimize distributed SPARQLqueries.

IV. CONCLUSIONS

With the rapid adoption of Linked Data principles, a giantgraph of structured data is emerging from the first time on theWeb. The language used to express such data (RDF) can itselfbe considered as a graph data language. In this short paper, Igave a series of concrete examples showing how graph datamanagement techniques are gaining increasing attention in that

7http://virtuoso.openlinksw.com/

context. Additional examples can be found in a recent tutorialcomparing both social networks and the Web of data from agraph data management perspective [21].

V. ACKNOWLEDGMENT

This work is supported by the Swiss National ScienceFoundation under grant number PP00P2 128459.

REFERENCES

[1] F. Manola and E. Miller (Ed.), “RDF Primer,” W3C Recommendation,February 2004, http://www.w3.org/TR/rdf-primer/.

[2] D. McGuinness and F. van Harmelen (Ed.), “OWL Web Ontol-ogy Language Overview,” W3C Recommendation, February 2004,http://www.w3.org/TR/owl-features/.

[3] K. Aberer, P. Cudre-Mauroux, M. Hauswirth, and T. van Pelt, “GridVine:Building Internet-Scale Semantic Overlay Networks,” in InternationalSemantic Web Conference (ISWC), 2004.

[4] P. Cudre-Mauroux, S. Agarwal, and K. Aberer, “Gridvine: An infras-tructure for peer information management,” IEEE Internet Computing,vol. 11, no. 5, 2007.

[5] C. Weiss, P. Karras, and A. Bernstein, “Hexastore: sextuple indexing forsemantic web data management,” Proceeding of the VLDB Endowment(PVLDB), vol. 1, no. 1, pp. 1008–1019, 2008.

[6] T. Neumann and G. Weikum, “RDF-3X: a RISC-style engine for RDF,”Proceedings of the VLDB Endowment (PVLDB), vol. 1, no. 1, pp. 647–659, 2008.

[7] K. Wilkinson, C. Sayers, H. A. Kuno, and D. Reynolds, “Efficient rdfstorage and retrieval in jena2,” in SWDB’03, 2003, pp. 131–150.

[8] E. I. Chong, S. Das, G. Eadon, and J. Srinivasan, “An efficient sql-basedrdf querying scheme,” in International Conference on Very Large DataBases (VLDB), 2005.

[9] E. Prud’hommeaux and A. Seaborne, “SPARQL QueryLanguage for RDF,” W3C Recommendation, January 2008,http://www.w3.org/TR/rdf-sparql-query/.

[10] O. Grlitz and S. Staab, Federated Data Management and Query Op-timization for Linked Open Data, ser. New Directions in Web DataManagement. Springer, 2011, vol. 1.

[11] A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt, “Fedx:optimization techniques for federated query processing on linked data,”in International Semantic Web Conference (ISWC). Berlin, Heidelberg:Springer-Verlag, 2011, pp. 601–616.

[12] R. Angles and C. Gutierrez, “Querying rdf data from a graph databaseperspective,” in In Proceedings of the Second European Semantic WebConference, 2005, pp. 346–360.

[13] M. Wylot, J. Pont, M. Wisniewski, and P. Cudre-Mauroux,“diplodocus[rdf] - short and long-tail rdf analytics for massive websof data,” in International Semantic Web Conference (ISWC), 2011, pp.778–793.

[14] M. Grund, J. Kruger, H. Plattner, A. Zeier, P. Cudre-Mauroux, andS. Madden, “Hyrise - a main memory hybrid storage engine,” PVLDB,vol. 4, no. 2, pp. 105–116, 2010.

[15] Y. Yan, C. Wang, A. Zhou, W. Qian, L. Ma, and Y. Pan, “Efficientindices using graph partitioning in rdf triple stores,” in InternationalConference on Data Engineering (ICDE), 2009, pp. 1263–1266.

[16] L. Zou, J. Mo, L. Chen, M. T. Oezsu, and D. Zhao, “gstore: Answeringsparql queries via subgraph matching,” PVLDB, vol. 4, no. 8, 2011.

[17] S. Harris and A. Seaborne, “RSPARQL 1.1 Query Language,” W3CWorking Draft, May 2011, http://www.w3.org/TR/sparql11-query/.

[18] A. Gubichev and T. Neumann, “Path query processing on very large rdfgraphs,” in WebDB, 2011.

[19] B. P. Vandervalk, E. L. McCarthy, and M. D. Wilkinson, “Optimizationof distributed sparql queries using edmonds’ algorithm and prim’salgorithm,” in International Conference on Computational Science andEngineering (CSE), 2009, pp. 330–337.

[20] X. Wang, T. Tiropanis, and H. Davis, “Evaluating graph traversal algo-rithms for distributed sparql query optimization,” in Joint InternationalSemantic Technology Conference (JIST2011), 2011.

[21] P. Cudre-Mauroux and S. Elnikety, “Graph Data Management Systemsfor New Application Domains,” in International Conference on VeryLarge Data Bases (VLDB), 2011.