A Path-based Relational RDF Database
description
Transcript of A Path-based Relational RDF Database
A Path-based Relational RDF Database
A. Matono, T. Amagasa, M. Yoshikawa, S. UemuraADC 2005
SNU IDB Lab.Hyewon Lim
January 9th, 2009
2
Contents
Introduction An Overview of RDF Related Work and the Differences with Our
Work Path-based Approach for Storing RDF Data
in Relational Databases Performance Evaluation Conclusions
3
Introduction (1/8)
Quality and quantity of metadata Semantic Web makes it possible to perform
high-level processes Reasoning, deduction, semantic searches
Metadata Described by Resource Description Framework
(RDF) RDF describes data and their semantics
4
Introduction (2/8)
The specification defines an RDF model and RDF syntax
RDF model Statements describe a relationship between a
pair of terms A set of statements
Represent metadata whose structure is a directed graph
5
Introduction (3/8)
RDF is common to use as a format to de-scribe various types of metadata Typical usage: describe large-scale metadata
Wordnet (35MB), Gene Ontology (365MB), Open Di-rectory Project (2GB)
In order to handle such data efficiently, RDF DBs that can manage massive RDF data are essential
6
Introduction (4/8)
One naïve approach is to use XML DBs Any RDF data can be serialized as XML data
This approach is impractical Structure of semantics as RDF data is different
to the structure of syntax as XML data Semantics cannot be stored into XML DBs
7
Introduction (5/8)
Another way: utilize relational DBs or Berkeley DB Several RDF DBs have been proposed Such conventional RDF DBs can be classified
into two groups 1. Schema data are designed based on RDF schema
Cannot handle such RDF data that do not have accom-panying RDF schema
2. RDF DBs store RDF data in terms of triples
8
Introduction (6/8)
Problems of processing large RDF data us-ing conventional RDF databases Ability to handle RDF schema
RDF query using information of RDF schema is impor-tant classes of RDF queries
Second group do not make any distinction between schema information and instance data
First group can process such queries
Poor performance in processing path queries Need to perform a join operation per each path step
9
Introduction (7/8)
Propose a path-based relational RDF DBs Relational schema is designed to be indepen-
dent of RDF schema information, and Designed to make the distinction between
schema information and instance data Can handle schemaless RDF data as well as RDF data
with schema Extract all reachable path expressions for each
resource, and store them To improve performance for path queries Do not need to perform join operations
10
Introduction (8/8)
Steps Classify every statement into categories ac-
cording to the type of predicate Construct subgraphs for each category Store the subgraphs into distinct relational ta-
bles Apply appropriate techniques for representing the
semantics of each subgraph
Limit the structure of a subgraph is DAG
11
An Overview of RDF (1/4)
RDF A foundation for representing and manipulating
metadata on Web resources Usable as long as the location of a Web re-
source is identifiable in terms of a URI Statements represent binary relationships be-
tween two distinct(or identical) resources RDF data are modeled as a directed graph
Nodes and arcs represent resources and relationships “This paper is authored by Akiyoshi MATONO.”
www.matono.net/paper
“Akiyoshi MATONO”authored
12
An Overview of RDF (2/4)
RDF Schema A specification for defining schematic informa-
tion of RDF data We can define:
Classes (rdfs:class) as types of resources Properties of a class (rdf:Property) Domains (rdfs:domain) and ranges (rdf:range) of the
properties Inheritance relationships (rdfs:subClassOf,
rdfs:subPropertyOf) among classes or properties Types (rdf:type)
13
An Overview of RDF (3/4)
Using RDF and RDF Schema, we can rep-resent complex information
14
An Overview of RDF (4/4)
Classifying RDF data Large size
Wordnet, ODP, and Gene Ontology Created mainly for systematical organization of data
resources Do not contain cycles Simple structure
Small size RSS, FOAF, and Dublin Core Used as metadata of images or Web pages
15
Related Work and the Differences with Our Work (1/3)
Several RDF DBs have been proposed Most of which use Relational DBs or Berkeley
DB as their underlying data storage Approaches using RDB
Flatly sores statements into a single relational table Creates relational tables for classes and properties that
are defined in the RDF schema information, storing re-sources according to their classes/properties
Approaches using BDB Create three hash tables Keys: subjects, predicates, objects
16
Related Work and the Differences with Our Work (2/3)
Problems of the conventional approaches Using the flat and hash approaches
Difficult to perform schema queries They do not make any distinction between schema in-
formation and resource descriptions
schema approach Be able to process queries about RDF schema Cannot handle RDF data without RDF schema infor-
mation Relational schema is designed based on that
Costly to maintain schema evolution Capabilities of the three approaches for pro-
cessing path-based queries are not sufficient
17
Related Work and the Differences with Our Work (3/3)
In conventional RDF databases, statement-based queries can be processed ef-
ficiently RDF data is decomposed into a large number of
statements When processing a path-based query
Require a number of join operations according to the steps in the path expression
18
Path-based Approach for Storing RDF Data in Relational Databases
- Subgraph extraction from RDF graph(1/2)
When storing RDF data Parses the RDF data generates own RDF graph decomposes the graph into five subgraphs ac-
cording to the type of predicate Class Inheritance (CI) graphs – rdfs:subClassOf Property Inheritance (PI) graphs – rdfs:subPropertyOf Type (T) graphs – rdf:type Domain-Range (DR) graphs – rdfs:domain, rdfs:range Generic (G) graphs
19
Path-based Approach for Storing RDF Data in Relational Databases
- Subgraph extraction from RDF graph(2/2)
Advantages of dividing an RDF graph Store RDF data into distinct relational tables
Dising relational schema to be independent of RDF schema information
Structures of the resulting subgraphs are less complex than the original RDF graphs Opportunities to apply several techniques for repre-
senting each subgraph by consider each graph struc-ture
20
Path-based Approach for Storing RDF Data in Relational Databases
- Path expressions (1/3)
Most queries of RDF data Queries to detect subgraphs matching a given
graph Queries to detect a set of nodes which can be
reached via given path expressions These queries are represented in path ex-
pressions Storage based on path expressions
Decrease in the number of join operations
21
Path-based Approach for Storing RDF Data in Relational Databases
- Path expressions (2/3)
Store not the entire RDF graph only graph G to which path-based queries are
frequently posed Graph CI and PI should be stored by a scheme
that can detect ancestor-descendant relation-ships
Queries for RDF data use path expressions consisting of arcs Stores arc paths into a relational table
22
Path-based Approach for Storing RDF Data in Relational Databases
- Path expressions (3/3)
Arc path DAG g, node set V(g), arc set E(g)
A finite sequence of arcs (v0, v1), (v1, v2), …, (vk-2, vk-1), (vk-1, vk)
The path expression of the arc path l(v0, v1), l(v1, v2), …, l(vk-2, vk-1), l(vk-1, vk)
Absolute arc path An arc path whose source node is a root
vm vn
23
Path-based Approach for Storing RDF Data in Relational Databases
- Extended interval numbering scheme for DAGs (1/2)
Interval numbering scheme Detect ancestor-descendant relationships be-
tween two nodes in a tree We use it to detect inheritance relationships
between classes or properties Extend the scheme to apply it to DAGs
24
Path-based Approach for Storing RDF Data in Relational Databases
- Extended interval numbering scheme for DAGs (2/2)
Relationship between two nodes can be verified by a subsumption v is an ancestor of u iff pre(v) < pre(u) ∧ post(u) < post(v)
v is a parent of u if depth(u) - depth(v)=1
v
u
(2, 5, 1)
(4, 1, 3)
v
u
(5, 4, 2)
(6, 3, 3)
25
Path-based Approach for Storing RDF Data in Relational Databases
- Proposed relational schema (1/2)
Designed relational schema for storing RDF data based on the subgraphs
26
Path-based Approach for Storing RDF Data in Relational Databases
- Proposed relational schema (2/2)
Storage example of the RDF data
27
Path-based Approach for Storing RDF Data in Relational Databases
- Query Processing
Examples Find the title of something painted by someone
SELECT r.resourceNameFROM path AS p, resource AS rWHERE p.pathID=r.pathID AND p.pathexp=‘#title<#paints’
Find the names of the classes that are http://www.w3.org/2000/01/rdf-schema#Resources‘s direct superclass
SELECT c1.classNameFROM class AS c, class AS c1WHERE c.pre<c1.pre AND c.post>c1.postAND c.depth=c1.depth-1 AND c.className=‘http://www.w3.org/2000/01/ref-schema#Resources’
28
Performance Evaluation
Compared the processing time between our approach and Jena2 Jena2: based on the flat approach
Cannot evaluate the performance of schema-based queries Exist no RDF data with schema information
whose size is large enough to be used in our experiments on the Web
Environments Athlon 1.4 GHz CPU, 1GB memory, Gentoo
Linux 1.4, PostgreSQL 7.4.3
29
Performance Evaluation
- Schema-based Queries (1/3)
Basic schema queries Find immediate children (or parents) of a
given class (or property) Find inheritance relationships between
given two classes (or properties) Find classes as a domain (or range) of a
given property Querying the meta-schema
Find all resources, that is, instances of “rdfs:Resource”.
Find all literals
30
Performance Evaluation
- Schema-based Queries (2/3)
Quering type information Find a set of instances of given class Find a set of statements using given prop-
erty
When the above queries are processed, there are two cases: Answer is obtained by a single access to
data storage, or multiple accesses
31
Performance Evaluation
- Schema-based Queries (3/3)
The ability of each approach for schema-based queries
Our approach is efficient because of interval num-ber scheme
In meta-schema queries, if the RDF graph includes many multiple paths, the redundancy is increased
32
Performance Evaluation
- Path-based Queries (1/2)
Datasets Sufficient size to see scalability The G graph of the data does not contain any
cycles The G graph of the data contain long absolute
path expressions Use the Gene Ontology
33
Performance Evaluation
- Path-based Queries (2/2)
Experiment results
34
Conclusions
We can handle schemaless RDF data We can process schema-based queries us-
ing the interval numbering scheme For path-based queries
Achieved high performance To reduce the number of join operations, we stored
RDF data based on path expressions
Future work Investigate query-processing techniques
Query language, query transformation, and query op-timization for RDF data