VLDB 2005
An Efficient SQL-based RDF Querying Scheme
Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan
New England Development CenterOracle
VLDB 2005
Talk Outline
• Introduction• Functionality• Design and Implementation • Performance• Conclusions and Future Work
VLDB 2005
Introduction
VLDB 2005
RDF (Resource Description Framework)• RDF is a W3C Standard for describing resources on
the web
• Uniform Resource Identifiers (URIs) are used to identify resources
• Example: http://www.oracle.com/people#John
• RDF triples are used to make statements about a resource
• Format: (subject predicate object)• Example: (:John :brotherOf :Mary)• Represents a directed, labeled edge in an RDF graph:
:John :Mary:brotherOf
VLDB 2005
RDF Data and Graph ExampleFamily Data: (:John :brotherOf :Mary)
(:Mary :parentOf :Matt)
(:John :name “John”)
(:Mary :name “Mary”)
(:Matt :name “Matt”)
:John
:Mary
:brotherOf
:Matt
:parentOf
:name John
Mary
:name
Matt
:name
VLDB 2005
RDF Querying Problem
• Given• RDF graphs: the data set to be searched• Graph Pattern: containing a set of variables
• Find• Matching Subgraphs
• Return • Sets of variable bindings: where each set
corresponds to a Matching Subgraph
VLDB 2005
RDF Query ExampleFamily Data: (:John :brotherOf :Mary)
(:Mary :parentOf :Matt) (:John :name “John”) (:Mary :name “Mary”) (:Matt :name “Matt”)Graph Pattern: (names of Mary’s brothers)
(?x :brotherOf ?y) (?y :name “Mary”)
(?x :name ?n) Variable Bindings: x = :John, y = :Mary, n = “John”Matching Subgraph: (:John :brotherOf :Mary)
(:Mary :name “Mary”) (:John :name “John”)
:John
:Mary
:brotherOf
:Matt
:parentOf
:name John
Mary
:name
Matt
:name
VLDB 2005
RDF Storage Issues• Need to store RDF <subject, predicate,
object> triples where the individual components can be URIs, blank nodes, or literals
• Namespaces used in URIs could be long• Multiple triples describe a resource resulting in
repetition of (possibly long) URIs • Different representations possible for a literal
occurring in multiple triples• e.g. 120 120.0 12.0e+1 1.20e+2
• RDF graph may include schema triples• e.g. (:brotherOf rdfs:domain :Male)
VLDB 2005
RDF Querying Issues in SQL• Support specification of graph pattern-based SQL
query• Occurrence of same variables in multiple triples of
graph pattern: Processing requires self-join• e.g. (?x :brotherOf ?y)
(?y :name “Mary”) (?x :name ?n)
• Query processing (e.g for filter conditions, ORDER BY) requires datatype-specific comparison semantics
Schema Triple: (:age rdfs:range xsd:int)
Graph Pattern: (?x :age ?a)Filter Condition: a > 60ORDER BY: a DESCENDING
VLDB 2005
RDF Querying Issues: Inference• Query processing may involve Inferencing.• Example:
Data: (:Jim :brotherOf :John) (:John :fatherOf :Mary)
Graph Pattern:(?x :uncleOf ?y)
Result: Empty
Rule:(?x :brotherOf ?y) (?y :fatherOf ?z) (?x :uncleOf ?z)
Inferred data: (:Jim :uncleOf :Mary)
Result: x = :Jim, y = :Mary
VLDB 2005
RDF Querying Approach• General Approach
• Create a new (declarative, SQL-like) query language • e.g.: RQL, SeRQL, TRIPLE, N3, Versa, SPARQL, RDQL,
RDFQL, SquishQL, RSQL, etc.
• SQL-based Approach• Introduces a SQL Table Function RDF_MATCH that uses
SPARQL-like graph pattern to express RDF queries
• Benefits of SQL-based Approach• Leverages all the powerful constructs in SQL (e.g.,
SELECT / FROM / WHERE, ORDER BY, GROUP BY, aggregates, Join) to process graph query results
• RDF queries can easily be combined with conventional queries on database tables thereby avoiding staging
VLDB 2005
• SELECT …FROM …, TABLE (
) t, …WHERE …;
• Use of RDF_MATCH Table Function allows embedding a graph query in a SQL query
Embedding RDF Query in SQL
RDF Query (expressed as RDF_MATCH Table Function invocation)
VLDB 2005
Functionality
VLDB 2005
RDF_MATCH Table Function• Input parameters
RDF_MATCH (Pattern, graph patternModels, Data (set of RDF graphs)RuleBases, Rules (0 or more rulebases)Aliases list of prefixes for namespaces)
• Returns a set of columns containing variable bindings• Variable matching URI returned as single VARCHAR2
column with the same name (e.g. x for ?x)• Variable matching literal returned as a pair of VARCHAR2
columns with a name (e.g. x for ?x) and the type (x$type for ?x)
VLDB 2005
RDF_MATCH Example• Example: student reviewers less than 25 years old
SELECT t.r reviewer, t.c conf, t.a ageFROM TABLE ( RDF_MATCH (
‘(?r rdf:type :Student) (?r :reviewerOf ?c)
(?r :age ?a)’, RDFModels(‘reviewers’), NULL,
RDFAliases(…))) tWHERE t.a < 25;
VLDB 2005
Specifying Rules• RDFS rulebase: Pre-Loaded• Can add User-defined rules
• Rule: “Chairperson of Conference is also a reviewer”
(‘rb’, rulebase name
‘ChairpersonRule’, rule name ‘(?r :ChairpersonOf ?c)’ antecedents
NULL, filter conditionNULL, aliases‘(?r:ReviewerOf ?c)’) consequents
VLDB 2005
RDF_MATCH Example with rulebase• Query: Find reviewers of conferences• SELECT t.r reviewer
FROM TABLE( RDF_MATCH(
‘(?r :ReviewerOf ?c)’, RDFModels (‘reviewers’),
RDFRules (‘rb’), NULL)) t;
• Data (:Mary :ChairpersonOf :IDBC2005)• Inferred data (:Mary :ReviewerOf :IDBC2005)
VLDB 2005
Design & Implementation
VLDB 2005
RDF Data Storage• Triples Data stored after normalization in two
tables• UriMap(UriID, UriValue,…) contains mapping of
(URIs, blank nodes, literals) to internal identifiers• IdTriples (ModelID, SubjectID, PropertyID,
ObjectID,…) contains the triple information encoded as three identifiers
• Multiple representation of literals: The first occurrence treated as canonical, rest mapped to canonical representation
• e.g. 120.0 120 1.20e+2 12.0e+1
VLDB 2005
RDF_MATCH Query Processing• Subsititute aliases with namespaces in search pattern• Convert URIs and literals to internal IDs• Generate Query
• Generate self-join query based on matching variables• Generate SQL subqueries for rulebases component
(if any)• Generate the join result by joining internal IDs with
UriMap table • Use model IDs to restrict IdTriples table
• Compile and Execute the generated query
VLDB 2005
Optimization: Table Function Rewrite
• TableRewriteSQL( )• Takes RDF Query (specified via arguments) as input • generates a SQL string
• Substitute the table function call with the generated SQL string
• Reparse and execute the resulting query• Advantages
• Avoid execution-time overhead (linear in number of result rows) associated with table function infrastructure
• Leverage SQL optimizer capabilities to optimize the resulting query (including filter condition pushdown)
VLDB 2005
Optimization: Materialized Join Views• Generic Materialized Join views (MJVs)
• Subject-Subject, Object-Subject, …
• Subject-property matrix MJVs (SPMJVs)• custom, workload based (e.g., frequent search patterns)Example: Select student name, university, and age• Select r, u, a ……
‘(?r rdf:type :Student) (?r :enrolledAt ?u) (?r :age ?a)’……
• SPMJV: < Student enrolledAt age >
VLDB 2005
Performance
VLDB 2005
Dataset
• WordNet : lexical database for English language
• UniProt : large scale (80 million triples)• Protein and annotation data
VLDB 2005
Experiments
• Varying number of triples in search pattern• Varying filter conditions• Varying projection list• Large-scale RDF data• Subject-property MJVs
VLDB 2005
Varying Number of Triples
• ‘(?a wn:hyponymOf ?b) (?b wn:hyponymOf ?c) …..
• Increasing number of self-joins
VLDB 2005
Varying Number of Triples
00.20.40.60.8
11.21.41.6
0 2 4 6 8
# of Triples in the Pattern
Tim
e (s
eco
nd
s)
Without MJV With MJV
VLDB 2005
Varying Projection List
• ‘(?c0 wn:wordForm ?word) (?c0 wn:wordForm ?syn1) (?c1 wn:wordForm ?syn1) …. (5 triples)
• Benefit of the projection list optimization• Eliminate joins with UriMap table for variables not
referenced outside of RDF_MATCH
VLDB 2005
Varying Projection List
0
0.1
0.2
0.3
0.4
0 1 2 3 4 5Projection List Size
Tim
e (s
eco
nd
s)
VLDB 2005
Large-Scale RDF Data
• UniProt – 10M, 20M, 40M, 80M triples• 6 example queries given with UniProt• Number of matches remain constant as
dataset size changes (ROWNUM)
VLDB 2005
UniProt Sample QueriesDescription Pattern Projection Result limit
Q1: Display the ranges of transmembrane regions
6 triples5 vars
3 vars 15000 rows
Q2: List proteins with publications by authors with matching names
5 triples5 vars1 LIKE pred.
3 vars 10 rows
Q3: Count the number of times a publication by a specific author is cited
3 triples2 vars
0 vars 32 rows
Q4: List resources that are related to proteins annotated with a specific keyword
3 triples2 vars
1 var 3000 rows
Q5: List genes associated with human diseases
7 triples5 vars
3 vars 750 rows
Q6: List recently modified entries
2 triples2 vars1 range pred.
2 vars 8000 rows
VLDB 2005
Query Response TimesRDF_MATCH Performance Scalability
Q1 Q2 Q3 Q4 Q5 Q6
10 M Triples0.86 < 0.01 < 0.01 0.03 0.18 0.46
20 M Triples 0.95 < 0.01 < 0.01 0.03 0.19 0.47
40 M Triples 0.96 < 0.01 < 0.01 0.03 0.18 0.47
80 M Triples 1.03 < 0.01 < 0.01 0.03 0.20 0.49
Maximum .054 0.002 0.002 .011 .065 0.07
VLDB 2005
Conclusions
VLDB 2005
Conclusions and Future Work• SQL-based RDF querying scheme
• RDF_MATCH table function• Supports graph-pattern based query on RDF data with
RDFS and user-defined rules• Efficient Execution
• Table Function Rewrite• Materialized Join Views: Generic and Subject-Property• Rule Indexes
• Future work • OPTIONAL support – outer-join• Provenance support
Top Related