RDF Storage methods and Systems Nikolaou Charalampos (A.M.: M953) Kotsifakos Alexios (A.M.: M964)...

RDF Storage methods and Systems

Nikolaou Charalampos (A.M.: M953)Kotsifakos Alexios (A.M.: M964)

Department of Informatics and Telecommunications

15/03/08 2

Presentation Outline

• RDF Triple Table

• Means of Storage

• Storage Methods

• RDF Storage Systems

• Storage Systems’ Evaluation

15/03/08 3

Storing RDF data

• RDF triple: <subject, property, object>, e.g. : <http://en.wikipedia.org/wiki/Stevie_Ray_Vaughan, performed, “Little Wing”>

• RDF database: a single table with three columns, subject, property, object

15/03/08 4

Why studying different RDF storage methods? (1/2)

• RDF triple store has the following disadvantages:– All interesting queries require many

self-joins over the rdf table:

SQL Query

SELECT p5.objFROM rdf AS p1, rdf AS p2, rdf AS p3,rdf AS p4, rdf AS p5WHERE p1.prop = ’title’ AND p1.obj ˜= ’Transaction’AND p1.subj = p2.subj AND p2.prop = ’type’AND p2.obj = ’book’ AND p3.prop = ’type’AND p3.obj = ’auth’ AND p4.prop = ’hasAuth’AND p4.subj = p2.subj AND p4.obj = p3.subjAND p5.prop = ’isnamed’ AND p5.subj = p4.obj;

RDF Triple Table

person1 isNamed ‘‘Serge Abiteboul’’person2 isNamed ‘‘Rick Hull’’person3 isNamed ‘‘Victor Vianu’’book1 hasAuthor person1book1 hasAuthor person2book1 hasAuthor person3book1 isTitled ‘‘Foundations of Databases’’

15/03/08 5

Why studying different RDF storage methods? (2/2)

- As the number of triples increases, the execution time of queries is also increased, due to memory consumption => more indexes required

- Inference at assertion time is infeasible as the number of triples scales, since for each entailment rule many more triples are stored

15/03/08 6

Triple store requirements – mapping considerations

Triple Store Requirements• Text searching• URIs• Datatypes• RDF Containers (rdf:Seq, rdf:Bag, rdf:Alt, rdf:_n)• RDF Vocabulary Description Language support (aka RDF Schema)• Ontological support, Inferencing• Triple provenance

Mapping considerations• The database schema• The particular database implementation• Non-relational databases such as ODBMS, XML• Database tuning• Database updates• Exposing the database schema

15/03/08 7

Means of storage

• Memory

• Hard Disk- Database

- File System (Native)

15/03/08 8

Storage Methods (1/2)

• Schema-oblivious (also called generic or vertical): One ternary relation is used to store any RDF/S schema or resource description graph. (Figure 1)

• Schema-aware: (also called specific or binary): One table per RDF/S schema property or class is used. (Figure 2)

• Hybrid: One ternary relation for every different property range type and a binary relation for all class instances (as in schema-aware). Property (class) instances with range values of the same type are stored in the same relation, distinguished by the property (class) id (as in schema-oblivious). (Figure 3)

15/03/08 9

Storage Methods (2/2)

15/03/08 10

Method Variations (1/2)

• Schema-oblivious– URI: stores the URIs in the table holding the triples

– ID: relies on integer identifiers to represent resources and properties in the triple table and stores them only once in a separate table

15/03/08 11

Method Variations (2/2)

• Schema-aware– ISA: exploits the object-relational features of SQL99

for representing subsumption relationships using sub-table definitions

– NOISA: stores RDF/S data using a standard relational representation as depicted in Figure 2

– Vertical Partitioning: NOISA + sort by subject + value column for each table can be optionally indexed

15/03/08 12

Extension to Schema-oblivious (1/2)

• Property Tables– Clustered Property Table: Clusters of properties that

tend to be defined together (Figure 4)

– Property-class Table: Clusters similar sets of subjects together in the same table (Figure 5)

15/03/08 13

Extension to Schema-oblivious (2/2)

Figure 4: Clustered PropertyTable Example

Figure 5: Property-class Table

15/03/08 14

Advantages – Disadvantages (1/5)

• Schema-oblivious(+) Straightforward Schema Evolution

(-) Disregards Type Information

(-) Significant storage overhead (URI)

(-) Additional join operation at the end of every query (ID)

15/03/08 15


• Schema-aware(-) Addition/deletion of a new property requires the

addition/deletion of a table

(-) Significant overhead when managing a potentially large number of tables

(++) Subsumption is implicitly supported by the schema (ISA)

(+) Internal encoding of subsumption => efficient evaluation of taxonomic queries in secondary storage (NOISA)

15/03/08 16


• Schema-aware (NOISA)(+) Support for multi-valued attributes(+) Support for heterogeneous records(+) Only those properties accessed by a query need to

be read(+) No clustering algorithms are needed(+) Fewer unions for one property and fast joins(-) Join is not free(-) Slower inserts

15/03/08 17


• Hybrid(+) Schema evolution easily supported

(+) Preserves type information

(+) Internal encoding of subsumption => efficient evaluation of taxonomic queries in secondary storage (NOISA)

15/03/08 18


• Property Tables(+) Reduce subject-subject self-joins of the triples table(+) Speed up queries that can be answered from a single

property table(-) Most queries require joins or unions to combine data

from several tables(-) RDF data tends not to be very structured (NULL)(-) Multi-valued attributes cause further complexity(-) Property clustering must be carefully done to avoid

creating too wide tables(-) Problematic queries that do not select on class type

(Property-class tables)(-) Unspecified property (clustered property tables)

15/03/08 19

Inference Support

• Precompute inferring triples (compile time)– Schema-oblivious (URI, ID)

• Compute on demand (run time)– Schema-aware (ISA, NOISA)– Hybrid

• Computation in:– Main memory (ISA)– Secondary memory (NOISA, Hybrid)

15/03/08 20

Column-Oriented DBMS

• Tuples are stored in column format instead of standard row format

• Query evaluation– Only columns relevant to a query read into memory and not entire

rows before projection occurs (wasting bandwidth)– Inserts might be slower in column-stores (especially if they are not

done in batch)

• Vertical partition uses column format for better performance

15/03/08 21

Performance Results (1/2)

Figure 6: Performance comparison of the triple-store schemawith the property table and vertically partitioned schemas. Propertytables contain only the columns necessary to execute particular query.

15/03/08 22

Performance Results (2/2)

Figure 7: Performance as number of triplesscale for a specific query

15/03/08 23

RDF Storage Systems

• Jena1, Jena2

• ICS-FORTH RDFSuite

• Sesame

• 3store

• RStar

• Oracle

15/03/08 24

Jena1 (1/2)

• Storage method: Schema-oblivious (ID)

• Supports multiple graphs stored in the Statement Table (one more attribute specifying the graph id)

• Supports reified statements by adding a statement id attribute to the Statement Table (does not allow multiple reified instances of any statement)

15/03/08 25

Jena1 (2/2)

Figure 8: Jena1 Schema (Normalized)

15/03/08 26

Jena2 (1/3)

• Storage method: Schema-oblivious (URI)

• Supports storage of similar graphs in multiple statement tables (one more attribute specifying the graph id)

• Supports clustered property tables and property-class table

• Reified statements are stored in property-class table (allows multiple reified instances of any statement)

15/03/08 27

Jena2 (2/3)

• Multi-valued properties may be clustered or may be stored in a separate table

• To face space consumption– Common prefixes in URIs are stored in a separate

table and the prefix itself is replaced by a db reference (prefix in cache)

– Long values are stored only once (defined by a threshold)

15/03/08 28

Jena2 (3/3)

Figure 9: Jena2 Schema (Denormalized)

15/03/08 29

Jena1 vs. Jena2

• The denormalized schema of Jena2 is faster than the normalized schema of Jena1, twice as fast for many operations

• For queries with high selectivity Jena1 and Jena2 perform about the same, but Jena2 performs better on queries which join a large number of tuples

15/03/08 30

ICS-Forth RDFSuite (1/3)• Storage method: Schema-aware (ISA)

• Core RDF/S model: Class, Property, SubClass and SubProperty tables to represent RDFS hierarchies

• Uses a Namespace table

• Uses a Type table to hold the built-in types of RDF/S (e.g. rdf:Property, rdfs:Class, etc), Literal types (e.g. integer, date), Container types (e.g. rdf:Bag)

• For each class/property there is a table in which its instances are stored

15/03/08 31

ICS-Forth RDFSuite (2/3)

Figure 10: Relational Representation of RDF Description Bases

15/03/08 32

Variations of RDFSuite (3/3)

• Construct a unified Instance table containing all instances of all classes

• Construct a unified Instance table containing all instance properties of all properties

• For every property with range a literal type, add an attribute to the class table of its domain (this is for single valued properties which are not specialized)

15/03/08 33

Sesame (1/5)


• Memory, Database and File System Support

• Abandonment of object-relational backend– One table for each class/property– The semantics of subtables do not match the

semantics of rdfs:subClassOf (multiple parents)

15/03/08 34

Sesame – Database (2/5)

Figure 11: Sesame RDBMS Schema

15/03/08 35

Sesame - Memory (3/5)

Figure 12: Bipartite graph representation of an RDF model

15/03/08 36

Sesame – File System (4/5)

• Read/write with Java’s I/O classes

• Stores indices (B-trees, hash tables) for quick searching

• Employs selective caching of data in memory for increasing retrieval performance

15/03/08 37

Sesame - Performance (5/5)

Figure 13: Left: first 20 min. , Right: complete upload

15/03/08 38

3store (1/2)

Figure 14: 3store Database Schema

15/03/08 39

3store (2/2)

• Datatypes: All datatypes are stored as strings => determination of datatype is done by conversion at runtime

• Hybrid approach for the production of entailment rules:– Those generating fewer entailments are evaluated at

assertion time with forward chaining rules– Those with greater storage cost and a lower evaluation

cost are evaluated at query time with backward chaining rules

15/03/08 40

RStar (1/2)


• Core RDF/S model: Class, Property, SubClass, SubProperty and Property-Class tables to represent RDFS hierarchies

• Uses a Triples table

• Uses a Namespace table

• Uses a Type table to hold Literal types (e.g. integer, float, date, string)

• For efficient searching all property, class, resource and data names are hashed

15/03/08 41

RStar (2/2)

Figure 15: A database schema of RStar

15/03/08 42

Oracle• Storage method: Schema-oblivious (ID)

• One table for triple storage (IdTriples), one table for mapping URIs and Literals to IDs (UriMap)

• Supports user views on selective portions of the RDF data

• Supports multiple representations for the same value (canonical literal ID)

• Pre-defined datatypes are partitioned into families according to their value space (e.g. numeric family)

• To speed up query answering “subject property matrices” are used as auxiliary tables

15/03/08 43

Storage Systems’ Evaluation (1/5)

• Two storage systems based on memory(Sesame-Memory, Jena-Memory)

• Two persistent storage systems with RDBMS (Sesame-DB, Jena-DB)

• Three native RDF approach systems(Sesame-Native, Kowari, YARS)

15/03/08 44

Storage Systems’ EvaluationQuery response time (2/5)

• Sesame-DB, Jena-DB slowest systems:– Query mechanism represented by a series of join

instructions– Store most triples in a single table

• Sesame-Memory, Jena-Memory -> quick

• Native systems– Sesame-Native like Sesame-Memory (use same

query evaluation processing)– Kowari solid performance– YARS several times faster than Sesame, using index

15/03/08 45

Storage Systems’ EvaluationData loading time, repository size (3/5)

Figure 16: Data loading time and repository size

15/03/08 46

Storage Systems’ EvaluationData loading time, repository size (4/5)

• Memory-based systems -> best performance

• DB-based systems -> bad scalability– Sesame-DB: a) performs duplication check, without caching

mechanism, since it employs a unique ID to RDF node mapping system, b) re-invokes its inference engine on loading RDF file

• Native systems -> nice performances– Sesame-Native, YARS use B-trees. YARS uses additional indices

=> more space– Kowari even more spaces using 64-bit storage

15/03/08 47

Storage Systems’ Evaluation (5/5)

fixedIBM DB2Forward ChainingRSQLRstar

fixedOracle, any RDBMS

(Btrees, table functions)

Forward ChainingSPARQL-like, RDQL-like

Oracle

fixedMySQLHybridRDQL3store

fixed and non-fixed

PostgreSql, any RDBMS

Backward ChainingRQLRDF

Suite

non-fixedPostgreSql, MySQL, Oracle9i

Forward chainingRDQL, RQL, SeRQLSesame

non-fixedPostgreSql, MySQL, Oracle

Backward/Forward and hybrid chaining

RQL, RDQL, SPARQL

Jena

SchemaDBMSInference support

Query

Language

System

15/03/08 48

Thank you!!!

15/03/08 49

References (1/2)

• Y. Theoharis, V. Christophides and G. Karvounarakis, Benchmarking Database Representations of RDF/S Stores. Fourth International Semantic Web Conference (ISWC'05), Galway, Ireland, November, 2005.

• Abadi, Daniel J., Marcus, Adam, Madden, Samuel R., and Hollenbach, Kate. Scalable Semantic Web Data Management Using Vertical Partitioning. In Proceedings of VLDB 2007.

• Stephen Harris, Nicholas Gibbins. 3store: Efficient Bulk RDF Storage, In Proceedings of the 1st International Workshop on Practical and Scalable Semantic Systems(PSSS'03).

• Jeen Broekstra, Arjohn Kampman, and Frank van Harmelen, Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In Proceedings of ISWC 2002.

• S. Alexaki, V. Christophides, G. Karvounarakis, D. Plexousakis, K. Tolle, The ICS-FORTH RDFSuite: Managing Voluminous RDF Description Bases, 2nd International Workshop on the Semantic Web (SemWeb'01), pp. 1-13, Hongkong, May 1, 2001.

• L. Ma, Z. Su, Y. Pan, L. Zhang, T. Liu: RStar: An RDF Storage and Query System for Enterprise Resource Management. In Proc. of the ACM CIKM 2004.

15/03/08 50

References (2/2)

• K. Wilkinson, C. Sayers, H. A. Kuno, D. Raynolds: Efficient RDF Storage and Retrieval in Jena2. In Proc. of SWDB'03 (co-located with VLDB'03).

• E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An Efficient SQL-based RDF Querying Scheme. In VLDB, pages 1216–1227, 2005.

• D. Beckett, J. Grant, SWAD-Europe: Mapping Semantic Web Data with RDBMSes, http://www.w3.org/2001/sw/Europe/reports/scalable_rdbms_mapping_report/

• J. Broekstra, “Storage, Querying and Inferencing for Semantic Web Languages”, Phd Thesis.

• Liu Baolin, H. Bo, An Evaluation of RDF Storage Systems for Large Data Applications, In Proc. of First International Conference on Semantics, Knowledge, and Grid (SKG 2005), 2005.

RDF Storage methods and Systems Nikolaou Charalampos (A.M.: M953) Kotsifakos Alexios (A.M.: M964)...

Documents

Transcript of RDF Storage methods and Systems Nikolaou Charalampos (A.M.: M953) Kotsifakos Alexios (A.M.: M964)...