SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data
-
Upload
olafgoerlitz -
Category
Documents
-
view
1.272 -
download
0
description
Transcript of SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data
![Page 1: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/1.jpg)
Institute for Web Science and Technologies
University of Koblenz ▪ Landau, Germany
Systematic Generation of
SPARQL Benchmark Queries
for Linked Open Data
Olaf Görlitz, Matthias Thimm, Steffen Staab
![Page 2: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/2.jpg)
ISWC'12, Boston, 11/15/2012Slide 2
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Linked Data Federation
SPARQL Queries on the Linked Data Cloud
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
![Page 3: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/3.jpg)
ISWC'12, Boston, 11/15/2012Slide 3
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
distributedqueries
federationimplementation
The Problem
Why not usebenchmarkqueries?
![Page 4: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/4.jpg)
ISWC'12, Boston, 11/15/2012Slide 4
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
RDF Benchmarks
LUBM, BSBM, SP²B, ...
• Synthetic datasets• Domain-specific• Highly structured• Sophisticated queries
FedBench (ISWC'11)
• 10 Linked Data sets(~170M triples)
• 25 handpickeddistributed queries
Centralized Fixed
Scalable, Flexible, ExpressiveLinked Data Benchmark
![Page 5: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/5.jpg)
ISWC'12, Boston, 11/15/2012Slide 5
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Overview
Benchmark Idea Methodology Evaluation
![Page 6: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/6.jpg)
ISWC'12, Boston, 11/15/2012Slide 6
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Linked Data Benchmark Features
Scalability Flexibility Expressiveness
Real Linked Data Sets Customization Typical+Complex Queries
Systematic SPARQL Benchmark Query Generator for Linked Open Data
Systematic SPARQL Benchmark Query Generator for Linked Open Data
![Page 7: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/7.jpg)
ISWC'12, Boston, 11/15/2012Slide 7
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Requirements
1. Define QueryCharacteristics
2. Automatic Query Generation
3. Query Validation
What we want:
Customize Benchmark
Random Queries
#results > 0
![Page 8: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/8.jpg)
ISWC'12, Boston, 11/15/2012Slide 8
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Contribution
Methodology and toolset forsystematic query generation
Query Generation Query ValidationParameterization
Linked Data
Config BenchmarkQueries
![Page 9: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/9.jpg)
ISWC'12, Boston, 11/15/2012Slide 9
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Overview
Benchmark Idea Methodology Evaluation
![Page 10: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/10.jpg)
ISWC'12, Boston, 11/15/2012Slide 10
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
Define typical + challenging distributed queries
QueryGeneration
QueryValidation
QueryGeneration
QueryValidation
No federation query logs available
Analyze queries of benchmarks
SELECT ?drug ?keggUrl ?chebiImage WHERE { ?drug rdf:type drugbank:drugs . ?drug drugbank:keggCompoundId ?keggDrug . ?keggDrug bio2rdf:url ?keggUrl . ?drug drugbank:genericName ?drugBankName . ?chebiDrug purl:title ?drugBankName . ?chebiDrug chebi:image ?chebiImage . }
FedBench/LifeScience#5
![Page 11: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/11.jpg)
ISWC'12, Boston, 11/15/2012Slide 11
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
• Query Form(Select, Construct, ...)
• Join Type(conj. / disj. / left-join)
• Result Modifiers(limit, offs, order by)
• Variable Patterns(s, o, s+o, ...)
• Join Patterns(star, path)
• Cross Product
• # Data Sources
• # Joins/ Patterns
• # Results
Algebra Structure Cardinality
![Page 12: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/12.jpg)
ISWC'12, Boston, 11/15/2012Slide 12
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
Main query parameter: join structure
FedBench queries star join
path join
![Page 13: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/13.jpg)
ISWC'12, Boston, 11/15/2012Slide 13
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
Path-join: n triple patterns,m sources (m≤n)
Additional query parameters: # triple patterns# data sourcesresult size...
Star-join: n triple pattern,anchor node (s/o)
![Page 14: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/14.jpg)
ISWC'12, Boston, 11/15/2012Slide 14
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
Iteratively add random triple pattern
Need background knowledge level of detail?
#results > 0 ?
Predicate combinations how provided?
owl:sameAs rdf:type
rdfs:label
foaf:knows
![Page 15: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/15.jpg)
ISWC'12, Boston, 11/15/2012Slide 15
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
owl:sameAs rdf:type
rdfs:label
foaf:knows
Linked Predicates Characteristics Sets*
(owl:sameAs → rdf:type)
DBpedia → geonames (43, 58)freebase → DBpedia (86, 72) ...
{rdfs:label, foaf:knows, …}
DBpedia (322), rdfs:label (437)foaf:knows (322)
...
*[Neumann, Moerkotte, ICDE 2011]
![Page 16: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/16.jpg)
ISWC'12, Boston, 11/15/2012Slide 16
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
Linked Predicates Characteristics Sets
(p1 → p
2)
p1
p2
p3
p4
⊗ (p2 → p
3)
⊗ (p3 → p
i )
{p1, p
4}
{p1, p
4, ...}
![Page 17: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/17.jpg)
ISWC'12, Boston, 11/15/2012Slide 17
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology
QueryParameterization
QueryGeneration
QueryValidation
How to evaluate? Compute confidence value
Verify generated queries (#results >0)
minimum join selectivity > e
![Page 18: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/18.jpg)
ISWC'12, Boston, 11/15/2012Slide 18
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Overview
Benchmark Idea Methodology Evaluation
![Page 19: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/19.jpg)
ISWC'12, Boston, 11/15/2012Slide 19
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Objective
Verify generation of valid queries (#results >0) Compare variations of query generation algorithms
Metrics: #queries with non-empty results #result per query
Baseline SPLODGElite SPLODGE
“random“predicate
backgroundknowlege
+ minimum join selectivity(> 10-4/10-3/10-2)
![Page 20: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/20.jpg)
ISWC'12, Boston, 11/15/2012Slide 20
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Setup
Real Linked Data Random queries Triple Store
SELECT * WHERE {?var1 <http://dbpedia.org/property/description> ?var2 .?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 .?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 .?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 .?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6
}
Billion Triple Challenge Dataset
• Path-joins across data sources• 3-6 patterns, bound predicates• 100 queries per batch
RDF3X
![Page 21: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/21.jpg)
ISWC'12, Boston, 11/15/2012Slide 21
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Results
Joined triple patterns
#que
ries
![Page 22: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/22.jpg)
ISWC'12, Boston, 11/15/2012Slide 22
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Results
Joined triple patterns
#res
ults
![Page 23: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/23.jpg)
ISWC'12, Boston, 11/15/2012Slide 23
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Estimated vs. actual results size
estimated result size
actu
al r
esul
t siz
e
![Page 24: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/24.jpg)
ISWC'12, Boston, 11/15/2012Slide 24
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Predicate Occurrence in Queries
![Page 25: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/25.jpg)
ISWC'12, Boston, 11/15/2012Slide 25
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
Conclusion
SPLODGE provides Flexible query characterization + parameterization Methodology for Systematic & Scalable Query Generation Toolset as Open Source (http://code.google.com/p/splodge/)
Future Work: Create a LOD Federation Benchmark Interactive SPARQL query construction
Questions?
![Page 26: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/26.jpg)
ISWC'12, Boston, 11/15/2012Slide 26
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Evaluation Setup
BTC 2011 dataset in RDF3X pure triples, no context 160 GB repository file
(14h loading, 200 GB tmp mem)
![Page 27: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022062312/55506639b4c905ae3f8b5646/html5/thumbnails/27.jpg)
ISWC'12, Boston, 11/15/2012Slide 27
SPLODGE: Systematic LOD Benchmark Query GenerationOlaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Pre-Processing for BTC data
Identify common domains(e.g. jane08.lifejournal.com/home) 3,0 h17 GB gzip
Replace quad context(reduce number of sources)
4,4 h
Sort quads + remove duplicates 8,5 h
Build predicate/context dictionary 1,0 h<1 MB gzip
Create resource in/out-link index 9,7 h1.7 GB gzip
Create linked predicate stats Compute characteristic sets 1,6 h