TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store
-
Upload
exascale-infolab -
Category
Science
-
view
863 -
download
7
description
Transcript of TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store
![Page 1: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/1.jpg)
23rd International World Wide Web Conference, 10th April 2014, Seoul, Korea
TripleProvEfficient Processing of Lineage
Queries over a Native RDF Store
Marcin Wylot1, Philippe Cudré-Mauroux1, and Paul Groth2
1) eXascale Infolab, University of Fribourg, Switzerland 2) Web & Madia Group, VU University Amsterdam, Netherlands
![Page 2: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/2.jpg)
Outline
➢ Motivation
➢ Provenance Polynomials
➢ System
➢ Results
![Page 3: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/3.jpg)
Data Provenance
“Provenance is information about
entities, activities, and people involved
in producing a piece of data or thing, which can be used to form
assessments about its quality, reliability or trustworthiness.”
How a query answer was derived: what data was
combined to produce the result.
![Page 4: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/4.jpg)
Data Integration
➢ Integrated and summarized data
➢ Trust, transparency, and cost
➢ Capability to pinpoint the exact source from which the result was selected
➢ Capability to trace back the complete list of sources and how they were combined to deliver a result
![Page 5: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/5.jpg)
Querying Distributed Data SourcesHow exactly was the answer derived?
![Page 6: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/6.jpg)
Application: Post-query Calculations
➢ Scores or probabilities for query result
➢ Result ranking
➢ Compute trust
➢ Information quality based on used sources
![Page 7: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/7.jpg)
Application: Query Execution
➢ Modify query strategies on the fly
➢ Restrict results to certain subset of sources
➢ Restrict results w.r.t. queries over provenance
➢ Access control, only certain sources will appear
➢ Detect if result would be valid when removing certain
source
![Page 8: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/8.jpg)
Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
to deliver a result
![Page 9: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/9.jpg)
Graph-based Query
select ?lat ?long ?g1 ?g2 ?g3 ?g4where {
graph ?g1 {?a [] "Eiffel Tower" . } graph ?g2 {?a inCountry FR . } graph ?g3 {?a lat ?lat . } graph ?g4 {?a long ?long . }
}
lat long l1 l2 l4 l4, lat long l1 l2 l4 l5,lat long l1 l2 l5 l4, lat long l1 l2 l5 l5,lat long l1 l3 l4 l4, lat long l1 l3 l4 l5,lat long l1 l3 l5 l4,lat long l1 l3 l5 l5,
lat long l2 l2 l4 l4, lat long l2 l2 l4 l5,lat long l2 l2 l5 l4, lat long l2 l2 l5 l5,lat long l2 l3 l4 l4, lat long l2 l3 l4 l5,lat long l2 l3 l5 l4, lat long l2 l3 l5 l5,
lat long l3 l2 l4 l4, lat long l3 l2 l4 l5,lat long l3 l2 l5 l4, lat long l3 l2 l5 l5,lat long l3 l3 l4 l4, lat long l3 l3 l4 l5,lat long l3 l3 l5 l4,lat long l3 l3 l5 l5,
![Page 10: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/10.jpg)
TripleProv Resuls
result: lat, long
provenance polynomial:(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)
![Page 11: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/11.jpg)
Polynomials Operators
➢ Union (⊕)
○ constraint or projection satisfied with multiple sources
l1 ⊕ l2 ⊕ l3
○ multiple entities satisfy a set of constraints or projections
➢ Join (⊗)
○ sources joined to handle a constraint or a projection
○ OS and OO joins between few sets of constraints
(l1 ⊕ l2) ⊗ (l3 ⊕ l4)
![Page 12: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/12.jpg)
Example Polynomial
select ?lat ?long where { ?a [] ``Eiffel Tower''.?a inCountry FR .?a lat ?lat .?a long ?long .
}
(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)
![Page 13: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/13.jpg)
Example Polynomial
select ?l ?long ?lat where {
?p name ``Krebs, Emil'' .
?p deathPlace ?l .
?c [] ?l .
?c featureClass P .
?c inCountry DE .
?c long ?long .
?c lat ?lat .
}
[(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5)] ⊗
[( l6 ⊕ l7) ⊗ (l8) ⊗ (l9 ⊕ l10) ⊗ (l11 ⊕ l12) ⊗ (l13)]
![Page 14: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/14.jpg)
Granularity Levels
➢ source-level: sources of a triples
➢ triple-level: all pieces of data used to answer the query
(l1 ⊕ l2) ⊗ (l3 ⊕ l4)
![Page 15: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/15.jpg)
System Architecture
![Page 16: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/16.jpg)
Native Data Model
➢ Semantically co-located data
➢ Template based molecules
![Page 17: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/17.jpg)
Various Physical Storage Models
Differences:➢ ease of implementation➢ memory consumption➢ query execution➢ interference with the original concept of molecule
1) SPOL 2) LSPO 3) SLPO 4) SPLO
![Page 18: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/18.jpg)
Annotated Triples
➢ Annotated provenance
➢ Quadruples
➢ Easy to implement
➢ Source data repeated
for each triple
![Page 19: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/19.jpg)
Co-located Elements
➢ Data grouped by source
➢ Physically co-located
➢ Avoids duplication of the
same source inside a
molecule
➢ Data about a given subject
co-located in one molecule
➢ More difficult to implement
![Page 20: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/20.jpg)
Experiments
How expensive it is to trace
provenance?
What is the overhead on query
execution time?
![Page 21: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/21.jpg)
Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked
open data cloud
○ Web Data Commons (WDC): RDFa, Microdata
extracted from common crawl
➢ Typical collections gathered from multiple sources
➢ sampled subsets of ~110 million triples each; ~25GB each
![Page 22: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/22.jpg)
Workloads
➢ 8 Queries defined for BTC○ T. Neumann and G. Weikum. Scalable join processing on very large rdf
graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009.
➢ Two additional queries with UNION and OPTIONAL
clauses
➢ 7 various new queries for WDC
http://exascale.info/tripleprov
![Page 23: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/23.jpg)
Results
Overhead of tracking provenance compared to
vanilla version of the system for BTC dataset
source-level co-located
source-level annotated
triple-level co-located
triple-level annotated
![Page 24: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/24.jpg)
Conclusions
➢ provenance overhead is considerable but acceptable,
on average about 60-70%
➢ most suitable storage model depends upon data and
workloads characteristics
➢ annotated: more appropriate for heterogenous datasets
and workloads retrieving provenance
➢ co-located: more appropriate for homogenous datasets
and workload filtering by source
![Page 25: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/25.jpg)
Future Work
➢ Distributed version
➢ Dynamic storage model
➢ Adaptive query execution strategies
➢ PROV output
➢ Over provenance queries
![Page 26: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/26.jpg)
Summary
➢ TripleProv: an efficient triplestore tracking provenance
➢ Two storage models
➢ Fine-grained multilevel provenance tracing
➢ Formal provenance polynomials
➢ Experimental evaluation
http://exascale.info/tripleprov
![Page 27: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/27.jpg)
Loading & Memory
Billion Triple Challenge
Web Data Commons
![Page 28: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/28.jpg)
Results
Overhead of tracking provenance compared to
vanilla version of the system for WDC dataset
source-level SLPO
source-level SPOL
triple-level SLPO
triple-level SPOL
![Page 29: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store](https://reader030.fdocuments.us/reader030/viewer/2022020217/53fb77f98d7f729c2e8b584f/html5/thumbnails/29.jpg)
Polynomials: multiple records
[(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)]
⊕
[(l5 ⊕ l7) ⊗ (l4) ⊗ ( l13 ⊕ l17) ⊗ (l28)]
⊕
[(l4) ⊗ (l1 ⊕ l2) ⊗ ( l3 ⊕ l7) ⊗ (l8 ⊕ l9⊕ l4)]