UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko...
-
Upload
ibrahim-modrell -
Category
Documents
-
view
217 -
download
0
Transcript of UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko...
UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems
Artem Chebotko
Joint work with
E. De Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly
Department of Computer Science
University of Texas - Pan American
6th IEEE International Workshop on Scientific Workflows, June 24, 2012
WasDerived
From
Provenance in eScience Metadata that captures history of an experiment
Problem diagnosis Result interpretation Experiment reproducibility
Scientific Workflow Community Provenance Challenges 2006: understanding and sharing information about
provenance representations and capabilities 2006: interoperability of different provenance 2009: evaluating various aspects of OPM 2010: showcase OPM in the context of novel applications
Open Provenance Model
W3C Provenance Working Group
UTPB – University of Texas Provenance Benchmark
SWFMS and Provenance
Taverna Kepler View VisTrails, Pegasus Swift
Galaxy Triana OPMProv Karma RDFProv etc.
UTPB – University of Texas Provenance Benchmark
Support provenance collection
Use proprietary of third-party systems to manage provenance
Differ in provenance models, provenance vocabularies, inference support, and query languages.
Provenance Management Requirements
Non-functional Data storage and querying efficiency and scalability Inference soundness and completeness
Functional Support of a particular, provenance model, provenance
vocabulary, query type, inference feature, visualization and analysis
No standard way to evaluate provenance systems with respect to these requirements
UTPB – University of Texas Provenance Benchmark
Provenance System Benchmarking Challenges
Well-documented and easy-to-understand datasets
Provenance data in a range of sizes
Provenance data with predefined inferred results that are known to be correct and complete
Test queries
Performance metrics
Result interpretation
Existing empirical studies of provenance systems use ad-hoc benchmarks or benchmarks developed in other research domains (see the paper for details)
UTPB – University of Texas Provenance Benchmark
Our Contributions University of Texas Provenance Benchmark (UTPB)
http://faculty.utpa.edu/chebotkoa/utpb/ Focus on scalability and inference
Flexible data generator
27 provenance templates 3 virtual workflows 3 workflow execution scenarios 3 provenance vocabularies
27 test queries in 11 categories
5 performance metrics
UTPB – University of Texas Provenance Benchmark
Talk Outline University of Texas Provenance Benchmark
UTPB Architecture Provenance Templates Provenance Generation UTPB Queries Performance Metrics Interpretation of Benchmark Results
Summary and Future work
UTPB – University of Texas Provenance Benchmark
UTPB Architecture
UTPB – University of Texas Provenance Benchmark
UTPB Architecture
UTPB – University of Texas Provenance Benchmark
Provenance Templates
UTPB – University of Texas Provenance Benchmark
Provenance Templates A provenance template is a document that serializes
provenance of one workflow execution according to a particular provenance model and a provenance vocabulary.
Provenance templates make the benchmark extensible and thus adaptable to the changing requirements of the field.
UTPB currently supports: 1 provenance model (OPM) 3 virtual workflows 3 provenance vocabularies (OPMV, OPMO, OPMX) 3 workflow execution scenarios 1 x 3 x 3 x 3 = 27 provenance templates
UTPB – University of Texas Provenance Benchmark
Virtual Workflow 1 Database Experiment
Processes: 7 Artifacts:14 Accounts: 2 Agents: 1
UTPB – University of Texas Provenance Benchmark
Virtual Workflow 2 Jeans Manufacturing
Processes: 13 Artifacts:18 Accounts: 3 Agents: 2 Several processes use and generate
the same artifacts and are “executed” in parallel
UTPB – University of Texas Provenance Benchmark
Virtual Workflow 3 French Press Coffee
Processes: 15 Artifacts:15 Accounts: 4 Agents: 0 Several branches with
multiple processes are “executed” in parallel
Several processes trigger each other without the record of using or generating artifacts
UTPB – University of Texas Provenance Benchmark
Provenance Vocabularies Almost every existing scientific workflow management
system defines its own proprietary model for provenance
Each model is serialized in some format, such as RDF, XML, or relational data, according to one or more predefined vocabularies or schemas.
Open Provenance Model (OPM) – a layer of interoperability OPM Vocabulary OPM Ontology OPM XML Schema
UTPB – University of Texas Provenance Benchmark
Workflow Execution Scenarios successful execution
incomplete execution with an error
successful execution with materialized provenance inferences
UTPB – University of Texas Provenance Benchmark
Provenance Generation
UTPB – University of Texas Provenance Benchmark
Provenance Generation
UTPB – University of Texas Provenance Benchmark
Provenance Generation
UTPB – University of Texas Provenance Benchmark
Provenance Generation
# Named graph: http://cs.panam.edu/utpb#opmGraph_C0_T0@prefix opmv: <http://purl.org/net/opmv/ns#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix utpb: <http://cs.panam.edu/utpb#> .
utpb:account_black_C0_T0 rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> .utpb:cuttingMachine_C0_T0 rdf:type opmv:Artifact .utpb:denim_C0_T0 rdfs:label "blue" .utpb:andrey_C0_T0 rdf:type opmv:Agent .utpb:cutDenim_C0_T0 opmv:used utpb:cuttingMachine_C0_T0, utpb:cuttingPattern_C0_T0, utpb:denim_C0_T0 .utpb:denimParts_C0_T0 opmv:wasGeneratedBy utpb:cutDenim_C0_T0 .
# Default graph<http://cs.panam.edu/utpb#opmGraph_C0_T0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
OPMV
UTPB – University of Texas Provenance Benchmark
Provenance Generation# Named graph: http://cs.panam.edu/utpb#opmGraph_C0_T0@prefix opmo: <http://openprovenance.org/model/opmo#> .@prefix opmv: <http://purl.org/net/opmv/ns#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix utpb: <http://cs.panam.edu/utpb#> .
utpb:account_black_C0_T0 rdf:type opmo:Account .utpb:cuttingMachine_C0_T0 rdf:type opmv:Artifact .utpb:propertyDenim_C0_T0 opmo:key utpb:keyDenimType_C0_T0 ; opmo:value "blue" .utpb:andrey_C0_T0 rdf:type opmv:Agent .utpb:used1_C0_T0 rdf:type opmo:Used ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:cuttingMachine_C0_T0 ; opmo:role utpb:roleMachine_C0_T0 ; opmo:pname utpb:_used1 ; opmo:account utpb:account_black_C0_T0 .utpb:wgb1_C0_T0 rdf:type opmo:WasGeneratedBy ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:denimParts_C0_T0 ; opmo:role utpb:roleDenim_C0_T0 ; opmo:pname utpb:_wgb1 ; opmo:account utpb:account_black_C0_T0 .
# Default graph<http://cs.panam.edu/utpb#opmGraph_C0_T0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
OPMO
UTPB – University of Texas Provenance Benchmark
Provenance Generation<utpb xmlns="http://openprovenance.org/model/opmx#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <dictionary>
<opmGraph id="opmGraph_C0_T0"> </dictionary> <opmGraph id="opmGraph_C0_T0"> <accounts> <account id="account_black"/> </accounts> <artifacts> <artifact id="cuttingMachine"> <account ref="account_black"/> <annotation> <property key="value"> <value>laser</value></property> <property key="label"> <value>Cutting machine</value></property> </annotation> </artifact> </artifacts> <agents> <agent id=“andrey”><account ref="account_black"/></agent> </agents> <dependencies> <used id=“used1”> <effect ref="cutDenim"/> <role id="roleMachine1” value="machine"/> <cause ref="cuttingMachine"/> <account ref="account_black"/> </used>
OPMX
UTPB – University of Texas Provenance Benchmark
UTPB Queries
UTPB – University of Texas Provenance Benchmark
UTPB Queries 27 Queries
11 Categories Graphs Dependencies Artifacts Processes Accounts Agents Roles Values Cross-Graph Queries Inferences Application-Specific
UTPB – University of Texas Provenance Benchmark
UTPB Queries
UTPB – University of Texas Provenance Benchmark
UTPB Queries
UTPB – University of Texas Provenance Benchmark
UTPB QueriesType Format Sample Query
English Find all artifact derivation dependencies in a particular provenance graph
SPARQL OPMV
SELECT ?causeArtifact ?effectArtifactFROM NAMED <http://cs.panam.edu/utpb#opmGraph_C0_T0>WHERE { GRAPH utpb:opmGraph { ?effectArtifact opmv:wasDerivedFrom ?causeArtifact . } }
SPARQL OPMO
SELECT ?causeArtifact ?effectArtifactFROM NAMED <http://cs.panam.edu/utpb#opmGraph_C0_T0>WHERE { GRAPH utpb:opmGraph { ?wdf rdf:type opmo:WasDerivedFrom . ?wdf opmo:cause ?causeArtifact . ?wdf opmo:effect ?effectArtifact . }}
XQuery OPMX
declare default element namespace "http://openprovenance.org/model/opmx#";<result> {for $wdf in /utpb/opmGraph[@id="opmGraph_C0_T0"]/dependencies/wasDerivedFromreturn <wasDerivedFrom>{$wdf/effect}{$wdf/cause}</wasDerivedFrom>} </result>
UTPB – University of Texas Provenance Benchmark
UTPB Queries
effectArtifact causeArtifact---------------------------------------------utpb:denimParts_C0_T0 utpb:denim_C0_T0utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0utpb:washedJeans_C0_T0 utpb:rawJeans_C0_T0utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:inspectedJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:buttons_C0_T0utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0utpb:jeans_C0_T0 utpb:labels_C0_T0utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0
OPMV
UTPB – University of Texas Provenance Benchmark
UTPB Queries
effectArtifact causeArtifact---------------------------------------------utpb:denimParts_C0_T0 utpb:denim_C0_T0utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0utpb:washedJeans_C0_T0 utpb:rawJeans_C0_T0utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:inspectedJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:buttons_C0_T0utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0utpb:jeans_C0_T0 utpb:labels_C0_T0utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0
OPMO
UTPB – University of Texas Provenance Benchmark
UTPB Queries
<result xmlns="http://openprovenance.org/model/opmx#" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <wasDerivedFrom> <effect ref="denimParts_C0_T0"/> <cause ref="denim_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="rawJeans_C0_T0"/> <cause ref="denimParts_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="rawJeans_C0_T0"/> <cause ref="sewingThread_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="washedJeans_C0_T0"/> <cause ref="rawJeans_C0_T0"/> </wasDerivedFrom> … <wasDerivedFrom> <effect ref="inspectedJeans_C0_T0"/> <cause ref="washedJeans_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="qualityJeans_C0_T0"/> <cause ref="buttonedJeans_C0_T0"/> </wasDerivedFrom></result>
OPMX
UTPB – University of Texas Provenance Benchmark
Performance Metrics
UTPB – University of Texas Provenance Benchmark
Performance Metrics Data loading time
Repository size
Query response time
Query soundness
Query completeness
UTPB – University of Texas Provenance Benchmark
Interpretation of Benchmark Results
UTPB – University of Texas Provenance Benchmark
Interpretation of Benchmark Results
Comparison across datasets of varying sizes
Comparison using a fixed dataset
Comparison across data serialized with different vocabularies (e.g., OPMV vs. OPMO)
Comparison across data managed using different technologies (e.g., RDF vs. XML)
Comparison across data of different provenance models (e.g., OPM vs. PROV-DM) – in the future
UTPB – University of Texas Provenance Benchmark
Summary and Future Work
UTPB – University of Texas Provenance Benchmark
Summary and Future Work
UTPB: A first formal benchmark for scientific workflow provenance management systems
Extensible with new provenance templates
Flexible data generation
Large selection of test queries
Well defined performance metrics
Future work
Benchmarking existing system using UTPB
Extending UTPB (functional requirements, PROV-DM, new metrics – query expressiveness)
UTPB – University of Texas Provenance Benchmark
THANK YOU! Questions?
UTPB – University of Texas Provenance Benchmark
UTPB website: http://faculty.utpa.edu/chebotkoa/utpb/
My contact information: Artem Chebotko, Department of Computer Science,
University of Texas – Pan American [email protected] http://www.cs.panam.edu/~artem