Coding Provenance in Software and Matching Tools to Data

29
Coding Provenance in Software and Matching Tools to Data OPeNDAP Provenance Project And ESIP ToolMatch Project Patrick West, Tetherless World Constellation Rensselaer Polytechnic Institute

description

Coding Provenance in Software and Matching Tools to Data. OPeNDAP Provenance Project And ESIP ToolMatch Project. Patrick West, Tetherless World Constellation Rensselaer Polytechnic Institute. What is Provenance. - PowerPoint PPT Presentation

Transcript of Coding Provenance in Software and Matching Tools to Data

Page 1: Coding Provenance in Software and Matching Tools to Data

Coding Provenance in Softwareand Matching Tools to Data

OPeNDAP Provenance Project

And

ESIP ToolMatch Project

Patrick West, Tetherless World Constellation Rensselaer Polytechnic Institute

Page 2: Coding Provenance in Software and Matching Tools to Data

What is Provenance

• Provenance is information about entities, activities, and people involved in producing a piece of data or thing.

• In Data Science we’re interested in keeping track of, or being able to trace back, how a data product was generated and from what.

• E.G. As part of the Ecosystem Status Report there’s an interesting plot in one of the chapters which I’m interested in learning more about.

2

Page 3: Coding Provenance in Software and Matching Tools to Data

Generating a Plot

3

Page 4: Coding Provenance in Software and Matching Tools to Data

How did I get there?

4

Page 5: Coding Provenance in Software and Matching Tools to Data

I know how it was generated

• Because I’m the one who added the plot to the document

• I know how the plot was generated

• I wrote parts of the software in OPeNDAP Hyrax that’s doing the data access, manipulation, and transformation

• So I know: . A plot is generated by accessing a set of data using OPeNDAP Hyrax; which generates a DAP DataDDS object by reading in a set of NetCDF files, constraining and projecting the data, running a server side function or two, doing an aggregation; and then using that data product to generate the plot.

5

Page 6: Coding Provenance in Software and Matching Tools to Data

IPythonNotebook

cell

cell

cell

cell

Generating a Plot

6

OPeNDAP Hyrax

Reads in Data

Spits outdataBadda Bing Badda Boom

Uses dataGenerates plot

OPeNDAPRequest URL

BUT I WANT TO KNOW MORE

Page 7: Coding Provenance in Software and Matching Tools to Data

Some informationI WANT to know

• How was that plot generated?

• What software was used to generate the plot and any intermediary data?

• What data files were read in to generate the plot, what was done to the data, and by what?

• Where did those data files come from? What parameters are in there? What sensors measured those parameters? Tell me information about the measuring of the data.

7

Page 8: Coding Provenance in Software and Matching Tools to Data

Generating a Plot

8

OPeNDAP Hyrax

Reads in Data

Spits outdata

IPythonNotebook

cell

cell

cell

cellUses dataGenerates plot

OPeNDAPRequest URL

Where did the datafiles come from?

Page 9: Coding Provenance in Software and Matching Tools to Data

Linked Data

• I also am interested in the developers of the software and who publishes the software, the licensing of the software, and how I could use it.

• I’m interested in what IPython Notebooks are, what they can do, and whether I could use them for other projects.

• And I want to be able to let the “owner” of the data files know that I’ve used the results of an access in a publication, presentation, article, or whatever.

9

Page 10: Coding Provenance in Software and Matching Tools to Data

What the project focuses on

10

OPeNDAP HyraxOPeNDAP Hyrax

OLFS BES

NetCDF dap ServerSideFunctions

aggregate

transformRequest URL

Page 11: Coding Provenance in Software and Matching Tools to Data

W3C Prov

11

Page 12: Coding Provenance in Software and Matching Tools to Data

Prov-O

12

:dds_of_reading a prov:Entity; dcterms:format opendap:DataDDS; prov:wasGeneratedBy [ a prov:Activity; prov:used <http://test.opendap.org/dap/data/h5/monday.h5> [ a vsto:Dataset, prov:Entity, toolmatch:DataCollection; toolmatch:hasAccessURL <http://test.opendap.org/dap/data/h5/monday.h5>; ]; prov:used <http://test.opendap.org/dap/data/h5/tuesday.h5> [ a vsto:Dataset, prov:Entity, toolmatch:DataCollection; toolmatch:hasAccessURL <http://test.opendap.org/dap/data/h5/monday.h5>; ]; prov:wasAssociatedWith <opendapi:software/hdf5_handler/2.1.1>; ];.

Page 13: Coding Provenance in Software and Matching Tools to Data

Prov-O

13

:aggregated_dds a prov:Entity; dcterms:format opendap:DataDDS; prov:wasGeneratedBy [ a prov:Activity; prov:used :constrained_dds; prov:wasAssociatedWith <opendapi:software/ncml_module/1.2.2>; ];.

:result a foaf:Document; nfo:fileName "thursday.h5"; dcterms:format netcdf; prov:wasGeneratedBy [ a prov:Activity; prov:used :aggregated_dds; prov:wasAssociatedWith <opendapi:software/fileout_netcdf/1.2.1>; ];.

:constrained_dds a prov:Entity; dcterms:format opendap:DataDDS; prov:wasGeneratedBy [ a prov:Activity; prov:used :dds_of_reading; prov:wasAssociatedWith <opendapi:software/BES/3.12.0>; ];.

Page 14: Coding Provenance in Software and Matching Tools to Data

DOAP – Description of a Project

14

Page 15: Coding Provenance in Software and Matching Tools to Data

DOAP – Description of a Project

15

<http://opendap.tw.rpi.edu/instances/software/BES> a doap:Project, prov:Entity; doap:name "OPeNDAP Back-End Server (BES)"; doap:developer <http://tw.rpi.edu/instances/PatrickWest>; doap:developer <http://tw.rpi.edu/instances/DanHalloway>; doap:developer <http://tw.rpi.edu/instances/James_Gallagher>; doap:developer <http://tw.rpi.edu/instances/NathanPotter>; doap:homepage <http://opendap.org/download/hyrax?q=BES_software>; doap:vendor <http://tw.rpi.edu/instances/OPeNDAP>; doap:repository <http://opendap.tw.rpi.edu/instances/Repository>; doap:bug-database <http://scm.opendap.org/trac/>; doap:release <http://opendap.tw.rpi.edu/instances/software/BES/3.12.0>; doap:description "BES is a high-performance back-end server software framework that allows data providers more flexibility in providing end users views of their data."; doap:license <http://opendap.tw.rpi.edu/instances/License>;. <http://opendap.tw.rpi.edu/instances/software/BES/3.12.0> a doap:Version, prov:Entity; prov:specializationOf <http://opendap.tw.rpi.edu/instances/software/BES>; doap:name "BES-3.12.0"; doap:revision "3.12.0"; doap:download-page <http://opendap.org/download/hyrax/1.9>; doap:repository <http://scm.opendap.org/svn/tags/bes/3.12.0>; doap:license <http://opendap.tw.rpi.edu/instances/License>; doap:created 2013-08-27;

.

Page 16: Coding Provenance in Software and Matching Tools to Data

DOAP – Description of a Project

16

<http://opendap.tw.rpi.edu/instances/Repository> a doap:SVNRepository; doap:location <http://scm.opendap.org/svn/> doap:browse <http://scm.opendap.org/svn/>.

<http://opendap.tw.rpi.edu/instances/License> dc:description "This software is distributed under the GNU Lesser General Public License <http://www.gnu.org/licenses/gpl.html>"; doap:name "GNU LESSER GENERAL PUBLIC LICENSE"; rdfs:seeAlso <http://www.gnu.org/licenses/gpl.html>;.

<http://opendap.tw.rpi.edu/id/opendap/D9IH6677D3I6HDIHD36IHDI7DH> # The hash above is: HASH(config file, BES version that read it) a prov:Agent; prov:wasDerivedFrom <http://opendap.tw.rpi.edu/instances/software/hdf5_handler/2.1.1>, <http://opendap.tw.rpi.edu/instances/software/BES/3.12.0>, <http://opendap.tw.rpi.edu/instances/software/ncml_module/1.2.2/>, <http://opendap.tw.rpi.edu/instances/software/fileout_netcdf/1.2.1>; . prov:wasDerivedFrom :config_file_hash; # b/c BES set it up: prov:wasAttributedTo <http://scm.opendap.org/svn/tags/bes/3.9.2>;.

Page 17: Coding Provenance in Software and Matching Tools to Data

What We’re Trying

• The BES loads shared modules at startup that handle specific tasks

• Our first attempt was to use something called a Reporter that reports on the completion of a request, but it’s too after the fact.

• Second thought is that the modules themselves add provenance information on the fly, which to me is ideal, but is unrealistic.

• The probably implementation is that the BES, the software framework that communicates with the modules, is where the provenance is tracked.

17

Page 18: Coding Provenance in Software and Matching Tools to Data

What’s next

• Get more use cases about what types of information we want to collect

• Write the story about what we’re trying to do

• Come up with software use cases for the implementation

• Continue discussing provenance with the core OPeNDAP group

• Continue to work with the original Prov group (Tim, Jim, and Stephan) in discussions

18

Page 19: Coding Provenance in Software and Matching Tools to Data

Questions

19

Page 20: Coding Provenance in Software and Matching Tools to Data

ToolMatch Usecase

• "I need data for Carbon dioxide (CO2) concentrations, a climate change indicator, for the summer of 2012, that can be accessed via OPeNDAP Hyrax and plotted as a timeseries.”

• "I need data with measurements of atmospheric aerosol optical depth sliced along latitude and longitude, returned as netcdf data, and accessible in MatLab."

20

Page 21: Coding Provenance in Software and Matching Tools to Data

Using SADL

21

Page 22: Coding Provenance in Software and Matching Tools to Data

Inferencing

22

* Equivalent ClassDataCollection <Aqua_AIRS_Level2_Plus_AMSU>and (isAccessedBy value OPeNDAP) or (hasDataStorageFormat value NetCDF)and (usesGridType value AuxiliaryLatLonGrid) or (usesGridType value RegularLatLonGrid)and usesConvention value ClimateForecast_CF* Subclass OfmappedBy value IDVand mappedBy value McIDAS-Vand mappedBy value Panoply Inferred

Page 23: Coding Provenance in Software and Matching Tools to Data

Inferencing

23

* Equivalent ClassDataCollectionand (isAccessedBy value OPeNDAP) or (hasDataFormat value NetCDF)and usesConvention value CF1Conventionand usesConvention value RegularLatLonGrid* Subclass OfmappedBy value Ferretand mappedBy value GrADS

Inferred

Page 24: Coding Provenance in Software and Matching Tools to Data

Inferencing

24

* Equivalent ClassDataCollectionand (isAccessedBy value GrADSDataServer) or (isAccessedBy value Hyrax) or (isAccessedBy value ThreddsDataServer) or (isAccessedBy value erddap)* Subclass OfisAccessedBy value OPeNDAP

Inferred

Page 25: Coding Provenance in Software and Matching Tools to Data

Resulting Query

25

The resulting query to find the set of tools available to visualize a data collection becomes very simple

DESCRIBE ?toolWHERE { <data_collection> toolmatch:visualizedBy ?tool . ?tool rdf:type toolmatch:Tool .}

Page 26: Coding Provenance in Software and Matching Tools to Data

The Result

26

Description

Tools

Page 27: Coding Provenance in Software and Matching Tools to Data

Where are weand what’s next

• We’ve got part of the ontology done

• We’ve got stuff in the triple store

• We need to complete the dataset ontology piece

• We need to verify the ontology and rules

• We need crowd sourcing for more tools and information about tools

• Patrick needs to understand rules better

27

Page 28: Coding Provenance in Software and Matching Tools to Data

Questions

28

Page 29: Coding Provenance in Software and Matching Tools to Data

References

OPeNDAP Provenance Project•Prov Overview - http://www.w3.org/TR/prov-overview/•OPeNDAP Prov - https://github.com/tetherless-world/opendap/•OPeNDAP LODSPeaKr - http://opendap.tw.rpi.edu/index.html•OPeNDAP Endpoint - http://opendap.tw.rpi.edu/virtuoso/sparql •OPeNDAP – http://opendap.org

ToolMatch Project•ToolMatch - http://wiki.esipfed.org/index.php/ToolMatch•ToolMatch Virtual Server - http://toolmatch.tw.rpi.edu/•ToolMatch Schema - http://toolmatch.tw.rpi.edu/docs/index •ToolMatch Endpoint - http://toolmatch.tw.rpi.edu/sparql

29