Standard Proveance Reporting and Scientifc Software … · 2016. 2. 14. · Catherine Wise,...
Transcript of Standard Proveance Reporting and Scientifc Software … · 2016. 2. 14. · Catherine Wise,...
Catherine Wise, Nicholas J Car, Ryan Fraser and Geoff Squire
Data61 and LAND & WATER
Standard Proveance Reporting and Scientifc Software Management in Virtual Labs
What are VLs?
What is VHIRL?
What is provenance?
How does VHIRL manage provenance (or not)?
How do we represent VHIRL’s actions to standardised provenance?
What work, other than representation, is needed for provenance?
What benefits do we get from this work?
Outline
What are VLs?
From https://nectar.org.au/virtual-laboratories-1, they are:
data repositories and computational tools and streamlining research workflows
What are VLs?
What is VHIRL?
• Virtual Hazards Impact & Risk Laboratory (VHIRL) is a scientific workflow portal
• Gives researchers access to a cloud computing for natural hazards research
• data from a variety of sources
• uses cloud computing resources
• currently has tools for the earthquakes, tsunamis & tropical cyclones in the Asia-Pacific region
What is VHIRL?
Components of the Virtual Lab: Virtual Hazard
Impact & Risk Laboratory (VHIRL) Data Services Processing
Services
Compute Services
Enablers
Virtual Laboratories
/Apps Data Analytics
Magnetics
Gravity
DEM
eScript
ANUGA
NCI Petascale
NCI Cloud
NeCTAR Cloud
Amazon Cloud
Desktop
Service Orchestration
Provenance Metadata
Auth.
Coastal Inundation
Tsuanmi Inundation
Scenario
Cyclone Wind Path Calculation
Landsat
Bathymetry
Cyclone Wind Model
Surface Wave Propagation
(earthquake)
TCRM
Connectivity via Provenance | Melanie Ayre | eResearch Australiasia 2015, Brisbane
What is provenance?
From http://en.wikipedia.org/wiki/Provenance#Computer_Science:
What is provenance?
“Computer science uses the term provenance to mean the lineage of data or processes, as per data provenance. However there is a field of informatics research within computer science called provenance that studies how provenance of data and processes should be characterised, stored and used. Semantic web standards bodies, such as the World Wide Web Consortium, ratified a standard for provenance representation in 2014, known as PROV.”
How do we represent VLs using standardised provenance?
• Natively tracks ‘everything’ used for scenario (re)runs
• Is not a: Data store, Software repo, Records mgt system
• Externalises as much information mgt as possible
• Code managed by the SSSC
VHIRL’s own data management
• SSSC is a web-based system to manage code & dependencies
• Contains Problems & Solutions that define a workflow
• Solutions consists of a Toolbox
• Toolboxes are code wrapped in a Python script + description of the required inputs
Scientific Solutions Software Centre (SSSC)
Class diagram for the SSSC
Scientific Solutions Software Centre (SSSC) • Beautiful, RESTful API this example: http://vhirl-dev.csiro.au/scm/toolbox/2
• Solution prov:Plan
• No RDF metadata, yet!
Mapping VHIRL to PROV 1
Input Data Process Output
Data
Mapping VHIRL to PROV 2
Code Process Output
Data
Config
Input Data
“Ontology Design Pattern”
Mapping VHIRL to PROV 3
Code Process Output
Data
Config
Input Data
Who/
which
system
Who
used
Entity Activity Agent
Mapping VHIRL to PROMS
Report N
Entity Activity Agent
Reporting
System X
R.S. Report
Mapping VHIRL to PROMS
VHIRL provenance into PROMS Server
Report N
Entity Activity Agent
Reporting
System X
R.S. Report
Report N Report N
Report M
Report N Reporting
System Y Report N
Report N Report N
Organisational
Provenance
Store
reported and stored
Modelling VHIRL’s data types
VL Run output
data
user The VL
Report N
managed
data
web
service
data
user
supplied
data
managed
code
user
supplied
code
PROMS Reporting Toolkits
VHIRL’s native PROV output
RDF file
What work other, than representation, is needed for
provenance?
Provenance effort (step) pyramid
Data Management
Establishing Reporting
Continued
Reporting
managed
data
web
service
data
user
supplied
data
managed
code
user
supplied
code
Data Management
output
data
all Entities need to
be ID’d (via URI)
and persisted VL Run
each VL run is
reported as an
Activity within a
Report
each VL instance
has/needs an ID and
is modelled as a
Reporting System
user
each VL user is
known by their login
(account) details.
Modelled as a
Reporter
The VL
Report N
each VL Report is ID’d
and persisted in the VL
Provenance Store
managed
data
web
service
data
user
supplied
data
managed
code
user
supplied
code
Data Management VL ID’d and persisted
output
data
cited using PROMS-O format
soon to be VL ID’d and persisted, with
minimal metadata recorded too
SSSC ID’s and persisted
perhaps SSSC ID’s and persisted,
perhaps VL managed
soon to be VL ID’d and persisted, if required,
perhaps with time limits
managed
data
web
service
data
user
supplied
data
managed
code
user
supplied
code
Data Management VL ID’d and persisted
output
data
cited using PROMS-O format
soon to be VL ID’d and persisted, with
minimal metadata recorded too
SSSC ID’s and persisted
perhaps SSSC ID’s and persisted,
perhaps VL managed
soon to be VL ID’d and persisted, if required,
perhaps with time limits
Virtual Labs Service Citation Example
[{ref}] {service title}
{service endpoint URI}
{query}
{time queried}
{cached copy ID}
[1] “Subset of elevation”
http://pid.csiro.au/service/anuga-thredds
“bussleton.nc?var=elevation&spatial=bb&
north=-33.06495205829679&south=-
33.551573283840156&west=114.849678
74597227&east=115.70661233971667&t
emporal=all&time_start=&time_end=&hor
izStride”
“2014-12-15T13:15:11”
http://pid.csiro.au/dataset/abcd1234
Establishing Reporting
VL Report
Organisational
Provenance
Store
querying & redelivery
Pro
ve
na
nce
Re
po
rtin
g T
oo
lkit
C#
Java
Python
Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
e1 = Entity(title='Grid X',
description='netCDF grid of property X',
uri='http://eg-vl.org.au/dataset/123',
downloadURL='http://eg-vl.org.au/dataset/123?_view=dl',
wasAttributedTo='http://data.ga.gov.au/id/person/john.doe')
Agent
N
Report N Report for
Run 456
Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
e1 = Entity(title='Grid X',
description='netCDF grid of property X',
uri='http://eg-vl.org.au/dataset/123',
downloadURL='http://eg-vl.org.au/dataset/123?_view=dl',
wasAttributedTo='http://data.ga.gov.au/id/person/john.doe')
Agent
N
e2 = ServiceEntity(
title='Subset of elevation',
description='5km solar radiation interpolated raster service',
serviceBaseUri='http://siss2.anu.edu.au/anuga/busselton.nc',
query='var=elevation&spatial=bb&north=-33.06495205&south=-
33.551573283&west=114.84967874&east=115.70661233&tempor
al=all&time_start=&time_end=&horizStride',
queriedAtTime='2014-12-15T13:15:11'
chachedCopy='http://bom.gov.au/dataset/678')
Report N Report for
Run 456
Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
Agent
N
a0 = Activity(
title='Run 456',
description='Upper bound run, full Grid X use',
wasAssociatedWith={VL added automatically},
startedAtTime={VL added automatically},
endedAtTime={VL added automatically},
usedEntities= [e1, e2],
generatedEntities={VL added automatically}) Report N Report for
Run 456
Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
Agent
N
Report N Report for
Run 456
r0 = Report(
title='Report for Run 456',
description='Upper bound run, full Grid X use',
startingActivity={VL added automatically},
endingActivity={VL added automatically})
rs0 = ReportSender('http://provstore.vl.org.au/report/')
rs.send(r0)
What do we get from this work?
Graph power!
Report N Reporting
System X
...
URI power!
Report N Reporting
System X
corporate
staff DB
temp repo
public web
service
DAP-style
repo
PROMS
instance
Distributed graphs!
GA PROMS
instance
VL PROMS
instance
Uni Prov
Store
Distributed Querying via endpoint cache