Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh...
-
Upload
gary-spencer -
Category
Documents
-
view
215 -
download
0
Transcript of Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh...
Metadata, Ontologies, and Provenance: Towards Extended Forms of Data
Management
Beth Plale,Yogesh Simmhan
Computer Science Dept.
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
2
The Data Deluge
Computational science is increasingly data intense and getting more so. Why?
• More complex computations:– Nested model runs– Linked models– Finer resolution
• More sources of data products – Observational data products
• Streaming continuously from hundreds of sensor and network sources, scaling to thousands
• Large archives – Annotations– Model configuration parameters– Output results– Model data– Statistical data (e.g., data mining)
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
3
Problem
Computational scientists are reaching their limit on ability to manage data products associated with investigations– Scientist can touch hundreds to thousands of data
products in single investigation
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
4
Seeds of solution in Internet?• Internet has proven the utility of user-oriented
view towards information space management– Search, tag: browser, bookmarks– Publish: blogs, web page tools
• But web not completely appropriate. Web is– Single-writer, multiple reader, and– Search-and-download.
• Apply concept of user-oriented view to managing data space
• Want ability to work locally.– myLEAD: tool to help an investigator make sense of,
and operate in, the vast information space that is computational science (e.g., mesoscale meteorology.)
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
5
Personal metadata catalog requirements
Scientists have following needs:• Want to share products but retain control over
what gets shared and with whom– Data not made public until results appear in journal
• Want rich search criteria over vast data space but don’t necessarily want to write SQL queries
• Need help managing products generated over extended period of time (I.e., years)
• Want high level of reliability - data must always be accessible,
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
6
Distributed and replicated personal metadata catalogues
IU
NCSA
UAHuntsville
MillersvilleUCAR
Unidata
OklaUniv
Master myLEADcatalog
SatellitemyLEAD catalog
-- distribution: users partitioned over 6 sites in LEAD testbed-- replication: master is replica site for all satellites
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
7
Hurricane Ivan
SE quadrant
Voltice study 1998
Voltice study 2002
Workflow template
Collection
Input parameter
Hurricane Ivan
SE quadrant
Voltice study 1998
Workflow template
Collection
Input parameter
Hurricane Ivan
SE quadrant
Voltice study 1998
Voltice study 2003
Workflow template
Collection
Input parameter
ftp://fileserver.org/file1998o768
Voltice study 2002
User Bob’s workspace in 1998 User Bob’s workspace in 2002 User Bob’s workspace in 2003
Physical data storage
Table of collection
Table of file
Table of User
Metadata Catalog
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
8
Ontologies aid in querying
Preservation
Sha
ring
Structure
Depth 2: searchable
Depth 3: brow sable
Doe
s no
t kn
ow
exis
tenc
e
Flat structure
Tempo
rary
data
pro
duct
Non-published Data products of other users
Non-preserved data product
Non structured data products
structure
sharing
preservation
Ontologies provide -- transparent structure -- controlled vocabulary
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
9
LEAD (http://lead.ou.edu)
• Each year, mesoscale weather – floods, tornadoes, hail, strong winds, lightning, and winter storms – causes hundreds of deaths, routinely disrupts transportation and commerce, and results in annual economic losses > $13B.
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
10
Conventional Numerical Weather Prediction
OBSERVATIONS
Radar DataMobile Mesonets
Surface ObservationsUpper-Air BalloonsCommercial Aircraft
Geostationary and Polar Orbiting Satellite
Wind ProfilersGPS Satellites
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
11
OBSERVATIONS
Radar DataMobile Mesonets
Surface ObservationsUpper-Air BalloonsCommercial Aircraft
Geostationary and Polar Orbiting Satellite
Wind ProfilersGPS Satellites
Analysis/Assimilation
Quality ControlRetrieval of Unobserved
QuantitiesCreation of Gridded Fields
Conventional Numerical Weather Prediction
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
12
Analysis/Assimilation
Quality ControlRetrieval of Unobserved
QuantitiesCreation of Gridded Fields
Prediction
PCs to Teraflop Systems
Conventional Numerical Weather Prediction
OBSERVATIONS
Radar DataMobile Mesonets
Surface ObservationsUpper-Air BalloonsCommercial Aircraft
Geostationary and Polar Orbiting Satellite
Wind ProfilersGPS Satellites
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
13
Analysis/Assimilation
Quality ControlRetrieval of Unobserved
QuantitiesCreation of Gridded Fields
Prediction
PCs to Teraflop Systems
Product Generation, Display,
Dissemination
Conventional Numerical Weather Prediction
OBSERVATIONS
Radar DataMobile Mesonets
Surface ObservationsUpper-Air BalloonsCommercial Aircraft
Geostationary and Polar Orbiting Satellite
Wind ProfilersGPS Satellites
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
14
Analysis/Assimilation
Quality ControlRetrieval of Unobserved
QuantitiesCreation of Gridded Fields
Prediction
PCs to Teraflop Systems
Product Generation, Display,
Dissemination
End Users
NWSPrivate Companies
Students
Conventional Numerical Weather Prediction
OBSERVATIONS
Radar DataMobile Mesonets
Surface ObservationsUpper-Air BalloonsCommercial Aircraft
Geostationary and Polar Orbiting Satellite
Wind ProfilersGPS Satellites
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
15
Analysis/Assimilation
Quality ControlRetrieval of Unobserved
QuantitiesCreation of Gridded Fields
Prediction
PCs to Teraflop Systems
Product Generation, Display,
Dissemination
End Users
NWSPrivate Companies
Students
Conventional Numerical Weather Prediction
OBSERVATIONS
Radar DataMobile Mesonets
Surface ObservationsUpper-Air BalloonsCommercial Aircraft
Geostationary and Polar Orbiting Satellite
Wind ProfilersGPS Satellites
The process is entirely serialand pre-scheduled: no response
to weather!
The process is entirely serialand pre-scheduled: no response
to weather!
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
16
Analysis/Assimilation
Quality ControlRetrieval of Unobserved
QuantitiesCreation of Gridded Fields
Prediction
PCs to Teraflop Systems
Product Generation, Display,
Dissemination
End Users
NWSPrivate Companies
Students
The LEAD Vision: No Longer Serial or Static
OBSERVATIONS
Radar DataMobile Mesonets
Surface ObservationsUpper-Air BalloonsCommercial Aircraft
Geostationary and Polar Orbiting Satellite
Wind ProfilersGPS Satellites
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
17
Analysis/Assimilation
Quality ControlRetrieval of Unobserved
QuantitiesCreation of Gridded Fields
Prediction
PCs to Teraflop Systems
Product Generation, Display,
Dissemination
End Users
NWSPrivate Companies
Students
The LEAD Vision: No Longer Serial or Static
OBSERVATIONS
Radar DataMobile Mesonets
Surface ObservationsUpper-Air BalloonsCommercial Aircraft
Geostationary and Polar Orbiting Satellite
Wind ProfilersGPS Satellites
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
18
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
19
Objective discussed in this talk:
• Grow the value of the data holdings. Can do so through provenance:
workflow
myLEAD
time
Process, time,causality
Exploiting Provenance Metadata
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
21
Contents of Talk
• Importance of Provenance
• Techniques for Provenance Management
• Data Quality and Provenance
• Conclusion
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
22
Data Provenance
• Derivation History of Data starting from its original sources
• Data: Files, tables, tuples, virtual collections
• Derivation: Process that transforms data – Script, Web service, Queries, Commands
• Lineage, Pedigree, Genealogy, Filiation, Parentage, …
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
23
A Simple Provenance DAG
D1
D0
D2
D4D3
P1
P2 P3
D2’
D0’
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
24
Importance of Provenance
• Scientific Domain– Publications are Provenance!– Many scientific datasets available online
• Biology, Astronomy (SDSS)
– Standard metadata describes datasets in well-known repositories
– Lineage information usually missing, but vital– GIS: Fitness for use– Material Engineering: Pedigree, Auditing– Biology: Citation & copyright, trust– Astronomy: Context information
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
25
Importance of Provenance
• Business Domain– Data warehousing: Integrated
view over historical data from multiple sources
– Complex transformations to generate normalized view (ETL)
– Business analytics and intelligence (OLAP queries)
– Lineage allows “drill-down” from view to source table
– Allows tracing back sources of errors
– “View deletion” problem
V1
V0
V2
T2T1
P1
Q2 Q3 Extract
Transform
Load
View Data
Source Tables
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
26
Application of Provenance
• Data Quality– Evaluate quality of data– Trust in the source of data– Use provenance and metadata information to
estimate data quality for a user– Assertions and Signatures for provenance
guarantee
• Audit Trail– Error detection– Usage log
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
27
Application of Provenance
• Replication Recipe– Provenance can be recipe for generating a
dataset– Repeat to verify/compare– Recreate/replicate– Partial updates
• Attribution– Copyright, citation, check data users
• Informational– Discover datasets– Browse provenance
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
28
Subject of Provenance
• What is provenance about?• Granularity
– Attribute, tables, files, data collections Fine-grained vs. Coarse-grained
– Trade-off with cost of collecting, storing, querying
• Data vs. Process Provenance– Provenance can be a graph of data & processes– Which of them is provenance focused upon?– Hybrid where all grouped together
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
29
Process vs. Data Oriented
D1
D0
D2
D4D3
P1
P2 P3
D2’
D0’
D1
D0
D2
D4D3
P1
P2 P3
D2’
D0’
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
30
Data Processing Architectures
• Service Oriented Architecture– Grid & Web services– Workflow & Service invocations– Data as parameters, references
• Databases– Update/View Queries, Stored Procedure Calls– Views, Tables, tuples, attributes
• Scripting, Command-line, etc.
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
31
Scheme for Representing Provenance
• Scheme for representing provenance– Annotations vs. Inversion
• Annotation – Annotate data with ancestral data & the steps used to
derive it e.g. a DAG– Annotation requires more storage; “Eager”– Annotation can be as rich as user decides
• Inversion– Store function (query) used to generate data and invert it– Not all functions are invertible; auxiliary data required;
JIT computation; query optimization– Minimal information provided (“Where”, “Why”)
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
32
Syntactic vs. Semantic Representation of Provenance
• Syntactic Structure– XML for Annotations– Implement specific for Inversion
• Semantic Knowledge– Semantic language used to define lineage metadata
• RDF, OWL
– Advantages• Provides Context• Enhance searches• Lineage proofs
– Ontologies used as a framework for semantic knowledge– Community effort needed!
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
33
Provenance Storage
• Stored with or separate from data?– Integrity, accessibility
• Maintenance– Mutability, versioning– who is responsible – data creator or central?
• Scalability– # of datasets, depth of lineage, granularity,
geographical distribution, # of users– Inversion vs. Annotation; Distributed vs. Centralized
• Overhead– Collection & storage– Automation
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
34
Provenance Dissemination
• Browsing Provenance as a DAG– Go back and forward in lineage through GUI
• Query based on lineage– By source data, or generating process– Enhanced by semantic information– Drill down during data mining
• Verify how data was created by reenactment or present proof statements
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
35
Taxonomy in Brief
• Application of ProvenanceData quality Audit trail AttributionReplication Recipe Informational
• Subject of ProvenanceData vs. Process Granularity
• Representation of ProvenanceAnnotation vs. Inversion ContentsSyntactic vs. Semantic
• Provenance StorageScalability Overhead
• Provenance Dissemination
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
36
Data Quality for Scientific Data
• Fitness for use• Subjective & Objective Parameters
– believability, reputation, reliability– precision, timeliness, accuracy
• Intrinsic Quality of data vs. Quality of data service– Correctness, consistency– accessibility, throughput, availability
• Good quality for one application may not be good for another (user driven)
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
37
Estimating Data Quality from Provenance
• Hypothesis: For derived datasets, quality depends not just on the dataset but also on its provenance — ancestral processes and data
• Quality of a dataset could be a function of:– Attributes of dataset– Attributes of generating process– Ancestral Datasets used to derive this dataset – And so on recursively …
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
38
Weighted DAG?
D1
D0
D2
D4D3
P1
P2 P3
D2’
D0’ D0_q = f(D0, P1_q)
P1_q = f(P1, D1_q, D2_q, D4_q)
D1_q = f(D1, P2_q)
D2_q = f(D2, P3_q)
P2_q = f(P2, D3_q) P3_q = f(P3, D4_q)
D4_q = f(D4)D4 = f(D3)
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
39
Challenges for Quality Metrics
• Some process may produce better quality data than its input dataset
• Subsetting, aggregation of data may change overall quality estimate
• Quality of transformation may be parameter dependent
• Multiple user profiles for different applications
• Missing lineage information can short-circuit measurement
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
40
Uses of Data Quality Measurement
• Comparing and rank datasets uniformly– Google Personalized
• Reduce search space to datasets matching user quality requirement
• Built community-wide quality feedback mechanism– Leverage knowledge of domain expert– Promote publication of better quality data– Amazon reviews?
2005-03-07T18:00-05:00
Networks & Complex Systems Seminar Talk
41
Research Questions
• What are the metrics for estimating the quality for data using provenance?
• How do we optimize user-centric searches based on quality?
• How can we recover information from incomplete lineage?
Thank you!
Questions | Comments