Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science...

Data Provenance Data Provenance and Data Quality and Data Quality

InferenceInferenceThe University of Texas at DallasThe University of Texas at Dallas

Computer ScienceComputer Science

11/13/200611/13/2006Ping MaoPing Mao

Jungin KimJungin Kim

ContentsContents

Data QualityData Quality OverviewOverview Quality InferenceQuality Inference

Data ProvenanceData Provenance Data Provenance DefinitionsData Provenance Definitions Taxonomy of Provenance Taxonomy of Provenance

TechniquesTechniques

Data Quality OverviewData Quality Overview

What is the Data Quality?What is the Data Quality? AccuracyAccuracy TimelinessTimeliness Credibility (Trustworthy)Credibility (Trustworthy)

Users and domains subjectiveUsers and domains subjective


ExampleExample Database collected over a period of time and Database collected over a period of time and

by a variety of company departmentby a variety of company department

Company Company NameName AddressAddress Number of Number of

EmployeesEmployees

AA 20 Rode St.20 Rode St. 3,0003,000

BB 50 Main Av.50 Main Av. 500500


Questions: Questions: When it createdWhen it created Where it came fromWhere it came from How and Why obtainedHow and Why obtained

Company Company NameName AddressAddress Number of Number of

EmployeesEmployees

AA20 Rode St.20 Rode St. 3,0003,000

BB50 Main Av.50 Main Av. 500500

Jan-12-00, by Jan-12-00, by salessales

Feb-5-00, by Feb-5-00, by ABCABC

Oct-24-00, by Oct-24-00, by acctigacctig

Oct-10-00, by Oct-10-00, by EFGEFG


How to store it?How to store it? Annotations by taggingAnnotations by tagging ProvenanceProvenance

Data Quality InferenceData Quality Inference

Next questions:Next questions: Can we trust data sets or data sources?Can we trust data sets or data sources?

Answer:Answer: Ranking by quality on data set Ranking by quality on data set

generated from data sourcesgenerated from data sources


MotivationMotivation Data are:Data are:

DistributedDistributed ErroneousErroneous Shared and IntegratedShared and Integrated


Data source rankingData source ranking1. Rank the data sets or sources in order 1. Rank the data sets or sources in order

of their accuraciesof their accuracies

2. Determine the top-k accurate data sets 2. Determine the top-k accurate data sets or sourceor source


FrameworkFramework D: a set of data sourceD: a set of data source Ti(k, v): table for a query Q, k is the key Ti(k, v): table for a query Q, k is the key

and v is the value at time tand v is the value at time t Ai Ai [0, 1]: Accuracy of data source Di [0, 1]: Accuracy of data source Di Ai < Aj if Di is less accurate than DjAi < Aj if Di is less accurate than Dj


General FrameworkGeneral Framework

h(t): historical function, 0 h(t): historical function, 0 h(t) h(t) 1 1 weighted sum of all within the last w time weighted sum of all within the last w time

indexesindexes

c(i,t): cohesion functionc(i,t): cohesion function


Cohesion function, c(i,t)Cohesion function, c(i,t)

Determines:Determines:

new accuracy estimatenew accuracy estimate

how well each data agrees with one anotherhow well each data agrees with one another f(i,t): dampening factor functionf(i,t): dampening factor function a(i,j,t): agreement functiona(i,j,t): agreement function


Dampening factor function, f(i,t)Dampening factor function, f(i,t) Probability, f(i,t) in data sourceProbability, f(i,t) in data source Similar to Google’s PageRank:Similar to Google’s PageRank:

high-quality sites receive a higher high-quality sites receive a higher PageRankPageRank,,

Google remembers each time it Google remembers each time it conducts a searchconducts a search

Prevent the solution from zeros for allPrevent the solution from zeros for all


Agreement function, a(i,j,t)Agreement function, a(i,j,t) tupleOverlap(i,j,t)tupleOverlap(i,j,t)

Measure the proportion of tuples in Measure the proportion of tuples in approximate agreementapproximate agreement

cosineOverlap(i,j,t)cosineOverlap(i,j,t) Measure the complement of the cosine Measure the complement of the cosine

distance of two sets of data over the same distance of two sets of data over the same key valueskey values

eOverlap(i,j,t) - Euclidian-based functioneOverlap(i,j,t) - Euclidian-based function Euclidian distance in n-dimensionEuclidian distance in n-dimension


Agreement function, a(i,j,t)Agreement function, a(i,j,t) Using Euclidian distance,Using Euclidian distance,

eOverlap(i,j,t) = 1 – eDist(V(i,j,t), eOverlap(i,j,t) = 1 – eDist(V(i,j,t), V(j,i,t))V(j,i,t))


Experimental resultsExperimental results 100 data sources100 data sources 20 different tuples (key, value)20 different tuples (key, value) Randomly assignedRandomly assigned Dampening function f(i,t), 0.5Dampening function f(i,t), 0.5


Experimental results Experimental results

Data Data ProvenanceProvenance

Data Provenance DefinitionsData Provenance Definitions Taxonomy of Provenance Taxonomy of Provenance

TechniquesTechniques Application of ProvenanceApplication of Provenance Subject of ProvenanceSubject of Provenance Representation of ProvenanceRepresentation of Provenance Provenance storageProvenance storage Provenance DisseminationProvenance Dissemination Examples of Data provenance Examples of Data provenance

TechniquesTechniques

What is Data ProvenanceWhat is Data Provenance

Data provenance:Data provenance: In database system domain: Data provenance, a kind of In database system domain: Data provenance, a kind of metadata, sometimes called “lineage" or “pedigree" is the metadata, sometimes called “lineage" or “pedigree" is the description of the origins of a piece of data and the process description of the origins of a piece of data and the process by which it arrived in a database.by which it arrived in a database.

Data provenance as information that helps determine Data provenance as information that helps determine the derivation history of a data product, starting from the derivation history of a data product, starting from its original sources.its original sources.

E-Science:E-Science: E-science is computationally intensive science. It is also the E-science is computationally intensive science. It is also the type of science that is carried out in highly distributed type of science that is carried out in highly distributed network environments, or science that uses immense data network environments, or science that uses immense data sets that require grid computing. Examples of this include sets that require grid computing. Examples of this include social simulations, particle physics, earth sciences and bio-social simulations, particle physics, earth sciences and bio-informatics. ..informatics. ..

Why Data Why Data Provenance is Provenance is

importantimportantWhen you find some data on the Web, do you have When you find some data on the Web, do you have any information about how it got there? It is quite any information about how it got there? It is quite possible that it was copied from somewhere else on possible that it was copied from somewhere else on the Web, which, in turn may have also been copied; the Web, which, in turn may have also been copied; and in this process it may have been transformed and in this process it may have been transformed and edited. and edited. If you are a scientist, or any kind of scholar, you If you are a scientist, or any kind of scholar, you would like to have confidence in the accuracy and would like to have confidence in the accuracy and timeliness of the data that you are working with. timeliness of the data that you are working with. Medical research requires tight controls on the Medical research requires tight controls on the quality of data because mistakes can harm people’s quality of data because mistakes can harm people’s health. Data quality in bioinformatics may not be as health. Data quality in bioinformatics may not be as immediate, but it is no less important. immediate, but it is no less important. Among the sciences, the field of Molecular Biology Among the sciences, the field of Molecular Biology is possibly one of the most sophisticated consumers is possibly one of the most sophisticated consumers of modern database technology and has generated a of modern database technology and has generated a wealth of new database issues. A substantial wealth of new database issues. A substantial fraction of research in genetics is conducted in fraction of research in genetics is conducted in "dry" laboratories using in silico experiments – "dry" laboratories using in silico experiments – analysis of data in the available databases.analysis of data in the available databases.

Taxonomy of Provenance Taxonomy of Provenance TechniquesTechniques

This paper cThis paper categorizes provenance systems based ategorizes provenance systems based on:on:

Why the record provenanceWhy the record provenance

application of data provenanceapplication of data provenance What they describeWhat they describe

Subject of provenanceSubject of provenance How they represent provenanceHow they represent provenance

Provenance RepresentationProvenance Representation How to store provenanceHow to store provenance

Storing ProvenanceStoring Provenance Ways to disseminate provenanceWays to disseminate provenance

Provenance DisseminationProvenance Dissemination

Taxonomy of ProvenanceTaxonomy of Provenance

Application of ProvenanceApplication of Provenance

Provenance systems can support a number of uses . Several Provenance systems can support a number of uses . Several applications of provenance information as follows:applications of provenance information as follows:

Data QualityData Quality: Lineage can be used to estimate data quality and : Lineage can be used to estimate data quality and data reliability based on the source data and transformations . It data reliability based on the source data and transformations . It can also provide proof statements on data derivation.can also provide proof statements on data derivation.

Audit TrailAudit Trail: Provenance can be used to trace the audit trail of : Provenance can be used to trace the audit trail of data, determine resource usage, and detect errors in data data, determine resource usage, and detect errors in data generation.generation.

Replication RecipesReplication Recipes: Detailed provenance information can allow : Detailed provenance information can allow repetition of data derivation, help maintain its currency, and be repetition of data derivation, help maintain its currency, and be a recipe for replication.a recipe for replication.

AttributionAttribution: Pedigree can establish the copyright and ownership : Pedigree can establish the copyright and ownership of data, enable its citation, and determine liability in case of of data, enable its citation, and determine liability in case of erroneous data.erroneous data.

Provenance systems can support a number of uses . Several Provenance systems can support a number of uses . Several applications of provenance information as follows:applications of provenance information as follows:

InformationalInformational: A generic use of lineage is to query based on : A generic use of lineage is to query based on lineage metadata for data discovery. It can also be browsed to lineage metadata for data discovery. It can also be browsed to provide a context to interpret data.provide a context to interpret data.

Subject of ProvenanceSubject of Provenance

Provenance Models:Provenance Models: data-oriented data-oriented modelmodel

an explicit model, lineage metadata is specifically gathered about an explicit model, lineage metadata is specifically gathered about the data product. One can delineate the provenance metadata about the data product. One can delineate the provenance metadata about the data product from metadata concerning other resources.the data product from metadata concerning other resources.

process-oriented process-oriented modelmodelAn indirect model, where the deriving processes are the primary An indirect model, where the deriving processes are the primary entities for which provenance is collected, and the data provenance entities for which provenance is collected, and the data provenance is determined by inspecting the input and output data products of is determined by inspecting the input and output data products of these processes.these processes.

Provenance GranularityProvenance Granularity (C(Coarse Grained/Fine Grained)oarse Grained/Fine Grained)The usefulness of provenance and the cost of collecting and storing The usefulness of provenance and the cost of collecting and storing provenance in a certain domain is linked to the provenance in a certain domain is linked to the granularity granularity at which at which it is collected. it is collected. Range from provenance on attributes and tuples in a database to Range from provenance on attributes and tuples in a database to provenance for collections of files, say, generated by an ensemble provenance for collections of files, say, generated by an ensemble experiment run. experiment run.

Representation of Representation of ProvenanceProvenance

Two major approaches:Two major approaches: Annotations:Annotations:

Metadata comprising of the derivation history of a data Metadata comprising of the derivation history of a data product is collected as product is collected as annotations annotations and descriptions about and descriptions about source data and processes. source data and processes. Advantage: richer and, in addition to the derivation history, Advantage: richer and, in addition to the derivation history, often include the parameters passed to the derivation often include the parameters passed to the derivation processes, the versions of the workflows that will enable processes, the versions of the workflows that will enable reproduction of the data, or even related publication reproduction of the data, or even related publication references references

InversionInversionUses the property by which some derivations can be inverted Uses the property by which some derivations can be inverted to find the input data supplied to them to derive the output to find the input data supplied to them to derive the output data. Examples include queries and user-defined functions in data. Examples include queries and user-defined functions in databases that can be inverted automatically or by explicit databases that can be inverted automatically or by explicit functions. functions. Advantage: more compact, the information it provides is Advantage: more compact, the information it provides is sparse and limited to the derivation history of the data. sparse and limited to the derivation history of the data.

Representation of Representation of Provenance(contd…)Provenance(contd…)

Many current provenance systems that use Many current provenance systems that use annotations have adopted XML for representing annotations have adopted XML for representing the lineage information. Some also capture the lineage information. Some also capture semantic information within provenance using semantic information within provenance using domain ontologies in languages like RDF and domain ontologies in languages like RDF and OWL. Ontologies precisely express the concepts OWL. Ontologies precisely express the concepts and relationships used in the provenance and and relationships used in the provenance and provide good contextual information. provide good contextual information.

Provenance StorageProvenance Storage

ScalabilityScalability Provenance information can grow to be larger than the data it Provenance information can grow to be larger than the data it

describes if the data is fine-grained and provenance information describes if the data is fine-grained and provenance information rich. So the manner in which the provenance metadata is stored rich. So the manner in which the provenance metadata is stored is important to its is important to its scalabilityscalability. .

The inversion method is arguably more scalable than using The inversion method is arguably more scalable than using annotations. However, one can reduce storage needs in the annotations. However, one can reduce storage needs in the annotation method by recording just the immediately preceding annotation method by recording just the immediately preceding transformation step that creates the data and recursively transformation step that creates the data and recursively inspecting the provenance information of those ancestors for the inspecting the provenance information of those ancestors for the complete derivation history. complete derivation history.

OverheadOverhead Less frequently use provenance information can be archived to Less frequently use provenance information can be archived to

reduce storage overhead or areduce storage overhead or a demand-supply model based on demand-supply model based on usefulness can retain provenance for those frequently used. usefulness can retain provenance for those frequently used.

IfIf provenance depends on users manually adding annotations provenance depends on users manually adding annotations instead of instead of aautomatically collecting it,utomatically collecting it, the burden on the user may the burden on the user may prevent complete provenance from being recorded and available prevent complete provenance from being recorded and available inin a machine accessible form that has semantic valuea machine accessible form that has semantic value

Provenance DisseminationProvenance Dissemination Visual GraphVisual Graph

A common way of disseminating provenance data is through A common way of disseminating provenance data is through a derivation graph that users can browse and inspecta derivation graph that users can browse and inspect

QueriesQueriesUsers can also search for datasets based on their provenance Users can also search for datasets based on their provenance metadata, such as to locate all datasets generated by a metadata, such as to locate all datasets generated by a executing a certain workflow. If semantic provenance executing a certain workflow. If semantic provenance information is available, these query results can information is available, these query results can automatically feed input datasets for a workflow at runtime. automatically feed input datasets for a workflow at runtime. The derivation history of datasets can be used to replicate The derivation history of datasets can be used to replicate data at another site, or update it if a dataset is stale due to data at another site, or update it if a dataset is stale due to changes made to its ancestors. changes made to its ancestors.

Service APIService APIProvenance retrieval APIs can additionally allow users to Provenance retrieval APIs can additionally allow users to implement their own mechanism of usageimplement their own mechanism of usage

SSurvey of Data Provenance urvey of Data Provenance TechniquesTechniques

Provenance in a Provenance in a Bioinformatics Grid Bioinformatics Grid

(myGrid)(myGrid) myGrid builds a personalised problem-myGrid builds a personalised problem-

solving environment that helps solving environment that helps bioinformaticians find, adapt, construct bioinformaticians find, adapt, construct and execute and execute in silicoin silico experiments experiments

Keep the scientist informed as to the Keep the scientist informed as to the provenance of data relevant to their provenance of data relevant to their experiment spaceexperiment space

What is the problem?What is the problem?

Provenance recording should be part Provenance recording should be part of the infrastructure, so that users of the infrastructure, so that users can can electelect to enable it when they execute to enable it when they execute their complex tasks over the Grid or in their complex tasks over the Grid or in Web Services environments. Web Services environments.

Currently, the Web Services protocol Currently, the Web Services protocol stack and the Open Grid Services stack and the Open Grid Services Architecture do not provide any Architecture do not provide any support for recording provenance. support for recording provenance.

Architectural VisionArchitectural Vision Provenance gathering is a collaborative process Provenance gathering is a collaborative process

that involves multiple entities, including the that involves multiple entities, including the workflow enactment engine, the enactment workflow enactment engine, the enactment engine's client, the service directory, and the engine's client, the service directory, and the invoked services.invoked services.

Provenance data will be submitted to one or more Provenance data will be submitted to one or more “provenance repositories” acting as storage for “provenance repositories” acting as storage for provenance data. provenance data.

Upon user's requests, some analysis, navigation Upon user's requests, some analysis, navigation and reasoning over provenance data can be and reasoning over provenance data can be undertaken. undertaken.

Architectural VisionArchitectural Vision

Storage could be achieved by a Storage could be achieved by a provenance service.provenance service.

Provenance service would provide Provenance service would provide support for analysis, navigation or support for analysis, navigation or reasoning over provenancereasoning over provenance

Client side support for submitting Client side support for submitting provenance data to the provenance provenance data to the provenance service.service.

Prototype OverviewPrototype Overview

ConclusionConclusion

Provenance is a rather unexplored domainProvenance is a rather unexplored domain Necessity to design a configurable Necessity to design a configurable

architecture capable of support multiple architecture capable of support multiple requirements from very different requirements from very different application domains.application domains.

Need to further investigate the Need to further investigate the algorithmic foundations of provenance, algorithmic foundations of provenance, which will lead to scalable and secure which will lead to scalable and secure industrial solutions.industrial solutions.

Future workFuture work

Using heterogeneous data sourcesUsing heterogeneous data sources Large data sourcesLarge data sources Historical measurementHistorical measurement Dynamic measurementDynamic measurement Security and authorization of data Security and authorization of data

provenanceprovenance Manage provenance in diverse Manage provenance in diverse

domaindomain

ReferencesReferences1) Yogesh L. Simmhan Beth Plale Dennis Gannon, "A Survey of Data Provenance 1) Yogesh L. Simmhan Beth Plale Dennis Gannon, "A Survey of Data Provenance

in e-Science," in SIGMOD Record, Vol. 34, No. 3, Sept. 2005 2) in e-Science," in SIGMOD Record, Vol. 34, No. 3, Sept. 2005 2) 2) "Using Semantic Web Technologies forRepresenting e-Science Provenance" 2) "Using Semantic Web Technologies forRepresenting e-Science Provenance"

http://theory.csail.mit.edu/~dquan/iswc2004-mygrid.pdfhttp://theory.csail.mit.edu/~dquan/iswc2004-mygrid.pdf3) Jan Brase, "Using digital library techniques- Registration of scientific primary 3) Jan Brase, "Using digital library techniques- Registration of scientific primary

data," in ECDL, 2004 data," in ECDL, 2004 http://www.kbs.uni-hannover.de/Arbeiten/Publikationen/2004/brase_TIB_hannhttp://www.kbs.uni-hannover.de/Arbeiten/Publikationen/2004/brase_TIB_hannover.pdf over.pdf

4) Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan, "Why nd Where:A 4) Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan, "Why nd Where:A Characterization of Data Provenance," in ICDT, 2001Characterization of Data Provenance," in ICDT, 2001

5) Peter Buneman, Sanjeev Khanna and Wang-Chiew Tan, "Data Provenance: 5) Peter Buneman, Sanjeev Khanna and Wang-Chiew Tan, "Data Provenance: Some Basic Issues,"http://db.cis.upenn.edu/DL/fsttcs.pdfSome Basic Issues,"http://db.cis.upenn.edu/DL/fsttcs.pdf

6) Wang-Chiew Tan, "Research Problems in Data 6) Wang-Chiew Tan, "Research Problems in Data Provenance"http://www.soe.ucsc.edu/~wctan/papers/2004/ieee.pdfProvenance"http://www.soe.ucsc.edu/~wctan/papers/2004/ieee.pdf

7) Raymond K. Pon and Alfonso F. Cárdenas, "Data Quality inference, 7) Raymond K. Pon and Alfonso F. Cárdenas, "Data Quality inference, "http://www.cs.ucla.edu/~rpon/IQIS.pdf 3) "http://www.cs.ucla.edu/~rpon/IQIS.pdf 3)

8) Wang, R., Kon, H. & Madnick, S. (1993), Data Quality Requirements Analysis and Modelling, Ninth International Conference of Data Engineering, Vienna, Austria.

9) Wand, Y. and Wang, R. (1996) “Anchoring Data Quality Dimensions in Ontological Foundations,” Communications of the ACM, November 1996. pp. 86-95

Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science...

Documents

Transcript of Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science...