Using Web Data Provenance for Quality Assessment

36
Using Web Data Provenance for Quality Assessment Olaf Hartig* Jun Zhao˚ *Humboldt-Universität zu Berlin ˚University of Oxford

description

With these slides I presented our paper at the provenance workshop (SWPM) at the International Semantic Web Conference (ISWC), Oct.2009

Transcript of Using Web Data Provenance for Quality Assessment

Page 1: Using Web Data Provenance for Quality Assessment

UsingWeb Data Provenance

forQuality Assessment

Olaf Hartig*Jun Zhao˚

*Humboldt-Universität zu Berlin ˚University of Oxford

Page 2: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 2

Information Quality (IQ)

● Common definition: fitness for use of information

● Multidimensional concept

● IQ criteria not independent of each other

● Relevancy of criteria determined by task and preferences

Category* Criteria / Dimensions

Intrinsic Accuracy, Believability, Objectivity, ...

Contextual Completeness, Relevance, Timeliness, ...

Representational Conciseness, Understandability, ...

Accessibility Availability, Security, ...*Classification by Wang and Strong, 1996

Page 3: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 3

IQ Assessment

● Assigning numerical values (IQ scores) to IQ criteria

● It is difficult!● Precision vs. Practicality

Semi-automatic methods● Rating-based● Reputation-based

Manual methods● Questionnaires

Page 4: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 4

Automated IQ Assessment

● Literature only outlines ideas for automatic methods

● Content analysis● Comparison (e.g. outlier detection)● Application of information retrieval methods● Analysis of results from data cleansing● Sampling techniques

● Context analysis● Analysis of metadata● Utilization of domain knowledge

Page 5: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 5

Our Goal:

Methods to automatically assessIQ criteria of Web data

Primary means:

Provenance of assessed data

Page 6: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 6

Outline

1. Web Data Provenance

2. General Assessment Approach

3. Development of Assessment Methods

Page 7: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 7

Existing Provenance Research

● Main research areas: (scientific) workflows, DBMSs

● General focus: data creation

Page 8: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 8

Provenance of Web Data

Page 9: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 9

Provenance of Web Data

Web data provenancecomprises

two dimensions:

Data Creation • Data Access

Page 10: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 10

Model of Web Data Provenance

● Provenance graph describes provenance of a data item● Nodes: provenance elements – pieces of provenance info● Edges: relate provenance elements to each other● Subgraphs for related data items possible

Page 11: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 11

● Provenance model defines:● Types of provenance elements● Relationships

Model of Web Data Provenance

Actors

Executions

Artifacts

Page 12: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 12

Data Access Dimension

Data Item

retrieved by Document

Data Access

contains

Relation tothe provided Information

Resource

Data Providing Service (Non-Human)

Data Publisher(Human)

Service Provider

uses controls

Data Accessor(Non-Human)

performs

accessed

Execution Time

Page 13: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 13

Data Access Dimension cont.

(Verified)Artifact

Integrity Verification

Relation tothe signed Data

Signer

Verification Result

Signature Verification

{incomplete}

Signature Method

Page 14: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 14

Data Creation Dimension

ProvenanceInformation

ProvenanceInformation

ProvenanceInformation

Data Creator(Human or Non-human)

{complete,disjoint}

Relation tothe created Data

Execution Time

Creation Guidelines

Data Creation

responsible for responsible for

Data Creating Service (e.g. Software Agent)

Data Creating Entity (e.g. Person, Group, Orga.)

Data Creating Device(e.g. Sensor)

Source Data

Data Item

part of

(Encompassing)Data Item

Page 15: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 15

Outline

1. Web Data Provenance

2. General Assessment Approach

3. Development of Assessment Methods

Page 16: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 16

A General Approach

● Blueprint for actual assessment methods that● Address specific scenario● Focus on specific IQ criterion

● Provenance elements have an influence on IQ

● Impact values represent these influences

● Assessment is affected by knowing about the influences

● Calculation of the IQ score with an assessment function that combines all impact values

Page 17: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 17

General Assessment Procedure

Step 1 – Generate a provenance graph for the data item

Step 2 – Annotate the provenance graph with impact values

Step 3 – Execute the assessment function

Page 18: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 18

Outline

1. Web Data Provenance

2. General Assessment Approach

3. Development of Assessment Methods

Page 19: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 19

Designing Assessment Methods

● Developing the general approach into an actual method

● Fundamental design question:

For which IQ criterion do we want to apply the method?

Page 20: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 20

Designing Assessment Methods

● Developing the general approach into an actual method

● Fundamental design question:

For which IQ criterion do we want to apply the method?

● Timeliness: degree to which the data item is up-to-date with respect to the task at hand

● Representation* as an absolute measure in [0,1]● 1 – meeting the most strict timeliness standards● 0 – unacceptable

*Following Ballou et al., 1998

Page 21: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 21

1 Generate the Provenance Graph

● Two complementary options:● Recording● Analyzing metadata

Where and how do we get provenance information?

What types of provenance elements are necessary?

What level of detail (i.e. granularity) is necessary?

Page 22: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 22

Example:

● Sensors (e.g. sensor1) hourly take measurement (e.g. msr)

● All msr stored in a Web-accessible storage device (store)

● Our system (sys) accesses them for further processing

● sys assesses the timeliness of all msr

1 Generate the Provenance Graph

Page 23: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 23

Example:

● Sensors (e.g. sensor1) hourly take measurement (e.g. msr)

● All msr stored in a Web-accessible storage device (store)

● Our system (sys) accesses them for further processing

● sys assesses the timeliness of all msr

msrtype: Data Item

doctype: Document

aExctype: Data Access

contained by

systype: Data Accessor

performed by

cExctype: Data Creation

storetype: Data Providing Service

sensor1type: Data Creator

accessed

retrieved by

created by performed by

Execution Time: 10:13

Execution Time: 10:00

1 Generate the Provenance Graph

Page 24: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 24

2 Annotation with Impact Values

● Systematically analyze each type of provenance elements

● Impact values not necessarily numerical● Depends on the assessment function in step 3

How might each provenanceelement influence the IQ criterion?

What kind of impact values are necessary?

How do we determine impact values?

How do we represent the influences by impact values?

Page 25: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 25

Determining Impact Values

● From the provenance information

● From user input● Configuration options● Rating-based, Reputation-based

● By content analysis● Comparison (e.g. outlier detection)● Adoption of information retrieval methods● Adoption of data cleansing techniques

● By context analysis● Further metadata● Domain knowledge

Page 26: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 26

Prov. Element Type Impact Values

Data Creation ● creation time● weights

Creation Guidelines -

(Source) Data Item ● expiry time

Data Creator -

Data Creation Dimension:

2 Annotation with Impact Values

How might each provenanceelement influence the IQ criterion?

Page 27: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 27

Prov. Element Type Impact Values

Data Creation ● creation time● weights

Creation Guidelines -

(Source) Data Item ● expiry time

Data Creator -

msrtype: Data Item

doctype: Document

aExctype: Data Access

contained by

systype: Data Accessor

performed by

cExctype: Data Creation

storetype: Data Providing Service

sensor1type: Data Creator

accessed

retrieved by

created by performed by

Execution Time: 10:13

Execution Time: 10:00

2 Annotation with Impact Values

Page 28: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 28

Prov. Element Type Impact Values

Data Creation ● creation time● weights

Creation Guidelines -

(Source) Data Item ● expiry time

Data Creator -

msrtype: Data Item

doctype: Document

aExctype: Data Access

contained by

systype: Data Accessor

performed by

cExctype: Data Creation

storetype: Data Providing Service

sensor1type: Data Creator

accessed

retrieved by

created by performed by

Execution Time: 10:13

Execution Time: 10:00creation time

10:00

2 Annotation with Impact Values

Page 29: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 29

Prov. Element Type Impact Values

Data Creation ● creation time● weights

Creation Guidelines -

(Source) Data Item ● expiry time

Data Creator -

msrtype: Data Item

doctype: Document

aExctype: Data Access

contained by

systype: Data Accessor

performed by

cExctype: Data Creation

storetype: Data Providing Service

sensor1type: Data Creator

accessed

retrieved by

created by performed by

Execution Time: 10:13

Execution Time: 10:00creation time

10:00

expiry time11:00

2 Annotation with Impact Values

Page 30: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 30

3 Assessment Function

● Develop the function together with the impact values

● Take incompleteness into consideration● Provenance graphs could be fragmentary● Annotations could be missing

What does the assessment function look like?

How do we represent the IQ criterion by an IQ score?

Page 31: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 31

Step 3 – Assessment Function

Page 32: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 32

Step 3 – Assessment Function

msrtype: Data Item

doctype: Document

aExctype: Data Access

contained by

systype: Data Accessor

performed by

cExctype: Data Creation

storetype: Data Providing Service

sensor1type: Data Creator

accessed

retrieved by

created by performed by

Execution Time: 10:13

Execution Time: 10:00creation time

10:00

expiry time11:00

Page 33: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 33

Step 3 – Assessment Function

msrtype: Data Item

doctype: Document

aExctype: Data Access

contained by

systype: Data Accessor

performed by

cExctype: Data Creation

storetype: Data Providing Service

sensor1type: Data Creator

accessed

retrieved by

created by performed by

Execution Time: 10:13

Execution Time: 10:00creation time

10:00

expiry time11:00

Page 34: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 34

Step 3 – Assessment Function

msrtype: Data Item

doctype: Document

aExctype: Data Access

contained by

systype: Data Accessor

performed by

cExctype: Data Creation

storetype: Data Providing Service

sensor1type: Data Creator

accessed

retrieved by

created by performed by

Execution Time: 10:13

Execution Time: 10:00creation time

10:00

expiry time11:00

t(msr) = 1 – (10:15 – 10:00) / (11:00 – 10:00) = 1 – 0.25h / 1h = 0.75

Page 35: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 35

Conclusion

● Web Data Provenance (data creation + data access)

● General approach for provenance-based IQ assessment● Impact values: influence of provenance elements on IQ

● Design decisions for actual assessment methods

● Application to timeliness (more in the paper)

● Future work:● How do we deal with incompleteness?● Application of the approach to other IQ criteria

Page 36: Using Web Data Provenance for Quality Assessment

Olaf Hartig - Using Web Data Provenance for Quality Assessment 36

These slides have been created byOlaf Hartig

http://olafhartig.de

This work is licensed under aCreative Commons Attribution-Share Alike 3.0 License

(http://creativecommons.org/licenses/by-sa/3.0/)

Attribution:● http://www.flickr.com/photos/rrrrred/3809362767/● http://www.hasslefreeclipart.com