Quality Metrics for Linked Open Data

18
Behshid Behkamal, Mohsen Kahani, Ebrahim Bagheri Quality Metrics for Linked Open Data [email protected]

Transcript of Quality Metrics for Linked Open Data

Page 1: Quality Metrics for  Linked Open Data

Behshid Behkamal, Mohsen Kahani, Ebrahim Bagheri

Quality Metrics for Linked Open Data

[email protected]

Page 2: Quality Metrics for  Linked Open Data

What is the problem?

What have others done?

What is our solution?

Does it work?

Outline

2

Page 3: Quality Metrics for  Linked Open Data

3

What is the problem?

Linked Open Data (LOD): Realizing Semantic Web by interlinking existing but

dispersed data

Main components of LOD:URIs to identify things RDF to describe dataHTTP to access data

Page 4: Quality Metrics for  Linked Open Data

Inclusion Criteria for publishing and interlinking datasets into LOD cloud

resolvable http/https URIs

Presented in one of the standard formats of Semantic Web (RDF, RDFa, RDF/XML, Turtle, N-Triples)

Contains at least 1000 triples

Connected via at least 50 RDF links to the existing datasets of LOD

Accessible via RDF crawling, RDF dump, or SPARQL endpoint

4

Is the dataset ready to be published?

What is the problem?

Page 5: Quality Metrics for  Linked Open Data

5

One approach: Publish first, improve later

Results in: Quality problems in the published

datasets

Missing link: Data Quality Evaluation Prior to Release

What is the problem?

Page 6: Quality Metrics for  Linked Open Data

Data quality in the Context of LOD

• General Validators

• Parsing and Syntax

• Accessibility / Dereferencability

Validators Quality Assessment of Published data

• Classifying quality problems of LOD

• Using metadata for quality assessment

• filtering poor quality data (WIQA)

• Semantic Annotation using ontologies

6

What have others done?

Page 7: Quality Metrics for  Linked Open Data

Our Contribution

7

• Identifying important quality deficiencies that need to be avoided or resolved prior to release

• Proposing a set of metrics to measure and identify these quality deficiencies in a dataset.

Page 8: Quality Metrics for  Linked Open Data

Quality Deficiency Issues

Resolution Method

Improper usage of vocabularies

- Not using appropriate existing vocabularies to describe the resources

Domain Expert

Redefining existing classes/properties

- Redefining the classes/properties in the ontology that already exist in the vocabularies

Domain Expert

Improper definition of classes/properties

- Classes with different name, but the same relations Semi-Automated

- Properties with different name, but the same meaning Ontologist

- Inadequate number of classes/ properties used to describe the resources

Domain Expert

Misuse of data type - Not using appropriate data types for the literals Automated

8

Quality Deficiencies (Schema level)

Page 9: Quality Metrics for  Linked Open Data

Quality Deficiency Issues

Resolution Method

Errors in property values - Missing values Automated

- Out-of-range values Automated

- Misspelling Semi-Automated

- Inconsistent values Automated

Miss-match with the real-world - Resources without correspondence in real-world Domain Expert

Syntax errors - Triples containing syntax errors Validator

Misuse of data type /object property - Improper assignment of object property to the data type property or vice versa

Validator

Improper usage of classes/properties - Using undefined classes/properties Semi-Automated

- Membership of disjoint classes Automated

- Misplaced classes/properties Validator

Redundant/similar individuals - Individuals with similar property values, but different names

Ontologist

Invalid usage of Inverse-functional properties

- Inverse-functional properties with void values Automated

9

Quality Deficiencies (Instance level)

Page 10: Quality Metrics for  Linked Open Data

Name Description Related quality deficiencies

Miss_Vlu (M1)

The ratio of the properties defined in the schema, but not presented in dataset. Errors in property values

Out_Vlu (M2)

The ratio of the triples of dataset which contain properties with out of range values Errors in property values

Msspl_Prp_Vlu(M3) The ratio of the properties of dataset which contain misspelled values Errors in property values

Und_Cls_Prp(M4)

The ratio of the triples of dataset using classes or properties without any formal definition

Improper usage of classes/ properties

Dsj_Cls(M5) The ratio of the instances of dataset being members of disjoint classes Improper usage of classes/

properties

Inc_Prp_Vlu(M6)

The ratio of the triples of dataset in which the values of properties are inconsistent Errors in property values

FP(M7)

The ratio of the number of triples of dataset with functional properties which contain inconsistent values Errors in property values

IFP(M8)

The ratio of the number of triples of dataset which contain invalid usage of inverse-functional properties.

Invalid usage of Inverse-functional properties

Im_DT(M9)

The ratio of the number of triples of dataset which contain data type properties with inappropriate data types.

Not using appropriate data types for the literals

Sml_Cls (M10)

The ratio of the classes of dataset with different names, but the same instances

Improper definition of classes

10

Proposing Metrics

Page 11: Quality Metrics for  Linked Open Data

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observations

11

Empirical Evaluation

Page 12: Quality Metrics for  Linked Open Data

Real world datasets

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observations

12

DatasetsNo. of triples

No. of instances

No. of classes

No. of properties

FAO Water Areas 10,730 586 31 19

Water Economic Zones 29,193 1,074 113 127

Large Marine Ecosystems 12,012 716 21 31

Geopolitical Entities 22,725 312 88 101

ISSCAAP Species Classification 398,166 25,253 52 93

Species Taxonomic Classification 319,490 11,741 33 26

Commodities 56,420 2,788 10 19

Vessels 4,236 240 6 22

Empirical Evaluation

Page 13: Quality Metrics for  Linked Open Data

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets

Comparing the trends of Metrics over two observations

13

Empirical Evaluation

Page 14: Quality Metrics for  Linked Open Data

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

14

3. Empirical Evaluation

Result:• Two pairs of metrics are correlated:

{IFP, Im_DT}{IFP, Inc_Prp_Vlu}

• The others are independent

Page 15: Quality Metrics for  Linked Open Data

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

15

Page 16: Quality Metrics for  Linked Open Data

Selecting several real datasets from LOD

Calculation of the metrics values for datasets

Metrics interdependency Study

Manipulating the quality of the datasets using heuristics

Comparing the trends of Metrics over two observations

16

Page 17: Quality Metrics for  Linked Open Data

Results analysis

In 78% of the scenarios, the metrics react as expected to the corresponding heuristics:

Heuristics applied and corresponding metrics have changed (23%)Heuristics have not been applied and the metric values have not changed (55%)

In 22% of the scenarios, heuristics have been applied and corresponding metrics have not changed, because:

Dependency between heuristics caused some side effects The ratio of heuristics done over the size of datasets is very low

17

Page 18: Quality Metrics for  Linked Open Data

Discussions