Quality Metrics for Linked Open Data
-
Upload
ebrahimbagheri -
Category
Data & Analytics
-
view
440 -
download
0
Transcript of Quality Metrics for Linked Open Data
![Page 1: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/1.jpg)
Behshid Behkamal, Mohsen Kahani, Ebrahim Bagheri
Quality Metrics for Linked Open Data
![Page 2: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/2.jpg)
What is the problem?
What have others done?
What is our solution?
Does it work?
Outline
2
![Page 3: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/3.jpg)
3
What is the problem?
Linked Open Data (LOD): Realizing Semantic Web by interlinking existing but
dispersed data
Main components of LOD:URIs to identify things RDF to describe dataHTTP to access data
![Page 4: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/4.jpg)
Inclusion Criteria for publishing and interlinking datasets into LOD cloud
resolvable http/https URIs
Presented in one of the standard formats of Semantic Web (RDF, RDFa, RDF/XML, Turtle, N-Triples)
Contains at least 1000 triples
Connected via at least 50 RDF links to the existing datasets of LOD
Accessible via RDF crawling, RDF dump, or SPARQL endpoint
4
Is the dataset ready to be published?
What is the problem?
![Page 5: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/5.jpg)
5
One approach: Publish first, improve later
Results in: Quality problems in the published
datasets
Missing link: Data Quality Evaluation Prior to Release
What is the problem?
![Page 6: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/6.jpg)
Data quality in the Context of LOD
• General Validators
• Parsing and Syntax
• Accessibility / Dereferencability
Validators Quality Assessment of Published data
• Classifying quality problems of LOD
• Using metadata for quality assessment
• filtering poor quality data (WIQA)
• Semantic Annotation using ontologies
6
What have others done?
![Page 7: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/7.jpg)
Our Contribution
7
• Identifying important quality deficiencies that need to be avoided or resolved prior to release
• Proposing a set of metrics to measure and identify these quality deficiencies in a dataset.
![Page 8: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/8.jpg)
Quality Deficiency Issues
Resolution Method
Improper usage of vocabularies
- Not using appropriate existing vocabularies to describe the resources
Domain Expert
Redefining existing classes/properties
- Redefining the classes/properties in the ontology that already exist in the vocabularies
Domain Expert
Improper definition of classes/properties
- Classes with different name, but the same relations Semi-Automated
- Properties with different name, but the same meaning Ontologist
- Inadequate number of classes/ properties used to describe the resources
Domain Expert
Misuse of data type - Not using appropriate data types for the literals Automated
8
Quality Deficiencies (Schema level)
![Page 9: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/9.jpg)
Quality Deficiency Issues
Resolution Method
Errors in property values - Missing values Automated
- Out-of-range values Automated
- Misspelling Semi-Automated
- Inconsistent values Automated
Miss-match with the real-world - Resources without correspondence in real-world Domain Expert
Syntax errors - Triples containing syntax errors Validator
Misuse of data type /object property - Improper assignment of object property to the data type property or vice versa
Validator
Improper usage of classes/properties - Using undefined classes/properties Semi-Automated
- Membership of disjoint classes Automated
- Misplaced classes/properties Validator
Redundant/similar individuals - Individuals with similar property values, but different names
Ontologist
Invalid usage of Inverse-functional properties
- Inverse-functional properties with void values Automated
9
Quality Deficiencies (Instance level)
![Page 10: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/10.jpg)
Name Description Related quality deficiencies
Miss_Vlu (M1)
The ratio of the properties defined in the schema, but not presented in dataset. Errors in property values
Out_Vlu (M2)
The ratio of the triples of dataset which contain properties with out of range values Errors in property values
Msspl_Prp_Vlu(M3) The ratio of the properties of dataset which contain misspelled values Errors in property values
Und_Cls_Prp(M4)
The ratio of the triples of dataset using classes or properties without any formal definition
Improper usage of classes/ properties
Dsj_Cls(M5) The ratio of the instances of dataset being members of disjoint classes Improper usage of classes/
properties
Inc_Prp_Vlu(M6)
The ratio of the triples of dataset in which the values of properties are inconsistent Errors in property values
FP(M7)
The ratio of the number of triples of dataset with functional properties which contain inconsistent values Errors in property values
IFP(M8)
The ratio of the number of triples of dataset which contain invalid usage of inverse-functional properties.
Invalid usage of Inverse-functional properties
Im_DT(M9)
The ratio of the number of triples of dataset which contain data type properties with inappropriate data types.
Not using appropriate data types for the literals
Sml_Cls (M10)
The ratio of the classes of dataset with different names, but the same instances
Improper definition of classes
10
Proposing Metrics
![Page 11: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/11.jpg)
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets
Comparing the trends of Metrics over two observations
11
Empirical Evaluation
![Page 12: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/12.jpg)
Real world datasets
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets
Comparing the trends of Metrics over two observations
12
DatasetsNo. of triples
No. of instances
No. of classes
No. of properties
FAO Water Areas 10,730 586 31 19
Water Economic Zones 29,193 1,074 113 127
Large Marine Ecosystems 12,012 716 21 31
Geopolitical Entities 22,725 312 88 101
ISSCAAP Species Classification 398,166 25,253 52 93
Species Taxonomic Classification 319,490 11,741 33 26
Commodities 56,420 2,788 10 19
Vessels 4,236 240 6 22
Empirical Evaluation
![Page 13: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/13.jpg)
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets
Comparing the trends of Metrics over two observations
13
Empirical Evaluation
![Page 14: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/14.jpg)
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets using heuristics
Comparing the trends of Metrics over two observations
14
√
√
√
3. Empirical Evaluation
Result:• Two pairs of metrics are correlated:
{IFP, Im_DT}{IFP, Inc_Prp_Vlu}
• The others are independent
![Page 15: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/15.jpg)
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets using heuristics
Comparing the trends of Metrics over two observations
15
√
√
√
![Page 16: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/16.jpg)
Selecting several real datasets from LOD
Calculation of the metrics values for datasets
Metrics interdependency Study
Manipulating the quality of the datasets using heuristics
Comparing the trends of Metrics over two observations
16
√
√
√
√
√
![Page 17: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/17.jpg)
Results analysis
In 78% of the scenarios, the metrics react as expected to the corresponding heuristics:
Heuristics applied and corresponding metrics have changed (23%)Heuristics have not been applied and the metric values have not changed (55%)
In 22% of the scenarios, heuristics have been applied and corresponding metrics have not changed, because:
Dependency between heuristics caused some side effects The ratio of heuristics done over the size of datasets is very low
17
![Page 18: Quality Metrics for Linked Open Data](https://reader036.fdocuments.us/reader036/viewer/2022082723/5879f1a11a28ab70298b4cdf/html5/thumbnails/18.jpg)
Discussions