A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10...

109
1 A Tutorial on Instance Matching Benchmarks Evangelia Daskalaki, Institute of Computer Science – FORTH , Greece Tzanina Saveta, Institute of Computer Science – FORTH , Greece Irini Fundulaki, Institute of Computer Science – FORTH , Greece Melanie Herschel, Universitaet Stuttgart ESWC 2016 , May 30 th , Anissaras – Crete , Greece http://www.ics.forth.gr/isl/BenchmarksTutorial/

Transcript of A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10...

Page 1: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

1

A Tutorial on Instance MatchingBenchmarks

Evangelia Daskalaki,

Institute of Computer Science – FORTH , Greece

Tzanina Saveta,Institute of Computer Science – FORTH , Greece

Irini Fundulaki,Institute of Computer Science – FORTH , Greece

Melanie Herschel,Universitaet Stuttgart

ESWC 2016 , May 30th, Anissaras – Crete , Greece

http://www.ics.forth.gr/isl/BenchmarksTutorial/

Page 2: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

2A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Teaser Slide

• We will talk about Benchmarks

• Benchmarks are generally a set of tests to assess computer systems performance

• Specifically we will talk about: Instance Matching (IM) Benchmark for Linked Data.

Page 3: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

3A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview

• Introduction into Linked Data

• Instance Matching

• Benchmarks for Linked Data

– Why Benchmarks?

– Benchmarks Characteristics

– Benchmarks Dimensions

• Benchmarks in the literature

– Benchmark Systems

– Synthetic Benchmarks

– Real Benchmarks

• Summary & Conclusions

Page 4: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

4A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Linked Data - The LOD Cloud

Media

Government

Geographic

Publications

User-generated

Life sciences

Cross-domain

Page 5: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

5A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Linked Data – The LOD Cloud

*Adapted from Suchanek & Weikum tutorial@SIGMOD 2013

Same entity can be described in

different sources

Page 6: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

6A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Different Descriptions of Same Entity in Different Sources

"Riva del Garda description in GeoNames"

"Riva del Garda description in DBpedia"

Page 7: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

7A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview

• Introduction into Linked Data

• Instance Matching

• Benchmarks for linked Data

– Why Benchmarks?

– Benchmarks Characteristics

– Benchmarks Dimensions

• Benchmarks in the literature

– Benchmark Generators

– Synthetic Benchmarks

– Real Benchmarks

• Summary & Conclusions

Page 8: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

8A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Instance Matching: the cornerstone for Linked Data

data acquisition

data

evolution

data integration

open/social data

How can we automatically recognizemultiple mentions of the same entity

across or within sources?=

Instance Matching

Page 9: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

9A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Instance Matching

• Problem has been considered for more than half a decade in Computer Science [EIV07]

• Traditional instance matching over relational data (known as record linkage)

Title Genre Year Director

Troy Action 2004 Petersen

Troj History Petersen

contradictionmissing

value

Nicely and homogeneously structured data. Value variations

Typically few sources compared

Page 10: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

10A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Web Data Instance Matching« The Early Days »

• IM algorithms for semi-structured XML model used to represent and exchange data.

m1,movie

t1,title s1,set

a11,

actor

a12,

actor

Troy

Brad

Pitt

Eric

Bana

m2,movie

t2,title s2,set

a21,

actor

a22,

actor

Troja

Brad

Pit

Erik

Bana

a23,

actor

Brian

Cox

y1,year

2004

y2,year

04

Solutions assume one common schema

Structural variation

Page 11: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

11A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Instance Matching Today

SetsRDF/OWL triples

*Adapted from Suchanek & Weikum tutorial@SIGMOD 2013

Many sources to match

Rich semanticsValue

StructureLogical variations

Page 12: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

12A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Need for IM techniques

• People interconnect their dataset with existing ones.

– These links are often manually curated (or semi-automatically generated).

• Size and number of datasets is huge, so it is vital to automatically detect additional links : making the graph more dense.

Page 13: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

13A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Benchmarking

Instance matching research has led to

the development of various systems.

–How to compare these?

–How can we assess their performance?

–How can we push the systems to get better?

These systems need to be benchmarked!

Page 14: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

14A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview

• Introduction into Linked Data

• Instance Matching

• Benchmarks for linked Data

– Why Benchmarks?

– Benchmarks Characteristics

– Benchmarks Dimensions

• Benchmarks in the literature

– Benchmark Generators

– Synthetic Benchmarks

– Real Benchmarks

• Summary & Conclusions

Page 15: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

15A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Benchmarking

“A Benchmark specifies a workload characterizingtypical applications in the specific domain. Theperformance of various computer systems on thisworkload, gives a rough estimate of their relativeperformance on that problem domain”

[G92]

Page 16: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

16A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Instance Matching Benchmark Ingredients [FLM08]

Organized into test cases each addressing different kind of requirements:

• Datasets

The raw material of the benchmarks. These are the source and the target dataset that will be matched together to find the links

• Gold Standard (Ground Truth / Reference Alignment)

The “correct answer sheet” used to judge the completeness and soundness of the instance matching algorithms.

• Metrics

The performance metric(s) that determine the systems behavior and performance

Page 17: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

17A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Datasets Characteristics

Nature of data (Real vs. Synthetic)

Schema (Same vs. Different)

Domain (dependent vs. independent)

Language (One vs. Multiple)

Page 18: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

18A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Real vs. Synthetic Datasets

Real datasets :

– Realistic conditions for heterogeneity problems

– Realistic distributions

– Error prone Reference Alignment

Synthetic datasets:

– Fully controlled test conditions

– Accurate Gold Standards

– Unrealistic distributions

– Systematic heterogeneity problems

Page 19: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

19A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Data Variations in Datasets

Value Variations

Structural Variations

Logical Variations

Combination of the variations

Multilingual variations

Page 20: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

20

Variations

Value

- Name style abbreviation

- Typographical errors

- Change format (date/gender/number)

- Synonym Change

- Multilingualism

Structural

-Change property depth

-Delete/Add property

-Split property values

-Transformation of object to data type property

-Transformation of data to object type property

Logical

-Delete/Modify Class Assertions-Invert property assertions-Change property hierarchy-Assert disjoint classes

[FMN+11]

Instance Matching Benchmarks for Linked DataEvangelia Daskalaki, Irini Fundulaki, Melanie Herschel, Tzanina Saveta

Page 21: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

21A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Gold Standard Characteristics

Existence of errors / missing alignments

Representation

(owl:sameAs / skos:exactMatch)

Page 22: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

22A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Metrics: Recall / Precision / F-measure

Gold StandardResult set

Recall r = TP / (TP + FN)

Precision p = TP / (TP + FP)

F-measure f = 2 * p * r / (p + r) True Positive

(TP)

False Positive (FP)

False Negative (FN)

Page 23: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

23A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Benchmarks Criteria

Systematic Procedure

matching tasks are reproducible and the execution has to be comparable

Availability related to the availability of the benchmark in time.

Quality Precise evaluation rules and high quality ontologies

Equity no system privileged during the evaluation process

Dissemination How many systems have used this benchmark to be evaluated with

Volume How many instances did the datasets contain

Gold Standard existence of gold standard and it’s accuracy.

Page 24: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

24A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Benchmarking

• Instance matching techniques have, until recently, been benchmarked in an ad-hoc way.

• There does not exist a standard way of benchmarkingthe performance of the systems, when it comes to Linked Data.

Page 25: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

25A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Ontology Alignment Evaluation Initiative

• On the other hand, IM benchmarks have been mainly driven forward by the Ontology Alignment Evaluation Initiative (OAEI)

– organizes annual campaign for ontology matching since 2005

– hosts independent benchmarks

• In 2009, OAEI introduced the Instance Matching (IM) Track

– focuses on the evaluation of different instance matching techniques and tools for Linked Data

Page 26: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

26A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview

• Introduction into Linked Data

• Instance Matching

• Benchmarks for linked Data

– Why Benchmarks?

– Benchmarks Characteristics

– Benchmarks Dimensions

• Benchmarks in the literature

– Benchmark Systems

– Synthetic Benchmarks

– Real Benchmarks

• Summary & Conclusions

Page 27: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

27A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Benchmark Systems

SWING SPIMBENCH

LANCE

Page 28: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

28A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Semantic Web Instance Generation (SWING 2010) [FMN+11]

Semi-automatic generator of IM Benchmarks

• Contributed in the generation of IIMB Benchmarks of OAEI in 2010, 2011 and 2012

• Freely available (https://code.google.com/p/swing-generator/)

• All kind of variations contained into the benchmarks (apart from multilingualism)

• Automatically created Gold Standard

Page 29: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

29A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

SWING phases

Data Acquisition

• Data Selection

• Ontology Enrichment

Data Transformation

• All kinds of variations

• Combination

Data Evaluation

• Creation of Gold Standard

• Testing

Page 30: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

30A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

SPIMBENCH [SDF+15]

• Based on Semantic Publishing Benchmark (SPB) of Linked Data Benchmark Council (LDBC)

• Synthetic benchmarks by using the BBC Ontologies.

• Deterministic, scalable data generation in the order of billion triples

• Weighted gold standard

Page 31: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

31A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Semantic Publishing Benchmark Ontologies

• Supports value, structural and logical transformations

• Full expressiveness of RDF/OWL language

– Complex class definitions (union, intersection)

– Complex property definitions (functional properties, inverse functional properties)

– Disjointness (properties)

• Downloadable from https://github.com/jsaveta/SPIMBench

Page 32: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

32A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

SPIMBENCH Architecture

Target Data

RESCALMATCHER SAMPLER

Weight Computation Module

Test Case

Generation

ParametersTest Case Generator Module

Matched Instances

SPB

Source Data

SPB Data

Generation

ParametersSPB Data Generator Module

Weighted

Gold Standard

Page 33: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

33A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

LANCE [SDFF+15]

–Descendant of SPIMBENCH

–Domain-independent benchmark generator

–LANCE supports:

• Semantics-aware transformations

• Standard value and structure based transformations

• Weighted gold standard

–Downloadable from https://github.com/jsaveta/Lance

Page 34: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

34A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

LANCE Architecture

Target Data

RESCALMATCHER SAMPLER

Weight Computation Module

Test Case

Generation

ParametersTest Case Generator Module

Matched Instances

Source Data

Weighted

Gold Standard

Source Data &

Ontology

(SPB, DBpedia,

UOBM, etc.)

RDF

RepositoryData Ingestion Module

Page 35: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

35A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview

• Introduction into Linked Data

• Instance Matching

• Benchmarks for linked Data

– Why Benchmarks?

– Benchmarks Characteristics

– Benchmarks Dimensions

• Benchmarks in the literature

– Benchmark Generators

– Synthetic Benchmarks

– Real Benchmarks

• Summary & Conclusions

Page 36: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

36A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Synthetic Benchmarks

OAEI IIMB 2009

OAEI IIMB 2010

OAEI Persons-Restaurants

2010ONTOBI 2010

OAEI IIMB 2011

Sandbox

2012

OAEI IIMB 2012

OAEI RDFT

2013

ID-REC Task 2014

SPIMBENCH 2015

Author Task 2015

Page 37: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

37A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI IIMB (2009) [EFH+09]

First attempt to create IM benchmark a with synthetic dataset

• Datasets

– OKKAM project containing actors, sport persons, and business firms

– Number of instances up to ~200

– Shallow ontology max depth=2

– Small RDF /OWL ontology comprised of 6 classes, 47 data type properties

• TestCases (Divided into 37 test cases)

– Test case 2-10 including value variations (Typographical errors, Use of different formats)

– Test case 11-19 including structural variations (Property deletion, Change property types)

– Test case 20-29 including logical variations (subClass of assertions, Modify class assertions)

– Test case 30-37 including Combination of the above

• Gold Standard

– Automatically created gold standard

Page 38: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

38A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Value Variations IIMB 2009

Property Original Instance Transformed Instance

type “Actor” “Actor”

wikipedia-name

“James Anthony Church” “qJaes Anthnodziurcdh”

cogito-Name “Tony Church” “Toty fCurch”

cogito-description

“James Anthony Church (Tony Church) (May 11, 1930 - March 25, 2008) was a British Shakespearean actor, who has appeared on stage and screen”

“Jpes Athwobyi tuscr(nTonsCourh)pMa y1sl1,9 3i- mrc 25, 200hoa s Bahirtishwaksepearnactdor, woh hmwse appezrem yonytmlaenn dscerepnq”

Typographical Errors

*Triples in the form of property , object

Page 39: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

39A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Structural Variations IIMB 2009

Original Instance Transformed Instance

type (uri1, “Actor”) type (uri2, “Actor”)

cogito-Name (uri1, “Wheeler Dryden”) cogito-Name (uri2, “Wheeler Dryden”)

cogito-first_sentence (uri1, “George Wheeler Dryden (August 31, 1892 in London - September 30, 1957 in Los Angeles) was an English actor and film director, the son of Hannah Chaplin and” ...)

cogito-first_sentence (uri2,uri3)

hasDataValue (uri3, “George Wheeler Dryden (August 31, 1892 in London -September 30, 1957 in Los Angeles) was an English actor and film director, the son of Hannah Chaplin and” ...)

cogito-tag (uri1, “Actor”) cogito-tag (uri2,uri4)

hasDataValue (uri4, “Actor”)

*Triples in the form of property (subject ,object)

Page 40: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

40A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Logical Variations IIMB 2009

Property name Original instance Transformed instance

type “Sportsperson” owl:Thing

wikipedia-name “Sammy Lee” “Sammy Lee”

cogito-first_sentence “Dr. Sammy Lee (born August 1, 1920 in Fresno, California) is the first Asian American to win an Olympic gold…”

“Dr. Sammy Lee (born August 1, 1920 in Fresno, California) is the first Asian American to win an Olympic gold …”

cogito-tag “Sportperson” “Sportperson”

cogito-domain “Sport” “Sport “

Sportsperson subClassOf Thing

*Triples in the form of property, object

Page 41: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

41A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Gold Standard IIMB 2009

– RDF/XML file

– Pairs of mapped instances

<Cell>

<entity1 rdf:resource=“http://www.okkam.org/ens/id1"/>

<entity2 rdf:resource=“http://islab.dico.unimi.it/iimb/abox.owl#ID3"/>

<measure rdf:datatype="http://www.w3.org/2001/XMLSchema#float">1.0</measure>

<relation>=</relation>

</Cell>

Page 42: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

42A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Systems- Results IIMB 2009

*Source OAEI 2009 http://oaei.ontologymatching.org/2009/results/oaei2009.pdf

Balanced benchmark - shows both good and bad results from systems.

Page 43: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

43A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview IIMB 2009C

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations (limited)

MultilingualityVar

iati

on

s

~200

6

Page 44: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

44A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI IIMB (2010) [EFM+10]

• Datasets

– Freebase Ontology- Domain independent.

– Implemented in small version with ~ 350 instances and large version with ~ 1400 instances

– OWL ontologies consisting of 29 classes (81 for large), 32 object prop, 13 data prop.

– Shallow ontology with max depth=3

– Created using the SWING Benchmark Generator [FMN+11]

• Test cases (divided into 80 test cases)

– Test cases 1-20 containing Value variations

– Test cases 21-40 containing Structural variations

– Test cases 41-60 containing Logical variations

– Test cases 61-80 Combination of the above

• Gold Standard

– Automatically created Gold Standards (same format as IIMB 2009)

Page 45: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

45A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Value Variations IIMB (2010)

Variation Original Instance Transformed instance

Typographical errors “Luke Skywalker” “L4kd Skiwaldek”

Date Format 1948-12-21 December 21, 1948

Name Format “Samuel L. Jackson” “Jackson, S.L.”

Gender Format “Male” “M”

Synonyms “Jackson has won multiple awards(...).”

“Jackson has gained several prizes (…).”

Integer 10 110

Float 1.3 1.30

Page 46: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

46A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Structural Variations IIMB (2010)[FMN+11]

Original Instance Transformed Instance

name (uri1, “Natalie Portman”) name (uri3, “Natalie”)

name (uri3, “Portman”)

born_in (uri1, uri2) born_in (uri3, uri4)

name (uri2, “Jerusalem”) name (uri4, “Jerusalem”)

name (uri4, “Aukland”)

gender (uri1, “Female”) obj_gender( uri3 , uri5)

date_of_birth(uri1, “1981-06-09”) has_value(uri5, “Female”)

*Triples in the form of property (subject, object)

Page 47: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

47A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Logical Variations IIMB (2010)

Original Values Transformed values

Character(uri1) Creature(uri4)

Creature(uri2) Creature(uri5)

Creature(uri3) Thing(uri6)

created_by(uri1,uri2) creates(uri5,uri4)

acted_by(uri1,uri3) featuring(uri4,uri6)

name(uri1, “Luke Skywalker”) name(uri4, “Luke Skywalker”)

name(uri1, “George Lucas”) name(uri4, “George Lucas”)

name(uri1, “Mark Hamill”) name(uri4, “Mark Hamill”)

Character subClassOf Creature created_by inverseOf creates

acted_by subPropertyOf featuring Creature subClassOf Thing

*Triples in the form of property( subject, object)

Page 48: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

48A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Systems Results OAEI 2010 (large version)

*Source OAEI 2010 Results http://disi.unitn.it/~p2p/OM-2010/oaei10_paper0.pdf

The closer to the reality it comes, the more challenging it gets.

Page 49: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

49A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview IIMB 2010C

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

MultilingualityVar

iati

on

s

~ 1400

3

Page 50: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

50A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI Persons & Restaurants Benchmark (2010) [EFM+10]

First Benchmark that includes the clustering matchings (1-n matchings)

• Datasets

– Febrl project about Persons

– Fodor’s and Zagat’s restaurant guides about Restaurants

– Same Schemata

• TestCases

– Person 1 ~500 instances (Max. 1 mod./property)

– Person 2 ~600 instances (Max 3 mod./property and max 10 mod./instance)

– Restaurant ~860 instances

• Variations

– Combination of Value and Structural variations

• Gold Standard

– Automatically created gold standard (same format as IIMB 2009)

– 1-N matching in Person 2

Page 51: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

51A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Systems Results PR 2010

*Source OAEI 2010 Results http://disi.unitn.it/~p2p/OM-2010/oaei10_paper0.pdf

F-Measure

1. The more variations are added the worse the systems perform2. Some systems could not cope with 1-n mappings requirement

Page 52: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

52A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview PR 2010C

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

MultilingualityVar

iati

on

s

~860

6

Page 53: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

53A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

ONTOlogy matching Benchmark with many Instances (ONTOBI) [Z10]

Synthetic Benchmark

• Datasets

– RDF/OWL benchmark created by extracting data from DBpedia v. 3.4

– 205 classes, 1144 object properties and 1024 data types properties

– 13.704 instances

• Divided into 16 Test cases

• Variations

– Value variations

– Structural variations

– Combination of the above

• Ground Truth

– Automatically created Gold Standard

Page 54: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

54A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

ONTOBI Variations

Simple Variations

Spelling mistakes (Value Variations)

Change format (Value Variation)

Suppressed Comments

(Structural Variation)

Delete data types (Structural Variation)

Page 55: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

55A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

ONTOBI Variations

Complex Variations

Flatten/Expand Structure

(Structural Variation)

Language modification

(Value Variation)

Random names (Value Variation)

Synonyms (Value Variation)

Disjunct Dataset (Value Variation)

Page 56: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

56A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

ONTOBI Systems & Results

MICU system

*Figure source K. Zaiß: Instance-Based Ontology Matching and the Evaluation of Matching Systems ,2011, Dissertation

Page 57: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

57A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview ONTOBI 2010C

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

MultilingualityVar

iati

on

s

~13700

1

Page 58: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

58A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI IIMB (2011) [EHH+11]

• Datasets

– Freebase Ontology- Domain independent.

– OWL ontologies consisting of 29 concepts, 20 object properties, 12 data properties

– ~4000 instances

– Created using the SWING Tool

• Testcases (Divided into 80 test cases)

– Divided into 80 test cases

– Test cases 1-20 containing Value variations

– Test cases 21-40 containing Structural variations

– Test cases 41-60 containing Logical variations

– Test cases 61-80 Combination of the above

• Ground Truth

– Automatically created Gold Standard (same format as IIMB 2009)

Page 59: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

59A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

System Results IIMB 2011

Test Precision F-measure Recall

001–010 0.94 0.84 0.76

011–020 0.94 0.87 0.81

021–030 0.89 0.79 0.70

031–040 0.83 0.66 0.55

041–050 0.86 0.72 0.62

051–060 0.83 0.72 0.64

061–070 0.89 0.59 0.44

071–080 0.73 0.33 0.21

CODI system results

The closer to the reality it comes, the more challenging it gets.

Page 60: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

60A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview IIMB 2011C

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

MultilingualityVar

iati

on

s

~4000

1

Page 61: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

61A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI Sandbox (2012) [AEE+12]

• Datasets

– Freebase Ontology- Domain independent

– Collection of OWL files consisting of 31 concepts, 36 object properties, 13 data properties

– ~375 instances

• Test cases (Divided into 10 test cases)

– Divided into 10 test cases containing Value Variations

• Ground Truth

– Automatically created Gold Standard (same format as IIMB 2009)

Goal :Attracted new systems

Page 62: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

62A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Systems Results Sandbox 2012

Systems/Results Precision Recall F- Measure

LogMap 0.94 0.94 0.94

LogMap Lite 0.95 0.89 0.92

SBUEI 0.95 0.98 0.96

Simple tests – Very good Results

Page 63: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

63A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview Sandbox 2012C

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

MultilingualityVar

iati

on

s

3

~375

Page 64: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

64A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI IIMB (2012) [AEE+12]

Enhanced Sandbox Benchmarks

• Datasets

– Freebase Ontology- Domain independent

– Volume ~1500 instances

– Generated using the SWING Benchmark Generator

• Test Cases (Divided into 80 test cases)

– Test cases 1-20 containing Value variations

– Test cases 21-40 containing Structural variations

– Test cases 41-60 containing Logical variations

– Test cases 61-80 Combination of the above

• Ground Truth

– Automatically created Gold Standard (same format as IIMB 2009)

Page 65: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

65A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

IIMB 2012 Systems & Results

*Source OAEI 2012 Results http://oaei.ontologymatching.org/2012/results/oaei2012.pdf

Systems show a drop on F-measure in combination of variations

Page 66: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

66A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview IIMB 2012C

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

MultilingualityVar

iati

on

s

4

1500

Page 67: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

67A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI RDFT (2013) [GDE+13]

First synthetic Benchmark with language variations

First synthetic Benchmark with Blind Evaluation

• Datasets

– RDF benchmark created by extracting data from DBpedia

– 430 instances, 11 RDF properties and 1744 triples

– Use of same schemata

• Test Cases (Divided into 5 test cases)

– Test case 1 contains Value variations

– Test case 2 contains Structural variations

– Test case 3 contains Language variations for comments and labels (English – French)

– Test case 4-5 contains combinations of the above variations

• Gold Standard

– Automatically created Gold Standard (same format as IIMB 2009)

– Cardinality 1-n matchings for test case 5

Page 68: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

68A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

*Source OAEI 2013 Results http://ceur-ws.org/Vol-1111/oaei13_paper0.pdf

RDFT Systems - Results

1. Systems can cope with multilingualism2. Slight drop of the F-measure for cluster mappings (apart from

RiMOM)

Page 69: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

69A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview RDFT 2013C

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

MultilingualityVar

iati

on

s

~430

4

Page 70: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

70A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI ID-REC track (2014) [DEE14]

– 1 test case: match books from the source dataset to the target dataset

– The benchmark contains ~2500 instances

– Transform the structured information into an unstructured version of the same information.

Page 71: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

71A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

System Results

Systems/Results Precision Recall F- Measure

InsMT 0.0008 0.7785 0.0015

InsMTL 0.0008 0.7785 0.0015

LogMap 0.6031 0.0540 0.0991

LogMap-C 0.6421 0.0417 0.0783

RiMOM-IM 0.6491 0.4894 0.5581

Systems show either high precision and low recall or the opposite (apart from RIMOM)

Page 72: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

72A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI ID-REC trackC

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

Multilinguality

5

~2500

Page 73: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

73A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI SPIMBENCH (2015) [CDE+15]

• Created from the SPIMBENCH System

• Contains 3 test cases:

– value-semantics ("val-sem"),

– value-structure ("val-struct"), and

– value-structure-semantics ("val-struct-sem")

• Volumes: sandbox- 10K instances and mainbox- 100K instances.

• First synthetic benchmark that tackles both scalability and logical variations

• First synthetic benchmark that contains OWL construct beyond the standard

Page 74: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

74A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI SPIMBENCH

Val-Struct-Sem

Precision Recall F-measure

STRIM 0.92 0.99 0.95

LogMap 0.99 0.79 0.88

Val-Struct Precision Recall F-measure

STRIM 0.99 0.99 0.99

LogMap 0.99 0.82 0.90

Val-Sem Precision Recall F-measure

STRIM 0.91 0.99 0.95

LogMap 0.99 0.86 0.92

Page 75: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

75A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI SPIMBENCHC

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

Multilinguality

2

~100K

Page 76: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

76A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI Author Task (2015) [CDE+15]

Two test cases:

• Author Disambiguation (author- dis)

– Find same authors based on their publications

• Author Recognition (author – rec)

– Associate Authors with Publications

• Show strong value and structural complexities

– Author and publication information is described in a different way.

• Abbreviations of author names and/or the initial part of publication titles.

– Class “Publication report” containing aggregated information, e.g. number of publications, years of activity, and number of citations.

• Shows similarities with ID-REC track 2014

Page 77: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

77A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI Author Task

author-rec Precision Recall F-measure

Exona 0.41 0.41 0.41

InsMT+ 0.25 0.03 0.05

Lily 0.99 0.99 0.99

LogMap 0.99 1.0 0.99

RiMOM 0.99 0.99 0.99

Systems appear to be more ready in contrast to ID-REC 2014!

author-dis Precision Recall F-measure

Exona 0.0 NaN 0.0

InsMT+ 0.76 0.66 0.71

Lily 0.96 0.96 0.96

LogMap 0.99 0.83 0.91

RiMOM 0.91 0.91 0.91

Page 78: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

78A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

OAEI Author TaskC

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

Multilinguality

5

~10K

Page 79: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

79A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Comparison of synthetic Benchmarks

Page 80: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

80A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview

• Introduction into Linked Data

• Instance Matching

• Benchmarks for linked Data

– Why Benchmarks?

– Benchmarks Characteristics

– Benchmarks Dimensions

• Benchmarks in the literature

– Benchmark Generators

– Synthetic Benchmarks

– Real Benchmarks

• Summary & Conclusions

Page 81: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

81A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Real Benchmarks

ARS

(OAEI 2009)

DI

(OAEI 2010)

DI-NYT

(OAEI 2011)

Page 82: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

82A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

AKT-Rexa-DBLP (ARS - OAEI 2009) [EFH+09]

• Datasets

– AKT-Eprints archive - information about papers produced within the AKT project.

– Rexa dataset- computer science research literature, people, organizations, venues and research communities data

– SWETO-DBLP dataset - publicly available dataset listing publications from the computer science domain.

– All three datasets were structured using the same schema - SWETO-DBLP ontology

• Test cases (Value/Structural variations)

– AKT / Rexa

– AKT /DBLP

– Rexa / DBLP

• Challenges

– Many instances (almost 1M instances)

– Ambiguous labels (person names and paper titles) and

– Noisy data (some sources contained incorrect information)

Page 83: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

83A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

ARS Data Statistics

• Dataset Statistics

– AKT-Eprints: 564-foaf: Persons and 283-sweto:Publications

– Rexa : 11.050-foaf: Persons and 3.721-sweto:Publications

– SWETO-DBLP : 307.774-foaf: Persons and 983.337-sweto:Publications

• Ground Truth

– Manually constructed - Error prone Reference Alignment

– AKT-REXA contains 777 overall mappings

– AKT-DBLP contains 544 overall mappings

– REXA-DBLP contains 1540 overall mappings

Page 84: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

84A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

ARS Systems & Results

*Source OAEI results 2009 http://ceur-ws.org/Vol-551/oaei09_paper0.pdf

1. Scalability issues from some the systems2. Structural variations in names of Persons lower the F-measure of systems

Page 85: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

85A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview ARSC

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

MultilingualityVar

iati

on

s

~1M

5

Page 86: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

86A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Data Interlinking (OAEI 2010) [EFM+10]

The first real Benchmark that contained semi-automatically created

reference alignments

• Datasets

– DailyMed - Provides marketed drug labels containing 4308 drugs

– Diseasome - Contains information about 4212 disorders and genes

– DrugBank - Is a repository of more than 5900 drugs approved by the US FDA

– SIDER - Contains information on marketed medicines (996 drugs) and their recorded adverse drug reaction (4192 side effects).

• Reference Alignments

– Semi-automatically created reference alignments

– Running the test with Silk and LinQuer systems

Page 87: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

87A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

DI Results

*Source OAEI 2010 Results http://disi.unitn.it/~p2p/OM-2010/oaei10_paper0.pdf

1. Providing a reliable mechanism for systems’ evaluation2. Improving the performances of matching systems

Page 88: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

88A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview DI 2010C

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

MultilingualityVar

iati

on

s

~6000

2

Page 89: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

89A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Data Integration (OAEI 2011) [EHH+11]

• Datasets

– New York Times

– DBpedia

– Freebase

– Geonames

• Tests cases

– DBpedia locations

– DBpedia organizations

– DBpedia people

– Freebase locations

– Freebase organizations

– Freebase people

– Geonames

• Reference Alignments

– Based on the links present in the datasets

– Provided matches are accurate but may not be complete

New York Times Subject headings

Page 90: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

90A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Data Integration – New York Times

People Organizations Locations

# NYT resources 9958 6088 3840

# Links to Freebase 4979 3044 1920

# Links to DBpedia 4977 1949 1920

# Links to Geonames 0 0 1789

Page 91: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

91A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

DI Results

*Source OAEI 2010 http://oaei.ontologymatching.org/2010/vlcr/index.html

1. Good results from all the systems2. Well known domain and datasets3. No logical variations

Page 92: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

92A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview DI 2011C

har

acte

rist

ics

Systematic Procedure

Quality

Equity

Volume

Dissemination

Availability

Ground Truth

Value Variations

Structural Variations

Logical Variations

MultilingualityVar

iati

on

s

3

Page 93: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

93A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Comparison of Real Benchmarks

Page 94: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

94A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Overview

• Introduction into Linked Data

• Instance Matching

• Benchmarks for linked Data

– Why Benchmarks?

– Benchmarks Characteristics

– Benchmarks Dimensions

• Benchmarks in the literature

– Benchmark Systems

– Synthetic Benchmarks

– Real Benchmarks

• Summary and Conclusions

Page 95: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

95A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Wrapping up: Benchmarks

Which benchmarks included multilingual datasets?

OAEI RDFT

2013 (French-English)

ID-REC 2014 (English- Italian)

Author Task (English –

Italian)

Page 96: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

96A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Wrapping up: Benchmarks

Which benchmarks included value variations into the test cases?

OAEI IIMB 2009

OAEI IIMB 2010

OAEI Persons-Restaurants

2010ONTOBI

OAEI IIMB 2011

Sandbox 2012OAEI IIMB

2012

OAEI RDFT

2013

ID-REC 2014SPIMBENCH

2015Author Task

2015ARS

DI 2010 DI 2011

Page 97: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

97A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Wrapping up: Benchmarks

Which benchmarks included structural variations into the test cases?

OAEI IIMB 2009

OAEI IIMB 2010

OAEI Persons-Restaurants

2010ONTOBI

OAEI IIMB 2011

OAEI IIMB 2012

OAEI RDFT

2013 ID-REC 2014

SPIMBENCH 2015

Author Task 2015

ARS DI 2010

DI 2011

Page 98: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

98A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Wrapping up: Benchmarks

Which benchmarks included logical variations into the test cases?

OAEI IIMB 2009

OAEI IIMB 2010

OAEI IIMB 2011

OAEI IIMB 2012

SPIMBENCH 2015

Page 99: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

99A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Wrapping up: Benchmarks

Which benchmarks included combination of the variations into the test cases?

IIMB 2009 IIMB 2010 IIMB 2011

IIMB 2012 RDFT 2013 ID-REC 2014

SPIMBENCH 2015

Author Task 2015

Page 100: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

100A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Wrapping up: Benchmarks

Which benchmarks are more voluminous?

SPIMBENCH 2015

ARS

DI 2011

Page 101: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

101A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Wrapping up: Benchmarks

Which benchmarks included both combination of the variations and was voluminous at the same

time?

SPIMBENCH 2015

Page 102: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

102A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Open Issues

• Issue 1:

Only one benchmark that tackles both, combination of variations and scalability issues

• Issue 2 :

Not enough IM benchmark using the full expressiveness of RDF/OWL language

Page 103: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

103A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Wrapping Up: Systems for Benchmarks

Outcomes as far as systems are concerned:

• Systems can handle the value variations, the structural variation, and the simple logical variations separately.

• More work needed for complex variations (combination of value, structural, and logical)

• More work needed for structural variations

• Enhancement of systems to cope with the clustering of the mappings (1-n mappings)

Page 104: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

104A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Conclusion

• Many instance matching benchmarks have been proposed

• Each of them answering to some of the needs of instance matching systems.

• It is high time now to start creating benchmarks that will “show the way to the future”

• Extend the limits of existing systems.

Page 105: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

105

Questions? Comments?

Thank you!

Page 106: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

106A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

References (1)

# Reference Abbreviation

1

J. L. Aguirre, K. Eckert, A. F. J. Euzenat, W. R. van Hage, L. Hollink, C. Meilicke, A. N. D. Ritze, F. Scharffe, P. Shvaiko, O. Svab-Zamazal, C. Trojahn, E. Jimenez-Ruiz, B. C. Grau, and B. Zapilko. Results of the ontology alignment evaluation initiative 2012. In OM, 2012. [AEE+12]

2 I. Bhattacharya and L. Getoor. Entity resolution in graphs. Mining Graph Data. Wiley and Sons, 2006. [BG06]

3

J. Euzenat, A. Ferrara, L. Hollink, A. Isaac, C. Joslyn, V. Malaise, C. Meilicken, A. Nikolov, J. Pane, M. Sabou, F. Scharffe, P. Shvaiko, V. S. H., Stuckenschmidt, O. Svab-Zamazal, V. Svatek, , C. Trojahn, G. Vouros, and S. Wang. Results of the Ontology Alignment Evaluation Initiative 2009. In OM, 2009. [EFH+09]

4

J. Euzenat, A. Ferrara, C. Meilicke, J. Pane, F. Schare, P. Shvaiko, H. Stuckenschmidt, O. Svab- Zamazal, V. Svatek, and C. Trojahn. Results of the Ontology Alignment Evaluation Initiative 2010. In OM, 2010. [EFM+10]

5

A. F. J. Euzenat, W. R. van Hage, L. Hollink, C. Meilicke, A. N. D. Ritze, F. Scharffe, P. Shvaiko, H. Stuckenschmidt, O. Svab-Zamazal, and C. Trojahn. Results of the Ontology Alignment Evaluation Initiative 2011. In OM, 2011. [EHH+11]

6A. K. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 2007. [EIV07]

7J.Euzenat and P. Shvaiko, editors. Ontology Matching. Springer-Verlag, 2007.

[ES07]

8 A. Ferrara, D. Lorusso, S. Montanelli, and G. Varese. Towards a Benchmark for Instance Matching. In OM, 2008. [FLM08]

9A. Ferrara, S. Montanelli, J. Noessner, and H. Stuckenschmidt. Benchmarking Matching Applications on the Semantic Web. In ESWC, 2011. [FMN+11]

10J. Gray, editor. The Benchmark Handbook for Database and Transaction Systems. Morgan Kaufmann, 1993.

[G93]

Page 107: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

107A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

References (2)

# Reference Abbreviation

11

B. C. Grau, Z. Dragisic, K. Eckert, A. F. J. Euzenat, R. Granada, V. Ivanova, E. Jimenez-Ruiz, A. O. Kempf, P. Lambrix, A. Nikolov, H. Paulheim, D. Ritze, F. Schare, P. Shvaiko, C. Trojahn, and O. Zamazal. Results of the ontology alignment evaluation initiative 2013. In OM, 2013. [GDE+13]

12Gray, A.J.G., Groth, P., Loizou, A., et al.: Applying linked data approaches to pharmacology: Architectural decisions and implementation. Semantic Web. (2012). [GGL+12]

13P. Hayes. RDF Semantics. www.w3.org/TR/rdf-mt, February 2004.

[H04]

14R. Isele and C. Bizer. Learning linkage rules using genetic programming. In OM, 2011.

[IB11]

15A. Isaac, L. van der Meij, S. Schlobach, and S. Wang. An Empirical Study of Instance-Based Ontology Matching. In ISWC/ASWC, 2007. [IMS07]

16E. Ioannou, N. Rassadko, and Y. Velegrakis. On Generating Benchmark Data for Entity Matching. Journal of Data Semantics, 2012. [IRV12]

17A. Jentzsch, J. Zhao, O. Hassanzadeh, K.-H. Cheung, M. Samwald, and B. Andersson. Linking open drug data. In Linking Open Data Triplification Challenge, I-SEMANTICS, 2009. [JZH+09]

18C. Li, L. Jin, and S. Mehrotra. Supporting ecient record linkage for large data sets using mapping techniques. In WWW, 2006. [LJM06]

19D. L. McGuinness and F. van Harmelen. OWL Web Ontology Language. http://www.w3.org/TR/owl-features/, 2004. [MH04]

20 B. M. F. Manola, E. Miller. RDF Primer. www.w3.org/TR/rdf-primer, February 2004. [MM04]

21M. Cheatham, Z. Dragisic, J. Euzenat, et. Al., Results of the Ontology Alignment Evaluation Initiative 2015, Proc. 10th ISWC workshop on ontology matching, OM 2015 [CDE15]

Page 108: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

108A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Reference (3)

# Reference Abbreviation

21J. Noessner, M. Niepert, C. Meilicke, and H. Stuckenschmidt. Leveraging Terminological Structure for Object Reconciliation. In ESWC, 2010. [NNM10]

22A. Nikolov, V. Uren, E. Motta, and A. de Roeck. Refining instance coreferencing results using belief propagation. In ASWC, 2008. [NUM+08]

23M. Perry. TOntoGen: A Synthetic Data Set Generator for Semantic Web Applications. AIS SIGSEMIS, 2(2), 2005.

[P05]

24E. Prud'hommeaux and A. Seaborne. SPARQL Query Language for RDF. www.w3.org/TR/rdfsparql- query, January 2008. [PS08]

25S. Wang, G. Englebienne, and S.Schlobach: Learning Concept Mappingd from Instance Similarity International Semantic Web Conference 2008: 339-355 [WES08]

26

Williams, A.J., Harland, L., Groth, P., Pettifer, S., Chichester, C., Willighagen, E.L., Evelo, C.T., Blomberg, N., Ecker, G., Goble, C., Mons, B.: Open PHACTS: Semantic interoperability for drug discovery. Drug Discovery Today. 17, 1188–1198 (2012). [WHG+12]

27K. Zaiss, S. Conrad, and S. Vater. A Benchmark for Testing Instance-Based Ontology Matching Methods. In KMIS, 2010. [Z10]

28Jim Gray. Benchmark Handbook: For Database and Transaction Processing Systems, ISBN:1558601597, 1992

[G92]

29T. Saveta, E. Daskalaki, G. Flouris, I. Fundulaki, M. Herschel, A.-C. Ngonga Ngomo, Pushing the Limits of Instance Matching Systems: A Semantics-Aware Benchmark for Linked Data, WWW 2015. [SDF+15]

30T.Saveta, E. Daskalaki, G. Flouris, I. Fundulaki, M. Herschel, A.-C. Ngonga Ngomo, LANCE: Piercing to the Heart of Instance Matching Tool, ISWC 2015, pp 375-391. [SDFF+15]

31

Z. Dragisic, K. Eckert, J. Euzenat, D. Faria, A. Ferrara, R. Granada, V. Ivanova, E. Jimenez-Ruiz, A. Oskar Kempf, P. Lambrix, S. Montanelli, H. Paulheim, D. Ritze, P. Shvaiko, A. Solimando, C. Trojahn, O. Zamaza, and B. Cuenca Grau, Results of the Ontology Alignment Evaluation Initiative 2014, Proc. 9th ISWC workshop on ontology matching, OM 2014. [DEE14]

Page 109: A Tutorial on Instance Matching Benchmarks · A Tutorial on Instance Matching Benchmarks 10 Evangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel. Web Data Instance

109A Tutorial on Instance Matching BenchmarksEvangelia Daskalaki, Tzanina Saveta, Irini Fundulaki, and Melanie Herschel.

Contact Information

Contact Information:

Evangelia Daskalaki - [email protected]

Tzanina Saveta - [email protected]

Irini Fundulaki - [email protected]

Melanie Herschel - [email protected]