SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

79
Delft University of Technology SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data Samur Araujo, Duc Thanh Tran, Arjen de Vries, Jan Hidders, Daniel Schwabe Delft University of Technology WebDB 2012

Transcript of SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

Page 1: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

DelftUniversity ofTechnology

SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Samur Araujo, Duc Thanh Tran, Arjen de Vries, Jan Hidders, Daniel Schwabe

Delft University of TechnologyWebDB 2012

Page 2: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

2SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Me You

Page 3: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

3SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

AppleMe

Page 4: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

4SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

You

Page 5: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

5SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

You

?

Ambiguous

Page 6: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

6SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Me

Page 7: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

7SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Me

Page 8: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

8SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Me

Page 9: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

9SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

You

Page 10: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

10SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Round Shape

Green Color

Eatable

Spherical Shape

Red Color

Eatable

My Apple Your Apple

Page 11: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

11SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Shape

Color

Eatable

Shape

Color

Eatable

My Apple Your Apple

Page 12: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

12SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

FruitRound Shape

Green Color

Eatable

Spherical Shape

Red Color

Eatable

My Apple Your Apple

Page 13: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

13SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

My Apple Your Apple

Page 14: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

14SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Matching

Source

Target

Page 15: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

15SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

“Instance matching uses a direct comparison paradigm”.

Page 16: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

16SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Page 17: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

17SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Source

Target

Is your Apple like my Apple?

Humm.. Maybe!

Page 18: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

18SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Homogenous data and schema.

Page 19: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

19SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

The source and target descriptions overlap.

TargetSource

Page 20: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

20SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Syntactic Overlap

Population = TotalPopulation

Page 21: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

21SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Semantic Overlap

Population = Num_Inhabitants

Page 22: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

22SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Web of Data: heterogeneous data and schema

Page 23: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

23SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

None or limited overlap between schemas

TargetSource

Page 24: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

24SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instances do not instantiate the schema, properly.

TargetSource

Page 25: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

25SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Apple

Nutritional Information

Botanical Information

Page 26: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

26SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

“Direct comparison paradigm does not apply”.

TargetSource

Page 27: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

27SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

AppleMe

Page 28: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

28SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Me

AppleOrangePineapple

Page 29: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

29SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

You

Apple

Me Orange

Pineapple

Page 30: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

32SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

You

Page 31: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

34SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Food

Page 32: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

35SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Page 33: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

36SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

FoodEatable

Page 34: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

37SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Source

Page 35: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

38SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

My Apple Your Apple

My Orange Your Orange

My Pineapple Your Pineapple

Page 36: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

39SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Page 37: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

40SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

“We use a class-based disambiguation paradigm …”

Page 38: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

41SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

“We use a class-based disambiguation paradigm …” “… when there is no overlap between schemas.”

Page 39: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

42SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Matching with SERIMI

Source

Target

Page 40: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

43SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

Page 41: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

44SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

ValuePredicate

Instance

Page 42: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

45SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

Roundshape

Apple1

Appletitle

Apple1

Redcolor

Apple1

EatableApple1category

Page 43: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

46SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

Roundshape

Apple1

Appletitle

Apple1

Redcolor

Apple1

EatableApple1category

P(X) = {p | (s, p, o) ∈ IR(G,X) ∧s ∈ X},

D(X) = {o | (s, p, o) ∈ IR(G,X) ∧s ∈ X ∧o ∈ L},

O(X) = {o | (s, p, o) ∈ IR(G,X) ∧s ∈ X ∧o ∈ U},

T(X) = {(p, o) | (s, p, o) ∈ IR(G,X) ∧s ∈ X}.

Page 44: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

47SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

Roundshape

Apple1

Appletitle

Apple1

Redcolor

Apple1

EatableApple1category

P(X) = {shape,title,color,category},

D(X) = {Round,Apple,Red,Eatable},

O(X) = {}

T(X) = {(shape,round),(title,Apple),(color,Red), (category,Eatable)}.

Page 45: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

48SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

[P(hi), D(hi), O(hi), T(hi)]

Page 46: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

49SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 1: Cluster the source

Source

Cars

FruitsCompanies

Page 47: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

51SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 2: Blocking Key Selection

Key Selection

Source instances

Page 48: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

52SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 2: Blocking Key Selection

Key Selection

key

keykey

Source instances

Page 49: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

53SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 2: Blocking Key Selection

Key Selection

key

keykey

Source instances

e.g.Title

Page 50: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

54SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 3: Pseudo-Homonyms Builder

Title=apple

Title=orangeTitle=pineapple

Pseudo-Homonyms

Builder

Target

Page 51: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

55SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 3: Pseudo-Homonyms Builder

Pseudo-Homonyms

Builder

Target

Source instances

Target

Pseudo-homonyms sets

Everything called Apple

Page 52: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

56SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Target

Pseudo-homonyms sets

Disambiguation

Class-based Disambiguat

or

Page 53: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

57SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Target

Pseudo-homonyms sets

Disambiguation

Class-based Disambiguat

or Source instances

Target

Pseudo-homonyms sets

Page 54: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

58SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Page 55: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

59SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Page 56: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

60SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

pseudo-homonym sets

inst

an

ces

Page 57: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

61SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

pseudo-homonym sets

inst

an

ces

[P(hi11), D(hi

11), O(hi11),

T(hi11)]

Page 58: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

62SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

Roundshape

Apple1

Appletitle

Apple1

Redcolor

Apple1

EatableApple1category

P(X) = {shape,title,color,category},

D(X) = {Round,Apple,Red,Eatable},

O(X) = {}

T(X) = {(shape,round),(title,Apple),(color,Red), (category,Eatable)}.

Page 59: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

63SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

inst

an

ces

pseudo-homonym sets

0.98

0.32

0.32

0.76

0.95

0.53

0.94

0.91

0.87

H1 H2 H3

Page 60: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

64SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3€

SetSim(A,B) = | A ∩ B | -| A - B | + | B - A |

2 | A∪B |

⎝ ⎜

⎠ ⎟

Page 61: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

65SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

SetSim(P(h11),P(H2)) + SetSim(P(h11),P(H3))

[P(hi11), D(hi

11), O(hi11),

T(hi11)]

Page 62: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

66SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

RDS(A,B) = SetSim(P(A), P(B)) +SetSim(D(A),D(B)) +

SetSim(O(A),O(B)) + SetSim(T(A), T(B))

Page 63: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

67SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

0.98

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

Page 64: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

68SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

0.98

0.32

0.32

0.76

0.95

0.53

0.94

0.91

0.87

H1 H2 H3

Page 65: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

69SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

URDS(t,PH(S)−) =RDS({t},PH(s'))

|PH(S') |PH (s' )∈PH (S )−

Page 66: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

70SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

0.98 0.95 0.94

H1 H2 H3

TOP-K or Threshold

Page 67: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

71SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Page 68: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

72SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Disambiguation

Class-based Disambiguat

or Source instances

Target

Pseudo-homonyms sets

Page 69: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

73SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Experiment

• Ontology Alignment Evaluation Initiative (OAEI 2010)

• Collections: the life science (LS) collection (DBPedia, Sider, Drugbank, LinkedCT, Dailymed, TCM, and Diseasome) and the Person-Restaurant (PR)

• 20 gigabytes of data, millions of triples.

• We compared SERIMI to ObjectCoref and RiMON

• Precision, Recall and F1

Page 70: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

74SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results

Page 71: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

75SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results

Page 72: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

76SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results

Page 73: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

77SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Page 74: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

80SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results

Page 75: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

81SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

0.98 0.95 0.94

H1 H2 H3

TOP-K or Threshold

Page 76: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

82SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results for Top-K

Sider-Daily. Sider-Drug. Drug.-Sider P11-P120.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Top-1

Top-2

Top-5

Top-10

Dataset Pair

F1

Page 77: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

83SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results for δ threshold

Sider-Daily. Sider-Drug. Drug.-Sider P11-P120.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

δ >= δmδ = 1.0δ >= 0.95δ >= 0.90δ >= 0.85

Dataset Pair

F1

Page 78: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

84SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Conclusion

• SERIMI is complementary approach to direct-match based instance matching tools.

• SERIMI is recommended for heterogeneous data where

there is no overlap between schemas.

• It is recommended for multi-class disambiguation.

Page 79: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

85SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

THANK YOU!

• Samur Araujo [email protected]

SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data