Entity Resolution with Evolving Rules

Post on 12-Jan-2016

39 views 0 download

Tags:

description

Entity Resolution with Evolving Rules. Youzhong Ma 2010-9-25 Lab of WAMDM. Outline. Motivations ER Related concepts ER properties Conclusions. Entity Resolution background. Entity Resolution background. Naïve ER Approach Vs. New Approach. Outline. Motivations ER Related concepts - PowerPoint PPT Presentation

Transcript of Entity Resolution with Evolving Rules

Entity Resolution with Evolving Rules

Youzhong Ma 2010-9-25Lab of WAMDM

Outline

Motivations ER Related concepts ER properties Conclusions

Entity Resolution background

Entity Resolution background

Naïve ER Approach Vs. New Approach

Outline

Motivations

ER Related concepts ER properties Conclusions

ER Related concepts

Suppose market A will merge market B They have to combine their customers The same person may occur in two

markets’ customer DB, but some attributes are different

How to deal with it?

ER Rule

Boolean functions determines if two records represent the same

entity: true or false.

Distance functions How different(similar) the records are.

ER Example

ER procedure

B1:Pname E1 = {{r1,r2,r3},{r4}} (6 comps) )

B2: Pname ∧ Pzip E2 = {{r1,r2},{r3},{r4}}

Naïve approachNaïve approach6 comps6 comps

original records set S = {r1,r2,r3,r4}ER input Pi = {{r1},{r2},{r3},{r4}}

Evolving ruleEvolving rule3 comps3 comps

The Evolving rule approach only works if the ER algorithm satisfies Certain properties and B2 is Stricter than B1.

So one contribution of this paper is to exploitUnder what conditions and for what ER algorithmsAre incremental approaches feasible?

B1:Pname ∧ Pzip E1 = {{r1,r2},{r3},{r4}} (6 comps) )

B2: Pname ∧ Phone E2 ={{r1},{r2,r3},{r4}}

3comps3comps

original records set S = {r1,r2,r3,r4}ER input Pi = {{r1},{r2},{r3},{r4}}

Pname Ename = {{r1,r2,r3},{r4}}

Pzip Ezip = {{r1,r2},{r3},{r4}}

Materialization!

Outline

Motivations ER Related concepts

ER properties Conclusions

Two important properties for ER algorithms that enable efficient rule evolution for match-based clustering

Rule Monotonicity(RM)

Context Free(CF)

Pname ∧ Pzip ≤ Pname

Rule Monotonicity(RM)

B2:Pname E2 = {{r1,r2,r3},{r4}}

B1: Pname ∧ Pzip E1 = {{r1,r2},{r3},{r4}}

Context Free (CF)

General Incremental VS. Context Free

Order independent VS. Rule Monotonicity An ER algorithm is order independent if the ER

result is same regardless of the order of the records processed.

Existing properties in literature

experiments

Outline

Motivations ER Related concepts ER properties

Conclusions

conclusions

Propose a new ER approach with evolving rules

Exploiting the properties (RM、 CF) of the ER algorithms that enable efficient rule evolution

Providing guidance to the ER algorithms designer

Some problems

How are the comparision rules generated?

How to design the ER Algorithms that hold the RM and CF properties?

How to Implement the ER algorithms in MapReduce framework?

Thanks to everyone of Web Group sincerely