IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee,...

29
iMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign Alon Halevy, Pedro Domingos University of Washington 1

Transcript of IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee,...

Page 1: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

1

iMAP: Discovering Complex Semantic Matchesbetween Database Schemas

Robin Dhamankar, Yoonkyong Lee, AnHai Doan

University of Illinois, Urbana-Champaign

Alon Halevy, Pedro DomingosUniversity of Washington

Page 2: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

2

Semantic Heterogeneity

• Semantics refer to the meaning of data in contrast to syntax, which only defines the structure of the schema items (e.g., classes and attributes).

• Semantic heterogeneity is a general term referring to disagreement about the meaning, interpretation or intended use of the same or related data.

• One of the toughest problems in Data Integration.

Page 3: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

3

Schema Matching

• A match between two schemas specifies semantic correspondences between elements of both schemas.

• A complex match specifies that a combination of attributes in one schema corresponds to a combination in the other.

Page 4: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

4

Aim• Examine an unbounded number of match

candidates for complex matching between data sources.

• Describe an iMAP system which semi-automatically discovers both 1-1 and complex matches for relational data (database schemas in this paper). The system uses following steps:

o Generating Matcheso Exploiting Domain Knowledgeo Explaining Match Prediction

Page 5: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

5

Content• An architecture for semi-automatically discovering

complex matches that combines search through a set of candidate matches and methods for evaluating each match in isolation, and a set of matches as a whole.

• Uses of new kinds of domain knowledge (overlap data and mining external data), and applying the knowledge as early as possible in the matching process.

• A mechanism for explaining the decisions made by the matching system.

• The iMAP system which embodies all these innovations, and a set of experiments on real-world.

Page 6: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

6

Problem Definition• consider the two relational schemas S and T in

Figure 1. Both databases store house listings and are managed by two di®erent real-estate companies. The schema of database T, for example, has one table, LISTINGS, whereas database S stores its data in two tables, HOUSES and AGENT.

• Suppose the two real-estate companies have decided to merge. To cut costs, they eliminate database S by transferring all house listings from S to database T.

Page 7: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

7

Jean Laup Mike Brown

Semantic Mapping• Specify relationships between schema elements

of disparate data sources.

• Example:

Table S1

32 15

city state

price ($)

Athens GA 360,000 Raleigh

NC 360,000

agent-id

3215

agent commission

0.03 0.04

id

Table S2

locationTable T listed-price

agent-name

Denver, CO

550,000 Laura Smith

Atlanta, GA

370,800 Mike Brown

Page 8: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Creating Semantic Mappings

• Step1: Generating matches

8

• Step2: Elaborating matches

into mappings

location=concat(city,state)listed-price=price*(1+commission)agent-name=agent

location = SELECT concat(city,state) FROM S1

listed-price = SELECT price*(1+commission)

FROM S1, S2 WHERE agent-id = id

agent-name = SELECT agent FROM S2

Table S1

city state

price ($)Athens GA 360,000

Raleigh NC 360,000

agent-id3215

3215 Jean Laup

Mike Brown

agent commission

0.030.04

id

Table S2

location

Target Tlisted-price

agent-name

Denver, CO 550,000 Laura Smith

Atlanta, GA 370,800 Mike Brown

Page 9: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Generating Matches

• 1-1 matchesagent-name=agent

• Complex matcheslocation=concat(city,state)listed-price=price*(1+commission)

• The experiments in this paper contain 30-50% of complex matches.

9

Page 10: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Illustrative Example

10

Finding the best global assigment

location = concat(city,state) listed-price=price*(1+commission) agent-name=agent

Estimating similarity

Generating candidate matches

city state price

agent

commission

Table S

location listed-price

Table T

agent-name

sim(location,concat(city,state))=0.96

sim(listed-price,price(1+commision))=0.97

sim(location,city)=0.76 …

sim(listed-price,price)=0.86…

sim(agent-name,agent)=0.95

location=concat(city,state)listed-price=price(1+commision)

location=city …

listed-price=price…

agent-name=agent…

“location” and “listed-price” are unrelated“city” and “price” are unrelated

Page 11: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

iMAP Architecture

11

Match selector

1-1 and complex matches

Match candidates

Explanationmodule

User

Domain knowledgeand data

Target schema T and source schema S

Similarity matrix

Similarity estimator

Match generator

Page 12: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Match Generator• The space of candidate matches is huge

12

• Solution: multi-searcher strategy– employing a set of special-purpose searchers– each searcher examines only a specific portion of search

space– e.g., text searcher, numeric searcher, …

• iMAP is highly extensible– new searchers can be added as developed

location=citylocation=stat

e…location=concat(city,state)location=concat(city,agent-

name)…

listed-price=pricelisted-

price=commission…listed-

price=price*(1+commision)listed-price=price*(1-commision)…

location=price*(1+commision)

…location=price*(1-commision)

listed-price=concat(city,state)

…listed-price=concat(city,agent-

name)

Page 13: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

13

Default Search Strategy

• Use beam searcho at each search level, keep k best match

candidates

• Example:

concat(city,price) concat(city,agent-name)concat(city,state)

city state

price ($)

commission

Find all concatenation candidate matches for location (k=3)

… …

agent-name

Page 14: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Default Search Strategy

• Use beam searcho at each search level, keep k-best match candidates

• Example:

14

concat(city,state)

city state

price ($)

commission

Find all concatenation candidate matches for location (k=3)

concat(city,state,price)concat(city,state,agent-name) …

… …

agent-name

concat(city,price) concat(city,agent-name)

Page 15: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Scoring Functions

• Each searcher has its own scoring function.o can use machine learning, statistics, heuristics, etc.

• E.g., current text searcher uses Naive Bayes classification.

• To compute score between location & concat(city,state)o build a Naive Bayes classifier for location.o Use classifier to compute how similar

each instance of concat(city,state) to locationo average similarity value is the desired score.

15

Page 16: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Termination Condition

• Even with beam search, search space can still be huge or infiniteo need to find effective termination condition

• Currently: terminate when see diminishing returns

16

Best candidate match at the 2nd

iterationf(concat(city,state))f(concat(city,state))-| |

<= 0.03thresholdscoring

function

Best candidate match at the 3rd

iteration

Page 17: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Implemented Searchers

Searcher Space of candidates Example matches

Text Text attributes at the source schemaname = concat(first-name, last-name)

NumericUser supplied matches or past complex matches

list-price = price * (1+tax-rate)

Category Attributes with less than t distinct valuesproduct-categories = product-types

Schema mismatch

Source attribute containing target schema infofireplace = 1 if house-desc has “fireplace”

Unit conversion

Physical quantity attributes weight-kg = 2.2* net-weight-pounds

Date Columns recognized as ontology nodesbirth-date = b-day / b-month / b-year

Overlap numeric

Specified by a context free grammarinterest-earned = balance * interest-rate

17

Page 18: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Exploiting Domain Knowledge

• Exploit new kinds of domain knowledge - overlap data and external datao Overlap data: data from different sources that represent

a same entity• tuples from real estate listings that represent same house

o External data: data not in the two input schemas• “# of real estate agents <= 50” mined from real estate

listings

• Use domain knowledge as early as possible

18

Page 19: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Generating Explanations

• Motivating example 2:

19

• Motivating example 1:

num-rooms=bath-rooms + bed-rooms + dining-rooms: 0.434

Correct match:num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms

num-rooms=bath-rooms + bed-rooms + living-rooms: 0.432

name=last-name: 0.434name=concat(first-name,last-name): 0.420

Correct match:name=concat(first-name,last-name)

Page 20: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Types of Questions Users Can

Ask

• Existence of a matcho why is month-posted=monthly-fee-rate generated?

• Non-existence of a matcho why is num-rooms= bath-rooms+bed-

rooms+dining-rooms+ living-rooms not generated?

• Ranking of a matcho why is name=last-name ranked higher than

name=concat(first-name,last-name)?

20

Page 21: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Generating Explanation

using Dependency Graph

21

list-price=pricelist-price=price(1 + monthly-fee-rate)

list-price = price(1+ monthly-fee-rate)score = 0.76

month-postedis unrelated to

list-price

month-posted = monthly-fee-rate score = 0.67

Match selector

month-posted = monthly-fee-rate score = 0.79

month-posted = monthly-fee-rate score = 0.55

Combining module

month-posted = monthly-fee-rateNaïve Bayes evaluatorName based evaluator

monthly-fee-rate is a monthmonth-posted is a month.

Date searcher

PreprocessorPreprocessor

If data in a column is in (1..12) it’s a month

Assumption Assumption

Source ColumnConstraint Target Column

Match List

Constraint

Candidate

Candidate

Candidate

Candidate

Candidate Candidatelist-price = price

score = 0.63

month-posted

monthly-fee-rate3

2…

910…

Page 22: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Sample Explanation

22

USER: Why num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms does not appear?

iMAP:

(1) Overlap numeric matcher CAN generate bath-rooms + bed-rooms + dining-rooms + living-rooms for num-rooms.

(2) Overlap numeric matcher DID NOT generate it.

(3) A reason: the match has length of 4 terms. overlap numeric searcher has considered only candidates of length up to 3 terms.

(4) Characteristics of the search space for num-rooms: a. Number of considered numeric attributes: 7 b. Considered numeric attributes: building-area lot-dimension1 lot-dimension2 bath-rooms bed-rooms dining-rooms living-rooms c. Used successor functions: Addition Multiplication d. Max. number of terms: 3 e. Max. number of elements in a term: 3

Page 23: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Empirical Evaluation• Domains & data sources

o four domains: Real Estate, Inventory, Cricket, and Financial Wizard

o avg. # of matches: 30o obtain data from several sources

• Internet, Microsoft Access sample databases, students in a large database class

• Methodology and performance measureo run experiments with several configurations on

eight experiment setso top-1 matching accuracy = % of target attributes

whose top-1 matches are correct

23

Page 24: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Complex Matching Accuracy

1-1 matching accuracy: 77-100%

24

0

20

40

60

80

100

Real Estate Inventory Cricket FinancialWizard

Matc

hin

g a

ccu

racy

(%

)

Complex matching accuracy: 50 - 86%

Default system: 33 - 55%+ Domain constraints: 50 - 55%+ Overlap data: 33 - 71%+ Domain constraints + Overlap data: 50 - 86%

Page 25: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Complex Matching Accuracy

for Disjoint Domains

25

Complex matching accuracy: 27 - 58%

0

20

40

60

80

100

Real Estate Inventory Cricket FinancialWizard

Default system: 18 - 58%

+ Domain constraints: 27 - 58%

Matc

hin

g a

ccu

racy

(%

)

Page 26: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Why not 100%?• Hard to find smaller components of a complex

matchagent-address = concat(street-name,city,state), where apt-number is missed

• Hard to remove small noise components froma complex match

phone-number = concat(agent-id,area-code,number),where agent-id is noise (a single digit number)

• Many correct complex matches are in the top three, but not in the top one

26

Page 27: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Summary & Future Work

• Complex matches are pervasiveo finding them is crucial for practical matching solutions

• Described the iMAP solutiono discovers both complex and 1-1 matcheso is highly extensible o exploits several kinds of domain knowledge o can explain produced matches

• Experiments show the promise of iMAP.

• Future worko apply iMAP to other data representations (e.g., XML)o combine iMAP-style techniques with Clio-style user

interaction 27

Page 28: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

28

QUESTIONS?

Page 29: IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

29