Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers...

Putting Context into Schema Matching

Philip Bohannon*Yahoo! Research

Eiman Elnahrawy*Rutgers University

Wenfei FanUniv of Edinburgh and Bell Labs

Michael Flaster*Google

*Work performed at Lucent Technologies -- Bell Laboratories.

Overview

1. Motivation

2. Background

3. Strawman

4. Framework

5. Experimental Evaluation

6. Related Work

7. Conclusions

Schema Matching vs. Schema Mapping

RS.Person

First

Last

City

RT.Student

Name

Address

City...

.

.

.

Source Schema: RS

Target Schema: RT

Arrows inferred based on meta-data or sample instance data

Associated confidence score

Meaning (variant of): RS.Person.City RT.Student.City

.88

.93

.97

Schema Matching means “computer-suggested arrows”

Schema Mapping: “From Arrows to Queries”

RS.Person

First

Last

City

RT.Student

Name

Address

City...

.

.

.

Given a set of arrows user input, produce a query that maps instances of RS into instances of RTRT

Transformations, joins [Miller, Has, Hernandez, VLDB 2002] added by, or with help from, the user

Most of this talk is about matching, some implications for mapping later

select concat(First, “ ”,Last) as Name,

City as City

from RS.Person, RS.Education,…

where …

Q: RS -> RT

Motivation: inventory mapping example

RS.invid: integer

name: string

code: string

type: integerinstock: string

descr: string

arrival: date

RT.booktitle: stringisbn: string

price: floatformat: string

RT.music

title: stringasin: string

price: float

label: string

sale: float

Consider integrating two inventory schemas

Books, music in separate tables in RT

Run some nice schema match software

Inventory where clause

RS.invid: integer

name: string

code: string


descr: string

arrival: date



RT.music


price: float

label: string

sale: float

The lines are helpful (schema matching is a best-effort affair), but…

lines are semantically correct only in the context of a selection condition

where type=1

where type = 2

Definition and Goals

Contextual schema match: An arrow between source and

target schema elements, annotated with a selection condition

– In a standard schema match, the condition “true” is always

used

Goal: Adapt instance-driven schema matching techniques to

infer semantically valid contextual schema matches, and

create schema maps from those matches

RS.aa RT.bb true M

RS.aa RT.bb RS.c=3 M

Attribute promotion example Consider integrating data about grade assignments [Fletcher,

Wyss, SIGMOD 2005 demo]

Again context is needed, but semantics are slightly different: attribute promotion

Name Assgn Grade Name Grade1 Grade2 Grade3 …Joe

Joe

Mary

Mary

Mary

Joe

1

23

1

23

84

8675

92

9485

where Assgn=1

where Assgn=2

=3 = …

Overview

1. Motivation

2. Background

3. Strawman

4. Framework


6. Related Work

7. Conclusion

Background: Instance-level matching

RS.ac RT1.bb true M RS.ac RT

1.ac true M

San JoseCupertinoPalo AltoGilroyPleasantonSunnyvale

SunnyvaleLos AngelesCupertinoGilroySan Diego

Nice match!

(408) 123-4456(212) 223-3455(408) 123-2222(408) 324-4444

SunnyvaleLos AngelesCupertinoGilroySan Diego

Dubious, at best!

Background: Instance-level matching

RS.ac RT1.bb true M RS.ac RT

1.ac true M

Perfect match!Dubious, at best!

BayesianTri-gram

Type Expert

String Edit Distance

CosineSimilarity

WhateverMore

Whatever

Coming up with a good score is far from simple!• Derive comparable scores across sample size, data types, etc.

StandardMatch(RS,RT,)

RS.ac RT1.ac true M

RS.ba RT1.cd true M

RS.ba RT1.sb true M

RS.db RT1.ar true M

RS.ac RT1.vw true M

RS.bd RT1.ad true M

1. Consider all |RS||RT| matches, score them, normalize the scores

RS.ac RT1.ac true M

RS.ba RT1.cd true M

RS.ba RT1.sb true M

RS.db RT1.ar true M

RS.ac RT1.vw true M

RS.bd RT1.ad true M

2. Rank by normalized score

3. Apply as a cutoff, and return

Background: Categorical Attributes What attributes are candidates

for the where clause?

We focus on “categorical” attributes (leaving non-categorical attributes as future work)

If not identified by schema, infer from sample data, as any attribute with

–more than 1 value

–most values associated with more than one tuple

RS.invid: integer

name: string

code: string


descr: string

arrival: date

Overview

1. Motivation

2. Background

3. Strawman

4. Framework


6. Related Work

7. Conclusion

Strawman Algorithm

1. Use instance-based matching algorithm to compute a set of matches, L = M1..Mn, along with associated scores

2. For each Mi in L, of the form (RS.s,RT.t,true)

For each categorical attribute c in the source (or target)

For each value v taken by c in the sample

1. Restrict the sample of RS to tuples where c=v

2. Re-compute the match score on the new sample

3. For c,v that most improves score, replace Mi with (RS.s,RT.t,c=v)

ContextMatch(RS,RT,)

RS.ac RT1.ac true M

RS.ba RT1.cd true M

RS.ba RT2.sb true M

RS.db RT2.ar true M

RS.ac RT1.vw true M

RS.bd RT1.ad true M



StandardMatch…

RS.ba RT1.cd Rs.t=1M

5. Evaluate quality of match

6. Keep the best!

RS.c = 2RS.d = “open”

RS.c = 2 or RS.c = 3RS.t = 0

RS.t = 1

4. Try each context condition

Problems with Strawman

False Positives – the increase in the score may not be meaningful, since

some random subsets of corpus will match better than the whole (even with size-adjusted metrics)

False Negatives– original matching algorithm only returned matches with

quality above some threshold to be in L, but a match that didn’t make the cut may improve greatly with contextual matching

Time – with disjuncts -- too many expressions to test

Strawman 2.0

Like Strawman, but require an improvement threshold, w, to cut down on false positives

Not much better

Setting w is problematic, as matcher scores are not perfect

Overview

1. Motivation

2. Background

3. Strawman

4. Framework


6. Related Work

7. Conclusion

Our approach:

RS.invid: integer

name: string

code: string


descr: string

arrival: date



RT.music


price: float

label: string

sale: float

1. Pre-filter conditions based on classification

2. Find conditions that improve several matches from the same table

View-oriented contextual mapping (cont’d)

RS.invid: integer

name: string

code: string


descr: string

arrival: date



RT.music


price: float

label: string

sale: float

RS.inv where type = 2

id: integer

name: string

code: string


descr: string

arrival: date

RS.inv where type = 1

id: integer

name: string

code: string


descr: string

arrival: date

Algorithm ContextMatch(RS,RT,)

L = ;

M = StandardMatch(RS,RT,);

C = InferCandidateViews(RS,M,EarlyDisjuncts);

for c C do

Vc = select * from RS where c;

for m M do

m’ := m with RS replaced by Vc;

s := ScoreMatch(m’);

L = L {(m’,s)};

return SelectContextualMatches(M, L, EarlyDisjuncts)

ContextMatch(RS,RT,)

RS.ac RT1.ac true M

RS.ba RT1.cd true M

RS.ba RT2.sb true M

RS.db RT2.ar true M

RS.ac RT1.vw true M

RS.bd RT1.ad true M



StandardMatch…InferCandidateViews

RS.c = 2RS.d = “open”

RS.c = 2 or RS.c = 3RS.t = 0

RS.t = 1

4. Re-compute summariesfor V as:

“select * from RS

where RS.t = 1”

For each candidate view V,

RS.ba RT1.cd Rs.t=1M

5. Evaluate quality of matches

How to Filter Candidate Views

Naïve

– Any Boolean condition involving a categorical attribute (strawman approach)

SourceClassifier, TargetClassifier

– Check for categorical attributes that do a “good job” categorizing other attributes

Disjunct Handling (early or late)

Conjunct Handling

RS.invid: integer

name: string

code: string


descr: string

arrival: date

id name type instock code descr

0 leaves of grass 1 y 0195128 hardcover

1 the white album 2 y B002UAX audio cd

2 heart of darkness 1 n 0486611 paperback

3 wasteland 1 y 039995 paperback

4 hotel california 2 n B002GVO electra

Source Classifier Intuition

how well do the categorical attributes serve as classifier labels for the other attributes?







Source Classifier Intuition: type

how about ‘type’?







Source Classifier Intuition: instock

how about ‘instock’?

What do we really mean by a “good job”? Split the sample into a training set and a testing set

(randomly) For each categorical attribute C and non-categorical

attribute A– Train a classifier H by treating the value of A as the

document and the value of C as the label

– Test H against test set, determine precision, p, and recall, r

– Score(C) w.r.t. A based on combination of precision and recall (F = 2pr/(p+r))

– Compare Score(C) to Score(NC), wher NC is a Naïve Classifier:

• This classifier chooses most frequent label

– C does a good job with H if H’s improvement over Naïve is statistically significant with 95% confidence







Target Classifier Intuition

Train a new classifier, T, treating each target schema attribute as a class of documents

Check source values against this classifier Label each value with best guess label Use labels instead of values in the same framework

Book.comment

Book.comment

Music.label

Handling Disjunctive Conditions Why Disjuncts? What if type field had separate categories for

hardback and paperback? Two approaches to handling disjunctive conditions, “early”

and “late” Early Disjuncts

– InferCandidateViews is responsible for identifying “interesting” disjuncts

– Each interesting disjunct is evaluated separately, no overlapping conditions are output

Late Disjuncts– InferCandidateViews returns no disjuncts

– All high-scoring conditions are unioned together (Clio semantics), effectively creating a disjunct

Early Disjuncts: A Heuristic Approach When evaluating trained classifier on test set for some

categorical attribute C, make note of misclassifications of the form “should be A, but guessed B”

Consider merging the (A,B) pair that would repair most errors

– by merge, we mean “replace” A and B values with (A,B)

Re-evaluate

Repeat

Keep all alternatives formed this way that score well

Only accept 1 view that mentions attribute C (don’t union)

Handling Conjuncts

Proposed Approach: – Assumes that a good conjunctive view has a good

disjunctive view as one of the terms in the conjunct.

Run Context Match Repeatedly

At stage i, consider views VC identified by the previous (i-1)th run as the input base tables– where C was the select condition defining the view

When considering candidate attributes for a run, only consider categorical attributes not in C.

(Conjunct handling not in current experiments)

Selecting Contextual Matches

Each view V based on condition c is evaluated, rather than each match

Compute overall confidence of matches from V, and compare to overall confidence from base table

If overall confidence is better than w, use V instead of the base table

If more than one qualifies– If EarlyDisunct, choose the best

– Else, take all that are over w

Comments on Schema Mapping

Seek to apply the Clio ([Popa et al, VLDB 2002]) approach to mapping construction

Create ‘logical tables’ based on key-foreign key constraints

Two challenges– Extend notion of foreign-key constraints in context of

selection views, undecidability result

– Extend join rules of [Popa et al, VLDB 2002] to handle the selection views

See paper for details

Overview

1. Motivation

2. Background

3. Strawman

4. Framework


6. Related Work

7. Conclusion

Experimental Study Used schemas from the retail domain

– schemas created by students at UW

• Aaron, Ryan, Barrett

– Populated code, descr info by scraping web-sites, used some name data from Illinois Semantic Integration Archive

ItemType is split, so that instead of just CD, BOOK

– e.g. CD1, CD2, BOOK1, BOOK2, =4

Compare matched edges to correct edges

– Accuracy: how many of BOOKi edges go to book target table?

– Precision: of the BOOKi edges, how many go to book target?

– Fmeas: 2(Accuracy * Precision)/(Accuracy + Precision)

View improvement threshold: w

Aaron Barett

How sensitive is technique to w?

Depends on disjunct strategy Easier to pick w with

EarlyDisjunct

Ryan

Strawman

Strawman means– Late disjunct (EarlyDisjunct=false)

– Pick best arrow from each source attribute on per-attribute basis (MultiTable)

Sensitivity to Decoy Categorical Attributes

EarlyDisjunct

LateDisjunct

Add 3 extra categorical attributes Vary their correlation with ItemType (higher correlation makes it

harder) Naïve is not only slow, it is overly confusing to the quality

metrics EarlyDisjunct heuristic based on classification helps with quality

Varying schema size

Add n non-categorical attributes to every table, all taken from same domain

Add n/4 categorical attributes to tables with categorical attributes Early dip is before non-categorical attributes match each other

Runtime as schema gets larger

Same experiment, compare runtimes TgtClass is somewhat higher quality (not shown), but takes much longer for large

schemas

Grades Example

Create an experiment based on grades example

Artificial data – mean of assignment I is 40 + 10(I-1) (as grades improve)

– standard deviation is varied

Name Assgn Grade Name Grade1 Grade2 Grade3 …Joe

Joe

Mary

Mary

Mary

Joe

Bob

Sue

1

23

1

23

84

8675

92

9485

where Assgn=2

where Assgn=1 =3 = …

Grades accuracy as std. dev increases

Overview

1. Motivation

2. Background

3. Strawman

4. Framework


6. Related Work

7. Conclusion

Related Work Instance level schema matching

– Survey [Rahm, Bernstein, VLDB Journal 2001], Coma [Do, Rahm, VLDB02], Coma++ [SIGMOD 05], iMAP [Doan et al, SIGMOD 01], Cupid [Madhavan, Bernstien, Rahm, VLDB 01], etc.

Schema mapping

– Clio [Popa, et al, VLDB 02], [Haas et al, SIGMOD 2005], etc

– Model Management (many papers)

Overcoming heterogeneity during match process

– Schema Mapping as Query Discovery [Miller, Haas, Hernandez, VLDB 2000] - present user with examples to derive join conditions

– MIQIS [Fletcher, Wyss, (demo) SIGMOD 2005] - search through a large space of schema transformations (beyond what is given here), but requires the same data to appear in both source and target

– We focus on inferring selection views only, but are very compatible with existing schema match work

Conclusions Contributions

– Introduced contextual matching as an important extension to schema matching

– Defined a general framework in which instance-level match technique is treated as a black box

– Identified two techniques based on classification to find good conditions

– Identified filtering criterea for contextual matches– Define contextual foreign key and new join rules to extend a Clio-

style schema mapper to better handle contextual matches– Experimental study illustrating time/quality tradeoffs

Future Work– More complex view conditioning (horizontal partitioning + attribute

promotion)– Consider taking constraints on target into account in quality functions

The End

Thank you, any questions?

sizes_fmeas.eps

Standard Match Algorithm

StandardMatch(RS,RT, )

– Evaluate quality of match between all pairs of (source, target) attributes

• Ignore complex (multi-attribute) matches for simplicity

– return matches between source table RS and target schema RT that have confidence threshold >=

RS.ac RT1.ac true M

RS.ba RT1.cd true M

RS.ba RT1.sb true M

RS.db RT1.ar true M

RS.ac RT1.vw true M

RS.bd RT1.ad true M

RS.af RT1.ca true M

Background: Instance-level matching Instance-level schema matching requires sample data for

source and target schema

Train a variety of matchers by treating each (source, target) column as a set of documents labeled by the column name

– e.g. text matchers based on string similarity, token similarity, format similarity, number of tokens, etc, or

– numeric matchers based on value distribution, etc.

Apply source matchers to sample target data, and vice versa

Combine resulting scores (with machine-learned weightings [Doan, Domingos, Halevy, SIGMOD 2001]) to score each arrow

RS.ac RT1.bb true M

score

RS.ac RT1.bb true M

“perfect match”

Algorithm ContextMatch(RS,RT,)

L = ;

M = StandardMatch(RS,RT,);

C = InferCandidateViews(RS,M,EarlyDisjuncts);

for c C do

Vc = select * from RS where c;

for m M do

m’ := m with RS replaced by Vc;

s := ScoreMatch(m’);

L = L {(m’,s)};

return SelectContextualMatches(M, L, EarlyDisjuncts)

Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers...

Documents

Transcript of Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers...