Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE...

Mayssam Sayyadian, Yoonkyong Lee, AnHai DoanUniversity of Illinois, USA

Arnon RosenthalMITRE Corp., USA

eTuner: Tuning Schema Matching eTuner: Tuning Schema Matching Software using Synthetic ScenariosSoftware using Synthetic Scenarios

2

Main PointsMain Points

Tuning matching systems: long standing problem– becomes increasingly worse

We propose a principled solution– exploits synthetic input/output pairs– promising, though much work remains

Idea applicable to other contexts

3

price agent-name address

Schema MatchingSchema Matching

1-1 match complex match

listed-price contact-name city state

Schema 2

120,000 George Bush Crawford, TX239,900 Hillary Clinton New York City, NY

320K Jane Brown Seattle WA240K Mike Smith Miami FL

Schema 1

4

Schema Matching is UbiquitousSchema Matching is Ubiquitous

Databases– data integration, – model management– data translation, – collaborative data sharing– keyword querying, schema/view integration– data warehousing, peer data management, …

AI– knowledge bases, ontology merging, information gathering agents, ...

Web– e-commerce, Deep Web, Semantic Web

eGovernment, bio-informatics, scientific data management

5

Current State of AffairsCurrent State of Affairs Finding semantic mappings is now a key bottleneck!

– largely done by hand, labor intensive & error prone

Numerous matching techniques have been developed– Databases: IBM Almaden, Microsoft Research, BYU, George Mason,

U Leipzig, U Wisconsin, NCSU, U Illinois, Washington, Humboldt-Universität zu Berlin, ...

– AI: Stanford, Karlsruhe University, NEC Japan, ...

Techniques are often synergistic, leading to multi-component matching architectures– each component employs a particular technique– final predictions combine those of the components

6

An Example: LSD An Example: LSD [SIGMOD-01][SIGMOD-01]

Schema 1

Urbana, IL James Smith Seattle, WA Mike Doan

address agent-name

area contact-agent

Peoria, IL (206) 634 9435 Kent, WA (617) 335 4243

Schema 2

Name Matcher

Naive BayesMatcher

Combiner

0.3

agent

name

contact

agent0.5

0.1

area => (address, 0.7), (description, 0.3)contact-agent => (agent-phone, 0.7), (agent-name, 0.3)

comments => (address, 0.6), (desc, 0.4)

Match Selector

ConstraintEnforcer

Only one attribute of Schema 2 matches address

area = address

contact-agent = agent-phone

...

comments = desc

7

Multi-Component Matching SolutionsMulti-Component Matching Solutions

Such systems are very powerful ...– maximize accuracy; highly customizable to individual domain

... but place a serious tuning burden on domain users

Constraintenforcer

Match selector

Matcher Matcher Combiner

… Matcher 1 Matcher n

Constraintenforcer

Match selector

Combiner

Matcher 1 Matcher n…

Constraintenforcer

Match selector

Combiner


Match selector

Combiner

LSD COMA SF

LSD-SF

Developed in many recent works– e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02;

Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05

Now commonly adopted, with industrial-strength systems – e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig]

8

Tuning Schema Matching SystemsTuning Schema Matching Systems

Library of matching components

Constraintenforcer

Match selector

Combiner


Execution graph

Knobs of decision tree matcher

Threshold selector

Bipartite graph selector

A* search enforcer Relax. labeler ILP

Average combiner

Min combiner

Max combiner

Weightedsum combiner

q-gram name matcher

Decision treematcher

Naïve Baysmatcher

TF/IDF name matcher

SVMmatcher

• Characteristics of attr.

• Post-prune?• Size of validation set

• Split measure

•••

Given a particular matching situation– how to select the right components? – how to adjust the multitude of knobs?

Untuned versions produce inferior accuracy, however ...

9

Large number of knobs– e.g., 8-29 in our experiments

Wide variety of techniques – database, machine learning, IR, information theory, etc.

Complex interaction among components Not clear how to compare the quality of knob configs

Matching systems are still tuned manually, by trial and error Multiple component systems make tuning even worse

... Tuning is Extremely Difficult ... Tuning is Extremely Difficult

Developing efficient tuning techniques is crucial to making matching systems attractive in practice

10

The eTuner SolutionThe eTuner Solution Given schema S & matching system M

– tunes M to maximize average accuracy of matching S with future schemas

– incurs virtually no cost to user

Key challenge 1: Evaluation– must search for “best” knob config – how to compute the quality of any knob config C?

– if knowing “ground-truth” matches for a representative workload W = {(S,T1), ..., (S,Tn)}, then can use W to evaluate C

– but often have no such W

Key challenge 2: Search– how to efficiently evaluate the huge space of knob configs?

11

Key Idea: Generate Synthetic Input/Output PairsKey Idea: Generate Synthetic Input/Output Pairs

Need workload W = {(S,T1), (S,T2), …, (S,Tn)}

To generate W– start with S– perturb S to generate T1– perturb S to generate T2– etc.

Know the perturbation => know matches between S & Ti

12

Key Idea: Generate Synthetic Input/Output PairsKey Idea: Generate Synthetic Input/Output Pairs

Perturb # of tables

id first last salary ($)

1 Bill Laup 40,000 $

2 Mike Brown 60,000 $

EMPLOYEES

EMPS

emp-last id wage

Laup 1 45200

Brown 2 59328

V1

Schema S

1

23


1 Bill Laup 40,000 $

2 Mike Brown 60,000 $

3 Jean Ann 30,000 $

4 Roy Bond 70,000 $

EMPLOYEES


3 Jean Ann 30,000$

4 Roy Bond 70,000$

EMPLOYEES

Perturb # of columnsin each table

last id salary($)

Laup 1 40,000$

Brown 2 60,000$

EMPLOYEES

Perturb column and table names

Perturb data tuplesin each table

EMPS

emp-last id wage

Laup 1 40,000$

Brown 2 60,000$

EMPS.emp-last = EMPLOYEES.lastEMPS.id = EMPLOYEES.idEMPS.wage = EMPLOYEES.salary($)

U

1

23

V

1

23 312

312

312

312

V1U Ω1: a set of semantic matches

Vn

.

.

.Split S into V and U with disjoint data tuples

13

Examples of Perturbation RulesExamples of Perturbation Rules Number of tables

– merge two tables based on a join path– splits a table into two

Structure of table– merges two columns

– e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name)

– drop a column– swap location of two columns

Names of tables/columns– rules capture common name transformations– abbreviation to the first 3-4 characters, dropping all vowels, synonyms,

dropping prefixes, adding table name to column name, etc

Data values– rules capture common format transformations: 12/4 => Dec 4– values are changed based on some distributions (e.g., Gaussian)

See paper for details

14

The eTuner ArchitectureThe eTuner Architecture

StagedTuner

Tuning Procedures

Workload Generator

Perturbation Rules

Matching Tool M

SyntheticWorkload

(Optional)

Tuned Matching Tool M

U Ω1 V1

U Ω2 V2

U Ωn Vn Schema S

15

The Staged TunerThe Staged Tuner

Level 1

Level 2

Level 3Constraintenforcer

Match selector

Combiner


Level 4

Tuning direction

Tune sequentially starting with lowest-level components Assume

– execution graph has k levels, m nodes per level– each node can be assigned one of n components– each component has p knobs, each of which has q values

tuning examines (npqkm) out of (npq)^(km) knob configs

16

Empirical Evaluation Empirical Evaluation

Domain # schemas# tables per

schema# attributes per schema

# tuples per table

reference paper

Real Estate 5 2 30 1000 LSD (SIGMOD’01)

Courses 5 3 13 50 LSD

Inventory 10 4 10 20 Corpus (ICDE’05)

Product 2 2 50 120 iMAP (SIGMOD’04)

Domains

LSD: 6 Matchers, 6 Combiners, 1 Constraint enforcer, 2 Match selectors, 21 Knobs

iCOMA: 10 Matchers, 4 Combiners, 2 Match selectors, 20 Knobs

SF: 3 Matchers, 1 Constraint enforcer, 2 Match selectors, 8 Knobs

LSD-SF: 7 Matcher, 7 Combiners, 1 Constraint enforcer, 2 Match selectors, 29 Knobs

Matching systems

17

Matching AccuracyMatching Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

CourseInventoryProductReal Estate

LSD COMA

SF

Off-the-shelfDomain-independent

LSD-SF

eTuner achieves higher accuracy than current best methods, at virtually no cost to the user

Domain-dependentSource-dependent

eTUNER: Automatic eTUNER: Human-assisted

CourseInventoryProductReal Estate0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

CourseInventoryProductReal Estate CourseInventoryProductReal Estate0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

18

Cost of Using eTunerCost of Using eTuner

You have a schema S and a matching system M Vendor supplies eTuner

– will hook it up with matching system M

Vendor supplies a matching system M– bundles eTuner inside

19

Sensitivity AnalysisSensitivity Analysis Adding perturbation rules Exploiting prior match results (enriching the workload)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 10 20 25 40 50Schemas in Synthetic Workload (#)

Acc

urac

y (F

1)

Average

Inventory Domain

Real Estate Domain

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 22 44 66 88

Tuned LSD

Previous matches in collection (%)

20

Summary: The eTuner Project @ IllinoisSummary: The eTuner Project @ Illinois

Tuning matching systems is crucial – long standing problem, is getting worse– a next logical step in schema matching research

Provides an automatic & principled solution– generates a synthetic workload, employs it to tune efficiently– incurs virtually no cost to human users– exploits user assistance whenever available

Extensive experiments over 4 domains with 4 systems

Future directions– find optimal synthetic workload– apply to other matching scenarios– adapt ideas to scenarios beyond schema matching (see 3rd speaker)

21

Backup: User AssistanceBackup: User Assistance S(phone1,phone2,…) Generate V by dropping phone2: V(phone1,…) Rename phone1 in V: V(x,…) Problem:

– x matches phone1, x does not match phone2

User: – group phone1 and phone2– so if x matches phone1, it will also match phone2

Intuition: tell system do not bother to try distinguish phone1 and phone2

Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE...

Documents

Transcript of Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE...