Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE...
-
Upload
olivia-noe -
Category
Documents
-
view
213 -
download
0
Transcript of Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE...
Mayssam Sayyadian, Yoonkyong Lee, AnHai DoanUniversity of Illinois, USA
Arnon RosenthalMITRE Corp., USA
eTuner: Tuning Schema Matching eTuner: Tuning Schema Matching Software using Synthetic ScenariosSoftware using Synthetic Scenarios
2
Main PointsMain Points
Tuning matching systems: long standing problem– becomes increasingly worse
We propose a principled solution– exploits synthetic input/output pairs– promising, though much work remains
Idea applicable to other contexts
3
price agent-name address
Schema MatchingSchema Matching
1-1 match complex match
listed-price contact-name city state
Schema 2
120,000 George Bush Crawford, TX239,900 Hillary Clinton New York City, NY
320K Jane Brown Seattle WA240K Mike Smith Miami FL
Schema 1
4
Schema Matching is UbiquitousSchema Matching is Ubiquitous
Databases– data integration, – model management– data translation, – collaborative data sharing– keyword querying, schema/view integration– data warehousing, peer data management, …
AI– knowledge bases, ontology merging, information gathering agents, ...
Web– e-commerce, Deep Web, Semantic Web
eGovernment, bio-informatics, scientific data management
5
Current State of AffairsCurrent State of Affairs Finding semantic mappings is now a key bottleneck!
– largely done by hand, labor intensive & error prone
Numerous matching techniques have been developed– Databases: IBM Almaden, Microsoft Research, BYU, George Mason,
U Leipzig, U Wisconsin, NCSU, U Illinois, Washington, Humboldt-Universität zu Berlin, ...
– AI: Stanford, Karlsruhe University, NEC Japan, ...
Techniques are often synergistic, leading to multi-component matching architectures– each component employs a particular technique– final predictions combine those of the components
6
An Example: LSD An Example: LSD [SIGMOD-01][SIGMOD-01]
Schema 1
Urbana, IL James Smith Seattle, WA Mike Doan
address agent-name
area contact-agent
Peoria, IL (206) 634 9435 Kent, WA (617) 335 4243
Schema 2
Name Matcher
Naive BayesMatcher
Combiner
0.3
agent
name
contact
agent0.5
0.1
area => (address, 0.7), (description, 0.3)contact-agent => (agent-phone, 0.7), (agent-name, 0.3)
comments => (address, 0.6), (desc, 0.4)
Match Selector
ConstraintEnforcer
Only one attribute of Schema 2 matches address
area = address
contact-agent = agent-phone
...
comments = desc
7
Multi-Component Matching SolutionsMulti-Component Matching Solutions
Such systems are very powerful ...– maximize accuracy; highly customizable to individual domain
... but place a serious tuning burden on domain users
Constraintenforcer
Match selector
Matcher Matcher Combiner
… Matcher 1 Matcher n
Constraintenforcer
Match selector
Combiner
Matcher 1 Matcher n…
Constraintenforcer
Match selector
Combiner
Matcher 1 Matcher n…
Match selector
Combiner
LSD COMA SF
LSD-SF
Developed in many recent works– e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02;
Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05
Now commonly adopted, with industrial-strength systems – e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig]
8
Tuning Schema Matching SystemsTuning Schema Matching Systems
Library of matching components
Constraintenforcer
Match selector
Combiner
Matcher 1 Matcher n…
Execution graph
Knobs of decision tree matcher
Threshold selector
Bipartite graph selector
A* search enforcer Relax. labeler ILP
Average combiner
Min combiner
Max combiner
Weightedsum combiner
q-gram name matcher
Decision treematcher
Naïve Baysmatcher
TF/IDF name matcher
SVMmatcher
• Characteristics of attr.
• Post-prune?• Size of validation set
• Split measure
•••
Given a particular matching situation– how to select the right components? – how to adjust the multitude of knobs?
Untuned versions produce inferior accuracy, however ...
9
Large number of knobs– e.g., 8-29 in our experiments
Wide variety of techniques – database, machine learning, IR, information theory, etc.
Complex interaction among components Not clear how to compare the quality of knob configs
Matching systems are still tuned manually, by trial and error Multiple component systems make tuning even worse
... Tuning is Extremely Difficult ... Tuning is Extremely Difficult
Developing efficient tuning techniques is crucial to making matching systems attractive in practice
10
The eTuner SolutionThe eTuner Solution Given schema S & matching system M
– tunes M to maximize average accuracy of matching S with future schemas
– incurs virtually no cost to user
Key challenge 1: Evaluation– must search for “best” knob config – how to compute the quality of any knob config C?
– if knowing “ground-truth” matches for a representative workload W = {(S,T1), ..., (S,Tn)}, then can use W to evaluate C
– but often have no such W
Key challenge 2: Search– how to efficiently evaluate the huge space of knob configs?
11
Key Idea: Generate Synthetic Input/Output PairsKey Idea: Generate Synthetic Input/Output Pairs
Need workload W = {(S,T1), (S,T2), …, (S,Tn)}
To generate W– start with S– perturb S to generate T1– perturb S to generate T2– etc.
Know the perturbation => know matches between S & Ti
12
Key Idea: Generate Synthetic Input/Output PairsKey Idea: Generate Synthetic Input/Output Pairs
Perturb # of tables
id first last salary ($)
1 Bill Laup 40,000 $
2 Mike Brown 60,000 $
EMPLOYEES
EMPS
emp-last id wage
Laup 1 45200
Brown 2 59328
V1
Schema S
1
23
id first last salary ($)
1 Bill Laup 40,000 $
2 Mike Brown 60,000 $
3 Jean Ann 30,000 $
4 Roy Bond 70,000 $
EMPLOYEES
id first last salary ($)
3 Jean Ann 30,000$
4 Roy Bond 70,000$
EMPLOYEES
Perturb # of columnsin each table
last id salary($)
Laup 1 40,000$
Brown 2 60,000$
EMPLOYEES
Perturb column and table names
Perturb data tuplesin each table
EMPS
emp-last id wage
Laup 1 40,000$
Brown 2 60,000$
EMPS.emp-last = EMPLOYEES.lastEMPS.id = EMPLOYEES.idEMPS.wage = EMPLOYEES.salary($)
U
1
23
V
1
23 312
312
312
312
V1U Ω1: a set of semantic matches
Vn
.
.
.Split S into V and U with disjoint data tuples
13
Examples of Perturbation RulesExamples of Perturbation Rules Number of tables
– merge two tables based on a join path– splits a table into two
Structure of table– merges two columns
– e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name)
– drop a column– swap location of two columns
Names of tables/columns– rules capture common name transformations– abbreviation to the first 3-4 characters, dropping all vowels, synonyms,
dropping prefixes, adding table name to column name, etc
Data values– rules capture common format transformations: 12/4 => Dec 4– values are changed based on some distributions (e.g., Gaussian)
See paper for details
14
The eTuner ArchitectureThe eTuner Architecture
StagedTuner
Tuning Procedures
Workload Generator
Perturbation Rules
Matching Tool M
SyntheticWorkload
(Optional)
Tuned Matching Tool M
U Ω1 V1
U Ω2 V2
U Ωn Vn Schema S
15
The Staged TunerThe Staged Tuner
Level 1
Level 2
Level 3Constraintenforcer
Match selector
Combiner
Matcher 1 Matcher n…
Level 4
Tuning direction
Tune sequentially starting with lowest-level components Assume
– execution graph has k levels, m nodes per level– each node can be assigned one of n components– each component has p knobs, each of which has q values
tuning examines (npqkm) out of (npq)^(km) knob configs
16
Empirical Evaluation Empirical Evaluation
Domain # schemas# tables per
schema# attributes per schema
# tuples per table
reference paper
Real Estate 5 2 30 1000 LSD (SIGMOD’01)
Courses 5 3 13 50 LSD
Inventory 10 4 10 20 Corpus (ICDE’05)
Product 2 2 50 120 iMAP (SIGMOD’04)
Domains
LSD: 6 Matchers, 6 Combiners, 1 Constraint enforcer, 2 Match selectors, 21 Knobs
iCOMA: 10 Matchers, 4 Combiners, 2 Match selectors, 20 Knobs
SF: 3 Matchers, 1 Constraint enforcer, 2 Match selectors, 8 Knobs
LSD-SF: 7 Matcher, 7 Combiners, 1 Constraint enforcer, 2 Match selectors, 29 Knobs
Matching systems
17
Matching AccuracyMatching Accuracy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
CourseInventoryProductReal Estate
LSD COMA
SF
Off-the-shelfDomain-independent
LSD-SF
eTuner achieves higher accuracy than current best methods, at virtually no cost to the user
Domain-dependentSource-dependent
eTUNER: Automatic eTUNER: Human-assisted
CourseInventoryProductReal Estate0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
CourseInventoryProductReal Estate CourseInventoryProductReal Estate0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
18
Cost of Using eTunerCost of Using eTuner
You have a schema S and a matching system M Vendor supplies eTuner
– will hook it up with matching system M
Vendor supplies a matching system M– bundles eTuner inside
19
Sensitivity AnalysisSensitivity Analysis Adding perturbation rules Exploiting prior match results (enriching the workload)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 10 20 25 40 50Schemas in Synthetic Workload (#)
Acc
urac
y (F
1)
Average
Inventory Domain
Real Estate Domain
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 22 44 66 88
Tuned LSD
Previous matches in collection (%)
20
Summary: The eTuner Project @ IllinoisSummary: The eTuner Project @ Illinois
Tuning matching systems is crucial – long standing problem, is getting worse– a next logical step in schema matching research
Provides an automatic & principled solution– generates a synthetic workload, employs it to tune efficiently– incurs virtually no cost to human users– exploits user assistance whenever available
Extensive experiments over 4 domains with 4 systems
Future directions– find optimal synthetic workload– apply to other matching scenarios– adapt ideas to scenarios beyond schema matching (see 3rd speaker)
21
Backup: User AssistanceBackup: User Assistance S(phone1,phone2,…) Generate V by dropping phone2: V(phone1,…) Rename phone1 in V: V(x,…) Problem:
– x matches phone1, x does not match phone2
User: – group phone1 and phone2– so if x matches phone1, it will also match phone2
Intuition: tell system do not bother to try distinguish phone1 and phone2