iMAP: Discovering Complex Semantic Matches Between Database Schemas
description
Transcript of iMAP: Discovering Complex Semantic Matches Between Database Schemas
iMAP: Discovering Complex iMAP: Discovering Complex Semantic Matches Between Semantic Matches Between
Database SchemasDatabase Schemas
Ohad EdryOhad Edry
January 2009January 2009
Seminar in DatabasesSeminar in Databases
MotivationMotivation
Consider a union of databases of two banks.Consider a union of databases of two banks.
We need to generate a mapping between the schemasWe need to generate a mapping between the schemas
IdIdNameNameCityCityStreetStreetHouse House NumberNumber
IdIdAccount Account numbernumber
Account statusAccount status
IdIdFirst First namename
Last Last namename
AddressAddressAccountAccountAccount Account statusstatus
Bank A tables
Bank B tables
IntroductionIntroduction Semantic mappingsSemantic mappings specify the relationships specify the relationships
between data stored in disparate sources.between data stored in disparate sources. A mapping between attribute of target schema to A mapping between attribute of target schema to
attributes of source schema According to the attributes of source schema According to the semanticssemantics
Motivation – Example continueMotivation – Example continue
IdIdNameNameCityCityStreetStreetHouse House NumberNumber
IdIdAccount Account numbernumber
Account statusAccount status
IdIdFirst First namename
Last Last namename
AddressAddressAccountAccountAccount Account statusstatus
Bank A tables
Bank B tables
Motivation – Example continueMotivation – Example continue
IdIdNameNameCityCityStreetStreetHouse House NumberNumber
IdIdAccount Account numbernumber
Account statusAccount status
IdIdFirst First namename
Last Last namename
AddressAddressAccountAccountAccount Account statusstatus
Bank A tables
Bank B tables
Semantic Mapping!
IntroductionIntroduction
Most of the work in this field focused on Most of the work in this field focused on Matching ProcessMatching Process..
The types of matches can be split to 2:The types of matches can be split to 2: 1 – 1 matching1 – 1 matching.. Complex matchingComplex matching – Combination of – Combination of
attributes in one schema corresponds to a attributes in one schema corresponds to a combination in other schemacombination in other schema
Match CandidateMatch Candidate – each matching of attributes – each matching of attributes from source and target schemas.from source and target schemas.
Motivation – Example continueMotivation – Example continue
IdIdNameNameCityCityStreetStreetHouse House NumberNumber
IdIdAccount Account numbernumber
Account statusAccount status
IdIdFirst First namename
Last Last namename
AddressAddressAccountAccountAccount Account statusstatus
Bank A tables
Bank B tables
Semantic Mapping!
1-1 matching candidate
Complex matching candidate
Introduction - examples:Introduction - examples: Example 1:Example 1:
Example 2:Example 2:
NameNameAddressAddressPhonePhone
OhadOhadHaifaHaifa1234512345
DavidDavidTel-AvivTel-Aviv1357913579
StudentStudentLocationLocationCellularCellular
EyalEyalHaifaHaifa12345671234567
MiriMiriTel-AvivTel-Aviv23456782345678
Product Product IDID
Product Product namename
PricePrice
Product Product IDID
DiscountDiscount
Product Product IDID
NameNameProduct Product PricePrice
Company A
Company B
Introduction - examples:Introduction - examples: Example 1:Example 1:
Example 2:Example 2:
NameNameAddressAddressPhonePhone
OhadOhadHaifaHaifa1234512345
DavidDavidTel-AvivTel-Aviv1357913579
StudentStudentLocationLocationCellularCellular
EyalEyalHaifaHaifa12345671234567
MiriMiriTel-AvivTel-Aviv23456782345678
Product Product IDID
Product Product namename
PricePrice
Product Product IDID
DiscountDiscount
Product Product IDID
NameNameProduct Product PricePrice
Company A
Company B
Introduction - examples:Introduction - examples: Example 1:Example 1:
Example 2:Example 2:
NameNameAddressAddressPhonePhone
OhadOhadHaifaHaifa1234512345
DavidDavidTel-AvivTel-Aviv1357913579
StudentStudentLocationLocationCellularCellular
EyalEyalHaifaHaifa12345671234567
MiriMiriTel-AvivTel-Aviv23456782345678
Product Product IDID
Product Product namename
PricePrice
Product Product IDID
DiscountDiscount
Product Product IDID
NameNameProduct Product PricePrice
Company A
Company B
Introduction - examples:Introduction - examples: Example 1:Example 1:
1 – 1 matching: Name = Student, Address = Location, Phone = Cellular.1 – 1 matching: Name = Student, Address = Location, Phone = Cellular. Example 2:Example 2:
NameNameAddressAddressPhonePhone
OhadOhadHaifaHaifa1234512345
DavidDavidTel-AvivTel-Aviv1357913579
StudentStudentLocationLocationCellularCellular
EyalEyalHaifaHaifa12345671234567
MiriMiriTel-AvivTel-Aviv23456782345678
Product Product IDID
Product Product namename
PricePrice
Product Product IDID
DiscountDiscount
Product Product IDID
NameNameProduct Product PricePrice
Company A
Company B
Introduction - examples:Introduction - examples: Example 1:Example 1:
1 – 1 matching: Name = Student, Address = Location, Phone = Cellular.1 – 1 matching: Name = Student, Address = Location, Phone = Cellular. Example 2:Example 2:
NameNameAddressAddressPhonePhone
OhadOhadHaifaHaifa1234512345
DavidDavidTel-AvivTel-Aviv1357913579
StudentStudentLocationLocationCellularCellular
EyalEyalHaifaHaifa12345671234567
MiriMiriTel-AvivTel-Aviv23456782345678
Product Product IDID
Product Product namename
PricePrice
Product Product IDID
DiscountDiscount
Product Product IDID
NameNameProduct Product PricePrice
Product Price = Price*(1-Discount)
Company A
Company B
Difficulties in Generating MatchingsDifficulties in Generating Matchings
Difficult to find the matches becauseDifficult to find the matches because Finding Finding complex matchescomplex matches is not trivial at all is not trivial at all
• How the system should know: How the system should know:
Product Price = Price*(1-Discount) The The number of candidatesnumber of candidates for Complex Matches is for Complex Matches is
large.large. Sometimes tables should be Sometimes tables should be joinedjoined::
Product Product IDID
Product Product namename
PricePrice
Product Product IDID
DiscountDiscount
Product Product IDID
Product Product NameName
Product Product PricePrice
Product Price = Price*(1-Discount)
Main Parts of the iMAP SystemMain Parts of the iMAP System
GeneratingGenerating Matching Candidates Matching Candidates PruningPruning matching candidates matching candidates
By exploiting Domain Knowledge By exploiting Domain Knowledge ExplainingExplaining Match Predictions Match Predictions
Provides an explanation to selected predicted Provides an explanation to selected predicted matching matching
Causes the system to be semi automatically.Causes the system to be semi automatically.
iMAP System ArchitectureiMAP System Architecture
Consists three main modules:Consists three main modules: Match GeneratorMatch Generator – generates the matching – generates the matching
candidates using special searchers for candidates using special searchers for target target schemaschema and and source schemasource schema. .
Similarity EstimatorSimilarity Estimator – generates matrix that – generates matrix that stores the similarity score of pairs (target stores the similarity score of pairs (target attribute, match candidate)attribute, match candidate)
Match SelectorMatch Selector – examines the score matrix – examines the score matrix and outputs the best matches under certain and outputs the best matches under certain conditions.conditions.
iMAP System Architecture – cont.iMAP System Architecture – cont.
To each attribute t of T iMAP generates match candidates from S
Similarity Estimator: receives match candidates and outputs similarity matrix
Match Selector: receives similarity matrix and output final match candidates
Part 1: Match Generation - Part 1: Match Generation - searcherssearchers
The key in match generation is to The key in match generation is to SEARCHSEARCH through the through the space of possible match candidatesspace of possible match candidates.. Search space – all attributes and data in source Search space – all attributes and data in source
schemasschemas
Searchers work based on knowledge of Searchers work based on knowledge of operators and attributes types such as operators and attributes types such as numeric, numeric, textual textual and some heuristic methods.and some heuristic methods.
The Internal of SearchersThe Internal of Searchers Search StrategySearch Strategy
Facing the large space using the standard Facing the large space using the standard beambeam searchsearch..
Match EvaluationMatch Evaluation Giving score which approximates the distance between the Giving score which approximates the distance between the
candidate and the target.candidate and the target.
Termination ConditionTermination Condition Search should be stopped because of a large search space.Search should be stopped because of a large search space.
The Internal of Searchers – The Internal of Searchers – ExampleExample
ii Iterations which limited by Iterations which limited by kk results: results:
Product Product IDID
Product Product namename
PricePrice
Product Product IDID
DiscountDiscount
Product Product IDID
NameNameProduct Product PricePrice
1. Product Price = Price*(1-Discount)
2. Product Price = Product ID
k. …
MAXi
MAXi+1
Stop: MAXi-MAXi+1<delta
Return first k candidates
The Internal of Searchers – Join The Internal of Searchers – Join PathsPaths
Find matches in Join Paths in two steps: Find matches in Join Paths in two steps:
Product Product IDID
Product Product namename
PricePrice
Product Product IDID
DiscountDiscount
Product Product IDID
NameNameProduct Product PricePrice
Product Price = Price*(1-Discount)
Company A Company B
First Step - Join paths between tables: Join(T1,T2)
Second Step – search process use the join paths
Implemented searchers in iMAPImplemented searchers in iMAP
Contains the following searchers:Contains the following searchers: TextText NumericNumeric CategoryCategory Schema MismatchSchema Mismatch Unit ConversionUnit Conversion DateDate Overlap versions of Text, Numeric, Category, Schema Overlap versions of Text, Numeric, Category, Schema
Mismatch, Unit ConversionMismatch, Unit Conversion
Implemented searchers – Text Implemented searchers – Text Searcher exampleSearcher example
Text searcher:Text searcher:Purpose:Purpose: finds matching candidates that are finds matching candidates that are concatenations of text attributes. concatenations of text attributes. Method:Method:
• Target attributeTarget attribute: Name: Name• Search SpaceSearch Space: attributes in source : attributes in source
Schemas which have textual propertiesSchemas which have textual properties• Searcher Searcher searchsearch in the Search Space in the Search Space
attributes or concatenations of attributesattributes or concatenations of attributes
IdIdNameName
IdIdFirst First namename
Last Last namename
Implemented searchers – Numeric Implemented searchers – Numeric Searcher exampleSearcher example
Numeric Searcher :Numeric Searcher : Purpose:Purpose: best matches best matches for numeric attributes. for numeric attributes.
Issues:Issues:• Compute the similarity Compute the similarity
score of complex score of complex matchmatch• Value distributionValue distribution
• Type of matchesType of matches• +,-,*,/+,-,*,/• 2 Columns2 Columns
dim1dim1dim2dim2
1133
2244
1122
4411
3311
sizesize
33
77
22
44
44
dim1*dim2=size
Implemented searchers in iMAP – Implemented searchers in iMAP – cont.cont.
Category Searcher:Category Searcher:Purpose:Purpose: find matches between categorical attributes in find matches between categorical attributes in the source and in the schema.the source and in the schema.
Schema Mismatch Searcher:Schema Mismatch Searcher:Purpose:Purpose: relating the data of a schema with the schema relating the data of a schema with the schema of the other. Occurs very often.of the other. Occurs very often.
Unit Conversion Searcher:Unit Conversion Searcher:Purpose:Purpose: find matches between different types of units. find matches between different types of units.
Date Searcher:Date Searcher:Purpose:Purpose: finds complex matches for date attributes. finds complex matches for date attributes.
Part 2: Similarity EstimatorPart 2: Similarity Estimator
Receives from the Match Generator candidate matches Receives from the Match Generator candidate matches which based on the which based on the score that each searcher assignsscore that each searcher assigns..
Problem:Problem: each searcher can give each searcher can give different scoredifferent score Solution: Solution: Final scoreFinal score, more accurate, to each match by , more accurate, to each match by
using additional types of information.using additional types of information. iMAP system uses iMAP system uses evaluator modules:evaluator modules:
• Name-based evaluator – computes score basing on similarity of Name-based evaluator – computes score basing on similarity of namesnames
• Naive Bayes evaluatorNaive Bayes evaluator
Why not to perform this phaseWhy not to perform this phase during the search phase?during the search phase?
Very Very Expensive!Expensive!
Module example - Naive Bayes Module example - Naive Bayes evaluatorevaluator
Consider the machConsider the mach
agent-address = locationagent-address = location Building model: Data instance in target Building model: Data instance in target
attribute will be attribute will be positivepositive otherwise the otherwise the data will be data will be negativenegative
Naïve Bayes ClassifierNaïve Bayes Classifier learn the learn the model model
Applied the trained classifier on the Applied the trained classifier on the source attribute datasource attribute data
Each data instance receive scoreEach data instance receive score Return an average on all score as Return an average on all score as
result result
Agent Agent AddressAddress
(Target)(Target)
LoactionLoaction
(Source)(Source)
HaifaHaifaT.A.T.A.
T.A.T.A.EilatEilat
JerusalemJerusalemNahariyaNahariya
EilatEilatNesherNesher
Part 3: Match SelectorPart 3: Match Selector
Receives from the Similarity Estimator the scored Receives from the Similarity Estimator the scored suggested for matching candidatessuggested for matching candidates
Problem: Problem: These matches may These matches may violateviolate certain domain certain domain integrity constraints.integrity constraints.
For example: mapping 2 source attributes to the same target For example: mapping 2 source attributes to the same target attributes.attributes.
Solution: Solution: set of set of domain constraintsdomain constraints Defined by domain experts or usersDefined by domain experts or users
Constraint ExampleConstraint Example
Constraint: Price and Club members price are Constraint: Price and Club members price are unrelatedunrelated
Match Selector delete this match candidateMatch Selector delete this match candidate
Product Product IDID
Product Product namename
PricePrice
Product IDProduct IDClub members Club members PricePrice
Product Product IDID
Product Product NameName
Product Product PricePrice
Match Selector receives list of candidates:
k. Product Price = Price+club members price
Exploiting Domain KnowledgeExploiting Domain Knowledge
iMAP system uses 4 different types of iMAP system uses 4 different types of knowledgeknowledge:: Domain Domain ConstraintsConstraints PastPast matches matches OverlapOverlap data data ExternalExternal data data
iMAP uses its knowledge at all levels of the iMAP uses its knowledge at all levels of the system and early as it can in match generation.system and early as it can in match generation.
Types of knowledgeTypes of knowledge
Domain constraintsDomain constraints Three cases:Three cases:
• Name and ID are unrelated - Attributes from the Source schema are Name and ID are unrelated - Attributes from the Source schema are unrelatedunrelated
searcherssearchers
• Account < 10000 - Constraint on single attribute Account < 10000 - Constraint on single attribute t t Similarity Estimator and SearchersSimilarity Estimator and Searchers
• Account and ID are unrelated - Attributes from the Target Schema Account and ID are unrelated - Attributes from the Target Schema are unrelatedare unrelated
Match SelectorMatch Selector
IdIdNameNameIdIdAccount Account
numbernumberAccount Account statusstatus
IdIdFirst First namename
Last Last namename
AccountAccountAccount Account statusstatus
Source:
Target:
Types of knowledge – cont.Types of knowledge – cont.
Past Complex MatchesPast Complex Matches Numeric Searcher can use past expression template:Numeric Searcher can use past expression template:
• Price=Price*(1-Discount) generates Price=Price*(1-Discount) generates
VARIABLE*(1-VARIABLE)VARIABLE*(1-VARIABLE)
External DataExternal Data – – using external sources for using external sources for learning about attributes and their data.learning about attributes and their data. Given a target attribute and useful feature of that Given a target attribute and useful feature of that
attribute, iMAP learn about value distributionattribute, iMAP learn about value distribution • Example: number of cities in stateExample: number of cities in state
Types of knowledge – cont.Types of knowledge – cont.
Overlap Data – Provide information for the mapping Overlap Data – Provide information for the mapping process.process.
contains searchers which can exploit overlap data.contains searchers which can exploit overlap data.
Overlap Text, Category & Schema Mismatch searchersOverlap Text, Category & Schema Mismatch searchers S and T share a state listingS and T share a state listing Matches: city=state , country=stateMatches: city=state , country=state Re-evaluating results: city=state is 0 and country=state is 1Re-evaluating results: city=state is 0 and country=state is 1
Overlap Numeric SearcherOverlap Numeric Searcher – using the overlap data and – using the overlap data and using using equation discovery system (LAGRMGE) equation discovery system (LAGRMGE) the best the best arithmetic expression for arithmetic expression for tt is found. is found.
Generating ExplanationsGenerating Explanations
One goal is to provide design environment which the One goal is to provide design environment which the user will user will inspect the matches predicted by the systeminspect the matches predicted by the system, , modified them manuallymodified them manually and and the system will have a the system will have a feedbackfeedback..
The system uses complex algorithms so it needs to The system uses complex algorithms so it needs to explain the user the matches. explain the user the matches.
Explanations are good for the user as wellExplanations are good for the user as well Correct matches quickly Correct matches quickly Tells the system where its mistake.Tells the system where its mistake.
Generating Explanations – so, what do you Generating Explanations – so, what do you want to know about the matches?want to know about the matches?
iMAP system defines 3 main questions:iMAP system defines 3 main questions: Explain the existing matchExplain the existing match – why a certain match X is presented – why a certain match X is presented
in the output of iMAP? Why the match survive the all process?in the output of iMAP? Why the match survive the all process? Explain absent matchExplain absent match - why a certain match Y is not presented - why a certain match Y is not presented
in the output of iMAP?in the output of iMAP? Explain match rankingExplain match ranking – why match X is ranked higher than – why match X is ranked higher than
match Y?match Y?
Each of these questions can be asked for each module Each of these questions can be asked for each module of iMAP. of iMAP.
Question can be reformulated recursively to underlying Question can be reformulated recursively to underlying components.components.
Generating Explanations - ExampleGenerating Explanations - Example Suppose we have 2 real-estate schemas:Suppose we have 2 real-estate schemas:
iMAP produces the ranked matches:iMAP produces the ranked matches: (1) List-price=price*(1+monthly-fee-rate)(1) List-price=price*(1+monthly-fee-rate) (2) List-price=price(2) List-price=price
List-priceList-priceMonth-Month-postedposted
…… PricePriceMonthly-Monthly-fee-ratefee-rate
……
iMAP explanation: both matches were generated by the iMAP explanation: both matches were generated by the numeric searcher and the similarity estimator also numeric searcher and the similarity estimator also
agreed to the ranking.agreed to the ranking.
Generating Explanations - ExampleGenerating Explanations - Example Suppose we have 2 real-estate schemas:Suppose we have 2 real-estate schemas:
The current order:The current order:
(1)(1) List-price=price*(1+monthly-fee-rate)List-price=price*(1+monthly-fee-rate)
(2)(2) List-price=priceList-price=price Match selector have 2 constraints: (1) month-Match selector have 2 constraints: (1) month-
posted=month-fee-rate, (2) month-posted and price posted=month-fee-rate, (2) month-posted and price don’t share common attributesdon’t share common attributes
List-priceList-priceMonth-Month-postedposted
…… PricePriceMonthly-Monthly-fee-ratefee-rate
……
List-price=price match is selected by the match List-price=price match is selected by the match generatorgenerator
Generating Explanations - ExampleGenerating Explanations - Example Suppose we have 2 real-estate schemas:Suppose we have 2 real-estate schemas:
The current order:The current order:
(1)(1) List-price=priceList-price=price
(2)(2) List-price=price*(1+monthly-fee-rate)List-price=price*(1+monthly-fee-rate) iMAP explains that the source for month-posted=month-iMAP explains that the source for month-posted=month-
fee-rate is the date searcherfee-rate is the date searcher
List-priceList-priceMonth-Month-postedposted
…… PricePriceMonthly-Monthly-fee-ratefee-rate
……
The user correct the iMAP that month-fee-rate is The user correct the iMAP that month-fee-rate is not type of date.not type of date.
Generating Explanations - ExampleGenerating Explanations - Example Suppose we have 2 real-estate schemas:Suppose we have 2 real-estate schemas:
List-price=price*(1+monthly-fee-rate) is again the chosen List-price=price*(1+monthly-fee-rate) is again the chosen match match
The Final order:The Final order:
(1)(1) List-price=price*(1+monthly-fee-rate)List-price=price*(1+monthly-fee-rate)
(2)(2) List-price=priceList-price=price
List-priceList-priceMonth-Month-postedposted
…… PricePriceMonthly-Monthly-fee-ratefee-rate
……
Example cont. – generated Example cont. – generated dependency graphdependency graph
Dependency Graph is small!!!Dependency Graph is small!!!
Searchers produce only k best matches
iMAP goes through three stages
What do you want to know about the What do you want to know about the matches?matches?
Why a certain match X is presented in the output of Why a certain match X is presented in the output of iMAP?iMAP? Returns the part in the graph that describes the Returns the part in the graph that describes the
match.match.
Example cont. – generated Example cont. – generated dependency graphdependency graph
What do you want to know about the What do you want to know about the matches?matches?
Why a certain match X is presented in the output of iMAP?Why a certain match X is presented in the output of iMAP? Returns the part in the graph that describes the match.Returns the part in the graph that describes the match.
Why match X is ranked higher than match Y?Why match X is ranked higher than match Y? Return the comparing part in the graph between the 2 Return the comparing part in the graph between the 2
matches.matches.
Example cont. – generated Example cont. – generated dependency graphdependency graph
What do you want to know about the What do you want to know about the matches?matches?
Why a certain match X is presented in the output of iMAP?Why a certain match X is presented in the output of iMAP? Returns the part in the graph that describes the match.Returns the part in the graph that describes the match.
Why match X is ranked higher than match Y?Why match X is ranked higher than match Y? Return the comparing part in the graph between the 2 matches.Return the comparing part in the graph between the 2 matches.
Why a certain match Y is not presented in the Why a certain match Y is not presented in the output of iMAP?output of iMAP? If the has been eliminated during the process the part If the has been eliminated during the process the part
that responsible for the eliminating explains whythat responsible for the eliminating explains why Otherwise the iMAP ask the searcher to check if they Otherwise the iMAP ask the searcher to check if they
can generate the match and to explain why it was not can generate the match and to explain why it was not generated generated
Example cont. – generated Example cont. – generated dependency graphdependency graph
Evaluating iMAP on real world Evaluating iMAP on real world domainsdomains
iMAP was evaluated on 4 real-word domains:iMAP was evaluated on 4 real-word domains:
For the Cricket domain they used 2 independently For the Cricket domain they used 2 independently developed databasesdeveloped databases
For the other 3 they used one real-world source For the other 3 they used one real-world source database and target schema which created by database and target schema which created by volunteers.volunteers.
Databases with Databases with overlapoverlap domains and databases with domains and databases with disjoint disjoint domainsdomains
Evaluating iMAP on real world Evaluating iMAP on real world domains – cont.domains – cont.
Data Processing:Data Processing: removing data such as “unknown” removing data such as “unknown” and adding the most obvious constraints.and adding the most obvious constraints.
Experiments:Experiments: there are actually 8 experimental domains there are actually 8 experimental domains 2 domains for each one – overlap domain and disjoint domain.2 domains for each one – overlap domain and disjoint domain.
Performance measure:Performance measure: 1 matching accuracy1 matching accuracy 3 matching accuracy3 matching accuracy Complex matchComplex match Partial complex matchPartial complex match
Results (1)Results (1)Overall and 1-1 matching accuracy:Overall and 1-1 matching accuracy:
Not in the figure, but according to the article the top-3 Not in the figure, but according to the article the top-3 accuracy is even higher and iMAP also achieves top-1 accuracy is even higher and iMAP also achieves top-1 and top-3 accuracy of 77%-100% for 1-1 matching and top-3 accuracy of 77%-100% for 1-1 matching
(a) Exploiting domain (a) Exploiting domain constraints and overlap constraints and overlap data improve accuracydata improve accuracy
(b) Disjoint domains (b) Disjoint domains achieves lower achieves lower accuracy than overlap accuracy than overlap data domainsdata domains
Results (2)Results (2)Complex matching accuracy – Top 1 and Top 3:Complex matching accuracy – Top 1 and Top 3:
Results (2) – Cont.Results (2) – Cont.Complex matching accuracy – Top 1:Complex matching accuracy – Top 1:
Low results for default iMAP (for example: inventory=9%) Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domainsboth in overlap domains and disjoint domains
(a) Exploiting domain constraints and overlap data (a) Exploiting domain constraints and overlap data improve accuracyimprove accuracy
(b) iMAP achieves lower accuracy than in overlap data (b) iMAP achieves lower accuracy than in overlap data domainsdomains No overlap data decreases the accuracy of Numeric No overlap data decreases the accuracy of Numeric
Searcher and Text Searcher.Searcher and Text Searcher.
Results (2) – complex matches low Results (2) – complex matches low resultsresults
Smaller components – example: apt-numberSmaller components – example: apt-number Suggested solution: adding format learning techniquesSuggested solution: adding format learning techniques
Small noise components – example: agent-idSmall noise components – example: agent-id Suggested solution: more aggressive match cleaning and more Suggested solution: more aggressive match cleaning and more
constraints.constraints.
Disjoint databases – difficult for numeric searcherDisjoint databases – difficult for numeric searcher Suggested solution: using past numeric matchesSuggested solution: using past numeric matches
Top–k – many results are not in top 1Top–k – many results are not in top 1 Increasing k to 10 will increase accuracyIncreasing k to 10 will increase accuracy
Results (2)Results (2)Complex matching accuracy – Top 1 and Top 3:Complex matching accuracy – Top 1 and Top 3:
Results (2) – Cont.Results (2) – Cont.
Complex matching accuracy – Top 3:Complex matching accuracy – Top 3: Low results for default iMAP (for example: inventory=9%) Low results for default iMAP (for example: inventory=9%)
both in overlap domains and disjoint domainsboth in overlap domains and disjoint domains Same reasons as in Top 1Same reasons as in Top 1
(c) Improvement in accuracy compared to (a) when (c) Improvement in accuracy compared to (a) when using overlap and constraintsusing overlap and constraints
This is a outcome of correct complex matches in the top This is a outcome of correct complex matches in the top 3 matches3 matches
Results (3)Results (3)
Partial Complex matching accuracy – Top 1 and Top 3:Partial Complex matching accuracy – Top 1 and Top 3:
Results (3) – cont.Results (3) – cont.
Partial Complex matching accuracy – Top 1 and Top 3:Partial Complex matching accuracy – Top 1 and Top 3:
The accuracy is measured in finding only the right The accuracy is measured in finding only the right attributesattributes
For example: wrong numeric template but right attributesFor example: wrong numeric template but right attributes
Much more accuracy than full complex matching Much more accuracy than full complex matching accuracy.accuracy.
Partial Complex Matches can be very useful when the Partial Complex Matches can be very useful when the user want to fix wrong matchesuser want to fix wrong matches
Performance & EfficiencyPerformance & EfficiencyPerformance:Performance:
iMAP is stable after 100 data tuplesiMAP is stable after 100 data tuples If we run it on fewer examples first we can reduce iMAP If we run it on fewer examples first we can reduce iMAP
running timerunning time
Data tupels
Accuracy
Performance & Efficiency – Cont.Performance & Efficiency – Cont.
Efficiency:Efficiency: Unoptimized iMAP versionUnoptimized iMAP version ran for 5 – 20 minutes on the ran for 5 – 20 minutes on the
experimental domainsexperimental domains
Several techniques are suggested in the article to Several techniques are suggested in the article to improve this time:improve this time:
For example breaking the schemas into independent chunksFor example breaking the schemas into independent chunks
Explaining match predictionsExplaining match predictions Example for explaining match prediction:Example for explaining match prediction:
Conclusion: the Name Based evaluator has more Conclusion: the Name Based evaluator has more influence – last lineinfluence – last line
The user can use this information to reduce the influence The user can use this information to reduce the influence of the Name Based evaluatorof the Name Based evaluator
Searcher Level: Concat(first-Searcher Level: Concat(first-name,last-name) was ranked name,last-name) was ranked higher than last-namehigher than last-name
Similarity Estimator:Similarity Estimator:• Name based was Name based was wrongwrong• Naïve Bayes was Naïve Bayes was rightright
Match Selector: didn’t Match Selector: didn’t influenceinfluence
Related workRelated work
L. Xu and D. Embley. Using domain ontologies to L. Xu and D. Embley. Using domain ontologies to discover direct and in direct matches for schema discover direct and in direct matches for schema elements:elements:
Mapping the schema to domain ontology and searching in this Mapping the schema to domain ontology and searching in this domain.domain.
Can be added to as additional searcherCan be added to as additional searcher
Clio System:Clio System: Sophisticated set of user-interface techniques to improve Sophisticated set of user-interface techniques to improve
matchesmatches
ConclusionsConclusions
Most of the work in that field until now was about 1-1 Most of the work in that field until now was about 1-1 matchingmatching
This article focused on complex matching. This article focused on complex matching.
iMAP key is the use of:iMAP key is the use of: SearchersSearchers Domain knowledgeDomain knowledge
Providing the user the possibility to affect the matchesProviding the user the possibility to affect the matches
Any Questions?Any Questions?
Thank you!Thank you!
BibliographyBibliography
Robin Dhamankar, Yoonkyong Lee, AnHai Doan,Alon Robin Dhamankar, Yoonkyong Lee, AnHai Doan,Alon Halevy, Pedro Domingos. iMAP: Discovering Complex Halevy, Pedro Domingos. iMAP: Discovering Complex Semantic Matches between Database Schemas.Semantic Matches between Database Schemas.
http://en.wikipedia.org/wiki/Beam_searchhttp://en.wikipedia.org/wiki/Beam_search