Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which...
-
Upload
joselyn-linson -
Category
Documents
-
view
215 -
download
2
Transcript of Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which...
Inaport Training
Fuzzy Matching
© Copyright 2010 InaPlex Inc
Matching
Matching• Process of deciding which record or set
of records in the target table(s) should be updated
• Alternatively, decide if record already exists and take appropriate action
© Copyright 2010 InaPlex Inc
Matching TechniquesInaport supports different ways to match
• Standard • build expressions on source and target
• Fuzzy• Refine Standard to allow for poor data
• SQL• Use SQL SELECT instead of expressions
© Copyright 2010 InaPlex Inc
Fuzzy MatchingStandard Matching
• can use any combination of fields• can use expressions
BUT• Ultimately is restricted to exact match
“InaPlex” <> “Innerplex Ltd”
© Copyright 2007 InaPlex Limited
Fuzzy Matching
Fuzzy matching compares source and target, and gives a similarity score
Score measures how “close” two strings areScore = 1 : Perfect matchScore = 0 : No match
“InaPlex” and “inaplx” : 98%“InaPlex” and “innerplex” : 87%“InaPlex” and “ibm” : 49%
See Tools – Fuzzy Match Demo
© Copyright 2010 InaPlex Inc
How it WorksAs with Standard matching, Fuzzy match
• Can use any field or combination of fields• Reads the match fields• Builds an in memory index for each table
• The target match expression is applied to the field data read from the table
© Copyright 2010 InaPlex Inc
How it WorksSet scoring levels
• Score > Upper • good match – accept immediately
• Lower < Score < Upper• Possible match – user review
• Score < lower• Not a match – reject
No match < Lower < Possible < Upper < Good
No match < 85% < Possible < 95% < Good
© Copyright 2010 InaPlex Inc
How it WorksWhen a source record comes in:
• Source expression applied to build match value• Source match value scored against every value
in target index• “Best” matches used – you set boundaries
• No match < Possible match < Good match• No 0.85 Possible 0.95 Good
© Copyright 2010 InaPlex Inc
How it WorksUser Review
• Shows the source record and possible matches in target• User can select one or more records as match• Options
• Review “good” and “possible” matches– For testing purposes
• Review just “possible” matches– If there are no possibles, good and no match accepted automatically
• No review– Good and no match accepted automatically– Possible treated as bad
© Copyright 2010 InaPlex Inc
How it WorksCustomise User Review
• May need to see more than the target table to decide on match
• Can also display associated tables• E.g. Address, Contact
• Can also select which fields from associated tables to display
© Copyright 2007 InaPlex Limited
Example – Operation TabSelect Fuzzy Match from Match Type
© Copyright 2010 InaPlex Inc
Example – Match Tab Specify the base match criteria
• Source and target match expressions• Boundary scores for no, possible, good matches• Cluster Match covered later
© Copyright 2010 InaPlex Inc
Example – Match TabSet up the User Review
• Can choose• No review – use in batch mode• Only possible matches – accept good
matches• Good + possible – review all matches
© Copyright 2007 InaPlex Limited
Example – User ReviewShows possible matches at run time
• Source record• Possible matching target records, with score• If configured, child records of selected target record• Allows selection of desired matches
© Copyright 2010 InaPlex Inc
ClusteringFuzzy Matching is powerful, flexible
BUTEvery source record must be scored against EVERY
target match, then highest scores selected
100,000 records in target => 100,000 scores per source record
Solution is CLUSTERING
© Copyright 2010 InaPlex Inc
ClusteringSpecify
• an expression to sort target records into clustersThen
• an equivalent expression for source to sort it into one clusterFinally
• scoring only done against members of the selected cluster
100,000 target divided into 20 clusters• 5,000 records per cluster => 5,000 scores per source record
© Copyright 2010 InaPlex Inc
ClusteringCluster expression should:
• Sort target into roughly equal groups• Guard against allocating source to wrong
cluster• Examples
• First letter of company name• Zip/Post code• Phone area code
© Copyright 2010 InaPlex Inc
Clustering
Alpha Corp
Zulu Corp
Beta Corp
Source record scored against every record in target
No clustering established
© Copyright 2007 InaPlex Limited
ClusteringSet up clustering based on first letter of
company name
© Copyright 2010 InaPlex Inc
ClusteringAlpha Corp
Zulu Corp
Beta Corp
Source record only scored against records in “b” cluster
Beta Corp
Brown Corp
Cluster on first letter
© Copyright 2010 InaPlex Inc
ClusteringImportant Note
Because source records will only be scored against one cluster, if clustering is poorly done can lead to missed matches• “naplex” would look in “n” cluster, not “I”
Cluster expression does NOT have to use same fields as match• E.g. Match on name, cluster on ZIP code
© Copyright 2010 InaPlex Inc
SummaryFuzzy matching provides powerful new tool
for handling complex, dirty data
Need to• Use carefully, especially clustering• Allow of overhead of user review
© Copyright 2010 InaPlex Inc
THANK YOU
www.inaplex.com
www.inaplex.com/cs/forums