Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which...

23
Inaport Training Fuzzy Matching

Transcript of Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which...

Page 1: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

Inaport Training

Fuzzy Matching

Page 2: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

Matching

Matching• Process of deciding which record or set

of records in the target table(s) should be updated

• Alternatively, decide if record already exists and take appropriate action

Page 3: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

Matching TechniquesInaport supports different ways to match

• Standard • build expressions on source and target

• Fuzzy• Refine Standard to allow for poor data

• SQL• Use SQL SELECT instead of expressions

Page 4: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

Fuzzy MatchingStandard Matching

• can use any combination of fields• can use expressions

BUT• Ultimately is restricted to exact match

“InaPlex” <> “Innerplex Ltd”

Page 5: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2007 InaPlex Limited

Fuzzy Matching

Fuzzy matching compares source and target, and gives a similarity score

Score measures how “close” two strings areScore = 1 : Perfect matchScore = 0 : No match

“InaPlex” and “inaplx” : 98%“InaPlex” and “innerplex” : 87%“InaPlex” and “ibm” : 49%

See Tools – Fuzzy Match Demo

Page 6: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

How it WorksAs with Standard matching, Fuzzy match

• Can use any field or combination of fields• Reads the match fields• Builds an in memory index for each table

• The target match expression is applied to the field data read from the table

Page 7: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

How it WorksSet scoring levels

• Score > Upper • good match – accept immediately

• Lower < Score < Upper• Possible match – user review

• Score < lower• Not a match – reject

No match < Lower < Possible < Upper < Good

No match < 85% < Possible < 95% < Good

Page 8: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

How it WorksWhen a source record comes in:

• Source expression applied to build match value• Source match value scored against every value

in target index• “Best” matches used – you set boundaries

• No match < Possible match < Good match• No 0.85 Possible 0.95 Good

Page 9: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

How it WorksUser Review

• Shows the source record and possible matches in target• User can select one or more records as match• Options

• Review “good” and “possible” matches– For testing purposes

• Review just “possible” matches– If there are no possibles, good and no match accepted automatically

• No review– Good and no match accepted automatically– Possible treated as bad

Page 10: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

How it WorksCustomise User Review

• May need to see more than the target table to decide on match

• Can also display associated tables• E.g. Address, Contact

• Can also select which fields from associated tables to display

Page 11: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2007 InaPlex Limited

Example – Operation TabSelect Fuzzy Match from Match Type

Page 12: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

Example – Match Tab Specify the base match criteria

• Source and target match expressions• Boundary scores for no, possible, good matches• Cluster Match covered later

Page 13: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

Example – Match TabSet up the User Review

• Can choose• No review – use in batch mode• Only possible matches – accept good

matches• Good + possible – review all matches

Page 14: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2007 InaPlex Limited

Example – User ReviewShows possible matches at run time

• Source record• Possible matching target records, with score• If configured, child records of selected target record• Allows selection of desired matches

Page 15: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

ClusteringFuzzy Matching is powerful, flexible

BUTEvery source record must be scored against EVERY

target match, then highest scores selected

100,000 records in target => 100,000 scores per source record

Solution is CLUSTERING

Page 16: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

ClusteringSpecify

• an expression to sort target records into clustersThen

• an equivalent expression for source to sort it into one clusterFinally

• scoring only done against members of the selected cluster

100,000 target divided into 20 clusters• 5,000 records per cluster => 5,000 scores per source record

Page 17: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

ClusteringCluster expression should:

• Sort target into roughly equal groups• Guard against allocating source to wrong

cluster• Examples

• First letter of company name• Zip/Post code• Phone area code

Page 18: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

Clustering

Alpha Corp

Zulu Corp

Beta Corp

Source record scored against every record in target

No clustering established

Page 19: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2007 InaPlex Limited

ClusteringSet up clustering based on first letter of

company name

Page 20: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

ClusteringAlpha Corp

Zulu Corp

Beta Corp

Source record only scored against records in “b” cluster

Beta Corp

Brown Corp

Cluster on first letter

Page 21: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

ClusteringImportant Note

Because source records will only be scored against one cluster, if clustering is poorly done can lead to missed matches• “naplex” would look in “n” cluster, not “I”

Cluster expression does NOT have to use same fields as match• E.g. Match on name, cluster on ZIP code

Page 22: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

SummaryFuzzy matching provides powerful new tool

for handling complex, dirty data

Need to• Use carefully, especially clustering• Allow of overhead of user review

Page 23: Inaport Training Fuzzy Matching. © Copyright 2010 InaPlex Inc Matching Process of deciding which record or set of records in the target table(s) should.

© Copyright 2010 InaPlex Inc

THANK YOU

www.inaplex.com

www.inaplex.com/cs/forums