An Open-source Similar-name Finder
-
Upload
dallan-quass -
Category
Technology
-
view
1.192 -
download
1
description
Transcript of An Open-source Similar-name Finder
An Open-source Similar-name Finder
Dallan Quass [email protected]
What's the problem?
People can't spell unusual names
Maybe a piece of mail addressed to Solverg Quast?
Solverg Quast5934 Phoenix Ave.Shoreview, MN 55126
Johnston Bros.1256 Bristol St.Mapleton, MN 55126
Should be: Solveig Quass
People use nicknames
John
Johnny
Jack
Transcribers make typos
Jhon
Most of our ancestors didn't know how to read or write
signature
What does it matter?
How do you find records?Johnny SnithJohn Smith
How do you match people?
John Smith Johnny Smithe
Not a new problem
Lots of solutions
Soundex
Nysii
s
Double
Metaphone
Refined Soundex
Daitch-Mokotoff
Caverphone
LevensteinJaro Winkler
Monge Elkan
Needleman Wunch
Smith
Waterman
No Bullseye
Why is this so hard?
How similar are two names?
We’re neighbors
JohnJonnyJoe
I don’t know those guys
First approach: Coders
Soundex
Nysii
s
Double
Metaphone
Refined Soundex
Daitch-Mokotoff
Caverphone
General approach
Combine repeated letters
Remove vowels (except maybe for leading)
Unite similar-sounding letters
First approach: Coders
Jim
John
Jane
Johan
Johannes
Second approach: Distance functions
LevensteinJaro Winkler
Monge Elkan
Needleman Wunch
Smith
Waterman
General approach
Align sequences of letters
Score based upon the number of matches, transpositions, differences
Monge Elkan considers similar-sounding letters
Second approach: Distance functions
Jim
John
Jane
Johan
Johannes
Better results,but
Doesn't scale well
Can we do better?
Warning: Machine learning ahead!
Thank you Ancestry!
Ancestry.com paid someone to label 100,000 pairs of names
Name pairs were drawn from actual matching records at Ancestry
Labeled name pairs have been made freely available
A closer look at Levenstein
Jon
John
Bohn-1
-1
Maximize your expectations
Expectation Maximization Algorithm
Expectation step: calculate the expected value of a function
Maximization step: find parameters that maximize the expected value
Iterate until convergence
Jon
John
Bohn
high cost
low costWeighted Edit Distance
Learn to classify
Positive and negative examples
Features
Coders
Distance functions
Weighted edit distance
Learn weights
several algorithms to choose from
Results in a vector
Threshold separates matches from non-matches
Wait, isn't this just another distance function?
Distance functions don't scale, right?
Right
Back to the basics
x f(x)
-5 -1-3 4.5 0 7 2 3.5 4 2
Long tail
Long tail
200,000 Surnames 70,000 Given names
≤ 1/5,000,000 names
Long tail
Use distance function with table here
Use coder here
Result: Table initialized by a function
Dallan: Dalana Daleen Dalen Dalin … Talan Tallon
Ryan: Aaran Aran Arrin … Rian Riana ...
A nice thing about tables...
Dallan: Dalana Daleen Dalen Dalin … Talan Tallon
Ryan: Aaran Aran Arrin … Rian Riana ...
Add to the table
Nicknames
BehindTheName.com
The New American
Dictionary of Baby
Names by Leslie
Dunking and William
Gosling A Dictionary of Surnames by Patrick
Hanks and Flavia Hodges
WeRelate community
Thank you BehindTheName.com!
Fascinating Family Treesfor given names
Result
97 65
97 74
SoundexOur approach
Precision Recall
28% decrease in false negatives
Given names
89 68
89 77
SoundexOur approach
Precision Recall
28% decrease in false negatives
Surnames
Who is using it?
WeRelate.org
Continuous improvement
Continuous improvement
Community oversight
How do I use it?
Source code and table available on Github: http://github.com/DallanQ/Names
SearchNormalizeIndexSearch
ScoreEvalService
Roadmap
Jan 2011 Open-source project created
Jan 2011 Implemented at WeRelate
Feb 2011 Announce at RootsTech
Continued improvements
Future work
Future work
Reduce the number of name variants to look up
Look up multiple codesRefined soundex?
Cluster namesMahout?
Remove “chaff” variants from common names
Conclusion
Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license
Thank you Ancestry.com and BehindTheName.com!!!
Identifying name variants is hard
But getting it right is pretty important
names are at the core of genealogical research
Open source algorithm is now freely available
http://github.com/DallanQ/Names
28% reduction in false negatives compared to Soundex
continuous improvement
Hopefully others will benefit from this effort
goal is to improve genealogical searches across the Web