Meetup ElasticSearch : « Booster votre Magento avec Elasticsearch »
Simple fuzzy name matching in elasticsearch
-
Upload
basis-technology -
Category
Technology
-
view
925 -
download
4
Transcript of Simple fuzzy name matching in elasticsearch
Quick survey: How many of us...
● Regularly develop Elastic applications?● Develop Elastic applications that include
names of…○ ...People?○ ...Places?○ ...Products?○ ...Organizations?○ …(other entity types)?
● Have names in languages beside English?● Want to have better name search?● Are Elasticsearch or plugin developers?
Motivating Questions...
● How could a border officer know whetheryou’re on a terrorist watch list?
● How does your bank know if you’re wiring money to a drug lord?
● How can an ecommerce site treat “Ho-medics Ultra sonic” and “Homedics Ultrasconic” as the same thing?
● How can a system search for mentions of people across news articles?
Answer...
Name Matching (plus more)
What kinds of name variation?
Real life exampleDavid K. MurgatroydVP of Engineering
Boarding Pass
Current Best Practice?
● multi_field type with a field per possible variation (http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names)
"mappings": { ... "type": "multi_field", "fields": {
"pty_surename": { "type": "string", "analyzer": "simple" },
"metaphone": { "type": "string", "analyzer": "metaphone" },
"porter": { "type": "string", "analyzer": "porter" } …
● Complex query against each field
● Generally gives high recall
Can’t a name field type do this?
● Manage all the subfields
● Contribute score that reflects phenomena
● Be part of queries using many field types
● Have multiple fields per document
● Have multiple values per field (coming soon)
But what if variations co-occur?
“Jesus Alfonso Lopez Diaz” v.
“LobEzDiaS, Chuy”
● Reordered● Missing token● Two spelling differences● Nickname for first name● Missing space
Can we do better?
● Incorporates our proprietary name matching technology
● Provides similarity scores to name pairs● Uses Elasticsearch's Rescore query● Allows for higher precision ranking and
tresholding● Multi-lingual name search
Demo
How could you use such a Field?
● Plugin contains custom mapper which does all the work behind the scenes
PUT /ofac/ofac/_mapping{ "ofac" : { "properties" : { "name" : { "type:" : "rni_name" } "aka" : { "type:" : "rni_name" } } }}
What happens at index time?
● NameMapper indexes keys for different phenomena in separate (sub) fields@Override
public void parse(ParseContext context) throws IOException {
Name name = NameBuilder.data(nameString).build();
//Generate keys for name
Collection<FieldSpec> fields = helper.deriveFieldsForName(name);
//Parse each key with the appropriate Mapper
for (FieldSpec field : fields) {
Mapper mapper = keyMappers.get(field.getField().fieldName());
context = context.createExternalValueContext(field.getStringValue());
mapper.parse(context);
}
}
Indexing
{ name: "Robert Smith"dob:"1987/02/13" }
{ name: "Robert Smith"name.key1:…name.key2:…name.key3:…dob: "1987/02/13" }
User Doc
Plug-in Implementation
Index
What happens at query time?
● Step #1: NameMapper generates analogous keys for a custom Lucene query that finds good candidates for re-scoring
@Override
public Query termQuery(Object value, @Nullable QueryParseContext context) {
//Parse name string
Name name = NameBuilder.data(value.toString()).build();
QuerySpec spec = helper.buildQuerySpec(new NameIndexQuery(name));
//Build Lucene query
Query query = spec.accept(new ESQueryVisitor(names.indexName() + "."));
return query;
}
What else happens at query time?
● Step #2: Uses a Rescore query to score names in the best candidate documents and reorder accordingly○ Tuned for high precision name matching○ Computationally expensive"rescore" : {
"query" : {
"rescore_query" : {
"function_score" : {
"name_score" : {
"field" : "name",
"query_name" : "LobEzDiaS, Chuy"
}
...
● The 'name_score' function matches the query name against the indexed name in every candidate document and returns the similarity score
@Override
public double score(int docId, float subQueryScore) {
//Create a scorer for the query name
CachedScorer cs = createCachedScorer(queryName);
//Retrieve name data from doc values
nameByteData.setDocument(docId);
Name indexName = bytesToName(nameByteData.valueAt(i).bytes);
//Score the query against the indexed name in this document
return cs.score(indexName);
}
What does that function do?
Rescore Query
Main Query
Plug-in Implementation
{ match : { name: "Bob Smitty" } }bool:
name.Key1:...name.Key2:...name.Key3:...
User Query
Rescorename_score : { field : "name", name : "Bob
Smitty")
name:"Robert Smith"dob:2/13/1987score : .79
Indexing
{ name: "Robert Smith"dob:"1987/02/13" }
{ name: "Robert Smith"name.Key1:…name.Key2:…name.Key3:…dob: "1987/02/13" }
User Doc
Index
● window_size○ Controls how many of the top documents to rescore○ Tradeoff accuracy vs speed
● minScoreToCheck - (Added by Us)○ Score threshold top doc must meet to be rescored○ Tradeoff accuracy vs speed
Rescore Params - Speed v. Accuracy
HighRecall Query(Elastic)
Subset High Recall Results
Total < windowsize
&
Score > minimumScoreThreshold
Rescoring (for High Precision)
Query
ScoredResults
Trading Off Accuracy for Speed
Rescore Params - Integration w/Query
● rescore_query○ Calls the name_score function to get score○ Combine rescore_queries to query across multiple
fields● query_weight
○ Controls how much weight is given to main query○ Allows user to include queries on other non-name
fields● rescore_query_weight
○ Controls how much weight is given to rescore query
What Challenges Were There?
● Design based on similar Solr plugin● 1-2 months solo develop time● Nice plugin infrastructure● Missing some useful javadocs/comments● No (official) plugin development guide● Used other plugin implementations as
guides https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#_plugins
Summary: How it works
● Custom field type mapping○ Splits a single field into multiple fields covering
different phenomena○ Supports multiple name fields in a document○ Intercepts the query to inject a custom Lucene query
● Custom rescore function○ Rescores documents with algorithm specific to name
matching ○ Limits intense calculations to only top candidates○ Highly configurable
Suggested Questions:
● What if the names are in other text fields?● How did you implement multi-valued fields?● How does it scale?● How do you handle names not in English?● How does this relate to the theme of Entity-
Centric Search?● How do plug-in’s scores relate to Solr scores?