Simple fuzzy name matching in elasticsearch

Simple Fuzzy Name Matching in Elasticsearch

June 18, 2015Brian Sawyer

Engineering [email protected]

Quick survey: How many of us...

● Regularly develop Elastic applications?● Develop Elastic applications that include

names of…○ ...People?○ ...Places?○ ...Products?○ ...Organizations?○ …(other entity types)?

● Have names in languages beside English?● Want to have better name search?● Are Elasticsearch or plugin developers?

Motivating Questions...

● How could a border officer know whetheryou’re on a terrorist watch list?

● How does your bank know if you’re wiring money to a drug lord?

● How can an ecommerce site treat “Ho-medics Ultra sonic” and “Homedics Ultrasconic” as the same thing?

● How can a system search for mentions of people across news articles?

Answer...

Name Matching (plus more)

What kinds of name variation?

Real life exampleDavid K. MurgatroydVP of Engineering

Boarding Pass

Current Best Practice?

● multi_field type with a field per possible variation (http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names)

"mappings": { ... "type": "multi_field", "fields": {

"pty_surename": { "type": "string", "analyzer": "simple" },

"metaphone": { "type": "string", "analyzer": "metaphone" },

"porter": { "type": "string", "analyzer": "porter" } …

● Complex query against each field

● Generally gives high recall

http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names

Can’t a name field type do this?

● Manage all the subfields

● Contribute score that reflects phenomena

● Be part of queries using many field types

● Have multiple fields per document

● Have multiple values per field (coming soon)

But what if variations co-occur?

“Jesus Alfonso Lopez Diaz” v.

“LobEzDiaS, Chuy”

● Reordered● Missing token● Two spelling differences● Nickname for first name● Missing space

Can we do better?

● Incorporates our proprietary name matching technology

● Provides similarity scores to name pairs● Uses Elasticsearch's Rescore query● Allows for higher precision ranking and

tresholding● Multi-lingual name search

How could you use such a Field?

● Plugin contains custom mapper which does all the work behind the scenes

PUT /ofac/ofac/_mapping{ "ofac" : { "properties" : { "name" : { "type:" : "rni_name" } "aka" : { "type:" : "rni_name" } } }}

What happens at index time?

● NameMapper indexes keys for different phenomena in separate (sub) fields@Override

public void parse(ParseContext context) throws IOException {

Name name = NameBuilder.data(nameString).build();

//Generate keys for name

Collection<FieldSpec> fields = helper.deriveFieldsForName(name);

//Parse each key with the appropriate Mapper

for (FieldSpec field : fields) {

Mapper mapper = keyMappers.get(field.getField().fieldName());

context = context.createExternalValueContext(field.getStringValue());

mapper.parse(context);

}

}

Indexing

{ name: "Robert Smith"dob:"1987/02/13" }

{ name: "Robert Smith"name.key1:…name.key2:…name.key3:…dob: "1987/02/13" }

User Doc

Plug-in Implementation

Index

What happens at query time?

● Step #1: NameMapper generates analogous keys for a custom Lucene query that finds good candidates for re-scoring

@Override

public Query termQuery(Object value, @Nullable QueryParseContext context) {

//Parse name string

Name name = NameBuilder.data(value.toString()).build();

QuerySpec spec = helper.buildQuerySpec(new NameIndexQuery(name));

//Build Lucene query

Query query = spec.accept(new ESQueryVisitor(names.indexName() + "."));

return query;

}

What else happens at query time?

● Step #2: Uses a Rescore query to score names in the best candidate documents and reorder accordingly○ Tuned for high precision name matching○ Computationally expensive"rescore" : {

"query" : {

"rescore_query" : {

"function_score" : {

"name_score" : {

"field" : "name",

"query_name" : "LobEzDiaS, Chuy"

}

...

● The 'name_score' function matches the query name against the indexed name in every candidate document and returns the similarity score

@Override

public double score(int docId, float subQueryScore) {

//Create a scorer for the query name

CachedScorer cs = createCachedScorer(queryName);

//Retrieve name data from doc values

nameByteData.setDocument(docId);

Name indexName = bytesToName(nameByteData.valueAt(i).bytes);

//Score the query against the indexed name in this document

return cs.score(indexName);

}

What does that function do?

Rescore Query

Main Query

Plug-in Implementation

{ match : { name: "Bob Smitty" } }bool:

name.Key1:...name.Key2:...name.Key3:...

User Query

Rescorename_score : { field : "name", name : "Bob

Smitty")

name:"Robert Smith"dob:2/13/1987score : .79

Indexing

{ name: "Robert Smith"dob:"1987/02/13" }

{ name: "Robert Smith"name.Key1:…name.Key2:…name.Key3:…dob: "1987/02/13" }

User Doc

Index

● window_size○ Controls how many of the top documents to rescore○ Tradeoff accuracy vs speed

● minScoreToCheck - (Added by Us)○ Score threshold top doc must meet to be rescored○ Tradeoff accuracy vs speed

Rescore Params - Speed v. Accuracy

HighRecall Query(Elastic)

Subset High Recall Results

Total < windowsize

&

Score > minimumScoreThreshold

Rescoring (for High Precision)

Query

ScoredResults

Trading Off Accuracy for Speed

Rescore Params - Integration w/Query

● rescore_query○ Calls the name_score function to get score○ Combine rescore_queries to query across multiple

fields● query_weight

○ Controls how much weight is given to main query○ Allows user to include queries on other non-name

fields● rescore_query_weight

○ Controls how much weight is given to rescore query

What Challenges Were There?

● Design based on similar Solr plugin● 1-2 months solo develop time● Nice plugin infrastructure● Missing some useful javadocs/comments● No (official) plugin development guide● Used other plugin implementations as

guides https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#_plugins

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#_plugins



Summary: How it works

● Custom field type mapping○ Splits a single field into multiple fields covering

different phenomena○ Supports multiple name fields in a document○ Intercepts the query to inject a custom Lucene query

● Custom rescore function○ Rescores documents with algorithm specific to name

matching ○ Limits intense calculations to only top candidates○ Highly configurable

Simple Fuzzy Name Matching in Elasticsearch

June 18, 2015Brian Sawyer

Engineering [email protected]

Suggested Questions:

● What if the names are in other text fields?● How did you implement multi-valued fields?● How does it scale?● How do you handle names not in English?● How does this relate to the theme of Entity-

Centric Search?● How do plug-in’s scores relate to Solr scores?

Simple fuzzy name matching in elasticsearch

Technology

Transcript of Simple fuzzy name matching in elasticsearch