Download - Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

Transcript
Page 1: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads
Page 2: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

SPELLCHECKING IN TROVIT: IMPLEMENTING A CONTEXTUAL MULTI-LANGUAGE SPELLCHECKER FOR CLASSIFIED ADS Xavier Sanchez Loro R&D Engineer

[email protected]

Page 3: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Introduction •  Our approach: Contextual Spellchecking •  Nature and characteristics of our document corpus •  Spellcheckers in Solr •  White-listing and purging: controlling dictionary data •  Spellchecker configuration •  Customizing Solr’s SpellcheckComponent •  Conclusions and Future Work

Outline

Page 4: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

Trovit Engineering Blog post on spellchecking

http://tech.trovit.com/index.php/spellchecking-in-trovit/

Supporting text for this speech

Page 5: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

INTRODUCTION

Page 6: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

Introduction

Trovit: a search engine for classified ads

Page 7: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

Introduction

Page 8: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Multi-language spellchecking system using SOLR and Lucene

•  Objectives: –  help our users to better find the desired ads –  avoid the dreaded 0 results as much as possible –  Our goal is not pure orthographic correction but also to

suggest correct searches for a certain site.

Introduction: spellchecking in Trovit

Page 9: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

OUR APPROACH: CONTEXTUAL SPELLCHECKING

Page 10: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  The Key element in the spellchecking process is choosing the right dictionary –  one with a relevant vocabulary

•  according to the type of information included in each site.

•  Approach –  Specializing the dictionaries based on user’s search context.

•  Search contexts are composed of: –  country (with a default language) –  vertical (determining the type of ads and vocabulary).

Contextual Spellchecking: approach

Page 11: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Each site’s document corpus has a limited vocabulary –  reduced to the type of information, language and terms included in each site’s

ads.

•  Using a more generalized approach is not suitable for our needs –  One vocabulary for each language less precise than specialized vocabularies

for each site. –  Drastic differences

•  type of terms •  semantics of each vertical.

–  Terms that are relevant in one context are meaningless in another one

•  Different vocabularies for each site, even when supporting the same language. –  Vocabulary is tailored according to context of searches

Contextual Spellchecking: vocabularies

Page 12: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

NATURE AND CHARACTERISTICS OF OUR DOCUMENT CORPUS

Page 13: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Document corpus is fed by different third-party sources –  providing the ads for the different sites.

•  We can detect incorrect documents and reconcile certain inconsistences –  But we cannot control or modify the content of the ads themselves.

•  Inconsistencies –  hinder any language detection process –  pose challenges to the development of the spellchecking system

Challenges: Inconsistencies in our corpus

Page 14: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Spanish homes vertical –  not fully written in Spanish –  Ads in several languages.

•  native languages: Spanish, Catalan, Basque and Galician. •  foreign languages: English, German, French, Italian, Russian… even

oriental languages like Chinese! •  Multi-language ads

–  badly written and misspelled words •  Spanish words badly translated from regional languages •  overtly misspelled words

–  e.g. “picina” yields a 1197 docs Vs 1048434 of “piscina”, 0.01% –  “noisy” content

•  numbers, postal codes, references, etc.

Inconsistencies example

Page 15: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Summarizing –  Segmented corpus in different indexes, one per country plus vertical (site) –  3rd party generated –  Ads in national language + other languages (regional and foreign) –  Multi-language content in ads –  Noisy content (numbers, references, postal codes, etc.) –  Small texts (around 3000 characters long) –  Misspellings and incorrect words

Corpus unreliable for use as the knowledge base to build any spellchecking dictionary.

Characteristics of our ads

Page 16: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

geolocation data is not mixed with vertical data

geolocation data interleaved with vertical data

Only vertical data (no geodata) •  Narrower

dictionary, less collisons, more controlable

Cover all geodata •  Wider dictionary,

more collisons, less controlable

What/Where search segmentation

Page 17: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

SPELLCHECKERS IN SOLR

Page 18: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  It creates a parallel index for the spelling dictionary that is based on an existing Lucene index. –  Depends on index data correctness (misspells) –  Creates additional index from current index (small, MB) –  Supports term frequency parameters –  Must (re)build

•  Even though this component behaves as expected –  it was of no use for Trovit’s use case.

IndexBasedSpellchecker

Page 19: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  It depends on index data –  not an accurate and reliable for the spellchecking dictionary.

•  Continuous builds –  synchronicity between index data and spelling index data. –  If not

•  frequency information and hit counting are neither reliable nor accurate.

•  false positives/negatives •  suggestions of words with different number of hits, even 0.

•  We cannot risk suffering this situation

IndexBasedSpellchecker

Page 20: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  It uses a flat file to generate a spelling dictionary in the form of a Lucene spellchecking index. –  Requires a dictionary file –  Creates additional index from dictionary file (small, MB) –  Does not depend on index data (controlled data) –  Build once

•  rebuild only if dictionary is updated –  No frequency information used when calculating spelling suggestions

FileBasedSpellChecker

Page 21: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Requires rebuilds also –  albeit less frequently

•  No frequency related data –  Pure orthographic correction is not our main goal –  We cannot risk suggesting corrections without results.

•  But –  insight on how to approach the final solution we are implementing. –  allows the highest degree of control in dictionary contents

•  essential feature for spelling dictionaries.

FileBasedSpellChecker

Page 22: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Experimental spellchecker that just uses the main Solr index directly –  Build/rebuild is not required. –  Depends on index data correctness (misspells) –  Uses existing index

•  field: source of the spelling dictionary. –  Supports term frequency parameters. –  No (re)build.

•  Several promising features –  No build + continuously in sync with index data. –  Provides accurate frequency information data.

DirectSpellChecker

Page 23: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  The real drawback –  lack of control over index data sourcing the spelling dictionary.

•  If we can overcome it, this type would make an ideal candidate for our use case.

DirectSpellChecker

Page 24: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Generates suggestions by combining adjacent words and/or breaking words into multiples. –  This spellchecker can be configured with a traditional checker

(ie:DirectSolrSpellChecker). –  The results are combined and collations can contain a mix of

corrections from both spellcheckers. –  Uses existing index. No build.

WordBreakSpellChecker

Page 25: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Good complement to the other spellcheckers •  It works really well with well-written concatenated words

–  it is able to break them up with great accuracy. •  Combining split words is not as accurate •  Drawback: it’s based on index data.

WordBreakSpellChecker

Page 26: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

WHITE-LISTING AND PURGING: CONTROLLING DICTIONARY DATA

Page 27: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Any spelling system can only be as good as its knowledge base or dictionary is accurate.

•  We need to control the data indexed as dictionary. •  White-listing approach

–  we only index spelling data contained in a controlled dictionary list. –  processes to build a base dictionary specialized for a given site.

White-listing

Page 28: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

White-list building process

Page 29: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

SPELLCHECKER CONFIGURATION

Page 30: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  DirectSpellChecker using purged spell field –  Spell field filled with purged content

•  Purging according to whitelist •  Whitelist generated from matching dictionary with index words, after

purge process •  Benefits:

–  Build is no longer required. –  Spell field is automatically updated via pipeline. –  We can work with term freq. –  No additional index, just an additional field. –  Better relevance and suggestions.

Initial spellchecker configuration

Page 31: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Cons: –  Whitelist maintenance and creation for new sites.

•  Features: –  Accurate detection of misspelled words. –  Good detection of concatenated words.

•  piscinagarajejardin to piscina garaje jardin •  picina garajejardin to piscina (garaje jardin)

–  Able to detect several misspelled words. –  Evolution based on whitelisting fine-tuning.

Initial spellchecker configuration

Page 32: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Issues: –  False negatives: suggestion of corrections when words are correctly spelled. –  Suggestions for all the words in the query, not just those misspelled words. –  Misguiding “correctlySpelled” parameter.

•  Parameter dependant on frequency information, making it unreliable for our purposes.

•  It returns true/false according to thresholds, –  not really depending on word distance but –  results found, “alternativeTermCount” and “maxResultsForSuggest”

thresholds. –  Minor discrepancies if we only index boosted terms (i.e. qf)

•  # hits spell< #docs index

Initial spellchecker configuration

Page 33: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

CUSTOMIZING SOLR SPELLCHECKCOMPONENT

Page 34: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Lack of reliability of the “correctlySpelled” parameter –  Difficult to know when give a suggestion or not. –  First policy based on document hits

•  sliding window –  based on the number of queried terms

•  the longer the tail, the smaller the threshold •  inaccurate and prone to collisions.

–  Difficult to set up thresholds to a good level of accuracy.

We needed a more reliable way.

Hacking SpellcheckComponent

Page 35: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Binary approach to deciding if a word is correctly spelled or not.

•  Simpler approach –  any term that appears in our spelling field is a correctly spelled word

•  regardless the value of its frequency info or the configured thresholds. –  this way the parameter can be used to control when to start querying the

spellchecking index.

Hacking SpellcheckComponent: correctlySpelled parameter behaviour

Page 36: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Other changes to the SpellcheckComponent: –  No suggestions when words are correctly spelled. –  Only makes suggestions for misspelled words, not for all words

•  i.e. piscina garage -> piscina garaje

•  Spanish-friendly ASCIIFoldingFilter –  modified in order to not fold “ñ” (for Spanish) and “ç” (for Catalan names)

characters. •  Avoids collisions with similar words with “n” and “c”

–  e.g. “pena” and “peña” –  Still folding accented vowels

•  usually omitted by users.

Hacking SpellcheckComponent

Page 37: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

CONCLUSIONS AND FUTURE WORK

Page 38: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Base code –  expand the spellchecking process to other sites –  design final policy to decide when giving suggestions or not.

•  Geodata in homes verticals –  find ways to avoid collisions in large dictionary sets.

•  Scoring system for spelling dictionary –  Control suggestions based on user input

•  Feedback on relevance or quality of our spellchecking suggestions. •  System more accurate and reliable •  Expand whitelists to cover large amounts of geodata

–  with acceptable levels of precision.

Conclusion & Future Work

Page 39: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Plural suggester –  suggest alternative searches and corrections using plural or singular variants

of the terms in the query. –  Use frequency and scoring information to choose most suitable suggestions.

Conclusion & Future Work

Page 40: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

THANKS FOR YOUR ATTENTION! ANY QUESTIONS?

Page 41: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

[1] Lucene/Solr Revolution EU 2013. Dublin, 6-7 November 2013. http://www.lucenerevolution.org/ [2] Trovit – A search engine for classified ads of real estate, jobs, cars and vacation rentals. http://www.trovit.com [3] Apache Software Foundation. “Apache Solr” https://lucene.apache.org/solr/ [4] Apache Software Foundation. “Apache Lucene” https://lucene.apache.org [5] Apache Software Foundation. “Spell Checking – Apache Solr Reference Guide – Apache Software Foundation” https://cwiki.apache.org/confluence/display/solr/Spell+Checking

References