Elasticsearch:Dealing With Human Language

Welcome!

12/06/2015Elasticsearch-3Dealing with Human Language Mohammad Aminul Islam11103812Cologne University of Applied Science12/06/2015Topics to Be CoveredStarted with languageAnalyzerStemmingLanguage IdentificationIdentify wordsNormalizing tokenRuducing words to their root formStemming issueLemmatizationTypes of stemmerStopwords: Performance Vs PrecisionSynonyms, Typoes and Mispeling12/06/2015GoalsWhat is happening behind the search engine

I know all the words, but that sentence make no sense to me Matt Groening12/06/201512/06/2015Started with languages: AnalyzerElasticsearch has a collection of language analyzerThese analyzer perform four worksTokenized text into individual wordsThe quick brown foxes > The , quick , brown, foxesLowercase tokensThe > theRemove common stopwordsquick, brown ,foxesToken to their root formFoxes> fox12/06/2015AnalyzerThe english analyzer remove the possesive `sJohn `s> John

The french analyzer remove the elision like l ` and qu `

The german analyzer normalize and ae with a, > ss12/06/2015Question ?

I am not happy with HTC mobile> I ,am , happy , HTC, mobile

What can we do now?12/06/2015Analyzer : ConfigurationLanguage analyzer can be use without configuration. Analyzer allows to control some behaviour

Stem-word exclusionFor example : organ and organization are the same word

Custom stopword For example : not, no cosider as important word.12/06/2015Incorrect stemmingStemming rules are different for different languageWe cannot use one stemming rule for all language Root word sometimes change the meaning of the actual word

For example : ebay.co.uk ebay.de12/06/2015Identify LanguageWe know our own document languageExternal document may be contain different languageWe can use language detector for identifying the language

For example: Compact language detector from googleIt can detect 160+ languageIt can detect mutiple language within a single line of text.

12/06/2015Identifying WordsIdentifying Words12/06/2015WordsWords are separated whitespace or punctuationIn English there are some controversy word: O`clock, cooperate, eyewitnessIn Deutsch or dutch there are some combined wordsSome asian language have no whitespace between words

Dedicated analyzer for many language

12/06/2015Standard AnalyzerBy default standard analyzer is used

We can define standard analyzer as a custom analyzer

12/06/2015Standard TokenizerTake a string as input , process the input and break it into individual wordsWhitespace tokenizer simply break on whitespaceYou are the 1st runner home!You,are , the, 1st, runner,homeThe letter tokenizer break on any character that is not a letterYou,are , the, st, runner,homeStandard Tokenizer use unicode text segmentation algorighm, it allows text containing a mixture of language12/06/2015Reducing word to their root form12/06/2015Words can changes their formNumber: fox, foxes Tense: pay, paid, paying Gender: waiter, waitress Person: hear, hears

Stemming try to remove difference between inflected word and root wordFor example : foxes > fox**Problem is Root word not always same meaning12/06/2015Two issues of StemmingUnderstemming:Fail to reduce the word with same meaning and same rootFor example: jumped and jumps > jump Jumping reduce to jumpi relevent document will not comeOverstemmingFail to keep two words with distinct meanings separateFor example: General and generate >gener irrelevent document will come12/06/2015LemmatizationIt is a set of related wordsFor example: paying,paid,pays is pay

It can group words by their word sense For example: wake and wakeup is different

Lemmatization is much more complicated and expensive process12/06/2015Types of stemmerAlgorithm Stemmers:It is use algorithmEasy to useFast ,use little memoryGood for regular worksDictionary Stemmers:It use dictionaryIt use more momoryHave to load all words

Question?Which stemmer we should use?12/06/201512/06/2015StopwordsPerformance Vs Precision12/06/2015For search purposes some words are more important than others. For better indexing we need to find out valuable words

Low frequency terms:Words that rearly appear in document have a high value or weightHigh frequency terms:Common words that appear in many document have lower value or weight such as the, an 12/06/2015Default stopwordsThe default English stopwords used in Elasticsearch are as follows:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

These stopwords filtered out before indexing with little negative impact on retrieval. Is it a good idea?12/06/2015Pros & Cons of StopwordsCons: Distinguishing happy form not happyFinding Shakespeares quotation To be, or not to be Using the country code for Norway: no

Pros:PerformanceSearch Fox instead of the fox12/06/2015SynonymsSynonyns are listed as comma separate values

With the => syntax, it is possible to specify a list of terms to

12/06/2015Typoes & Mispelings80% of human misspellings have an edit distance of 180% of misspellings could be corrected with a single editEdit distance specified by fuzziness parameter 2

0 for strings of one or two characters 1 for strings of three, four, or five characters 2 for strings of more than five characters

12/06/2015Phonetic MatchingSound similar, may be spelling differAlgorithm for word to phonetic representation areSoundex algorithm is the granddaddy of t allMetaphone and double metaphone for EnglishCaverphone for matching names in New Zealand Klner Phonetik for better handling of German words.

12/06/2015Q&A12/06/2015SummaryAnalyzerStemmingIdentify WordsNormalizing TokenTypes of StemmerStopwordsSynonyms,Mispeling12/06/2015Thank you

Elasticsearch:Dealing With Human Language

Documents

Transcript of Elasticsearch:Dealing With Human Language