Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse...
Transcript of Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse...
![Page 1: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/1.jpg)
Introduction to Information Retrievalhttp://informationretrieval.org
IIR 3: Dictionaries and tolerant retrieval
Hinrich Schutze
Center for Information and Language Processing, University of Munich
2014-04-10
1 / 114
![Page 2: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/2.jpg)
Overview
1 Recap
2 Dictionaries
3 Wildcard queries
4 Edit distance
5 Spelling correction
6 Soundex
2 / 114
![Page 3: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/3.jpg)
Outline
1 Recap
2 Dictionaries
3 Wildcard queries
4 Edit distance
5 Spelling correction
6 Soundex
3 / 114
![Page 4: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/4.jpg)
Type/token distinction
Token – an instance of a word or term occurring in adocument
Type – an equivalence class of tokens
In June, the dog likes to chase the cat in the barn.
12 word tokens, 9 word types
4 / 114
![Page 5: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/5.jpg)
Problems in tokenization
What are the delimiters? Space? Apostrophe? Hyphen?
For each of these: sometimes they delimit, sometimes theydon’t.
No whitespace in many languages! (e.g., Chinese)
No whitespace in Dutch, German, Swedish compounds(Lebensversicherungsgesellschaftsangestellter)
5 / 114
![Page 6: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/6.jpg)
Problems with equivalence classing
A term is an equivalence class of tokens.
How do we define equivalence classes?
Numbers (3/20/91 vs. 20/3/91)
Case folding
Stemming, Porter stemmer
Morphological analysis: inflectional vs. derivational
Equivalence classing problems in other languages
More complex morphology than in EnglishFinnish: a single verb may have 12,000 different formsAccents, umlauts
6 / 114
![Page 7: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/7.jpg)
Skip pointers
16 28 72
5 51 98
2 4 8 16 19 23 28 43
1 2 3 5 8 41 51 60 71
Brutus
Caesar
7 / 114
![Page 8: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/8.jpg)
Positional indexes
Postings lists in a nonpositional index: each posting is just a docIDPostings lists in a positional index: each posting is a docID and a list ofpositionsExample query: “to1 be2 or3 not4 to5 be6”
to, 993427:〈 1: 〈 7, 18, 33, 72, 86, 231〉;2: 〈1, 17, 74, 222, 255〉;4: 〈 8, 16, 190, 429, 433〉;5: 〈363, 367〉;7: 〈13, 23, 191〉; . . . 〉
be, 178239:〈 1: 〈 17, 25〉;4: 〈 17, 191, 291, 430, 434〉;5: 〈14, 19, 101〉; . . . 〉 Document 4 is a match!
8 / 114
![Page 9: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/9.jpg)
Positional indexes
With a positional index, we can answer phrase queries.
With a positional index, we can answer proximity queries.
9 / 114
![Page 10: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/10.jpg)
Take-away
Tolerant retrieval: What to do if there is no exact matchbetween query term and document term
Wildcard queries
Spelling correction
10 / 114
![Page 11: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/11.jpg)
Outline
1 Recap
2 Dictionaries
3 Wildcard queries
4 Edit distance
5 Spelling correction
6 Soundex
11 / 114
![Page 12: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/12.jpg)
Inverted index
For each term t, we store a list of all documents that contain t.Brutus −→ 1 2 4 11 31 45 173 174
Caesar −→ 1 2 4 5 6 16 57 132 . . .
Calpurnia −→ 2 31 54 101
...
︸ ︷︷ ︸ ︸ ︷︷ ︸
dictionary postings
12 / 114
![Page 13: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/13.jpg)
Dictionaries
The dictionary is the data structure for storing the termvocabulary.
Term vocabulary: the data
Dictionary: the data structure for storing the term vocabulary
13 / 114
![Page 14: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/14.jpg)
Dictionary as array of fixed-width entries
For each term, we need to store a couple of items:
document frequencypointer to postings list. . .
Assume for the time being that we can store this informationin a fixed-length entry.
Assume that we store these entries in an array.
14 / 114
![Page 15: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/15.jpg)
Dictionary as array of fixed-width entries
term documentfrequency
pointer topostings list
a 656,265 −→aachen 65 −→. . . . . . . . .zulu 221 −→
space needed: 20 bytes 4 bytes 4 bytes
How do
we look up a query term qi in this array at query time? That is:which data structure do we use to locate the entry (row) in thearray where qi is stored?
15 / 114
![Page 16: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/16.jpg)
Data structures for looking up term
Two main classes of data structures: hashes and trees
Some IR systems use hashes, some use trees.
Criteria for when to use hashes vs. trees:
Is there a fixed number of terms or will it keep growing?What are the relative frequencies with which various keys willbe accessed?How many terms are we likely to have?
16 / 114
![Page 17: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/17.jpg)
Hashes
Each vocabulary term is hashed into an integer, its rownumber in the array
At query time: hash query term, locate entry in fixed-widtharray
Pros: Lookup in a hash is faster than lookup in a tree.
Lookup time is constant.
Cons
no way to find minor variants (resume vs. resume)no prefix search (all terms starting with automat)need to rehash everything periodically if vocabulary keepsgrowing
17 / 114
![Page 18: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/18.jpg)
Trees
Trees solve the prefix problem (find all terms starting withautomat).
Simplest tree: binary tree
Search is slightly slower than in hashes: O(logM), where M isthe size of the vocabulary.
O(logM) only holds for balanced trees.
Rebalancing binary trees is expensive.
B-trees mitigate the rebalancing problem.
B-tree definition: every internal node has a number ofchildren in the interval [a, b] where a, b are appropriatepositive integers, e.g., [2, 4].
18 / 114
![Page 19: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/19.jpg)
Binary tree
19 / 114
![Page 20: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/20.jpg)
B-tree
20 / 114
![Page 21: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/21.jpg)
Outline
1 Recap
2 Dictionaries
3 Wildcard queries
4 Edit distance
5 Spelling correction
6 Soundex
21 / 114
![Page 22: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/22.jpg)
Wildcard queries
mon*: find all docs containing any term beginning with mon
Easy with B-tree dictionary: retrieve all terms t in the range:mon ≤ t < moo
*mon: find all docs containing any term ending with mon
Maintain an additional tree for terms backwardsThen retrieve all terms t in the range: nom ≤ t < non
Result: A set of terms that are matches for wildcard query
Then retrieve documents that contain any of these terms
22 / 114
![Page 23: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/23.jpg)
How to handle * in the middle of a term
Example: m*nchen
We could look up m* and *nchen in the B-tree and intersectthe two term sets.
Expensive
Alternative: permuterm index
Basic idea: Rotate every wildcard query, so that the * occursat the end.
Store each of these rotations in the dictionary, say, in a B-tree
23 / 114
![Page 24: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/24.jpg)
Permuterm index
For term hello: add hello$, ello$h, llo$he, lo$hel, o$hell, and$hello to the B-tree where $ is a special symbol
24 / 114
![Page 25: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/25.jpg)
Permuterm → term mapping
25 / 114
![Page 26: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/26.jpg)
Permuterm index
For hello, we’ve stored: hello$, ello$h, llo$he, lo$hel, o$hell,$hello
Queries
For X, look up X$For X*, look up $X*For *X, look up X$*For *X*, look up X*For X*Y, look up Y$X*Example: For hel*o, look up o$hel*
Permuterm index would better be called a permuterm tree.
But permuterm index is the more common name.
26 / 114
![Page 27: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/27.jpg)
Processing a lookup in the permuterm index
Rotate query wildcard to the right
Use B-tree lookup as before
Problem: Permuterm more than quadruples the size of thedictionary compared to a regular B-tree. (empirical number)
27 / 114
![Page 28: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/28.jpg)
k-gram indexes
More space-efficient than permuterm index
Enumerate all character k-grams (sequence of k characters)occurring in a term
2-grams are called bigrams.
Example: from April is the cruelest month we get the bigrams:$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $mmo on nt h$
$ is a special word boundary symbol, as before.
Maintain an inverted index from bigrams to the terms thatcontain the bigram
28 / 114
![Page 29: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/29.jpg)
Postings list in a 3-gram inverted index
etr beetroot metric petrify retrieval✲ ✲ ✲ ✲
29 / 114
![Page 30: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/30.jpg)
k-gram (bigram, trigram, . . . ) indexes
Note that we now have two different types of inverted indexes
The term-document inverted index for finding documentsbased on a query consisting of terms
The k-gram index for finding terms based on a queryconsisting of k-grams
30 / 114
![Page 31: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/31.jpg)
Processing wildcarded terms in a bigram index
Query mon* can now be run as:$m and mo and on
Gets us all terms with the prefix mon . . .
. . . but also many “false positives” like moon.
We must postfilter these terms against query.
Surviving terms are then looked up in the term-documentinverted index.
k-gram index vs. permuterm index
k-gram index is more space efficient.Permuterm index doesn’t require postfiltering.
31 / 114
![Page 32: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/32.jpg)
Exercise
Google has very limited support for wildcard queries.
For example, this query doesn’t work very well on Google:[gen* universit*]
Intention: you are looking for the University of Geneva, butdon’t know which accents to use for the French words foruniversity and Geneva.
According to Google search basics, 2010-04-29: “Note thatthe * operator works only on whole words, not parts of words.”
But this is not entirely true. Try [pythag*] and [m*nchen]
Exercise: Why doesn’t Google fully support wildcard queries?
32 / 114
![Page 33: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/33.jpg)
Processing wildcard queries in the term-document index
Problem 1: we must potentially execute a large number ofBoolean queries.
Most straightforward semantics: Conjunction of disjunctions
For [gen* universit*]: genevauniversity orgenevauniversiteorgeneveuniversity orgeneveuniversite orgeneraluniversitiesor . . .
Very expensive
Problem 2: Users hate to type.
If abbreviated queries like [pyth* theo*] for [pythagoras’theorem] are allowed, users will use them a lot.
This would significantly increase the cost of answering queries.
Somewhat alleviated by Google Suggest
33 / 114
![Page 34: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/34.jpg)
Outline
1 Recap
2 Dictionaries
3 Wildcard queries
4 Edit distance
5 Spelling correction
6 Soundex
34 / 114
![Page 35: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/35.jpg)
Edit distance
The edit distance between string s1 and string s2 is theminimum number of basic operations that convert s1 to s2.
Levenshtein distance: The admissible basic operations areinsert, delete, and replace
Levenshtein distance dog-do: 1
Levenshtein distance cat-cart: 1
Levenshtein distance cat-cut: 1
Levenshtein distance cat-act: 2
Damerau-Levenshtein distance cat-act: 1
Damerau-Levenshtein includes transposition as a fourthpossible operation.
35 / 114
![Page 36: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/36.jpg)
Levenshtein distance: Computation
f a s t
0 1 2 3 4
c 1 1 2 3 4
a 2 2 1 2 3
t 3 3 2 2 2
s 4 4 3 2 3
36 / 114
![Page 37: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/37.jpg)
Levenshtein distance: Algorithm
LevenshteinDistance(s1, s2)1 for i ← 0 to |s1|2 do m[i , 0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i ] = s2[j]8 then m[i , j] = min{m[i -1, j]+1,m[i , j-1]+1,m[i -1, j-1]}9 else m[i , j] = min{m[i -1, j]+1,m[i , j-1]+1,m[i -1, j-1]+1}10 return m[|s1|, |s2|]Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy(cost 0)
37 / 114
![Page 38: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/38.jpg)
Levenshtein distance: Algorithm
LevenshteinDistance(s1, s2)1 for i ← 0 to |s1|2 do m[i , 0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i ] = s2[j]8 then m[i , j] = min{m[i -1, j]+1,m[i , j-1]+1,m[i -1, j-1]}9 else m[i , j] = min{m[i -1, j]+1,m[i , j-1]+1,m[i -1, j-1]+1}10 return m[|s1|, |s2|]Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy(cost 0)
38 / 114
![Page 39: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/39.jpg)
Levenshtein distance: Algorithm
LevenshteinDistance(s1, s2)1 for i ← 0 to |s1|2 do m[i , 0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i ] = s2[j]8 then m[i , j] = min{m[i -1, j]+1,m[i , j-1]+1,m[i -1, j-1]}9 else m[i , j] = min{m[i -1, j]+1,m[i , j-1]+1,m[i -1, j-1]+1}10 return m[|s1|, |s2|]Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy(cost 0)
39 / 114
![Page 40: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/40.jpg)
Levenshtein distance: Algorithm
LevenshteinDistance(s1, s2)1 for i ← 0 to |s1|2 do m[i , 0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i ] = s2[j]8 then m[i , j] = min{m[i -1, j]+1,m[i , j-1]+1,m[i -1, j-1]}9 else m[i , j] = min{m[i -1, j]+1,m[i , j-1]+1,m[i -1, j-1]+1}10 return m[|s1|, |s2|]Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy(cost 0)
40 / 114
![Page 41: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/41.jpg)
Levenshtein distance: Algorithm
LevenshteinDistance(s1, s2)1 for i ← 0 to |s1|2 do m[i , 0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i ] = s2[j]8 then m[i , j] = min{m[i -1, j]+1,m[i , j-1]+1,m[i -1, j-1]}9 else m[i , j] = min{m[i -1, j]+1,m[i , j-1]+1,m[i -1, j-1]+1}10 return m[|s1|, |s2|]Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy(cost 0)
41 / 114
![Page 42: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/42.jpg)
Levenshtein distance: Example
f a s t
0 1 1 2 2 3 3 4 4
c1
1
1 2
2 1
2 3
2 2
3 4
3 3
4 5
4 4
a2
2
2 2
3 2
1 3
3 1
3 4
2 2
4 5
3 3
t3
3
3 3
4 3
3 2
4 2
2 3
3 2
2 4
3 2
s4
4
4 4
5 4
4 3
5 3
2 3
4 2
3 3
3 3
42 / 114
![Page 43: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/43.jpg)
Each cell of Levenshtein matrix
cost of getting here frommy upper left neighbor(copy or replace)
cost of getting herefrom my upper neighbor(delete)
cost of getting here frommy left neighbor (insert)
the minimum of thethree possible “move-ments”; the cheapestway of getting here
43 / 114
![Page 44: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/44.jpg)
Levenshtein distance: Example
f a s t
0 1 1 2 2 3 3 4 4
c1
1
1 2
2 1
2 3
2 2
3 4
3 3
4 5
4 4
a2
2
2 2
3 2
1 3
3 1
3 4
2 2
4 5
3 3
t3
3
3 3
4 3
3 2
4 2
2 3
3 2
2 4
3 2
s4
4
4 4
5 4
4 3
5 3
2 3
4 2
3 3
3 3
44 / 114
![Page 45: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/45.jpg)
Dynamic programming (Cormen et al.)
Optimal substructure: The optimal solution to the problemcontains within it subsolutions, i.e., optimal solutions tosubproblems.
Overlapping subsolutions: The subsolutions overlap. Thesesubsolutions are computed over and over again whencomputing the global optimal solution in a brute-forcealgorithm.
Subproblem in the case of edit distance: what is the editdistance of two prefixes
Overlapping subsolutions: We need most distances of prefixes3 times – this corresponds to moving right, diagonally, down.
45 / 114
![Page 46: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/46.jpg)
Weighted edit distance
As above, but weight of an operation depends on thecharacters involved.
Meant to capture keyboard errors, e.g., m more likely to bemistyped as n than as q.
Therefore, replacing m by n is a smaller edit distance than byq.
We now require a weight matrix as input.
Modify dynamic programming to handle weights
46 / 114
![Page 47: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/47.jpg)
Using edit distance for spelling correction
Given query, first enumerate all character sequences within apreset (possibly weighted) edit distance
Intersect this set with our list of “correct” words
Then suggest terms in the intersection to the user.
→ exercise in a few slides
47 / 114
![Page 48: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/48.jpg)
Exercise
1 Compute Levenshtein distance matrix for oslo – snow
2 What are the Levenshtein editing operations that transformcat into catcat?
48 / 114
![Page 49: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/49.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
s2
2
l3
3
o4
4
49 / 114
![Page 50: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/50.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 ?
s2
2
l3
3
o4
4
50 / 114
![Page 51: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/51.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
s2
2
l3
3
o4
4
51 / 114
![Page 52: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/52.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 ?
s2
2
l3
3
o4
4
52 / 114
![Page 53: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/53.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
s2
2
l3
3
o4
4
53 / 114
![Page 54: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/54.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 ?
s2
2
l3
3
o4
4
54 / 114
![Page 55: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/55.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
s2
2
l3
3
o4
4
55 / 114
![Page 56: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/56.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 ?
s2
2
l3
3
o4
4
56 / 114
![Page 57: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/57.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
l3
3
o4
4
57 / 114
![Page 58: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/58.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 ?
l3
3
o4
4
58 / 114
![Page 59: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/59.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
l3
3
o4
4
59 / 114
![Page 60: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/60.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 ?
l3
3
o4
4
60 / 114
![Page 61: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/61.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
l3
3
o4
4
61 / 114
![Page 62: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/62.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 ?
l3
3
o4
4
62 / 114
![Page 63: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/63.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
l3
3
o4
4
63 / 114
![Page 64: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/64.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 ?
l3
3
o4
4
64 / 114
![Page 65: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/65.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
o4
4
65 / 114
![Page 66: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/66.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 ?
o4
4
66 / 114
![Page 67: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/67.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
o4
4
67 / 114
![Page 68: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/68.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 ?
o4
4
68 / 114
![Page 69: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/69.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
o4
4
69 / 114
![Page 70: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/70.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 ?
o4
4
70 / 114
![Page 71: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/71.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
o4
4
71 / 114
![Page 72: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/72.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 ?
o4
4
72 / 114
![Page 73: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/73.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
73 / 114
![Page 74: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/74.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 ?
74 / 114
![Page 75: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/75.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
75 / 114
![Page 76: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/76.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 ?
76 / 114
![Page 77: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/77.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
77 / 114
![Page 78: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/78.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 ?
78 / 114
![Page 79: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/79.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 2
79 / 114
![Page 80: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/80.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 2
4 5
3 ?
80 / 114
![Page 81: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/81.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 2
4 5
3 3
81 / 114
![Page 82: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/82.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 2
4 5
3 3
82 / 114
![Page 83: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/83.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 2
4 5
3 3
How do
I read out the editing operations that transform oslo into snow?
83 / 114
![Page 84: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/84.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 2
4 5
3 3
cost operation input output
1 insert * w
84 / 114
![Page 85: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/85.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 2
4 5
3 3
cost operation input output
0 (copy) o o
1 insert * w
85 / 114
![Page 86: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/86.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 2
4 5
3 3
cost operation input output
1 replace l n
0 (copy) o o
1 insert * w
86 / 114
![Page 87: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/87.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 2
4 5
3 3
cost operation input output
0 (copy) s s
1 replace l n
0 (copy) o o
1 insert * w
87 / 114
![Page 88: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/88.jpg)
s n o w
0 1 1 2 2 3 3 4 4
o1
1
1 2
2 1
2 3
2 2
2 4
3 2
4 5
3 3
s2
2
1 2
3 1
2 3
2 2
3 3
3 3
3 4
4 3
l3
3
3 2
4 2
2 3
3 2
3 4
3 3
4 4
4 4
o4
4
4 3
5 3
3 3
4 3
2 4
4 2
4 5
3 3
cost operation input output
1 delete o *
0 (copy) s s
1 replace l n
0 (copy) o o
1 insert * w
88 / 114
![Page 89: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/89.jpg)
c a t c a t
0 1 1 2 2 3 3 4 4 5 5 6 6
c1
1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
5 6
4 4
6 7
5 5
a2
2
2 1
3 1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
5 6
4 4
t3
3
3 2
4 2
2 1
3 1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
89 / 114
![Page 90: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/90.jpg)
c a t c a t
0 1 1 2 2 3 3 4 4 5 5 6 6
c1
1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
5 6
4 4
6 7
5 5
a2
2
2 1
3 1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
5 6
4 4
t3
3
3 2
4 2
2 1
3 1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
cost operation input output
1 insert * c
1 insert * a
1 insert * t
0 (copy) c c
0 (copy) a a
0 (copy) t t
90 / 114
![Page 91: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/91.jpg)
c a t c a t
0 1 1 2 2 3 3 4 4 5 5 6 6
c1
1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
5 6
4 4
6 7
5 5
a2
2
2 1
3 1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
5 6
4 4
t3
3
3 2
4 2
2 1
3 1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
cost operation input output
0 (copy) c c
1 insert * a
1 insert * t
1 insert * c
0 (copy) a a
0 (copy) t t
91 / 114
![Page 92: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/92.jpg)
c a t c a t
0 1 1 2 2 3 3 4 4 5 5 6 6
c1
1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
5 6
4 4
6 7
5 5
a2
2
2 1
3 1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
5 6
4 4
t3
3
3 2
4 2
2 1
3 1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
cost operation input output
0 (copy) c c
0 (copy) a a
1 insert * t
1 insert * c
1 insert * a
0 (copy) t t
92 / 114
![Page 93: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/93.jpg)
c a t c a t
0 1 1 2 2 3 3 4 4 5 5 6 6
c1
1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
5 6
4 4
6 7
5 5
a2
2
2 1
3 1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
5 6
4 4
t3
3
3 2
4 2
2 1
3 1
0 2
2 0
2 3
1 1
3 4
2 2
3 5
3 3
cost operation input output
0 (copy) c c
0 (copy) a a
0 (copy) t t
1 insert * c
1 insert * a
1 insert * t
93 / 114
![Page 94: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/94.jpg)
Outline
1 Recap
2 Dictionaries
3 Wildcard queries
4 Edit distance
5 Spelling correction
6 Soundex
94 / 114
![Page 95: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/95.jpg)
Spelling correction
Two principal uses
Correcting documents being indexedCorrecting user queries
Two different methods for spelling correction
Isolated word spelling correction
Check each word on its own for misspellingWill not catch typos resulting in correctly spelled words, e.g.,an asteroid that fell form the sky
Context-sensitive spelling correction
Look at surrounding wordsCan correct form/from error above
95 / 114
![Page 96: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/96.jpg)
Correcting documents
We’re not interested in interactive spelling correction ofdocuments (e.g., MS Word) in this class.
In IR, we use document correction primarily for OCR’eddocuments. (OCR = optical character recognition)
The general philosophy in IR is: don’t change the documents.
96 / 114
![Page 97: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/97.jpg)
Correcting queries
First: isolated word spelling correction
Premise 1: There is a list of “correct words” from which thecorrect spellings come.
Premise 2: We have a way of computing the distance betweena misspelled word and a correct word.
Simple spelling correction algorithm: return the “correct”word that has the smallest distance to the misspelled word.
Example: informaton → information
For the list of correct words, we can use the vocabulary of allwords that occur in our collection.
Why is this problematic?
97 / 114
![Page 98: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/98.jpg)
Alternatives to using the term vocabulary
A standard dictionary (Webster’s, OED etc.)
An industry-specific dictionary (for specialized IR systems)
The term vocabulary of the collection, appropriately weighted
98 / 114
![Page 99: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/99.jpg)
Distance between misspelled word and “correct” word
Several alternatives
Edit distance and Levenshtein distance
Weighted edit distance
k-gram overlap
99 / 114
![Page 100: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/100.jpg)
Spelling correction
Now that we can compute edit distance: how to use it forisolated word spelling correction – this is the last slide in thissection.
k-gram indexes for isolated word spelling correction.
Context-sensitive spelling correction
General issues
100 / 114
![Page 101: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/101.jpg)
k-gram indexes for spelling correction
Enumerate all k-grams in the query term
Example: bigram index, misspelled word bordroom
Bigrams: bo, or, rd, dr, ro, oo, om
Use the k-gram index to retrieve “correct” words that matchquery term k-grams
Threshold by number of matching k-grams
E.g., only vocabulary terms that differ by at most 3 k-grams
101 / 114
![Page 102: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/102.jpg)
k-gram indexes for spelling correction: bordroom
rd aboard ardent boardroom border
or border lord morbid sordid
bo aboard about boardroom border
✲ ✲ ✲ ✲
✲ ✲ ✲ ✲
✲ ✲ ✲ ✲
102 / 114
![Page 103: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/103.jpg)
Context-sensitive spelling correction
Our example was: an asteroid that fell form the sky
How can we correct form here?
One idea: hit-based spelling correction
Retrieve “correct” terms close to each query termfor flew form munich: flea for flew, from for form, munch formunichNow try all possible resulting phrases as queries with one word“fixed” at a timeTry query “flea form munich”Try query “flew from munich”Try query “flew form munch”The correct query “flew from munich” has the most hits.
Suppose we have 7 alternatives for flew, 20 for form and 3 formunich, how many “corrected” phrases will we enumerate?
103 / 114
![Page 104: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/104.jpg)
Context-sensitive spelling correction
The “hit-based” algorithm we just outlined is not veryefficient.
More efficient alternative: look at “collection” of queries, notdocuments
104 / 114
![Page 105: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/105.jpg)
General issues in spelling correction
User interface
automatic vs. suggested correctionDid you mean only works for one suggestion.What about multiple possible corrections?Tradeoff: simple vs. powerful UI
Cost
Spelling correction is potentially expensive.Avoid running on every query?Maybe just on queries that match few documents.Guess: Spelling correction of major search engines is efficientenough to be run on every query.
105 / 114
![Page 106: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/106.jpg)
Exercise: Understand Peter Norvig’s spelling corrector
import re, collections
def words(text): return re.findall(’[a-z]+’, text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file(’big.txt’).read()))
alphabet = ’abcdefghijklmnopqrstuvwxyz’
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) gt 1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
106 / 114
![Page 107: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/107.jpg)
Outline
1 Recap
2 Dictionaries
3 Wildcard queries
4 Edit distance
5 Spelling correction
6 Soundex
107 / 114
![Page 108: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/108.jpg)
Soundex
Soundex is the basis for finding phonetic (as opposed toorthographic) alternatives.
Example: chebyshev / tchebyscheff
Algorithm:
Turn every token to be indexed into a 4-character reduced formDo the same with query termsBuild and search an index on the reduced forms
108 / 114
![Page 109: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/109.jpg)
Soundex algorithm
1 Retain the first letter of the term.2 Change all occurrences of the following letters to ’0’ (zero): A, E, I,
O, U, H, W, Y3 Change letters to digits as follows:
B, F, P, V to 1C, G, J, K, Q, S, X, Z to 2D,T to 3L to 4M, N to 5R to 6
4 Repeatedly remove one out of each pair of consecutive identical digits5 Remove all zeros from the resulting string; pad the resulting string
with trailing zeros and return the first four positions, which willconsist of a letter followed by three digits
109 / 114
![Page 110: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/110.jpg)
Example: Soundex of HERMAN
Retain H
ERMAN → 0RM0N
0RM0N → 06505
06505 → 06505
06505 → 655
Return H655
Note: HERMANN will generate the same code
110 / 114
![Page 111: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/111.jpg)
How useful is Soundex?
Not very – for information retrieval
Ok for “high recall” tasks in other applications (e.g., Interpol)
Zobel and Dart (1996) suggest better alternatives for phoneticmatching in IR.
111 / 114
![Page 112: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/112.jpg)
Exercise
Compute Soundex code of your last name
112 / 114
![Page 113: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/113.jpg)
Take-away
Tolerant retrieval: What to do if there is no exact matchbetween query term and document term
Wildcard queries
Spelling correction
113 / 114
![Page 114: Introduction to Information Retrieval ` `%%%`#`&12 ` ~~~ alse …hs/teach/14s/ir/pdf/03dict.flat.pdf · Overview 1 Recap 2 Dictionaries 3 Wildcard queries 4 Edit distance 5 Spelling](https://reader034.fdocuments.us/reader034/viewer/2022042221/5ec7c4dbc2f4ec1c153284b2/html5/thumbnails/114.jpg)
Resources
Chapter 3 of IIR
Resources at http://cislmu.org
trie vs hash vs ternary treeSoundex demoEdit distance demoPeter Norvig’s spelling correctorGoogle: wild card search, spelling correction gone wrong, amisspelling that is more frequent that the correct spelling
114 / 114