Speech & NLP
www.vkedco.blogspot.com
Spellchecking
N-Gram Analysis, Error Detection, Word Boundary Problem, Error
Correction in Isolated Words
Vladimir Kulyukin
Outline
● N-Gram Analysis
● Error Detection & Correction
● Word Boundary Problem
● Spellchecking Nonwords & Isolated Words
N-Gram Analysis
N-Gram Tables (NGTs)
● The simplest form of NGT is a bigram NGT
● The dimensions of bigram NGTs depend on the size
of the alphabet
● For the standard ASCII characters the dimensions
of the bigram NGT is 26 x 26
● For other alphabets (e.g., Unicode) it will be
considerably larger
N-Gram Tables: N-Gram Weights
● There are multiple strategies of assigning weights
to N-Grams in an NGT
● The simplest form is binary: an N-Gram is either
present or absent
● Another common form is the frequency of an N-
Gram in the corpus
● More sophisticated weights are tf*idf or weights
based on some probability distribution
Example: Bigram NGT for the Alphabet {A, B, C}
A B C
A 0 20 11
B 50 12 125
C 0 78 97 • Assume that there is some imaginary corpus where each wordform is a string over {A, B, C}
• Suppose that the weight of each bigram is equal to its frequency in the corpus
• The above NGT records the following bigram frequencies:
AA occurs 0 times; AB occurs 20 times; AC occurs 11 times
BA occurs 50 times; BB occurs 12 times; BC occurs 125 times
CA occurs 0 times; CB occurs 78 times; CC occurs 97 times
Trigram NGTs
● A trigram NGT is a 3D array (a cube)
● The dimensions of the cube depend on the size of
the alphabet
● Like a bigram NGT, a trigram NGT records weights
of every possible trigram
● Like in a bigram NGT, a trigram NGT can be binary
(i.e., recording presence/absence of a trigram in
the corpus), real (i.e., some real weights)
Nonpositional vs. Positional NGTs
● If an NGT records only weights (numeric or
symbolic) in its entries it is called nonpositional
● Nonpositional is used in the sense that the
recorded weight does not take into account of
where in the string a specific n-gram occurs
● Positional NGTs typically add the start and the
end indices to each weight that record where a
specific n-gram occurs in some string
Example: Nonpositional vs. Positional Bigrams
● We have two strings S1 and S2 in the corpus
● Bigrams for S1: AB [0,1], BC [1,2], CC [2,3], CA [3,4], AB [4,5], BC [5,6], CC [6,7], CA [7, 8], AB [8,9]
● Bigrams for S2: AB [0,1], BC [1,2], CC [2,3], CB [3,4], BA [4,5], AA [5,6]
A B C C A B C C A B
0 1
2
3 4 5 6 7 8 9
S1 =
A
0
B
1
C C
2
3
B
4
A
5
A
6
S2 =
A B C
A [5,6] - 1 [0,1] – 2; [4,5] – 1; [8,9] - 1
NULL
B [4,5] - 1 NULL [1,2]-2; [5,6]-1
C [3,4]-1; [7,8]-1 [3,4]-1 [2,3]-2; [6,7]-1
A B C
A 1 4 0
B 1 0 3
C 2 1 3
Non-positional Bigram Table Positional Bigram Table
Example: Positional NGT Table Entry
● Here is how we can read the sample entry for AB in
positional bigram NGT on the previous slide:
– AB occurs 2 times in position [0, 1] in the corpus;
– AB occurs once in position [4, 5] in the corpus;
– AB occurs once in position [8, 9] in the corpus.
B
A [0, 1] - 2 [4, 5] - 1 [8, 9] - 1
Corpus-Based vs. Text-Based NGTs
● A most frequent source of N-Grams for an NGT is a
text corpus
● However, there are approaches that advocate
computing NGTs for individual texts (journal
articles, web pages, etc.)
● The longer the document the more reliable the
NGT
● Possible misspellings are detected as infrequent or
non-existent N-Grams
Word Boundary Problem
Word Boundary Problem
● A typical definition of word boundaries is through white
space characters (tabs, newlines, spaces, etc.)
● This definition does not work too well in automated
spellchecking because of run-ons and splits:
● Examples of run-ons:
– atthe, ina, ofthe, forgetit, understandyou
● Examples of splits:
– at t he, sp end, un derstand you
Challenges of Run-ons & Splits
● A major difficulty in dealing with run-ons & splits
is that both frequently result in valid wordforms
● Examples:
– forgot when split at r results at two valid
wordforms: for and got, i.e., for got
– A run-on of drive- and in results in drive-in (this
type of error is frequent in OCR)
Challenges of Run-ons & Splits
● Another major challenge in dealing with run-ons &
splits is that any relaxation of the standard word
boundary definition results in combinatorial
explosion of possible strings to check
● Successive applications have been designed only
for limited domains
● Research indicates that many run-ons and splits
involves a relatively small set of high-frequency
words
Error Detection & Correction
Error Detection vs. Error Correction
● There are two main problems in automated
spellchecking: error detection & error correction
● Error detection is finding when a string is not in
the list of known strings (aka dictionary)
● Error correction is a harder problem: a string
which is not in the dictionary must be somehow
changed to become one of the strings in the
dictionary
Nonword Error Correction
● Nonword is a wordform that is not on the list of
known wordforms
● Example:
– wordform ATER would not typically be on any list of known
English wordforms (it could be an abbreviation for a company
or a scientific term)
– suggested spelling corrections are AFTER, LATER, ALTER,
WATER, ATE
N-Gram-Based Error Detection
● Collect a representative sample of texts
● Compute the N-Gram table (NGT) for the text corpus
● Given an input string, compute its list of N-Grams
● For each N-Gram in the input string
– check it against the NGT
– if the input string contains N-Grams that are not in the NGT or
contains that are infrequent (this is typically decided by a
threshold), mark it as a misspelling
Dictionary Lookup Techniques
● Nonwords can be detected via dictionary lookup
techniques
● The challenges are 1) efficient key lookup; 2) efficient
storage of dictionaries (possibly on multiple nodes); 3)
vocabulary selection for the dictionary
● Typical data structures are
– Hash tables
– Frequency ordered binary search trees
– Tries
Some Word Error Correction Approaches
● N-Gram techniques
● Minimum edit distances
● Key similarity
● Rule-based techniques
● Probabilistic techniques
● Neural networks
Basic Error Types
● Approximately 80% of all misspelled words contain
a single instance of one of the four types:
– Insertion: receeive (extra e is inserted)
– Deletion: rceive (e after r is deleted)
– Substitution: reseive (s is substituted for c)
– Transposition: recieve (e and i are transposed)
Minimum Edit Distance
● Minimum edit distances techniques use dynamic
programming to compute the distance between a
source string and a target string in terms of the
number of insertions, deletions, transpositions,
and substitutions
● Use the trie data structure to compute the
similarity between a source and a target
● Minimum edit distance techniques based on
dynamic programming are expensive
Generic N-Gram Correction Algorithm
● Build the NGT (bigram or trigram, positional or non-
positional)
● At run time:
– Detect misspellings through the presence of an infrequent or
non-existent N-Grams
– Compute the number of N-Grams common to a misspelling and
a dictionary word
– Use the number of common n-grams in a similarity measure
k(C/(n1 + n2)), where k is a constant, C is the number of
common N-Grams, and n1 and n2 are the lengths of two strings
– Another similarity measure is C/max(n1, n2)
Similarity Key Techniques
● Basic idea is to map every string into a key in such away that similarly spelled strings have
identical or similar keys (the same idea is used in hash tables)
● SOUNDEX system was one of the first key mapping techniques according to the following
map:
– A, E, I, O, U, H, W, Y 0
– B, F, P, V 1
– C, G, J, K, Q, S, X, Z 2
– D, T 3
– L 4
– M, N 5
– R 6
● A word’s key consists of its first letter plus the digits computed according to the map above
● Zeros are eliminated and runs of the same digit are conflated
Examples of SOUNDEX Similarity Keys
● Bush B0220 B2
● Busch B0220 B2
● QUAYLE Q00040 Q4
● KWAIL K0004 K4
Generic Similarity Key Algorithm
● Compute the key for each word in the dictionary
● Sort the key-word pairs by the keys
● At run time,
– Detect a misspelled word (e.g., bigram or trigram matching)
– Compute its key
– Find the key in the sorted list
– Check the key’s word and the words of a small number of its
immediate neighbors to the left and right for possible
suggestions
– Return the list of possible suggestions
Probabilistic Techniques
● Probabilistic techniques use two types of probabilities: transition
probabilities and confusion probabilities
● Transition probabilities are language dependent and compute the
conditional probability that a given letter or a sequence of
letters is followed by another letter or a sequence of letters
(these are done with n-grams)
● Confusion probabilities are estimates of how frequently a given
letter is mistaken for another letter
● They are source dependent (e.g., in OCR each OCR engine must
be repeatedly evaluated to estimate confusion probabilities
accurately)
Conditional Probabilities in Error Correction
● Let X be a dictionary word and Y be an misspelled string
● We need to estimate:
● P(X) is the probability of X
● P(Y) is the probability of Y
● P(Y|X) is the probability of the misspelling Y given that
X is the correct word
YP
XPXYPYXP
||
Conditional Probabilities in Error Correction
● Let X be a dictionary word and Y be a misspelled string
● To find the most probable correct spelling (X) of the
misspelled string (Y), we need to maximize: G(X|Y)=log
P(Y|X)+ log P(X)
● P(X) is the unigram probability of X
● Estimation of log P(Y|X) is harder but doable:
positionth -i in the characters theare are
)or ( rdlongest wo theoflength theis
,|log|log1
ii
n
i
ii
XY
XYn
XYPXYP
Example
estimates. get these to
Tesseract) (e.g., engine OCR specific afor sexperimentrun can
weexample,For estimated. bemust |,|,|
log|log|log
|log|
? is rdcorrect wo y that theprobabilit the
is What . is wordformmisspelled that theSuppose
3322
11
TAPATPFFP
FATPTXAYPAXTYP
FXFYPFATXFTAYG
FAT
FTA
References
1. K. Kukich, "Techniques for Automatically Correcting Words in
Text." ACM Computing Surveys, Vol. 24, No. 4, Dec. 1992.
Top Related