Speech & NLP (Fall 2014): Spellchecking: N-Gram Analysis, Error Detection, Word Boundary Problem,...
-
Upload
vladimir-kulyukin -
Category
Science
-
view
196 -
download
1
description
Transcript of Speech & NLP (Fall 2014): Spellchecking: N-Gram Analysis, Error Detection, Word Boundary Problem,...
Speech & NLP
www.vkedco.blogspot.com
Spellchecking
N-Gram Analysis, Error Detection, Word Boundary Problem, Error
Correction in Isolated Words
Vladimir Kulyukin
Outline
● N-Gram Analysis
● Error Detection & Correction
● Word Boundary Problem
● Spellchecking Nonwords & Isolated Words
N-Gram Analysis
N-Gram Tables (NGTs)
● The simplest form of NGT is a bigram NGT
● The dimensions of bigram NGTs depend on the size
of the alphabet
● For the standard ASCII characters the dimensions
of the bigram NGT is 26 x 26
● For other alphabets (e.g., Unicode) it will be
considerably larger
N-Gram Tables: N-Gram Weights
● There are multiple strategies of assigning weights
to N-Grams in an NGT
● The simplest form is binary: an N-Gram is either
present or absent
● Another common form is the frequency of an N-
Gram in the corpus
● More sophisticated weights are tf*idf or weights
based on some probability distribution
Example: Bigram NGT for the Alphabet {A, B, C}
A B C
A 0 20 11
B 50 12 125
C 0 78 97 • Assume that there is some imaginary corpus where each wordform is a string over {A, B, C}
• Suppose that the weight of each bigram is equal to its frequency in the corpus
• The above NGT records the following bigram frequencies:
AA occurs 0 times; AB occurs 20 times; AC occurs 11 times
BA occurs 50 times; BB occurs 12 times; BC occurs 125 times
CA occurs 0 times; CB occurs 78 times; CC occurs 97 times
Trigram NGTs
● A trigram NGT is a 3D array (a cube)
● The dimensions of the cube depend on the size of
the alphabet
● Like a bigram NGT, a trigram NGT records weights
of every possible trigram
● Like in a bigram NGT, a trigram NGT can be binary
(i.e., recording presence/absence of a trigram in
the corpus), real (i.e., some real weights)
Nonpositional vs. Positional NGTs
● If an NGT records only weights (numeric or
symbolic) in its entries it is called nonpositional
● Nonpositional is used in the sense that the
recorded weight does not take into account of
where in the string a specific n-gram occurs
● Positional NGTs typically add the start and the
end indices to each weight that record where a
specific n-gram occurs in some string
Example: Nonpositional vs. Positional Bigrams
● We have two strings S1 and S2 in the corpus
● Bigrams for S1: AB [0,1], BC [1,2], CC [2,3], CA [3,4], AB [4,5], BC [5,6], CC [6,7], CA [7, 8], AB [8,9]
● Bigrams for S2: AB [0,1], BC [1,2], CC [2,3], CB [3,4], BA [4,5], AA [5,6]
A B C C A B C C A B
0 1
2
3 4 5 6 7 8 9
S1 =
A
0
B
1
C C
2
3
B
4
A
5
A
6
S2 =
A B C
A [5,6] - 1 [0,1] – 2; [4,5] – 1; [8,9] - 1
NULL
B [4,5] - 1 NULL [1,2]-2; [5,6]-1
C [3,4]-1; [7,8]-1 [3,4]-1 [2,3]-2; [6,7]-1
A B C
A 1 4 0
B 1 0 3
C 2 1 3
Non-positional Bigram Table Positional Bigram Table
Example: Positional NGT Table Entry
● Here is how we can read the sample entry for AB in
positional bigram NGT on the previous slide:
– AB occurs 2 times in position [0, 1] in the corpus;
– AB occurs once in position [4, 5] in the corpus;
– AB occurs once in position [8, 9] in the corpus.
B
A [0, 1] - 2 [4, 5] - 1 [8, 9] - 1
Corpus-Based vs. Text-Based NGTs
● A most frequent source of N-Grams for an NGT is a
text corpus
● However, there are approaches that advocate
computing NGTs for individual texts (journal
articles, web pages, etc.)
● The longer the document the more reliable the
NGT
● Possible misspellings are detected as infrequent or
non-existent N-Grams
Word Boundary Problem
Word Boundary Problem
● A typical definition of word boundaries is through white
space characters (tabs, newlines, spaces, etc.)
● This definition does not work too well in automated
spellchecking because of run-ons and splits:
● Examples of run-ons:
– atthe, ina, ofthe, forgetit, understandyou
● Examples of splits:
– at t he, sp end, un derstand you
Challenges of Run-ons & Splits
● A major difficulty in dealing with run-ons & splits
is that both frequently result in valid wordforms
● Examples:
– forgot when split at r results at two valid
wordforms: for and got, i.e., for got
– A run-on of drive- and in results in drive-in (this
type of error is frequent in OCR)
Challenges of Run-ons & Splits
● Another major challenge in dealing with run-ons &
splits is that any relaxation of the standard word
boundary definition results in combinatorial
explosion of possible strings to check
● Successive applications have been designed only
for limited domains
● Research indicates that many run-ons and splits
involves a relatively small set of high-frequency
words
Error Detection & Correction
Error Detection vs. Error Correction
● There are two main problems in automated
spellchecking: error detection & error correction
● Error detection is finding when a string is not in
the list of known strings (aka dictionary)
● Error correction is a harder problem: a string
which is not in the dictionary must be somehow
changed to become one of the strings in the
dictionary
Nonword Error Correction
● Nonword is a wordform that is not on the list of
known wordforms
● Example:
– wordform ATER would not typically be on any list of known
English wordforms (it could be an abbreviation for a company
or a scientific term)
– suggested spelling corrections are AFTER, LATER, ALTER,
WATER, ATE
N-Gram-Based Error Detection
● Collect a representative sample of texts
● Compute the N-Gram table (NGT) for the text corpus
● Given an input string, compute its list of N-Grams
● For each N-Gram in the input string
– check it against the NGT
– if the input string contains N-Grams that are not in the NGT or
contains that are infrequent (this is typically decided by a
threshold), mark it as a misspelling
Dictionary Lookup Techniques
● Nonwords can be detected via dictionary lookup
techniques
● The challenges are 1) efficient key lookup; 2) efficient
storage of dictionaries (possibly on multiple nodes); 3)
vocabulary selection for the dictionary
● Typical data structures are
– Hash tables
– Frequency ordered binary search trees
– Tries
Some Word Error Correction Approaches
● N-Gram techniques
● Minimum edit distances
● Key similarity
● Rule-based techniques
● Probabilistic techniques
● Neural networks
Basic Error Types
● Approximately 80% of all misspelled words contain
a single instance of one of the four types:
– Insertion: receeive (extra e is inserted)
– Deletion: rceive (e after r is deleted)
– Substitution: reseive (s is substituted for c)
– Transposition: recieve (e and i are transposed)
Minimum Edit Distance
● Minimum edit distances techniques use dynamic
programming to compute the distance between a
source string and a target string in terms of the
number of insertions, deletions, transpositions,
and substitutions
● Use the trie data structure to compute the
similarity between a source and a target
● Minimum edit distance techniques based on
dynamic programming are expensive
Generic N-Gram Correction Algorithm
● Build the NGT (bigram or trigram, positional or non-
positional)
● At run time:
– Detect misspellings through the presence of an infrequent or
non-existent N-Grams
– Compute the number of N-Grams common to a misspelling and
a dictionary word
– Use the number of common n-grams in a similarity measure
k(C/(n1 + n2)), where k is a constant, C is the number of
common N-Grams, and n1 and n2 are the lengths of two strings
– Another similarity measure is C/max(n1, n2)
Similarity Key Techniques
● Basic idea is to map every string into a key in such away that similarly spelled strings have
identical or similar keys (the same idea is used in hash tables)
● SOUNDEX system was one of the first key mapping techniques according to the following
map:
– A, E, I, O, U, H, W, Y 0
– B, F, P, V 1
– C, G, J, K, Q, S, X, Z 2
– D, T 3
– L 4
– M, N 5
– R 6
● A word’s key consists of its first letter plus the digits computed according to the map above
● Zeros are eliminated and runs of the same digit are conflated
Examples of SOUNDEX Similarity Keys
● Bush B0220 B2
● Busch B0220 B2
● QUAYLE Q00040 Q4
● KWAIL K0004 K4
Generic Similarity Key Algorithm
● Compute the key for each word in the dictionary
● Sort the key-word pairs by the keys
● At run time,
– Detect a misspelled word (e.g., bigram or trigram matching)
– Compute its key
– Find the key in the sorted list
– Check the key’s word and the words of a small number of its
immediate neighbors to the left and right for possible
suggestions
– Return the list of possible suggestions
Probabilistic Techniques
● Probabilistic techniques use two types of probabilities: transition
probabilities and confusion probabilities
● Transition probabilities are language dependent and compute the
conditional probability that a given letter or a sequence of
letters is followed by another letter or a sequence of letters
(these are done with n-grams)
● Confusion probabilities are estimates of how frequently a given
letter is mistaken for another letter
● They are source dependent (e.g., in OCR each OCR engine must
be repeatedly evaluated to estimate confusion probabilities
accurately)
Conditional Probabilities in Error Correction
● Let X be a dictionary word and Y be an misspelled string
● We need to estimate:
● P(X) is the probability of X
● P(Y) is the probability of Y
● P(Y|X) is the probability of the misspelling Y given that
X is the correct word
YP
XPXYPYXP
||
Conditional Probabilities in Error Correction
● Let X be a dictionary word and Y be a misspelled string
● To find the most probable correct spelling (X) of the
misspelled string (Y), we need to maximize: G(X|Y)=log
P(Y|X)+ log P(X)
● P(X) is the unigram probability of X
● Estimation of log P(Y|X) is harder but doable:
positionth -i in the characters theare are
)or ( rdlongest wo theoflength theis
,|log|log1
ii
n
i
ii
XY
XYn
XYPXYP
Example
estimates. get these to
Tesseract) (e.g., engine OCR specific afor sexperimentrun can
weexample,For estimated. bemust |,|,|
log|log|log
|log|
? is rdcorrect wo y that theprobabilit the
is What . is wordformmisspelled that theSuppose
3322
11
TAPATPFFP
FATPTXAYPAXTYP
FXFYPFATXFTAYG
FAT
FTA
References
1. K. Kukich, "Techniques for Automatically Correcting Words in
Text." ACM Computing Surveys, Vol. 24, No. 4, Dec. 1992.