Post on 05-Jan-2016
description
January 2012 Spelling Models 1
Human Language Technology
Spelling Models
January 2012 Spelling Models 2
References
• Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction. Inf. Process. Manage. 27, 5 (September 1991), 517-522.
• Church, K. and W. Gale (1991). Probability Scoring for Spelling Correction. Statistics and Computing 1: 93-103.
• Brill, E. and Moore, R., (2000), An improved error model for noisy channel spelling correction, Proceedings of ACL Conference, [pdf]
January 2012 Spelling Models 3
Outline
• In this lecture we describe three different models of how spelling errors are produced.
• Single Character– Equal probabililty– Differentiated probability
• Multiple Character
January 2012 Spelling Models 4
Confusion Set
The confusion set of a word w includes w along with all words in the dictionary D such that O can be derived from w by a single application of one of the four edit operations: – Add a single letter.– Delete a single letter.– Replace one letter with another.– Transpose two adjacent letters.
January 2012 Spelling Models 5
Error Model 1Mayes, Damerau et al. 1991
• Let C be the number of words in the confusion set of w.
• The error model, for all s in the confusion set of d, is:P(O|w) = α if O=w,
(1- α)/(C-1) otherwise• α is the prior probability of a given typed word
being correct.• Key Idea: The remaining probability mass is
distributed evenly among all other words in the confusion set.
January 2012 Spelling Models 6
Error Model 2: Church & Gale 1991
• Church & Gale (1991) propose a more sophisticated error model based on same confusion set (one edit operation away from w).
• Two improvements:1. Unequal weightings attached to different editing
operations.2. Insertion and deletion probabilities are conditioned
on context. The probability of inserting or deleting a character is conditioned on the letter appearing immediately to the left of that character.
January 2012 Spelling Models 7
Obtaining Error Probabilities
• The error probabilities are derived by first assuming all edits are equiprobable.
• They use as a training corpus a set of space-delimited strings that were found in a large collection of text, and that (a) do not appear in their dictionary and (b) are no more than one edit away from a word that does appear in the dictionary.
• They iteratively run the spell checker over the training corpus to find corrections, then use these corrections to update the edit probabilities.
January 2012 Spelling Models 8
Error Model 3Brill and Moore (2000)
• Let Σ be an alphabet• Model allows all operations of the form
α β, where α,β in Σ*. • P(α β) is the probability that when users
intends to type the string α they type β instead.
• N.B. model considers substitutions of arbitrary substrings not just single characters.
January 2012 Spelling Models 9
Model 3Brill and Moore (2000)
• Model also tries to account for the fact that in general, positional information is a powerful conditioning feature, e.g. p(entler|antler) < p(reluctent|reluctant)
• i.e. Probability is partially conditioned by the position in the string in which the edit occurs.
• artifact/artefact; correspondance/correspondence
January 2012 Spelling Models 10
Three Stage Model
• Person picks a word.physical
• Person picks a partition of characters within word.ph y s i c al
• Person types each partition, perhaps erroneously.
• f i s i k le• p(fisikle|physical) =
p(f|ph) * p(i|y) * p(s|s) * p(i|i) * p(k|c) * p(le|al)
January 2012 Spelling Models 11
Formal Presentation
∑ ∏∑∈ =
=∈)(
||
1||||
)(
)|()|(wPartR
R
i
ii
RTsPartT
RTPwRP
• Let Part(w) be the set of all possible ways to partition string w into substrings.
• For particular R in Part(w) containing j continuous segments, let Ri be the ith segment. Then P(s|w) =
January 2012 Spelling Models 12
Simplification
∏=
||
1
R
i
P(s | w) =max R
P(R|w) P(Ti|Ri)
• By considering only the best partitioning of s and w this simplifies to
January 2012 Spelling Models 13
Training the Model
• To train model, need a series of (s,w) word pairs.
• begin by aligning the letters in (si,wi) based on MED.
• For instance, given the training pair (akgsual, actual), this could be aligned as:a c t u a l
a k g s u a l
January 2012 Spelling Models 14
Training the Model
• This corresponds to the sequence of editing operations
• aa ck εg ts uu aa ll• To allow for richer contextual information, each
nonmatch substitution is expanded to incorporate up to N additional adjacent edits.
• For example, for the first nonmatch edit ck in the example above, with N=2, we would generate the following substitutions:
January 2012 Spelling Models 15
Training the Model
a c t u a l
a k g s u a l
c kac akc kgac akgct kgs
• We would do similarly for the other nonmatch edits, and give each of these substitutions a fractional count.
January 2012 Spelling Models 16
Training the Model
• We can then calculate the probability of each substitution α β ascount(α β)/count(α).
• count(α β) is simply the sum of the counts derived from our training data as explained above
• Estimating count(α) is harder, since we are not training from a text corpus, but from a a set of (s,w) tuples (without an associated corpus)
January 2012 Spelling Models 17
Training the Model
• From a large collection of representative text, count the number of occurrences of α.
• Adjust the count based on an estimate of the rate with which people make typing errors.