Download - Speech & NLP (Fall 2014): Spellchecking: N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Speech & NLP

www.vkedco.blogspot.com

Spellchecking

N-Gram Analysis, Error Detection, Word Boundary Problem, Error

Correction in Isolated Words

Vladimir Kulyukin

http://www.vkedco.blogspot.com/



http://www.linkedin.com/pub/vladimir-kulyukin/23/2a2/150



Outline

● N-Gram Analysis

● Error Detection & Correction

● Word Boundary Problem

● Spellchecking Nonwords & Isolated Words

N-Gram Analysis

N-Gram Tables (NGTs)

● The simplest form of NGT is a bigram NGT

● The dimensions of bigram NGTs depend on the size

of the alphabet

● For the standard ASCII characters the dimensions

of the bigram NGT is 26 x 26

● For other alphabets (e.g., Unicode) it will be

considerably larger

N-Gram Tables: N-Gram Weights

● There are multiple strategies of assigning weights

to N-Grams in an NGT

● The simplest form is binary: an N-Gram is either

present or absent

● Another common form is the frequency of an N-

Gram in the corpus

● More sophisticated weights are tf*idf or weights

based on some probability distribution

Example: Bigram NGT for the Alphabet {A, B, C}

A B C

A 0 20 11

B 50 12 125

C 0 78 97 • Assume that there is some imaginary corpus where each wordform is a string over {A, B, C}

• Suppose that the weight of each bigram is equal to its frequency in the corpus

• The above NGT records the following bigram frequencies:

AA occurs 0 times; AB occurs 20 times; AC occurs 11 times

BA occurs 50 times; BB occurs 12 times; BC occurs 125 times

CA occurs 0 times; CB occurs 78 times; CC occurs 97 times

Trigram NGTs

● A trigram NGT is a 3D array (a cube)

● The dimensions of the cube depend on the size of

the alphabet

● Like a bigram NGT, a trigram NGT records weights

of every possible trigram

● Like in a bigram NGT, a trigram NGT can be binary

(i.e., recording presence/absence of a trigram in

the corpus), real (i.e., some real weights)

Nonpositional vs. Positional NGTs

● If an NGT records only weights (numeric or

symbolic) in its entries it is called nonpositional

● Nonpositional is used in the sense that the

recorded weight does not take into account of

where in the string a specific n-gram occurs

● Positional NGTs typically add the start and the

end indices to each weight that record where a

specific n-gram occurs in some string

Example: Nonpositional vs. Positional Bigrams

● We have two strings S1 and S2 in the corpus

● Bigrams for S1: AB [0,1], BC [1,2], CC [2,3], CA [3,4], AB [4,5], BC [5,6], CC [6,7], CA [7, 8], AB [8,9]

● Bigrams for S2: AB [0,1], BC [1,2], CC [2,3], CB [3,4], BA [4,5], AA [5,6]

A B C C A B C C A B

0 1

2

3 4 5 6 7 8 9

S1 =

A

0

B

1

C C

2

3

B

4

A

5

A

6

S2 =

A B C

A [5,6] - 1 [0,1] – 2; [4,5] – 1; [8,9] - 1

NULL

B [4,5] - 1 NULL [1,2]-2; [5,6]-1

C [3,4]-1; [7,8]-1 [3,4]-1 [2,3]-2; [6,7]-1

A B C

A 1 4 0

B 1 0 3

C 2 1 3

Non-positional Bigram Table Positional Bigram Table

Example: Positional NGT Table Entry

● Here is how we can read the sample entry for AB in

positional bigram NGT on the previous slide:

– AB occurs 2 times in position [0, 1] in the corpus;

– AB occurs once in position [4, 5] in the corpus;

– AB occurs once in position [8, 9] in the corpus.

B

A [0, 1] - 2 [4, 5] - 1 [8, 9] - 1

Corpus-Based vs. Text-Based NGTs

● A most frequent source of N-Grams for an NGT is a

text corpus

● However, there are approaches that advocate

computing NGTs for individual texts (journal

articles, web pages, etc.)

● The longer the document the more reliable the

NGT

● Possible misspellings are detected as infrequent or

non-existent N-Grams

Word Boundary Problem

Word Boundary Problem

● A typical definition of word boundaries is through white

space characters (tabs, newlines, spaces, etc.)

● This definition does not work too well in automated

spellchecking because of run-ons and splits:

● Examples of run-ons:

– atthe, ina, ofthe, forgetit, understandyou

● Examples of splits:

– at t he, sp end, un derstand you

Challenges of Run-ons & Splits

● A major difficulty in dealing with run-ons & splits

is that both frequently result in valid wordforms

● Examples:

– forgot when split at r results at two valid

wordforms: for and got, i.e., for got

– A run-on of drive- and in results in drive-in (this

type of error is frequent in OCR)

Challenges of Run-ons & Splits

● Another major challenge in dealing with run-ons &

splits is that any relaxation of the standard word

boundary definition results in combinatorial

explosion of possible strings to check

● Successive applications have been designed only

for limited domains

● Research indicates that many run-ons and splits

involves a relatively small set of high-frequency

words

Error Detection & Correction

Error Detection vs. Error Correction

● There are two main problems in automated

spellchecking: error detection & error correction

● Error detection is finding when a string is not in

the list of known strings (aka dictionary)

● Error correction is a harder problem: a string

which is not in the dictionary must be somehow

changed to become one of the strings in the

dictionary

Nonword Error Correction

● Nonword is a wordform that is not on the list of

known wordforms

● Example:

– wordform ATER would not typically be on any list of known

English wordforms (it could be an abbreviation for a company

or a scientific term)

– suggested spelling corrections are AFTER, LATER, ALTER,

WATER, ATE

N-Gram-Based Error Detection

● Collect a representative sample of texts

● Compute the N-Gram table (NGT) for the text corpus

● Given an input string, compute its list of N-Grams

● For each N-Gram in the input string

– check it against the NGT

– if the input string contains N-Grams that are not in the NGT or

contains that are infrequent (this is typically decided by a

threshold), mark it as a misspelling

Dictionary Lookup Techniques

● Nonwords can be detected via dictionary lookup

techniques

● The challenges are 1) efficient key lookup; 2) efficient

storage of dictionaries (possibly on multiple nodes); 3)

vocabulary selection for the dictionary

● Typical data structures are

– Hash tables

– Frequency ordered binary search trees

– Tries

Some Word Error Correction Approaches

● N-Gram techniques

● Minimum edit distances

● Key similarity

● Rule-based techniques

● Probabilistic techniques

● Neural networks

Basic Error Types

● Approximately 80% of all misspelled words contain

a single instance of one of the four types:

– Insertion: receeive (extra e is inserted)

– Deletion: rceive (e after r is deleted)

– Substitution: reseive (s is substituted for c)

– Transposition: recieve (e and i are transposed)

Minimum Edit Distance

● Minimum edit distances techniques use dynamic

programming to compute the distance between a

source string and a target string in terms of the

number of insertions, deletions, transpositions,

and substitutions

● Use the trie data structure to compute the

similarity between a source and a target

● Minimum edit distance techniques based on

dynamic programming are expensive

Generic N-Gram Correction Algorithm

● Build the NGT (bigram or trigram, positional or non-

positional)

● At run time:

– Detect misspellings through the presence of an infrequent or

non-existent N-Grams

– Compute the number of N-Grams common to a misspelling and

a dictionary word

– Use the number of common n-grams in a similarity measure

k(C/(n1 + n2)), where k is a constant, C is the number of

common N-Grams, and n1 and n2 are the lengths of two strings

– Another similarity measure is C/max(n1, n2)

Similarity Key Techniques

● Basic idea is to map every string into a key in such away that similarly spelled strings have

identical or similar keys (the same idea is used in hash tables)

● SOUNDEX system was one of the first key mapping techniques according to the following

map:

– A, E, I, O, U, H, W, Y 0

– B, F, P, V 1

– C, G, J, K, Q, S, X, Z 2

– D, T 3

– L 4

– M, N 5

– R 6

● A word’s key consists of its first letter plus the digits computed according to the map above

● Zeros are eliminated and runs of the same digit are conflated

Examples of SOUNDEX Similarity Keys

● Bush B0220 B2

● Busch B0220 B2

● QUAYLE Q00040 Q4

● KWAIL K0004 K4

Generic Similarity Key Algorithm

● Compute the key for each word in the dictionary

● Sort the key-word pairs by the keys

● At run time,

– Detect a misspelled word (e.g., bigram or trigram matching)

– Compute its key

– Find the key in the sorted list

– Check the key’s word and the words of a small number of its

immediate neighbors to the left and right for possible

suggestions

– Return the list of possible suggestions

Probabilistic Techniques

● Probabilistic techniques use two types of probabilities: transition

probabilities and confusion probabilities

● Transition probabilities are language dependent and compute the

conditional probability that a given letter or a sequence of

letters is followed by another letter or a sequence of letters

(these are done with n-grams)

● Confusion probabilities are estimates of how frequently a given

letter is mistaken for another letter

● They are source dependent (e.g., in OCR each OCR engine must

be repeatedly evaluated to estimate confusion probabilities

accurately)

Conditional Probabilities in Error Correction

● Let X be a dictionary word and Y be an misspelled string

● We need to estimate:

● P(X) is the probability of X

● P(Y) is the probability of Y

● P(Y|X) is the probability of the misspelling Y given that

X is the correct word

YP

XPXYPYXP

||

Conditional Probabilities in Error Correction

● Let X be a dictionary word and Y be a misspelled string

● To find the most probable correct spelling (X) of the

misspelled string (Y), we need to maximize: G(X|Y)=log

P(Y|X)+ log P(X)

● P(X) is the unigram probability of X

● Estimation of log P(Y|X) is harder but doable:

positionth -i in the characters theare are

)or ( rdlongest wo theoflength theis

,|log|log1

ii

n

i

ii

XY

XYn

XYPXYP

Example

estimates. get these to

Tesseract) (e.g., engine OCR specific afor sexperimentrun can

weexample,For estimated. bemust |,|,|

log|log|log

|log|

? is rdcorrect wo y that theprobabilit the

is What . is wordformmisspelled that theSuppose

3322

11

TAPATPFFP

FATPTXAYPAXTYP

FXFYPFATXFTAYG

FAT

FTA

References

1. K. Kukich, "Techniques for Automatically Correcting Words in

Text." ACM Computing Surveys, Vol. 24, No. 4, Dec. 1992.

http://www.devl.fr/docs/these/bibli/Kukich1992Techniqueforautomatically.pdf

http://www.devl.fr/docs/these/bibli/Kukich1992Techniqueforautomatically.pdf