Speech & NLP (Fall 2014): Spellchecking: N-Gram Analysis, Error Detection, Word Boundary Problem,...

Post on 14-Dec-2014

196 views 1 download

description

 

Transcript of Speech & NLP (Fall 2014): Spellchecking: N-Gram Analysis, Error Detection, Word Boundary Problem,...

Outline

● N-Gram Analysis

● Error Detection & Correction

● Word Boundary Problem

● Spellchecking Nonwords & Isolated Words

N-Gram Analysis

N-Gram Tables (NGTs)

● The simplest form of NGT is a bigram NGT

● The dimensions of bigram NGTs depend on the size

of the alphabet

● For the standard ASCII characters the dimensions

of the bigram NGT is 26 x 26

● For other alphabets (e.g., Unicode) it will be

considerably larger

N-Gram Tables: N-Gram Weights

● There are multiple strategies of assigning weights

to N-Grams in an NGT

● The simplest form is binary: an N-Gram is either

present or absent

● Another common form is the frequency of an N-

Gram in the corpus

● More sophisticated weights are tf*idf or weights

based on some probability distribution

Example: Bigram NGT for the Alphabet {A, B, C}

A B C

A 0 20 11

B 50 12 125

C 0 78 97 • Assume that there is some imaginary corpus where each wordform is a string over {A, B, C}

• Suppose that the weight of each bigram is equal to its frequency in the corpus

• The above NGT records the following bigram frequencies:

AA occurs 0 times; AB occurs 20 times; AC occurs 11 times

BA occurs 50 times; BB occurs 12 times; BC occurs 125 times

CA occurs 0 times; CB occurs 78 times; CC occurs 97 times

Trigram NGTs

● A trigram NGT is a 3D array (a cube)

● The dimensions of the cube depend on the size of

the alphabet

● Like a bigram NGT, a trigram NGT records weights

of every possible trigram

● Like in a bigram NGT, a trigram NGT can be binary

(i.e., recording presence/absence of a trigram in

the corpus), real (i.e., some real weights)

Nonpositional vs. Positional NGTs

● If an NGT records only weights (numeric or

symbolic) in its entries it is called nonpositional

● Nonpositional is used in the sense that the

recorded weight does not take into account of

where in the string a specific n-gram occurs

● Positional NGTs typically add the start and the

end indices to each weight that record where a

specific n-gram occurs in some string

Example: Nonpositional vs. Positional Bigrams

● We have two strings S1 and S2 in the corpus

● Bigrams for S1: AB [0,1], BC [1,2], CC [2,3], CA [3,4], AB [4,5], BC [5,6], CC [6,7], CA [7, 8], AB [8,9]

● Bigrams for S2: AB [0,1], BC [1,2], CC [2,3], CB [3,4], BA [4,5], AA [5,6]

A B C C A B C C A B

0 1

2

3 4 5 6 7 8 9

S1 =

A

0

B

1

C C

2

3

B

4

A

5

A

6

S2 =

A B C

A [5,6] - 1 [0,1] – 2; [4,5] – 1; [8,9] - 1

NULL

B [4,5] - 1 NULL [1,2]-2; [5,6]-1

C [3,4]-1; [7,8]-1 [3,4]-1 [2,3]-2; [6,7]-1

A B C

A 1 4 0

B 1 0 3

C 2 1 3

Non-positional Bigram Table Positional Bigram Table

Example: Positional NGT Table Entry

● Here is how we can read the sample entry for AB in

positional bigram NGT on the previous slide:

– AB occurs 2 times in position [0, 1] in the corpus;

– AB occurs once in position [4, 5] in the corpus;

– AB occurs once in position [8, 9] in the corpus.

B

A [0, 1] - 2 [4, 5] - 1 [8, 9] - 1

Corpus-Based vs. Text-Based NGTs

● A most frequent source of N-Grams for an NGT is a

text corpus

● However, there are approaches that advocate

computing NGTs for individual texts (journal

articles, web pages, etc.)

● The longer the document the more reliable the

NGT

● Possible misspellings are detected as infrequent or

non-existent N-Grams

Word Boundary Problem

Word Boundary Problem

● A typical definition of word boundaries is through white

space characters (tabs, newlines, spaces, etc.)

● This definition does not work too well in automated

spellchecking because of run-ons and splits:

● Examples of run-ons:

– atthe, ina, ofthe, forgetit, understandyou

● Examples of splits:

– at t he, sp end, un derstand you

Challenges of Run-ons & Splits

● A major difficulty in dealing with run-ons & splits

is that both frequently result in valid wordforms

● Examples:

– forgot when split at r results at two valid

wordforms: for and got, i.e., for got

– A run-on of drive- and in results in drive-in (this

type of error is frequent in OCR)

Challenges of Run-ons & Splits

● Another major challenge in dealing with run-ons &

splits is that any relaxation of the standard word

boundary definition results in combinatorial

explosion of possible strings to check

● Successive applications have been designed only

for limited domains

● Research indicates that many run-ons and splits

involves a relatively small set of high-frequency

words

Error Detection & Correction

Error Detection vs. Error Correction

● There are two main problems in automated

spellchecking: error detection & error correction

● Error detection is finding when a string is not in

the list of known strings (aka dictionary)

● Error correction is a harder problem: a string

which is not in the dictionary must be somehow

changed to become one of the strings in the

dictionary

Nonword Error Correction

● Nonword is a wordform that is not on the list of

known wordforms

● Example:

– wordform ATER would not typically be on any list of known

English wordforms (it could be an abbreviation for a company

or a scientific term)

– suggested spelling corrections are AFTER, LATER, ALTER,

WATER, ATE

N-Gram-Based Error Detection

● Collect a representative sample of texts

● Compute the N-Gram table (NGT) for the text corpus

● Given an input string, compute its list of N-Grams

● For each N-Gram in the input string

– check it against the NGT

– if the input string contains N-Grams that are not in the NGT or

contains that are infrequent (this is typically decided by a

threshold), mark it as a misspelling

Dictionary Lookup Techniques

● Nonwords can be detected via dictionary lookup

techniques

● The challenges are 1) efficient key lookup; 2) efficient

storage of dictionaries (possibly on multiple nodes); 3)

vocabulary selection for the dictionary

● Typical data structures are

– Hash tables

– Frequency ordered binary search trees

– Tries

Some Word Error Correction Approaches

● N-Gram techniques

● Minimum edit distances

● Key similarity

● Rule-based techniques

● Probabilistic techniques

● Neural networks

Basic Error Types

● Approximately 80% of all misspelled words contain

a single instance of one of the four types:

– Insertion: receeive (extra e is inserted)

– Deletion: rceive (e after r is deleted)

– Substitution: reseive (s is substituted for c)

– Transposition: recieve (e and i are transposed)

Minimum Edit Distance

● Minimum edit distances techniques use dynamic

programming to compute the distance between a

source string and a target string in terms of the

number of insertions, deletions, transpositions,

and substitutions

● Use the trie data structure to compute the

similarity between a source and a target

● Minimum edit distance techniques based on

dynamic programming are expensive

Generic N-Gram Correction Algorithm

● Build the NGT (bigram or trigram, positional or non-

positional)

● At run time:

– Detect misspellings through the presence of an infrequent or

non-existent N-Grams

– Compute the number of N-Grams common to a misspelling and

a dictionary word

– Use the number of common n-grams in a similarity measure

k(C/(n1 + n2)), where k is a constant, C is the number of

common N-Grams, and n1 and n2 are the lengths of two strings

– Another similarity measure is C/max(n1, n2)

Similarity Key Techniques

● Basic idea is to map every string into a key in such away that similarly spelled strings have

identical or similar keys (the same idea is used in hash tables)

● SOUNDEX system was one of the first key mapping techniques according to the following

map:

– A, E, I, O, U, H, W, Y 0

– B, F, P, V 1

– C, G, J, K, Q, S, X, Z 2

– D, T 3

– L 4

– M, N 5

– R 6

● A word’s key consists of its first letter plus the digits computed according to the map above

● Zeros are eliminated and runs of the same digit are conflated

Examples of SOUNDEX Similarity Keys

● Bush B0220 B2

● Busch B0220 B2

● QUAYLE Q00040 Q4

● KWAIL K0004 K4

Generic Similarity Key Algorithm

● Compute the key for each word in the dictionary

● Sort the key-word pairs by the keys

● At run time,

– Detect a misspelled word (e.g., bigram or trigram matching)

– Compute its key

– Find the key in the sorted list

– Check the key’s word and the words of a small number of its

immediate neighbors to the left and right for possible

suggestions

– Return the list of possible suggestions

Probabilistic Techniques

● Probabilistic techniques use two types of probabilities: transition

probabilities and confusion probabilities

● Transition probabilities are language dependent and compute the

conditional probability that a given letter or a sequence of

letters is followed by another letter or a sequence of letters

(these are done with n-grams)

● Confusion probabilities are estimates of how frequently a given

letter is mistaken for another letter

● They are source dependent (e.g., in OCR each OCR engine must

be repeatedly evaluated to estimate confusion probabilities

accurately)

Conditional Probabilities in Error Correction

● Let X be a dictionary word and Y be an misspelled string

● We need to estimate:

● P(X) is the probability of X

● P(Y) is the probability of Y

● P(Y|X) is the probability of the misspelling Y given that

X is the correct word

YP

XPXYPYXP

||

Conditional Probabilities in Error Correction

● Let X be a dictionary word and Y be a misspelled string

● To find the most probable correct spelling (X) of the

misspelled string (Y), we need to maximize: G(X|Y)=log

P(Y|X)+ log P(X)

● P(X) is the unigram probability of X

● Estimation of log P(Y|X) is harder but doable:

positionth -i in the characters theare are

)or ( rdlongest wo theoflength theis

,|log|log1

ii

n

i

ii

XY

XYn

XYPXYP

Example

estimates. get these to

Tesseract) (e.g., engine OCR specific afor sexperimentrun can

weexample,For estimated. bemust |,|,|

log|log|log

|log|

? is rdcorrect wo y that theprobabilit the

is What . is wordformmisspelled that theSuppose

3322

11

TAPATPFFP

FATPTXAYPAXTYP

FXFYPFATXFTAYG

FAT

FTA

References

1. K. Kukich, "Techniques for Automatically Correcting Words in

Text." ACM Computing Surveys, Vol. 24, No. 4, Dec. 1992.