Speech & NLP (Fall 2014): Spellchecking: N-Gram Analysis, Error Detection, Word Boundary Problem,...

32
Speech & NLP www.vkedco.blogspot.com Spellchecking N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words Vladimir Kulyukin

description

 

Transcript of Speech & NLP (Fall 2014): Spellchecking: N-Gram Analysis, Error Detection, Word Boundary Problem,...

Page 2: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Outline

● N-Gram Analysis

● Error Detection & Correction

● Word Boundary Problem

● Spellchecking Nonwords & Isolated Words

Page 3: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

N-Gram Analysis

Page 4: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

N-Gram Tables (NGTs)

● The simplest form of NGT is a bigram NGT

● The dimensions of bigram NGTs depend on the size

of the alphabet

● For the standard ASCII characters the dimensions

of the bigram NGT is 26 x 26

● For other alphabets (e.g., Unicode) it will be

considerably larger

Page 5: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

N-Gram Tables: N-Gram Weights

● There are multiple strategies of assigning weights

to N-Grams in an NGT

● The simplest form is binary: an N-Gram is either

present or absent

● Another common form is the frequency of an N-

Gram in the corpus

● More sophisticated weights are tf*idf or weights

based on some probability distribution

Page 6: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Example: Bigram NGT for the Alphabet {A, B, C}

A B C

A 0 20 11

B 50 12 125

C 0 78 97 • Assume that there is some imaginary corpus where each wordform is a string over {A, B, C}

• Suppose that the weight of each bigram is equal to its frequency in the corpus

• The above NGT records the following bigram frequencies:

AA occurs 0 times; AB occurs 20 times; AC occurs 11 times

BA occurs 50 times; BB occurs 12 times; BC occurs 125 times

CA occurs 0 times; CB occurs 78 times; CC occurs 97 times

Page 7: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Trigram NGTs

● A trigram NGT is a 3D array (a cube)

● The dimensions of the cube depend on the size of

the alphabet

● Like a bigram NGT, a trigram NGT records weights

of every possible trigram

● Like in a bigram NGT, a trigram NGT can be binary

(i.e., recording presence/absence of a trigram in

the corpus), real (i.e., some real weights)

Page 8: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Nonpositional vs. Positional NGTs

● If an NGT records only weights (numeric or

symbolic) in its entries it is called nonpositional

● Nonpositional is used in the sense that the

recorded weight does not take into account of

where in the string a specific n-gram occurs

● Positional NGTs typically add the start and the

end indices to each weight that record where a

specific n-gram occurs in some string

Page 9: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Example: Nonpositional vs. Positional Bigrams

● We have two strings S1 and S2 in the corpus

● Bigrams for S1: AB [0,1], BC [1,2], CC [2,3], CA [3,4], AB [4,5], BC [5,6], CC [6,7], CA [7, 8], AB [8,9]

● Bigrams for S2: AB [0,1], BC [1,2], CC [2,3], CB [3,4], BA [4,5], AA [5,6]

A B C C A B C C A B

0 1

2

3 4 5 6 7 8 9

S1 =

A

0

B

1

C C

2

3

B

4

A

5

A

6

S2 =

A B C

A [5,6] - 1 [0,1] – 2; [4,5] – 1; [8,9] - 1

NULL

B [4,5] - 1 NULL [1,2]-2; [5,6]-1

C [3,4]-1; [7,8]-1 [3,4]-1 [2,3]-2; [6,7]-1

A B C

A 1 4 0

B 1 0 3

C 2 1 3

Non-positional Bigram Table Positional Bigram Table

Page 10: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Example: Positional NGT Table Entry

● Here is how we can read the sample entry for AB in

positional bigram NGT on the previous slide:

– AB occurs 2 times in position [0, 1] in the corpus;

– AB occurs once in position [4, 5] in the corpus;

– AB occurs once in position [8, 9] in the corpus.

B

A [0, 1] - 2 [4, 5] - 1 [8, 9] - 1

Page 11: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Corpus-Based vs. Text-Based NGTs

● A most frequent source of N-Grams for an NGT is a

text corpus

● However, there are approaches that advocate

computing NGTs for individual texts (journal

articles, web pages, etc.)

● The longer the document the more reliable the

NGT

● Possible misspellings are detected as infrequent or

non-existent N-Grams

Page 12: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Word Boundary Problem

Page 13: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Word Boundary Problem

● A typical definition of word boundaries is through white

space characters (tabs, newlines, spaces, etc.)

● This definition does not work too well in automated

spellchecking because of run-ons and splits:

● Examples of run-ons:

– atthe, ina, ofthe, forgetit, understandyou

● Examples of splits:

– at t he, sp end, un derstand you

Page 14: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Challenges of Run-ons & Splits

● A major difficulty in dealing with run-ons & splits

is that both frequently result in valid wordforms

● Examples:

– forgot when split at r results at two valid

wordforms: for and got, i.e., for got

– A run-on of drive- and in results in drive-in (this

type of error is frequent in OCR)

Page 15: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Challenges of Run-ons & Splits

● Another major challenge in dealing with run-ons &

splits is that any relaxation of the standard word

boundary definition results in combinatorial

explosion of possible strings to check

● Successive applications have been designed only

for limited domains

● Research indicates that many run-ons and splits

involves a relatively small set of high-frequency

words

Page 16: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Error Detection & Correction

Page 17: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Error Detection vs. Error Correction

● There are two main problems in automated

spellchecking: error detection & error correction

● Error detection is finding when a string is not in

the list of known strings (aka dictionary)

● Error correction is a harder problem: a string

which is not in the dictionary must be somehow

changed to become one of the strings in the

dictionary

Page 18: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Nonword Error Correction

● Nonword is a wordform that is not on the list of

known wordforms

● Example:

– wordform ATER would not typically be on any list of known

English wordforms (it could be an abbreviation for a company

or a scientific term)

– suggested spelling corrections are AFTER, LATER, ALTER,

WATER, ATE

Page 19: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

N-Gram-Based Error Detection

● Collect a representative sample of texts

● Compute the N-Gram table (NGT) for the text corpus

● Given an input string, compute its list of N-Grams

● For each N-Gram in the input string

– check it against the NGT

– if the input string contains N-Grams that are not in the NGT or

contains that are infrequent (this is typically decided by a

threshold), mark it as a misspelling

Page 20: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Dictionary Lookup Techniques

● Nonwords can be detected via dictionary lookup

techniques

● The challenges are 1) efficient key lookup; 2) efficient

storage of dictionaries (possibly on multiple nodes); 3)

vocabulary selection for the dictionary

● Typical data structures are

– Hash tables

– Frequency ordered binary search trees

– Tries

Page 21: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Some Word Error Correction Approaches

● N-Gram techniques

● Minimum edit distances

● Key similarity

● Rule-based techniques

● Probabilistic techniques

● Neural networks

Page 22: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Basic Error Types

● Approximately 80% of all misspelled words contain

a single instance of one of the four types:

– Insertion: receeive (extra e is inserted)

– Deletion: rceive (e after r is deleted)

– Substitution: reseive (s is substituted for c)

– Transposition: recieve (e and i are transposed)

Page 23: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Minimum Edit Distance

● Minimum edit distances techniques use dynamic

programming to compute the distance between a

source string and a target string in terms of the

number of insertions, deletions, transpositions,

and substitutions

● Use the trie data structure to compute the

similarity between a source and a target

● Minimum edit distance techniques based on

dynamic programming are expensive

Page 24: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Generic N-Gram Correction Algorithm

● Build the NGT (bigram or trigram, positional or non-

positional)

● At run time:

– Detect misspellings through the presence of an infrequent or

non-existent N-Grams

– Compute the number of N-Grams common to a misspelling and

a dictionary word

– Use the number of common n-grams in a similarity measure

k(C/(n1 + n2)), where k is a constant, C is the number of

common N-Grams, and n1 and n2 are the lengths of two strings

– Another similarity measure is C/max(n1, n2)

Page 25: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Similarity Key Techniques

● Basic idea is to map every string into a key in such away that similarly spelled strings have

identical or similar keys (the same idea is used in hash tables)

● SOUNDEX system was one of the first key mapping techniques according to the following

map:

– A, E, I, O, U, H, W, Y 0

– B, F, P, V 1

– C, G, J, K, Q, S, X, Z 2

– D, T 3

– L 4

– M, N 5

– R 6

● A word’s key consists of its first letter plus the digits computed according to the map above

● Zeros are eliminated and runs of the same digit are conflated

Page 26: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Examples of SOUNDEX Similarity Keys

● Bush B0220 B2

● Busch B0220 B2

● QUAYLE Q00040 Q4

● KWAIL K0004 K4

Page 27: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Generic Similarity Key Algorithm

● Compute the key for each word in the dictionary

● Sort the key-word pairs by the keys

● At run time,

– Detect a misspelled word (e.g., bigram or trigram matching)

– Compute its key

– Find the key in the sorted list

– Check the key’s word and the words of a small number of its

immediate neighbors to the left and right for possible

suggestions

– Return the list of possible suggestions

Page 28: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Probabilistic Techniques

● Probabilistic techniques use two types of probabilities: transition

probabilities and confusion probabilities

● Transition probabilities are language dependent and compute the

conditional probability that a given letter or a sequence of

letters is followed by another letter or a sequence of letters

(these are done with n-grams)

● Confusion probabilities are estimates of how frequently a given

letter is mistaken for another letter

● They are source dependent (e.g., in OCR each OCR engine must

be repeatedly evaluated to estimate confusion probabilities

accurately)

Page 29: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Conditional Probabilities in Error Correction

● Let X be a dictionary word and Y be an misspelled string

● We need to estimate:

● P(X) is the probability of X

● P(Y) is the probability of Y

● P(Y|X) is the probability of the misspelling Y given that

X is the correct word

YP

XPXYPYXP

||

Page 30: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Conditional Probabilities in Error Correction

● Let X be a dictionary word and Y be a misspelled string

● To find the most probable correct spelling (X) of the

misspelled string (Y), we need to maximize: G(X|Y)=log

P(Y|X)+ log P(X)

● P(X) is the unigram probability of X

● Estimation of log P(Y|X) is harder but doable:

positionth -i in the characters theare are

)or ( rdlongest wo theoflength theis

,|log|log1

ii

n

i

ii

XY

XYn

XYPXYP

Page 31: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

Example

estimates. get these to

Tesseract) (e.g., engine OCR specific afor sexperimentrun can

weexample,For estimated. bemust |,|,|

log|log|log

|log|

? is rdcorrect wo y that theprobabilit the

is What . is wordformmisspelled that theSuppose

3322

11

TAPATPFFP

FATPTXAYPAXTYP

FXFYPFATXFTAYG

FAT

FTA

Page 32: Speech & NLP (Fall 2014): Spellchecking:  N-Gram Analysis, Error Detection, Word Boundary Problem, Error Correction in Isolated Words

References

1. K. Kukich, "Techniques for Automatically Correcting Words in

Text." ACM Computing Surveys, Vol. 24, No. 4, Dec. 1992.