Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of...

21
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden [email protected]

Transcript of Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of...

Probabilistic Detection of Context-Sensitive Spelling Errors

Johnny BigertRoyal Institute of Technology, Sweden

[email protected]

What?

Context-Sensitive Spelling Errors

Example:Nice whether today.

All words found in dictionary If context is considered,

the spelling of whether is incorrect

Why?Why do we need detection of

context-sensitive spelling errors?

These errors are quite frequent (reports on 16-40% of all errors)

Larger dictionaries result in more errors undetected

They cannot be found by regular spell checkers!

Why not?

What about proposing corrections for the errors?

An interesting topic,but not the topic of this article

Detection is imperative,correction is an aid

Related work?

Are there no algorithms doing this already?

A full parser is perfect for the job

Drawbacks: high accuracy is required not available for many languages manual labor is expensive not robust

Related work?Are there no other algorithms? Several other algorithms (e.g.

Winnow) Some do correction

Drawbacks: They require a set of

easily confused words Normally, you don’t know your

spelling errors beforehand

Why?

What are the benefits of this algorithm?

Find any error Avoid extensive manual work Robustness

How?

Prerequisites We use PoS tag trigram

frequenciesfrom an annotated corpus

We are given a sentence, and apply a PoS tagger

How?

Basic assumption If any tag trigram frequency is low,

that part is probably ungrammatical

But?

But don’t you often encounter rare or unseen trigrams?

Yes, unfortunately We modify the notion of frequency

Find and use other, ”syntactically close” PoS trigrams

Close?

What is the syntactic distance between two PoS tags?

A probability that one tag is replaceable by another

Retain grammaticality

Distances extracted from corpus Unsupervised learning algorithm

Then?

The algorithm We have a generalized PoS tag

trigtram frequency

If frequency below threshold, text is probably ungrammatical

Result?

Summary so far Unsupervised learning Automatic algorithm Detection of any error No manual labor!

Alas, phrase boundaries cause problems

Phrases?

What about phrases? PoS tag trigrams overlapping two

phrases are very productive Rare phrases, rare trigrams

Transformations!

Transform?How do we transform a phrase? Shallow parser Transform phrases to most

common form Normally, the head

Benefits: retain grammaticality, less rare trigrams, longer tagger scope

Example?

Example of phrase transformation

Only the paintings that are old are for sale

Only the paintings are for sale

NP

NP

Then what?How do we use the

transformations? Apply tagger to transformed sentence Run first part of algorithm again

If any transformation yield only trigrams with high frequency,sentence ok

Otherwise, probable error

Result?

Summary Trigram part, fully automatic Phrase part, could use machine

learning of rules for shallow parser

Finds many difficult error types Threshold determines

precision/recall trade-off

Evaluation?

Fully automatic evaluation Introduce artificial context-

sensitive spelling errors (using software Missplel)

Automated evaluation procedure for 1, 2, 5, 10 and 20% misspelled words(using software AutoEval)

Results? 1% errors

Results? 2% errors