Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of...
-
Upload
alfred-sanders -
Category
Documents
-
view
222 -
download
0
Transcript of Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of...
Probabilistic Detection of Context-Sensitive Spelling Errors
Johnny BigertRoyal Institute of Technology, Sweden
What?
Context-Sensitive Spelling Errors
Example:Nice whether today.
All words found in dictionary If context is considered,
the spelling of whether is incorrect
Why?Why do we need detection of
context-sensitive spelling errors?
These errors are quite frequent (reports on 16-40% of all errors)
Larger dictionaries result in more errors undetected
They cannot be found by regular spell checkers!
Why not?
What about proposing corrections for the errors?
An interesting topic,but not the topic of this article
Detection is imperative,correction is an aid
Related work?
Are there no algorithms doing this already?
A full parser is perfect for the job
Drawbacks: high accuracy is required not available for many languages manual labor is expensive not robust
Related work?Are there no other algorithms? Several other algorithms (e.g.
Winnow) Some do correction
Drawbacks: They require a set of
easily confused words Normally, you don’t know your
spelling errors beforehand
How?
Prerequisites We use PoS tag trigram
frequenciesfrom an annotated corpus
We are given a sentence, and apply a PoS tagger
But?
But don’t you often encounter rare or unseen trigrams?
Yes, unfortunately We modify the notion of frequency
Find and use other, ”syntactically close” PoS trigrams
Close?
What is the syntactic distance between two PoS tags?
A probability that one tag is replaceable by another
Retain grammaticality
Distances extracted from corpus Unsupervised learning algorithm
Then?
The algorithm We have a generalized PoS tag
trigtram frequency
If frequency below threshold, text is probably ungrammatical
Result?
Summary so far Unsupervised learning Automatic algorithm Detection of any error No manual labor!
Alas, phrase boundaries cause problems
Phrases?
What about phrases? PoS tag trigrams overlapping two
phrases are very productive Rare phrases, rare trigrams
Transformations!
Transform?How do we transform a phrase? Shallow parser Transform phrases to most
common form Normally, the head
Benefits: retain grammaticality, less rare trigrams, longer tagger scope
Example?
Example of phrase transformation
Only the paintings that are old are for sale
Only the paintings are for sale
NP
NP
Then what?How do we use the
transformations? Apply tagger to transformed sentence Run first part of algorithm again
If any transformation yield only trigrams with high frequency,sentence ok
Otherwise, probable error
Result?
Summary Trigram part, fully automatic Phrase part, could use machine
learning of rules for shallow parser
Finds many difficult error types Threshold determines
precision/recall trade-off
Evaluation?
Fully automatic evaluation Introduce artificial context-
sensitive spelling errors (using software Missplel)
Automated evaluation procedure for 1, 2, 5, 10 and 20% misspelled words(using software AutoEval)