Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach...
-
Upload
darren-mcbride -
Category
Documents
-
view
215 -
download
0
Transcript of Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach...
Intelligent Key Prediction byN-grams and Error-correction Rules
Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong PotipitiInformation Research and Development Division,National Electronics and Computer Technology
Center(NECTEC), Thailand
Outline Introduction Related
Works The Approach Experiments Conclusions
Introduction
The Thai-English Keyboard System, Problems and Solutions
The Thai-English Keyboard System
174 characters. 83 in Thai. 52 in English. 39 in Symbol.
4 characters/1 Button. 2 languages. 2 sets of characters.
Use special keys to select languages and to distinguish a set of characters.
ฤ ฟ
EnglishThai
Shift
No shift
A
a
A
The Problems Two annoyances for Thais to type
Thai-English bilingual texts. To switch between languages by
using a language switching key. To distinguish a set of characters by
using a combination of shift-key and a character-key.
Other multilingual users face the same problems.
Related Works
Language Identification and Key Prediction
Language Identification Microsoft Word 2000.
Automatic language identification.
Considering only surrounding text.
Problems. No key prediction. Often detect incorrect language. Difficult to switch to the proper
language.
Key Prediction Zheng et al.
Automatic word prediction. Base on lexical tree. Word class n-grams are
employed. Problems.
No language identification. Required large amount of
memory space.
The Approach
Overview, Language Identification, Key Prediction and Pattern Shortening
The System Overview There are two main
processes in our system.
Automatic language identification.
Automatic Thai key prediction.
Language?
Key Input
Thai KeyPrediction
Output
Eng
Thai
(1)
(2)1
11)(
m
iiiE KKpEprob
1
11)(
m
iiiT KKpTprob
Language IdentificationLanguage Identification
Thai Key Prediction
)|(.),|(maxarg 2-1-1
..,2,1
iiiii
n
iccccKpcccp
n
(3)
Error-correction Rules Some character
sequences often generate errors.
Error patterns are collected for error-correction process.
Mutual information is used to optimize the patterns.
Training corpus
Tri-gram prediction
Error correction rules
Errors from Tri-gram prediction
Error-collection Rule Reduction
Pattern No.
Error Key Sequences
Correct Patterns
1. k[lkl9
2. mpklkl9
3. kkklkl9
4. lkl9
Pattern Shortening
)()(
)()(
zpxp
zxpzxLm
y
yy
)()(
)()(
zpxp
zxpzxRm
y
yy
(4)
(5)
The Error-correction Process
Finite Automata pattern matching is applied.
The longest pattern is selected.
Key Input
Finite Automata
Terminal?
No Yes
Correction
Experiments
Language Identification and Key Prediction
Language Identification Result
Bi-Gram is employed. Using artificial
corpus for testing. The first 6 characters
are enough.
m (number of first characters)
Identification Accuracy (%)
3 94.27
4 97.06
5 98.16
6 99.1
7 99.11
Key Prediction Corpus Information
25 MB training corpus.
5 MB test corpus
Training Corpus
(%)
Test Corpus
(%)
Unshift Alphabets
88.63 88.95
Shift Alphabets
11.37 11.05
Key Prediction Result
Prediction Accuracy from
Training Corpus (%)
Prediction Accuracy from
Test Corpus (%)
Tri-gram Prediction
93.11 92.21
Tri-gram Prediction +
Error Correction99.53 99.42
Conclusions
The Approach and Future Works
The Approach We have applied the tri-gram
model and error-correction rules for intelligent Thai key prediction and English-Thai language identification.
The experiment reports 99 percent in accuracy, which is very impressive.
Future Works Hopefully, this technique is
applicable to other Asian languages and multilingual systems.
Our future work is to apply the algorithm to mobile phones, handheld devices and multilingual input systems.
The End
Questions and Answers