Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach...

24
Intelligent Key Prediction by N-grams and Error- correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research and Development Division, National Electronics and Computer Technology Center(NECTEC), Thailand

Transcript of Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach...

Page 1: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Intelligent Key Prediction byN-grams and Error-correction Rules

Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong PotipitiInformation Research and Development Division,National Electronics and Computer Technology

Center(NECTEC), Thailand

Page 2: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Outline Introduction Related

Works The Approach Experiments Conclusions

Page 3: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Introduction

The Thai-English Keyboard System, Problems and Solutions

Page 4: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

The Thai-English Keyboard System

174 characters. 83 in Thai. 52 in English. 39 in Symbol.

4 characters/1 Button. 2 languages. 2 sets of characters.

Use special keys to select languages and to distinguish a set of characters.

ฤ ฟ

EnglishThai

Shift

No shift

A

a

A

Page 5: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

The Problems Two annoyances for Thais to type

Thai-English bilingual texts. To switch between languages by

using a language switching key. To distinguish a set of characters by

using a combination of shift-key and a character-key.

Other multilingual users face the same problems.

Page 6: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Related Works

Language Identification and Key Prediction

Page 7: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Language Identification Microsoft Word 2000.

Automatic language identification.

Considering only surrounding text.

Problems. No key prediction. Often detect incorrect language. Difficult to switch to the proper

language.

Page 8: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Key Prediction Zheng et al.

Automatic word prediction. Base on lexical tree. Word class n-grams are

employed. Problems.

No language identification. Required large amount of

memory space.

Page 9: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

The Approach

Overview, Language Identification, Key Prediction and Pattern Shortening

Page 10: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

The System Overview There are two main

processes in our system.

Automatic language identification.

Automatic Thai key prediction.

Language?

Key Input

Thai KeyPrediction

Output

Eng

Thai

Page 11: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

(1)

(2)1

11)(

m

iiiE KKpEprob

1

11)(

m

iiiT KKpTprob

Language IdentificationLanguage Identification

Page 12: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Thai Key Prediction

)|(.),|(maxarg 2-1-1

..,2,1

iiiii

n

iccccKpcccp

n

(3)

Page 13: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Error-correction Rules Some character

sequences often generate errors.

Error patterns are collected for error-correction process.

Mutual information is used to optimize the patterns.

Training corpus

Tri-gram prediction

Error correction rules

Errors from Tri-gram prediction

Page 14: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Error-collection Rule Reduction

Pattern No.

Error Key Sequences

Correct Patterns

1. k[lkl9

2. mpklkl9

3. kkklkl9

4. lkl9

Page 15: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Pattern Shortening

)()(

)()(

zpxp

zxpzxLm

y

yy

)()(

)()(

zpxp

zxpzxRm

y

yy

(4)

(5)

Page 16: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

The Error-correction Process

Finite Automata pattern matching is applied.

The longest pattern is selected.

Key Input

Finite Automata

Terminal?

No Yes

Correction

Page 17: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Experiments

Language Identification and Key Prediction

Page 18: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Language Identification Result

Bi-Gram is employed. Using artificial

corpus for testing. The first 6 characters

are enough.

m (number of first characters)

Identification Accuracy (%)

3 94.27

4 97.06

5 98.16

6 99.1

7 99.11

Page 19: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Key Prediction Corpus Information

25 MB training corpus.

5 MB test corpus

Training Corpus

(%)

Test Corpus

(%)

Unshift Alphabets

88.63 88.95

Shift Alphabets

11.37 11.05

Page 20: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Key Prediction Result

Prediction Accuracy from

Training Corpus (%)

Prediction Accuracy from

Test Corpus (%)

Tri-gram Prediction

93.11 92.21

Tri-gram Prediction +

Error Correction99.53 99.42

Page 21: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Conclusions

The Approach and Future Works

Page 22: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

The Approach We have applied the tri-gram

model and error-correction rules for intelligent Thai key prediction and English-Thai language identification.

The experiment reports 99 percent in accuracy, which is very impressive.

Page 23: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

Future Works Hopefully, this technique is

applicable to other Asian languages and multilingual systems.

Our future work is to apply the algorithm to mobile phones, handheld devices and multilingual input systems.

Page 24: Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.

The End

Questions and Answers