Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

23
Effect of Word-Based Effect of Word-Based Correction on Retrieval Correction on Retrieval of of Arabic OCR Degraded Arabic OCR Degraded Documents Documents Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166 El-Ahram, Giza, Egypt {wmagdy,darwishk}@eg.ibm.com

description

Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents. Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166 El-Ahram, Giza, Egypt {wmagdy,darwishk}@eg.ibm.com. Outlines:. Motivation Background Approach Experimental Setup Results Conclusion - PowerPoint PPT Presentation

Transcript of Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Page 1: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Effect of Word-Based Effect of Word-Based

Correction on Retrieval ofCorrection on Retrieval of

Arabic OCR Degraded Arabic OCR Degraded

DocumentsDocuments

Walid Magdy & Kareem DarwishIBM Technology Development CenterPO Box 166 El-Ahram, Giza, Egypt{wmagdy,darwishk}@eg.ibm.com

Page 2: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Outlines:

1. Motivation

2. Background

3. Approach

4. Experimental Setup

5. Results

6. Conclusion

7. Future Work

Page 3: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Motivation:14

00

1500

1600

1700

1800

1900

2000

First printing press

Read to search

E-text becomes commonplace

Automated full text search

Problem: 500+ years of legacy documents

Goal: To search printed documents efficiently and effectively

1998: Arabic e-text comes

online

Does OCR solve the problem?

Page 4: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Arabic Language Challenges

• Orthography– Character shape depends on position– 15 of the 28 letters contain dots– Optional diacritics may be present – Printed text may include ligatures and kashida

• Morphology– Prefix, infix, and suffix– 6x1010 possible surface forms

• Other factors– Eighth most widely spoken language in the world– Web growth started only recently

ونـهـاكـتبوسـيــwasaya+ktub+uunahaaand will + write + they it

=and they will write it

Page 5: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

• Pre-processing:– Remove diacritics– Normalize different forms of alef & ya to

accommodate for∙ Common spelling errors∙ Grammatical, morphological, and orthographic

propertiesئ , , ؤ ا ، إ ، آ ، ء , أ ا ,and ي ، ي ى

• Text Retrieval: Best Index Terms– Regular text: Light stemming and character 3 & 4-

grams are best– OCR text: character 3 & 4 grams are best

Arabic Pre-processing & Retrieval

Page 6: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Word-Based Correction for

Retrieval of Arabic OCR Degraded Documents

Main Idea:

Word-Based Correction for

Retrieval of Arabic OCR Degraded Documents

VVorcl-Easod Comectlon l0r

Belrieval of Arahie OCRDcgraclod Doeurnerits

Correction

OCR

ImageDegraded TextCorrected Text

We want to examine the effect of correction on Retrieval

Page 7: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Approach:

OCR system

OCRDegraded

Text---------------------------

OCRCorrected

Text-------------------------

Indexing

Ranked List of Documents

OCRCorrection

Page 8: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

• Test collections

• Error Correction

• Building Error Model

• Training & Decoding

• Experiments

Experimental Setup:

Page 9: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Document Collections:

ZADTREC 2002 CLIRPrinted 14th century

religious book, scanned at 300x300 dpi and OCR’ed

Arabic newswire articles from Agence France Press

(AFP)

2,730 documents383,872 articles

25 topics 50 topics

Real Degraded text by OCR process

Synthetic degraded text using degradation model

WER = 39 %WER = 30.8 %

Page 10: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

The ZAD Collection:

شرع ومتى التيمم حكم

Sample Document:

Sample Query:

Page 11: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

The TREC 2002 CLIR Collection:

Sample Document:

Sample Query:وعراقيين ايرانيين حرب سجناء

<DOC><DOCNO>19940513_AFP_ARB0001</DOCNO><HEADER> 7710ع 4 0800ارا- تصج / افب / 86قبرص / ذاتي حكم سالم االوسط الشرق </HEADER><BODY><HEADLINE> &HT; اريحا كنيس فوق 1رفع ي لم الفلسطيني <HEADLINE/> العلم<TEXT><P> ) الغربية ) الضفة (- 5-31اريحا مدخل ) بحراسة الفلسطينية الشرطة عناصر احد يقوم ب افاال الفلسطينية الشرطة الى تسليمها تم التي المدينة مواقع آخر احد اريحا وسط في اليهودي الكنيس

الكنيس فوق الفلسطيني العلم رفع يتم لم <P/> انه<P> " مكان هذا الكنيس فوق الفلسطيني العلم رفع تحاول كانت لفلسطينية فلسطيني ضابط وقال<P/> "مقدس<P> ما االسرائيليون الجنود كان الذي الكنيس مدخل من يهود مستوطنين ثالثة اقترب ذلك وقبيل

ثيابهم بتمزيق قاموا الدخول من الجنود منعهم وعندما حراسته يوءمنون <P/> زالوا</TEXT>

Page 12: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Manual Corrected OCR Text

Aligning Characters Mapping

Build Error Model

OCR Degraded

Text

OCR Degraded

Text

Generate Corrections

Pick up most likely

correction using Bayes

Rule

OCR Corrected

Text

Decoding

Training

OCR-Correction Model :

Page 13: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Aligning Characters Mapping:

m:n Mapping

Ex: walid vvaicl

w vv S a a √ l Null D i i √ d cl S

w a l i d

v v a i c l

1 : 1 Mapping

Ex: walid vvaicl

w v S Null v I a a √ l Null D i i √ d c S Null l I

w a l i d

v v a i c l

Page 14: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Building Error Model:

)..(

)....( )D ..D..C(C P yxlkonsubstituti

lk

yxlk

CCcount

DDCCcount

)..(

)..( )..C(C P lkdeletion

lk

lk

CCcount

CCcount

)(

)..( )D ..D ( P yxinsertion Ccount

DDcount yx

Where CkCl, and DxDy are a character or more

Page 15: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Decoding:

yx DDall

lkyx CCDDP..:

)..|..(

Baye’s Rule:

P ( Wordcorrect | WordOCR ) =

argmax ( P ( WordOCR | Wordcorrect ) P ( Wordcorrect ) )

P ( WordOCR | Wordcorrect ) =

P ( Wordcorrect ) = LM probability

(used simple unigram probability)

Character Level model

Word Level model

Page 16: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

ε ε ε ε ε

Example:

Character Level Model:

1. Segmentation

2. Mapping

3. Generate Candidates

Ex: dairn

d a i r n

da i r n

d ai r n

dai r n

d a i rn

da ir n

d air n

dair n

d a i rn

da i rn

d ai rn

dai rn

d a irn

da irn

d airn

dairn

d a i rn

rn 0.7 m 0.15 im 0.02 ln 0.015 0.005

i 0.84 l 0.12 0.02 t 0.015 ll 0.005 0.005

d 0.8 h 0.1 cl 0.08 0.02

a 0.9 o 0.05 r 0.02 oi 0.015 0.005 n 0.005 e 0.005

dairn 0.425

daim 0.091

claim 0.0091

aim 0.00227

horn 0.00007

l 0.09 i 0.05 li 0.02 s 0.015 f 0.005 t 0.005 a 0.005

Page 17: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Example (cont):

Word Level Model:

Find the Frequency of Occurrence of each generated word in the dictionary

P ) dairn | dairn ( = 0.425

P ) daim | dairn ( = 0.091

P ) claim | dairn ( = 0.0091

P ) aim | dairn ( = 0.00227

P ) horn | dairn ( = 0.00007

Freq ) dairn ( = 0

Freq ) daim ( = 0

Freq ) claim ( = 1500

Freq ) aim ( = 4000

Freq ) horn ( = 150

dairn claim

Page 18: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

IR Experiments

• Degraded Collections are corrected, best one, two, three and five corrections were picked up for each word to be indexed

• The collections were indexed and searched using words, character 3-grams, character 4-grams, and lightly stemmed word

• Retrieval performance were tested for all combination between index type and number of correction

• Measure of merit is Mean Average Precision

• Significance testing done using t-test with p-value = 0.05

Page 19: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Correction Results:

11.5

8.1

13.213.71516.9

39

22.2

0

5

10

15

20

25

30

35

40

NoCorrection

1 2 3 4 5 10 AllN- corrections

Wo

rd E

rro

r R

ate

(%

)

9.28.1

6.89.510.2

11.9

30.8

16.7

0

5

10

15

20

25

30

35

NoCorrection

1 2 3 4 5 10 AllN- corrections

Wo

rd E

rro

r R

ate

(%

)

ZAD Collection TREC Collection

Page 20: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

IR Results:

“ZAD Collection” :

0

0.03

0.06

0.09

0.12

0.15

0.18

0.21

0.24

0.27

0.3

0.33

0.36

0.39

0.42

0.45

Whole Word 3-gram 4-gram Stem

Mea

n A

vera

ge P

reci

sion

CleanBad1 Correction2 Correcftions3 Corrections5 Corrections

Clean

Bad

0

0.03

0.06

0.09

0.12

0.15

0.18

0.21

0.24

0.27

0.3

0.33

0.36

0.39

0.42

0.45

Whole Word 3-gram 4-gram Stem

Mea

n A

vera

ge P

reci

sion

CleanBad1 Correction2 Correcftions3 Corrections5 Corrections

Page 21: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

IR Results:

“TREC Collection” :

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

Whole Word 3 -gram 4 - gram stem

Me

an

Av

era

ge

Pre

cis

ion

OriginalBad1 Correction2 Corrections3 Corrections5 Corrections

Clean

Bad

Page 22: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Conclusion & future work:

• Despite WER was halved IR effectiveness was not improved with statistically significant increase

• Using more than one correction does not help

• Indexing using n-grams (shorter index terms) is better than “moderate” error correction

• Effect of using n-gram word LM on error correction“Magdy, W. and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. IN EMNLP 2006”

• Effect of “good” error correction on improving the retrieval effectiveness

Page 23: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Lnanh Lnanh gongonThank Thank youyou

Correction