Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Effect of Word-Based Effect of Word-Based

Correction on Retrieval ofCorrection on Retrieval of

Arabic OCR Degraded Arabic OCR Degraded

DocumentsDocuments

Walid Magdy & Kareem DarwishIBM Technology Development CenterPO Box 166 El-Ahram, Giza, Egypt{wmagdy,darwishk}@eg.ibm.com

Outlines:

1. Motivation

2. Background

3. Approach

4. Experimental Setup

5. Results

6. Conclusion

7. Future Work

Motivation:14

First printing press

Read to search

E-text becomes commonplace

Automated full text search

Problem: 500+ years of legacy documents

Goal: To search printed documents efficiently and effectively

1998: Arabic e-text comes

online

Does OCR solve the problem?

Arabic Language Challenges

• Orthography– Character shape depends on position– 15 of the 28 letters contain dots– Optional diacritics may be present – Printed text may include ligatures and kashida

• Morphology– Prefix, infix, and suffix– 6x1010 possible surface forms

• Other factors– Eighth most widely spoken language in the world– Web growth started only recently

ونـهـاكـتبوسـيــwasaya+ktub+uunahaaand will + write + they it

=and they will write it

• Pre-processing:– Remove diacritics– Normalize different forms of alef & ya to

accommodate for∙ Common spelling errors∙ Grammatical, morphological, and orthographic

propertiesئ , , ؤ ا ، إ ، آ ، ء , أ ا ,and ي ، ي ى

• Text Retrieval: Best Index Terms– Regular text: Light stemming and character 3 & 4-

grams are best– OCR text: character 3 & 4 grams are best

Arabic Pre-processing & Retrieval

Word-Based Correction for

Retrieval of Arabic OCR Degraded Documents

Main Idea:

Word-Based Correction for

Retrieval of Arabic OCR Degraded Documents

VVorcl-Easod Comectlon l0r

Belrieval of Arahie OCRDcgraclod Doeurnerits

Correction

ImageDegraded TextCorrected Text

We want to examine the effect of correction on Retrieval

Approach:

OCR system

OCRDegraded

Text---------------------------

OCRCorrected

Text-------------------------

Indexing

Ranked List of Documents

OCRCorrection

• Test collections

• Error Correction

• Building Error Model

• Training & Decoding

• Experiments

Experimental Setup:

Document Collections:

ZADTREC 2002 CLIRPrinted 14th century

religious book, scanned at 300x300 dpi and OCR’ed

Arabic newswire articles from Agence France Press

2,730 documents383,872 articles

25 topics 50 topics

Real Degraded text by OCR process

Synthetic degraded text using degradation model

WER = 39 %WER = 30.8 %

The ZAD Collection:

شرع ومتى التيمم حكم

Sample Document:

Sample Query:

The TREC 2002 CLIR Collection:

Sample Document:

Sample Query:وعراقيين ايرانيين حرب سجناء

<DOC><DOCNO>19940513_AFP_ARB0001</DOCNO><HEADER> 7710ع 4 0800ارا- تصج / افب / 86قبرص / ذاتي حكم سالم االوسط الشرق </HEADER><BODY><HEADLINE> &HT; اريحا كنيس فوق 1رفع ي لم الفلسطيني <HEADLINE/> العلم<TEXT><P> ) الغربية ) الضفة (- 5-31اريحا مدخل ) بحراسة الفلسطينية الشرطة عناصر احد يقوم ب افاال الفلسطينية الشرطة الى تسليمها تم التي المدينة مواقع آخر احد اريحا وسط في اليهودي الكنيس

الكنيس فوق الفلسطيني العلم رفع يتم لم <P/> انه<P> " مكان هذا الكنيس فوق الفلسطيني العلم رفع تحاول كانت لفلسطينية فلسطيني ضابط وقال<P/> "مقدس<P> ما االسرائيليون الجنود كان الذي الكنيس مدخل من يهود مستوطنين ثالثة اقترب ذلك وقبيل

ثيابهم بتمزيق قاموا الدخول من الجنود منعهم وعندما حراسته يوءمنون <P/> زالوا</TEXT>

Manual Corrected OCR Text

Aligning Characters Mapping

Build Error Model

OCR Degraded

Generate Corrections

Pick up most likely

correction using Bayes

OCR Corrected

Decoding

Training

OCR-Correction Model :

Aligning Characters Mapping:

m:n Mapping

Ex: walid vvaicl

w vv S a a √ l Null D i i √ d cl S

w a l i d

v v a i c l

1 : 1 Mapping

Ex: walid vvaicl

w v S Null v I a a √ l Null D i i √ d c S Null l I

w a l i d

v v a i c l

Building Error Model:

)....( )D ..D..C(C P yxlkonsubstituti

CCcount

DDCCcount

)..( )..C(C P lkdeletion

CCcount

)..( )D ..D ( P yxinsertion Ccount

DDcount yx

Where CkCl, and DxDy are a character or more

Decoding:

yx DDall

lkyx CCDDP..:

)..|..(

Baye’s Rule:

P ( Wordcorrect | WordOCR ) =

argmax ( P ( WordOCR | Wordcorrect ) P ( Wordcorrect ) )

P ( WordOCR | Wordcorrect ) =

P ( Wordcorrect ) = LM probability

(used simple unigram probability)

Character Level model

Word Level model

ε ε ε ε ε

Example:

Character Level Model:

1. Segmentation

2. Mapping

3. Generate Candidates

Ex: dairn

d a i r n

da i r n

d ai r n

dai r n

d a i rn

da ir n

d air n

dair n

d a i rn

da i rn

d ai rn

dai rn

d a irn

da irn

d airn

d a i rn

rn 0.7 m 0.15 im 0.02 ln 0.015 0.005

i 0.84 l 0.12 0.02 t 0.015 ll 0.005 0.005

d 0.8 h 0.1 cl 0.08 0.02

a 0.9 o 0.05 r 0.02 oi 0.015 0.005 n 0.005 e 0.005

dairn 0.425

daim 0.091

claim 0.0091

aim 0.00227

horn 0.00007

l 0.09 i 0.05 li 0.02 s 0.015 f 0.005 t 0.005 a 0.005

Example (cont):

Word Level Model:

Find the Frequency of Occurrence of each generated word in the dictionary

P ) dairn | dairn ( = 0.425

P ) daim | dairn ( = 0.091

P ) claim | dairn ( = 0.0091

P ) aim | dairn ( = 0.00227

P ) horn | dairn ( = 0.00007

Freq ) dairn ( = 0

Freq ) daim ( = 0

Freq ) claim ( = 1500

Freq ) aim ( = 4000

Freq ) horn ( = 150

dairn claim

IR Experiments

• Degraded Collections are corrected, best one, two, three and five corrections were picked up for each word to be indexed

• The collections were indexed and searched using words, character 3-grams, character 4-grams, and lightly stemmed word

• Retrieval performance were tested for all combination between index type and number of correction

• Measure of merit is Mean Average Precision

• Significance testing done using t-test with p-value = 0.05

Correction Results:

13.213.71516.9

NoCorrection

1 2 3 4 5 10 AllN- corrections

9.28.1

6.89.510.2

NoCorrection

1 2 3 4 5 10 AllN- corrections

ZAD Collection TREC Collection

IR Results:

“ZAD Collection” :

Whole Word 3-gram 4-gram Stem

CleanBad1 Correction2 Correcftions3 Corrections5 Corrections

Whole Word 3-gram 4-gram Stem

CleanBad1 Correction2 Correcftions3 Corrections5 Corrections

IR Results:

“TREC Collection” :

Whole Word 3 -gram 4 - gram stem

OriginalBad1 Correction2 Corrections3 Corrections5 Corrections

Conclusion & future work:

• Despite WER was halved IR effectiveness was not improved with statistically significant increase

• Using more than one correction does not help

• Indexing using n-grams (shorter index terms) is better than “moderate” error correction

• Effect of using n-gram word LM on error correction“Magdy, W. and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. IN EMNLP 2006”

• Effect of “good” error correction on improving the retrieval effectiveness

Lnanh Lnanh gongonThank Thank youyou

Correction

Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Documents

Transcript of Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents

Restoration of Degraded Document Images ppt

Degraded Buried Piping Systems

REHABILITATION OF DEGRADED FOREST ECOSYSTEMS IN …assets.panda.org/downloads/lowermekongregionaloverview.pdfREHABILITATION OF DEGRADED FOREST ECOSYSTEMS IN CAMBODIA, LAO PDR, THAILAND

Colour Analysis of Degraded Parchment

CnC for Tuning Hints on OCR - Purdue Engineering · PDF fileCnC for Tuning Hints on OCR ... level programming model (CnC). 3 . OCR ... CnC / OCR Concept Mapping Concept OCR construct

Degraded Image

Media Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval Information Retrieval Image Retrieval Video Retrieval Audio Retrieval.

Achieving ecosystem stability on degraded land

Supervised Autonomy for Communication-degraded ...

Degraded Images

Computer Lexica in OCR and Retrieval

Degraded Image: Valentine’s Gift

Jornada | Science-based management strategies for ......or non-degraded state; 13.5 % in slightly degraded, 21.1 % in moderately degraded; 12.8 % in heavily degraded and 10.3 % in

Reforestation of Degraded Lands

REVEGETATION OF DEGRADED CAATINGA SITES

Facile Esterification of Degraded and Non‐Degraded Starch

Report on Degraded Ecosystem Pashan Lake

Degraded and Wastelands of India - ICAR

Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166.

REFORESTATION IN DEGRADED MEDITERRANEAN RANGELANDS