Post on 14-Jan-2016
description
Effect of Word-Based Effect of Word-Based
Correction on Retrieval ofCorrection on Retrieval of
Arabic OCR Degraded Arabic OCR Degraded
DocumentsDocuments
Walid Magdy & Kareem DarwishIBM Technology Development CenterPO Box 166 El-Ahram, Giza, Egypt{wmagdy,darwishk}@eg.ibm.com
Outlines:
1. Motivation
2. Background
3. Approach
4. Experimental Setup
5. Results
6. Conclusion
7. Future Work
Motivation:14
00
1500
1600
1700
1800
1900
2000
First printing press
Read to search
E-text becomes commonplace
Automated full text search
Problem: 500+ years of legacy documents
Goal: To search printed documents efficiently and effectively
1998: Arabic e-text comes
online
Does OCR solve the problem?
Arabic Language Challenges
• Orthography– Character shape depends on position– 15 of the 28 letters contain dots– Optional diacritics may be present – Printed text may include ligatures and kashida
• Morphology– Prefix, infix, and suffix– 6x1010 possible surface forms
• Other factors– Eighth most widely spoken language in the world– Web growth started only recently
ونـهـاكـتبوسـيــwasaya+ktub+uunahaaand will + write + they it
=and they will write it
• Pre-processing:– Remove diacritics– Normalize different forms of alef & ya to
accommodate for∙ Common spelling errors∙ Grammatical, morphological, and orthographic
propertiesئ , , ؤ ا ، إ ، آ ، ء , أ ا ,and ي ، ي ى
• Text Retrieval: Best Index Terms– Regular text: Light stemming and character 3 & 4-
grams are best– OCR text: character 3 & 4 grams are best
Arabic Pre-processing & Retrieval
Word-Based Correction for
Retrieval of Arabic OCR Degraded Documents
Main Idea:
Word-Based Correction for
Retrieval of Arabic OCR Degraded Documents
VVorcl-Easod Comectlon l0r
Belrieval of Arahie OCRDcgraclod Doeurnerits
Correction
OCR
ImageDegraded TextCorrected Text
We want to examine the effect of correction on Retrieval
Approach:
OCR system
OCRDegraded
Text---------------------------
OCRCorrected
Text-------------------------
Indexing
Ranked List of Documents
OCRCorrection
• Test collections
• Error Correction
• Building Error Model
• Training & Decoding
• Experiments
Experimental Setup:
Document Collections:
ZADTREC 2002 CLIRPrinted 14th century
religious book, scanned at 300x300 dpi and OCR’ed
Arabic newswire articles from Agence France Press
(AFP)
2,730 documents383,872 articles
25 topics 50 topics
Real Degraded text by OCR process
Synthetic degraded text using degradation model
WER = 39 %WER = 30.8 %
The ZAD Collection:
شرع ومتى التيمم حكم
Sample Document:
Sample Query:
The TREC 2002 CLIR Collection:
Sample Document:
Sample Query:وعراقيين ايرانيين حرب سجناء
<DOC><DOCNO>19940513_AFP_ARB0001</DOCNO><HEADER> 7710ع 4 0800ارا- تصج / افب / 86قبرص / ذاتي حكم سالم االوسط الشرق </HEADER><BODY><HEADLINE> &HT; اريحا كنيس فوق 1رفع ي لم الفلسطيني <HEADLINE/> العلم<TEXT><P> ) الغربية ) الضفة (- 5-31اريحا مدخل ) بحراسة الفلسطينية الشرطة عناصر احد يقوم ب افاال الفلسطينية الشرطة الى تسليمها تم التي المدينة مواقع آخر احد اريحا وسط في اليهودي الكنيس
الكنيس فوق الفلسطيني العلم رفع يتم لم <P/> انه<P> " مكان هذا الكنيس فوق الفلسطيني العلم رفع تحاول كانت لفلسطينية فلسطيني ضابط وقال<P/> "مقدس<P> ما االسرائيليون الجنود كان الذي الكنيس مدخل من يهود مستوطنين ثالثة اقترب ذلك وقبيل
ثيابهم بتمزيق قاموا الدخول من الجنود منعهم وعندما حراسته يوءمنون <P/> زالوا</TEXT>
Manual Corrected OCR Text
Aligning Characters Mapping
Build Error Model
OCR Degraded
Text
OCR Degraded
Text
Generate Corrections
Pick up most likely
correction using Bayes
Rule
OCR Corrected
Text
Decoding
Training
OCR-Correction Model :
Aligning Characters Mapping:
m:n Mapping
Ex: walid vvaicl
w vv S a a √ l Null D i i √ d cl S
w a l i d
v v a i c l
1 : 1 Mapping
Ex: walid vvaicl
w v S Null v I a a √ l Null D i i √ d c S Null l I
w a l i d
v v a i c l
Building Error Model:
)..(
)....( )D ..D..C(C P yxlkonsubstituti
lk
yxlk
CCcount
DDCCcount
)..(
)..( )..C(C P lkdeletion
lk
lk
CCcount
CCcount
)(
)..( )D ..D ( P yxinsertion Ccount
DDcount yx
Where CkCl, and DxDy are a character or more
Decoding:
yx DDall
lkyx CCDDP..:
)..|..(
Baye’s Rule:
P ( Wordcorrect | WordOCR ) =
argmax ( P ( WordOCR | Wordcorrect ) P ( Wordcorrect ) )
P ( WordOCR | Wordcorrect ) =
P ( Wordcorrect ) = LM probability
(used simple unigram probability)
Character Level model
Word Level model
ε ε ε ε ε
Example:
Character Level Model:
1. Segmentation
2. Mapping
3. Generate Candidates
Ex: dairn
d a i r n
da i r n
d ai r n
dai r n
d a i rn
da ir n
d air n
dair n
d a i rn
da i rn
d ai rn
dai rn
d a irn
da irn
d airn
dairn
d a i rn
rn 0.7 m 0.15 im 0.02 ln 0.015 0.005
i 0.84 l 0.12 0.02 t 0.015 ll 0.005 0.005
d 0.8 h 0.1 cl 0.08 0.02
a 0.9 o 0.05 r 0.02 oi 0.015 0.005 n 0.005 e 0.005
dairn 0.425
daim 0.091
claim 0.0091
aim 0.00227
horn 0.00007
l 0.09 i 0.05 li 0.02 s 0.015 f 0.005 t 0.005 a 0.005
Example (cont):
Word Level Model:
Find the Frequency of Occurrence of each generated word in the dictionary
P ) dairn | dairn ( = 0.425
P ) daim | dairn ( = 0.091
P ) claim | dairn ( = 0.0091
P ) aim | dairn ( = 0.00227
P ) horn | dairn ( = 0.00007
Freq ) dairn ( = 0
Freq ) daim ( = 0
Freq ) claim ( = 1500
Freq ) aim ( = 4000
Freq ) horn ( = 150
dairn claim
IR Experiments
• Degraded Collections are corrected, best one, two, three and five corrections were picked up for each word to be indexed
• The collections were indexed and searched using words, character 3-grams, character 4-grams, and lightly stemmed word
• Retrieval performance were tested for all combination between index type and number of correction
• Measure of merit is Mean Average Precision
• Significance testing done using t-test with p-value = 0.05
Correction Results:
11.5
8.1
13.213.71516.9
39
22.2
0
5
10
15
20
25
30
35
40
NoCorrection
1 2 3 4 5 10 AllN- corrections
Wo
rd E
rro
r R
ate
(%
)
9.28.1
6.89.510.2
11.9
30.8
16.7
0
5
10
15
20
25
30
35
NoCorrection
1 2 3 4 5 10 AllN- corrections
Wo
rd E
rro
r R
ate
(%
)
ZAD Collection TREC Collection
IR Results:
“ZAD Collection” :
0
0.03
0.06
0.09
0.12
0.15
0.18
0.21
0.24
0.27
0.3
0.33
0.36
0.39
0.42
0.45
Whole Word 3-gram 4-gram Stem
Mea
n A
vera
ge P
reci
sion
CleanBad1 Correction2 Correcftions3 Corrections5 Corrections
Clean
Bad
0
0.03
0.06
0.09
0.12
0.15
0.18
0.21
0.24
0.27
0.3
0.33
0.36
0.39
0.42
0.45
Whole Word 3-gram 4-gram Stem
Mea
n A
vera
ge P
reci
sion
CleanBad1 Correction2 Correcftions3 Corrections5 Corrections
IR Results:
“TREC Collection” :
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
Whole Word 3 -gram 4 - gram stem
Me
an
Av
era
ge
Pre
cis
ion
OriginalBad1 Correction2 Corrections3 Corrections5 Corrections
Clean
Bad
Conclusion & future work:
• Despite WER was halved IR effectiveness was not improved with statistically significant increase
• Using more than one correction does not help
• Indexing using n-grams (shorter index terms) is better than “moderate” error correction
• Effect of using n-gram word LM on error correction“Magdy, W. and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. IN EMNLP 2006”
• Effect of “good” error correction on improving the retrieval effectiveness
Lnanh Lnanh gongonThank Thank youyou
Correction