Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval

Fusion of Multiple Corrupted Fusion of Multiple Corrupted Transmissions and its effect on Transmissions and its effect on

Information RetrievalInformation Retrieval

Walid MagdyKareem Darwish

Mohsen Rashwan

Outlines

1. Motivation

2. Prior work

3. Fusion Definition

4. Approach

5. Experimental Setup

6. Results

7. Conclusion & Future work

Motivation

• Many Arabic documents are available only in print form.

• The need of transforming these documents into electronic form increased since the end of last century, where searching E-text is much easier.

• Arabic OCR accuracy is still much lower than the state-of-the-art for other languages, such as English.

• Degraded text, resulting from OCR systems, affects the effectiveness of Information Retrieval.

• The need for having higher quality text for Arabic documents became a must for improving IR effectiveness.

Prior Art:

• Previous work on OCRed text focused on two main aspects:

1. Work involves improving Information Retrieval effectiveness regardless of improving text quality.

2. Work focuses on improving text quality leading to improvement in IR effectiveness.

• Examples:

1. Query garbling based on character error model.

2. OCR correction based on character error model and Language model.

Fusion Definition:

• Previous approaches depends on the presence of only one source of degraded text.

• Our approach assumes the presence of more than one version of the degraded text.

FusionFusion S0’ = S0 + ε0’

S1 = S0 + ε1

S2 = S0 + ε2

Sn = S0 + εn

CorrectionCorrection S0’ = S0 + ε0’

Sx = S0 + εx

OCROCR Sx = S0 + εx

Simage

Clean version of text

Noisy edit operations

Degraded version of text

ε0’ < εx

ε0’ < min(ε1 … εn)

Approach:

! حياة ولم بنوره االستدالل في إلم هدء ولمرضا في 5مأل

بنور ألمستدالله في إال ثدي حيا 4وال إال 4والرضا ! 4في إلم هدء بنوره ولم االستدالل مألحياة ولمفي5رضافي

في ثديوال إال 4حياوال 4بنور ألمستداللهإال4رضافي

OCR OCR SystemSystem11

OCR OCR SystemSystem22

Language Language ModelModel

في إالثدي وال إال حياة وال بنوره االستدالل في5رضا

Image

Experimental Setup:

• Only one OCR system was available “Sakhr Automatic Reader v4”.

• In order to obtain multiple sources for a given data set:1. Few pages were selected at random from a book, OCRed, then outcome text was

manually corrected.

2. Degraded and Clean text were used to create a character error model based on 1:1 character mapping.

3. Generated model is then used to garble a clean text using different CER’s.

• Used OCRed book for test was Zad Alma’ad, with the following specs:1. Eight pages scanned at 300x300 dpi that contain 4,236 words, with CER of 13.9% and

WER 36.8%.

2. Clean version of the book was available in electronic form that consists of 2,730 separate documents. Associated a set of 25 topics and relevance judgments.

• LM is built using a web-mined collection of religious text by Ibn Taymiya, the teacher of the author of Zad Alma’ad

• MAP was used as the figure of merit for IR results.

Experimental Setup:Generating Synthetic Garbled Data

• For a clean word “قنبلة”

قـنـبـلـة

نـبـلـقـة

قق 0.8

ف 0.1

ت0.05

ن0.05

ق ف ت ن

0.0 0.8 0.9 0.95 1

نـبـلـتـة

GarblerGarbler

Character Error Model

0.921

Generate random number

تـ

Experimental Setup:Generating Synthetic Garbled Data

ق ف ت ن

0.0 0.8 0.9 0.95 1

ق ف ت ن

0.0 0.6 0.8 0.9 1

ق ف نت

0.0 0.95

0.9 10.975

k =CERnew

CERorg

k = 2

k = 0.5

Experimental Setup:Generated Versions

Data set k CER WER OOV

Original NA 13.9% 36.8% 20.9%

Model-1 113.9% 36.3% 21.1%

13.9% 36.4% 21.1%

Model-2 0.57.0% 20.3% 11.9%

7.0% 20.4% 11.9%

Model-3 0.679.3% 26.1% 15.2%

9.3% 25.9% 15.2%

Model-4 1.2517.4% 43.2% 25.0%

17.4% 43.3% 24.9%

Model-5 227.9% 59.2% 33.8%

27.9% 59.2% 33.7%

0

0.1

0.2

0.3

0.4

0.5

0.6

Clean

Collection

Mea

n A

vera

ge P

reci

sion

Model-1 Model-2 Model-3 Model-4 Model-5

Error rates for generated versions

Retrieval results for generated versions

Results:Fusion Results

WER for outcome text from fusion process between couples of versions

WER after fusion of both versions

Common Errors between versions

Results:Retrieval Results

Results in MAP of searching different fused models, hashed bars refers to statistical significant retrieval results better than the original degraded versions

Conclusion & Future Work:

• Text fusion proved to be an effective method for selecting the proper word among different candidate words coming from different sources.

• Effectiveness of text fusion on WER reduction depends on the percentage of error overlap among different versions.

• Information retrieval improvement as a cause of text fusion was found to be promising specially for the few outcome versions that are statistically indistinguishable from the clean version.

• As a future work, fusion technique needs to be tested on real degraded data coming from different sources that will introduce a new challenge, which is word alignment among different sources.

الله الله جزاكم جزاكمخبراخبرا

الته الته بزاكم بزاكمالله خيراخيرا الله جزاكم جزاكمخيراخيرا

Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval

Documents

Transcript of Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval