Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis...

33
Holistic Recognition of Printed Arabic Script Ligatures Akram El-Korashy Supervised by: Dr. Faisal Shafait Deutsche Forschungszentrum für Künstliche Intelligenz (DFKI) Kaiserslautern, Deutschland

description

The thesis addresses the problem of holistic recognition of printed text in Nastalique writing style of the Urdu language. The main difficulty of the recognition process lies in the large number of classes (17,000 different possible classes in our Urdu text data). This large number of classes not only limits the efficiency (run-time) of many recognition algorithms, but it also makes it more difficult to make use of some state-of-the-art classifiers –like random forests– that assume a much smaller number of classes in the classification problems they can be used for. In this paper, we investigate different strategies for improving the efficiency (reducing the search space) of nearest neighbor based classification of Urdu ligatures. Experiments using spectral hashing show that the search space of nearest neighbor comparison can be reduced by about 50% without loss in recognition accuracy. Further experiments demonstrate that Random Forest classifier can reliably distinguish one-character ligatures from multiple-character ligatures.

Transcript of Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis...

Page 1: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Holistic Recognition of Printed Arabic Script LigaturesAkram El-KorashySupervised by: Dr. Faisal Shafait

Deutsche Forschungszentrum für Künstliche Intelligenz (DFKI)Kaiserslautern, Deutschland

Page 2: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 1

Page 3: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 2

Page 4: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Segmentation-free OCR for Arabic scriptsive

● Nastalique writing: Classify ligatures instead of individual characters.

● Over 20,000 valid ligatures in the Urdu language.

● Ease in the preprocessing, with difficulty in feature extraction & classification.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 3

Page 5: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 4

Page 6: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Features Extraction, Shape Context method● Distribution of Points, Transformation

methods, Structural Analysis.

● Nabocr: Shape Context features vector.

● Contour Extraction.

● Shape Context is a shape descriptor proposed by Belongie et al.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 5

Page 7: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Features Extraction, Shape Context method● 4 histograms from 4 quadrants.

● Each histogram is a sum of point histograms.

● Distance, Orientation

● Histogram: bins of ranges.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 6

Page 8: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 7

Page 9: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Hierarchical Classification● Decomposing a classification problem into a

set of smaller problems.● Useful with large numbers of categories.

● Efficiency of recognition.● Can help improve accuracy

● Independent set of features for each branch.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 8

Page 10: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 9

Page 11: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Spectral Hashing● Fast NN technique● Feature vector into a binary code:

○ easily computed○ small no. of bits○ similarity mapping

● Calculating binary code:○ maximum variance direction(PCA)○ sin eigenfn.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 10

Page 12: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 11

Page 13: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Random Forests● Ensemble Classifier

Ensemble learning combines the predictions of different classifiers (decision trees) by collecting independent votes from each tree and calculating the majority vote to give a prediction.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 12

Page 14: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 13

Page 15: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Shape context weaknesses● Scale invariance

● Missing representation of dots

● Confusion between ligatures that vary only in dots.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 14

Page 16: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 15

Page 17: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

New Features● Sizes of connected components

● Locations of connected components

○ above, below,or interleaving

○ Grid location

Akram El-Korashy, Segmentation-free OCR, 14.08.12 16

Page 18: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

New Features● Pixel-level properties:

○ weights of regions○ fill ratio

● Length, Width, Aspect Ratio

○ Invariance to scanning resolution○ Setting reference size○ Histogram of widths and heights

Akram El-Korashy, Segmentation-free OCR, 14.08.12 17

Page 19: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 18

Page 20: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Classification Methodology● Experiment set "1"

○ Spectral Hashing, reduction of number of comparisons

● Experiment set "2"○ Random Forests, hierarchy by recognizing the no. of

characters

● Experiment "3"○ Random Forests, classification of alphabet symbols

Akram El-Korashy, Segmentation-free OCR, 14.08.12 19

Page 21: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Classification Methodology● Spectral Hashing (sunvid project):

○ Training Dataset (~80,000 samples)

○ Test Dataset (~20,000 samples)

○ Different combinations of number of bits, number of tables, tolerance bits (training different hash structures in parallel)

Akram El-Korashy, Segmentation-free OCR, 14.08.12 20

Page 22: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Classification Methodology● Random Forests (python milk):

○ Number of decision trees: 101○ 70% of the attributes○ 70% of the training samples

○ Reduced training dataset (~20,000 samples)○ Test dataset of ~18,000 samples

Akram El-Korashy, Segmentation-free OCR, 14.08.12 21

Page 23: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Classification Methodology●

○ New features vector

○ Classifying based on no. of characters

○ Classifying the Alphabet Symbols

1-character classifier

2-character classifier

3+ character classifier

Random Forest classifier

Akram El-Korashy, Segmentation-free OCR, 14.08.12 22

input

Page 24: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 23

Page 25: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

● Spectral Hashing Results "1"○ Effect of changing the number of tables○ 7-bit-binary-code, 2 tolerance bits

Experiments and Results

Akram El-Korashy, Segmentation-free OCR, 14.08.12 24

Page 26: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Experiments and Results● Spectral Hashing Results "1"

Accuracy Best Reduction Hash (bits, tables, tolerance)

81.5% 37538 (47.2%) 7, 9, 1

81% 31553 (39.7%) 7, 7, 1

80.5% 23975 (30.1%) 8, 9, 1

79.5% 20736 (26.1%) 7, 4, 1

78% 18737 (23.6%) 8, 7, 1

76% 15392 (19.4%) 7, 3, 1

Akram El-Korashy, Segmentation-free OCR, 14.08.12 25

Page 27: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Experiments and Results● Spectral Hashing Results "1"● Significant reduction rates

○ Reduction down to 19% for a difference of 6% in accuracy

○ Reduction down to 24% for a difference of 4% in accuracy.

○ Reduction down to 47.2% for no accuracy loss.○ Observation: Accuracy slightly higher than 1-NN for

reduction down to 57.6%

Akram El-Korashy, Segmentation-free OCR, 14.08.12 26

Page 28: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Experiments and Results● Random Forest Results "2"

● Accuracy of 78.7% for 1, 2, 3, 4+ labels● Accuracy of 45.4% for 1, 2, 3, 4, 5+ labels● Accuracy of 20.7% for 1, 2, 3, 4, 5, 6+ labels● Even worse with more partitioning

Akram El-Korashy, Segmentation-free OCR, 14.08.12 27

Page 29: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Experiments and Results● Random Forest Results "2"

● Confusion matrix for 1, 2, 3+: alphabet symbols can be separately classified.

test label / result 1 2 3+ Recall

1 1131 88 14 91.9%

2 16 94 531 17.2%

3+ 7 2 16627 99.9%% true positives 98% 51% 96.8% ___

Akram El-Korashy, Segmentation-free OCR, 14.08.12 28

Page 30: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Experiments and Results● Alphabet symbols

○ 80.34 % for Random Forests "3"

○ Accuracy of 98.74 % for 1-NN classifier

○ 1-NN classifier can be used for recognition under class 1.

○ Over 30% of ligatures are individual characters.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 29

Page 31: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Outline● Introduction

○ Segmentation-free OCR for Arabic scripts● Approaches Used

○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical

classification, Spectral Hashing, Random Forests)● Improvements and Methodology

○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology

● Experiments and Results● Conclusion and Summary

Akram El-Korashy, Segmentation-free OCR, 14.08.12 30

Page 32: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Conclusion and Summary● Features vector can be improved.

● 1-NN improved efficiency by Spectral Hashing: significant reduction

● Random Forests: can be used to separate the 1-character alphabet symbols.

● Useful for overall performance improvement on real text data.

Akram El-Korashy, Segmentation-free OCR, 14.08.12 31

Page 33: Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis presentation)

Future Work

Thank You

Questions?