Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis...
-
Upload
akram-el-korashy -
Category
Technology
-
view
1.178 -
download
1
description
Transcript of Search space reduction for holistic ligature recognition in Urdu Nastaliq script (Bachelor thesis...
Holistic Recognition of Printed Arabic Script LigaturesAkram El-KorashySupervised by: Dr. Faisal Shafait
Deutsche Forschungszentrum für Künstliche Intelligenz (DFKI)Kaiserslautern, Deutschland
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 1
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 2
Segmentation-free OCR for Arabic scriptsive
● Nastalique writing: Classify ligatures instead of individual characters.
● Over 20,000 valid ligatures in the Urdu language.
● Ease in the preprocessing, with difficulty in feature extraction & classification.
Akram El-Korashy, Segmentation-free OCR, 14.08.12 3
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 4
Features Extraction, Shape Context method● Distribution of Points, Transformation
methods, Structural Analysis.
● Nabocr: Shape Context features vector.
● Contour Extraction.
● Shape Context is a shape descriptor proposed by Belongie et al.
Akram El-Korashy, Segmentation-free OCR, 14.08.12 5
Features Extraction, Shape Context method● 4 histograms from 4 quadrants.
● Each histogram is a sum of point histograms.
● Distance, Orientation
● Histogram: bins of ranges.
Akram El-Korashy, Segmentation-free OCR, 14.08.12 6
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 7
Hierarchical Classification● Decomposing a classification problem into a
set of smaller problems.● Useful with large numbers of categories.
● Efficiency of recognition.● Can help improve accuracy
● Independent set of features for each branch.
Akram El-Korashy, Segmentation-free OCR, 14.08.12 8
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 9
Spectral Hashing● Fast NN technique● Feature vector into a binary code:
○ easily computed○ small no. of bits○ similarity mapping
● Calculating binary code:○ maximum variance direction(PCA)○ sin eigenfn.
Akram El-Korashy, Segmentation-free OCR, 14.08.12 10
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 11
Random Forests● Ensemble Classifier
Ensemble learning combines the predictions of different classifiers (decision trees) by collecting independent votes from each tree and calculating the majority vote to give a prediction.
Akram El-Korashy, Segmentation-free OCR, 14.08.12 12
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 13
Shape context weaknesses● Scale invariance
● Missing representation of dots
● Confusion between ligatures that vary only in dots.
Akram El-Korashy, Segmentation-free OCR, 14.08.12 14
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 15
New Features● Sizes of connected components
● Locations of connected components
○ above, below,or interleaving
○ Grid location
Akram El-Korashy, Segmentation-free OCR, 14.08.12 16
New Features● Pixel-level properties:
○ weights of regions○ fill ratio
● Length, Width, Aspect Ratio
○ Invariance to scanning resolution○ Setting reference size○ Histogram of widths and heights
Akram El-Korashy, Segmentation-free OCR, 14.08.12 17
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 18
Classification Methodology● Experiment set "1"
○ Spectral Hashing, reduction of number of comparisons
● Experiment set "2"○ Random Forests, hierarchy by recognizing the no. of
characters
● Experiment "3"○ Random Forests, classification of alphabet symbols
Akram El-Korashy, Segmentation-free OCR, 14.08.12 19
Classification Methodology● Spectral Hashing (sunvid project):
○ Training Dataset (~80,000 samples)
○ Test Dataset (~20,000 samples)
○ Different combinations of number of bits, number of tables, tolerance bits (training different hash structures in parallel)
Akram El-Korashy, Segmentation-free OCR, 14.08.12 20
Classification Methodology● Random Forests (python milk):
○ Number of decision trees: 101○ 70% of the attributes○ 70% of the training samples
○ Reduced training dataset (~20,000 samples)○ Test dataset of ~18,000 samples
Akram El-Korashy, Segmentation-free OCR, 14.08.12 21
Classification Methodology●
○ New features vector
○ Classifying based on no. of characters
○ Classifying the Alphabet Symbols
1-character classifier
2-character classifier
3+ character classifier
Random Forest classifier
Akram El-Korashy, Segmentation-free OCR, 14.08.12 22
input
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 23
● Spectral Hashing Results "1"○ Effect of changing the number of tables○ 7-bit-binary-code, 2 tolerance bits
Experiments and Results
Akram El-Korashy, Segmentation-free OCR, 14.08.12 24
Experiments and Results● Spectral Hashing Results "1"
Accuracy Best Reduction Hash (bits, tables, tolerance)
81.5% 37538 (47.2%) 7, 9, 1
81% 31553 (39.7%) 7, 7, 1
80.5% 23975 (30.1%) 8, 9, 1
79.5% 20736 (26.1%) 7, 4, 1
78% 18737 (23.6%) 8, 7, 1
76% 15392 (19.4%) 7, 3, 1
Akram El-Korashy, Segmentation-free OCR, 14.08.12 25
Experiments and Results● Spectral Hashing Results "1"● Significant reduction rates
○ Reduction down to 19% for a difference of 6% in accuracy
○ Reduction down to 24% for a difference of 4% in accuracy.
○ Reduction down to 47.2% for no accuracy loss.○ Observation: Accuracy slightly higher than 1-NN for
reduction down to 57.6%
Akram El-Korashy, Segmentation-free OCR, 14.08.12 26
Experiments and Results● Random Forest Results "2"
● Accuracy of 78.7% for 1, 2, 3, 4+ labels● Accuracy of 45.4% for 1, 2, 3, 4, 5+ labels● Accuracy of 20.7% for 1, 2, 3, 4, 5, 6+ labels● Even worse with more partitioning
Akram El-Korashy, Segmentation-free OCR, 14.08.12 27
Experiments and Results● Random Forest Results "2"
● Confusion matrix for 1, 2, 3+: alphabet symbols can be separately classified.
test label / result 1 2 3+ Recall
1 1131 88 14 91.9%
2 16 94 531 17.2%
3+ 7 2 16627 99.9%% true positives 98% 51% 96.8% ___
Akram El-Korashy, Segmentation-free OCR, 14.08.12 28
Experiments and Results● Alphabet symbols
○ 80.34 % for Random Forests "3"
○ Accuracy of 98.74 % for 1-NN classifier
○ 1-NN classifier can be used for recognition under class 1.
○ Over 30% of ligatures are individual characters.
Akram El-Korashy, Segmentation-free OCR, 14.08.12 29
Outline● Introduction
○ Segmentation-free OCR for Arabic scripts● Approaches Used
○ Features Extraction, and the Shape Context method○ Machine Learning Techniques (Hierarchical
classification, Spectral Hashing, Random Forests)● Improvements and Methodology
○ Shape Context weaknesses○ New Features (dots, sizes, pixel-level matching)○ Classification Methodology
● Experiments and Results● Conclusion and Summary
Akram El-Korashy, Segmentation-free OCR, 14.08.12 30
Conclusion and Summary● Features vector can be improved.
● 1-NN improved efficiency by Spectral Hashing: significant reduction
● Random Forests: can be used to separate the 1-character alphabet symbols.
● Useful for overall performance improvement on real text data.
Akram El-Korashy, Segmentation-free OCR, 14.08.12 31
Future Work
Thank You
Questions?