Sindhi Optical Character Recognition
description
Transcript of Sindhi Optical Character Recognition
Sindhi Optical Character Recognition
By: Mutee U RahmanMuhammad Rafi
Waleed Butt
سنڌي عڪسي اکرن جي سڃاڻپ
Summary of the Project
Total 15 main bodies are consideredDue to complications diacritics are not
consideredTesseract & Decision Tree training models
generated and testedAccuracy calculated by counting
generated correct ids
Data Description Data Set-I
15 main bodies
35 Tokens of Training Data
10 Tokens of Testing Data
Data Set-II
56 random MBs
Syllable IDTotal of Strings
Correctly Recognized
با 502 10 10بد 503 10 10ٻو 504 10 10د 505 10 10
ني 506 10 10۽ 507 10 1
هو 508 10 10جي 509 10 10خو 510 10 10۾ 511 10 10ن 512 10 10ر 513 10 10
سا 514 10 10س 515 10 10و 516 10 10ي 517 10 10
Subtotal 160 151Accuracy 94.375
Tesseract Recognition Results on Data-Set I (Test Data)
Tsseract Accuracy Results on Data-Set II Data-File
100% Accuracy on random data file
Syllable IDTotal of Strings
با 502 5بد 503 0ٻو 504 1د 505 4
ني 506 6۽ 507 5
هو 508 0جي 509 3خو 510 5۾ 511 6ن 512 7ر 513 6
سا 514 4س 515 3و 516 5ي 517 1
Total 56
Decision Tree Results
Syllable IDTotal of Strings
Correctly Recognized
با 502 10 10بد 503 10 10ٻو 504 10 9د 505 10 9
ني 506 10 8۽ 507 10 1
هو 508 10 8جي 509 10 9خو 510 10 10۾ 511 10 9ن 512 10 10ر 513 10 9
سا 514 10 10س 515 10 10و 516 10 9ي 517 10 8
Subtotal 160 139Accuracy 86.875%
Preprocessing Line Segment
◦Sample pages are given with different numbers of lines
◦All lines were extracted correctly -100%
Preprocessing Line Segment
◦Pages with different number of lines given for segmenting line
◦All lines were extracted correctly -100%◦100%
Preprocessing
Syllable/Ligature Segmentation◦From every page, we have successfully
extracted syllable/ligature◦Performance of syllable/ligature 80%
Preprocessing Main Body (MB)
◦We have selected 15 MB from Sindhi Alphabets ◦We have not able to isolate diacritics, hence
the MB are not correctly identifiable.◦ Total main
bodiesCorrectly classified as main bodies
% Accuracy
15 12 80%
Preprocessing Diacritics
◦We are not able to extract diacritics from the text.
Conclusion
Tesseract accuracy is 94.4% and DT accuracy is 86.7% on Dataset-I
On Dataset-II accuracy for Tesseract is 100%
Line Extraction 100%, Syllable 80%