Sindhi Optical Character Recognition

Sindhi Optical Character Recognition

By: Mutee U RahmanMuhammad Rafi

Waleed Butt

سنڌي عڪسي اکرن جي سڃاڻپ

Summary of the Project

Total 15 main bodies are consideredDue to complications diacritics are not

consideredTesseract & Decision Tree training models

generated and testedAccuracy calculated by counting

generated correct ids

Data Description Data Set-I

15 main bodies

35 Tokens of Training Data

10 Tokens of Testing Data

Data Set-II

56 random MBs

Syllable IDTotal of Strings

Correctly Recognized

با 502 10 10بد 503 10 10ٻو 504 10 10د 505 10 10

ني 506 10 10۽ 507 10 1

هو 508 10 10جي 509 10 10خو 510 10 10۾ 511 10 10ن 512 10 10ر 513 10 10

سا 514 10 10س 515 10 10و 516 10 10ي 517 10 10

Subtotal 160 151Accuracy 94.375

Tesseract Recognition Results on Data-Set I (Test Data)

Tsseract Accuracy Results on Data-Set II Data-File

100% Accuracy on random data file


با 502 5بد 503 0ٻو 504 1د 505 4

ني 506 6۽ 507 5

هو 508 0جي 509 3خو 510 5۾ 511 6ن 512 7ر 513 6

سا 514 4س 515 3و 516 5ي 517 1

Total 56

Decision Tree Results


Correctly Recognized

با 502 10 10بد 503 10 10ٻو 504 10 9د 505 10 9

ني 506 10 8۽ 507 10 1

هو 508 10 8جي 509 10 9خو 510 10 10۾ 511 10 9ن 512 10 10ر 513 10 9

سا 514 10 10س 515 10 10و 516 10 9ي 517 10 8

Subtotal 160 139Accuracy 86.875%

Preprocessing Line Segment

◦Sample pages are given with different numbers of lines

◦All lines were extracted correctly -100%

Preprocessing Line Segment

◦Pages with different number of lines given for segmenting line

◦All lines were extracted correctly -100%◦100%

Preprocessing

Syllable/Ligature Segmentation◦From every page, we have successfully

extracted syllable/ligature◦Performance of syllable/ligature 80%

Preprocessing Main Body (MB)

◦We have selected 15 MB from Sindhi Alphabets ◦We have not able to isolate diacritics, hence

the MB are not correctly identifiable.◦ Total main

bodiesCorrectly classified as main bodies

% Accuracy

15 12 80%

Preprocessing Diacritics

◦We are not able to extract diacritics from the text.

Conclusion

Tesseract accuracy is 94.4% and DT accuracy is 86.7% on Dataset-I

On Dataset-II accuracy for Tesseract is 100%

Line Extraction 100%, Syllable 80%

Sindhi Optical Character Recognition

Documents

Transcript of Sindhi Optical Character Recognition