Lessons from Indic OCR Development

17
National Conference on Free Software Nishad T R NIT, Calicut http://www.himili.com/ocr/ Lessons from Indic OCR Development

description

Talk about Tesseract-OCR system for Malayalam in National Conference on Free Software

Transcript of Lessons from Indic OCR Development

Page 1: Lessons from Indic OCR Development

National Conference on Free Software

Nishad T RNIT, Calicuthttp://www.himili.com/ocr/

Lessons from Indic OCR Development

Page 2: Lessons from Indic OCR Development

2

Overview

History and Evolution of OCRWhen, Where, Why and How of OCRSelection of an OCR Engine and other

gearsPutting it all together, and whyTesseract architectural styleChallenges in Indic OCRLessons learned and appliedWhere is it NOW?

Page 3: Lessons from Indic OCR Development

Apr 13, 2023 3

OCR in General

EngineTraining DataInput ToolsOutput formatting tools

Page 4: Lessons from Indic OCR Development

Apr 13, 2023 4

Three competents

Ocrad  Ocrad is the GNU OCR program. It was written

by Antonio Diaz Diaz and is licensed under GPL.

GOCR GOCR is an OCR program written by Joerg

Schulenburg and others. It is licensed under GPL.

Tesseract Under the sponsorship of Google, Tesseract

was made open source in 2006.

Page 5: Lessons from Indic OCR Development

And how they performed

Page 6: Lessons from Indic OCR Development

Again how they performed

Page 7: Lessons from Indic OCR Development

And the winner is ….

Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test. The only caveat is that one absolutely must convert the input to bitonal.

Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy.

GOCR gives poor output at a slow speed.

Page 8: Lessons from Indic OCR Development

Apr 13, 2023 8

Development Process Evolution

Fostering Contributions developer focus and avoiding starvation code, code review, documentation, support

Recognizing Ego trust and good intentions beware of maniacal focus

Limits of volunteerism eight knives and an apple (dining developer

problem) eight knives and a pumpkin eight pumpkins and no knives

Page 9: Lessons from Indic OCR Development

How Debayan tamed Matra

http://debayanin.googlepages.com/hackingtesseract

Page 10: Lessons from Indic OCR Development

And how they performed

To train for another language, you have to create 8 data files in the tessdata subdirectory. Language codes follow the ISO 639-3 standard tessdata/xxx.freq-dawg tessdata/xxx.word-dawg tessdata/xxx.user-words tessdata/xxx.inttemp tessdata/xxx.normproto tessdata/xxx.pffmtable tessdata/xxx.unicharset tessdata/xxx.DangAmbigs

Page 11: Lessons from Indic OCR Development

Apr 13, 2023 11

The BOX File concept

Command tesseract fontfile.tif fontfile batch.nochop

makeboxSample Box

അ 8 682 53 703 ആ 62 676 112 703 ഇ 121 676 155 705 ഈ 165 677 220 705 ഉ 232 677 256 704 ഊ 265 677 313 705

Page 12: Lessons from Indic OCR Development

Apr 13, 2023 12

In Kindergarten

Page 13: Lessons from Indic OCR Development

13

His Teacher

JTesseract is the Tesseract GUI responsible for easing the

training process. JTesseract is released under Apache 2.0 license.

JTesseract currently works only on Windows platform.

Developed by Ruwan Janapriya Egoda Gamage http://www.janapriya.net

Features Visual box file editing Project based training process

Page 14: Lessons from Indic OCR Development

Apr 13, 2023 14

His Classmates

nopapaper

Page 15: Lessons from Indic OCR Development

Apr 13, 2023 15

LibTIFFThis software provides support for the Tag

Image File Format (TIFF), a widely used format for storing image data. The latest version of the TIFF specification is available on-line in several different formats, as are a number of Technical Notes (TTN's).

Page 16: Lessons from Indic OCR Development

Apr 13, 2023 16

Windows GUI

Page 17: Lessons from Indic OCR Development

Apr 13, 2023 17

Questions?

Places to see: Front Door

http://code.google.com/p/tesseract-ocr jtesseract

http://code.google.com/p/jtesseract/ FreeOCR

http://www.freeocr.net

http://www.himili.com/ocr