Download - Optical Character Recognition: the What, Why, and How

OCR: the What, Why, and How

Mackenzie Brooks, Metadata LibrarianAlston Cobourn, Digital Scholarship Librarian

What is Optical Character Recognition?

Wikipedia says: the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text.

http://en.wikipedia.org/wiki/Optical_character_recognition

The Second Province of Sigma Chi, embracing the chapters at the University of Virginia, Uampden- Sidney, Roanoke, Randolph-Macon, University of North Carolina and Washington and Lec.held its annual convention here on Thursday night aud Friday. The delegates arrived on the evening trains on Thursday mid went immediately to the Lex- ington, where they put tip.

Why

● Textual analysis● Your research, student work, DH projects● ADA compliance

With the full text you can…

● All PDFs are not created equal○ Searchable○ Extract text

● Textual analysis ○ Voyant Tools○ TEI

Accessibility

● Screen reading● Kurtzweil● Indexable ● General readability

How?

● Multiple tools, multiple methods● Reads the text and tries to assign values to the

characters it sees● Matrix matching vs. feature extraction● Character vs. whole word recognition● Matches from internal lexicon/dictionary● Language options available

Complications

● Fonts, formatting, line breaks, columns, italics etc.

● Thin paper, writing on back of page● Stray marks, printing errors, margin notes,

footnotes● Year of language

Recommendations

● High resolution● Binarization● Deskewing● Orientation● Crop out extraneous marks

What tools can I use?

Adobe Acrobat ProGoogle DriveTesseract

Adobe Acrobat Pro

● Via the Stable● Over 40 languages (not just Latin

characters)● Automatically preprocesses the image● Easy to use● Enhances PDFs● Will work with TIFs and JPGs

Google Drive

● Will process PDF, GIF, JPG, and PNG● Recommends text be 10 pixels high● Size limit: 2MB per file or 10 pages of PDF● Upload Settings > Convert Text from

Uploaded PDFs and Image Files

Project Naptha

http://projectnaptha.com/● automatically applies state-of-the-art computer

vision algorithms on every image you see while browsing the web. The result is a seamless and intuitive experience, where you can highlight as well as copy and paste and even edit and translate the text formerly trapped within an image.

http://projectnaptha.com/

http://projectnaptha.com/

Tesseract

● OCR Engine● 1985-1994 HP; 2006 Google● Highest accuracy● Command line or front end options

https://code.google.com/p/tesseract-ocr/



Tesseract Frontends

● FreeOCR○ Windows○ PDF, TIFF, JPG○ 11 languages○ Ability to input common errors○ Functions as scanning software

● Other options:○ https://code.google.com/p/tesseract-ocr/wiki/3rdParty

Free OCRSWMSFC Is Set To InterviewNew CandidatesClass Ring OrdersNow Being TakenThe SWMSFC, an autonomouscommittee. constituted to raise fundsfor a scholarship in memory of W&Lmen who lost their lives in WorldWar ll, wffl interview candidatesfor membership in the organizationTuesday Oct. 11, at 7 p.m. in theStudent Union.

AdobeSWMSFC Is Set To Interview New Candidates Class Ring Orders Now Being Taken The SWMSFC. an autonomous committee, cotultiluted to raise funds lor a scholarship In memory of W&L men who l09t their lives in World War II, will interview candidates for membership in the organilation Tuesday Oct. 11, at 7 p.m. in the Student. Union.

TesseractSWMSFC Is SetTo InterviewNew CandidatesClass Ring OrdersNow Being TakenThe SWMSFC, an autonomouscommittee, constituted to raise fundsfor a scholarship in memory of W&Lmen who lost their lives in WorldWar II, will interview candidatesfor membership in the organizationTuesday Oct. 11, at 7 p.m. in theStudent Union.

Google DriveSWMSFC Is Set To Interview New CandidatesClass Ring Orders Now Being TakenThe SWMSFC, an autonomous committee, constituted to raise funds for a scholarship in memory of W&L men who lost their lives in World War II, will interview candidates for membership in the organization Tuesday Oct. 11, at 7 p.m. in the Student Union.

Contact

Mackenzie [email protected] x8659

Alston [email protected] x8657

[email protected]

mailto:[email protected]