Optical Character Recognition: the What, Why, and How

18
OCR: the What, Why, and How Mackenzie Brooks, Metadata Librarian Alston Cobourn, Digital Scholarship Librarian

description

Delivered by Mackenzie Brooks and Alston Cobourn to Washington and Lee University. This presentation explains what OCR is, gives a variety of use cases, and covers the types of tools available.

Transcript of Optical Character Recognition: the What, Why, and How

Page 1: Optical Character Recognition: the What, Why, and How

OCR: the What, Why, and How

Mackenzie Brooks, Metadata LibrarianAlston Cobourn, Digital Scholarship Librarian

Page 2: Optical Character Recognition: the What, Why, and How

What is Optical Character Recognition?

Wikipedia says: the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text.

http://en.wikipedia.org/wiki/Optical_character_recognition

Page 3: Optical Character Recognition: the What, Why, and How
Page 4: Optical Character Recognition: the What, Why, and How

The Second Province of Sigma Chi, embracing the chapters at the University of Virginia, Uampden- Sidney, Roanoke, Randolph-Macon, University of North Carolina and Washington and Lec.held its annual convention here on Thursday night aud Friday. The delegates arrived on the evening trains on Thursday mid went immediately to the Lex- ington, where they put tip.

Page 5: Optical Character Recognition: the What, Why, and How

Why

● Textual analysis● Your research, student work, DH projects● ADA compliance

Page 6: Optical Character Recognition: the What, Why, and How

With the full text you can…

● All PDFs are not created equal○ Searchable○ Extract text

● Textual analysis ○ Voyant Tools○ TEI

Page 7: Optical Character Recognition: the What, Why, and How

Accessibility

● Screen reading● Kurtzweil● Indexable ● General readability

Page 8: Optical Character Recognition: the What, Why, and How

How?

● Multiple tools, multiple methods● Reads the text and tries to assign values to the

characters it sees● Matrix matching vs. feature extraction● Character vs. whole word recognition● Matches from internal lexicon/dictionary● Language options available

Page 9: Optical Character Recognition: the What, Why, and How

Complications

● Fonts, formatting, line breaks, columns, italics etc.

● Thin paper, writing on back of page● Stray marks, printing errors, margin notes,

footnotes● Year of language

Page 10: Optical Character Recognition: the What, Why, and How

Recommendations

● High resolution● Binarization● Deskewing● Orientation● Crop out extraneous marks

Page 11: Optical Character Recognition: the What, Why, and How

What tools can I use?

Adobe Acrobat ProGoogle DriveTesseract

Page 12: Optical Character Recognition: the What, Why, and How

Adobe Acrobat Pro

● Via the Stable● Over 40 languages (not just Latin

characters)● Automatically preprocesses the image● Easy to use● Enhances PDFs● Will work with TIFs and JPGs

Page 13: Optical Character Recognition: the What, Why, and How

Google Drive

● Will process PDF, GIF, JPG, and PNG● Recommends text be 10 pixels high● Size limit: 2MB per file or 10 pages of PDF● Upload Settings > Convert Text from

Uploaded PDFs and Image Files

Page 14: Optical Character Recognition: the What, Why, and How

Project Naptha

http://projectnaptha.com/● automatically applies state-of-the-art computer

vision algorithms on every image you see while browsing the web. The result is a seamless and intuitive experience, where you can highlight as well as copy and paste and even edit and translate the text formerly trapped within an image.

Page 15: Optical Character Recognition: the What, Why, and How

Tesseract

● OCR Engine● 1985-1994 HP; 2006 Google● Highest accuracy● Command line or front end options

https://code.google.com/p/tesseract-ocr/

Page 16: Optical Character Recognition: the What, Why, and How

Tesseract Frontends

● FreeOCR○ Windows○ PDF, TIFF, JPG○ 11 languages○ Ability to input common errors○ Functions as scanning software

● Other options:○ https://code.google.com/p/tesseract-ocr/wiki/3rdParty

Page 17: Optical Character Recognition: the What, Why, and How

Free OCRSWMSFC Is Set To InterviewNew CandidatesClass Ring OrdersNow Being TakenThe SWMSFC, an autonomouscommittee. constituted to raise fundsfor a scholarship in memory of W&Lmen who lost their lives in WorldWar ll, wffl interview candidatesfor membership in the organizationTuesday Oct. 11, at 7 p.m. in theStudent Union.

AdobeSWMSFC Is Set To Interview New Candidates Class Ring Orders Now Being Taken The SWMSFC. an autonomous committee, cotultiluted to raise funds lor a scholarship In memory of W&L men who l09t their lives in World War II, will interview candidates for membership in the organilation Tuesday Oct. 11, at 7 p.m. in the Student. Union.

TesseractSWMSFC Is SetTo InterviewNew CandidatesClass Ring OrdersNow Being TakenThe SWMSFC, an autonomouscommittee, constituted to raise fundsfor a scholarship in memory of W&Lmen who lost their lives in WorldWar II, will interview candidatesfor membership in the organizationTuesday Oct. 11, at 7 p.m. in theStudent Union.

Google DriveSWMSFC Is Set To Interview New CandidatesClass Ring Orders Now Being TakenThe SWMSFC, an autonomous committee, constituted to raise funds for a scholarship in memory of W&L men who lost their lives in World War II, will interview candidates for membership in the organization Tuesday Oct. 11, at 7 p.m. in the Student Union.