OCR: the What, Why, and How
Mackenzie Brooks, Metadata LibrarianAlston Cobourn, Digital Scholarship Librarian
What is Optical Character Recognition?
Wikipedia says: the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text.
http://en.wikipedia.org/wiki/Optical_character_recognition
The Second Province of Sigma Chi, embracing the chapters at the University of Virginia, Uampden- Sidney, Roanoke, Randolph-Macon, University of North Carolina and Washington and Lec.held its annual convention here on Thursday night aud Friday. The delegates arrived on the evening trains on Thursday mid went immediately to the Lex- ington, where they put tip.
Why
● Textual analysis● Your research, student work, DH projects● ADA compliance
With the full text you can…
● All PDFs are not created equal○ Searchable○ Extract text
● Textual analysis ○ Voyant Tools○ TEI
Accessibility
● Screen reading● Kurtzweil● Indexable ● General readability
How?
● Multiple tools, multiple methods● Reads the text and tries to assign values to the
characters it sees● Matrix matching vs. feature extraction● Character vs. whole word recognition● Matches from internal lexicon/dictionary● Language options available
Complications
● Fonts, formatting, line breaks, columns, italics etc.
● Thin paper, writing on back of page● Stray marks, printing errors, margin notes,
footnotes● Year of language
Recommendations
● High resolution● Binarization● Deskewing● Orientation● Crop out extraneous marks
What tools can I use?
Adobe Acrobat ProGoogle DriveTesseract
Adobe Acrobat Pro
● Via the Stable● Over 40 languages (not just Latin
characters)● Automatically preprocesses the image● Easy to use● Enhances PDFs● Will work with TIFs and JPGs
Google Drive
● Will process PDF, GIF, JPG, and PNG● Recommends text be 10 pixels high● Size limit: 2MB per file or 10 pages of PDF● Upload Settings > Convert Text from
Uploaded PDFs and Image Files
Project Naptha
http://projectnaptha.com/● automatically applies state-of-the-art computer
vision algorithms on every image you see while browsing the web. The result is a seamless and intuitive experience, where you can highlight as well as copy and paste and even edit and translate the text formerly trapped within an image.
Tesseract
● OCR Engine● 1985-1994 HP; 2006 Google● Highest accuracy● Command line or front end options
https://code.google.com/p/tesseract-ocr/
Tesseract Frontends
● FreeOCR○ Windows○ PDF, TIFF, JPG○ 11 languages○ Ability to input common errors○ Functions as scanning software
● Other options:○ https://code.google.com/p/tesseract-ocr/wiki/3rdParty
Free OCRSWMSFC Is Set To InterviewNew CandidatesClass Ring OrdersNow Being TakenThe SWMSFC, an autonomouscommittee. constituted to raise fundsfor a scholarship in memory of W&Lmen who lost their lives in WorldWar ll, wffl interview candidatesfor membership in the organizationTuesday Oct. 11, at 7 p.m. in theStudent Union.
AdobeSWMSFC Is Set To Interview New Candidates Class Ring Orders Now Being Taken The SWMSFC. an autonomous committee, cotultiluted to raise funds lor a scholarship In memory of W&L men who l09t their lives in World War II, will interview candidates for membership in the organilation Tuesday Oct. 11, at 7 p.m. in the Student. Union.
TesseractSWMSFC Is SetTo InterviewNew CandidatesClass Ring OrdersNow Being TakenThe SWMSFC, an autonomouscommittee, constituted to raise fundsfor a scholarship in memory of W&Lmen who lost their lives in WorldWar II, will interview candidatesfor membership in the organizationTuesday Oct. 11, at 7 p.m. in theStudent Union.
Google DriveSWMSFC Is Set To Interview New CandidatesClass Ring Orders Now Being TakenThe SWMSFC, an autonomous committee, constituted to raise funds for a scholarship in memory of W&L men who lost their lives in World War II, will interview candidates for membership in the organization Tuesday Oct. 11, at 7 p.m. in the Student Union.
Top Related