Optical Character Recognition: the What, Why, and How
-
Upload
mackenziekbrooks -
Category
Education
-
view
496 -
download
1
description
Transcript of Optical Character Recognition: the What, Why, and How
![Page 1: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/1.jpg)
OCR: the What, Why, and How
Mackenzie Brooks, Metadata LibrarianAlston Cobourn, Digital Scholarship Librarian
![Page 2: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/2.jpg)
What is Optical Character Recognition?
Wikipedia says: the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text.
http://en.wikipedia.org/wiki/Optical_character_recognition
![Page 3: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/3.jpg)
![Page 4: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/4.jpg)
The Second Province of Sigma Chi, embracing the chapters at the University of Virginia, Uampden- Sidney, Roanoke, Randolph-Macon, University of North Carolina and Washington and Lec.held its annual convention here on Thursday night aud Friday. The delegates arrived on the evening trains on Thursday mid went immediately to the Lex- ington, where they put tip.
![Page 5: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/5.jpg)
Why
● Textual analysis● Your research, student work, DH projects● ADA compliance
![Page 6: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/6.jpg)
With the full text you can…
● All PDFs are not created equal○ Searchable○ Extract text
● Textual analysis ○ Voyant Tools○ TEI
![Page 7: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/7.jpg)
Accessibility
● Screen reading● Kurtzweil● Indexable ● General readability
![Page 8: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/8.jpg)
How?
● Multiple tools, multiple methods● Reads the text and tries to assign values to the
characters it sees● Matrix matching vs. feature extraction● Character vs. whole word recognition● Matches from internal lexicon/dictionary● Language options available
![Page 9: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/9.jpg)
Complications
● Fonts, formatting, line breaks, columns, italics etc.
● Thin paper, writing on back of page● Stray marks, printing errors, margin notes,
footnotes● Year of language
![Page 10: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/10.jpg)
Recommendations
● High resolution● Binarization● Deskewing● Orientation● Crop out extraneous marks
![Page 11: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/11.jpg)
What tools can I use?
Adobe Acrobat ProGoogle DriveTesseract
![Page 12: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/12.jpg)
Adobe Acrobat Pro
● Via the Stable● Over 40 languages (not just Latin
characters)● Automatically preprocesses the image● Easy to use● Enhances PDFs● Will work with TIFs and JPGs
![Page 13: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/13.jpg)
Google Drive
● Will process PDF, GIF, JPG, and PNG● Recommends text be 10 pixels high● Size limit: 2MB per file or 10 pages of PDF● Upload Settings > Convert Text from
Uploaded PDFs and Image Files
![Page 14: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/14.jpg)
Project Naptha
http://projectnaptha.com/● automatically applies state-of-the-art computer
vision algorithms on every image you see while browsing the web. The result is a seamless and intuitive experience, where you can highlight as well as copy and paste and even edit and translate the text formerly trapped within an image.
![Page 15: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/15.jpg)
Tesseract
● OCR Engine● 1985-1994 HP; 2006 Google● Highest accuracy● Command line or front end options
https://code.google.com/p/tesseract-ocr/
![Page 16: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/16.jpg)
Tesseract Frontends
● FreeOCR○ Windows○ PDF, TIFF, JPG○ 11 languages○ Ability to input common errors○ Functions as scanning software
● Other options:○ https://code.google.com/p/tesseract-ocr/wiki/3rdParty
![Page 17: Optical Character Recognition: the What, Why, and How](https://reader034.fdocuments.us/reader034/viewer/2022051312/546e86eab4af9fb4268b4682/html5/thumbnails/17.jpg)
Free OCRSWMSFC Is Set To InterviewNew CandidatesClass Ring OrdersNow Being TakenThe SWMSFC, an autonomouscommittee. constituted to raise fundsfor a scholarship in memory of W&Lmen who lost their lives in WorldWar ll, wffl interview candidatesfor membership in the organizationTuesday Oct. 11, at 7 p.m. in theStudent Union.
AdobeSWMSFC Is Set To Interview New Candidates Class Ring Orders Now Being Taken The SWMSFC. an autonomous committee, cotultiluted to raise funds lor a scholarship In memory of W&L men who l09t their lives in World War II, will interview candidates for membership in the organilation Tuesday Oct. 11, at 7 p.m. in the Student. Union.
TesseractSWMSFC Is SetTo InterviewNew CandidatesClass Ring OrdersNow Being TakenThe SWMSFC, an autonomouscommittee, constituted to raise fundsfor a scholarship in memory of W&Lmen who lost their lives in WorldWar II, will interview candidatesfor membership in the organizationTuesday Oct. 11, at 7 p.m. in theStudent Union.
Google DriveSWMSFC Is Set To Interview New CandidatesClass Ring Orders Now Being TakenThe SWMSFC, an autonomous committee, constituted to raise funds for a scholarship in memory of W&L men who lost their lives in World War II, will interview candidates for membership in the organization Tuesday Oct. 11, at 7 p.m. in the Student Union.