JSTOR & OCR - A Case Study
description
Transcript of JSTOR & OCR - A Case Study
![Page 1: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/1.jpg)
JSTOR & OCR - A Case Study
Kiffany Francis
![Page 2: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/2.jpg)
What is JSTOR?
• “JSTOR is a not-for-profit organization with a dual mission to create and maintain a trusted archive of important scholarly journals, and to provide access to these journals as widely as possible.”
![Page 3: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/3.jpg)
JSTOR:
JSTOR - journal storage.
They are building a digital archive of journal back runs,Some of which date back to the 1600s. JSTOR has converted over 10 million paper journal pages from over 240 journals representing more than 170 publishers. The JSTOR archive is available at more than 1,450 libraries.
They are building a digital archive of journal back runs,Some of which date back to the 1600s. JSTOR has converted over 10 million paper journal pages from over 240 journals representing more than 170 publishers. The JSTOR archive is available at more than 1,450 libraries.
![Page 4: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/4.jpg)
JSTORhttp://www.jstor.org
Each journal page digitized by JSTORis processed by an OCR application.The resulting text files are used to support full-text searching offered to JSTOR users.
Each journal page digitized by JSTORis processed by an OCR application.The resulting text files are used to support full-text searching offered to JSTOR users.
![Page 5: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/5.jpg)
What is OCR?What is OCR?
Optical Character Recognition
It is the process that converts the text of a printed page or image into editable, digital text.
![Page 6: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/6.jpg)
What does OCR software do?
• The software analyzes the layout of text.
• The order of the paragraphs is determined.
• Analysis of characters begin.
• Compares character groups (words) to dictionary in OCR application
• When match is found, software prints word to text file.
![Page 7: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/7.jpg)
What does OCR software do?What does OCR software do?
If a match can not be found…
• The software makes a reasonable assumption and flags the word with low confidence.
• If a word or character can not be read at all, a default character is inserted as a placeholder.
![Page 8: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/8.jpg)
Problems with OCRProblems with OCRProblems with OCRProblems with OCR
Does not handle certain text very well.
•Non-Arabic text•Nonmodern type•Small print•Certain fonts•Complex page layouts
![Page 9: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/9.jpg)
JSTOR: Production Process
The process begins at JSTOR inAnn Arbor, Michigan.
Page-by-page examination of journal run.Preservation concerns are addressed.Scanning guidelines are created.A production librarian and serials specialist create indexing guidelines.Journal is shipped to contractor to be scanned and described.
![Page 10: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/10.jpg)
JSTOR: Production Process
At the contractor facility:
Physical journals are disbound and separated Into pages sorted by issue.Each page is scanned in bitonal TIFF format at 600 dpi resolution.Page images are checked for marks, folds, skewing.A table of contents file is added.If available, abstracts and keywords are added.
All digital files created by contractor, page images and toc files, are downloaded to CD-ROM and shipped back to JSTOR - Ann Arbor.
Physical journals are disbound and separated Into pages sorted by issue.Each page is scanned in bitonal TIFF format at 600 dpi resolution.Page images are checked for marks, folds, skewing.A table of contents file is added.If available, abstracts and keywords are added.
All digital files created by contractor, page images and toc files, are downloaded to CD-ROM and shipped back to JSTOR - Ann Arbor.
![Page 11: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/11.jpg)
JSTOR: Production Process
Rich Digital Masters:
Each page is scanned in bitonal TIFF format at 600 dpiThis is preferred because:
1. In 1994, there was some debate about whether 300 dpi or 600 dpi was better because of storage space. 600 dpi won
out.2. 600 dpi printers are now standard3. Resolutions higher than 600 dpi are not discernably better
for black-and-white text-based images.
Each page is scanned in bitonal TIFF format at 600 dpiThis is preferred because:
1. In 1994, there was some debate about whether 300 dpi or 600 dpi was better because of storage space. 600 dpi won
out.2. 600 dpi printers are now standard3. Resolutions higher than 600 dpi are not discernably better
for black-and-white text-based images.
![Page 12: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/12.jpg)
JSTOR: Production Process
Back at JSTOR - Ann Arbor:
Files are uploaded from CD-ROM to JSTOR file servers.Quality control process verifies image and table of content
quality.After quality check, each page image is processed by OCR
software to create full-text for searching.After further quality control, the title is announced to
JSTOR participants.
Files are uploaded from CD-ROM to JSTOR file servers.Quality control process verifies image and table of content
quality.After quality check, each page image is processed by OCR
software to create full-text for searching.After further quality control, the title is announced to
JSTOR participants.
![Page 13: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/13.jpg)
JSTOR: Production Process
The quality of OCR for journals.
•JSTOR reports a 97% accuracy rate for theirOCR created text-files.
•Some journals yield OCR files that are 99.95%accurate.
•This level of accuracy is satisfactory for searchingbut not for presentation.
•JSTOR reports a 97% accuracy rate for theirOCR created text-files.
•Some journals yield OCR files that are 99.95%accurate.
•This level of accuracy is satisfactory for searchingbut not for presentation.
![Page 14: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/14.jpg)
Example of JSTOR page.
![Page 15: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/15.jpg)
Example of scanned image from JSTOR
![Page 16: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/16.jpg)
JSTOR: Preservation Issues
A PLAN FOR PRESERVATION.
Print repositories of JSTOR journals are being started at University of California and Harvard University.
The database is currently housed on servers managed and maintained at Princeton University, University of Michigan, and University of Manchester (UK).
Archival cold tapes are also stored at the OCLC and at the JSTOR offices in New York City.
![Page 17: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/17.jpg)
Guidelines: Is OCR right for your project?
1. “Select the technology that will enhance your ability to meet the objectives of the project.”
1. “Select the technology that will enhance your ability to meet the objectives of the project.”
From “An OCR Case Study” by Eileen Gifford Fenton
![Page 18: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/18.jpg)
Guidelines: Is OCR right for your project?
2. “Scale matters -- a lot.”2. “Scale matters -- a lot.”
From “An OCR Case Study” by Eileen Gifford Fenton
![Page 19: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/19.jpg)
Guidelines: Is OCR right for your project?
3. “There is no right answer.”3. “There is no right answer.”
From “An OCR Case Study” by Eileen Gifford Fenton
![Page 20: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/20.jpg)
Guidelines: Is OCR right for your project?
4. “Costs will be higher than you expect.”4. “Costs will be higher than you expect.”
From “An OCR Case Study” by Eileen Gifford Fenton
![Page 21: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/21.jpg)
Guidelines: Is OCR right for your project?
5. “The answer that is right for today may not be right in the future.”5. “The answer that is right for today may not be right in the future.”
From “An OCR Case Study” by Eileen Gifford Fenton
![Page 22: JSTOR & OCR - A Case Study](https://reader036.fdocuments.us/reader036/viewer/2022062322/5681472f550346895db46c04/html5/thumbnails/22.jpg)
Sources for Further Investigation
Bibliography:
Guthrie, Kevin, JSTOR. “Developing a Digital Preservation StrategyFor JSTOR, an interview.”http://www.rlg.org/preserv/diginews/diginews4-4.html#feature1
JSTOR website: http://www.jstor.org/
Kiplinger, John. Director of Production, JSTOR. “Print-Repository Effort Under Way at UCLA and Harvard.” http://www.clir.org/PUBS/issues/issues47.html#print
Fenton, Eileen Gifford, JSTOR, University of Michigan. “An OCR Case Study.” In Handbook for Digital Projects:A Management Tool for Preservation and Access.http://www.nedcc.org/digital/vii.htm#3
Bibliography:
Guthrie, Kevin, JSTOR. “Developing a Digital Preservation StrategyFor JSTOR, an interview.”http://www.rlg.org/preserv/diginews/diginews4-4.html#feature1
JSTOR website: http://www.jstor.org/
Kiplinger, John. Director of Production, JSTOR. “Print-Repository Effort Under Way at UCLA and Harvard.” http://www.clir.org/PUBS/issues/issues47.html#print
Fenton, Eileen Gifford, JSTOR, University of Michigan. “An OCR Case Study.” In Handbook for Digital Projects:A Management Tool for Preservation and Access.http://www.nedcc.org/digital/vii.htm#3