Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of...
-
Upload
joleen-cook -
Category
Documents
-
view
220 -
download
3
Transcript of Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of...
![Page 1: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/1.jpg)
Digital Reformatting of Text
Aaron ChoateDigital Library Production Services
The University of Texas Libraries
![Page 2: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/2.jpg)
From last time:
Calculating potential file size (no really… this time we got it!)
file size = height x width x bit-depth x dpi2
8 bits per byte
![Page 3: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/3.jpg)
imagingBenchmarking
Subjective evaluation becomes more problematic when the goal is legibility rather than fidelity.
![Page 4: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/4.jpg)
imagingBenchmarking
Physical Type, size and presentation
![Page 5: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/5.jpg)
imagingBanchmarking
Physical condition• Darkening pages
• Fading ink
• Stains
• bleed-through
• Uneven printing
• Fold lines
• smearing
![Page 6: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/6.jpg)
imagingBenchmarking
Document classification• Simple text / printed line art
• Distinct-edge based representationBitonal?
• Manuscripts• Soft-edge-based
Grayscale / color
• Mixed material
![Page 7: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/7.jpg)
imagingBenchmarking
Medium and support• Support – (paper, clay tablet, etc.)
• Thin paper? (bleed through)
• Medium – (graphite pencil, inks, etc)• Fading of ink
• Variations in color or density
![Page 8: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/8.jpg)
imagingBenchmarking
Tonal Representation
![Page 9: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/9.jpg)
imagingBenchmarking
Color Appearance• Is color reproduction necessary to the
document’s meaning?
• What purpose does the color serve?
• How important is maintaining the color appearance?
![Page 10: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/10.jpg)
imagingBenchmarking
Detail• Printed text –
• Measure the height of the smallest lowercase letter that typifies the item or group of items.
• Manuscripts, line art –• Measure the finest stroke-width that must be
represented and characterize the needed level of quality
![Page 11: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/11.jpg)
imagingBenchmarking
QI…(Quality Index)• Defining detail as character height
• ANSI/AIIM preservation microfilming standard for determining requirements for text legibility
• Defines a range from barely legible through excellent that maps to technical test targets
![Page 12: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/12.jpg)
imagingBenchmarking
Line pairs
Excellent = 8 line pairs
Good = 5 line pairs
Marginal = 3.6 line pairs
Barely legible = 3.0 line pairs
![Page 13: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/13.jpg)
imagingBenchmarking
Digital QI Bitonal (only black pixels)
QI = (dpi x .039h)/3
h = 3QI/.039dpi
dpi = 3QI/.039h
Tonal images (grayscale for printed text)QI = (dpi x .039h)/2
h = 2QI/0.39dpi
dpi = 2QI/.039h
![Page 14: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/14.jpg)
Text Capture
Methods• Rekeying
• OCR
Accuracy …
![Page 15: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/15.jpg)
Software
Scansoft - Omnipage Pro Abbyy – Fine Reader Adobe Acrobat … PrimeOCR – Prime Recognition
![Page 16: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/16.jpg)
Encoding
![Page 17: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/17.jpg)
XML vs SGML
SGML (Standard Generalized Markup Language ) is the grand-daddy of all markup languages
XML is a subset of SGML with an intent on being the format for use on the Internet.
XML attempts to fill the gap between SGML, which can be used for just about anything, and HTML which is severely limited and currently being abused because of this. (table structures for layout, clear 1 pixel GIFs.. etc)
![Page 18: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/18.jpg)
xmlDTDs vs Schemas
![Page 19: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/19.jpg)
xmlTEI
Text Encoding Initiative• Initially launched in 1987, the TEI is an
international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent.
![Page 20: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/20.jpg)
xmlTEI
Levels of encoding• Level 1: Fully Automated Conversion and En
coding
• Level 2: Minimal Encoding
• Level 3: Simple Analysis
• Level 4: Basic Content Analysis
• Level 5: Scholarly Encoding Projects
![Page 21: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/21.jpg)
Character sets
Unicode –
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
![Page 22: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/22.jpg)
character setsUnicode
Greek & Coptic
![Page 23: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/23.jpg)
Software
XMetal Oxygen Cooktop
![Page 24: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649e2e5503460f94b1ebdb/html5/thumbnails/24.jpg)
Software
MetaE