What publishers need to know about digitization
-
Upload
lizadaly -
Category
Technology
-
view
9.365 -
download
0
description
Transcript of What publishers need to know about digitization
![Page 1: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/1.jpg)
What publishers need to know about digitizationLiza Daly
Consultant, Threepress Consulting Inc.
http://threepress.org/
Thursday, November 13, 2008
![Page 2: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/2.jpg)
Software engineer and consultant specializing in web-based publishing applications
Digitization projects for Ford Foundation, Arnold Arboretum, Rosen Publishing and SAGE Publications
Online reference products for Oxford University Press and Columbia University Press
Current: ebook applications and consulting
IntroductionLiza Daly [email protected]
Thursday, November 13, 2008
![Page 3: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/3.jpg)
1. Digitization 101: from scanning to OCR to XML
2. Smart vendor selection
3. A gentle introduction to XML
4. I’ve got digital content: now what?
IntroductionWhat I’ll cover
?Thursday, November 13, 2008
![Page 4: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/4.jpg)
What we talk about when we talk about digitization
Turning printed content...
...or microfilm archives
...or documents in legacy systems
...into modern digital forms.
(sometimes starting from print is easier)
text
<text>
Thursday, November 13, 2008
![Page 5: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/5.jpg)
Assume that we’re starting from a print archive.
(If you’re starting from a digital file, congratulations, your costs just went down -- but not to zero!)
Digitization 101
Thursday, November 13, 2008
![Page 6: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/6.jpg)
Scan
From paper to digital images...
Thursday, November 13, 2008
![Page 7: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/7.jpg)
OCR
...to digital text...
Thursday, November 13, 2008
![Page 8: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/8.jpg)
XML
...to reusable markup.
Thursday, November 13, 2008
![Page 9: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/9.jpg)
Digitization 101Scanning
http://www.flickr.com/photos/heather-dietz/448629362/
Thursday, November 13, 2008
![Page 10: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/10.jpg)
Digitization 101Scanning
Scan
http://www.flickr.com/photos/heather-dietz/448629362/
Thursday, November 13, 2008
![Page 11: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/11.jpg)
Digitization 101Scanning methods
Destructive scanningPages are cut out of the binding and
machine-fed into the scanner in batch.
(Imagine a huge office copier.)
Scanned copies are normally destroyed.
Thursday, November 13, 2008
![Page 12: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/12.jpg)
Non-destructive scanning
Pages kept in their original binding
Manual page-turning
Originals are returned to the source
Primarily for rare or historical works
Digitization 101Scanning methods
Thursday, November 13, 2008
![Page 13: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/13.jpg)
High-volume, non-destructive automated scanning also exists.
Digitization 101Scanning methods
Thursday, November 13, 2008
![Page 14: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/14.jpg)
Optical Character Recognition
OCR software “guesses” the letters that appear in an image. A dictionary is used to help correct errors.
Common errors include wordsruntogether or speling mistakes.
Digitization 101OCR
Thursday, November 13, 2008
![Page 15: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/15.jpg)
OCR quality is sensitive to a number of factors.
Is the document in good condition with clear type?
Is the layout simple or complex?
Is a custom dictionary required for proper names or obscure terms?
Digitization 101OCR
Thursday, November 13, 2008
![Page 16: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/16.jpg)
This is easy.
Thursday, November 13, 2008
![Page 17: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/17.jpg)
This is hard.
Thursday, November 13, 2008
![Page 18: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/18.jpg)
http://timesmachine.nytimes.com/
Thursday, November 13, 2008
![Page 19: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/19.jpg)
Better OCR Worse OCR
Layout Simple textMulticolumn,
sidebars
Vocabulary Common Specialized
Source quality Clean and legibleDamaged, dirty or
partial
Digitization 101OCR
Thursday, November 13, 2008
![Page 20: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/20.jpg)
Limitations and cautions:
Documents with specialized jargon, such as medical journals or archaic texts, will require custom dictionaries.
Tables and equations aren’t suitable for OCR.
A human check is always advisable.
Digitization 101OCR
Thursday, November 13, 2008
![Page 21: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/21.jpg)
If the goal of digitization is to make content findable on the web, the text needs to be correct.
Thursday, November 13, 2008
![Page 22: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/22.jpg)
X
SCAN the documents to convert to digital files
Apply OCR to the scans to get computer-ready text
Convert the text into XML
Thursday, November 13, 2008
![Page 23: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/23.jpg)
Digitization 101XML
Not all digitization projects end with XML.
Why?
Thursday, November 13, 2008
![Page 24: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/24.jpg)
1,000 1,500 2,000 3,000+
Characters-per-page versus digitization cost/time
Machine OCRHuman-checked OCRXML
Thursday, November 13, 2008
![Page 25: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/25.jpg)
Vendor selection and costs
Thursday, November 13, 2008
![Page 26: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/26.jpg)
But also:
Project management
Shipping
Heterogeneous content
Front/back matter & indexes
Consider:
Quantity of material
Quality of the originals
Layout complexity
Vocabulary
Thursday, November 13, 2008
![Page 27: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/27.jpg)
But also:
Project management
Shipping
Heterogeneous content
Front/back matter & indexes
Consider:
Quantity of material
Quality of the originals
Layout complexity
Vocabulary
Thursday, November 13, 2008
![Page 28: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/28.jpg)
Vendor tips
Send samples before considering any estimate
...and have the output evaluated.
Compare not just cost-per-page but estimated time.
Feel comfortable with their project management.
Check references!
Thursday, November 13, 2008
![Page 29: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/29.jpg)
Should you partner?
Thursday, November 13, 2008
![Page 30: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/30.jpg)
?Thursday, November 13, 2008
![Page 31: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/31.jpg)
??
Thursday, November 13, 2008
![Page 32: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/32.jpg)
It’s too early to say whether Google Books is right for all publishers.
But you’re certainly giving up:
1. Control
2. Revenue share
3. Ownership
Thursday, November 13, 2008
![Page 33: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/33.jpg)
Creative partnerships Consider whether some of your backlist is public domain or can be released under a Creative Commons license.
Thursday, November 13, 2008
![Page 34: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/34.jpg)
XML 101
Thursday, November 13, 2008
![Page 35: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/35.jpg)
XML 101What’s XML?
XML is just plain text, with markers to tell a computer what the text means and how it should be laid out.
Thursday, November 13, 2008
![Page 36: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/36.jpg)
XML 101What’s XML?
Text with “markup” is an old idea.
This is a paragraph.¶This is another paragraph.
Thursday, November 13, 2008
![Page 37: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/37.jpg)
XML 101What’s XML?
XML just changes the symbols around.
<p>This is a paragraph.</p><p>This is another paragraph.</p>
Thursday, November 13, 2008
![Page 38: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/38.jpg)
XML 101What’s XML good for?
1. Everybody speaks it.
2. Once you have one kind of XML, it’s easy to turn it into another kind.
Thursday, November 13, 2008
![Page 39: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/39.jpg)
When you decide to digitize to XML, you’ll need to pick what kind of XML you want.
Thursday, November 13, 2008
![Page 40: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/40.jpg)
Kinds of XML
Thursday, November 13, 2008
![Page 41: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/41.jpg)
Kinds of XML
DTD
Thursday, November 13, 2008
![Page 42: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/42.jpg)
Kinds of XML
DTD Language
Thursday, November 13, 2008
![Page 43: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/43.jpg)
Kinds of XML
DTD
Format
Language
Thursday, November 13, 2008
![Page 44: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/44.jpg)
Kinds of XML
DTD
Format
Language
Schema
Thursday, November 13, 2008
![Page 45: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/45.jpg)
Kinds of XML
DTD
Format
Language
XSD
Schema
Thursday, November 13, 2008
![Page 46: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/46.jpg)
Kinds of XML
DTD
Format
Language
XSD
Schema
Thursday, November 13, 2008
![Page 47: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/47.jpg)
The schema defines the list of <tags> that appear in a document, and what they mean.
A paragraph ¶ in one schema might be <p>, but in another it might be <para>.
XML 101Schema vocabulary
Thursday, November 13, 2008
![Page 48: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/48.jpg)
TEI
DocBookMETS/ALTO
PRISMePub
DAISY
Thursday, November 13, 2008
![Page 49: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/49.jpg)
TEI
DocBookMETS/ALTO
PRISMePub
DAISY
XML
Thursday, November 13, 2008
![Page 50: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/50.jpg)
XML 101Choosing a schema
Books DocBook, DAISY, ePub, TEI
Magazines/Newspapers METS/ALTO, PRISM
Scholarly TEI, MathML
Thursday, November 13, 2008
![Page 51: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/51.jpg)
XML 101DIY schemas
Creating your own schema should be a last resort.
Expensive to build and maintain.
High training and hiring costs.
Reduced opportunities for interoperability.
Regulatory compliance.
Thursday, November 13, 2008
![Page 52: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/52.jpg)
XML 101DIY schemas
Creating your own schema should be a last resort.
Expensive to build and maintain.
High training and hiring costs.
Reduced opportunities for interoperability.
Regulatory compliance.
Thursday, November 13, 2008
![Page 53: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/53.jpg)
$
$$$
Low High
Complex schemas cost more...
...but also provide more opportunity for product development.
Thursday, November 13, 2008
![Page 54: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/54.jpg)
Now what?
Thursday, November 13, 2008
![Page 55: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/55.jpg)
MonetizingXML conversion
XML
Thursday, November 13, 2008
![Page 56: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/56.jpg)
MonetizingXML conversion
XML web
Thursday, November 13, 2008
![Page 57: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/57.jpg)
XML web
Thursday, November 13, 2008
![Page 58: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/58.jpg)
webXML
Thursday, November 13, 2008
![Page 59: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/59.jpg)
webUGC
Thursday, November 13, 2008
![Page 60: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/60.jpg)
Remixing content
XML allows content to be distributed, altered,
and recontextualized in unexpected ways.
http://flickr.com/photos/thomashawk/2492298772/Thursday, November 13, 2008
![Page 61: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/61.jpg)
Small Beer Press
Thursday, November 13, 2008
![Page 62: What publishers need to know about digitization](https://reader034.fdocuments.us/reader034/viewer/2022051816/5454327baf79597c338b4b7a/html5/thumbnails/62.jpg)
Questions?
Liza DalyThreepress Consulting Inc.+01 617 301 [email protected]
Thursday, November 13, 2008