PDF (Portable Document Format) for Digital Preservation and Delivery
description
Transcript of PDF (Portable Document Format) for Digital Preservation and Delivery
![Page 1: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/1.jpg)
PDF (Portable Document Format) for Digital Preservation and Delivery
John LaurieDigital Initiatives LibrarianThe University of Auckland Library
National Digital Forum2012
![Page 2: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/2.jpg)
Issues
• Is PDF good enough?• What’s a maximum file size• PDF/A or simple PDF?• Searchable text or clearscan?• How dirty is our OCR?• Can we attach metadata to PDF files?• Should we be using METS-ALTO instead?
![Page 3: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/3.jpg)
Local PDF collections at the University of Auckland• Exam papers (image-only) - DigiTool• JPS, NZJH, Early NZ Statutes, The Bookshelf -
B-engine • Theses, working papers - DSpace• Course Materials (mainly chapters from books)
– Linked from the Catalogue
![Page 4: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/4.jpg)
![Page 5: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/5.jpg)
Advantages and Disadvantages
• “PDF and PDF/A broadly acceptable for long term digital archiving”Seadle, Michael. Library Hi Tech27. 4 (2009): 639-644.
• Widely used, constantly improving, Search engine friendly• Open standard since 2008• Read out loud, print• But simple? Morass of variables in my experience
• Image PDF files are large and slow to load• Editing a problem – crowdsourcing proofreading• Difficult to repurpose as HTML etc• Metadata only at the item level
![Page 6: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/6.jpg)
Scanning for PDF
• Condition of originals• Target outcomes
searchable text or ClearScan• 300dpi for clear modern fonts• 400dpi for older documents and very small fonts• Adobe Acrobat or FineReader• Different settings needed for photos and text-only pages• Black-and-white scans don’t work for historical texts and old
newspapers.• Splitting born-digital PDFs
![Page 7: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/7.jpg)
Optical Character Recognition (OCR)
• Accuracy depends on document and font -getting better all the time
• FineReader better than Adobe Acrobat but doesn’t offer ClearScan option
• ClearScan vs Searchable image, dirty OCR hidden behind image
• FineReader offers spell-checking, find and replace editing, proofreading
• Tables, HTML versions, rekeying• Pdftotext and other text extractors for indexing
![Page 8: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/8.jpg)
ABBYY FineReader 11
Spellchecking options
![Page 9: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/9.jpg)
PDF text behind image
![Page 10: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/10.jpg)
HTML showing actual text
![Page 11: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/11.jpg)
File Sizes, Optimising files
• Compromise between image quality and overlarge files• What size is too big?• Text behind image – I’m saving at 300dpi, 40% quality,
about 200K per page for simple text• Breaking up into smaller sections• Batch optimising• Preservation masters, simple text, saved as 5-6MB TIFF
as part of FineReader files• Reduce File Size best method but often can’t save as
PDF/A afterwards
![Page 12: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/12.jpg)
PDF/A, PDF/A-1a, PDF/A-1b
• “PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for the digital preservation of electronic documents”
• A-1a is stricter than A-1b • Many PDF files can’t be saved as PDF/A –after “reduce file size”
because it substitutes non-embedded fonts. • Many fonts not allowed to be embedded?• Preflight identifies errors.• Medline wants a PDF/A copy of each article• PDFs downloaded from EBSCO, Springer and ProQuest not PDF/A
compliant• Will the smarter computers of the future need embedded fonts? “As
we all get smarter and technology improves the acute concerns about format obsolescence may diminish” Butch Lazorchak The Signal, Library of Congress
![Page 13: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/13.jpg)
ClearScan vs Searchable image
• Clearscan files are just over half the size, are sharper and clearer
• No Clearscan option from FineReader (spellcheck, find and replace editing, TIFF master copies)
• ClearScan substitutes a new font – matches shape not OCRed text unlike text only PDF, can’t guarantee 100% accuracy
• But pretty good especially on clean text
![Page 14: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/14.jpg)
Adobe ClearScan example
Text behind image says AkaroQ
![Page 15: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/15.jpg)
Adobe Searchable Image Version
Text behind image says AkaroQ
![Page 16: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/16.jpg)
FineReader Text over the image
FineReader Text over the image (FR reads Akaroa correctly from the same TIFF file)
![Page 17: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/17.jpg)
Problems with text extraction for indexing using pdftotext
applet Search for t h e
![Page 18: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/18.jpg)
Problems with text extraction for indexing using pdftotext
applet Search for the
![Page 19: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/19.jpg)
And diacritics
![Page 20: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/20.jpg)
PDF XMP metadata
Attaching Dublin Core metadata to PDF documents
![Page 21: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/21.jpg)
PDF files
![Page 22: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/22.jpg)
![Page 23: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/23.jpg)
![Page 24: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/24.jpg)
![Page 25: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/25.jpg)
![Page 26: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/26.jpg)
![Page 27: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/27.jpg)
![Page 28: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/28.jpg)
![Page 29: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/29.jpg)
PDF vs METS-ALTO
• Papers Past and other newspaper projects use METS-ALTO• METS (Metadata Encoding and Transmission Standard) links hierarchy
of pages, sections, articles, issues and volumes, provides for descriptive and other metadata at each level – structural metadata
• ALTO (Analyzed Layout and Text Object) stores layout information and OCR text, enables page views, article views for newspapers.
• CCS (Content Conversion Specialists) have created DocWorks METAe which automates creation of METS-ALTO files and metadata for sections
• Should we all be using METS-ALTO? • Derivatives (PDF, text, TEI, HTML) complex document structures,
metadata at any level
![Page 30: PDF (Portable Document Format) for Digital Preservation and Delivery](https://reader035.fdocuments.us/reader035/viewer/2022062423/568144bf550346895db187b8/html5/thumbnails/30.jpg)
Websites
• New Zealand Journal of Historyhttp://www.nzjh.auckland.ac.nz/
• ResearchSpace Doctoral Theses https://researchspace.auckland.ac.nz/handle/2292/2
• Early New Zealand Statuteshttp://www.enzs.auckland.ac.nz/
• Early New Zealand Statistics test with PDF and HTMLhttp://www.thebookshelf.auckland.ac.nz/document.php?wid=1148&action=null