Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law...
-
Upload
theodore-cannon -
Category
Documents
-
view
212 -
download
0
Transcript of Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law...
Document Delivery Formats Document Delivery Formats for the Webfor the Web
andandLegal Digital CollectionsLegal Digital Collections
Kevin Reiss
June 18th, 2004
Law Library
Rutgers-Newark School of Law
Delivery Formats & IssuesDelivery Formats & Issues Delivery Format: type of the file a user receives when
accessing a document in a digital collection Important not just for viewing, but also for Information
Retrieval (IR) tasks like full-text indexing There is no one format that is right for every type of
collection. Important issues to consider:
– Open v. Closed Formats– Usability and Accessibility– Subject Specific Concerns for Legal Materials
Open v. Closed FormatsOpen v. Closed Formats
Who is "in control" of the document format you choose? A standards body? A single company or organization?
Can you count on something that one entity controls to be supported over time?
Advantages of Open Formats (a.k.a. Standards)– Interoperability and support over time.– Integrate well with open-source or low cost processing and
IR tools– Help web content providers who need to support an
increasing variety of devices and platforms
Usability & AccessibilityUsability & Accessibility
What software do users need to view a particular format?
Can a web browser natively display it? If the format requires a browser plug-in:
– Is it free? Are users likely to have it installed?– Does it work on all computing platforms?
Do public search engines index the format? Can dial-up modem users access the material in
the collection?
Subject Specific Concerns for Legal Subject Specific Concerns for Legal MaterialsMaterials
Legal digital projects usually manage texts, not images. Some types of legal materials are harder to maintain, i.e.
codified material. Legal documents are almost exclusively printed in black
& white. Preservation of the page structure is important for
citation purposes. Maintaining the original appearance of digitized print
documents is not important; archival and rare materials are potential exceptions.
Possible Delivery FormatsPossible Delivery Formats
Pure image formats: TIFF, JPEG Open encoded formats: XML, HTML,
ASCII, and Unicode Hybrid formats: PDF, DjVu – can contain
both image and textProprietary formats: Microsoft Word,
WordPerfect
Pure Images: TIFF, JPEGPure Images: TIFF, JPEG Raster (pixel-based) exclusively used for scanned collections TIFF is the best choice for archival scanned images Pros
– Web browsers display them natively– Both are open formats
Cons– Large file sizes make viewing on slow connections problematic– Text of the documents available only through OCR (Optical Character
Recognition) – Weak support for multi-page documents– JPEGs have trouble displaying text when they are compressed to
levels appropriate for the web– Contain metadata about the physical file itself, not the contents of the
file
Imaged Formats Cont.Imaged Formats Cont. OCR is an important consideration:
– 5% rate of error doesn't have an impact on traditional IR measures
– 20% error rate significantly degrades [Doerman 98] the performance of traditional IR techniques.
– High quality OCR is now available for relatively low cost
Abbyy Finereader ($300) Table and page layout recognition supported
Open Encoded FormatsOpen Encoded FormatsXML, HTML, ASCII, UnicodeXML, HTML, ASCII, Unicode
Typically easier to integrate into digital libraries [Baird 2004]– Created in 3 ways:
Born digital documents Manually keyed documents Corrected OCR
– IR applications easy to build, open source support strong– International standards or W3C recommendations– Accessible with all current web technologies– Metadata easily embedded in XML|HTML documents– Can be created with any text-editor– Improvements in OCR make encoding scanned collections
feasible
Open Encoded Formats Cont.Open Encoded Formats Cont.
Cons:– These documents can be expensive for staff to create
Manual Encoding in XML may have to be done by hand Manual correction of OCR errors
– Need technical expertise on staff to get the full benefits of these formats, the PERL programmer
– These don't necessarily preserve the "look" of printed documents
Hybrid Formats: PDF, DjVuHybrid Formats: PDF, DjVu
PDF and DjVu are proprietary technologies that have substantial support in the open source community.
Both can contain a layer of the document’s text and an image of each page in a document.
Both utilize cross-platform, freely available web browser plug-ins.
Both try to preserve the look of print documents Easy to export born digital documents to these
formats using printer drivers, “print to PDF”
Adobe PDFAdobe PDF Pros:
– PDF has strong market acceptance in the legal community – PDF-Archive, a standard for using PDF as an archival format in
development by AIIM [Association for Information and Image Management]
– Adobe makes the PDF reference manual and software development kit freely available to developers.
– Standard methodology for embedding metadata in documents, the XMP Standard (Extensible Metadata Platform) that seeks compatibility with semantic web technologies
Cons:– Plug-in performance is poor for long documents– PDFs composed of scanned images can be very large in size,
even for short documents
DjVuDjVu Designed to be a scan-to-web technology. Pros:
– Best compression of any image format on the web– Users can load lengthy documents very quickly– The DjVu plug-in can be manipulated via cgi-style
arguments– Use the Any2DjVu server to try out the format.
Cons:– DjVu does not yet have great market acceptance in the legal
community. – DjVu does not have a standard method for embedded
metadata within documents.
Proprietary FormatsProprietary Formats
Word Processing Formats: MS Word, WordPerfect
Not a good choice for document delivery on the web
Cons:– These formats are completely closed – Poor cross platform support– It is often problematic to index these documents using
inexpensive or open source IR tools.
The New Jersey Digital Legal LibraryThe New Jersey Digital Legal Library
URL: http://njlegallib.rutgers.edu Digitize New Jersey Legal materials not currently
available online. Available for users in two formats: DjVu and PDF Current Workflow:
– Scan -> TIFF; then TIFF -> PDF and TIFF -> DjVu– Extract OCR text from the DjVu to XHTML using
XSL Stylesheets and DjVuLibre (The Open Source DjVu Library)
– Use swish-e to index the XHTML documents with embedded extended Dublin Core metadata
ReferencesReferences
1. Baird, Henry. Difficult and Urgent Open Problems in Document Images Analysis for Libraries. Proceedings of the First International Workshop on Document Image Analysis for Libraries. Palo Alto CA, 2004.
2. Doerman, David. The Indexing and Retrieval of Document Images: A Survey. 70 (3). Computer Vision and Image Understanding. pp. 287-298.