Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law...

Document Delivery Formats Document Delivery Formats for the Webfor the Web

andandLegal Digital CollectionsLegal Digital Collections

Kevin Reiss

June 18th, 2004

Law Library

Rutgers-Newark School of Law

Delivery Formats & IssuesDelivery Formats & Issues Delivery Format: type of the file a user receives when

accessing a document in a digital collection Important not just for viewing, but also for Information

Retrieval (IR) tasks like full-text indexing There is no one format that is right for every type of

collection. Important issues to consider:

– Open v. Closed Formats– Usability and Accessibility– Subject Specific Concerns for Legal Materials

Open v. Closed FormatsOpen v. Closed Formats

Who is "in control" of the document format you choose? A standards body? A single company or organization?

Can you count on something that one entity controls to be supported over time?

Advantages of Open Formats (a.k.a. Standards)– Interoperability and support over time.– Integrate well with open-source or low cost processing and

IR tools– Help web content providers who need to support an

increasing variety of devices and platforms

Usability & AccessibilityUsability & Accessibility

What software do users need to view a particular format?

Can a web browser natively display it? If the format requires a browser plug-in:

– Is it free? Are users likely to have it installed?– Does it work on all computing platforms?

Do public search engines index the format? Can dial-up modem users access the material in

the collection?

Subject Specific Concerns for Legal Subject Specific Concerns for Legal MaterialsMaterials

Legal digital projects usually manage texts, not images. Some types of legal materials are harder to maintain, i.e.

codified material. Legal documents are almost exclusively printed in black

& white. Preservation of the page structure is important for

citation purposes. Maintaining the original appearance of digitized print

documents is not important; archival and rare materials are potential exceptions.

Possible Delivery FormatsPossible Delivery Formats

Pure image formats: TIFF, JPEG Open encoded formats: XML, HTML,

ASCII, and Unicode Hybrid formats: PDF, DjVu – can contain

both image and textProprietary formats: Microsoft Word,

WordPerfect

Pure Images: TIFF, JPEGPure Images: TIFF, JPEG Raster (pixel-based) exclusively used for scanned collections TIFF is the best choice for archival scanned images Pros

– Web browsers display them natively– Both are open formats

Cons– Large file sizes make viewing on slow connections problematic– Text of the documents available only through OCR (Optical Character

Recognition) – Weak support for multi-page documents– JPEGs have trouble displaying text when they are compressed to

levels appropriate for the web– Contain metadata about the physical file itself, not the contents of the

file

Imaged Formats Cont.Imaged Formats Cont. OCR is an important consideration:

– 5% rate of error doesn't have an impact on traditional IR measures

– 20% error rate significantly degrades [Doerman 98] the performance of traditional IR techniques.

– High quality OCR is now available for relatively low cost

Abbyy Finereader ($300) Table and page layout recognition supported

Open Encoded FormatsOpen Encoded FormatsXML, HTML, ASCII, UnicodeXML, HTML, ASCII, Unicode

Typically easier to integrate into digital libraries [Baird 2004]– Created in 3 ways:

Born digital documents Manually keyed documents Corrected OCR

– IR applications easy to build, open source support strong– International standards or W3C recommendations– Accessible with all current web technologies– Metadata easily embedded in XML|HTML documents– Can be created with any text-editor– Improvements in OCR make encoding scanned collections

feasible

Open Encoded Formats Cont.Open Encoded Formats Cont.

Cons:– These documents can be expensive for staff to create

Manual Encoding in XML may have to be done by hand Manual correction of OCR errors

– Need technical expertise on staff to get the full benefits of these formats, the PERL programmer

– These don't necessarily preserve the "look" of printed documents

Hybrid Formats: PDF, DjVuHybrid Formats: PDF, DjVu

PDF and DjVu are proprietary technologies that have substantial support in the open source community.

Both can contain a layer of the document’s text and an image of each page in a document.

Both utilize cross-platform, freely available web browser plug-ins.

Both try to preserve the look of print documents Easy to export born digital documents to these

formats using printer drivers, “print to PDF”

Adobe PDFAdobe PDF Pros:

– PDF has strong market acceptance in the legal community – PDF-Archive, a standard for using PDF as an archival format in

development by AIIM [Association for Information and Image Management]

– Adobe makes the PDF reference manual and software development kit freely available to developers.

– Standard methodology for embedding metadata in documents, the XMP Standard (Extensible Metadata Platform) that seeks compatibility with semantic web technologies

Cons:– Plug-in performance is poor for long documents– PDFs composed of scanned images can be very large in size,

even for short documents

DjVuDjVu Designed to be a scan-to-web technology. Pros:

– Best compression of any image format on the web– Users can load lengthy documents very quickly– The DjVu plug-in can be manipulated via cgi-style

arguments– Use the Any2DjVu server to try out the format.

Cons:– DjVu does not yet have great market acceptance in the legal

community. – DjVu does not have a standard method for embedded

metadata within documents.

Proprietary FormatsProprietary Formats

Word Processing Formats: MS Word, WordPerfect

Not a good choice for document delivery on the web

Cons:– These formats are completely closed – Poor cross platform support– It is often problematic to index these documents using

inexpensive or open source IR tools.

The New Jersey Digital Legal LibraryThe New Jersey Digital Legal Library

URL: http://njlegallib.rutgers.edu Digitize New Jersey Legal materials not currently

available online. Available for users in two formats: DjVu and PDF Current Workflow:

– Scan -> TIFF; then TIFF -> PDF and TIFF -> DjVu– Extract OCR text from the DjVu to XHTML using

XSL Stylesheets and DjVuLibre (The Open Source DjVu Library)

– Use swish-e to index the XHTML documents with embedded extended Dublin Core metadata

ReferencesReferences

1. Baird, Henry. Difficult and Urgent Open Problems in Document Images Analysis for Libraries. Proceedings of the First International Workshop on Document Image Analysis for Libraries. Palo Alto CA, 2004.

2. Doerman, David. The Indexing and Retrieval of Document Images: A Survey. 70 (3). Computer Vision and Image Understanding. pp. 287-298.

Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law...

Documents

Transcript of Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law...