Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et...

Post on 18-Jan-2016

215 views 0 download

Tags:

Transcript of Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et...

Document Computing

Technologies for Managing Electronic Document Collections

Ross Wilkinson ... [et al.]

Circulation Counter  [RES3H]  ZA4080 .D63 1998  

Chapter 1

Document Lifecycle

What is a document?

A document records a message from people to people.

Characteristics of a document

• Content

• Structure

• Metadata

Metadata

• A message has a context, which is important for understanding the message.

• A document contains not only the contents of a message, but also some information about the document, e.g. author, date, recipients.

• We called such information the metadata about the document.

Adobe Acrobat Document

Why Document Management?

• It is hard to find documents.

• It is hard to organize documents.

• It is hard to control documents.

• Metadata helps document management.

Benefits of Document Management

• Location-independent delivery of documents upon demand

• Controlled access to documents

• A record of the life of a document

• Better re-use of documents

Chapter 2

Electronic Document Description

Document Content

• Simplest type of content – unformatted text

• Text retrieval system based on search by keywords

• E.g Windows Desktop Search (video)

• Optical character recognition (OCR) system Adobe Acrobat

Document

Document Structure

• Even unformatted text has some structures, e.g. lines, words, images, etc.

• A document may have elaborate structures.

• Two levels of structures:– Logical structure– Presentational structure

Logical structures

• Example:

TO: John D.

FROM: Kate M.

DATE: 7/8/98

I have finished Stage B of the design. Could you take a look at it?

• Simple logical structure: lines of text

• A logical structure of a memo: (see next slide)

A logical structure for a memo

Memo

Head Body

Sender Receiver Date Paragraph

Presentational Structure

• A different presentational structure for the same memo

John D., 7/8/98

I have finished Stage B of the design. Could you take a look at it?

Kate M.

Presentation medium

• The content of the same document can be presented in different media with different presentational structures:

• E.g. a PDF file vs. a online Web page

Metadata

• Generally, we need metadata to capture:– Registration information– Usage information– Structural properties– Contextual information– Content description– Historical information

The Dublin Core metadata set

• Title• Creator• Subject• Description• Publisher• Contributors• Date• Type

• Format: e.g. HMTL, pdf

• Identifier: e.g. URI• Source• Language• Relation• Coverage: duration• Rights: e.g. copyright

Document Description Language (DDL)

• For use by document management system• E.g. RTF, Postcript, SGML• DDL support:

– Language support, media support, transparency, structure, link support, metadata support

• Other DDL characteristics:– Document creation, import conversion, export

transformation, update, presentation quality, presentation flexibility, etc.

Examples of DDLs

• ASCII (American Standard Code for Information Interchange)

• Unicode• ASCII and Unicode offer very limited

support• Rich Text Format• TeX and LaTeX• SGML, HTML, XML• Postscript, PDF

Rich Text Format (RTF)

• Developed by Microsoft

• For interchange between Microsoft Word and other software

• Main purposes:– Preserve information in Word (blocks of text)

• Example: next slide

{\rtf1\adeflang1025\ansi\ansicpg1252\uc2\adeff0\deff0\stshfdbch13\stshfloch0\stshfhich0\stshfbi0\deflang2057\deflangfe1028{\fonttbl{\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman

{\title John D}{\author Dr. Yeung}{\operator Dr. Yeung}{\creatim\yr2008\mo3\dy18\hr15\min24}{\revtim\yr2008\mo3\dy18\hr15\min25}{\version1}{\edmins1}{\nofpages1}{\nofwords14}{\nofchars81}{\*\company Lingnan University}{\nofcharsws94}

\ltrch\fcs0 \insrsid1782868\charrsid1782868 \hich\af0\dbch\af13\loch\f0 John D., 7/8/98

\par \hich\af0\dbch\af13\loch\f0 I have finished Stage B of the design. Could you take a look at it?

\par

\par \hich\af0\dbch\af13\loch\f0 Kate M\hich\af0\dbch\af13\loch\f0 .

\par }\pard \ltrpar\ql \li0\ri0\widctlpar\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 {\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid4811147

\par }}

TeX and LaTeX

• TeX created by Donald Knuth

• TeX is a typesetting software.

• LaTeX created based on TeX by Leslie Lamport

• LaTeX use markup constructs to separate logical description from presentation.

• LaTeX example: see next slide

• To learn LaTeX: click.

\documentclass{article}\usepackage{times}\pagestyle{empty}

\begin{document}

\title{Sample Document}

\author{W. L. Yeung\\Department of Computing and Decision Sciences\\Lingnan University, Hong Kong\\wlyeung@ln.edu.hk}

\maketitle

\section{Introduction}

\section{Conclusion}

\end{document}

SGML

• Standard Generalized Markup Language• To describe a document in SGML, we

need:– An SGML declaration– A document type definition (DTD)– A document instance

• An SGML declaration specifies which characters are used in the DTD. Normally a default is used.

SGML (cont.)

• A document type definition (DTD) defines the rules for forming a class of documents, i.e. the grammar of a document class.

• The building blocks of SGML documents are elements.

• A DTD for the memo document: next slide.

<!-– DTD for office memo -->

<!-- ELEMENT CONTENT -- >

<!ELEMENT memo - - (head, body, close?) >

<!ELEMENT head 0 0 (to & from & date) >

<!ELEMENT to - - (#PCDATA) >

<!ELEMENT from - - (#PCDATA) >

<!ELEMENT date - - (#PCDATA) >

<!ELEMENT body - - (#PCDATA) >

<!ELEMENT par - - (#PCDATA) >

<!ELEMENT close - - (#PCDATA) >

<!-- ELEMENT NAME VALUE DEFAULT -- >

<!ATTLIST memo status (con|pub) pub >

<!ATTLIST par id id #IMPLIED >

DTD

• An element definition gives the name of the element, then the rules for building that element.

• Elements can contain other elements.

• Terminal (basic) elements often consist of parsed character data “#PCDATA” or “#CDATA”.

The memo in SGML<MEMO>

<TO> John D </TO>

<FROM> Kate M </FROM>

<DATE> 7/8/1998 </DATE>

<BODY>

<PAR>

I have finished Stage B of the design.

</PAR>

</BODY>

</MEMO>

HTML

• Hypertext Markup Language

• For World Wide Web (WWW) documents

• Conforms to a SGML DTD

• HTML is presentation oriented: instructions (tags) are inserted into a document to for presentation effects

• The DTD for HTML is available on http://www.w3.org/TR/html401/sgml/dtd.html

The memo in HTML

<!DOCTYPE HTML PUBLIC “-//IETF//DTD HTML//EN”><HTML><HEAD><TITLE>Memo</TITLE><META NAME=“DC.AUTHOR” CONTENT=“Kate M”</META><META NAME=“DC.DATE” CONTENT=“7/8/1998”</META></HEAD><BODY><H1>Memo</H1><P>I have finished Stage B of the <A

HREF=“/team3/design2”>design<A>.</P></BODY></HTML>

XML

• Extensible Markup Language

• Three basic definitions:– XML for representing data and documents– XLink and XPointer for representing inter-

document linking– XSL for representing presentation

• XML is a near-subset of SGML

XML (Cont.)

• Two classes of XML documents:– Valid XML documents: documents that conform to a

specific supplied DTD– Well-formed documents: only satisfy a simple default

grammar, without conforming to a specific DTD

• XML has become the cornerstone of electronic commerce as it allows businesses to exchange electronic documents according to some standard formats based on XML.

Postscript

• Developed by Adobe

• For representing documents that are to be printed (mainly on laser printers)

• A page description language optimized for printing text, images, graphics.

Portable Document Format (PDF)

• Developed by Adobe• A page description language for representing

text, graphics and images• A PDF file contains presentation information on

pages, annotations, links, fonts, etc.• Support delivery of electronic documents exactly

as they would appear in printed form.• Not designed for editing or document format

exchange.