1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october...

61
1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002

Transcript of 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october...

Page 1: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

1

Digital preservation.Principles and potential role of XML

Giovanni Michetti

Urbino, 9th october 2002

Page 2: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

2

Documents:form vs. content ?

Traditional environment:

Form

Content

Page 3: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

3

Documents:form vs. content ?

Digital environment:

Form

Content

Page 4: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

4

Documents:structure

Structure is unavoidably inside documents

Complexity grows structure grows Structure is (part of the) message

We deal with structure not in digital environment only

Page 5: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

5

Documents:structure and digital environment

Moving information onto new media

Need of functionalities to manage the explosive growth of information

Need to make structure explicit

Page 6: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

6

Markup

The proper description of an information resource requires: identifying its logical components making its structure explicit

Markup

Page 7: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

7

Markup

Markup:every means of making interpretation of a document explicit

Page 8: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

8

From a record ...University of Urbino

Faculty of Arts

Rome, 1st August 2002Dr. Giovanni Michetti

Protocol n. 1234/ABSubject: Teaching appointment

We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree n. 80/1998, the authorization by the administration you belong to.

The DeanProf. Giorgio Cerboni Baiardi

Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino

Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]

Page 9: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

9

… to a marked record ...<XML><letter><sender>University of Urbino

Faculty of Arts </sender>

<date>Rome, 1st August 2002</date><addressee>Dr. Giovanni Michetti</addressee>

<protocolnumb>Protocol n. 1234/AB</protocolnumb><subject>Subject: Teaching appointment</subject>

<text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text>

<author>The DeanProf. Giorgio Cerboni Baiardi</author>

<heading>Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino

Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]</heading></letter></XML>

Page 10: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

10

… to a DTD ...

<! ELEMENT letter (sender, date, addressee, protocolnumb, subject, text, author,

heading)><!ELEMENT sender (#PCDATA)><!ELEMENT date (#PCDATA)><!ELEMENT addressee (#PCDATA)><!ELEMENT protocolnumb (#PCDATA)><!ELEMENT subject (#PCDATA)><!ELEMENT text (#PCDATA)><!ELEMENT author (#PCDATA)><!ELEMENT heading (#PCDATA)>

Page 11: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

11

… to a more precise DTD

<! ELEMENT letter (sender, date, addressee, precedent?, protocolnumb, classif?, subject,

text, attachments?, author, heading)><!ELEMENT sender, date, addressee, protocolnumb, subject, text, author,

heading (#PCDATA)><!ELEMENT precedent (#PCDATA)><!ELEMENT classif (#PCDATA)><!ELEMENT attachments (#PCDATA)>

Page 12: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

12

Let’s refine the markup ...<XML><letter><sender><body>University of Urbino</body>

<bureau>Faculty of Arts</bureau></sender>

<date><place>Rome,</place><time>1st August 2002</time></date><addressee>Dr. Giovanni Michetti</addressee>

<protocolnumb>Protocol n. 1234/AB</protocolnumb><subject>Subject: Teaching appointment</subject>

<text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text>

<author><role>The Dean</role><name>Prof. Giorgio Cerboni Baiardi</name></author>

<heading>Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino

Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]</heading></letter></XML>

Page 13: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

13

... keeping on refining ...<XML><letter><sender><body>University of Urbino</body>

<bureau>Faculty of Arts</bureau></sender>

<date><place>Rome,</place><time>1st August 2002</time></date><addressee>Dr. Giovanni Michetti</addressee>

[Protocolnumb + Subject + Text]

<author><role>The Dean</role><name><title>Prof.</title><propername>Giorgio</propername><surname>Cerboni

Baiardi</surname></name></author>

<heading><bureau>Faculty of Arts</bureau><address>Piano S. Lucia 6 - 61029 Urbino</address>

<tel>Tel: 0722.320125</tel><fax>Fax: 0722.322553</fax><email>Email:

[email protected]</email></heading></letter></XML>

Page 14: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

14

… and let’s refine the DTD<! ELEMENT letter

(sender, date, addressee, precedent?, protocolnumb, classifi?, subject, text,

attachment?, author, heading)>

<!ELEMENT sender (body, bureau)>

<!ELEMENT body (#PCDATA)>

<!ELEMENT bureau (#PCDATA)>

<!ELEMENT date (place, time)>

<!ELEMENT place (#PCDATA)>

<!ELEMENT time (#PCDATA)>

<!ELEMENT addressee (#PCDATA)>

<!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)>

<!ELEMENT author (role, name)>

<!ELEMENT role (#PCDATA)>

<!ELEMENT name (title, propername, surname)>

<!ELEMENT title, propername, surname (#PCDATA)>

<!ELEMENT heading (bureau,address, tel, fax, email)>

<!ELEMENT address, tel, fax, email (#PCDATA)>

Page 15: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

15

The final DTD<! ELEMENT letter

(sender, date, addressee+, precedent?, protocolnumb, classifi?, subject, text,

attachment?, author, heading?)>

<!ELEMENT sender (body?, bureau)><!ELEMENT body (#PCDATA)><!ELEMENT bureau (#PCDATA)><!ELEMENT date (place, time)><!ELEMENT place (#PCDATA)><!ELEMENT time (#PCDATA)><!ELEMENT addressee (#PCDATA)><!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)>

<!ELEMENT author (role?, name)><!ELEMENT role (#PCDATA)>

<!ELEMENT name (title?, propername?, surname)><!ELEMENT title, propername, surname (#PCDATA)>

<!ELEMENT heading (bureau?, address?, tel?, fax?, email?)><!ELEMENT address, tel, fax, email (#PCDATA)>

Page 16: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

16

XML declaration Every XML document should start with

an XML declaration, like<?XML version="1.0">

Such declaration must be right at the start of the document: there should be nothing before it (comments, instructions, white spaces, ...)

Page 17: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

17

XML declaration

A parser uses the first 5 characters <?XML to understand which kind of character set the document uses

The version attribute must have value 1.0

Page 18: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

18

XML declaration

It is possible to specify the language encoding using the optional encoding attribute.

Example:

<?XML version="1.0" encoding="ISO-8859-1"?>

Page 19: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

19

Elements Elements are the most important

components of XML documents: they are the logical components through which you can identify the structure of documents. Example:

<author>Giovanni Michetti</author>delimiter

tag-namecontent

start-tagend-tag

element

Page 20: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

20

Elements

Each start-tag must have a corresponding end-tag (starting with a forward slash)

Empty elements (like <img>, <br>, <hr> in HTML) are represented by a tag starting with a delimiter and ending with a forward slash before the closing bracket. Example: <image/>

Page 21: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

21

Attributes Attributes are expressed as name-value

pairs associated with elements and appearing only in start-tags

Names are separated from related values by an equal sign (=). Values are wrapped in single or double quotes

Attributes must be associated to elements

No matter of the order of the attributes inside a start-tag

Page 22: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

22

XML tree

An XML document is a kind of a hierarchical tree. It starts from a root (root or document element) and it develops from it into child elements, that can be sibling

Page 23: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

23

XML tree

Each element has one and only one father (except from root)

Each element is completely wrapped inside another element

Page 24: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

24

Entities Example:

<author>Giovanni Michetti</author>

The string Giovanni Michetti (the element content) is also called character data. Character data can appear anywhere inside elements, or as values of attributes

Page 25: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

25

Entities There are special characters that are

not allowed in text blocks: what if we want to use the less than symbol < in a mathematical formula (a < b ) ?

Stratagem 1 Stratagem 2

Page 26: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

26

Entities

1. CDATA sections: They start with the CDATA start marker

<!CDATA[

and end with the CDATA end marker

]]>

Page 27: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

27

Entities

2. Entity references:Example:

&lt; <

The parser recognizes the entity &lt; and substitute it with the proper value <

Page 28: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

28

Entities

A parser is a piece of software able to read and interpret an XML document. A parser read the XML document as plain text

Some parsers (validating parsers) are able to check the conformance of an XML document with a DTD

Page 29: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

29

Entities Standard (i.e. predefined) entities:

&lt; <&gt; >&amp; &&apos; '&quot; "

Any XML parser recognizes these entities and substitutes them with the proper values

Page 30: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

30

Well-formed documents Any XML document must be well

formed: it has to comply with some constraints, some of which are:

Each start-tag has a corresponding end-tag Elements can’t overlap There must be one and only one root

element Attribute values must be quoted An element can’t contain different attributes

with the same name

Page 31: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

31

Document Type Definition (DTD)

Once able to create a set of attributes and tags, we need to share it with other users in order to adopt the same syntax

We need a Document Type Definition (DTD)

Page 32: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

32

Document Type Definition (DTD)

A DTD defines what markup can be used in a document that is supposed to conform to a specific structure, whose components are identified by tags

Page 33: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

33

Document Type Definition (DTD)

For example, a DTD defines what elements a document can contain, their occurrences, their order, and so on

A DTD can set out which attributes an element can take and whether they must be valued. It is also possible to define a set of predefined values for the attributes, and so on

Page 34: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

34

Internal and external DTD

A DTD can be an external file or it can be included as part of the XML document. If it is an external file, the XML document must contain an explicit reference inside the Document Type Declaration:

<!DOCTYPE MyXMLDocs SYSTEM “file.dtd”>

Page 35: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

35

Internal and external DTD

A DTD can also be written inside the document type declaration. In this case we have an internal DTD, like:<!DOCTYPE MyXMLDoc [

<!ELEMENT MyXMLDoc (#PCDATA)>

]> In this case, all the constraints on the

structure of the document are provided as declarations inside the square brackets

Page 36: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

36

Element declarations A DTD is a set of declarations, the most

important of which is the element declaration. Any DTD must have at least one element declaration (referred to the root element)

The syntax for a declaration is:

<!ELEMENT elementname (contentmodel)>

Page 37: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

37

Element declarations Example:

<!ELEMENT anthology (poem+)>

<!ELEMENT poem (title?, (stanza+|line+) )

<!ELEMENT title (#PCDATA)>

<!ELEMENT stanza (verso+)>

<!ELEMENT line (#PCDATA)>

Page 38: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

38

Cardinality suffixes Cardinality suffixes are symbols used to

specify how many times an element can occur at a certain point of the structure. Symbols used are:

? 0-1+ 1-n* 0-n

(none) 1

Page 39: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

39

Connectors Connectors are symbols used to specify

order and relationships between components of a model

Symbols used are:

, (comma)

| (vertical line)

Page 40: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

40

Attribute declarations An attribute declaration allows to define

attributes associated to a given element

The syntax for a declaration is:

<!ATTLIST element_name attribute_definition*>

where an attribute definition is like:

attribute_name attribute_type default_declaration

Page 41: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

41

Valid documents Well-formed documents: XML

documents conforming to the rules laid down in the XML 1.0 specifications

Valid documents: well-formed documents conforming to the rules laid down in a DTD

Page 42: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

42

Stylesheets

So far the structure. But how can we render documents in the proper way?

Stylesheets

Page 43: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

43

Stylesheets Since content is separated from style, we do

need no more to re-write the whole document each time we want to change the layout: we simply need to change the “instructions” that modify rendering. In other words, we can modify representation without modifying content

XSL (eXtensible Stylesheet Language) is a style language based upon DSSL (Document Style Semantics and Specification Language)

Page 44: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

44

So far the document …

… but a document is (generally) part of a file, which is in turn part of a series or a more complex archival collection

Archival bond

Page 45: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

45

The object of analysis:from documents ...

Page 46: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

46

… to files ...

Page 47: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

47

.....

Page 48: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

48

… to series

Page 49: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

49

Archives:a complex system of relationships

File

Series

Archiv

e

Document

Page 50: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

50

Preserving, of course; but what?

Preserving

Original data

Context allowing data to be interpreted

Hardware

??

?

Page 51: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

51

Preserving context

Preserving the context

Need to manage a network of metadata

Page 52: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

52

XML technologies XML Schema Document Object Model (DOM) Simple API for XML (SAX) XSLT/Xpath XML Query Xlink Xpointer Xbase Xform XML Fragment interchange Xinclude

Page 53: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

53

XML features It’s a formal, non-proprietary standard

it is acceptable to a wide range of users It’s a meta-language

it allows to define DTDs and validate documents It allows to manage highly structured documents It’s human-readable and self-descriptive

good chances to last It uses Unicode text

no problems related to internationalization

Page 54: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

54

XML features

It’s a family of technologies It’s modular It’s license-free and platform-independent It can be transported across Web using

existing transport protocol re-use of communication and

security structures already in place

Page 55: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

55

XML features

It allows to easily manage metadata It provides very good mechanism for

representing the layout It’s easy, powerful, but not too expensive

Page 56: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

56

XML double-edged features

1. It’s a meta-language: it allows to define DTDs danger of specialization (each user community with its own language)

Without a common language, XML is not so competitive with respect to other mechanism of data interchange

XSL does allow to translate between different encodings, but it could be quite complex

RosettaNet and OASIS: trying to adopt common languages

Page 57: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

57

XML double-edged features

2. It’s self-descriptive: you can create documents without using a DTD ...

Page 58: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

58

XML double-edged features

3. It supports sophisticated searching by means of the tags embedded in the text, but a bad markup (not complete or not correct) highly reduces search effectiveness

Page 59: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

59

XML limitations

It’s a syntax: it contains no semantics you need to use other XML modules such as XML Schema and RDF

It’s based upon text: the size of the markup can be much larger than the data itself

Page 60: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

60

Preservation

Some considerations ...

Page 61: 1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002.

61

Thanks to all

Giovanni Michetti

[email protected]