Introduction to DTD Bun Yue Professor, CS/CIS UHCL.

76
Introduction to DTD Bun Yue Professor, CS/CIS UHCL

Transcript of Introduction to DTD Bun Yue Professor, CS/CIS UHCL.

Introduction to DTD

Bun YueProfessor, CS/CISUHCL

Introduction A DTD is a grammar that is used to

determine the validity of an XML document.

There is no separation recommendation of DTD.

It is embedded inside the XML recommendation: http://www.w3.org/TR/2008/REC-xml-20081126/ (5th edition).

DTD

DTD is used to specify additional constraints and rules for a given vocabulary, such as element nesting rules attribute name and value constraints.

DTD allows XML parsers to capture errors as soon as possible. Errors are less costly to fix in earlier stages.

Validation

An XML document satisfying the rules of a DTD is said to be validated.

The command line DTD validation tool, xmlvalid, can be obtained from http://www.elcel.com/products/xmlvalid.html.

XML editors and parsers usually can be used to validate XML documents.

Example<?xml version="1.0"><person><name>Adam</name><spouse>Lucy</spouse><spouse>Eva</spouse></person>

Should there be two spouses? Is it an error? Is "Eva"or "Lucy" a person? Are there any additional information about "Lucy" or

"Eva"?

Creator’s Intentions

General problems with XML documents: Creators may not know what applications will use the file.

Need to communicate creator's intentions to users.

Document Modeling XML document modeling defines a

grammar to restrict and constrain an XML application.

Advantages of document modeling: Clear intention. Restrictions lead to easier processing. Interoperability improves if everyone uses the

same standards. Facilitate the development of tools for the XML

applications.

Document Modeling

Disadvantage of document modeling: Time for development. Potentially more timely to check validity. May be too restrictive.

XML Modeling Languages

Many methods. Two main standards:

Document Type Definition (DTD): more established, but limited.

XML Schema: more sophisticated and gaining popularity.

May use both.

Example

Continuing on the previous example, a better approach is to specify the constraints using DTD, such as:

A person may only have up to one spouse.

A spouse must refer to a person in the same XML doc.

DTD Example

A possible DTD declaration for this:

…<!ELEMENT person (name, pet*)><!ATTLIST person id ID #REQUIRED spouse IDREF #IMPLIED>…

XML ExampleAn XML document satisfying the DTD:

...<person id="p12324" spouse="p10001"><name>Adam</name><pet>Eva</pet></person><person id="p10001"><name>Lucy</name></person>... This XML document is validated w.r.t. the DTD.

Document Modeling Without a document model, an XML

document only needs to be well-formed and it may have: unlimited and unrestricted vocabulary: any

element and attributes will be allowed. no grammar rules, for example:

any element can be nested within any other element.

any element may have any attribute. an attribute may have any value.

Associating XML to Document Type Declarations

The <!DOCTYPE> tag is used in XML to associate the XML document to its document type declarations.

It is optional but must follow the XML declaration immediately.

DTD declarations can be: Internal DTD Subset External DTD

The name of the root element should follow the keyword DOCTYPE.

Internal Subset

DTD is defined at the beginning of an XML document within the <!DOCTYPE> tag.

Format: <!DOCTYPE root-element external-subset-declaration [internal-subset-declaration]>.

Internal Subset Example

<?xml version="1.0"?><!DOCTYPE persons [<!ELEMENT persons (#PCDATA)>]><persons>Kwok-Bun Yue</persons>

Internal Subset Example

<?xml version="1.0"?><!DOCTYPE board SYSTEM "msg.dtd"[<!ENTITY monitor "Kwok-Bun Yue"><!ENTITY monitoremail

"[email protected]">]>…

Consideration Internal DTD declarations have higher precedence

than external DTD. Internal DTD advantages:

always available as it is part of the XML document. Higher precedence than external DTD.

Disadvantages: Wasted transmission for non-validating parsers. Redundancy problems: many documents may have

the same internal DTD subset definitions. Good to use Internal DTD subset to override external

DTD (for example, to define entities suitable for the XML document.)

External DTD External DTD is stored in external

resources (e.g. files specified by an URL.) Example:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0

Strict//EN""http://www.w3.org/TR/xhtml1/DTD/

xhtml1-strict.dtd">

External DTD Format

Instruct the XML document to get the DTD from the URL.

The keyword after the root element can be: SYSTEM: always get the DTD from the

URL. PUBLIC: may get the DTD from some

other means.

Formal Public Identifier "-//W3C//DTD XHTML 1.0 Strict//EN" is the

formal public identifier (FPI) of the xhtml DTD. FTI identify resources by names instead of URLs and thus do not have the URL relocation problem. Rough meaning in this example: '-': not registered. 'W3C': owner id, W3C in this case. 'DTD XHTML 1.0 Strict': type and description of

document. 'EN': language, English, in this case.

FPI

Another example of FPI: "-//W3C//DTD HTML 4.0 Transitional//EN"

FPI is required for PUBLIC but not SYSTEM.

DTD Declarations

Vigorous data modeling should be used to define DTD. Need to define the right business rules

and constraints. Errors in DTD are costly. Usually, define the DTD to be as

restrictive as possible.

DTD DTD declarations are composed of a

sequence of declarations. Each DTD declaration declares one of the

following constructs: ELEMENT: XML element types ATTLIST: attributes of an element ENTITY: reusable content referenced by the &…;

syntax NOTATION: external contents not to be parsed.

DTD Declarations

If there is conflict, earlier declarations have higher precedence.

Although internal declarations are physically located after external declarations, they are read first and have thus high precedence.

No forward reference is allowed for parameter entities.

Element Declarations Format of element declaration: <!

ELEMENT element-name element-declaration>.

Element declarations can be one of the following four kinds. EMPTY ANY #PCDATA Content model: most important.

EMPTY and ANY

EMPTY: empty element. E.g. <!ELEMENT file EMPTY>

ANY: may contain anything. No parsing checking. Any embedded descendant elements will

still need to be declared within the DTD.

E.g.<!ELEMENT freeForAll ANY>

#PCDATA and Content Model

#PCDATA (parsed character data): text that is parsed for entity reference replacement. E.g. <!ELEMENT firstname (#PCDATA)>

Content model: a declaration of contents enclosed by ( & ) for specifying child elements.

Content Model

The following symbols can be used by content models: ,: sequencing. (): grouping ?: 0 or 1. *: 0 or more. +: 1 or more. |: or.

Example

<!ELEMENT abc EMPTY><!ELEMENT generalnote ANY><!ELEMENT firstname (#PCDATA)><!ELEMENT name (lastname,

firstname)>

Example<!ELEMENT name (first, middleinitial?, last)>Acceptable:

<name><first>Bun</first><last>Yue</last></name>

<name><first>Bun</first><middleinitial>K</middleinitial><last>Yue</last></name>

Example

<!ELEMENT name (first, middleinitial?, last)>Not acceptable:

<name><last>Yue</last><first>Bun</first></name>

<name>The one and only:<middleinitial>K</middleinitial><first>Bun</first><last>Yue</last></name>

Exercise #1Provide a DTD that will validate the following:

<names><name><first>Bun</first><last>Yue</last></name><name><first>Bun</first><middleinitial>K</middleinitial><last>Yue</last></name></names>

Mixed Content Model For mixed content model, the following

format must be used: (#PCDATA | child-element-1 | child-element-2 ...)* #PCDATA must come first. * must be used.

(#PCDATA) is also mixed content. In general, mixed content models

(character data and elements) should be avoided if possible.

Mixed Content Model

Mixed content models should be avoided if possible because: provide minimum constraints are harder to parse. behaviors may also be different with or

without DTD: some spaces may be for cosmetic uses only.

Scattered #PCDATA is up to interpretation.

Exercise #2

Can you provide an example of well known elements that use mixed content models?

Exercise #3a

How many text nodes are there?

<a><title>Greeting</title> <b>Hello</b>How are you?<b>Goodbye</b></a>

Exercise #3b

How many text nodes are there using <!ELEMENT a (#PCDATA | title | b)* >?

<a><title>Greeting</title> <b>Hello</b>How are you?<b>Goodbye</b></a>

Exercise #3c

How many text nodes are there using <!ELEMENT a (title | b | c)*>?

<a><title>Greeting</title> <b>Hello</b>How are you?<b>Goodbye</b></a>

Example<!ELEMENT email (from, to+, cc*, subject,

body)>Acceptable:

<email><from>Yue</from><to>Lee</

to><to>Smith</to><cc>King</cc><subject>hello</subject><body>good bye</body></email>

Example<!ELEMENT email (from, to+, cc*, subject,

body)>Not acceptable:

<email><from>Yue</from><cc>King</

cc><to>Lee</to><cc>Queen</cc><to>Smith</to><subject>hello</subject><body>good bye</body></email>

Exercise #4

Modify the DTD so cc and to can come in any order (there should still be at least on to).

Exercise #5

Comments on this DTD:<!ELEMENT bookcollection (book+) ><!ELEMENT book (author, publisher,

isbn, chapter*)><!ELEMENT author (#PCDATA)><!ELEMENT publisher (#PCDATA)> <!ELEMENT isbn (#PCDATA)> <!ELEMENT chapter (#PCDATA)>

Exercise #6

How do you declare in DTD that element <a> may have a <b> and a <c> child in any order?

ATTLIST Declarations To declare attribute properties of an

element. Format: <!ATTLIST element-name

attribute-declarations> Attribute declarations declare one or

more attributes. Each attribute declaration includes

the attribute name, its type and a setting.

Example

<!ATTLIST person comment CDATA #IMPLIED><!ATTLIST person ssn ID #REQUIRED gender (male|female) #IMPLIED age CDATA #IMPLIED iq CDATA "100">

Example

Attribute data types CDATA: character data . A string, least

restrictive. ID: unique identifier.

An unique name string within the document; Must start with a letter, a "_" or a ":". Like a primary key of the element, but not

exactly so. No two elements within the XML document

should have the same ID value. The scope of ID is for the document, not for the

element.

Attribute data types IDREF: identifier reference. Refer to

an ID value of some other elements. IDREFS: identifier reference list. Refer

to many ID values separated by white spaces.

ENTITY: entity name. Name of a pre-defined external entity.

ENTITIES: entity name list. Many entity names separated by a space.

Attribute data types NMTOKEN: name token. A name formed by

alphanumeric characters only (including ".", "-", "_", and ":"). The first character may be a letter, ".", ":", "_" or "-".

NMTOKENS: name token list. Many NMTOKENS separated by a white space.

NOTATION: notation list. For referencing data other than XML. A list of notation names. Each notation contains instruction for processing non

XML data. Each notation contains instruction for processing non

XML data.

Attribute data types

Enumeration: provide explicit choices separated by | within a pair of parenthesis. Note that the value of Enumeration must

be NMTOKEN.

Example

Here is the XML 1.0 specification for name (for ID and IDREFS) and NMTOKEN:

Names and Tokens:[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':'

| CombiningChar | Extender [5] Name ::= (Letter | '_' | ':') (NameChar)* [6] Names ::= Name (S Name)* [7] Nmtoken ::= (NameChar)+ [8] Nmtokens ::= Nmtoken (S Nmtoken)*

Example

<!ATTLIST person gender (male | female) #REQUIRED>

<!ATTLIST node name NMTOKEN #IMPLIEDid ID #REQUIREDlinks IDREFS #IMPLIED>

Attribute values and settings #REQUIRED: mandatory; commonly used. #IMPLIED: optional; commonly used. Default value: if the attribute is missing,

assume default value (use with care) #FIXED and default value: only one

possible value of the attribute is acceptable and that is the default value.

Example

<!ATTLIST node name NMTOKEN #IMPLIEDid ID #REQUIREDlinks IDREFS #IMPLIEDtype (element|attribute|comment) "comment"author (yue|davari|liaw) #FIXED "yue"

>

Example: AddressBook.dtd<!ELEMENT addressBook (person+)><!ELEMENT person (name,email*,link?)><!ATTLIST person id ID #REQUIRED><!ATTLIST person gender NMTOKEN #IMPLIED luckynumber CDATA #IMPLIED><!ELEMENT name (last,first)><!ELEMENT first (#PCDATA)> <!ELEMENT last (#PCDATA)><!ELEMENT email (#PCDATA)><!ELEMENT link EMPTY><!ATTLIST linkspouse IDREF #IMPLIEDchildren IDREFS #IMPLIED>

A Conforming XML<?xml version="1.0"?><!DOCTYPE addressBook SYSTEM

"AddressBook.dtd"><addressBook><person id="s123456789" gender="male" luckynumber="7 12 3"><name><last>Hope</last><first>Bob</first></name><email>[email protected]</

email> </person> <person id="s222222222"

gender="female"><name><last>King</last><first>Deborah</first></name><link spouse="s222222223" children="s222222226 s222222227" /></person>

<person id="s222222223" gender="male"><name><last>King</last><first>Jim</first></name></person><person id="s222222226" gender="male"><name><last>King</last><first>John</first></name></person><person id="s222222227" gender="female"><name><last>King</last><first>Jane</first></name></person></addressBook>

Exercise #7

Can you improve the DTD?

Exercise #8

Construct a simple and restrictive DTD for a labeled directed graph.

Entity Declarations

Entities are like macros in C. When XML processors parse an entity (usually of the format &entity-name;), the entity is replaced by its value.

Entity declaration uses the syntax <!ENTITY entity-declaration>.

Kind of Entity Declarations General entity: <!ENTITY entity-name

entity-value>. External entity:

<!ENTITY entity-name entity-uri>. The entity will be replaced by text from the

external source. (therefore it can be long and shared.)

Nonparsed external entity: <!ENTITY entity-name entity-uri NDATA entity-

type> NDATA is a keyword. entity-type is the type of the non-data source,

which will not be parsed by the XML processor.

Kind of Entity Declarations

Parameter entity: <!ENTITY % entity-name entity-value>. to be used inside the DTD only. referred to as %entity-name;

External parameter entity: <!ENTITY % entity-name entity-uri>. to be used inside the DTD only. Referred to as %entity-name.

Example

In http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd:

<!ENTITY % HTMLlat1 PUBLIC"-//W3C//ENTITIES Latin 1 for XHTML//EN""xhtml-lat1.ent">%HTMLlat1;

HTML lat1 is an external parameter entity.

ExampleIn http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd:<!ENTITY % ContentType "CDATA"><!-- media type, as per [RFC2045] -->

<!ENTITY % ContentTypes "CDATA"><!-- comma-separated list of media types, as per

[RFC2045] -->

<!ENTITY % Charset "CDATA"><!-- a character encoding, as per [RFC2045] -->...

Notation Declarations XML is not designed for storing binary

data. If needed, binary data can be stored

by using notation declaration and will not be parsed.

To handle binary data, the XML processor needs to know its data type as well as some instructions on handle it.

Notation Declarations Notation declaration syntax: <!NOTATION

name identifier>. name is the notation type name.

identifier is some instruction meaningful to the target XML processor.

Notation is usually used together with non-parsed external entity.

Notation types defined by notation declaration is used as the entity types of non-parsed external entity.

Example

<!NOTATION jpeg SYSTEM "image/jpeg">

<!ENTITY uhcl "images/uhcl.jpeg" NDATA jpeg>

Document Modeling with DTD

Mapping of data requirements of the problem to XML model.

May usually take two steps: Use a modeling language for analysis

and design: e.g. UML. Map the model to DTD, XML Schema, etc.

DTD Modeling Tips Use formal modeling techniques (such as

UML). Use modeling tools, such as Rational's

Rose. Track versions carefully. Decide on organization before modeling.

For example, declarations may be grouped by: functions hierarchical elements

Some General Tips Consider major design options, such

as: Elements versus attributes. Flat element structures versus nested

element structures. Descendants, siblings or ancestors.

Model should generally be as restrictive as possible.

Include sufficient documentation.

Some General Tips

Generous uses of whitespaces and inline comments.

Use parameter entities generously. Import modules by using external

parameter entities. Use meaningful names.

Exercise #9 Start with the following UML diagram for a

graph. Refine it and design a suitable DTD.

Exercise #10 Consider the XML file, satvexample.xml, and its DTD,

tvschedule.dtd. Both files are obtained from the site http://mysite.verizon.net/vze20h45/comp/xml/videoxml.html with very minor changes in satvexample.xml (to remove XSL references and modify DOCTYPE to refer to a local DTD).

Validation results of the XML file: http://xmlvalidation.com/: no error. xmlvalid (from http://www.elcel.com/products/xmlvalid.html

(you will need to register, download, install and run it in command line mode) Error: non-deterministic content model for element 'DAY': more

than one path leads to element ' DATE'. Error: element content invalid. Element 'PROGRAMSLOT' is not

expected here, expecting 'HOLIDAY'.

Which validator is correct? How do you correct the problem?

Weakness of DTD Not XML compliant. Cannot be parsed by XML parsers. Difficult to extract information from XML applications. Closed construct: all defined within one DTD. Not

easily extensible. Difficult to break down to smaller pieces. Type definitions not rich. E.g.

Insufficient types and precisions. no int, float, etc. No user defined types.

Do not work well with XML namespaces. Difficult to enforce sophisticated constraints.

Other Schema Languages

There are many efforts to overcome the limitations of DTD.

XML Schema is one of the most important, since it is a W3C standard.

Other important languages include RelaxNG, Schematron, etc.

However, the ‘schema war’ is considered settled by many.

Questions