Introduction to DTD Bun Yue Professor, CS/CIS UHCL.
Transcript of Introduction to DTD Bun Yue Professor, CS/CIS UHCL.
Introduction A DTD is a grammar that is used to
determine the validity of an XML document.
There is no separation recommendation of DTD.
It is embedded inside the XML recommendation: http://www.w3.org/TR/2008/REC-xml-20081126/ (5th edition).
DTD
DTD is used to specify additional constraints and rules for a given vocabulary, such as element nesting rules attribute name and value constraints.
DTD allows XML parsers to capture errors as soon as possible. Errors are less costly to fix in earlier stages.
Validation
An XML document satisfying the rules of a DTD is said to be validated.
The command line DTD validation tool, xmlvalid, can be obtained from http://www.elcel.com/products/xmlvalid.html.
XML editors and parsers usually can be used to validate XML documents.
Example<?xml version="1.0"><person><name>Adam</name><spouse>Lucy</spouse><spouse>Eva</spouse></person>
Should there be two spouses? Is it an error? Is "Eva"or "Lucy" a person? Are there any additional information about "Lucy" or
"Eva"?
Creator’s Intentions
General problems with XML documents: Creators may not know what applications will use the file.
Need to communicate creator's intentions to users.
Document Modeling XML document modeling defines a
grammar to restrict and constrain an XML application.
Advantages of document modeling: Clear intention. Restrictions lead to easier processing. Interoperability improves if everyone uses the
same standards. Facilitate the development of tools for the XML
applications.
Document Modeling
Disadvantage of document modeling: Time for development. Potentially more timely to check validity. May be too restrictive.
XML Modeling Languages
Many methods. Two main standards:
Document Type Definition (DTD): more established, but limited.
XML Schema: more sophisticated and gaining popularity.
May use both.
Example
Continuing on the previous example, a better approach is to specify the constraints using DTD, such as:
A person may only have up to one spouse.
A spouse must refer to a person in the same XML doc.
DTD Example
A possible DTD declaration for this:
…<!ELEMENT person (name, pet*)><!ATTLIST person id ID #REQUIRED spouse IDREF #IMPLIED>…
XML ExampleAn XML document satisfying the DTD:
...<person id="p12324" spouse="p10001"><name>Adam</name><pet>Eva</pet></person><person id="p10001"><name>Lucy</name></person>... This XML document is validated w.r.t. the DTD.
Document Modeling Without a document model, an XML
document only needs to be well-formed and it may have: unlimited and unrestricted vocabulary: any
element and attributes will be allowed. no grammar rules, for example:
any element can be nested within any other element.
any element may have any attribute. an attribute may have any value.
Associating XML to Document Type Declarations
The <!DOCTYPE> tag is used in XML to associate the XML document to its document type declarations.
It is optional but must follow the XML declaration immediately.
DTD declarations can be: Internal DTD Subset External DTD
The name of the root element should follow the keyword DOCTYPE.
Internal Subset
DTD is defined at the beginning of an XML document within the <!DOCTYPE> tag.
Format: <!DOCTYPE root-element external-subset-declaration [internal-subset-declaration]>.
Internal Subset Example
<?xml version="1.0"?><!DOCTYPE persons [<!ELEMENT persons (#PCDATA)>]><persons>Kwok-Bun Yue</persons>
Internal Subset Example
<?xml version="1.0"?><!DOCTYPE board SYSTEM "msg.dtd"[<!ENTITY monitor "Kwok-Bun Yue"><!ENTITY monitoremail
"[email protected]">]>…
Consideration Internal DTD declarations have higher precedence
than external DTD. Internal DTD advantages:
always available as it is part of the XML document. Higher precedence than external DTD.
Disadvantages: Wasted transmission for non-validating parsers. Redundancy problems: many documents may have
the same internal DTD subset definitions. Good to use Internal DTD subset to override external
DTD (for example, to define entities suitable for the XML document.)
External DTD External DTD is stored in external
resources (e.g. files specified by an URL.) Example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Strict//EN""http://www.w3.org/TR/xhtml1/DTD/
xhtml1-strict.dtd">
External DTD Format
Instruct the XML document to get the DTD from the URL.
The keyword after the root element can be: SYSTEM: always get the DTD from the
URL. PUBLIC: may get the DTD from some
other means.
Formal Public Identifier "-//W3C//DTD XHTML 1.0 Strict//EN" is the
formal public identifier (FPI) of the xhtml DTD. FTI identify resources by names instead of URLs and thus do not have the URL relocation problem. Rough meaning in this example: '-': not registered. 'W3C': owner id, W3C in this case. 'DTD XHTML 1.0 Strict': type and description of
document. 'EN': language, English, in this case.
FPI
Another example of FPI: "-//W3C//DTD HTML 4.0 Transitional//EN"
FPI is required for PUBLIC but not SYSTEM.
DTD Declarations
Vigorous data modeling should be used to define DTD. Need to define the right business rules
and constraints. Errors in DTD are costly. Usually, define the DTD to be as
restrictive as possible.
DTD DTD declarations are composed of a
sequence of declarations. Each DTD declaration declares one of the
following constructs: ELEMENT: XML element types ATTLIST: attributes of an element ENTITY: reusable content referenced by the &…;
syntax NOTATION: external contents not to be parsed.
DTD Declarations
If there is conflict, earlier declarations have higher precedence.
Although internal declarations are physically located after external declarations, they are read first and have thus high precedence.
No forward reference is allowed for parameter entities.
Element Declarations Format of element declaration: <!
ELEMENT element-name element-declaration>.
Element declarations can be one of the following four kinds. EMPTY ANY #PCDATA Content model: most important.
EMPTY and ANY
EMPTY: empty element. E.g. <!ELEMENT file EMPTY>
ANY: may contain anything. No parsing checking. Any embedded descendant elements will
still need to be declared within the DTD.
E.g.<!ELEMENT freeForAll ANY>
#PCDATA and Content Model
#PCDATA (parsed character data): text that is parsed for entity reference replacement. E.g. <!ELEMENT firstname (#PCDATA)>
Content model: a declaration of contents enclosed by ( & ) for specifying child elements.
Content Model
The following symbols can be used by content models: ,: sequencing. (): grouping ?: 0 or 1. *: 0 or more. +: 1 or more. |: or.
Example
<!ELEMENT abc EMPTY><!ELEMENT generalnote ANY><!ELEMENT firstname (#PCDATA)><!ELEMENT name (lastname,
firstname)>
Example<!ELEMENT name (first, middleinitial?, last)>Acceptable:
<name><first>Bun</first><last>Yue</last></name>
<name><first>Bun</first><middleinitial>K</middleinitial><last>Yue</last></name>
Example
<!ELEMENT name (first, middleinitial?, last)>Not acceptable:
<name><last>Yue</last><first>Bun</first></name>
<name>The one and only:<middleinitial>K</middleinitial><first>Bun</first><last>Yue</last></name>
Exercise #1Provide a DTD that will validate the following:
<names><name><first>Bun</first><last>Yue</last></name><name><first>Bun</first><middleinitial>K</middleinitial><last>Yue</last></name></names>
Mixed Content Model For mixed content model, the following
format must be used: (#PCDATA | child-element-1 | child-element-2 ...)* #PCDATA must come first. * must be used.
(#PCDATA) is also mixed content. In general, mixed content models
(character data and elements) should be avoided if possible.
Mixed Content Model
Mixed content models should be avoided if possible because: provide minimum constraints are harder to parse. behaviors may also be different with or
without DTD: some spaces may be for cosmetic uses only.
Scattered #PCDATA is up to interpretation.
Exercise #3a
How many text nodes are there?
<a><title>Greeting</title> <b>Hello</b>How are you?<b>Goodbye</b></a>
Exercise #3b
How many text nodes are there using <!ELEMENT a (#PCDATA | title | b)* >?
<a><title>Greeting</title> <b>Hello</b>How are you?<b>Goodbye</b></a>
Exercise #3c
How many text nodes are there using <!ELEMENT a (title | b | c)*>?
<a><title>Greeting</title> <b>Hello</b>How are you?<b>Goodbye</b></a>
Example<!ELEMENT email (from, to+, cc*, subject,
body)>Acceptable:
<email><from>Yue</from><to>Lee</
to><to>Smith</to><cc>King</cc><subject>hello</subject><body>good bye</body></email>
Example<!ELEMENT email (from, to+, cc*, subject,
body)>Not acceptable:
<email><from>Yue</from><cc>King</
cc><to>Lee</to><cc>Queen</cc><to>Smith</to><subject>hello</subject><body>good bye</body></email>
Exercise #4
Modify the DTD so cc and to can come in any order (there should still be at least on to).
Exercise #5
Comments on this DTD:<!ELEMENT bookcollection (book+) ><!ELEMENT book (author, publisher,
isbn, chapter*)><!ELEMENT author (#PCDATA)><!ELEMENT publisher (#PCDATA)> <!ELEMENT isbn (#PCDATA)> <!ELEMENT chapter (#PCDATA)>
ATTLIST Declarations To declare attribute properties of an
element. Format: <!ATTLIST element-name
attribute-declarations> Attribute declarations declare one or
more attributes. Each attribute declaration includes
the attribute name, its type and a setting.
Example
<!ATTLIST person comment CDATA #IMPLIED><!ATTLIST person ssn ID #REQUIRED gender (male|female) #IMPLIED age CDATA #IMPLIED iq CDATA "100">
Attribute data types CDATA: character data . A string, least
restrictive. ID: unique identifier.
An unique name string within the document; Must start with a letter, a "_" or a ":". Like a primary key of the element, but not
exactly so. No two elements within the XML document
should have the same ID value. The scope of ID is for the document, not for the
element.
Attribute data types IDREF: identifier reference. Refer to
an ID value of some other elements. IDREFS: identifier reference list. Refer
to many ID values separated by white spaces.
ENTITY: entity name. Name of a pre-defined external entity.
ENTITIES: entity name list. Many entity names separated by a space.
Attribute data types NMTOKEN: name token. A name formed by
alphanumeric characters only (including ".", "-", "_", and ":"). The first character may be a letter, ".", ":", "_" or "-".
NMTOKENS: name token list. Many NMTOKENS separated by a white space.
NOTATION: notation list. For referencing data other than XML. A list of notation names. Each notation contains instruction for processing non
XML data. Each notation contains instruction for processing non
XML data.
Attribute data types
Enumeration: provide explicit choices separated by | within a pair of parenthesis. Note that the value of Enumeration must
be NMTOKEN.
Example
Here is the XML 1.0 specification for name (for ID and IDREFS) and NMTOKEN:
Names and Tokens:[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':'
| CombiningChar | Extender [5] Name ::= (Letter | '_' | ':') (NameChar)* [6] Names ::= Name (S Name)* [7] Nmtoken ::= (NameChar)+ [8] Nmtokens ::= Nmtoken (S Nmtoken)*
Example
<!ATTLIST person gender (male | female) #REQUIRED>
<!ATTLIST node name NMTOKEN #IMPLIEDid ID #REQUIREDlinks IDREFS #IMPLIED>
Attribute values and settings #REQUIRED: mandatory; commonly used. #IMPLIED: optional; commonly used. Default value: if the attribute is missing,
assume default value (use with care) #FIXED and default value: only one
possible value of the attribute is acceptable and that is the default value.
Example
<!ATTLIST node name NMTOKEN #IMPLIEDid ID #REQUIREDlinks IDREFS #IMPLIEDtype (element|attribute|comment) "comment"author (yue|davari|liaw) #FIXED "yue"
>
Example: AddressBook.dtd<!ELEMENT addressBook (person+)><!ELEMENT person (name,email*,link?)><!ATTLIST person id ID #REQUIRED><!ATTLIST person gender NMTOKEN #IMPLIED luckynumber CDATA #IMPLIED><!ELEMENT name (last,first)><!ELEMENT first (#PCDATA)> <!ELEMENT last (#PCDATA)><!ELEMENT email (#PCDATA)><!ELEMENT link EMPTY><!ATTLIST linkspouse IDREF #IMPLIEDchildren IDREFS #IMPLIED>
A Conforming XML<?xml version="1.0"?><!DOCTYPE addressBook SYSTEM
"AddressBook.dtd"><addressBook><person id="s123456789" gender="male" luckynumber="7 12 3"><name><last>Hope</last><first>Bob</first></name><email>[email protected]</
email> </person> <person id="s222222222"
gender="female"><name><last>King</last><first>Deborah</first></name><link spouse="s222222223" children="s222222226 s222222227" /></person>
<person id="s222222223" gender="male"><name><last>King</last><first>Jim</first></name></person><person id="s222222226" gender="male"><name><last>King</last><first>John</first></name></person><person id="s222222227" gender="female"><name><last>King</last><first>Jane</first></name></person></addressBook>
Entity Declarations
Entities are like macros in C. When XML processors parse an entity (usually of the format &entity-name;), the entity is replaced by its value.
Entity declaration uses the syntax <!ENTITY entity-declaration>.
Kind of Entity Declarations General entity: <!ENTITY entity-name
entity-value>. External entity:
<!ENTITY entity-name entity-uri>. The entity will be replaced by text from the
external source. (therefore it can be long and shared.)
Nonparsed external entity: <!ENTITY entity-name entity-uri NDATA entity-
type> NDATA is a keyword. entity-type is the type of the non-data source,
which will not be parsed by the XML processor.
Kind of Entity Declarations
Parameter entity: <!ENTITY % entity-name entity-value>. to be used inside the DTD only. referred to as %entity-name;
External parameter entity: <!ENTITY % entity-name entity-uri>. to be used inside the DTD only. Referred to as %entity-name.
Example
In http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd:
<!ENTITY % HTMLlat1 PUBLIC"-//W3C//ENTITIES Latin 1 for XHTML//EN""xhtml-lat1.ent">%HTMLlat1;
HTML lat1 is an external parameter entity.
ExampleIn http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd:<!ENTITY % ContentType "CDATA"><!-- media type, as per [RFC2045] -->
<!ENTITY % ContentTypes "CDATA"><!-- comma-separated list of media types, as per
[RFC2045] -->
<!ENTITY % Charset "CDATA"><!-- a character encoding, as per [RFC2045] -->...
Notation Declarations XML is not designed for storing binary
data. If needed, binary data can be stored
by using notation declaration and will not be parsed.
To handle binary data, the XML processor needs to know its data type as well as some instructions on handle it.
Notation Declarations Notation declaration syntax: <!NOTATION
name identifier>. name is the notation type name.
identifier is some instruction meaningful to the target XML processor.
Notation is usually used together with non-parsed external entity.
Notation types defined by notation declaration is used as the entity types of non-parsed external entity.
Document Modeling with DTD
Mapping of data requirements of the problem to XML model.
May usually take two steps: Use a modeling language for analysis
and design: e.g. UML. Map the model to DTD, XML Schema, etc.
DTD Modeling Tips Use formal modeling techniques (such as
UML). Use modeling tools, such as Rational's
Rose. Track versions carefully. Decide on organization before modeling.
For example, declarations may be grouped by: functions hierarchical elements
Some General Tips Consider major design options, such
as: Elements versus attributes. Flat element structures versus nested
element structures. Descendants, siblings or ancestors.
Model should generally be as restrictive as possible.
Include sufficient documentation.
Some General Tips
Generous uses of whitespaces and inline comments.
Use parameter entities generously. Import modules by using external
parameter entities. Use meaningful names.
Exercise #10 Consider the XML file, satvexample.xml, and its DTD,
tvschedule.dtd. Both files are obtained from the site http://mysite.verizon.net/vze20h45/comp/xml/videoxml.html with very minor changes in satvexample.xml (to remove XSL references and modify DOCTYPE to refer to a local DTD).
Validation results of the XML file: http://xmlvalidation.com/: no error. xmlvalid (from http://www.elcel.com/products/xmlvalid.html
(you will need to register, download, install and run it in command line mode) Error: non-deterministic content model for element 'DAY': more
than one path leads to element ' DATE'. Error: element content invalid. Element 'PROGRAMSLOT' is not
expected here, expecting 'HOLIDAY'.
Which validator is correct? How do you correct the problem?
Weakness of DTD Not XML compliant. Cannot be parsed by XML parsers. Difficult to extract information from XML applications. Closed construct: all defined within one DTD. Not
easily extensible. Difficult to break down to smaller pieces. Type definitions not rich. E.g.
Insufficient types and precisions. no int, float, etc. No user defined types.
Do not work well with XML namespaces. Difficult to enforce sophisticated constraints.
Other Schema Languages
There are many efforts to overcome the limitations of DTD.
XML Schema is one of the most important, since it is a W3C standard.
Other important languages include RelaxNG, Schematron, etc.
However, the ‘schema war’ is considered settled by many.