CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured...
-
Upload
alden-litherland -
Category
Documents
-
view
264 -
download
0
Transcript of CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured...
CIS 670 Fall 2001 (LN 5) 1
XML
Introduction to XML
– XML basics
– DTDs
– XML and semistructured data
Query languages for XML
XML-QL, XQL, XSL
XML extensions
XML-Data, XLink, XPointer
CIS 670 Fall 2001 (LN 5) 2
XML’s history: SGML, HTML, XML
SGML: Standard Generalized Markup Language
-- Charles Goldfarb, ISO 8879, 1986
DTDs (Document Type Declarations)
powerful and flexible tool for structuring information, but
– complete, generic implementation of SGML proved extremely difficult
– tools for working with SGML documents proved expensive
two children that have outpaced SGML:
– HTML: HyperText Markup Language (Tim Berners-Lee, 1991). Describing presentation.
– XML: eXtensible Markup Language, W3C, 1998. Describing content.
CIS 670 Fall 2001 (LN 5) 3
From HTML to XML
HTML is good for presentation (human friendly), but does not help automatic data extraction by means of programs (not computer friendly).
Why? HTML tags:
predefined and fixed
describing display format, not the structure of the data.
<h3> Book </h3>
<ul>
<il><i> SGML </i> Goldfarb <br>
<b> 1986 </b>
<il><b> XML </b> W3C <br>
</ul>
CIS 670 Fall 2001 (LN 5) 4
XML: a first glance
XML tags:
user defined
describing the structure of the data
<book id = “B1” >
<title> SGML </title>
<author> Goldfarb </author>
<year> 1986 </year>
</book>
<book id = “B2” >
<title> XML </title>
<author> W3C </author>
</book>
CIS 670 Fall 2001 (LN 5) 5
XML vs. HTML
user defined new tags, describing structure instead of display
structures can be arbitrarily nested (even recursively defined)
optional description of its grammar (DTD) and thus validation is possible
XML presentation:
XML standard does not define how data should be displayed
Style sheet: provide browsers with a set of formatting rules to be applied to particular elements
– CSS (Cascading Style Sheets), originally for HTML
– XSL (eXtensible Style Language), for XML
CIS 670 Fall 2001 (LN 5) 6
Web applications
data exchange
data transformation
data integration
data extraction
E-commerce
CIS 670 Fall 2001 (LN 5) 7
XML basics (1)
XML consists of tags and text
<book id = “B2” >
<title> XML </title>
<author> W3C </author>
</book>
tags come in pairs: markups
– start tag, e.g., <book>
– end tag, e.g., </book>
tags must be properly nested
– <book> <title> … </title> </book> -- good
– <book> <title> … </book> </title> -- bad
XML has only one “basic” type: text, called PCDATA (Parsed Character DATA)
CIS 670 Fall 2001 (LN 5) 8
XML basics (2)
nested tags can be used to express various structures,
e.g., “records”:
<person>
<name> Wenfei Fan </name>
<tel> (215) 204-6485 </tel>
<email> [email protected] </email>
<email> [email protected] </email>
</person>
a list: represented by using the same tags repeatedly:
<person> … </person>
<person> … </person>
...
CIS 670 Fall 2001 (LN 5) 9
XML basics (3)
XML data is ordered!
How to represent sets in XML?
How to represent an unordered pair (a, b) in XML?
Can one directly represent the following in a conventional database?
– <person> … </person>
<person> … </person> …
– <person>
<name> Wenfei Fan </name>
<tel> (215) 204-6485 </tel>
<email> [email protected] </email>
<email> [email protected] </email>
</person>
CIS 670 Fall 2001 (LN 5) 10
XML element (1)
Element: the segment between an start and a corresponding end tag
subelement: the relation between an element and its component elements.
<person>
<name> Wenfei Fan </name>
<tel> (215) 204-6485 </tel>
<email> [email protected] </email>
<email> [email protected] </email>
</person>
CIS 670 Fall 2001 (LN 5) 11
XML elements (2)
root element: an XML document consists of a single element called the root element, e.g.,
<db>
<person> … </person>
<person> … </person> ...
</db>
empty element: special element indicating non-textual content, e.g., <giggle></giggle> or simply <giggle/>
– sound effect (to be generated by application):
<claim> Everyone taking CIS 670 can get an A <giggle/>. However, …</claim>
– its attributes carry information:
<image img=“picture.gif” />
CIS 670 Fall 2001 (LN 5) 12
XML elements (3)
mixed content: an element may contain a mixture of subelements and PCDATA:
<person>
This is my daughter
<name> Grace Fan</name>
<email> [email protected] </email>
I am not too sure of the following
<hobby> crying </hobby>
<education> Ph.D </education>
Don’t even think about it: <smoking/>
</person>
CIS 670 Fall 2001 (LN 5) 13
XML attributes (1)
An start tag may contain attributes describing “properties” of the element (e.g., dimension or type)
<picture>
<height dim=“cm”> 2400</height>
<width dim=“in”> 96 </width>
<data encoding=“gif”> M05-+C$ … </data>
</picture>
References (meaningful only when a DTD is present):
<person id = “989” father=“868”>
<name> Grace Fan</name>
<email> [email protected] </email>
</person>
CIS 670 Fall 2001 (LN 5) 14
XML attributes (2)
XML attributes cannot be nested
XML attributes must be unique
one can’t write <person friends=“0” friends=“1”> ...
XML attributes are not ordered
<person id = “989” friends=“110 111 112 113 114”>
<name> Grace Fan</name>
</person>
is the same as
<person friends=“110 111 112 113 114” id = “989”>
<name> Grace Fan</name>
</person>
CIS 670 Fall 2001 (LN 5) 15
Attributes vs. subelements
When to use attributes?
Not always clear. A research problem.
An incomplete guideline:
attributes cannot nest (flat structure)
subelements cannot represent references
subelements are more easily displayed (without the use of complex style sheets)
subelements are often easier to read
attributes are more compact
CIS 670 Fall 2001 (LN 5) 16
Other XML constructs (1)
XML declaration: version information must be provided:
<?xml version= ‘1.0’?>
comments:
<!-- This is a comment. Processors will ignore me -->
CDATA: escape blocks containing characters that would otherwise be recognized as markup:
<![CDATA[ content]]>
e.g., <![CDATA[ <start> this is not an element </end>]]>
CIS 670 Fall 2001 (LN 5) 17
Other XML constructs (2)
PI (Processing Instruction): for applications, not parsers
<?instruction?>
Example: associate a CSS style sheet with XML document
<?xml:stylesheet href=“book.css” type=“text/css”?>
Example: associate an XSL style sheet with XML document
<?xml:stylesheet
href=“http://www.cis.temple.edu/~fan/book.xsl”
type=“text/xsl” ?>
CIS 670 Fall 2001 (LN 5) 18
Other XML constructs (3)
Entities: macros (defined in a DTD):
<!ENTITY entity_name “entity_content”>
E.g., in 670.dtd
<!ENTITY XML-flavor “structure”>
<!ENTITY HTML-flavor “display”>
In 670.xml:
XML is about &XML-flavor while HTML about &HTML-flavor
DTD (Document Type Declaration)
<!DOCTYPE 670 PUBLIC
“http://www.cis.temple.edu/670/670.dtd”>
CIS 670 Fall 2001 (LN 5) 19
A complete XML document
<?xml version= ‘1.0’?>
<!DOCTYPE book PUBLIC “~fan/book.dtd”>
<?xml:stylesheet href=“book.xsl” type=“text/xsl”?>
<books> <!-- book database --><book id = “B1” >
<title> SGML </title><author> Goldfarb </author><year> 1986 </year>
</book><book id = “B2” >
<title> XML </title></book>
</books>
CIS 670 Fall 2001 (LN 5) 20
Well-formed XML documents
a document is well-formed if it satisfies two constraints (when only elements and attributes are considered):
tags have to nest properly
attributes have to be unique
There are also constraints about other constructs (e.g., entities) -- XML specification.
Very weak constraints: it does little more than ensure that XML data will parse into a labeled tree
XML is tree-like: rooted directed tree (graph) with labels on vertices!
CIS 670 Fall 2001 (LN 5) 21
We are here
Introduction to XML
XML basics
DTDs XML and semistructured data
CIS 670 Fall 2001 (LN 5) 22
Document Type Declarations (DTDs)
A DTD imposes structure on an XML document
The DTD is a syntactic specification (grammar)
There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems (extensions)
DTDs are optional: an XML document may not come along with a DTD
DTDs are somewhat unsatisfactory, and several proposals have been made for better schema formalisms
CIS 670 Fall 2001 (LN 5) 23
A DTD
<!DOCTYPE db [
<!ELEMENT db (book*)>
<!ELEMENT book (title, authors*, section*, ref*)>
<!ATTLIST book isbn ID #required>
<!ELEMENT section (text | section)*>
<!ELEMENT ref EMPTY>
<!ATTLIST ref to IDREFS #implied>
<!ELEMENT title #PCDATA>
<!ELEMENT author #PCDATA>
<!ELEMENT text #PCDATA>
]>
CIS 670 Fall 2001 (LN 5) 24
Element declarations (1)
for each element type E, a declaration of the form:
<!ELEMENT E P>
where P is a regular expression, i.e.,
P ::= EMPTY | ANY | #PCDATA | E’ |
P1, P2 | P1 | P2 | P? | P+ | P*
– E’: element type
– P1 , P2: concatenation
– P1 | P2: disjunction
– P?: optional
– P+: one or more occurrences
– P*: the Kleene closure
CIS 670 Fall 2001 (LN 5) 25
Element declarations (2)
Extended context free grammar: <!ELEMENT E P>
Why is it called extended?
E.g., <!ELEMENT book (title, authors*, section*, ref*)>
single root: <!DOCTYPE db [ … ] >
subelements are ordered.
The following two are different. Why?
<!ELEMENT section (text | section)*>
<!ELEMENT section (text* | section* )>
recursive definition, e.g., section, binary tree:
<!ELEMENT node (leaf | (node, node))
<!ELEMENT leaf (#PCDATA)>
CIS 670 Fall 2001 (LN 5) 26
Element declarations (3)
more on recursive DTDs
<!ELEMENT person (name, father, mother)>
<!ELEMENT father (person)>
<!ELEMENT mother (person)>
What is the problem with this? How to fix it?
– Attributes
– optional (e.g., father?, mother?)
more on ordering
How to declare E to be an unordered pair (a, b)?
<!ELEMENT E ((a, b) | (b, a)) >
CIS 670 Fall 2001 (LN 5) 27
Element declarations (4)
EMPTY element:
<!ELEMENT ref EMPTY>
<!ATTLIST ref to IDREFS #implied>
observe it has attributes
ANY: may contain any content -- discouraged
<!ELEMENT generic ANY>
mixed content
<!ELEMENT section (#PCDATA | section)*>
CIS 670 Fall 2001 (LN 5) 28
Element declarations (5)
global definition:
<!ELEMENT person (name, ssn)>
<!ELEMENT course (name, credit, instructor)>
The type associated with an element is unique -- only one declaration for name is allowed.
To avoid name clashes, one may use two distinct tags: e.g., personname, coursename.
namespace: define two namespaces
<MYNAMESPACE xmlns:person=“~fan/person.dtd”
xmlns:course=“~fan/course.dtd”>
<person:name> … <course:name> …
</MYNAMESPACE>
CIS 670 Fall 2001 (LN 5) 29
Attribute declarations (1)
General syntax:
<!ATTLIST element_name
attribute-name attribute-type default-declaration>
example:
<!ATTLIST book
isbn ID #required>
<!ATTLIST ref
to IDREFS #implied>
Note: it is OK for several element types to define an attribute of the same name, e.g.,
<!ATTLIST person name ID #required>
<!ATTLIST pet name ID #required>
CIS 670 Fall 2001 (LN 5) 30
Attribute declarations (2)
<!ATTLIST element_name
attribute-name attribute-type default-declaration>
attribute types: 10
– CDATA
– ID, IDREF, IDREFS
– ENTITY, ENTITIES
– NMTOKEN, NMTOKENS
– enumerated, notation
default declarations: 4
– #required, #implied
– “default value”, #fixed “default value”
CIS 670 Fall 2001 (LN 5) 31
Specifying ID and IDREF attributes
<!ATTLIST person
id ID #required
father IDREF #implied
mother IDREF #implied
children IDREFS #implied>
e.g.,
<person id=“898” father=“332” mother=“336”
children=“982 984 986”>
….
</person>
CIS 670 Fall 2001 (LN 5) 32
XML reference mechanism
ID attribute: unique within the entire document.
– An element can have at most one ID attribute.
– No default (fixed default) value is allowed.
• #required: a value must be provided
• #implied: a value is optional
IDREF attribute: its value must be some other element’s ID value in the document.
IDREFS attribute: its value is a set, each element of the set is the ID value of some other element in the document.
<person id=“898” father=“332” mother=“336”
children=“982 984 986”>
CIS 670 Fall 2001 (LN 5) 33
ID vs. object identifiers in OODBs
ID is unique within the whole document, like an oid
ID is not system-generated and can be changed - different from oid and somehow like keys in relational DBs
IDREF (IDREFS) are untyped -- a big problem: you point to something, but you don’t know what it is!
<student id=“01” taking=“cis670”/>
<student id=“02” taking=“cis670 01”/>
<course id=“cis670”/>
No inverse constraints, i.e., child is inverse of parent.
This makes it difficult to translate object-oriented databases into an XML encoding.
CIS 670 Fall 2001 (LN 5) 34
ID/IDREF vs. key/foreign key in RDBs
keys are unique within the same relation, while IDs are unique within the whole database
a relation may have several different keys, while an element can have at most one ID
keys can be multi-valued, while IDs must be single-valued
enroll (sid: string, cid: string, grade:string)
foreign keys are typed, while IDREF (IDREFS) is not
This makes it difficult to translate RDBs into XML.
Why is it a problem? To exchange/integrate/transform data, we need to translate legacy data into an XML encoding while preserving the original semantics of the data.
CIS 670 Fall 2001 (LN 5) 35
CDATA attributes
CDATA: string
<!ATTLIST snore
volume CDATA #implied>
e.g., <snore volume=“loud” />
default value: used when no value is given
<!ATTLIST snore
volume CDATA “normal”>
e.g., <snore volume=“loud” />
<snore/>
fixed default value: fixed and may not be changed
<!ATTLIST snore
volume CDATA #fixed “normal”>
e.g., <snore/>
CIS 670 Fall 2001 (LN 5) 36
Enumerated types
We specify a range and don’t want its volume out of range.
<!ATTLIST snore
volume (silent | quite | normal | loud | loudest)
“normal”>
CIS 670 Fall 2001 (LN 5) 37
NOTATIONS
Notations allow documents to identify the types of content they will contain:
<!NOTATION notation-id SYSTEM “notation-type”>
e.g., <!NOTATION gif SYSTEM “image/gif”>
<!NOTATION jpg SYSTEM “image/jpg”>
In attributes:
<!ATTLIST picture
source CDATA #required
type NOTATION (gif | jpg) #required>
CIS 670 Fall 2001 (LN 5) 38
NMTOKEN, NMTOKENS
NMTOKEN: name token, restricted form of strings that has the same production rule as element names. No additional constraints: it does not have to match another attribute or declaration.
<!ATTLIST picture
width NMTOKEN #required>
NMTOKENS: a list of name tokens.
CIS 670 Fall 2001 (LN 5) 39
ENTITY, ENTITIES
ENTITY (attribute): must match the name of an entity
ENTITIES: multiple ENTITY values, each must be the name of an entity.
Parametric entities: their use is limited to the DTD:
syntax: <!ENTITY %entity-name “entity-content”>
use: %entity-name;
e.g., <!ENTITY %basics “RDBs | OODBs | WebDBs”>
<!ATTLIST DB-user
skills %basics>
<!ATTLIST DBA
skills %basics
others CDATA #required>
Research: sub-typing and inheritance by means of entities?
CIS 670 Fall 2001 (LN 5) 40
Valid XML documents
A valid XML document must have a DTD.
The document is well-formed
It conforms to the DTD:
– elements observe the grammar (nested only in the way described by the DTD)
– elements have only the attributes specified by the DTD
– ID/IDREF attributes satisfy their constraints:
• ID must be distinct
• IDREF/IDREFS values must be existing ID values
CIS 670 Fall 2001 (LN 5) 41
We are here
Introduction to XML
– XML basics
– DTDs
– XML and semistructured data
CIS 670 Fall 2001 (LN 5) 42
Graph representation
XML data can be presented as a rooted node-labeled directed tree, while semistructured data is a rooted edge-labeled directed graph.
If references are not considered (without DTD), XML document is usually modeled as a tree.
References (IDREF/IDREFS) can be viewed as edges and thus lead to graphs (cyclic structure) -- XML-QL model
When IDREF/IDREFS attributes are simply treated as text data, XML data is modeled as a tree - XSL and XQL model.
This model does not take validation into account.
Research: ordered graph/tree?
CIS 670 Fall 2001 (LN 5) 43
Node-labeled vs. edge-labeled (1)
Consider an ssd expression {a: {b: “string”}, a: {c: “string”}}
Edge-labeled graph:
We may encode it in XML either as (with some DTD):
<a> <b id=“&12”> string </b> </a>
<a c=“&12” />
or as <a b =“&12” />
<a> <c id=“&12”> string </c> </a>
a a
cb“string”
CIS 670 Fall 2001 (LN 5) 44
Node-labeled vs. edge-labeled (2)
Node-labeled graph:
<a> <b id=“&12”> string </b> </a>
<a c=“&12” />
<a b =“&12” />
<a> <c id=“&12”> string </c> </a>
a a
c
b
“string”
a a
cb
“string”
CIS 670 Fall 2001 (LN 5) 45
XML vs. semistructured data
similarities:
– both described best by graphical representation
– both are schema-less, self-describing
differences:
– XML is ordered, semistructured data is not.
– XML data is modeled as a node labeled graph (tree), semistructured data as an edge labeled graph.
– XML can mix text and elements; when translating it to a semistructured data expression, we have to add some surrounding tag for the PCDATA.
– XML has other stuffs (PIs, comments).
– XML may have (optional) DTD.
CIS 670 Fall 2001 (LN 5) 46
Query languages
for semistructured data:
– Lorel
– UnQL
for XML
– XML-QL
– XQL
– XSLT (transformation language)
XML data can be treated and queried as semistructured data (e.g., Lorel).
CIS 670 Fall 2001 (LN 5) 47
DTDs vs. schemas (types)
By database (or programming language) standard XML DTDs are rather weak specifications.
– Only one base type -- PCDATA.
– No useful “abstractions”, e.g., sets.
– IDREFs are not typed or scoped -- you point to something, but you don’t know what!
– No constraints (e.g., inverse) or methods.
– No sub-typing or inheritance.
XML extensions to overcome the limitations.
– Type systems: XML-Data, XML-Schema, SOX, DCD
– metadata: RDF
– constraints
CIS 670 Fall 2001 (LN 5) 48
Summary
XML is a new data exchange format. Its main virtues include widespread acceptance.
DTDs provide some useful syntactic constraints on documents. As schemas they are weak.
Research topics include:
– How will we query XML data?
– How will we navigate the Web (XLink, XPointer)?
– How will we extend XML DTDs to capture the semantics of the data?
– How will we map between XML and other representations, esp. structured databases?
– How will we compress/store XML data?