CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured...

CIS 670 Fall 2001 (LN 5) 1

XML

Introduction to XML

– XML basics

– DTDs

– XML and semistructured data

Query languages for XML

XML-QL, XQL, XSL

XML extensions

XML-Data, XLink, XPointer

CIS 670 Fall 2001 (LN 5) 2

XML’s history: SGML, HTML, XML

SGML: Standard Generalized Markup Language

-- Charles Goldfarb, ISO 8879, 1986

DTDs (Document Type Declarations)

powerful and flexible tool for structuring information, but

– complete, generic implementation of SGML proved extremely difficult

– tools for working with SGML documents proved expensive

two children that have outpaced SGML:

– HTML: HyperText Markup Language (Tim Berners-Lee, 1991). Describing presentation.

– XML: eXtensible Markup Language, W3C, 1998. Describing content.

CIS 670 Fall 2001 (LN 5) 3

From HTML to XML

HTML is good for presentation (human friendly), but does not help automatic data extraction by means of programs (not computer friendly).

Why? HTML tags:

predefined and fixed

describing display format, not the structure of the data.

<h3> Book </h3>

<ul>

<il> SGML Goldfarb 

 1986 

<il> XML W3C 

</ul>

CIS 670 Fall 2001 (LN 5) 4

XML: a first glance

XML tags:

user defined

describing the structure of the data

<book id = “B1” >

<title> SGML </title>

<author> Goldfarb </author>

<year> 1986 </year>

</book>


<title> XML </title>

<author> W3C </author>

</book>

CIS 670 Fall 2001 (LN 5) 5

XML vs. HTML

user defined new tags, describing structure instead of display

structures can be arbitrarily nested (even recursively defined)

optional description of its grammar (DTD) and thus validation is possible

XML presentation:

XML standard does not define how data should be displayed

Style sheet: provide browsers with a set of formatting rules to be applied to particular elements

– CSS (Cascading Style Sheets), originally for HTML

– XSL (eXtensible Style Language), for XML

CIS 670 Fall 2001 (LN 5) 6

Web applications

data exchange

data transformation

data integration

data extraction

E-commerce

CIS 670 Fall 2001 (LN 5) 7

XML basics (1)

XML consists of tags and text


<title> XML </title>

<author> W3C </author>

</book>

tags come in pairs: markups

– start tag, e.g., <book>

– end tag, e.g., </book>

tags must be properly nested

– <book> <title> … </title> </book> -- good

– <book> <title> … </book> </title> -- bad

XML has only one “basic” type: text, called PCDATA (Parsed Character DATA)

CIS 670 Fall 2001 (LN 5) 8

XML basics (2)

nested tags can be used to express various structures,

e.g., “records”:

<person>

<name> Wenfei Fan </name>

<tel> (215) 204-6485 </tel>

<email> [email protected] </email>


</person>

a list: represented by using the same tags repeatedly:

<person> … </person>


...

CIS 670 Fall 2001 (LN 5) 9

XML basics (3)

XML data is ordered!

How to represent sets in XML?

How to represent an unordered pair (a, b) in XML?

Can one directly represent the following in a conventional database?

– <person> … </person>

<person> … </person> …

– <person>


<tel> (215) 204-6485 </tel>



</person>

CIS 670 Fall 2001 (LN 5) 10

XML element (1)

Element: the segment between an start and a corresponding end tag

subelement: the relation between an element and its component elements.

<person>


<tel> (215) 204-6485 </tel>



</person>

CIS 670 Fall 2001 (LN 5) 11

XML elements (2)

root element: an XML document consists of a single element called the root element, e.g.,

<db>


<person> … </person> ...

</db>

empty element: special element indicating non-textual content, e.g., <giggle></giggle> or simply <giggle/>

– sound effect (to be generated by application):

<claim> Everyone taking CIS 670 can get an A <giggle/>. However, …</claim>

– its attributes carry information:

<image img=“picture.gif” />

CIS 670 Fall 2001 (LN 5) 12

XML elements (3)

mixed content: an element may contain a mixture of subelements and PCDATA:

<person>

This is my daughter

<name> Grace Fan</name>


I am not too sure of the following

<hobby> crying </hobby>

<education> Ph.D </education>

Don’t even think about it: <smoking/>

</person>

CIS 670 Fall 2001 (LN 5) 13

XML attributes (1)

An start tag may contain attributes describing “properties” of the element (e.g., dimension or type)

<picture>

<height dim=“cm”> 2400</height>

<width dim=“in”> 96 </width>

<data encoding=“gif”> M05-+C$ … </data>

</picture>

References (meaningful only when a DTD is present):

<person id = “989” father=“868”>



</person>

CIS 670 Fall 2001 (LN 5) 14

XML attributes (2)

XML attributes cannot be nested

XML attributes must be unique

one can’t write <person friends=“0” friends=“1”> ...

XML attributes are not ordered

<person id = “989” friends=“110 111 112 113 114”>


</person>

is the same as

<person friends=“110 111 112 113 114” id = “989”>


</person>

CIS 670 Fall 2001 (LN 5) 15

Attributes vs. subelements

When to use attributes?

Not always clear. A research problem.

An incomplete guideline:

attributes cannot nest (flat structure)

subelements cannot represent references

subelements are more easily displayed (without the use of complex style sheets)

subelements are often easier to read

attributes are more compact

CIS 670 Fall 2001 (LN 5) 16

Other XML constructs (1)

XML declaration: version information must be provided:

<?xml version= ‘1.0’?>

comments:



CDATA: escape blocks containing characters that would otherwise be recognized as markup:

<![CDATA[ content]]>

e.g., <![CDATA[ <start> this is not an element </end>]]>

CIS 670 Fall 2001 (LN 5) 17


PI (Processing Instruction): for applications, not parsers

<?instruction?>

Example: associate a CSS style sheet with XML document

<?xml:stylesheet href=“book.css” type=“text/css”?>

Example: associate an XSL style sheet with XML document

<?xml:stylesheet

href=“http://www.cis.temple.edu/~fan/book.xsl”

type=“text/xsl” ?>

CIS 670 Fall 2001 (LN 5) 18


Entities: macros (defined in a DTD):

<!ENTITY entity_name “entity_content”>

E.g., in 670.dtd

<!ENTITY XML-flavor “structure”>

<!ENTITY HTML-flavor “display”>

In 670.xml:

XML is about &XML-flavor while HTML about &HTML-flavor

DTD (Document Type Declaration)

<!DOCTYPE 670 PUBLIC

“http://www.cis.temple.edu/670/670.dtd”>

CIS 670 Fall 2001 (LN 5) 19

A complete XML document

<?xml version= ‘1.0’?>

<!DOCTYPE book PUBLIC “~fan/book.dtd”>

<?xml:stylesheet href=“book.xsl” type=“text/xsl”?>

<books> <book id = “B1” >

<title> SGML </title><author> Goldfarb </author><year> 1986 </year>

</book><book id = “B2” >

<title> XML </title></book>

</books>

CIS 670 Fall 2001 (LN 5) 20

Well-formed XML documents

a document is well-formed if it satisfies two constraints (when only elements and attributes are considered):

tags have to nest properly

attributes have to be unique

There are also constraints about other constructs (e.g., entities) -- XML specification.

Very weak constraints: it does little more than ensure that XML data will parse into a labeled tree

XML is tree-like: rooted directed tree (graph) with labels on vertices!

CIS 670 Fall 2001 (LN 5) 21

We are here

Introduction to XML

XML basics

DTDs XML and semistructured data

CIS 670 Fall 2001 (LN 5) 22

Document Type Declarations (DTDs)

A DTD imposes structure on an XML document

The DTD is a syntactic specification (grammar)

There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems (extensions)

DTDs are optional: an XML document may not come along with a DTD

DTDs are somewhat unsatisfactory, and several proposals have been made for better schema formalisms

CIS 670 Fall 2001 (LN 5) 23

A DTD

<!DOCTYPE db [

<!ELEMENT db (book*)>

<!ELEMENT book (title, authors*, section*, ref*)>

<!ATTLIST book isbn ID #required>

<!ELEMENT section (text | section)*>

<!ELEMENT ref EMPTY>

<!ATTLIST ref to IDREFS #implied>

<!ELEMENT title #PCDATA>

<!ELEMENT author #PCDATA>

<!ELEMENT text #PCDATA>

]>

CIS 670 Fall 2001 (LN 5) 24

Element declarations (1)

for each element type E, a declaration of the form:

<!ELEMENT E P>

where P is a regular expression, i.e.,

P ::= EMPTY | ANY | #PCDATA | E’ |

P1, P2 | P1 | P2 | P? | P+ | P*

– E’: element type

– P1 , P2: concatenation

– P1 | P2: disjunction

– P?: optional

– P+: one or more occurrences

– P*: the Kleene closure

CIS 670 Fall 2001 (LN 5) 25


Extended context free grammar: <!ELEMENT E P>

Why is it called extended?

E.g., <!ELEMENT book (title, authors*, section*, ref*)>

single root: <!DOCTYPE db [ … ] >

subelements are ordered.

The following two are different. Why?

<!ELEMENT section (text | section)*>

<!ELEMENT section (text* | section* )>

recursive definition, e.g., section, binary tree:

<!ELEMENT node (leaf | (node, node))

<!ELEMENT leaf (#PCDATA)>

CIS 670 Fall 2001 (LN 5) 26


more on recursive DTDs

<!ELEMENT person (name, father, mother)>

<!ELEMENT father (person)>

<!ELEMENT mother (person)>

What is the problem with this? How to fix it?

– Attributes

– optional (e.g., father?, mother?)

more on ordering

How to declare E to be an unordered pair (a, b)?

<!ELEMENT E ((a, b) | (b, a)) >

CIS 670 Fall 2001 (LN 5) 27


EMPTY element:

<!ELEMENT ref EMPTY>

<!ATTLIST ref to IDREFS #implied>

observe it has attributes

ANY: may contain any content -- discouraged

<!ELEMENT generic ANY>

mixed content

<!ELEMENT section (#PCDATA | section)*>

CIS 670 Fall 2001 (LN 5) 28


global definition:

<!ELEMENT person (name, ssn)>

<!ELEMENT course (name, credit, instructor)>

The type associated with an element is unique -- only one declaration for name is allowed.

To avoid name clashes, one may use two distinct tags: e.g., personname, coursename.

namespace: define two namespaces

<MYNAMESPACE xmlns:person=“~fan/person.dtd”

xmlns:course=“~fan/course.dtd”>

<person:name> … <course:name> …

</MYNAMESPACE>

CIS 670 Fall 2001 (LN 5) 29

Attribute declarations (1)

General syntax:

<!ATTLIST element_name

attribute-name attribute-type default-declaration>

example:

<!ATTLIST book

isbn ID #required>

<!ATTLIST ref

to IDREFS #implied>

Note: it is OK for several element types to define an attribute of the same name, e.g.,

<!ATTLIST person name ID #required>

<!ATTLIST pet name ID #required>

CIS 670 Fall 2001 (LN 5) 30

Attribute declarations (2)

<!ATTLIST element_name

attribute-name attribute-type default-declaration>

attribute types: 10

– CDATA

– ID, IDREF, IDREFS

– ENTITY, ENTITIES

– NMTOKEN, NMTOKENS

– enumerated, notation

default declarations: 4

– #required, #implied

– “default value”, #fixed “default value”

CIS 670 Fall 2001 (LN 5) 31

Specifying ID and IDREF attributes

<!ATTLIST person

id ID #required

father IDREF #implied

mother IDREF #implied

children IDREFS #implied>

e.g.,

<person id=“898” father=“332” mother=“336”

children=“982 984 986”>

….

</person>

CIS 670 Fall 2001 (LN 5) 32

XML reference mechanism

ID attribute: unique within the entire document.

– An element can have at most one ID attribute.

– No default (fixed default) value is allowed.

• #required: a value must be provided

• #implied: a value is optional

IDREF attribute: its value must be some other element’s ID value in the document.

IDREFS attribute: its value is a set, each element of the set is the ID value of some other element in the document.

<person id=“898” father=“332” mother=“336”

children=“982 984 986”>

CIS 670 Fall 2001 (LN 5) 33

ID vs. object identifiers in OODBs

ID is unique within the whole document, like an oid

ID is not system-generated and can be changed - different from oid and somehow like keys in relational DBs

IDREF (IDREFS) are untyped -- a big problem: you point to something, but you don’t know what it is!

<student id=“01” taking=“cis670”/>

<student id=“02” taking=“cis670 01”/>

<course id=“cis670”/>

No inverse constraints, i.e., child is inverse of parent.

This makes it difficult to translate object-oriented databases into an XML encoding.

CIS 670 Fall 2001 (LN 5) 34

ID/IDREF vs. key/foreign key in RDBs

keys are unique within the same relation, while IDs are unique within the whole database

a relation may have several different keys, while an element can have at most one ID

keys can be multi-valued, while IDs must be single-valued

enroll (sid: string, cid: string, grade:string)

foreign keys are typed, while IDREF (IDREFS) is not

This makes it difficult to translate RDBs into XML.

Why is it a problem? To exchange/integrate/transform data, we need to translate legacy data into an XML encoding while preserving the original semantics of the data.

CIS 670 Fall 2001 (LN 5) 35

CDATA attributes

CDATA: string

<!ATTLIST snore

volume CDATA #implied>

e.g., <snore volume=“loud” />

default value: used when no value is given

<!ATTLIST snore

volume CDATA “normal”>

e.g., <snore volume=“loud” />

<snore/>

fixed default value: fixed and may not be changed

<!ATTLIST snore

volume CDATA #fixed “normal”>

e.g., <snore/>

CIS 670 Fall 2001 (LN 5) 36

Enumerated types

We specify a range and don’t want its volume out of range.

<!ATTLIST snore

volume (silent | quite | normal | loud | loudest)

“normal”>

CIS 670 Fall 2001 (LN 5) 37

NOTATIONS

Notations allow documents to identify the types of content they will contain:

<!NOTATION notation-id SYSTEM “notation-type”>

e.g., <!NOTATION gif SYSTEM “image/gif”>

<!NOTATION jpg SYSTEM “image/jpg”>

In attributes:

<!ATTLIST picture

source CDATA #required

type NOTATION (gif | jpg) #required>

CIS 670 Fall 2001 (LN 5) 38

NMTOKEN, NMTOKENS

NMTOKEN: name token, restricted form of strings that has the same production rule as element names. No additional constraints: it does not have to match another attribute or declaration.

<!ATTLIST picture

width NMTOKEN #required>

NMTOKENS: a list of name tokens.

CIS 670 Fall 2001 (LN 5) 39

ENTITY, ENTITIES

ENTITY (attribute): must match the name of an entity

ENTITIES: multiple ENTITY values, each must be the name of an entity.

Parametric entities: their use is limited to the DTD:

syntax: <!ENTITY %entity-name “entity-content”>

use: %entity-name;

e.g., <!ENTITY %basics “RDBs | OODBs | WebDBs”>

<!ATTLIST DB-user

skills %basics>

<!ATTLIST DBA

skills %basics

others CDATA #required>

Research: sub-typing and inheritance by means of entities?

CIS 670 Fall 2001 (LN 5) 40

Valid XML documents

A valid XML document must have a DTD.

The document is well-formed

It conforms to the DTD:

– elements observe the grammar (nested only in the way described by the DTD)

– elements have only the attributes specified by the DTD

– ID/IDREF attributes satisfy their constraints:

• ID must be distinct

• IDREF/IDREFS values must be existing ID values

CIS 670 Fall 2001 (LN 5) 41

We are here

Introduction to XML

– XML basics

– DTDs

– XML and semistructured data

CIS 670 Fall 2001 (LN 5) 42

Graph representation

XML data can be presented as a rooted node-labeled directed tree, while semistructured data is a rooted edge-labeled directed graph.

If references are not considered (without DTD), XML document is usually modeled as a tree.

References (IDREF/IDREFS) can be viewed as edges and thus lead to graphs (cyclic structure) -- XML-QL model

When IDREF/IDREFS attributes are simply treated as text data, XML data is modeled as a tree - XSL and XQL model.

This model does not take validation into account.

Research: ordered graph/tree?

CIS 670 Fall 2001 (LN 5) 43

Node-labeled vs. edge-labeled (1)

Consider an ssd expression {a: {b: “string”}, a: {c: “string”}}

Edge-labeled graph:

We may encode it in XML either as (with some DTD):

<a> string </a>

<a c=“&12” />

or as <a b =“&12” />

<a> <c id=“&12”> string </c> </a>

a a

cb“string”

CIS 670 Fall 2001 (LN 5) 44

Node-labeled vs. edge-labeled (2)

Node-labeled graph:

<a> string </a>

<a c=“&12” />

<a b =“&12” />

<a> <c id=“&12”> string </c> </a>

a a

c

b

“string”

a a

cb

“string”

CIS 670 Fall 2001 (LN 5) 45

XML vs. semistructured data

similarities:

– both described best by graphical representation

– both are schema-less, self-describing

differences:

– XML is ordered, semistructured data is not.

– XML data is modeled as a node labeled graph (tree), semistructured data as an edge labeled graph.

– XML can mix text and elements; when translating it to a semistructured data expression, we have to add some surrounding tag for the PCDATA.

– XML has other stuffs (PIs, comments).

– XML may have (optional) DTD.

CIS 670 Fall 2001 (LN 5) 46

Query languages

for semistructured data:

– Lorel

– UnQL

for XML

– XML-QL

– XQL

– XSLT (transformation language)

XML data can be treated and queried as semistructured data (e.g., Lorel).

CIS 670 Fall 2001 (LN 5) 47

DTDs vs. schemas (types)

By database (or programming language) standard XML DTDs are rather weak specifications.

– Only one base type -- PCDATA.

– No useful “abstractions”, e.g., sets.

– IDREFs are not typed or scoped -- you point to something, but you don’t know what!

– No constraints (e.g., inverse) or methods.

– No sub-typing or inheritance.

XML extensions to overcome the limitations.

– Type systems: XML-Data, XML-Schema, SOX, DCD

– metadata: RDF

– constraints

CIS 670 Fall 2001 (LN 5) 48

Summary

XML is a new data exchange format. Its main virtues include widespread acceptance.

DTDs provide some useful syntactic constraints on documents. As schemas they are weak.

Research topics include:

– How will we query XML data?

– How will we navigate the Web (XLink, XPointer)?

– How will we extend XML DTDs to capture the semantics of the data?

– How will we map between XML and other representations, esp. structured databases?

– How will we compress/store XML data?

CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured...

Documents

Transcript of CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured...