Database Systems Part VII: XML Software School of Hunan University 2006.10.

46
Database Systems Part VII: XML Software School of Hunan University • 2006.10

Transcript of Database Systems Part VII: XML Software School of Hunan University 2006.10.

Page 1: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Database SystemsPart VII:

XML

• Software School of Hunan University

• 2006.10

Page 2: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Objectives

Define semistructured data and list its advantages over (fixed)

relational schemas.

Given OEM objects, be able to draw a graph representation of

the data.

Given an XML document, determine if it is well-formed.

Given an XML document and a DTD, determine if it is valid.

Explain the difference between #PCDATA and CDATA.

Know the symbols (?,*,+) for cardinality constraints in DTDs.

Compare and contrast ID/IDREFs in DTDs with keys and

foreign keys in the relational model.

Page 3: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Objectives (2)

List some advantages that XML Schema has over DTDs.

Explain what an XML parser does.

Compare and contrast the two XML parser APIs: DOM and S

AX.

Explain why and when namespaces are used.

Why are paths important in XML? Explain the role of XPath.

Explain what XSL and XSLT are used for.

Page 4: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Semistructured Data

Semistructured data is data that may have a varying structure. The data typically has some structure, but the data does not conform to a fixed schema.

With semistructured data, the schema information is contained within the data itself. Thus, semistructured data is often called self-describing.

Semistructured data is important because:�It allows flexibility in representing data whose schema

may frequently change or is hard to conform to a single schema.

�XML data is very similar to semistructured data.

Page 5: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Representing Semistructured Data

• Semistructured data is often represented in a graph form.

• In this graph:

�Nodes represent objects.

�Labels on edges represent attributes or relationships.

�Atomic values are at leaf nodes (nodes with no edges out).

• The graph representation is very flexible because there is no restriction on:

� The number of labels out of a node.

� The number of children nodes with a given label.

Page 6: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Semistructured Graph ExampleData represented:- Dept D1 has Employees:R. Davis and J. Jones- Dept D2 has budget $350,000 andEmps: L. Chu, A. Lee andmanager R. Davis

&1

&2 &3

&5&4 &6 &9&8 &10&7

&12&11 &13 &14

rootUnique node identifier

D1 D2name name name name

emp emp emp empdnomgr

dnobudget

R. Davis J. Jones L. Chu A. Lee

$350,000

dept dept

Page 7: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Semistructured Data Question

Given the following data, represent it in a semistructured graph:

Employees: (eno, ename, dept) E1, J. Doe, D1 E3, A. Lee, D2Departments: (dno, dname) D1, Management D2, ConsultingWorksOn: (eno, pno, hours) E1, P1, 12 E3, P3, 10 E3, P4, 48

How would your graph changeif instead of D1 this value was NULL?

Page 8: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Object Exchange Model (OEM)

Object Exchange Model (OEM) is a semistructured data

model developed for the TSIMMIS integration project.

Data in OEM consists of objects where each object has:

a unique identifier (e.g. &6)

a text label (string)

a type (string)

a value

Atomic objects have a value for a base type and have no

children nodes when represented in a diagram.

Complex objects have a set of object identifiers as their value, and have children nodes in the diagram.

Page 9: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Object Exchange Model (OEM) Example

The previous data could be represented in the OEM model as:(&1, root, set, {&2, &3})(&2, Dept, set, {&4, &5, &6})(&3, Dept, set, {&5, &7, &8, &9, &10})(&4, name, string, "D1")(&5, Emp, set, {&11})(&6, Emp, set, {&12})(&7, name, string, "D2")(&8, Emp, set, {&13})(&9, Emp, set, {&14})(&10, budget, double, 350000)(&11, name, string, "R. Davis")(&12, name, string, "J. Jones")(&13, name, string, "L. Chu")(&14, name, string, "A. Lee")

Page 10: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XML

Extensible Markup Language (XML) is a markup language that allows for the description of data semantics.

XML is a markup language for describing any type of data because the markup terms can be user-defined.

XML derives from SGML (Standard Generalized Markup Language) for structuring documents.

XML is interoperable with both HTML and SGML.

Note that XML is case-sensitive unlike HTML.

XML is standardized by the World Wide Web Consortium (W3C).

Page 11: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Advantages of XMLSome advantages of XML: Simplicity The XML standard is relatively short and less complex that S

GML. Open standard Standardized by W3C to be vendor and platform independent. Extensibility XML allows users to define their own tags. Separation of data and presentation XML data may be presented to the user in multiple ways. Interoperability(互用性 ,协同工作能力 ) By standardizing on tags, interoperation and integration of sys

tems is simplified.

Page 12: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XML Components An XML document is a text document that contains markup in the form of

tags.

An XML document consists of: An XML declaration line indicating the XML version. Elements (or tags) called markup. Each element may contain free-text,

attributes, or other nested elements.

Every XML document has a single root element.

Tags, as in HTML, are matched pairs, as <FOO> … </FOO>.

Closing tags are not needed if the element contains no data: <FOO/>

Tags may be nested. An attribute is a name-value pair declared in an element. Comments

Note that XML data is ordered by nature.

Page 13: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XML Example

<?xml version = "1.0" encoding="UTF-8" standalone="no"?>

<?xml:stylesheet type="text/xsl" href="dept.xsl"?>

<!DOCTYPE root SYSTEM "dept.dtd">

<!-- Department/Employee data formatted in XML -->

<root><Dept dno = "D1">

<Emp eno="E7"><name>R. Davis</name></Emp>

<Emp eno="E8"><name>J. Jones</name></Emp>

</Dept>

<Dept dno = "D2" mgr = "E7">

<Emp eno="E6"><name>L. Chu</name></Emp>

<Emp eno="E3"><name>A. Lee</name></Emp>

<budget>350000</budget>

</Dept>

</root>

Element reference

AttributeComment

External DTDfor validation

Stylesheet forpresentation

XML declaration

Page 14: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Valid and Well-Formed XML Documents

An XML document is well-formed if it obeys the syntax of the XML standard. This includes:

Having a single root element

All elements must be properly closed and nested.

An XML document is valid if it is well-formed and it conforms to its Document Type Definition (DTD).

A document can be well-formed without being valid if it contains tags or nesting structures that are not allowed in its DTD.

The DTD is a schema definition for an XML document.

Page 15: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Namespaces

Namespaces allow tag names to be qualified to avoid naming conflicts. A naming conflict would occur when the same name is used by two different domains or vocabularies.

A namespace consists of two components:

1) A declaration of the namespace and its abbreviation.

2) Prefixing tag names with the namespace name to exactly define the tag's origin.

Page 16: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Namespaces Example

<?xml version = "1.0" encoding="UTF-8" standalone="no"?><root xmlns = "http://www.foo.com"

xmlns:n1 = "http://www.abc.com"><Dept dno = "D1">

<Emp eno="E7"><name>R. Davis</name></Emp><Emp eno="E8"><name>J. Jones</name></Emp>

</Dept><Dept dno = "D2" mgr = "E7">

<Emp eno="E6"><name>L. Chu</name></Emp><Emp eno="E3"><name>A. Lee</name></Emp><n1:budget>350000</budget>

</Dept></root>

budget is a XML tag in the n1 namespace.

n1 namespace

Default namespace

Page 17: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Schemas for XML

Although the totally unrestricted format of XML and semistructured data is valuable to many applications, database data normally has some structure, even though that structure may not be as rigid as required in relational schemas.

Thus, it is valuable to define schemas for XML documents that restrict the format of those documents.

There are two main ways of specifying a schema for XML:

Document Type Definition (DTD)

XML Schema

Page 18: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Document Type Definitions (DTDs)

A Document Type Definition (DTD) defines the grammatical rules for the document. It is not required for an XML document but provides a mechanism for checking a document's validity.

General DTD form:<!DOCTYPE myroot [ <elements> ]>

Essentially, a DTD is a set of document rules expressed using EBNF (Extended Backus-Naur Form) grammar. The rules limit:

the set of allowable element names, how elements can be nested, and the attributes of an element among other things

contents of DTD declareselements and attributes

name of root element in XML document

Page 19: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Document Type Definitions (DTDs)

<!DOCTYPE root [

<!ELEMENT root(Dept+)>

<!ELEMENT Dept(Emp*, budget?)>

<!ATTLIST Dept dno ID #REQUIRED>

<!ATTLIST Dept mgr IDREF #IMPLIED>

<!ELEMENT budget (#PCDATA)>

<!ELEMENT Emp (name)>

<!ATTLIST Emp eno ID #REQUIRED>

<!ELEMENT name (#PCDATA)>

]>

+ means 1 or more times

* means 0 or more times? means 0 or 1 time

Element reference(like a foreign key)

ID is a unique value thatidentifies the element

Parsed Character Data(atomic value)

Page 20: Database Systems Part VII: XML Software School of Hunan University 2006.10.

DTD Elements

Each element declaration has the form:

<!ELEMENT <name> ( <components> )>

Notes: Each !ELEMENT declaration specifies that an element is bein

g created. Components is either a list of one or more subelements or #

PCDATA, or EMPTY (has no content). Each subelement may occur exactly once, zero or one time

(?), zero or more (*), or one or more (+). An element that is a composite element (contains other elem

ents) specifies the other elements in parentheses. These elements must appear in the document in the order specified.

Page 21: Database Systems Part VII: XML Software School of Hunan University 2006.10.

DTD Attributes

Attributes are declared using:

<!ATTLIST <eltName> <attrName> <type> <value>)

Notes: Attribute names must be unique within an element. The type of attribute may be (among others):

CDATA - non-parsed string ID - element identifier (ID valued attributes must start with a letter) IDREF or IDREFS - one or more element references

The value source may be: #IMPLIED - attribute is optional #REQUIRED - always present #FIXED "default value"- declared default value

Page 22: Database Systems Part VII: XML Software School of Hunan University 2006.10.

DTD Attribute Example

Values for an attribute may be specified: <!ELEMENT Emp (name,salary)> <!ATTLIST Emp title ("EE"|"SA"|"PR"|"ME") "EE"> <!ATTLIST Emp eno ID #REQUIRED> <!ATTLIST Emp bdate CDATA #IMPLIED>

Page 23: Database Systems Part VII: XML Software School of Hunan University 2006.10.

DTD ID and IDREF

An ID attribute is used to uniquely identify an element within a document. It is used to model keys. Note however that the scope of the uniqueness is the entire document not elements of the given type. An ID is like declaring named anchors( 锚 ) in HTML. An ID must be a string starting with a letter and must be unique throughout the whole document.

IDREF/IDREFS are attributes that allow an element to refer to another element in the document. An IDREF attribute contains a single value that references a named location in the document (an existing ID value). An IDREFS attribute contains a list of ID references. Using IDREF allows the structure of an XML document to be a graph, rather than just a tree.

Page 24: Database Systems Part VII: XML Software School of Hunan University 2006.10.

DTD Choice

The symbol "|" can connect alternative sequences of tags.

For example, a name may have an optional title, a first name, and a last name, in that order, or is a full name:

<!ELEMENT NAME ( (TITLE?, FIRST, LAST) | FULLNAME)>

Page 25: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Element identify, IDs and ID References

• Example: Branch has staff

<!ATTLIST STAFF staffNo ID #REQUIRED>

<!ATTLIST BRANCH staff IDREFS #IMPLIED>

<STAFF staffNo = “SL21”>

<NAME>

<FNAME>John</FNAME><LNAME>White</LNAME>

</NAME>

</STAFF><STAFF staffNo = “SL41”> <NAME> <FNAME>Julie</FNAME><LNAME>Lee</LNAME> </NAME></STAFF><BRANCH staff = “SL21 SL41”> <BRANCHNO>B005</BRANCHNO></BRANCH>

Page 26: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Using DTDs in an XML Document

A DTD can either be either embedded in the XML document itself or specified as an external file.

Embedding:

Set STANDALONE = "yes" in XML declaration.

Put DTD right after XML declaration line.

Separate file:

Set STANDALONE = "no" in XML declaration.

Put DTD in another file say myfile.dtd.

Put the line in the XML document:

<!DOCTYPE myroot SYSTEM "myfile.dtd">

privateDTD

DTD filelocation

name of root element in XML document

Page 27: Database Systems Part VII: XML Software School of Hunan University 2006.10.

DTD Question

Build a DTD for our relational database:

Emp (eno, ename, title, salary, dno)

Project (pno, pname, budget, dno)

WorksOn (eno, pno, resp, dur)

Department (dno, dname, mgr)

Page 28: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XML Schema

While DTDs are sufficient for many uses of XML, they are lacking compared to other forms of database schema. Specifically, they do not have good support for:

data types constraints explicit primary keys and foreign key references.

Further, DTDs are written in a non-XML syntax.

XML Schema was defined by the W3C to provide a standard XML schema language which itself is written in XML and has better support for typical data modeling issues.

Page 29: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Types in XML Schema

Elements that contain other elements are called complex types:<xsd:element name = "Dept">

<xsd:complexType><xsd:sequence>

// Children defined here</xsd:sequence>

</xsd:complexType></xsd:element>

Elements that have no subelements are simple types:

<xsd:element name = "name" type = "xsd:string" /><xsd:element name = "budget" type = "xsd:decimal" />

Page 30: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Cardinality in XML Schema

The number of occurrences of elements can be controlled by cardinality constraints minOccurs and maxOccurs:

<xsd:element name = "Emp"

minOccurs="0" maxOccurs="unbounded">

<xsd:element name = "budget" type = "xsd:decimal"

minOccurs="1" maxOccurs="1" />

Page 31: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Constraints in XML Schema

Uniqueness constraints dictate that the value of an element (with a specified path) must be unique:

<xsd:unique name = "UniqueDno"><xsd:selector xpath = "Dept" /><xsd:field xpath = "@dno" />

</xsd:unique>

Key constraints are uniqueness constraints where the value cannot be null.

<xsd:key name = "EmpKey"><xsd:selector xpath = "Dept/Emp" /><xsd:field xpath = "@eno" />

</xsd:key>

Page 32: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Constraints in XML Schema (2)

Reference constraints forces the value of a reference to be constrained to specified keys:

<xsd:keyref name = "DeptMgrFK" refer "EmpKey"><xsd:selector xpath = "Dept" />

<xsd:field xpath = "@mgr" />

</xsd:keyref>

Page 33: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Other Features of XML Schema

XML Schema also allows you to:

Define your own types

Define schema references to simplify schema maintenance

Define groups of elements or attributes

Use groups to allow for schema variations and choices

Construct elements that are lists or unions of values

Page 34: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XML Schema Example<?xml version = "1.0"><xsd:schema xmlns:xsd = "http://www.w3.org/2000/10/XMLSchema"><xsd:element name = "root"> <xsd:complexType> <xsd:sequence> <xsd:element name = "Dept" minOccurs="1" maxOccurs="unbounded"> <xsd:complexType> <xsd:attribute name = "dno" type = "xsd:string" /> <xsd:attribute name = "mgr" type = "xsd:string" /> <xsd:element name = "Emp" minOccurs="0" maxOccurs="unbounded"> <xsd:complexType> <xsd:element name = "name" type = "xsd:string" /> <xsd:attribute name = "eno" type = "xsd:string" /> </xsd:complexType> </xsd:element> <xsd:element name="budget" minOccurs="0" type ="xsd:decimal" /> </xsd:complexType> </xsd:element>

Page 35: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XML Schema Example

</xsd:sequence> </xsd:complexType></xsd:element><xsd:key name = "DeptKey">

<xsd:selector xpath = "Dept" /><xsd:field xpath = "@dno" />

</xsd:key><xsd:key name = "EmpKey">

<xsd:selector xpath = "Dept/Emp" /><xsd:field xpath = "@eno" />

</xsd:key><xsd:keyref name = "DeptMgrFK" refer "EmpKey">

<xsd:selector xpath = "Dept" /><xsd:field xpath = "@mgr" />

</xsd:keyref></xsd:schema>

Page 36: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XML Parsers

An XML parser processes the content and structure of an XML document and may use its DTD (if one is present).

Internet Explorer 5 has a built-in XML parser.

XML parsers read in the XML document and determine if the document is well-formed and valid (if a DTD is provided).

Once a document is parsed, programs may manipulate the document using one of two common interfaces: DOM and SAX.

Note that you can write and parse XML documents without using a parser as the document itself is just a text file.

Page 37: Database Systems Part VII: XML Software School of Hunan University 2006.10.

DOM and SAX

Document Object Model (DOM) and Simple API for XML (SAX) are two APIs for manipulating XML documents.

DOM is a tree-based API that parses the entire document into an in-memory tree and provides methods for traversing the tree.

SAX is an event-based serial access method. As a parser reads an XML file, it generates events when a new element is seen, when an element is closed, etc.

A SAX program has event handler methods that perform actions when an event is fired.

The difference between DOM and SAX is that DOM may be costly for large documents, but DOM programs are normally easier to write.

Page 38: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XML and HTML

XML documents are considered a data source.

An XML document that exists in an HTML document is called a data island.

Page 39: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XML in HTML Example

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML> <! -- A contact list stored in XML displayed in HTML --><BODY><XML ID = "xmlDoc"> <!-- xmlDoc is the name for the XML info -->

<contacts><contact>

<LastName>Smith</LastName><FirstName>Bob</FirstName>

</contact><contact>

<LastName>Henry</LastName><FirstName>Steve</FirstName>

</contact><contact>

<LastName>Jones</LastName><FirstName>Fred</FirstName>

</contact></contacts>

Page 40: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XML in HTML Example (2)

</XML><TABLE BORDER = "1" DATASRC = "#xmlDoc">

<THEAD> <TR>

<TH>Last Name</TH><TH>First Name</TH>

</TR></THEAD><TBODY> <TR>

<TD><SPAN DATAFLD = "LastName"></SPAN></TD><TD><SPAN DATAFLD = "FirstName"></SPAN></TD></TR>

</TBODY></TABLE></BODY></HTML>

Page 41: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XPath

XPath is a declarative query language for XML that provides syntax for addressing parts of an XML document.

XPath allows collections of elements to be retrieved by specifying a directory like path with optional conditions.

We will talk more about XPath when we look at XML querying.

Page 42: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XSL and XSLT

XSL (eXtenstible Stylesheet Language) is a W3C recommendation that defines how XML data is displayed.

It is similar but more powerful than Cascading Stylesheet Specification (CSS) used with HTML.

XSLT (eXtenstible Stylesheet Language for Transformations) is a subset of XSL that provides a method for transforming XML (or other text documents) into other documents.

Page 43: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XSL Example

<?xml version = "1.0"?><xsl:stylesheet xmlns:xsl ="http://www.w3.org/TR/WD-xsl"><xsl:template match = "/">

<html><head><title>My Title</title></head><body><table><xsl:for-each select = "Dept">

<tr><td><xsl:value-of select="Dept[@dno]"></td><td><table><xsl:for-each select = "Dept/Emp"><tr><td><xsl:value-of select="Dept/Emp/name"></td></tr>

</xsl:for-each></table></td></tr>

</xsl:for-each></table></body></html>

</xsl:template></xsl:stylesheet>

paths in XML document

Stylesheet declaration with namespace

Repeat for other columns headings

Page 44: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XSL Example Output

<html><head><title>My Title</title></head><body>

<table><tr><td>D1</td><td> <table>

<tr><td>R. Davis</td></tr><tr><td>J. Jones</td></tr>

</table></td></tr><tr><td>D2</td><td> <table>

<tr><td>L. Chu</td></tr><tr><td>A. Lee</td></tr>

</table></td></tr>

</table></body></html>

Page 45: Database Systems Part VII: XML Software School of Hunan University 2006.10.

XHTML

Extensible Hypertext Markup Language (XHTML) includes HTML 4 and XML and is the proposed successor (继承者 ) to HTML 4.

XHTML contains the HTML tag set for backwards compatibility(向后兼容 ).

XHTML documents are well-formed documents as they are verified using a DTD.

For more information on XHTML see:

http://www.w3.org/TR/xhtml1

Page 46: Database Systems Part VII: XML Software School of Hunan University 2006.10.

Conclusion

Semistructured data is data that may have a varyingstructure. The data typically has some structure, but the data does not conform to a fixed schema.

Extensible Markup Language (XML) is a markup languagethat allows for the description of data semantics.

An XML document does not need a schema to be well-formed.An XML document is valid if it conforms to its DTD schema.XML Schemas can define schemas with more precision.

XML has many associated standards including parsing(DOM/SAX), namespaces, querying (XPath/XQuery) anddisplay and transformations (XSL,XSLT,XHTML).