Representing Web Data: XML - Engineeringgvj/Courses/CSI3140/lectures/XML.pdf · Guy-Vincent Jourdan...
Transcript of Representing Web Data: XML - Engineeringgvj/Courses/CSI3140/lectures/XML.pdf · Guy-Vincent Jourdan...
Representing Web Data:XML
CSI 3140
WWW Structures, Techniques and Standards
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 2
XML
Example XML document:
An XML document is one that follows
certain syntax rules (most of which we
followed for XHTML)
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 3
XML Syntax
An XML document consists of
Markup
Tags, which begin with < and end with >
References, which begin with & and end with ;
Character, e.g.  
Entity, e.g. <
The entities lt, gt, amp, apos, and quot are recognized in every XML document.
Other XHTML entities, such as nbsp, are only recognized in other XML documents if they are defined in the DTD
Character data: everything not markup
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 4
XML Syntax
Comments
Begin with <!--
End -->
Must not contain –
CDATA section
Special element the entire content of which is interpreted
as character data, even if it appears to be markup
Begins with <![CDATA[
Ends with ]]> (illegal except when ending CDATA)
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 5
XML Syntax
The CDATA section
is equivalent to the markup
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 6
XML Syntax
< and & must be represented by references
except
When beginning markup
Within comments
Within CDATA sections
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 7
XML Syntax
Element tags and elements
Three types
Start, e.g. <message>
End, e.g. </message>
Empty element, e.g. <br />
Start and end tags must properly nest
Corresponding pair of start and end element tags plus
everything in between them defines an element
Character data may only appear within an element
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 8
XML Syntax
Start and empty-element tags may contain
attribute specifications separated by white
space
Syntax: name = quoted value
quoted value must not contain <, can contain &
only if used as start of reference
quoted value must begin and end with matching
quote characters („ or “)
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 9
XML Syntax
Element and attribute names are case
sensitive
XML white space characters are space,
carriage return, line feed, and tab
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 10
XML Documents
A well-formed XML document
follows the XML syntax rules and
has a single root element
Well-formed documents have a tree structure
Many XML parsers (software for
reading/writing XML documents) use tree
representation internally
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 11
XML Documents
An XML document is written according to
an XML vocabulary that defines
Recognized element and attribute names
Allowable element content
Semantics of elements and attributes
XHTML is one widely-used XML
vocabulary
Another example: RSS (rich site summary)
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 12
XML Documents
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 13
XML Documents
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 14
XML Documents
Valid names and content for an XML
vocabulary can be specified using
Natural language
XML DTDs (Chapter 2)
XML Schema (Chapter 9)
If DTD is used, then XML document can
include a document type declaration:
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 15
XML Documents
Two types of XML parsers:
Validating
Requires document type declaration
Generates error if document does not
Conform with DTD and
Meet XML validity constraints
Example: every attribute value of type ID must be unique within the document
Non-validating
Checks for well-formedness
Can ignore external DTD
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 16
XML Documents
Good practice to begin XML documents with
an XML declaration
Minimal example:
If included, < must be very first character of the
document
To override default UTF-8/UTF-16 character
encoding, include encoding declaration
following version:
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 17
XML Documents
Internal subset of DTD
Entity vsn will be defined by any XML parser,
validating or not
Declaration of
internal subset of DTD
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 18
XML Namespaces
XML Namespace: Collection of element and
attribute names associated with an XML vocabulary
Namespace Name: Absolute URI that is the name
of the namespace
Ex: http://www.w3.org/1999/xhtml is the namespace
name of XHTML 1.0
Default namespace for elements of a document is
specified using a form of the xmlns attribute:
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 19
XML Namespaces
Another form of xmlns attribute known as a
namespace declaration can be used to
associate a namespace prefix with a
namespace name:Namespace prefix
Namespace declaration
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 20
XML Namespaces
Example use of namespace prefix:
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 21
XML Namespaces
In a namespace-aware XML application, all
element and attribute names are considered qualified
names
A qualified name has an associated expanded name that
consists of a namespace name and a local name
Ex: item is a qualified name with expanded name
<null, item>
Ex: xhtml:a is a qualified name with expanded name
<http://www.w3.org/1999/xhtml, a>
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 22
XML Namespaces
Other namespace usage:A namespace can be declared and used on the same element
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 23
XML Namespaces
Other namespace usage:
A namespace prefix can be redefined for
an element and its content
These elements belong to http://www.example.org/namespace
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 24
JavaScript and XML
JavaScript DOM can be used to process
XML documents
JavaScript XML Dom processing is often
used with XMLHttpRequest
Host object that is a constructor for other host
objects
Sends an HTTP request to server, receives back
an XML document
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 25
JavaScript and XML
Example use:
Previous visit count servlet: must reload
document to see updated count
Visit count with XMLHttpRequest: browser
will automatically update the visit count
periodically without reloading the entire page
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved. 0-13-185603-0
26
JavaScript and XMLDocument
generated by
GET request to
VisitCountUpdate
servlet
JavaScript file using
XMLHttpRequest object
span that will be updated by JavaScript code
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved. 0-13-185603-0
27
JavaScript and XML
XMLHttpRequest
request is processed
by doPost() method of
servlet
Response is XML document
Current visit count is returned as content
of count XML element
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 28
JavaScript and XML
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 29
JavaScript and XML
Typical code for creating an instance of XMLHttpRequest
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 30
JavaScript and XML
Return immediately after sending request
(asynchronous behavior)
Function called as
state of connection
changes
Body of request (empty in this example)
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 31
JavaScript and XML
Indicates response received
successfully
Root of
returned
XML
document
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 32
JavaScript and XML
Ajax: Asynchronous JavaScript and XML
Combination of
(X)HTML
XML
CSS
JavaScript
JavaScript DOM (HTML and XML)
XMLHttpRequest in asynchronous mode
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 33
Java-based DOM
Java DOM API defined by org.w3c.dom
package
Semantically similar to JavaScript DOM
API, but many small syntactic differences
Nodes of DOM tree belong to classes such as
Node, Document, Element, Text
Non-method properties accessed via methods
Ex: parentNode accessed by calling
getParentNode()
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 34
Java-based DOM
Methods such as
getElementsByTagName() return
instance of NodeList
getLength() method returns # of items
item() method returns an item
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 35
Java-based DOM
Example: program to count link elements
in an RSS document:
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 36
Java-based DOM
Imports:
From Java
API for XML
Processing
(JAXP)
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 37
Java-based DOM
Default parser is non-validating and non-
namespace-aware.
Overriding:
Also setValidating(true)
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 38
Java-based DOM
Namespace-aware versions of methods end
in NS:Namespace name
Local name
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 39
SAX
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 40
SAX
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 41
SAX
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 42
SAX
Used if not namespace-aware or
if qualified name does not belong
to any namespace.
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 43
SAX
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 44
Transformations
JAXP provides API for transforming
between DOM, SAX, and Stream (text)
representations of XML documents
Example:
Input from stream to DOM
Modify DOM
Output as stream
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 45
Transformations
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 46
Transformations
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 47
Transformations
“SAX output” means that a SAX event
handler is called:
Example: the code
feeds the XML document represented by DOM
document through the SAX event handler
CountElementsHelper()
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 48
XSL
The Extensible Stylesheet Language (XSL)
is an XML vocabulary typically used to
transform XML documents from one form to
another form
XSL document
Input XML
document XSLT ProcessorOutput XML
document
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 49
XSL
XSL
markup
Everything in the body
of the document that is
not XSL markup is
template data
Example XSL document
HelloWorld.xsl
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 50
XSL
Input XML document HelloWorld.xml:
Output XML document:
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 51
XSL
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 52
XSL
Components of XSL:
XSL Transformations (XSLT): defines XSL
namespace elements and attributes
XML Path Language (XPath): used in many
XSL attribute values (ex: child::message)
XSL Formatting Objects (XSL-FO): XML
vocabulary for defining document style (print-
oriented)
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 53
XPath
XPath operates on a tree representation of an XML document
Similar to DOM tree in that nodes have different types (element, text, comment, etc.)
Unlike DOM, attributes are also nodes in the XPath tree
Root of XPath tree called document root
One of the children of the document root, called the document element, is the root element of the XML document
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 54
XPath
Location path: expression representing one
or more XPath tree nodes
/ represents document root
child::message is an example of a location
step and has two parts:
Axis name Node test
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 55
XPath
Attribute nodes are
only seen along the attribute axis
XSLT specifies
context node
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 56
XPath
Node test:
Name test: qualified name representing an
element (or attribute, for attribute axis) type
Example: child::message uses a name test
May use * as wildcard name test
Node-type test:
text(): true if node is a text node
comment(): true if node is a comment node
node(): true of any node
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 57
XPath
A location step can have one or more
predicates that act as filters:
This predicate
applied first
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 58
XPath
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 59
XPath
Abbreviations:
Axis defaults to child if not specified
child::para = para
@ can be used in place of attribute::
attribute::display = @display
parent::node() = ..
self::node() = .
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 60
XPath
A location path is one or more location steps
separated by /
Ex: child::para/child::strong (or
just para/strong)
Ex: para/strong/emph
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 61
XPath
Evaluating a two-step location path:
Evaluate first location step, producing node list L1
For each node ni in L1
Temporarily set context node to ni
Evaluate second location step, producing node list Li
Result is union of Lis
Continue process for paths with more steps
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 62
XPath
If body is context node, then:
para/strong represents {s1,s2,s4}
para[strong] represents {p1, p3}
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 63
XPath
An absolute location path begins with / and uses
the document root as the context node
Ex: /body/para represents all para nodes that are
children of body which is child of document root
Ex: / represents list consisting only of the document
root
A relative location path does not begin with / and
uses an element determined by the application (e.g.,
XSLT) as the context node
Ex: body/para
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 64
XPath
Another abbreviation:
/descendant-or-self::node()/ = //
Examples:
//strong: All strong elements in document
.//strong: All strong elements that are
descendants (or self) relative to the context node
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 65
XPath
Combining node lists:
Use | to represent union of node lists produced
by individual location paths
Ex: strong|descendant::emph
represents all nodes that are either
children of the context node of type strong; or
descendants of the context node of type emph
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 66
XSLT
Template
rule
Pattern of template rule
Template
of template
rule
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 67
XSLT
XSLT processor deals with three XPath
trees:
Input trees: source and style-sheet
Elements containing only white space are normally
not included in either input tree (exception:
xsl:text element)
White space retained within other elements
Output tree: result
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 68
XSLT
XSLT processing (high level):
Construct input trees
Initialize empty result tree
Search source tree for a node that is matched by a
template rule, i.e., a node that is contained in the node list
represented by the pattern of some template rule
Instantiate the template of the matching template rule in
the result tree
Context node for XPath expressions is matched node
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 69
XSLT
Matches source tree document root
Context node for relative location
path is source tree document root
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 70
XSLT
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 71
XSLT
Restrictions on XPath in template rule
pattern (value of match attribute):
Only child and attribute axes are allowed
directly (can indirectly use descendant-or-
self axis via // notation)
XPath expression must evaluate to a node list
(some XPath expressions are functions that
produce string values)
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 72
XSLT
…
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 73
XSLT
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 74
XSLT
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 75
XSLT
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 76
XSLT
Adding attributes:
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 77
XSLTSource document elements are in a namespace
We want to copy
all h1 elements
plus all of their
descendants
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 78
XSLT
XPath expressions must
use qualified names (because source
includes namespace)
Copy
element
and content
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 79
XSLT
Template markup
in the result tree becomes
Most browsers will not accept this notation!
XSLT does not recognize
Solution:
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 80
XSLT
Adding XML special characters to the result
Template:
Result:
disable-output-escaping also applies to value-of
Becomes & on input
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 81
XSLT
Output formatting:
Add xml:space=“preserve” to
transform element of template to retain white
space
Use xsl:output element:
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 82
XML and Browsers
An XML document can contain a processing
instruction telling a browser to:
Apply XSLT to create an XHTML document:
Guy-Vincent Jourdan :: CSI 3140 :: based on Jeffrey C. Jackson’s slides 83
XML and Browsers
An XML document can contain a processing
instruction telling a browser to:
Apply CSS to style the XML document: