Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

27
Structured Structured -Document -Document Processing Languages Processing Languages Spring 2004 Spring 2004 Course Review Course Review Repetitio mater studiorum est! Repetitio mater studiorum est!

Transcript of Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

Page 1: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

StructuredStructured-Document -Document Processing LanguagesProcessing Languages

Spring 2004 Spring 2004

Course ReviewCourse Review

Repetitio mater studiorum est!Repetitio mater studiorum est!

Page 2: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 2

Goals of the CourseGoals of the Course

To familiarize with the most important models To familiarize with the most important models and languages for and languages for – manipulatingmanipulating– representingrepresenting– transforming and transforming and – querying querying

structured documents (or XML)structured documents (or XML) Generic XML processing technologyGeneric XML processing technology

– very little about specific XML applications or very little about specific XML applications or commercial systemscommercial systems

Page 3: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 3

Methodological GoalsMethodological Goals

Some central professional skillsSome central professional skills– consulting of technical specificationsconsulting of technical specifications– experimenting with SW implementationsexperimenting with SW implementations

Ability to think…?Ability to think…?– to find out relationshipsto find out relationships– to apply knowledge in new situationsto apply knowledge in new situations

("Pidgin English" for scientific communication)("Pidgin English" for scientific communication)

Page 4: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 4

XML?XML?

ExtensibleExtensible Markup Language Markup Language is is notnot a markup a markup language! language! – does not fix a tag set nor its semantics does not fix a tag set nor its semantics

(like markup languages like HTML do)(like markup languages like HTML do)

XML XML isis– A way to use markup to represent informationA way to use markup to represent information– A A metalanguagemetalanguage

» supports definition of specific markup languages through XML supports definition of specific markup languages through XML DTDs or SchemasDTDs or Schemas

» E.g. XHTML a reformulation of HTML using XMLE.g. XHTML a reformulation of HTML using XML

Page 5: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 5

XML Encoding of Structure: XML Encoding of Structure: ExampleExample

<S><S>

SS

EE

<W><W> <W><W></W></W> <E A=‘1’><E A=‘1’> </E></E>HelloHello world!world!

WW

HelloHello

WW

world!world!

</W></W>

</S></S>

A=1A=1

Page 6: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 6

Basics of XML DTDsBasics of XML DTDs

A A Document Type DeclarationDocument Type Declaration provides a provides a grammar (grammar (document type definitiondocument type definition,, DTD DTD) for a ) for a class of documentsclass of documents

Syntax (in the prolog of a document instance):Syntax (in the prolog of a document instance):<!<!DOCTYPEDOCTYPE rootElemType rootElemType SYSTEMSYSTEM "ex.dtd" "ex.dtd"<!-- "<!-- "external subsetexternal subset" in file ex.dtd --> " in file ex.dtd --> [ <!-- "[ <!-- "internal subsetinternal subset" may come here --> " may come here --> ]>]>

DTD is the union of the external and internal subsetDTD is the union of the external and internal subset

Page 7: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 7

How do Declarations Look Like?How do Declarations Look Like?

<!ELEMENT invoice (client, item+)><!ELEMENT invoice (client, item+)>

<!ATTLIST invoice num NMTOKEN #REQUIRED><!ATTLIST invoice num NMTOKEN #REQUIRED>

<!ELEMENT client (name, email?)> <!ELEMENT client (name, email?)>

<!ATTLIST client num NMTOKEN #REQUIRED><!ATTLIST client num NMTOKEN #REQUIRED>

<!ELEMENT name (#PCDATA)> <!ELEMENT name (#PCDATA)>

<!ELEMENT email (#PCDATA)> <!ELEMENT email (#PCDATA)>

<!ELEMENT item (#PCDATA)><!ELEMENT item (#PCDATA)>

<!ATTLIST item <!ATTLIST item

priceprice NMTOKEN #REQUIREDNMTOKEN #REQUIRED

unit (FIM | EUR) ”EUR” >unit (FIM | EUR) ”EUR” >

Page 8: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 8

Element type declarationsElement type declarations

The general form isThe general form is<!ELEMENT<!ELEMENT elementTypeName elementTypeName ((EE)>)>

where where EE is a is a content modelcontent model regular expression of element namesregular expression of element names Content model operators:Content model operators:

E | F : alternationE | F : alternation EE,, F: concatenation F: concatenationE? : optionalE? : optional E* : zero or moreE* : zero or moreE+ : one or moreE+ : one or more (E) : grouping(E) : grouping

Page 9: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 9

XML Schema Definition XML Schema Definition LanguageLanguage

XML syntaxXML syntax– schema documents easier to manipulate by schema documents easier to manipulate by

programs (than the special DTD syntax)programs (than the special DTD syntax) Compatibility with namespacesCompatibility with namespaces

– can validate documents using declarations from can validate documents using declarations from multiple sourcesmultiple sources

Content datatypesContent datatypes– 44 built-in datatypes (including primitive Java 44 built-in datatypes (including primitive Java

datatypes, datatypes of SQL, and XML attribute datatypes, datatypes of SQL, and XML attribute types)types)

– mechanisms to derive user-defined datatypesmechanisms to derive user-defined datatypes

Page 10: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 10

XML NamespacesXML Namespaces

<xsl:stylesheet version=<xsl:stylesheet version="1.0""1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict">xmlns="http://www.w3.org/TR/xhtml1/strict">

<!-- XHTML is the ’default namespace’ --><!-- XHTML is the ’default namespace’ --><xsl:template match="doc/title"><xsl:template match="doc/title"> <H1><H1>

<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1> </xsl:template> </xsl:template>

</xsl:stylesheet> </xsl:stylesheet>

Page 11: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 11

3. XML Processor APIs3. XML Processor APIs

How can applications manipulate structured How can applications manipulate structured documents?documents?– An overview of document parser interfacesAn overview of document parser interfaces

3.1 SAX: an event-based interface3.1 SAX: an event-based interface

3.2 DOM: an object-based interface3.2 DOM: an object-based interface

3.3 JAXP: Java API for XML Processing3.3 JAXP: Java API for XML Processing

Page 12: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 12

A SAX-based applicationA SAX-based application

Application Main Application Main RoutineRoutine

startDocument()startDocument()

startElement()startElement()

characters()characters()

Parse()Parse()

Callback

Callback

Routines

Routines

endElement()endElement() <A i="1"><A i="1"> </A></A>Hi!Hi!

"A",[i="1"]"A",[i="1"]

"Hi!""Hi!"

"A""A"<?xml version='1.0'?><?xml version='1.0'?>

Page 13: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 13

DOM: What is it? DOM: What is it?

An object-based, language-neutral API for XML An object-based, language-neutral API for XML and HTML documentsand HTML documents

– Allows programs and scripts to build, navigate, and Allows programs and scripts to build, navigate, and modify documentsmodify documents

– a foundation for developing a foundation for developing querying, filtering, querying, filtering, transformation, rendering etc. transformation, rendering etc.

applications on top of DOM implementationsapplications on top of DOM implementations In contrast to “In contrast to “SSerial erial AAccess ccess XXML” could think as ML” could think as

““DDirectly irectly OObtainable in btainable in MMemory”emory”

Page 14: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 14

<invoice form="00" <invoice form="00" type="estimated">type="estimated"> <addressdata><addressdata> <name>John Doe</name><name>John Doe</name> <address><address> <streetaddress>Pyynpolku 1<streetaddress>Pyynpolku 1 </streetaddress></streetaddress> <postoffice>70460 KUOPIO<postoffice>70460 KUOPIO </postoffice></postoffice> </address></address> </addressdata></addressdata> ......

DOM structure modelDOM structure model

invoiceinvoice

namename

addressdataaddressdata

addressaddress

form="00"form="00"type="estimated"type="estimated"

John DoeJohn Doe streetaddressstreetaddress postofficepostoffice

70460 KUOPIO70460 KUOPIOPyynpolku 1Pyynpolku 1

......

DocumentDocument

ElementElement

NamedNodeMapNamedNodeMap

TextText

Page 15: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 15

Trans form ation P rocess

O utput P ro cess

X M L

T ext

H T M L

S tyleS heet

SourceDocument

Sourc e TreeR esult T ree

Overview of XSLT TransformationOverview of XSLT Transformation

Page 16: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 16

JAXP 1.1JAXP 1.1

An interface for “plugging-in” and using An interface for “plugging-in” and using XML processors in Java applicationsXML processors in Java applications– includes packagesincludes packages

» org.xml.sax:org.xml.sax: SAX 2.0 interface SAX 2.0 interface» org.w3c.dom:org.w3c.dom: DOM Level 2 interface DOM Level 2 interface» javax.xml.parsersjavax.xml.parsers::

initialization and use of parsersinitialization and use of parsers» javax.xml.transformjavax.xml.transform::

initialization and use of transformers initialization and use of transformers (XSLT processors)(XSLT processors)

Included in JDK starting from vers. 1.4Included in JDK starting from vers. 1.4

Page 17: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 17

XMLXML

.getXMLReader().getXMLReader()

JAXP: Using a SAX parser (1)JAXP: Using a SAX parser (1)

f.xmlf.xml

.parse(.parse( ” ”f.xml”)f.xml”)

.newSAXParser().newSAXParser()

Page 18: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 18

f.xmlf.xml

JAXP: Using a DOM parser (1)JAXP: Using a DOM parser (1)

.parse(”f.xml”).parse(”f.xml”)

.newDocument().newDocument()

.newDocumentBuilder().newDocumentBuilder()

Page 19: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 19

XSLTXSLT

JAXP: Using Transformers (1)JAXP: Using Transformers (1)

.newTransformer(…).newTransformer(…)

.transform(.,.).transform(.,.)

Page 20: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 20

Transformation & FormattingTransformation & Formatting

XSLT scriptXSLT script

II IIII

Page 21: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 21

Page regionsPage regions

A simple page can contain 1-5 regions, specified by child A simple page can contain 1-5 regions, specified by child elements of the elements of the simple-page-mastersimple-page-master

Page 22: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 22

Top-level formatting objectsTop-level formatting objects

Slightly simplified:Slightly simplified: fo:rootfo:root

fo:layout-master-setfo:layout-master-set

(fo:simple-page-master | fo:page-sequence-master)+(fo:simple-page-master | fo:page-sequence-master)+

fo:page-sequencefo:page-sequence++

fo:region-fo:region-bodybody

fo:region-fo:region-before?before? fo:region-fo:region-

end?end?

fo:region-fo:region-start?start?

fo:region-fo:region-after?after? specify masters specify masters

for page sequences for page sequences by referring to by referring to simple-page-masterssimple-page-masters

contents of pagescontents of pages

Page 23: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 23

XML-wrappingXML-wrapping

Need ”Need ”XML-wrappersXML-wrappers” (aka ” (aka extractorsextractors))– interface/conversion program to produce an interface/conversion program to produce an

XML representation for source dataXML representation for source data

source1source1

source2source2

source3source3

wrapperwrapper11

XML-form-XML-form-11

wrapperwrapper22

wrapperwrapper33

XML-form-XML-form-22

Page 24: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 24

XW-architecture (3)XW-architecture (3)

ap

plica

tioap

plica

tionn

XW

-engin

eX

W-e

ngin

e

SA

XS

AX

AA AA x1x1x2x2

BBBBy1y1 y2y2

z1 z1 z2z2

<xw:wrapper … ><xw:wrapper … > … … </xw:wrapper></xw:wrapper>

<part-a> <part-a> <e1> <e1>x1x1</e1> </e1> <e2> <e2>x2x2</e2></e2></part-a></part-a><part-b><part-b> <line-1> <line-1> <d1> <d1>y1y1</d1> </d1> <d2> <d2>y2y2</d2></d2> </line-1> </line-1> <d3> <d3>z2z2</d3></d3></part-b></part-b>

result result documentdocument

source source datadata

wrapper descriptionwrapper description

Page 25: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 25

<whole><whole> <<xw:STORExw:STORE xw:name=" xw:name="xxxx">"> <a xw:starter="<a xw:starter="AA" xw:terminator="" xw:terminator="$$"/>"/> </</xw:STORExw:STORE>> <b xw:starter="B"><b xw:starter="B"> <b1 xw:starter="1"/><b1 xw:starter="1"/> <b2 xw:starter="2"/><b2 xw:starter="2"/> <<xw:COPY-OFxw:COPY-OF xw:select=" xw:select="xxxx"/>"/> <b3 xw:starter="3"/><b3 xw:starter="3"/> </b></b></whole></whole>

<whole><whole> <b><b> <b1>one</b1><b1>one</b1> <b2>two</b2><b2>two</b2>

<b3>three</b3><b3>three</b3> </b></b></whole></whole>

AAxy..xy..zz$$??????

B..B..1one 2two1one 2two

3three3three

Rearranging result structuresRearranging result structures

<a>xy..<a>xy..z</a>z</a>

Page 26: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 26

XQuery in a NutshellXQuery in a Nutshell

Functional expression languageFunctional expression language Strongly-typedStrongly-typed: : (XML Schema) types may be assigned (XML Schema) types may be assigned

to expressions staticallyto expressions statically Includes XPath 2.0 Includes XPath 2.0 (says(says Draft, but not all XPath axes Draft, but not all XPath axes included!) included!)

– XQuery 1.0 and XPath 2.0 share extensive functionality:XQuery 1.0 and XPath 2.0 share extensive functionality:» XQuery 1.0 and XPath 2.0 Functions and Operators, XQuery 1.0 and XPath 2.0 Functions and Operators,

WD 15/11/2002WD 15/11/2002

Roughly: XQuery Roughly: XQuery XPath' + XSLT' + SQL' XPath' + XSLT' + SQL'

Page 27: Structured-Document Processing Languages Spring 2004 Course Review Repetitio mater studiorum est!

SDPL 2004 Course Review 27

Course Main MessageCourse Main Message

XML is a universal way to represent info as XML is a universal way to represent info as tree-like data structures tree-like data structures

There are specialized and powerful There are specialized and powerful technologies for processing ittechnologies for processing it

The development is going onThe development is going on