Structured -Document Processing Languages Spring 2007

32
Structured Structured -Document -Document Processing Languages Processing Languages Spring 2007 Spring 2007 Course Review Course Review Repetitio mater studiorum est! Repetitio mater studiorum est!

description

Structured -Document Processing Languages Spring 2007. Course Review. Repetitio mater studiorum est!. Goals of the Course. Learn about central models and languages for manipulating representing transforming and querying structured documents (or XML) - PowerPoint PPT Presentation

Transcript of Structured -Document Processing Languages Spring 2007

Page 1: Structured -Document  Processing Languages  Spring 2007

StructuredStructured-Document -Document Processing LanguagesProcessing Languages

Spring 2007 Spring 2007

Course ReviewCourse Review

Repetitio mater studiorum est!Repetitio mater studiorum est!

Page 2: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 2

Goals of the CourseGoals of the Course

Learn about central models and languages for Learn about central models and languages for – manipulatingmanipulating– representingrepresenting– transforming and transforming and – querying querying

structured documents (or XML)structured documents (or XML)

"Generic XML processing technology""Generic XML processing technology"

Page 3: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 3

Methodological GoalsMethodological Goals

Central professional skillsCentral professional skills– consulting technical specificationsconsulting technical specifications– experimenting with SW implementationsexperimenting with SW implementations

Ability to think…?Ability to think…?– to find out relationshipsto find out relationships– to apply knowledge in new situationsto apply knowledge in new situations

("Pidgin English" for scientific communication)("Pidgin English" for scientific communication)

Page 4: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 4

XML?XML?

ExtensibleExtensible Markup Language Markup Language is is notnot a markup a markup language! language! – does not fix a tag set nor its semantics does not fix a tag set nor its semantics

(like markup languages like HTML do)(like markup languages like HTML do)

XML XML isis– A way to use markup to represent informationA way to use markup to represent information– A A metalanguagemetalanguage

» supports definition of specific markup languages through XML supports definition of specific markup languages through XML DTDs or SchemasDTDs or Schemas

» E.g. XHTML a reformulation of HTML using XMLE.g. XHTML a reformulation of HTML using XML

Page 5: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 5

XML Encoding of Structure: XML Encoding of Structure: ExampleExample

<S><S>

SS

EE

<W><W> <W><W></W></W> <E A=‘1’><E A=‘1’> </E></E>HelloHello world!world!

WW

HelloHello

WW

world!world!

</W></W>

</S></S>

A=1A=1

Page 6: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 6

Basics of XML DTDsBasics of XML DTDs

A A Document Type DeclarationDocument Type Declaration provides a provides a grammar (grammar (document type definitiondocument type definition,, DTD DTD) for a ) for a class of documentsclass of documents

Syntax (in the prolog of document instance):Syntax (in the prolog of document instance):<!DOCTYPE rootElemType SYSTEM "ex.dtd"<!DOCTYPE rootElemType SYSTEM "ex.dtd"<!-- "<!-- "external subsetexternal subset" in file ex.dtd --> " in file ex.dtd -->

[ <!-- "[ <!-- "internal subsetinternal subset" may come here --" may come here --> >

]>]> DTD = union of the external and internal subsetDTD = union of the external and internal subset

Page 7: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 7

How do Declarations Look Like?How do Declarations Look Like?

<!ELEMENT invoice (client, item+)><!ELEMENT invoice (client, item+)>

<!ATTLIST invoice num NMTOKEN #REQUIRED><!ATTLIST invoice num NMTOKEN #REQUIRED>

<!ELEMENT client (name, email?)> <!ELEMENT client (name, email?)>

<!ATTLIST client num NMTOKEN #REQUIRED><!ATTLIST client num NMTOKEN #REQUIRED>

<!ELEMENT name (#PCDATA)> <!ELEMENT name (#PCDATA)>

<!ELEMENT email (#PCDATA)> <!ELEMENT email (#PCDATA)>

<!ELEMENT item (#PCDATA)><!ELEMENT item (#PCDATA)>

<!ATTLIST item <!ATTLIST item

priceprice NMTOKEN #REQUIREDNMTOKEN #REQUIRED

unit (FIM | EUR) ”EUR” >unit (FIM | EUR) ”EUR” >

Page 8: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 8

Element type declarationsElement type declarations

The general form isThe general form is<!ELEMENT<!ELEMENT elementTypeName elementTypeName ((EE)>)>

where where EE is a is a content modelcontent model regular expression of element namesregular expression of element names Content model operators:Content model operators:

E | F : alternationE | F : alternation EE,, F: concatenation F: concatenationE? : optionalE? : optional E* : zero or moreE* : zero or moreE+ : one or moreE+ : one or more (E) : grouping(E) : grouping

Page 9: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 9

XML Schema Definition XML Schema Definition LanguageLanguage

XML syntaxXML syntax– schema documents easier to manipulate by schema documents easier to manipulate by

programs (than the DTD syntax)programs (than the DTD syntax) Compatibility with namespacesCompatibility with namespaces

– can validate documents using declarations from can validate documents using declarations from multiple sourcesmultiple sources

Content datatypesContent datatypes– 44 built-in datatypes (including primitive Java 44 built-in datatypes (including primitive Java

datatypes, datatypes of SQL, and XML attribute datatypes, datatypes of SQL, and XML attribute types)types)

– + user-defined datatypes+ user-defined datatypes

Page 10: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 10

XML NamespacesXML Namespaces

<xsl:stylesheet version=<xsl:stylesheet version="1.0""1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict">xmlns="http://www.w3.org/TR/xhtml1/strict">

<!-- XHTML is the ’default namespace’ --><!-- XHTML is the ’default namespace’ --><xsl:template match="doc/title"><xsl:template match="doc/title"> <H1><H1>

<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1> </xsl:template> </xsl:template>

</xsl:stylesheet> </xsl:stylesheet>

Page 11: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 11

3. XML Processor APIs3. XML Processor APIs

How can applications manipulate structured How can applications manipulate structured documents?documents?– Overview of document parser interfacesOverview of document parser interfaces

3.1 SAX: an event-based interface3.1 SAX: an event-based interface

3.2 DOM: an object-based interface3.2 DOM: an object-based interface

3.3 JAXP: Java API for XML Processing3.3 JAXP: Java API for XML Processing

Page 12: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 12

A SAX-based applicationA SAX-based application

Application Main Application Main RoutineRoutine

startDocument()startDocument()

startElement()startElement()

characters()characters()

Parse()Parse()

Callback

Callback

Routines

Routines

endElement()endElement() <A i="1"><A i="1"> </A></A>Hi!Hi!

"A",[i="1"]"A",[i="1"]

"Hi!""Hi!"

"A""A"<?xml version='1.0'?><?xml version='1.0'?>

Page 13: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 13

DOM: What is it? DOM: What is it?

Object-based, language-neutral API for XML and Object-based, language-neutral API for XML and HTML documentsHTML documents

– Allows programs/scripts to Allows programs/scripts to » build build » navigate and navigate and » modify documentsmodify documents

““DDirectly irectly OObtainable in btainable in MMemory” vs emory” vs ““SSerial erial AAccess ccess XXML”ML”

Page 14: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 14

<invoice form="00" <invoice form="00" type="estimated">type="estimated"> <addressdata><addressdata> <name>John Doe</name><name>John Doe</name> <address><address> <streetaddress>Pyynpolku 1<streetaddress>Pyynpolku 1 </streetaddress></streetaddress> <postoffice>70460 KUOPIO<postoffice>70460 KUOPIO </postoffice></postoffice> </address></address> </addressdata></addressdata> ......

DOM structure modelDOM structure model

invoiceinvoice

namename

addressdataaddressdata

addressaddress

form="00"form="00"type="estimated"type="estimated"

John DoeJohn Doe streetaddressstreetaddress postofficepostoffice

70460 KUOPIO70460 KUOPIOPyynpolku 1Pyynpolku 1

......

DocumentDocument

ElementElement

NamedNodeMapNamedNodeMap

TextText

Page 15: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 15

Trans form ation P rocess

O utput P ro cess

X M L

T ext

H T M L

S tyleS heet

SourceDocument

Sourc e TreeR esult T ree

Overview of XSLT TransformationOverview of XSLT Transformation

Page 16: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 16

JAXP (Java API for XML JAXP (Java API for XML Processing)Processing)

An interface for “plugging-in” and using An interface for “plugging-in” and using XML processors in Java applicationsXML processors in Java applications– includes packagesincludes packages

» org.xml.saxorg.xml.sax:: SAX 2.0 interface SAX 2.0 interface» org.w3c.domorg.w3c.dom:: DOM Level 2 interface DOM Level 2 interface» javax.xml.parsersjavax.xml.parsers::

initialization and use of parsersinitialization and use of parsers» javax.xml.transformjavax.xml.transform::

initialization and use of transformers initialization and use of transformers (XSLT processors)(XSLT processors)

Included in standard JavaIncluded in standard Java

Page 17: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 17

XMLXML

.getXMLReader().getXMLReader()

JAXP: Using a SAX parser (1)JAXP: Using a SAX parser (1)

f.xmlf.xml

.parse(.parse( ” ”f.xml”)f.xml”)

.newSAXParser().newSAXParser()

Page 18: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 18

f.xmlf.xml

JAXP: Using a DOM parser (1)JAXP: Using a DOM parser (1)

.parse(”f.xml”).parse(”f.xml”)

.newDocument().newDocument()

.newDocumentBuilder().newDocumentBuilder()

Page 19: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 19

XSLTXSLT

JAXP: Using Transformers (1)JAXP: Using Transformers (1)

.newTransformer(…).newTransformer(…)

.transform(.,.).transform(.,.)

Page 20: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 20

4. Introduction to Style Sheets4. Introduction to Style Sheets

Specify and produce Specify and produce visual representationvisual representation for for structured documentsstructured documents

by defining a mapping from document by defining a mapping from document structure+content to formatting tasks, andstructure+content to formatting tasks, and– inserting/generating new textinserting/generating new text– numberingnumbering– rearrangingrearranging

by rules based on contextual conditionsby rules based on contextual conditions

Page 21: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 21

Process of Transformation (Process of Transformation (muunnosmuunnos))

doc

Structu red docum ent

sect id="sec t3 "

xre f

Formatter input

TransformationTeX

FOT (XSL formatting object tree)

Style sheet

- Latex style file, CSS, XSLT

Page 22: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 22

Process of Formatting (Process of Formatting (muotoilumuotoilu))

Creates a detailed description of presentationCreates a detailed description of presentation

– > style sheet may not have complete control of the final > style sheet may not have complete control of the final formatted presentation!formatted presentation!

Form atter

Formatter input- T eX ,FO T

(X S L fo rm attingo bject tree) TeX , FOP , ...

D escr. o fpresentation- DVI, PS, PDF

Page 23: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 23

Process of Rendering (Process of Rendering (hahmonnushahmonnus))

Display/play the document on output mediumDisplay/play the document on output medium

P rin ter d riverDVI,PS,PDF

So m e te x t lin e s So m e te x t lin e s So m e te x t lin e s So m e te x tlin e So m e te x t lin e s So m e te x t lin e s So m e te x t lin e s So m e te x tlin e So m e te x t lin e s So m e te x t lin e s So m e te x t lin e s So m e te x tlin e So m e te x t lin e s So m e te x t lin e s So m e te x t lin e s So m e te x t

lin e So m e te x t lin e s So m e te x t lin e s So m e te x t lin e

s So m e te x t lin e So m e te x t lin e s So m e te x t lin e s So m e te x tlin e s So m e te x t lin e So m e te x t lin e s So m e te x t lin e s So m e te x tlin e s So m e te x t lin e So m e te x t lin e s So m e te x t lin e s So m e te x tlin e s So m e te x t lin e So m e te x t lin e s So m e te x t lin e s So m e te x t

lin e s So m e te x t lin e So m e te x t lin e s So

m e te x t lin e s So m e te x t lin e s So m e te x t lin e So m e te x t lin e sSo m e te x t lin e s So m e te x t lin e s So m e te x t lin e So m e te x t lin e sSo m e te x t lin e s So m e te x t lin e s So m e te x t lin e So m e te x t lin e sSo m e te x t lin e s So m e te x t lin e s So m e te x t lin e So m e te x t lin e sSo m e te x t lin e s So m e te x t lin e s So m e te x t lin e So m e te x t lin e sSo m e te x t lin e s So m e te x t lin e s So m e te x t lin e So m e te x t lin e s

So m e te x t lin e s So m e te x t lin e s So m e te x t lin e

D isp lay d river

Page 24: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 24

CSS - Cascading Style SheetsCSS - Cascading Style Sheets

A stylesheet languageA stylesheet language– mainly to specify the representation of web pages by mainly to specify the representation of web pages by

attaching style (fonts, colours, margins, …) to attaching style (fonts, colours, margins, …) to HTML/XML documentsHTML/XML documents

Example style rule:Example style rule:

H1 H1 {color: blue; font-weight: bold;}{color: blue; font-weight: bold;}

Page 25: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 25

CSS Processing Model (simplified)CSS Processing Model (simplified)

0. Parse the document0. Parse the document1. Match style rules to elements of the doc tree1. Match style rules to elements of the doc tree

– annotate each element with values assigned for annotate each element with values assigned for propertiesproperties

» inheritance and elaborate "cascade" rules applied to select inheritance and elaborate "cascade" rules applied to select which value is assignedwhich value is assigned

2. Generate a formatting structure 2. Generate a formatting structure – of nested rectangular boxesof nested rectangular boxes

3. Render the formatting structure3. Render the formatting structure– display, print, audio-synthesize, ...display, print, audio-synthesize, ...

Page 26: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 26

XSL: Transformation & FormattingXSL: Transformation & Formatting

XSLT scriptXSLT script

II IIII

Page 27: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 27

Page regionsPage regions

A simple page can contain 1-5 regions, specified by child A simple page can contain 1-5 regions, specified by child elements of the elements of the simple-page-mastersimple-page-master

Page 28: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 28

Top-level formatting objectsTop-level formatting objects

Slightly simplified:Slightly simplified: fo:rootfo:root

fo:layout-master-setfo:layout-master-set

(fo:simple-page-master | fo:page-sequence-master)+(fo:simple-page-master | fo:page-sequence-master)+

fo:page-sequencefo:page-sequence++

fo:region-fo:region-bodybody

fo:region-fo:region-before?before? fo:region-fo:region-

after?after?

fo:region-fo:region-end?end?

fo:region-fo:region-start?start? specify masters specify masters

for page sequences, for page sequences, by referring to by referring to simple-page-masterssimple-page-masters

contents of pagescontents of pages

fo:flowfo:flow

Page 29: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 29

XQuery in a NutshellXQuery in a Nutshell

Functional expression languageFunctional expression language– A query is a side-effect-free A query is a side-effect-free expressionexpression

Operates on Operates on sequencessequences of items of items– atomic values or XML nodesatomic values or XML nodes

Strongly-typedStrongly-typed: : (XML Schema) types may be assigned to (XML Schema) types may be assigned to expressions statically, and results can be validated expressions statically, and results can be validated

Extends XPath 2.0Extends XPath 2.0 ((but not all axesbut not all axes required) required)

– common for common for XQuery 1.0 and XPath 2.0:XQuery 1.0 and XPath 2.0:» Functions and OperatorsFunctions and Operators, W3C Rec. 01/2007, W3C Rec. 01/2007

Roughly: XQuery Roughly: XQuery XPath 2.0 + XSLT' + SQL' XPath 2.0 + XSLT' + SQL'

Page 30: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 30

FLWOR ("flower") ExpressionsFLWOR ("flower") Expressions

forfor, , letlet, , wherewhere, , order byorder by and and returnreturn clauses clauses (~SQL (~SQL selectselect--fromfrom--wherewhere))

Form: Form: (ForClause | LetClause)+ (ForClause | LetClause)+ WhereClause? WhereClause? OrderByClause?OrderByClause?""returnreturn" Expr" Expr

binds variables to values, and uses these binds variables to values, and uses these bindings to construct a result bindings to construct a result (an ordered sequence of nodes)(an ordered sequence of nodes)

Page 31: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 31

XQuery ExampleXQuery Example

forfor $pn $pn in distinct-valuesin distinct-values((docdoc(”sp.xml”)//pno)(”sp.xml”)//pno)

letlet $sp:= $sp:=docdoc(”sp.xml”)//sp_tuple[pno=$pn](”sp.xml”)//sp_tuple[pno=$pn]

where countwhere count($sp) >= 3($sp) >= 3

order byorder by $pn $pn

returnreturn

<well_supplied_item><well_supplied_item>

<pno><pno>{{$pn$pn}}</pno></pno>

<avgprice> <avgprice> {avg{avg($sp/price)($sp/price)}} </avgprice> </avgprice>

<well_supplied_item> <well_supplied_item>

Page 32: Structured -Document  Processing Languages  Spring 2007

SDPL 2007 Course Review 32

Course Main MessageCourse Main Message

XML is a universal way to represent XML is a universal way to represent information as tree-like data structures information as tree-like data structures

Specialized and powerful technologies for Specialized and powerful technologies for processing itprocessing it– Worst hype has settledWorst hype has settled– R&D still activeR&D still active