XML Tools Presentation by Jeiel & Joost 21-3-2007.

XML Tools

Presentation by

Jeiel & Joost

21-3-2007

Intro

• Introduction

• HaXml

• HXT

• Summary

Why Functional?

• Transforming a XML document:Applying a transform function to a XML argument and getting a XML result.

• XSLT is based on functional languages.

HaXml features

• Combinators

• DTD2Haskell

• Haskell2XML/XML2Haskell(made easy with DrIFT)

• Xtract

• Parsing, validating, special tools for HTML.

HaXml is not widely used

• Installation was a pain

• Little documentation and support available

• Google searches also not helpful

• Not very userfriendly, e.g. no easy way to validate XML document.

Example: Validate a document

x = do f <- readFile "zetels.xml"

g <- readFile "properties.dtd"

z <- return.getElement $ (xmlParse "zetels" f)

d <- return.fromJust $ (dtdParse "properties" g)

return (validate d z)

getElement :: Document -> Element

getElement (Document _ _ e _) = e

Had to be saved to disk

Result: ["Document type should be <extsubset> but appears to be <properties>."]

Example: Combinators

coalitie :: CFiltercoalitie = foldXml (cat [tag "properties", txt, attrval ("key",AttValue [Left "CDA"]), attrval ("key",AttValue [Left "PvdA"]), attrval ("key",AttValue [Left

"ChristenUnie"]) ])

main = processXmlWith coalitie

Result: ["Document type should be <extsubset> but appears to be <properties>."]

DTD2Haskell

example = Properties (Properties_Attrs {propertiesVersion = Default ""}) (Just (Comment "Tweede Kamer Zetel Verdeling 2003")) [Entry (Entry_Attrs {entryKey = "CDA"}) "44", Entry (Entry_Attrs {entryKey = "PvdA"}) "42", Entry (Entry_Attrs {entryKey = "VVD"}) "28"]

Based on Java Properties DTD: http://java.sun.com/dtd/properties.dtd

showXML example

<?xml version='1.0' ?>

<properties>

<comment>Tweede Kamer Zetel Verdeling 2003

</comment>

<entry key="CDA">44</entry>

<entry key="PvdA">42</entry>

<entry key="VVD">28</entry>

</properties>

Haskell2Xml

• Make your datatype an instance of Haskell2Xml

• Use the DrIFT tool to derive this class automatically.

class Haskell2Xml a wheretoHType :: a -> HTypetoContents :: a -> [Content]fromContents :: [Content] -> (a, [Content])

http://www.cs.york.ac.uk/fp/HaXml/HaXml/Text-XML-HaXml-Haskell2Xml.html#t%3AHaskell2Xml

http://www.cs.york.ac.uk/fp/HaXml/HaXml/Text-XML-HaXml-Haskell2Xml.html#v%3AtoHType

Haskell2Xml

• For free:

toXml :: Haskell2Xml a => a -> Document fromXml :: Haskell2Xml a => Document -> a readXml :: Haskell2Xml a => String -> Maybe a showXml :: Haskell2Xml a => a -> String fReadXml :: Haskell2Xml a => FilePath -> IO a fWriteXml :: Haskell2Xml a => FilePath -> a -> IO ()

hGetXml :: Haskell2Xml a => Handle -> IO a hPutXml :: Haskell2Xml a => Handle -> a -> IO ()

Haskell2Xml example

• showXml (Just True)

<!DOCTYPE maybe-bool [

<!ATTLIST bool value CDATA #REQUIRED>

<!ELEMENT bool EMPTY>

<!ELEMENT maybe-bool bool?>

]>

<maybe-bool><bool value="True"/></maybe-bool>

Xml2Haskell

• Same as Haskell2Xml

• Make datatype instance of XMLContent instead of Haskell2Xml

• use DTD2Haskell to generate the instances of XMLContent

• Xml2Haskell and Haskell2Xml both have showXml, readXml, etc. be alert

Haskell XML Toolkit (HXT)

Features:

• Support for different character sets (Unicode and UTF-8, US-ASCII and ISO-Latin-1)• Wellformed document parsing, validation, construction• Namespace support: namespace propagation and checking • XPath support for selection of document parts • Liberal HTML parser for interpreting any text containing < ... > as HTML/XML • Schema validator • Integrated XSLT transformer

Datatypestype NTree a = NTree a [NTree a] -- already defineddata XNode = XText String

| ...| XTag QName XmlTrees | XAttr QName | ...

data QName = QN { namePrefix :: String localPart :: String namespaceUri :: String}type XmlTree = NTree XNode type XmlTrees = [XmlTree]

More types - filters

type XmlFilter = XmlTree -> [XmlTree]

or more general:type Filter a b = a -> [b]

but for now: XmlFilter

Simple predicate functions

isA :: (a -> Bool) -> a -> [a]

isA p x

| p x = [x]

| otherwise = []

isText :: XmlFilter

isText t@(NTree (XText _) _) = [t]

isText _ = []

Another example

getChildren :: XmlFilter

getChildren (NTree n cs) = cs

getGrandChildren :: XmlFilter

getGrandChildren (NTree n cs) = concat [ getChildren c | c <- cs ]

Seems a bit overdone...

Combining filters

(>>>) :: XmlFilter -> XmlFilter -> XmlFilter

(f >>> g) t = concat [g t' | t' <- f t]

So the definition of grandchildren becomes:getGrandChildren :: XmlFilter

getGrandChildren = getChildren >>> getChildren

Or selecting only text children:getTextChildren :: XmlFilter

getTextChildren = getChildren >>> isText

Logical and/or

• The >>> is a logical and, using predicates as XmlFilters

• Where is the logical or?

Answer:(<+>) :: XmlFilter -> XmlFilter -> XmlFilter

(f <+> g) t = f t ++ g t

Tree traversal

deep :: XmlFilter -> XmlFilter deep f = f `orElse` (getChildren >>> deep f)

multi :: XmlFilter -> XmlFiltermulti f = f <+> (getChildren >>> multi f)

Partial summary

• Combinators are powerful and elegant

• More elegant than pure functions

• But filters alone are not general enough

• What about side effects?

• A big difference between HaXml and HXT is... can you guess?

HXT uses Arrows!!!

• >>> and arr are in Arrow• <+> is in ArrowPlus• Functions like isA and isText lift pure functions in

the arrow• List filters are in ArrowList• Choice filters are in ArrowIf• Tree filters are in ArrowTree• Xml specific filters are in

Text.XML.HXT.XmlArrow

4 typesIn HXT there are 4 types for using pure list arrows:newtype LA a b = LA { runLA :: (a -> [b]) }newtype SLA s a b = SLA { runSLA :: (s -> a -> (s, [b])) }newtype IOLA a b = IOLA { runIOLA :: (a -> IO [b]) }newtype IOSLA s a b = IOSLA { runIOSLA :: (s -> a -> IO (s, [b])) } Which are instances of ArrowXml:class (Arrow a, ArrowList a, ArrowTree a) =>

ArrowXml a where ...

Examples

Selecting all text nodes from an XML document:

selectAllText :: ArrowXml a => a XmlTree XmlTree

selectAllText = deep isText

Selecting all tags in an XML document with a certain name:

selectAllTags :: ArrowXml a => String -> a XmlTree XmlTree

selectAllTags s = deep (isElem >>> hasName s)

Textbased browser?selectAllTextAndRealAltValues :: ArrowXml a => a XmlTree XmlTreeselectAllTextAndRealAltValues = deep ( isText <+> ( isElem >>> hasName "img" >>> getAttrValue "alt" >>> isA significant >>> arr addBrackets >>> mkText ) ) where significant :: String -> Bool significant = not . all (`elem` " \n\r\t") addBrackets :: String -> String addBrackets s = " [[ " ++ s ++ " ]] "

All images in an HTML document

imageTable :: ArrowXml a => a XmlTree XmlTreeimageTable = selem "html" [ selem "head" [ selem "title" [ txt "Images in Page" ]] , selem "body" [ selem "h1" [ txt "Images in Page" ] , selem "table" [ collectImages >>> genTableRows ] ] ] where collectImages = deep ( isElem >>> hasName "img" ) genTableRows = selem "tr"[ selem "td" [ getAttrValue "src" >>> mkText ] ]

Class ArrowXml a

• Contains a lot of XML specific functions• Some of these have implementations

based on others for convenience:– mkelem– aelem– selem– eelem

• Extension class is ArrowDTD a, which contains DTD specific functions.

Tools over XML

• As mentioned before, HXT contains an XPath module– We’ll have a quick look at it in a minute.

• As a result of a master’s thesis, a basic XSLT processor has been implemented on top of HXT– Pure Haskell– 1800 lines of code (compare to XALAN’s 347000!!)– The processor is not yet optimized– Kind of hard to give a quick view.

XPath• XPath in C#XmlNodeList XmlElement.SelectNodes(string xpath);

• XPath is just another plain old XmlFilter:getXPath :: String -> XmlFiltergetXPathWithNsEnv :: NsEnv -> String -> XmlFiltergetXPathSubTrees :: String -> XmlFiltergetXPathSubTreesWithNsEnv :: NsEnv -> String ->

XmlFiltergetXPathNodeSet :: String -> XmlTree -> XmlNodeSetgetXPathNodeSetWithNsEnv :: NsEnv -> String ->

XmlTree -> XmlNodeSetevalExpr :: Env -> Context -> Expr -> XPathFilter

Summary

XML Tools Presentation by Jeiel & Joost 21-3-2007.

Documents

Transcript of XML Tools Presentation by Jeiel & Joost 21-3-2007.