XML Tools Presentation by Jeiel & Joost 21-3-2007.
-
Upload
shawn-mcgee -
Category
Documents
-
view
215 -
download
0
Transcript of XML Tools Presentation by Jeiel & Joost 21-3-2007.
Why Functional?
• Transforming a XML document:Applying a transform function to a XML argument and getting a XML result.
• XSLT is based on functional languages.
HaXml features
• Combinators
• DTD2Haskell
• Haskell2XML/XML2Haskell(made easy with DrIFT)
• Xtract
• Parsing, validating, special tools for HTML.
HaXml is not widely used
• Installation was a pain
• Little documentation and support available
• Google searches also not helpful
• Not very userfriendly, e.g. no easy way to validate XML document.
Example: Validate a document
x = do f <- readFile "zetels.xml"
g <- readFile "properties.dtd"
z <- return.getElement $ (xmlParse "zetels" f)
d <- return.fromJust $ (dtdParse "properties" g)
return (validate d z)
getElement :: Document -> Element
getElement (Document _ _ e _) = e
Had to be saved to disk
Result: ["Document type should be <extsubset> but appears to be <properties>."]
Example: Combinators
coalitie :: CFiltercoalitie = foldXml (cat [tag "properties", txt, attrval ("key",AttValue [Left "CDA"]), attrval ("key",AttValue [Left "PvdA"]), attrval ("key",AttValue [Left
"ChristenUnie"]) ])
main = processXmlWith coalitie
Result: ["Document type should be <extsubset> but appears to be <properties>."]
DTD2Haskell
example = Properties (Properties_Attrs {propertiesVersion = Default ""}) (Just (Comment "Tweede Kamer Zetel Verdeling 2003")) [Entry (Entry_Attrs {entryKey = "CDA"}) "44", Entry (Entry_Attrs {entryKey = "PvdA"}) "42", Entry (Entry_Attrs {entryKey = "VVD"}) "28"]
Based on Java Properties DTD: http://java.sun.com/dtd/properties.dtd
showXML example
<?xml version='1.0' ?>
<properties>
<comment>Tweede Kamer Zetel Verdeling 2003
</comment>
<entry key="CDA">44</entry>
<entry key="PvdA">42</entry>
<entry key="VVD">28</entry>
</properties>
Haskell2Xml
• Make your datatype an instance of Haskell2Xml
• Use the DrIFT tool to derive this class automatically.
class Haskell2Xml a wheretoHType :: a -> HTypetoContents :: a -> [Content]fromContents :: [Content] -> (a, [Content])
Haskell2Xml
• For free:
toXml :: Haskell2Xml a => a -> Document fromXml :: Haskell2Xml a => Document -> a readXml :: Haskell2Xml a => String -> Maybe a showXml :: Haskell2Xml a => a -> String fReadXml :: Haskell2Xml a => FilePath -> IO a fWriteXml :: Haskell2Xml a => FilePath -> a -> IO ()
hGetXml :: Haskell2Xml a => Handle -> IO a hPutXml :: Haskell2Xml a => Handle -> a -> IO ()
Haskell2Xml example
• showXml (Just True)
<!DOCTYPE maybe-bool [
<!ATTLIST bool value CDATA #REQUIRED>
<!ELEMENT bool EMPTY>
<!ELEMENT maybe-bool bool?>
]>
<maybe-bool><bool value="True"/></maybe-bool>
Xml2Haskell
• Same as Haskell2Xml
• Make datatype instance of XMLContent instead of Haskell2Xml
• use DTD2Haskell to generate the instances of XMLContent
• Xml2Haskell and Haskell2Xml both have showXml, readXml, etc. be alert
Haskell XML Toolkit (HXT)
Features:
• Support for different character sets (Unicode and UTF-8, US-ASCII and ISO-Latin-1)• Wellformed document parsing, validation, construction• Namespace support: namespace propagation and checking • XPath support for selection of document parts • Liberal HTML parser for interpreting any text containing < ... > as HTML/XML • Schema validator • Integrated XSLT transformer
Datatypestype NTree a = NTree a [NTree a] -- already defineddata XNode = XText String
| ...| XTag QName XmlTrees | XAttr QName | ...
data QName = QN { namePrefix :: String localPart :: String namespaceUri :: String}type XmlTree = NTree XNode type XmlTrees = [XmlTree]
More types - filters
type XmlFilter = XmlTree -> [XmlTree]
or more general:type Filter a b = a -> [b]
but for now: XmlFilter
Simple predicate functions
isA :: (a -> Bool) -> a -> [a]
isA p x
| p x = [x]
| otherwise = []
isText :: XmlFilter
isText t@(NTree (XText _) _) = [t]
isText _ = []
Another example
getChildren :: XmlFilter
getChildren (NTree n cs) = cs
getGrandChildren :: XmlFilter
getGrandChildren (NTree n cs) = concat [ getChildren c | c <- cs ]
Seems a bit overdone...
Combining filters
(>>>) :: XmlFilter -> XmlFilter -> XmlFilter
(f >>> g) t = concat [g t' | t' <- f t]
So the definition of grandchildren becomes:getGrandChildren :: XmlFilter
getGrandChildren = getChildren >>> getChildren
Or selecting only text children:getTextChildren :: XmlFilter
getTextChildren = getChildren >>> isText
Logical and/or
• The >>> is a logical and, using predicates as XmlFilters
• Where is the logical or?
Answer:(<+>) :: XmlFilter -> XmlFilter -> XmlFilter
(f <+> g) t = f t ++ g t
More combinatorsorElse :: XmlFilter -> XmlFilter -> XmlFilterorElse f g t | null (f t) = g t | otherwise = f t
guards :: XmlFilter -> XmlFilter -> XmlFilter guards g f t | null (g t) = [] | otherwise = f t
when :: XmlFilter -> XmlFilter -> XmlFilter when f g t | null (g t) = [t] | otherwise = f t
Tree traversal
deep :: XmlFilter -> XmlFilter deep f = f `orElse` (getChildren >>> deep f)
multi :: XmlFilter -> XmlFiltermulti f = f <+> (getChildren >>> multi f)
Partial summary
• Combinators are powerful and elegant
• More elegant than pure functions
• But filters alone are not general enough
• What about side effects?
• A big difference between HaXml and HXT is... can you guess?
HXT uses Arrows!!!
• >>> and arr are in Arrow• <+> is in ArrowPlus• Functions like isA and isText lift pure functions in
the arrow• List filters are in ArrowList• Choice filters are in ArrowIf• Tree filters are in ArrowTree• Xml specific filters are in
Text.XML.HXT.XmlArrow
4 typesIn HXT there are 4 types for using pure list arrows:newtype LA a b = LA { runLA :: (a -> [b]) }newtype SLA s a b = SLA { runSLA :: (s -> a -> (s, [b])) }newtype IOLA a b = IOLA { runIOLA :: (a -> IO [b]) }newtype IOSLA s a b = IOSLA { runIOSLA :: (s -> a -> IO (s, [b])) } Which are instances of ArrowXml:class (Arrow a, ArrowList a, ArrowTree a) =>
ArrowXml a where ...
Examples
Selecting all text nodes from an XML document:
selectAllText :: ArrowXml a => a XmlTree XmlTree
selectAllText = deep isText
Selecting all tags in an XML document with a certain name:
selectAllTags :: ArrowXml a => String -> a XmlTree XmlTree
selectAllTags s = deep (isElem >>> hasName s)
Textbased browser?selectAllTextAndRealAltValues :: ArrowXml a => a XmlTree XmlTreeselectAllTextAndRealAltValues = deep ( isText <+> ( isElem >>> hasName "img" >>> getAttrValue "alt" >>> isA significant >>> arr addBrackets >>> mkText ) ) where significant :: String -> Bool significant = not . all (`elem` " \n\r\t") addBrackets :: String -> String addBrackets s = " [[ " ++ s ++ " ]] "
All images in an HTML document
imageTable :: ArrowXml a => a XmlTree XmlTreeimageTable = selem "html" [ selem "head" [ selem "title" [ txt "Images in Page" ]] , selem "body" [ selem "h1" [ txt "Images in Page" ] , selem "table" [ collectImages >>> genTableRows ] ] ] where collectImages = deep ( isElem >>> hasName "img" ) genTableRows = selem "tr"[ selem "td" [ getAttrValue "src" >>> mkText ] ]
Class ArrowXml a
• Contains a lot of XML specific functions• Some of these have implementations
based on others for convenience:– mkelem– aelem– selem– eelem
• Extension class is ArrowDTD a, which contains DTD specific functions.
Tools over XML
• As mentioned before, HXT contains an XPath module– We’ll have a quick look at it in a minute.
• As a result of a master’s thesis, a basic XSLT processor has been implemented on top of HXT– Pure Haskell– 1800 lines of code (compare to XALAN’s 347000!!)– The processor is not yet optimized– Kind of hard to give a quick view.
XPath• XPath in C#XmlNodeList XmlElement.SelectNodes(string xpath);
• XPath is just another plain old XmlFilter:getXPath :: String -> XmlFiltergetXPathWithNsEnv :: NsEnv -> String -> XmlFiltergetXPathSubTrees :: String -> XmlFiltergetXPathSubTreesWithNsEnv :: NsEnv -> String ->
XmlFiltergetXPathNodeSet :: String -> XmlTree -> XmlNodeSetgetXPathNodeSetWithNsEnv :: NsEnv -> String ->
XmlTree -> XmlNodeSetevalExpr :: Env -> Context -> Expr -> XPathFilter