XML processing with perl

11
XML processing with Perl For the 2 nd YPPUG session by Joe Jiang [email protected]

description

 

Transcript of XML processing with perl

Page 1: XML processing with perl

XML processing with Perl

For the 2nd YPPUG sessionby Joe Jiang [email protected]

Page 2: XML processing with perl

XML is a data format, not a language

• We use it in financial & searching.• DMP can also support it, but not as good as text/HTML.• Many people use it for configuration files.• I have used it at Perl book translation.• For example: ... $book/> count $book//sect1117$book/> count $book//sect2149$book/> count $book//para4691# Wah, it's a big book :)

 

Page 3: XML processing with perl

The tool to work with XML

• It's named XML::XSH2, by Petr Pajas• And it take an useful utility named xsh• Which is based on XML::LibXSLT and XML::SAX::Writer,

and ...• Which is based on XML::LibXML and a lot of ...• So you should not expect flat/easy installation :)• But it's still possible to be built with cpanm utility• So I suggest to install cpanm first

$ curl -kL http://cpanmin.us | perl - --sudo App::cpanminus $ cpanm -S XML::XSH2# already made it at dev, so you can just run: xsh# ! Finding XML::XSH2 on cpanmetadb failed.# This kind of info is common

Page 4: XML processing with perl

How is it used? XPath plus verbs

$scratch/> $book := open english-tidyup.xmlparsing english-tidyup.xmldone. $book/> cd //book/chapter[1]$book/book/chapter[1]> ls title<title>Introduction</title>

$book/book/chapter[1]> cd /$book/> ls //chapter/title<title>Introduction</title><title>Filesystems</title><title>User Accounts</title>...

Page 5: XML processing with perl

Good at pipeline processing

$book/> ls //sect1//para/text() | wc -w

Found 12398 node(s).150879 Use "wc -m" for Chinese char count.Or make fun with frequency statistics, for top 100 used words: $book/> ls //sect1//para/text() | perl -MList::MoreUtils=natatime -lane 'END{ $it = natatime 100, sort {$cnt{$b} <=> $cnt{$a}} keys %cnt; print for map {join qq(\t), $_, $cnt{$_}} $it->() } $cnt{$_}++ for @F'...data    483...Perl    437...file    426...

Page 6: XML processing with perl

It can be used for conversion #1

$scratch/>$x:=open ArticleInfo_9.xml;parsing ArticleInfo_9.xmldone.$x/>ls $x<?xml version="1.0" encoding="utf-16"?><小样 >        <标题 ><![CDATA[第一推荐 ]]></标题 >        <作者 ><![CDATA[]]></作者 >        <内容 ><![CDATA[  华为美国拓展求解  华为对美国市场的执着显示出中国企业走出去的急切需要,但这样高调注定要经受更多挫折。 ]]></内容 >        <附图 >                <简图 >                        <文件名 >../cnmlfiles/A01/A01Ab25C005_b.jpg</文件名 >                        <高 >260</高 >                        <宽 >245</宽 >                </简图 >        </附图 ></小样 >

Page 7: XML processing with perl

Now building an empty xHTML #2

$x/>$y:=new html;$y/>ls $y<?xml version="1.0" encoding="utf-8"?><html/>$y/>xadd element "<head/>" into $y/html; #xadd is just alias of insert$y/>ls $y<?xml version="1.0" encoding="utf-8"?><html>  <head/></html> $y/>xadd element "<title/>" into $y/html/head;$y/>xadd element "<body/>" into $y/html;$y/>ls $y<?xml version="1.0" encoding="utf-8"?><html>  <head>    <title/>  </head>  <body/></html>

Page 8: XML processing with perl

Copy contents into xHTML #3

$y/>xadd text $x//小样 /标题 /text() into $y/html/head/title;$y/>ls $y<?xml version="1.0" encoding="utf-8"?><html>  <head>    <title>第一推荐 </title>  </head>  <body/></html>

$y/>xadd text $x//小样 /内容 /text() into $y/html/body;$y/>save --file x.html $y;Document saved into file 'x.html'.$y/>Good bye!$ cat x.html<?xml version="1.0" encoding="utf-8"?><html>  <head>    <title>第一推荐 </title>  </head>  <body>  华为美国拓展求解  华为对美国市场的执着显示出中国企业走出去的急切需要,但这样高调注定要经受更多挫折。 </body></html>

Page 9: XML processing with perl

XSLT is a focused XML conversion language, based on XPath<?xml version="1.0" encoding="ISO-8859-1"?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"><xsl:template match="/perldata/hashref">  <table border="1">   <tr>    <th>Key</th>    <th>Value</th>   </tr>    <xsl:for-each select="item">    <tr>     <td><xsl:value-of select="@key"/></td>     <td><xsl:value-of select="."/></td>    </tr></xsl:for-each></table></xsl:template></xsl:stylesheet>

Page 10: XML processing with perl

This works well with XML::Dumper

$ perl -MXML::Dumper -e 'print pl2xml(\%INC)' | xsltproc hashref.xsl - | w3m -T text/html

• We can use xsltproc to convert the DocBook book to HTML• And to PDF, with another utility named fop• Or generate MSWord doc file from openoffice• With the help from openoffice docbook XSLT filter

Page 11: XML processing with perl

Now you have been equipped with another tool named XML  Thanks all for the magic!

Module Name Author Version

XML::Dumper MIKEWONG 0.81

XML::Simple GRANTM 2.18

XML::LibXML PAJAS 1.87

XML::XPath MSERGEANT 1.13

XML::XSH2 PAJAS 2.1.3

XML::Twig MIROD 3.38