Reading Microsoft Word XML files with SAS August 25, 2005
description
Transcript of Reading Microsoft Word XML files with SAS August 25, 2005
Reading Microsoft Word XML files with SAS
August 25, 2005
Larry Hoyle -- Policy Research Institute
University of Kansas
revised 8/18/2005
3 scenarios
• Extracting text along with associated properties (styles and attributes)
• Extracting all data from tables
• Extracting coordinates of objects in drawings
XML - syntax<?xml version="1.0" ?>
<LarryRootTag>
<EmptyTag/><nestedTag>
Some content
</nestedTag >
<nestedTag anAttribute="wha">
Other content
</nestedTag >
</LarryRootTag>
Must begin with this prolog tag
Paired tags, must have 1 root tag
case sensitive
Empty tags end with />
Tags and content called "element"
Tags can be Qualified by
attributes
Elements can be nested,Start and end in same parent
Word XML
Word XML
Extracting text and properties
• SAS XML Engine
• Needs XMLMAP file
• Can use XML Mapper to generate XMLMAP
• Only needs to be generated once for
each type of extract
Example DocumentI have never been so humiliated in my life. That was very rude treatment.What a pleasant experience. Your staff was both quick and pleasant.It took about the time I expected to reach someone.I have nothing to say. The sky is blue and the sea is green.You are the worst organization in the world.I love you guys.
XML - Example DocumentI have never been so humiliated in my life. That was very rude treatment.What a pleasant experience. Your staff was both quick and pleasant.It took about the time I expected to reach someone.I have nothing to say. The sky is blue and the sea is green.You are the worst organization in the world.I love you guys.
Paragraph property:/w:wordDocument/w:body /wx:sect/w:p/w:pPr
Run property:/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.
Rows
• The XMLMap has to describe a path that delineates rows:
• In this case it’s each text element in a run (in a paragraph…)
<TABLE-PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>
Columns – the text
• The XMLMap has to describe a path that delineates each column:
• The text itself is:
<COLUMN name="t">
<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>
Columns – the text element number
• A sequential number for the text element is:
<COLUMN name="tNum"
ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN"
syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>
Columns – the paragraph number
• A sequential number for the paragraph is:
<COLUMN name="pNum" ordinal="YES" retain="YES">
<INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>
Columns –paragraph color
<COLUMN name="PColorVal" retain="YES">
<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>
Columns – run color
<COLUMN name="RColorVal" retain="YES">
<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>
Our dataset
Tables
All Tables Into One Dataset
Tables – Word XML
Tables - DataSet Rows
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>
Tables – Table Number
<COLUMN name="tblNum" ordinal="YES" retain="YES">
<INCREMENT-PATH beginend="BEGIN" syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl
</INCREMENT-PATH>
Tables – Row Number
<COLUMN name="trNum" ordinal="YES" retain="YES">
<INCREMENT-PATH beginend="BEGIN" syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl/w:tr
</INCREMENT-PATH>
We Could Add Properties if Needed
Nested tables
Nested Tables – Absolute Path for Rows
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>
Nested Tables – Rootless Path for Rows
<TABLE-PATH syntax="XPath">
w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>
Drawing ObjectsVML – Vector Markup Language
• Drawings in Word get stored as XML also
• We’ll just look at lines
VML – Vector Markup Language
Dataset – One Row for Each Line
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line</TABLE-PATH>
Dataset – Column: From
<COLUMN name="from"> <PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from</PATH>
Dataset – Column: To
<COLUMN name="from"> <PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@to</PATH>
Dataset – Column: StrokeColor
<COLUMN name="from"> <PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@strokecolor</PATH>
The Dataset
Usage Example: Annotate dataset
if prxmatch(xyPattern, from) then do;
function='move';
x= input(PRXPOSN (xyPattern, 1, from),10.);
if prxmatch('/flip:y/',style) then
y= -1* input(PRXPOSN (xyPattern, 2, to),10.);
else
y= -1* input(PRXPOSN (xyPattern, 2, from),10.);
output;
Plotted in SAS
Contact Information
Larry HoylePolicy Research Institute, University of Kansas
http://www.ku.edu/pri/ksdata/sashttp/sugi31