Reading Microsoft Word XML files with SAS August 25, 2005
-
Upload
neil-weaver -
Category
Documents
-
view
18 -
download
3
description
Transcript of Reading Microsoft Word XML files with SAS August 25, 2005
Reading Microsoft Word XML files with SAS
August 25, 2005
Larry Hoyle -- Policy Research Institute
University of Kansas
revised 8/18/2005
3 scenarios
• Extracting text along with associated properties (styles and attributes)
• Extracting all data from tables
• Extracting coordinates of objects in drawings
XML - syntax<?xml version="1.0" ?>
<LarryRootTag>
<EmptyTag/><nestedTag>
Some content
</nestedTag >
<nestedTag anAttribute="wha">
Other content
</nestedTag >
</LarryRootTag>
Must begin with this prolog tag
Paired tags, must have 1 root tag
case sensitive
Empty tags end with />
Tags and content called "element"
Tags can be Qualified by
attributes
Elements can be nested,Start and end in same parent
Word XML
Word XML
Extracting text and properties
• SAS XML Engine
• Needs XMLMAP file
• Can use XML Mapper to generate XMLMAP
• Only needs to be generated once for
each type of extract
Example DocumentI have never been so humiliated in my life. That was very rude treatment.What a pleasant experience. Your staff was both quick and pleasant.It took about the time I expected to reach someone.I have nothing to say. The sky is blue and the sea is green.You are the worst organization in the world.I love you guys.
XML - Example DocumentI have never been so humiliated in my life. That was very rude treatment.What a pleasant experience. Your staff was both quick and pleasant.It took about the time I expected to reach someone.I have nothing to say. The sky is blue and the sea is green.You are the worst organization in the world.I love you guys.
Paragraph property:/w:wordDocument/w:body /wx:sect/w:p/w:pPr
Run property:/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.
Rows
• The XMLMap has to describe a path that delineates rows:
• In this case it’s each text element in a run (in a paragraph…)
<TABLE-PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>
Columns – the text
• The XMLMap has to describe a path that delineates each column:
• The text itself is:
<COLUMN name="t">
<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>
Columns – the text element number
• A sequential number for the text element is:
<COLUMN name="tNum"
ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN"
syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>
Columns – the paragraph number
• A sequential number for the paragraph is:
<COLUMN name="pNum" ordinal="YES" retain="YES">
<INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>
Columns –paragraph color
<COLUMN name="PColorVal" retain="YES">
<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>
Columns – run color
<COLUMN name="RColorVal" retain="YES">
<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>
Our dataset
Tables
All Tables Into One Dataset
Tables – Word XML
Tables - DataSet Rows
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>
Tables – Table Number
<COLUMN name="tblNum" ordinal="YES" retain="YES">
<INCREMENT-PATH beginend="BEGIN" syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl
</INCREMENT-PATH>
Tables – Row Number
<COLUMN name="trNum" ordinal="YES" retain="YES">
<INCREMENT-PATH beginend="BEGIN" syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl/w:tr
</INCREMENT-PATH>
We Could Add Properties if Needed
Nested tables
Nested Tables – Absolute Path for Rows
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>
Nested Tables – Rootless Path for Rows
<TABLE-PATH syntax="XPath">
w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>
Drawing ObjectsVML – Vector Markup Language
• Drawings in Word get stored as XML also
• We’ll just look at lines
VML – Vector Markup Language
Dataset – One Row for Each Line
<TABLE-PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line</TABLE-PATH>
Dataset – Column: From
<COLUMN name="from"> <PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from</PATH>
Dataset – Column: To
<COLUMN name="from"> <PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@to</PATH>
Dataset – Column: StrokeColor
<COLUMN name="from"> <PATH syntax="XPath">
/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@strokecolor</PATH>
The Dataset
Usage Example: Annotate dataset
if prxmatch(xyPattern, from) then do;
function='move';
x= input(PRXPOSN (xyPattern, 1, from),10.);
if prxmatch('/flip:y/',style) then
y= -1* input(PRXPOSN (xyPattern, 2, to),10.);
else
y= -1* input(PRXPOSN (xyPattern, 2, from),10.);
output;
Plotted in SAS
Contact Information
Larry HoylePolicy Research Institute, University of Kansas
http://www.ku.edu/pri/ksdata/sashttp/sugi31