Reading Microsoft Word XML files with SAS August 25, 2005

Post on 13-Jan-2016

17 views 0 download

description

Reading Microsoft Word XML files with SAS August 25, 2005. Larry Hoyle -- Policy Research Institute University of Kansas. revised 8/18/2005. 3 scenarios. Extracting text along with associated properties (styles and attributes) Extracting all data from tables - PowerPoint PPT Presentation

Transcript of Reading Microsoft Word XML files with SAS August 25, 2005

Reading Microsoft Word XML files with SAS

August 25, 2005

Larry Hoyle -- Policy Research Institute

University of Kansas

revised 8/18/2005

3 scenarios

• Extracting text along with associated properties (styles and attributes)

• Extracting all data from tables

• Extracting coordinates of objects in drawings

XML - syntax<?xml version="1.0" ?>

<LarryRootTag>

<EmptyTag/><nestedTag>

Some content

</nestedTag >

<nestedTag anAttribute="wha">

Other content

</nestedTag >

</LarryRootTag>

Must begin with this prolog tag

Paired tags, must have 1 root tag

case sensitive

Empty tags end with />

Tags and content called "element"

Tags can be Qualified by

attributes

Elements can be nested,Start and end in same parent

Word XML

Word XML

Extracting text and properties

• SAS XML Engine

• Needs XMLMAP file

• Can use XML Mapper to generate XMLMAP

• Only needs to be generated once for

each type of extract

Example DocumentI have never been so humiliated in my life. That was very rude treatment.What a pleasant experience. Your staff was both quick and pleasant.It took about the time I expected to reach someone.I have nothing to say. The sky is blue and the sea is green.You are the worst organization in the world.I love you guys.

XML - Example DocumentI have never been so humiliated in my life. That was very rude treatment.What a pleasant experience. Your staff was both quick and pleasant.It took about the time I expected to reach someone.I have nothing to say. The sky is blue and the sea is green.You are the worst organization in the world.I love you guys.

Paragraph property:/w:wordDocument/w:body /wx:sect/w:p/w:pPr

Run property:/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.

Rows

• The XMLMap has to describe a path that delineates rows:

• In this case it’s each text element in a run (in a paragraph…)

<TABLE-PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>

Columns – the text

• The XMLMap has to describe a path that delineates each column:

• The text itself is:

<COLUMN name="t">

<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>

Columns – the text element number

• A sequential number for the text element is:

<COLUMN name="tNum"

ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN"

syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>

Columns – the paragraph number

• A sequential number for the paragraph is:

<COLUMN name="pNum" ordinal="YES" retain="YES">

<INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>

Columns –paragraph color

<COLUMN name="PColorVal" retain="YES">

<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>

Columns – run color

<COLUMN name="RColorVal" retain="YES">

<PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>

Our dataset

Tables

All Tables Into One Dataset

Tables – Word XML

Tables - DataSet Rows

<TABLE-PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>

Tables – Table Number

<COLUMN name="tblNum" ordinal="YES" retain="YES">

<INCREMENT-PATH beginend="BEGIN" syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:tbl

</INCREMENT-PATH>

Tables – Row Number

<COLUMN name="trNum" ordinal="YES" retain="YES">

<INCREMENT-PATH beginend="BEGIN" syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:tbl/w:tr

</INCREMENT-PATH>

We Could Add Properties if Needed

Nested tables

Nested Tables – Absolute Path for Rows

<TABLE-PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>

Nested Tables – Rootless Path for Rows

<TABLE-PATH syntax="XPath">

w:tbl/w:tr/w:tc/w:p/w:r/w:t</TABLE-PATH>

Drawing ObjectsVML – Vector Markup Language

• Drawings in Word get stored as XML also

• We’ll just look at lines

VML – Vector Markup Language

Dataset – One Row for Each Line

<TABLE-PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line</TABLE-PATH>

Dataset – Column: From

<COLUMN name="from"> <PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from</PATH>

Dataset – Column: To

<COLUMN name="from"> <PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@to</PATH>

Dataset – Column: StrokeColor

<COLUMN name="from"> <PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@strokecolor</PATH>

The Dataset

Usage Example: Annotate dataset

if prxmatch(xyPattern, from) then do;

function='move';

x= input(PRXPOSN (xyPattern, 1, from),10.);

if prxmatch('/flip:y/',style) then

y= -1* input(PRXPOSN (xyPattern, 2, to),10.);

else

y= -1* input(PRXPOSN (xyPattern, 2, from),10.);

output;

Plotted in SAS

Contact Information

Larry HoylePolicy Research Institute, University of Kansas

LarryHoyle@ku.edu

http://www.ku.edu/pri/ksdata/sashttp/sugi31