Multi Dimensional Markup -...

Multi Dimensional Markup

Richard de Jong 9516697 [email protected] Geus 9658556 [email protected]

June 30, 2005

This document is the thesis of the Bachelor project of the Bsc Artificial Intelligence of theUniversity of Amsterdam by Joris Geus and Richard de Jong. This project is supervised byMaarten Marx and Valentin Jijkoun of ILPS, the Information and Language Processing Sys-tems group of the Informatics Institute of the University of Amsterdam.

Division of labor:

Joris was responsible for the programming and has written sections 6 and 7.Richard was responsible for creating and running the test scripts, and for typesetting this doc-ument. He has also documented the code. He has written section 4 and 8.The other sections have been written iteratively by Richard and Joris.

1

Abstract

This paper describes the implementation of Extended XPath performed by the authors. It givesan introduction to XML, XPath, the GODDAG and overlapping markup. It shows the need tobe able to work with documents containing overlapping markup (called concurrent XML), andthe need to be able to query such documents. It describes Extended XPath, a way to querysuch documents, and shows a way to implement this. It evaluates this implementation usingvalidation and performance tests. It is shown that, although parsing and loading documents israther slow, Extended XPath queries are answered in linear time.

Contents

1 Introduction 41.1 Concurrent XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Component documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Research Goal 7

3 Method 8

4 Main results 9

5 Terminology 11

6 Data structure for Concurrent XML 126.1 Conceptual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126.2 Mathematical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

7 Extended XPath 177.1 Conceptual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187.2 Mathematical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

8 Tests and results 228.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

8.2.1 Query Complexity of XML::XPath . . . . . . . . . . . . . . . . . . . . . . 238.2.2 Query Complexity of CXML::CXPath . . . . . . . . . . . . . . . . . . . . 258.2.3 Data Complexity of XML::XPath . . . . . . . . . . . . . . . . . . . . . . . 278.2.4 Data Complexity of CXML::CXPath . . . . . . . . . . . . . . . . . . . . . 28

8.3 Ease of use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298.3.1 ILPS data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298.3.2 Iacob’s data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.3.3 QLDB data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

9 Conclusion 32

10 Discussion 33

A Iacob’s data 35A.1 Original files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A.1.1 text1.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A.1.2 text.dtd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A.1.3 condition.dtd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A.1.4 physical.dtd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

A.2 Standoff files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36A.2.1 base.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36A.2.2 text.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36A.2.3 condition.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1

A.2.4 physical.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

B Results of validation tests 38B.1 Iacob’s example query # 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38B.2 Iacob’s example query # 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38B.3 Iacob’s example query # 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38B.4 //u[@n=”1”]/xancestor::* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38B.5 //sentence[@n=”13”]/xdescendant::* . . . . . . . . . . . . . . . . . . . . . . . . . 39B.6 //sentence[@n=”13”]/xdescendant or self::* . . . . . . . . . . . . . . . . . . . . . 39B.7 //u[@n=”1”]/xancestor or self::* . . . . . . . . . . . . . . . . . . . . . . . . . . . 40B.8 //u[@n=”1”]/xfollowing::* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40B.9 //u[@n=”2”]/xpreceding::* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41B.10 //sentence[@n=”13”]/following overlapping::* . . . . . . . . . . . . . . . . . . . . 42B.11 //sentence[@n=”14”]/preceding overlapping::* . . . . . . . . . . . . . . . . . . . 42B.12 //sentence[@n=”13”]/overlapping::* . . . . . . . . . . . . . . . . . . . . . . . . . 42B.13 //sentence[@n=”13”]/xancestor or overlapping::* . . . . . . . . . . . . . . . . . . 42B.14 //sentence[@n=”13”]/xdescendant or overlapping::* . . . . . . . . . . . . . . . . 43B.15 //sentence[@n=”13”]/intersect::* . . . . . . . . . . . . . . . . . . . . . . . . . . . 43B.16 //u[@n=”1”]/direct following::* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43B.17 //u[@n=”2”]/direct preceding::* . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

C Documentation 45C.1 CXML::CXPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

C.1.1 DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45C.1.2 DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45C.1.3 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45C.1.4 AUTHOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

C.2 CXML::CXPath::Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46C.2.1 DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46C.2.2 DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46C.2.3 AUTHOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

C.3 CXML::CXPath::Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46C.3.1 DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46C.3.2 DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46C.3.3 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47C.3.4 AUTHOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

C.4 CXML::CXPath::Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47C.4.1 DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47C.4.2 DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48C.4.3 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48C.4.4 AUTHOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

C.5 CXML::CXPath::Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49C.5.1 DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49C.5.2 DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49C.5.3 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49C.5.4 AUTHOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

C.6 CXML::CXPath::BaseOrder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49C.6.1 DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49C.6.2 DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2

C.6.3 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49C.6.4 AUTHOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

C.7 CXML::CXPath::DocumentOrder . . . . . . . . . . . . . . . . . . . . . . . . . . . 50C.7.1 DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50C.7.2 DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50C.7.3 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50C.7.4 AUTHOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

List of Tables

1 Query Complexity of XML::XPath parent/child . . . . . . . . . . . . . . . . . . . 242 Query Complexity of XML::XPath ancestor/descendant . . . . . . . . . . . . . . 253 Query Complexity of CXML::CXPath non-x-axes . . . . . . . . . . . . . . . . . . 264 Query Complexity of CXML::CXPath x-axes . . . . . . . . . . . . . . . . . . . . 275 Data Complexity of XML::XPath . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Data Complexity of CXML::CXPath . . . . . . . . . . . . . . . . . . . . . . . . . 29

List of Figures

1 Example of a GODDAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Conceptual drawing of a Base object . . . . . . . . . . . . . . . . . . . . . . . . . 143 Connection XML trees with a base object . . . . . . . . . . . . . . . . . . . . . . 154 Query Complexity of XML::XPath parent/child . . . . . . . . . . . . . . . . . . . 245 Query Complexity of XML::XPath ancestor/descendant . . . . . . . . . . . . . . 256 Query Complexity of CXML::CXPath non-x-axes . . . . . . . . . . . . . . . . . . 267 Query Complexity of CXML::CXPath x-axes . . . . . . . . . . . . . . . . . . . . 278 Data Complexity of XML::XPath . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Data Complexity of CXML::CXPath docsize vs parsetime . . . . . . . . . . . . . 3010 Data Complexity of CXML::CXPath docsize vs querytime . . . . . . . . . . . . . 30

3

1 Introduction

In this section we first give a short introduction to the field of Concurrent XML, then we describethe outline of this thesis.

1.1 Concurrent XML

XML[1] stands for Extensible Markup Language. It is a markup language much like HTML[11].XML can be used to describe meaning and structure of data, or add meaning and structure todata, using tags that are user definable. Consider the following fragment of text which will serveas our running example:

Here is the text. Hereis some more text.

You can use XML to annotate the sentences:

<DOC>

<PARAGRAPH id="par1">

<SENTENCE id="s1"><W>Here</W><W>is</W><W>the</W><W>text.</W></SENTENCE>

<SENTENCE id="s2"><W>Here</W><W>is</W><W>some</W><W>more</W><W>text.</W></SENTENCE>

</PARAGRAPH>

</DOC>

or to annotate the lines:

<DOC>

<PAGE id="pg1">

<LINE id="l1">Here is the text. Here</LINE>

<LINE id="l2">is some more text.</LINE>

</PAGE>

</DOC>

Both of these XML documents are valid. Now if you wanted to combine these different “anno-tations” in one XML document you get

<DOC>

<PARAGRAPH id="par1">

<PAGE id="pg1">

<LINE id="l1">

<SENTENCE id="s1"><W>Here</W><W>is</W> <W>the</W><W>text.</W></SENTENCE>

<SENTENCE id="s2"><W>Here</W></LINE>

<LINE id="l2"><W>is</W><W>some</W><W>more</W><W>text.</W></SENTENCE>

<LINE id="l2"><W>is</W><W>some</W><W>more</W><W>text.</W></LINE>

</PAGE>

</PARAGRAPH>

</DOC>

This is not a valid XML document: the LINE element is closed before the SENTENCE elementinside it is closed. It is no longer possible to create a tree representation of this document sothat every leaf node has one parent and the leaf nodes are in document order. This conflictingmarkup is known as overlapping markup[13]. Whilst it is not a valid XML document, it can berepresented using Concurrent XML.

Overlapping markup can occur in any application where markup is being merged. It is pos-sible to adapt data to fit the current XML processing model, but that can change the semanticsof the markup. A number of techniques exist for handling overlapping markup. Most involvethe introduction of special tags that restore the tree-like structure of the XML document.

Some researchers handle overlap by deviating from the tree data model for representingstructured data perhaps even abandoning XML altogether. Our work (implementing Extended

4

XPath in Perl) is based on an article of Ionut Emil Iacob[6] on an ‘XPath Extension for QueryingConcurrent XML Markup’[7].

In our example we can use the overlap to find interesting things. We want to be able to askquestions like:

• “Give me all sentences on the first line.”

• “Give me all sentences that stretch more than one line.”

This brings us to the subject of XPath[2]. XPath is a language for addressing parts of anXML document, based on its tree structure. It uses path expressions to specify the locations ofstructures and data within an XML document. When deviating from standard XML in orderto accommodate for overlap, XPath should be modified as well. Ionut Emil Iacob implementeda data structure for concurrent XML and defined an extension to XPath with 11 new axes. Hecalls this “Extended XPath”. His work is based on GODDAG’s[12], a way to represent multipleconnected trees, which we will describe later.

The two questions above cannot be written using normal XPath expressions. However, thenew axes we implement make this possible. Using these new axes, the questions can be writtenas:

• //LINE[@id=“l1”]/xdescendant::SENTENCE

• //LINE/overlapping::SENTENCE

1.2 Component documents

Next we introduce the component documents used for the Question-Answering (QA) applicationat ILPS[8]. This application needs to use Concurrent XML and Extended XPath. Componentdocuments are the XML documents that together form concurrent XML. These componentdocuments are related: a set of component documents describes the same piece of information.All documents do this from a different point of view. The piece of information described maybe an article in a journal, a book, a timetable or anything else that can be written as a text file1.

One of our component XML documents is the base document. This document containsan inline annotation of the source document2. This means that all the text that is availableis contained in this document. In our case, there is a tag for every sentence with an ‘id’attribute that has a unique value. All of the other component XML documents contain standoffmarkup. They have elements containing the attribute ‘cxmlBase’ whose value is a URI. EachURI evaluates to a sentence in the base document using the ‘id’ attribute of the sentence foraddressing. Elements with a ‘cxmlBase’ attribute have children with ‘start’ and ‘end’ attributes.The values of these attributes are to be interpreted as the character offsets on the content ofthis sentence (the sentence where the URI in the ‘cxmlBase’ attribute of the parent evaluatesto).

Here is an example from the QA dataset. The base document contains lines like this:

<SENT id="s19">

<TEXT>Menusuggestie voor zondag : vooraf : runderbouillon

met tuinkruiden ; hoofdgerecht : Wiener Tafelspitz , roomsaus

met mierikswortel , gesmoorde bleekselderij , gekookte

aardappelen ; dessert : flensjes met abrikozenmarmelade .

</TEXT>

</SENT>

1Binary data is excluded, because XML cannot handle this2See section 5 for an explanation of this and other terms

5

Here the original text was broken up in sentences and each sentence now has an ‘id’ attribute.Now we have a parser that extracts time expressions from a document and produces a documentcalled timex.xml with elements like these:

<SENT cxmlBase="xml:base//SENT[@id=’s19’]/TEXT">

<TIMEX id="time67" pid="s19" start="19" type="05DOW" end="25"

val="1995-12-03" span="zondag">1995-12-03</TIMEX>

</SENT>

Here we see the use of standoff annotation. We see a SENT element with a ‘cxmlBase’attribute whose value is a URI pointing to the TEXT element (with id=’s19’) from the base.xmldocument. It has a child which is a TIMEX element with offset attributes ‘start’ and ‘end’ withthe respective values ‘19’ and ‘25’. Apparently this TIMEX element annotates the word ‘zondag’which begins at character 19 and ends at character 25 in sentence ‘s19’.

As a second example, we now show our example, rewritten to standoff annotation. As can beseen, a new base document is created containing all text. The two other files no longer containthe text itself, but only refer to positions in the base document.The base document is:

<DOC>Here is the text. Here is some more text.</DOC>

The structural document is:

<DOC>

<PARAGRAPH id="par1" cxmlBase="xml:base‘//DOC’">

<SENTENCE id="s1" start="1" end="17">

<W start="1" end="4">



<W start="13" end="17">

</SENTENCE>

<SENTENCE id="s2" start="19" end="41">

<W start="19" end="22">

<W start="24" end="25">

<W start="27" end="30">

<W start="32" end="35">

<W start="37" end="41">

</SENTENCE>

</PARAGRAPH>

</DOC>

And the physical document looks like:

<DOC>

<PAGE id="pg1" cxmlBase="xml:base‘//DOC’">

<LINE id="l1" start="1" end="22"/>

<LINE id="l2" start="24" end="41"/>

</PAGE>

</DOC>

Now that we have given a quick tour of Concurrent XML and standoff markup, we continueby describing the structure of the rest of this thesis.

1.3 Outline

In section 2 we will describe the research goals of our project. Section 3 describes the stepswe have taken to complete the goals. In section 4, we describe what has been made. Section5 describes a number of terms we use. In section 6 a datastructure for Concurrent XML ispresented. Extended XPath is explained in section 7. Section 8 describes how we tested ourimplementation. The conclusions are presented in section 9 and the discussion is found in section10.

6

2 Research Goal

The main goal of this project is to see if it possible to create a hybrid system, that uses relationsbetween nodes in a tree to answer regular XPath questions, but uses relations between the yieldsof nodes to answer Extended XPath questions.

Answer this goal by implementing concurrent XML and Extended XPath based on the ideaof a GODDAG, as proposed by Iacob. Do this on top of the implementation of XPath in Perl.

When used on a text corpus that has concurrent markup, this implementation should beable to answer questions such as:

• Find all noun phrases overlapping with temporal expressions

• Find all persons with temporal expressions in the same sentence

• Find all sentences that do not stretch more than one line

The second goal of this project is to further improve the current implementation of ExtendedXPath developed at ILPS[8], focussing on efficiency, scalability, modularity and documentation.

To give a brief hint on the results, the main goal has been completed. The second goal hasonly partly been completed. More details on this can be found in the discussion, see section 10.

7

3 Method

The method we have used to solve these goals is composed of several steps:

1. Read and analyze papers on the subject.

2. Analyze and document the existing implementation.

3. Discuss strengths and weaknesses of the existing implementation.

4. Define how to implement Extended XPath:

(a) Define the datastructure.

(b) Write pseudocode for the new axes.

5. Perform the programming to implement the datastructure and the new axes.

6. Validate the output of the program by comparing the answers to those of Iacob’s im-plementation[6].

7. Evaluate the performance of this system by running the experiments of Gottlob et al.[5].

8. Discuss the results.

8

4 Main results

The main result of our project, is a Perl module, that implements Extended XPath as definedby Iacob[7]. It is based on previous work by Valentin Jijkoun of ILPS[8]. It is written such thatit can handle the data that ILPS uses.

The module can be found here [9]. The zip file found there contains:

• The Perl module itself (a set of .pm files)

• Two scripts that show how to use the module.

• A number of scripts that let one determine the performance of the module.

• Three data sets:

– CLEF-QA, a set of articles from the Algemeen Dagblad, tagged by a named-entityparser, a chunk parser and a timex parser.

– Framenet, an example of data used in Framenet.

– Iacob’s data rewritten to standoff annotation (see section A).

• A README file.

The Perl module is capable of reading these data sets and answer queries about them. Wewill show how to do this, by showing part of one of the test scripts:

01 #!/usr/bin/perl -w

02 # This script runs all queries that are used for testing in our thesis.

03 use strict;

04 use lib ’../’;

05 use CXML::CXPath;

06 use XML::XPath::Node::Text;

07

08 my $dir = ’../Iacob’;

09

10 # What’s the question to answer?

11 my $query;

12

13 my $xp = CXML::CXPath->new(’xml:base’ => "$dir/base.xml",

14 ’xml:text’ => "$dir/text.xml",

15 ’xml:condition’ => "$dir/condition.xml",

16 ’xml:physical’ => "$dir/physical.xml",

17 );

18 print STDERR "finished parsing xml\n";

19

20 my @queries = ( ’//page[@n="1"]/line[@n="21"]/xdescendant::w’,

21 ’//u[@n="1"]/xancestor::*’,

22 ’//sentence[@n="13"]/xdescendant::*’,

23 ’//sentence[@n="13"]/overlapping::*’,

24 );

25

26 foreach $query (@queries)

27 {

28 print "\n\n==========================================================

===================\n\nNow running query: ", $query, "\n\n";

29 my @nodes = $xp->findnodes($query);

30 my $norecurse = 1;

31 foreach my $node (@nodes) {

32 print $node->toString($norecurse);

33 print "\n";

34 }

35 }

9

What happens is this:Line 05 tells Perl to use our module.On line 13-17 four files are read and parsed, so they are available in memory.Line 18 prints a messages when the reading and parsing is done.Line 20 creates an array of 4 Extended XPath questions.These questions are all handled consecutively by the loop on lines 26-35:Line 28 prints a line of =’s and the question at hand, so one can see the question in the results.Line 29 uses findnode to find the answers to the questions. It stores this nodeset in an array.Lines 31-34 loop over the nodeset, the nodes are converted to strings and printed.

The result of running this file is:

finished parsing xml

=============================================================================

Now running query: //page[@n="1"]/line[@n="21"]/xdescendant::w

<w n="4" start="17" end="23"></w>

<w n="5" start="25" end="28"></w>

<w n="6" start="30" end="31"></w>

=============================================================================

Now running query: //u[@n="1"]/xancestor::*

<doc id="CP56483" cxmlBase="xml:base//doc[@id=’CP56483’]"></doc>



<sentence n="13" start="1" end="154"></sentence>



<page n="1" start="1" end="187"></page>

=============================================================================

Now running query: //sentence[@n="13"]/xdescendant::*

<w n="1" start="1" end="5"></w>

<w n="2" start="7" end="11"></w>

<w n="3" start="13" end="15"></w>

<w n="4" start="17" end="23"></w>

<w n="5" start="25" end="28"></w>

<w n="6" start="30" end="31"></w>





<line n="20" start="1" end="15"></line>




=============================================================================

Now running query: //sentence[@n="13"]/overlapping::*


More information on the code is found in Appendix C and in the source code itself [9].

10

5 Terminology

The purpose of this section is to provide a consistent terminology to be used throughout thisdocument.

Source document: The text that is being annotated, this text is often called content.

Component documents: The XML documents that are taken as the input for ConcurrentXML.

Component XML trees: XML trees of the component XML documents.

CXML::CXPath The Perl module that implements Extended XPath written by Joris Geusand Richard de Jong based on earlier work at ILPS[8].

Inline annotation: An XML document that contains text and markup (tags). Normal way torepresent data.

Standoff annotation: An XML document that conforms to XML encoding standards, but thesource content being annotated may reside in a separate document. The markup pointsto, rather than wraps around, the relevant data content.

XML::XPath The Perl module that implements XPath.

XML tree: The tree representation of an XML document

XPath Data Model: The XML tree as built by XPath.

11

6 Data structure for Concurrent XML

This paragraph proposes a data structure for representing Concurrent XML. It does so at threedifferent levels, the conceptual level, the mathematical level and the implementation level. Thedescription of our data structure is incremental, first we discuss how documents merge into athree dimensional structure, then we will further explain this structure introducing a mathe-matical graph. Next we will deviate from this structure by introducing a hybrid design of treesand bases and eventually we explain our data structure as it is implemented.

6.1 Conceptual

As is explained in the introduction, XML as a data structure for annotating data is not powerfulenough. It cannot handle the arbitrary combinations of markup that stems from combining validXML documents. We need a new data structure to store all annotations for one common sourcedocument.

The conceptual data structure for an XML document is like a tree. XML tags become nodesin the tree, the most outer tag becomes the root node, enclosing tags become the parent ofthe tags they enclose. If several XML documents all annotate the same source document in aninline manner each corresponding XML tree will have the same content at it’s leaf (text) nodes.We define our conceptual model for the concurrent XML data structure as: a set of nodes thatrepresent XML elements, where the XML trees representing the component documents exist inparallel and share one set of common leaf nodes and a new common root node. These leaf nodesrepresent the content of the source document. There are now several paths from the new rootnode to the leaf nodes, and back up again. Different paths represent different annotations. Thenew data structure can be thought of as a three dimensional tree where leaf nodes can haveparents in multiple component documents.

6.2 Mathematical

A GODDAG is a General Ordered-Descendant Directed Acyclic Graph[12]. It can be used torepresent a set of trees. As with the tree view of XML documents, leaf nodes represent thecharacter data content of the document, non-terminal nodes represent elements. The “GeneralOrdered-Descendant” in the name means that the set of leaf nodes have a general order, theorder is such that the character data content they represent is in document order. Normally,this is just the order in which the text is read. The GODDAG model is a superset of the XMLtree model, so that every XML document can be represented in a GODDAG that is equivalentto its XML tree.

A GODDAG for our example can be found in figure 1. Here you can recognize the twoXML trees. At the top is the tree representation of our document for annotating paragraphs,sentences and words. At the bottom the tree representation of the document for annotatingpages and lines is drawn upside down. They are connected by a set of new leafnodes. These aredrawn in dark blue.

In the GODDAG overlapping nodes share children. For instance in our example, LINE l1is overlapping with SENTENCE s2 because they share the new leafnode containing the word‘Here’.

A GODDAG can be constructed from several XML trees by adding a common root and con-necting the trees at their common leaf nodes, giving these leaf nodes multiple parents. SeparateXML trees in general do not have common leaf nodes, these need to be constructed first. Ifall component documents annotate the source document inline the construction of common leafnodes is straightforward. The newly constructed common leaf nodes are children of the original

12

Figure 1: Example of a GODDAG

leaf nodes, such that each represents a sequence of consecutive characters that is not broken byan XML tag in any of the component documents. See the dark blue nodes in figure 1 above. Ifthe component documents use standoff annotation, the construction of common leaf nodes is alittle different but the GODDAG stays the same.

The next level of complexity in the data structure is the use of objects instead of nodes forconnecting XML trees. Here we deviate from the mathematical GODDAG and have a hybriddata structure. Unlike proper nodes general objects can store all sorts of information. Theobjects we use for connecting the trees are called bases. We divide the source content over thebases by storing a reference to a TEXT in each base. This TEXT node contains a part of thesource content. Each base refers to a TEXT node with a sequence of consecutive characters

13

which does not overlap with that of any of the TEXT nodes the other bases refer to. Therecan be overlap within one base but not between any two bases. Overlap means that differentnodes partially refer to the same content. Overlap will be discussed in more detail in section7. Besides storing the TEXT node we also store a list of tuples 〈node, begin, end〉 in each base.The tuples represent nodes in any of the component trees that annotate a piece of the contentin the TEXT node of this base. The offsets begin and end are character positions in the textfrom the TEXT node stored in the base.

6.3 Implementation

There is a good implementation of XML XPath in Perl. This implementation aims to complyexactly to the XPath specification [2]. A CXPath Perl module that implements a GODDAG-likedata structure using the Perl XPath implementation has already been created at ILPS. We basethe rest of our work on this CXPath Perl module.

With Perl XPath one can parse an XML document and build a tree representation of it inmemory, called the XPath Data Model. From this point on we will call the tree representation ofan XML document the ‘XML tree’. With Perl XPath we can parse n XML documents and thusmake n XML trees. Instead of making one GODDAG we make n XML trees and a separate datastructure to join them together. We call this separate data structure ‘base’. A base contains

• a reference to a TEXT node with a sequence of consecutive characters from the originalcontent document.

• a list of nodes whose markup points to the content of this base and their respective offsets,

Figure 2 shows a conceptual drawing of a base object. The right hand side of the drawingdepicts the reference to a TEXT node of the source document. The left hand side depicts thelist containing nodes with their associated begin and end value. Here there are thee such nodesbut the list can have any length. The arrows point to the nodes the references refer to.

Figure 2: Conceptual drawing of a Base object

In Figure 3 we show how to connect two XML trees with a base object. The XML tree tothe right comes from our running example. The XML tree to the left represents the followingXML document that uses standoff annotation for adding BOLD and ITALIC tags.

14

Figure 3: Connection XML trees with a base object

<DOC>

<LINE cxmlBase="l1">

<BOLD begin="6" end="7"/>

<ITALIC begin="9 end="11"/>

<BOLD begin="13" end="16"/>

</LINE>

</DOC>

The Base stores a ref to the TEXT node with id “l1”. This is shown by the big arroworiginating from the right box in the conceptual drawing of the base object. The base alsostores a list of nodes that annotate text from TEXT node “l1”, together with the characterpositions begin and end. This is drawn as the three big arrows originating from the threereferences at the left side of the object. In addition to base objects pointing to nodes in thetrees, nodes in the trees store a reference to the base object (that holds a ref to a TEXT nodethey annotate on).

There can be more then one base. The content of the source document is distributed overall bases. The set of bases is ordered such that the in order concatenation of the content of theindividual bases is in source document order. If a node representing a sentence from the sourcedocument is not annotated by any of the nodes in the XML trees that represent the documentswith standoff annotation then there is no need to create a base for that node. As a consequence,

15

the set of bases does not need to contain the complete content of the source document, individualsentences can be missing.

The specific implementation is closely related to the structure of the component XML doc-uments as described in the introduction. We feed our software a number of component XMLdocuments with the structure as described in the introduction, with the first document be-ing the base document. After all component documents have been processed we have a datastructure in memory that consists of the individual XML trees and the collection of base objects.

Here is how we construct the bases. For each component document the software uses PerlXPath to create an XML tree. Next, this tree is traversed, from the root down to the leaf nodes,by following the child relation. If a node X has a ‘cxmlBase’ attribute the URI in it’s value isevaluated to the corresponding node in the base document. A base object is created containingthe corresponding node and a reference to node X. We also add a reference from node X to itsbase. Of course if that base already exists we only add the references to it. The children ofnode X will have offset attributes on the same base. In all these child nodes we store a referenceto their base, and in the corresponding base we store references to these nodes as well as thevalues of their offsets. In the process of creating base objects we keep track of their order ina separate ‘BaseOrder’ object. Now we have the data structure in memory which consists ofthe individual XML trees and the collection of base objects with references from the base to allnodes annotating it’s content and references from these nodes to their base.

Not all nodes of the component XML trees are part of the data structure for concurrentXML. This is especially so for the component XML trees that represent the documents withstandoff annotation. For those trees, only the nodes with the proper attributes (’cxmlBase’,‘start’ and ‘end’) are listed in the base objects. This supports intuition since some nodes ina document with standoff annotation only serve a structuring role. They have no meaning inthe context of the source document and therefore should not be part of the concurrent XMLdata structure. The ancestor and sibling relation between nodes in component documents withstandoff annotation, as well as their corresponding XPath axes, have no meaning in the contextof concurrent XML. Siblings in standoff XML can have an xancestor relation between them inthe concurrent XML data structure.

The data structure does not require that all words or character sequences in the contentof the source document are annotated. Strings of content text that are not annotated are notstored in the base objects. We have discussed the idea of adding extra nodes to the base objectsto annotate these unannotated strings. See section .

When is a set of XML documents still concurrent? In theory every node in any of the XMLtrees can have a ‘cxmlBase’ attribute whose value is an URI that evaluates to any node in anyof the XML trees. In order for all component documents to annotate the same source documentwe need a set of common bases. This can be achieved by taking all the bases from the samedocument that annotates the source document in an inline manner.

16

7 Extended XPath

XPath is a language for addressing parts of an XML document. It uses path expressions tospecify the locations of structures and data within an XML document. This language needsto be extended to be able to work on the CXPath data structure. Since the whole idea ofconcurrent XML is to have different annotations on the same text, we can create new XPathaxes to select nodes in one component tree based on nodes from another component tree throughtheir common base. The extension to XPath that we are implementing has been described byIacob in [7].

The new axes of the extension to XPath are useful in the application of QA systems. Thefollowing example questions show there is a use for combining the annotations of different com-ponent documents, because they relate named entities to TIMEX elements for instance.

• Find all noun phrases overlapping with temporal expressions:“//CHUNK[@type=”NP”]/overlapping::TIMEX”

• Find all persons with temporal expressions in the same sentence:“//SENTENCE/xdescendant::[NE[@type=”person”] and TIMEX]”

• Find all sentences that do not stretch more than one line:“//SENTENCE[not(overlapping::LINE)]”

Below is a listing of the new axes. They are grouped by type of overlap. For all axes,except intersect, direct following and direct preceding, their definitions can be found in Iacob’spaper[7]. The definitions for the other axes can be found in the Mathematical section below.The axes will be discussed in depth in the following paragraphs.

• containment

– xancestor– xdescendant– xancestor or self– xdescendant or self

• overlap

– following overlapping– preceding overlapping– overlapping– intersect

• order

– xfollowing– xpreceding– direct following– direct preceding

• hybrid

– xancestor or overlapping– xdescendant or overlapping

17

7.1 Conceptual

In a tree that represents a single XML document, there are relations between nodes such asparent and child or ancestor and descendant. These relations stem from the tree data repre-sentation. One could define these relations as being based on the yield of a node. The yieldis the collection of leaf nodes that are reachable from a node by following the child reference.Two nodes may have different yields. The yield of one node could be completely contained inthe yield of another node, in this case the second node is an ancestor of the first. Or the yieldof a node could completely contain the yield of another node, in this case the second node is adescendant of the first.

In the GODDAG structure representing multiple XML documents, this notion of relationsthrough yields becomes useful. Now the yield of one node could have an ancestor that is anode generated from a different XML document. With this notion of comparing yields the XMLaddressing language XPath can be extended to work on a GODDAG. Since a GODDAG allowsfor concurrent markup, yields of different nodes can now also display all kinds of overlap notpossible in a tree representation of one XML document.

In our running example, SENTENCE s1 is contained in LINE l1, LINE l1 overlaps SEN-TENCE s2, LINE l2 is following SENTENCE s1.

7.2 Mathematical

Consider the following ordering relations of concurrent annotations of nodes a and b. The yieldsof a and b are notated with dots. We will describe our new axes as relations between yields.

containment

1: <a>.................</a>..................................

2: <a>.................</a>..................................

3: <a>......................</a>..................................

4: <a>..................................</a>..................................

If element a represents the context node then in all the above concurrent annotations theyield of a is completely contained by the yield of b. We call this axis xdescendant, a is thexdescendant of b.

If b were the context node then in all the above concurrent annotations the yield of b fullycontains the yield of a. We call this axis xancestor, b is the xancestor of a. In plain XML thexancestor and xdescendant relations are equal to the ancestor and descendant relations.

overlap

5: <a>.......................</a>......................

6: <a>......................</a>.......................

18

Concurrent annotations 5 and 6 are the complement of each other. If a is the context nodethen the axis in concurrent annotation 5 is called following overlapping, b is the follow-ing overlapping of a. The concurrent annotation in 6 is called preceding overlapping, b isthe preceding overlapping of a. The axis representing the union of concurrent annotation 5and 6 is called overlapping. We call the axis that represents the union of overlap and contain-ment intersect. It can be used to find all yields that have at least one common member in thenodesets.

order

7: <a>...........</a>...........

8: <a>...........</a>...........

The yield of b follows after the yield of a, in concurrent annotation 7 b immediately follows aand in concurrent annotation 8 b follows a. The accompanying axes are called direct followingand xfollowing. If b is the context node we have the complementing axes direct precedingand xpreceding.

The hybrid axes represent unions of the other axes.We now give the formal definitions of the axes we created. As mentioned before, formal

definitions of the other axes can be found in [7].

Let D = 〈d1, . . . , dk〉 be a concurrent XML document over a concurrent markup hierarchyCMH. We define 3 new Extended XPath axes over the concurrent document D, in the con-text of a node x ∈ nodes(D): intersect, direct following and direct preceding. As for ExtendedXPath, for each axis we define the corresponding function

X : nodes(D) → 2nodes(D)

where X(x) returns the set of nodes corresponding to axis X evaluated for the context nodeof node x ∈ nodes(D)

1 intersect ::= {y ∈ nodes(D) |(start(y) ≤ end(x)∧end(x) ≤ end(y)

)∨

(start(y) ≤ start(x)∧

end(x) ≤ end(y))∨

(start(y) ≥ start(x) ∧ end(x) ≥ end(y)

)}

2 direct following ::= {y ∈ nodes(D) | start(y) = end(x) + 1}

3 direct preceding ::= {y ∈ nodes(D) | end(y) = start(x)− 1}

In analogy with the chapter on the data structure we will first consider a pure GODDAGconstructed from component documents that annotates a common document in an inline manner.In the GODDAG the yield of a context node is the nodeset of all leafnodes that are reachablefrom the context node by following the child relation. These nodesets are ordered. Overlapbetween nodes is defined as common membership in the nodesets of those nodes. In this modelevaluating an axis for a context node means following all child relations until all nodes areleafnodes to construct the yield nodeset. Now all members of the yield nodeset are tested formembership of nodesets that represent the yields of other nodes. This is done by followingthe parent relation of the nodes. In the GODDAG the leafnodes can have multiple parents. A

19

typical axis evaluation would require several traversals of the GODDAG from nodes to leafnodesand up again.

Now we consider Extended XPath on our data structure that builds a sort of GODDAG,based on standoff annotation by adding a series of bases. The yield of a node is now defined asa tuple 〈base, begin, end〉. Yields on the same base can be compared by comparing the beginand end values of nodes.

Let start(x) be the start and end(x) the end from the tuple 〈base, start, end〉 for node xand start(y) be the start and end(y) the end from the tuple 〈base, start, end〉 for a node y.Evaluating the axis xancestor for node x would now become the following expression:

xancestor ::= {y | start(x) <= start(y) ∧ end(x) >= end(y)}

We have used definitions like this for every axis. We found that comparing numbers is a muchcheaper operations than traversing trees.

In our definition of the base we stated that “each base annotates a sequence of consecutivecharacters that does not overlap that of any of the other bases. There can be overlap within onebase but not between any two bases”. As a consequence the evaluation of many of our axes canbe restricted to testing for those begin and end values of nodes in the same base. Only for theaxes xfollowing, xpreceding, direct following and direct preceding do we need to look beyondthe base of the context node.

For xfollowing we need to find all nodes that follow after the context node within the base ofthe context node, and we need to find all nodes whose bases follow after the base of the contextnode. For xpreceding we need to find all nodes that precede the context node within the base ofthe context node united with all nodes whose bases precede the base of the context node. Withdirect following if the end value of the yield of the context node is equal to the length of thebase, in which case the end of the yield coincides with the end of the base, every node whoseyield begins at position 0 in the next base is an direct following of the context node. If the endvalue of the yield of the context node is smaller then the length of the base, all nodes whoseyield have a begin value that is equal to the end value of the yield of the context node +1 are andirect following. This +1 denotes that the context node and its direct following nodes do notoverlap. In practice this +1 offset is not always correct due to whitespace. The direct precedingaxis is like the direct following axis, but here the special case is the one where the begin valueof the yield of the context node is equal to 0. If so, all nodes in the preceding base whose yieldshave an end value equal to the length of that base are direct preceding nodes. The general caseis the one where all nodes whose yield have an end value that is equal to the begin value of thecontext node −1 are direct preceding nodes. Here the same practical problem with whitespaceexists.

7.3 Implementation

The implementation closely follows the definitions of the axes as presented above. We haveseparate XML trees and we added some extra data members to all the nodes. Every node that hasa base holds a reference to that base. Also, every node holds a reference to a ‘DocumentOrder’object and a BaseOrder object. The DocumentOrder object is used in sorting nodesets thatresult from the evaluation of an axis. The BaseOrder object contains a list of references tobases, and this list is in document order. We use this BaseOrder object in the evaluation ofxfollowing, xpreceding, direct following and direct preceding.

What happens if we evaluate a new axis on our implementation? A context node that has anyoverlap will have a base, since that is where the overlap takes place. This base is referred to in adata member of the node so the base is immediately available. The base contains a list of all the

20

nodes that annotate part of the content of the TEXT node referred to from this base. All overlap(except the special cases xfollowing, xpreceding, direct following and direct preceding which willbe discussed next) takes place between the yields of the nodes from this list. Evaluating an axisboils down to iterating over the list, testing for numerical relations between the begin and endvalues of the yield of the context node and those of the yields of the nodes from the list.

In any of the axes xfollowing, xpreceding, direct following and direct preceding a contextnode has a reference to its base in a data member. It also has a reference to the BaseOrderobject in a data member. This BaseOrder object has methods getFollowing(), getPreceding(),getNext() and getPrevious() that all take a reference to a base as its argument and respectivelyreturn all following bases, all preceding bases, the next base and the previous base. With thesedata members and the methods of the BaseOrder object, every node has access to all the neces-sary bases for evaluating the axes xfollowing, xpreceding, direct following and direct preceding.

21

8 Tests and results

This section describes the tests executed to validate the answers, to evaluate the performanceof the implementation of Extended XPath, and the results of these tests.

8.1 Validation

To validate the correctness of our system, we process a set of Extended XPath queries usingour implementation and Iacob’s. We expect to receive the same results. We assume that bothimplementations handle regular XPath queries according to the specifications, and will notfurther investigate this.

To be able to compare the two implementations, we first had to rewrite Iacob’s test data toour format, so our implementation could process it. See Appendix A for details.

Iacob has prepared 3 test questions and their answers which we run on our system. To becomplete, we add queries to test all of the new axes. These can be found in the subsections below.We also test the new axes that we defined: intersect, direct following and direct preceding.

The exact queries, and the answers of both implementations are found in Appendix B. Here,we only discuss the results.

Note that CXML::CXPath uses underscores in the names of axes, while Iacob uses dashes.This is because of a limitation in Perl3. So for instance Iacob’s following-overlapping is calledfollowing overlapping in our implementation.

As can be seen from the full queries and results in Appendix B, both implementations agreefor the most part on what is the answer to a question. However, some differences are noted:

• In cases where multiple items are returned from the same component document, and thereis a parent/child or ancestor/descendant relation between these items, our implementationreturns them top down (first the parent then the child), while Iacob’s implementationreturns them bottom up (first the items on the lowest level, then those one level above,etc.). This happens with xancestor and xancestor or self.

• In the case of xpreceding and preceding overlapping, the same happens. In Iacob’s imple-mentation, items are returned ordered on smallest distance to the context element. In ourimplementation, the items are returned in document order.

• In xdescendant or overlapping and xancestor or overlapping, we notice something strange.Iacob’s implementation returns nothing, while our implementation does.

The first two differences probably can be explained by the fact that we always return thedata in document order (first ordered by component document, than by order in that document)as the specification of XPath requires, but Iacob apparently sorts his data in increasing distancefrom the context element. For xdescendant for instance, these orders would coincide, so thedifference is not noted.

The reason Iacob’s implementation returns the empty set, is probably that Iacob has notimplemented xdescendant or overlapping and xancestor or overlapping. This is very strange,since he does define these axes. As can be read in his definition of Extended XPath [7], theresult should be the union of the results of xdescendant and overlapping, so anything that iseither an xdescendant or a following should be returned. But while both separate queries doyield a result, their union does not. We will report this to Iacob.

3In Perl, it is impossible to use dashes in the name of a subroutine, only letters, digits and underscores areallowed

22

8.2 Performance

We measure how our implementation scales with growing complexity of queries, and with growingcomplexity of data. For evaluating the growing complexity of queries, we measure four items:

• regular XPath queries in XML::XPath using the parent/child axes

• regular XPath queries in XML::XPath using ancestor/descendant on one document

• regular XPath queries in CXML::CXPath using ancestor/descendant on a set of 4 docu-ments

• extended XPath queries in CXML::CXPath using xancestor/xdescendant on a set of 4documents

For evaluating the growing complexity of data, we measure two things:

• an XPath query on a growing XML document

• an Extended XPath query on a set of growing concurrent XML documents

We have used the Perl module “Benchmark” to measure the time execution of a subroutinetakes. Each query is run 10 times and the results are averaged. The experiment is run on a 1.6GHz Pentium M laptop with 512 MB of memory.

Unfortunately, it was impossible to measure the use of memory correctly. If one would wantto measure this, one has to compile Perl using the ‘-DDEBUGGING’ flag, and use the function‘mstat()’. Because of time constraints, we were unable to do this. What we can say is that someof the tests required a moderate amount of memory, but others grew beyond the limits of thetest machine very quickly. The upper bound on the test data was chosen either because it wasimpractical to wait much longer (some experiments took hours), or because physical memorywas exhausted so the machine started swapping.

8.2.1 Query Complexity of XML::XPath

As Gottlob et al. describe [5], several popular implementations of XPath follow the specificationin a rather straightforward, perhaps even naıve way. They found that in these implementations,in the worst case, query evaluation requires time exponential in the size of queries.

We are interested whether the XPath implementation in Perl is written in a similar wayas the implementations Gottlob et al. studied. To this end, we repeated the experimentof Gottlob et al.. In this experiment, a very small document is queried, the document is‘<a></a>’. Queries are constructed using a simple pattern. The first query is‘//a/b’ and the i+1-th query is obtained by taking the i-th query and appending ‘/parent::a/b’.For instance, the third query is ‘//a/b/parent::a/b/parent::a/b’.

The results are listed in Table 1 and a graph is given in Figure 4. As can be seen from thegraph, on a logarithmic scale, a straight line can be fitted with a better than 99% correlation.Hence, in the worst case the implementation of Perl requires time exponential in the size ofqueries.

For the second experiment, we use queries of the same structure as in Gottlob et al.’s exper-iment, but now using the ancestor and descendant axes. We use text.xml, one of Iacob’s filesin, standoff annotation (see Appendix A for details).

The first query is ‘//sentence[@n=”13”]’ and the i + 1-th query is obtained by taking thei-th query and appending ‘/descendant::*[self::w]/ancestor::*[self::sentence]’4.

4There is a bug in XML::XPath which causes filters to not always work. Therefore we implement the filterusing the self axis (*[self::w])

23

Iteration (i) Time to query (s)1 0.002 0.003 0.004 0.025 0.056 0.097 0.228 0.459 0.97

10 1.9411 3.7412 7.1713 14.8614 29.8015 59.4516 118.7017 237.8318 476.22

Table 1: Query Complexity of XML::XPath parent/child

Figure 4: Query Complexity of XML::XPath parent/child

The results are listed in Table 2 and a graph is given in Figure 5. As can be seen from thegraph, for the ancestor and descendant axes the implementation of Perl requires time exponentialin the size of queries, just as for the parent/child relation.

24


Table 2: Query Complexity of XML::XPath ancestor/descendant

Figure 5: Query Complexity of XML::XPath ancestor/descendant

8.2.2 Query Complexity of CXML::CXPath

We know now that in the worst case query evaluation time of XML::XPath is exponential inthe size of the query. We are interested whether our implementation performs as good or badas XML::XPath or perhaps even better or worse.

For the third experiment, we use queries of the same structure as in the last experiment,but now on all 4 documents of Iacob in standoff annotation (again, see Appendix A for details).This will reveal whether the use of bases makes our implementation faster or slower.

The first query is ‘//sentence[@n=”13”]’ and the i + 1-th query is obtained by taking thei-th query and appending ‘/descendant::*[self::w]/ancestor::*[self::sentence]. This set of queriesis chosen such that answers will always come from the same document, but that other documentswill have to be searched as well.

25

The results are in table 3, a graph is presented in figure 6. As can be seen from the graph,CXML::CXPath also needs time exponential in the size of the query. What is more, it takesmuch more time than XML::XPath. This is because the query itself is a lot more complexbecause of the self axes, and because more data is involved since we use 4 documents for thisexperiments. (compleet?)


Table 3: Query Complexity of CXML::CXPath non-x-axes

Figure 6: Query Complexity of CXML::CXPath non-x-axes

For the fourth experiment, we use queries of the same structure as in the other experiments,but now using two of the extended axes. Again, the set of 4 files of Iacob in standoff annotationare used.

The first query is ‘//sentence[@n=”13”]’ and the i + 1-th query is obtained by taking thei-th query and appending ‘/xdescendant::*[self::w]/xancestor::*[self::sentence].

The results can be found in table 4 and a graph is found in figure 7. The time needed to

26

answer an Extended XPath question is also exponential in the size of the query. Further can benoticed that the time needed is a factor 2 smaller than with the regular XPath queries. Thissuggests that our implementation is better at answering Extended XPath questions than regularones.


Table 4: Query Complexity of CXML::CXPath x-axes

Figure 7: Query Complexity of CXML::CXPath x-axes

8.2.3 Data Complexity of XML::XPath

Gottlob et al. describe in [5] another experiment to evaluate how the time required to answer aquestion depends on the size of the document. He uses a query of the form ‘//a//b’ +q times‘//b[ancestor::a’ +q times ‘//b]/ancestor::a’ + ‘//b’. He chose q = 20. The data used is of theform ‘<a></a>’ with the ‘’ part repeated n times, so the document size is n + 1.

27

For practical reasons, we will use q = 2, so the query becomes ‘//a//b[ancestor::a//b[ancestor-::a//b]/ancestor::a//b]/ancestor::a//b’.

Two parameters are of interest now: the time to parse the document and load it into memory,and the time needed to answer the query once the document is loaded.

The results of this experiment can be found in table 5. The graph of the time to query thedocument is drawn in figure 8. The graph of the time to parse and load the document is notgiven for obvious reasons. As can be seen, from the table, time to load a document is negligible5.The time needed to answer the query however, grows polynomially in the number of nodes.

Document size (# nodes) Time to parse (s) Time to query (s)0 0.00 0.00

20 0.00 0.8640 0.00 7.0560 0.00 23.6380 0.00 57.06

100 0.00 110.58120 0.00 192.58140 0.00 307.66160 0.00 463.78180 0.00 667.28200 0.00 914.61

Table 5: Data Complexity of XML::XPath

8.2.4 Data Complexity of CXML::CXPath

To evaluate the dependency on the size of data in our implementation, we run the experimenton CXML::CXPath, using xdescendant and xancestor. For this experiment we have to constructour own data set, because it needs to fit our constraints, and it should be easy to grow.

We create a base document consisting of: ‘<a> . </a>’. For documentsize n, the b part is repeated n− 1 times, with increasing id’s.

We construct two files containing standoff markup. They have the same structure as the basedocument, but we add a URI to the b element, and add a c or a d element, which has offsets onthe b part. For the ‘c’-document of size 2 it is ‘<a><cid=’1’ start=’0’ end=’1’/></a>’. To match the base document of size n, the <c/>- part is repeated n − 1 times, with increasing id’s. The offsets always stay the same. Inthe ‘d’-documents, the c tag is replaced by a d tag, for the rest they are identical.

Again, time to parse the document and time to answer the question are recorded. The exactquery is ‘//b//c//xancestor::*[self::d]/xdescendant::*[self::c]//xancestor::*[self::d]/xdescendant-::*[self::c]//xancestor::*[self::d]/xdescendant::*[self::c]’.

The results are very different now compared to XML::XPath. See table 6, figures 9 and 10.Now the loading and parsing of the document takes a considerable amount of time (quadratic inthe size of the document), but the time needed to answer a question grows linearly with the sizeof the document. Apparently our datastructure is costly to build in memory, but very efficientto answer Extended XPath queries.

5We investigated when the time to load a document would increase. A document of this form containing 107

nodes still took less than half a second to load. It did exhaust the memory of the computer though. . .

28

Figure 8: Data Complexity of XML::XPath

Document size (# nodes) Time to parse (s) Time to query (s)2 0.16 0.02

100 40.39 2.45200 153.00 4.99300 343.33 7.50400 614.41 9.67500 962.28 12.27600 2526.44 14.66700 3203.03 17.24800 4888.69 19.30900 6088.33 21.72

Table 6: Data Complexity of CXML::CXPath

8.3 Ease of use

We want to evaluate whether our system can cope with various data that may be in the same“spirit” but with a slightly different notation.

8.3.1 ILPS data

Our system has of course been built to work with the data that is currently available, and inuse at the ILPS, so it will work very easily with it. The data has been generated using severalparsers. The resulting files from these parsers have been preprocessed to add the cxmlBaseattribute and the offsets.

Our system will be able to work with data from new parsers, as long as it is possible to

29

Figure 9: Data Complexity of CXML::CXPath docsize vs parsetime

Figure 10: Data Complexity of CXML::CXPath docsize vs querytime

preprocess it in the same manner. There is however one restriction, the URI’s used in allannotations of one text, should all point to the same level in the base document. That is, if one

30

annotation bases it’s offsets on ‘sentence’ elements in the base document, another annotationshould not address it based on ‘paragraph’ elements. If this will ever be necessary, the assumptionof a linear order in the bases will have to be dropped, and bases will have to be created in atree-like fashion. An other option is to do the preprocessing of the other files again, choosing adifferent element in the base document to base the URI on.

8.3.2 Iacob’s data

Iacob’s data are constructed using a variant of fragmentation with virtual join. For this kind ofdata, it is possible to rewrite the data to our format by creating a base document containingall the text, and creating one document for every DTD. This document should contain only thetags used in that DTD and no more other tags than necessary to ensure a logical structure (thedoc element for instance). It should contain a URI to the base document just created. One hasto choose the level on which the URI’s should be created. This depends on the level of overlap.It may for instance not be possible to address the base document on ‘sentence’ elements, soone has to address on document level. In fact this reduces the number of bases to 1. See alsoAppendix A for an example of Iacob’s data and a rewritten version.

It should be possible to write a script that can do this, but we have not done this.

8.3.3 QLDB data

On the QLDB site[14] there is similar data[15] as ILPS uses. This data annotates fragmentsof speech. It has a key for each sentence/utterance, and “local” timex-keys on each sentence.There is a natural overall ordering in this data: first sentence, then local key in sentence.

This data cannot be used without modification, either to our system or to the data. Tomention only the smallest problems: the attribute representing the position of an utterance ona timeline is called “offset”, while we expect it to be called either “start” or “begin”. Anotherproblem is that only the start of the utterance is given, while our system also expects an “end”attribute.

31

9 Conclusion

This thesis shows that it is possible to create a hybrid system that uses either relations basedon positions in a tree or relations between yields of nodes to answer both normal and ExtendedXPath questions.

Even though conceptually the model of a GODDAG and the query language are clear, wefound several points where we had to make important design decisions which could not be tracedback to the theoretical model. This was for a large part due to the fact that the theoretical modelis based on text with inline annotation, while our data consisted of text with standoff annotation.Even though the latter corresponds in some sense even closer to a physical representation of aGODDAG. Another problem not addressed in the theoretical model had to do with the commontext data. Theoretically it is assumed that all sources share exactly the same common data, soaddressing and aligning is trivial. In reality this assumption does not hold.

We mention two of these problems: 1) With inline annotation, every source file containsexactly the same text data. They only differ in their markup. With standoff annotation thisneed not be, and in general is not, true. Thus one has to decide what is the common data of anumber of source files. 2) Even if one has made a decision on the common data, one still hasto decide on some canonical form of that data. This is because all files should have pointersto the same common form. This problem occurs because typically we do not have control overthe source documents. In reality we see that 1) source documents do not agree on the commondata (e.g., adding or deleting whitespace, which causes that offsets do not match), and 2) sourcedocuments use different ways of addressing the common data (inline vs standoff annotation).

In section 8 we found that our system answers all questions correctly, sometimes even betterthan Iacob’s implementation. We also found that our system answers questions in linear time.This makes our implementation fast and expressive enough to be used in real life applicationslike the Question-Answering system of ILPS.

32

10 Discussion

While normal XPath axes run in polynomial time (with respect to the complexity of the data),the new axes run in linear time. Our implementation would greatly benefit if processing ofregular XPath expressions would be improved, since regular and extended axes may be mixedin XPath path expressions. According to Gottlob et al. at least some of the axes can beimplemented to run in linear time[5].

For our system, it is beneficial to keep documents in memory because the loading and parsingstage is expensive. Future work could go into replacing the base objects by a relational database.Most of the new axes are evaluated by comparing offsets, this could be done very efficiently usingdatabase queries.

The implementation can be extended to incorporate those pieces of text that are not anno-tated onto the base. After all bases have been built, one can iterate over them and compare thetext in the node to the offsets in the list of nodes. Every substring that is not annotated canbe annotated in a new node. This way the base contains all text from the source document justlike the root node in an XML tree does.

Our implementation is a hybrid system because it uses trees as well as bases. Instead of onlyreplacing the bases with some database, as mentioned above, it may be interesting to investigatereplacing the whole system by a database implementation. This requires that the GODDAGstructure is captured in a relational model and that the XPath expressions can be rewritten toact on this model. A good candidate for a database would be the monetDB[10] that is currentlydeveloped at the CWI[3], since it is already capable of handling XQuery, which makes use ofXPath.

33

References

[1] Bray Tim, Paoli Jean, Sperberg-McQueen C. M., Maler Eve, Yergeau Francois, editorsXML 1.0, 2004

[2] Clark James, DeRose Steven, editors XPath 1.0, 1999

[3] CWI http://www.cwi.nl/

[4] DeRose Steven, Markup Overlap: A Review and a Horse, Extreme Markup Languages,2004

[5] Gottlob, Georg, Koch Christoph, Pichler Reinhard , Efficient Algorithms for ProcessingXPath Queries, Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002

[6] Iacob Ionut E., Website http://dblab.csr.uky.edu/˜eiaco0/

[7] Iacob Ionut E., Dekhtyar Alex, Zhao Wenzhong, XPath Extension for Querying ConcurrentXML Markup, University of Kentucky, Department of Computer Science, Technical ReportTR394-04, February 2004

[8] Information and Language Processing Systems group ILPS

[9] Jong de Richard, Website for CXML::CXPath http://student.science.uva.nl/˜rjong/cxml/

[10] monetDB http://monetdb.cwi.nl/

[11] Raggett Dave, Le Hors Arnaud, Jacobs Ian, editors, HTML 4.01 Specification, 1999

[12] Sperberg-McQueen C. M., Huitfeldt Claus, GODDAG: A Data Structure for OverlappingHierarchies

[13] Text Encoding Initiative (TEI) http://www.tei-c.org/

[14] QLDB Website http://www.ldc.upenn.edu/Projects/QLDB/

[15] QLDB data http://www.ldc.upenn.edu/Projects/QLDB/

34

http://www.w3.org/TR/REC-xml/

http://www.w3.org/TR/xpath

http://www.cwi.nl/

http://www.mulberrytech.com/Extreme/Proceedings/xslfo-pdf/2004/DeRose01/EML2004DeRose01.pdf

http://www.dbai.tuwien.ac.at/research/xmltaskforce/vldb2002.pdf

http://www.dbai.tuwien.ac.at/research/xmltaskforce/vldb2002.pdf

http://dblab.csr.uky.edu/~eiaco0/

http://dblab.csr.uky.edu/~eiaco0/publications/TR394-04.pdf

http://dblab.csr.uky.edu/~eiaco0/publications/TR394-04.pdf

http://ilps.science.uva.nl/

http://student.science.uva.nl/~rjong/cxml/

http://monetdb.cwi.nl/

http://www.w3.org/TR/1999/REC-html401-19991224/

http://helmer.aksis.uib.no/claus/goddag.html

http://helmer.aksis.uib.no/claus/goddag.html

http://www.tei-c.org/

http://www.ldc.upenn.edu/Projects/QLDB/

http://www.ldc.upenn.edu/Projects/QLDB/

A Iacob’s data

The original data of Iacob consists of 1 XML document and 3 DTD documents. The DTD’s eachdefine a subset of valid tags. Depending on what DTD is selected, only those tags are regardedvalid in the XML document. In this way, it is possible to put all tags in 1 XML document,and it is still possible to select only a subset of them. The tags are written as a mix of regularopening and closing tags, and a variant of fragmentation with virtual join, which means thereare empty elements, that contain a start or end attribute, meant to “open” or “close” thetag. This is one way to circumvent the requirement in XML files that every tag opened mustbe closed before it’s parent tags are closed.

We will now present the original data of Iacob, and the data rewritten to our format forstandoff annotation.

A.1 Original files

A.1.1 text1.xml

<?xml version="1.0"?>

<doc id="CP56483">

<page n="1">...



<line n="20"><sentence n="13" __tag_type="START"/> <w n="1">Where</w> <w n="2">

there</w><w n="3">are</w></line>

<line n="21"><w n="4">charges</w> <w n="5">that</w> <w n="6">by</w> one means

</line>

<line n="22">of another the vote is being denied, 

we must</line>

<line n="23">find out all of the facts -- 

the extent, the</line>

<line n="24">methods, the results.<sentence __tag_type="END"/>

<sentence n="14" __tag_type="START"/>The</line>

<line n="25">same is true of substantial</line>

</page>

<page n="2">

<line n="1"><w n="7">charges</w> that unwarranted economic of other</line>

<line n="2">pressures are being applied to deny</line>

<line n="3"><w n="8">fundamental</w> <w n="9">rights</w> 

<w n="10" __tag_type="START"/>safe-</line>

<line n="4">guarded<w __tag_type="END"/> <w n="11">by</w> the Constitution and

laws of<w n="12">the</w></line>

<line n="5">United States.<sentence __tag_type="END"/></line>



...</page>

</doc>

A.1.2 text.dtd



<!ELEMENT doc (p)*>

<!ELEMENT p (sentence)*>

<!ELEMENT sentence (#PCDATA|w)*>

<!ELEMENT w (#PCDATA)>

A.1.3 condition.dtd



<!ELEMENT doc (#PCDATA|u|i)*>

<!ELEMENT u (#PCDATA)>

<!ELEMENT i (#PCDATA)>

35

A.1.4 physical.dtd



<!ELEMENT doc (#PCDATA|page)*>

<!ELEMENT page (#PCDATA|line)*>

<!ELEMENT line (#PCDATA)>

A.2 Standoff files

A.2.1 base.xml

<?xml version="1.0"?> <doc id="CP56483">

Where there are charges that by one means of another the vote is

being denied, we must find out all of the facts -- the extent, the

methods, the results. The same is true of substantial charges that

unwarranted economic of other pressures are being applied to deny

fundamental rights safe-guarded by the Constitution and laws of

the United States.

</doc>

A.2.2 text.xml


<doc id="CP56483" cxmlBase="xml:base//doc[@id=’CP56483’]">



<sentence n="13" start="1" end="154">

<w n="1" start="1" end="5">Where</w>

<w n="2" start="7" end="11">there</w>

<w n="3" start="13" end="15">are</w>

<w n="4" start="17" end="23">charges</w>

<w n="5" start="25" end="28">that</w>

<w n="6" start="30" end="31">by</w>

</sentence>

<sentence n="14" start="155" end="347">

<w n="7" start="187" end="193">charges</w>

<w n="8" start="266" end="276">fundamental</w>

<w n="9" start="278" end="283">rights</w>

<w n="10" start="285" end="296">safe-guarded</w>

<w n="11" start="298" end="299">by</w>

<w n="12" start="330" end="332">the</w>

</sentence>



</doc>

A.2.3 condition.xml









</doc>

A.2.4 physical.xml



<page n="1" start="1" end="187">

<line n="20" start="1" end="15">Where there are</line>

<line n="21" start="17" end="41">charges that by one means</line>

<line n="22" start="43" end="86">of another the vote is being denied, we must</line>

<line n="23" start="88" end="131">find out all of the facts the extent, the</line>

<line n="24" start="133" end="157">methods, the results. The</line>

<line n="25" start="159" end="187">same is true of substantial</line>

</page>

<page n="2" start="189" end="347">

<line n="1" start="189" end="228">charges that unwarranted economic of other</line>

36

<line n="2" start="230" end="264">pressures are being applied to deny</line>

<line n="3" start="266" end="289">fundamental rights safe-</line>

<line n="4" start="290" end="332">guarded the Constitution and laws of the</line>

<line n="5" start="334" end="347">United States.</line>

</page>

</doc>

37

B Results of validation tests

B.1 //page[@n=”1”]/line[@n=”21”]/xdescendant::w

Iacob:<w n=”4”><w n=”5”><w n=”6”>

CXML::CXPath:<w n=”4” start=”17” end=”23”></w><w n=”5” start=”25” end=”28”></w><w n=”6” start=”30” end=”31”></w>

B.2 //sentence[xancestor::page[@n=”1”] or overlapping::page[@n=”1”]]

Iacob:<sentence n=”13”><sentence n=”14”>

CXML::CXPath:<sentence n=”13” start=”1” end=”154”></sentence><sentence n=”14” start=”155” end=”347”></sentence>

B.3 //line[xdescendant::w[string(.)=”safeguarded”] or following-overlapping-::w[translate(string(.),”\n\r-”,””)=”safeguarded”]]

Iacob:<line n=”3”>CXML::CXPath:<line n=”3” start=”266” end=”289”></line>

B.4 //u[@n=”1”]/xancestor::*

Iacob:<sentence n=”13”><doc id=”CP56483”><doc id=”CP56483”><page n=”1”><doc id=”CP56483”>

CXML::CXPath:<doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><sentence n=”13” start=”1” end=”154”></sentence><doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><page n=”1” start=”1” end=”187”></page>

38

B.5 //sentence[@n=”13”]/xdescendant::*

Iacob:<w n=”1”><w n=”2”><w n=”3”><w n=”4”><w n=”5”><w n=”6”><line n=”20”><line n=”21”><line n=”22”><line n=”23”>

CXML::CXPath:<w n=”1” start=”1” end=”5”></w><w n=”2” start=”7” end=”11”></w><w n=”3” start=”13” end=”15”></w><w n=”4” start=”17” end=”23”></w><w n=”5” start=”25” end=”28”></w><w n=”6” start=”30” end=”31”></w><line n=”20” start=”1” end=”15”></line><line n=”21” start=”17” end=”41”></line><line n=”22” start=”43” end=”86”></line><line n=”23” start=”88” end=”131”></line>

B.6 //sentence[@n=”13”]/xdescendant or self::*

Iacob:<sentence n=”13”><w n=”1”><w n=”2”><w n=”3”><w n=”4”><w n=”5”><w n=”6”><line n=”20”><line n=”21”><line n=”22”><line n=”23”>

CXML::CXPath:<sentence n=”13” start=”1” end=”154”></sentence><w n=”1” start=”1” end=”5”></w>

39

<w n=”2” start=”7” end=”11”></w><w n=”3” start=”13” end=”15”></w><w n=”4” start=”17” end=”23”></w><w n=”5” start=”25” end=”28”></w><w n=”6” start=”30” end=”31”></w><line n=”20” start=”1” end=”15”></line><line n=”21” start=”17” end=”41”></line><line n=”22” start=”43” end=”86”></line><line n=”23” start=”88” end=”131”></line>

B.7 //u[@n=”1”]/xancestor or self::*

Iacob:<sentence n=”13”><doc id=”CP56483”><doc id=”CP56483”><page n=”1”><doc id=”CP56483”>

CXML::CXPath:<doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><sentence n=”13” start=”1” end=”154”></sentence><doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><page n=”1” start=”1” end=”187”></page>

B.8 //u[@n=”1”]/xfollowing::*

Iacob:<sentence n=”14”><w n=”7”><w n=”8”><w n=”9”><w n=”10”><w n=”11”><w n=”12”><line n=”24”><line n=”25”><page n=”2”><line n=”1”><line n=”2”><line n=”3”>

40

<line n=”4”><line n=”5”>

CXML::CXPath:<sentence n=”14” start=”155” end=”347”></sentence><w n=”7” start=”187” end=”193”></w><w n=”8” start=”266” end=”276”></w><w n=”9” start=”278” end=”283”></w><w n=”10” start=”285” end=”296”></w><w n=”11” start=”298” end=”299”></w><w n=”12” start=”330” end=”332”></w><line n=”24” start=”133” end=”157”></line><line n=”25” start=”159” end=”187”></line><page n=”2” start=”189” end=”347”></page><line n=”1” start=”189” end=”228”></line><line n=”2” start=”230” end=”264”></line><line n=”3” start=”266” end=”289”></line><line n=”4” start=”290” end=”332”></line><line n=”5” start=”334” end=”347”></line>

B.9 //u[@n=”2”]/xpreceding::*

Iacob:<w n=”7”><w n=”6”><w n=”5”><w n=”4”><w n=”3”><w n=”2”><w n=”1”><sentence n=”13”><line n=”1”><line n=”25”><line n=”24”><line n=”23”><line n=”22”><line n=”21”><line n=”20”><page n=”1”>

CXML::CXPath:<sentence n=”13” start=”1” end=”154”></sentence><w n=”1” start=”1” end=”5”></w><w n=”2” start=”7” end=”11”></w><w n=”3” start=”13” end=”15”></w>

41

<w n=”4” start=”17” end=”23”></w><w n=”5” start=”25” end=”28”></w><w n=”6” start=”30” end=”31”></w><w n=”7” start=”187” end=”193”></w><page n=”1” start=”1” end=”187”></page><line n=”20” start=”1” end=”15”></line><line n=”21” start=”17” end=”41”></line><line n=”22” start=”43” end=”86”></line><line n=”23” start=”88” end=”131”></line><line n=”24” start=”133” end=”157”></line><line n=”25” start=”159” end=”187”></line><line n=”1” start=”189” end=”228”></line>

B.10 //sentence[@n=”13”]/following overlapping::*

Iacob:<line n=”24”>CXML::CXPath:<line n=”24” start=”133” end=”157”></line>

B.11 //sentence[@n=”14”]/preceding overlapping::*

Iacob:<line n=”24”><page n=”1”>

CXML::CXPath:<page n=”1” start=”1” end=”187”></page><line n=”24” start=”133” end=”157”></line>

B.12 //sentence[@n=”13”]/overlapping::*

Iacob:<line n=”24”>

CXML::CXPath:<line n=”24” start=”133” end=”157”></line>

B.13 //sentence[@n=”13”]/xancestor or overlapping::*

Iacob:Empty results set

CXML::CXPath:<doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc>

42

<page n=”1” start=”1” end=”187”></page><line n=”24” start=”133” end=”157”></line>

B.14 //sentence[@n=”13”]/xdescendant or overlapping::*

Iacob:Empty results set

CXML::CXPath:<w n=”1” start=”1” end=”5”></w><w n=”2” start=”7” end=”11”></w><w n=”3” start=”13” end=”15”></w><w n=”4” start=”17” end=”23”></w><w n=”5” start=”25” end=”28”></w><w n=”6” start=”30” end=”31”></w><line n=”20” start=”1” end=”15”></line><line n=”21” start=”17” end=”41”></line><line n=”22” start=”43” end=”86”></line><line n=”23” start=”88” end=”131”></line><line n=”24” start=”133” end=”157”></line>

B.15 //sentence[@n=”13”]/intersect::*

CXML::CXPath:<doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><sentence n=”13” start=”1” end=”154”></sentence><w n=”1” start=”1” end=”5”></w><w n=”2” start=”7” end=”11”></w><w n=”3” start=”13” end=”15”></w><w n=”4” start=”17” end=”23”></w><w n=”5” start=”25” end=”28”></w><w n=”6” start=”30” end=”31”></w><doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><doc id=”CP56483” cxmlBase=”xml:base//doc[@id=’CP56483’]”></doc><page n=”1” start=”1” end=”187”></page><line n=”20” start=”1” end=”15”></line><line n=”21” start=”17” end=”41”></line><line n=”22” start=”43” end=”86”></line><line n=”23” start=”88” end=”131”></line><line n=”24” start=”133” end=”157”></line>

B.16 //u[@n=”1”]/direct following::*

CXML::CXPath:

43

B.17 //u[@n=”2”]/direct preceding::*

CXML::CXPath:

44

C Documentation

This section contatins the documentation of the code of our modules. It is written in Plain OldDocumentation format (POD), the native documentation language for Perl. It can be found inthe Perl files themselves and can be read using a command like “perldoc CXML::CXPath”. Wehave included it here as well for the interested reader.

C.1 CXML::CXPath

C.1.1 DESCRIPTION

This module is an extension of XML::XPath to enable querying Concurrent XML documents. Itis based on the work of Ionut Emil Iacob which can be found at http://dblab.csr.uky.edu/~eiaco0/-research/cmh/.

C.1.2 DETAILS

The following XML attributes are used to define concurrent annotation, and are therefore neededin the XML files:

cxmlBase

Is a reference (URI) to the node which serves as the ”base” for the annotation inside thecurrent nodes.

cxmlLocalBase

A URI referring to the node with whose text values the offsets (start/begin and end at-tributes) were calculated. If cxmlLocalBase is absent, offsets are assumed to be calculatedusing the text value of cxmlBase. cxmlLocalBase=’inline’ indicates that the annotationinside current node is inline, that is, with usual XML markup rather than offsets.

start or begin

First character, current node refers to (starting from 0)

end

Last character current node refers to.

C.1.3 API

new Expects a hash of names and filenames of XML documents. The names are expectedto be in the format ”xml:name”, where the name is free to choose. All documents refer to acxmlbase, these should map to one of the names.

For every filename it opens the file and creates an XPath tree representation for the XMLdocument. All trees are stored together in one object under the name assigned with the filename.

The document is parsed and for every base (tag with URI attribute) the text it points to isstored, together with all the nodes referring to it.

Returns a reference to a CXPath object.

findnodes Is called on a CXPath object. It expects an (Extended) XPath query and optionallya context node. It runs this XPath query on all XPath trees using XML::XPath::findnode.

Returns an array of all node objects that satisfy the query.

45

add to uri cache This adds a uri and a ref to it’s text to the uri cache of the CXPath object.add to uri cache is called from add cxmlbase or find uri. It expects a uri and a ref to a nodecontaining the text represented by a uri. If it does not receive a uri, it retrieves the cxmlnamefrom the node. If this does not succeed, it sets uri to the name of the document last loaded intothe CXPath object. In the last two cases the location of the node ref is appended to the uri.

The uri cache is emptied if it’s full, the uri is then inserted as the first element.A ref to the node containing the text with which it is called is then returned.

find uri This is called from add cxmlbase. It finds the name of the XML document that thecxmlbase of the CXPath object refers to (The base file that is encoded with standoff annotation).When this is found, it follows the uri and retrieves a ref to the XPath node containing the textdenoted by the uri.

Returns a ref to the node containing the text for the uri.

find cxmlbases Checks for an XML element with attribute’cxmlBase’ or ’cxmlLocalBase’and if that exists calls the subroutine add cxmlbase

C.1.4 AUTHOR

Original author: Valentin Jijkoun [email protected] by: Joris Geus [email protected] and Richard de Jong [email protected]

C.2 CXML::CXPath::Node

C.2.1 DESCRIPTION

This is part of the implementation of Extended XPath for Concurrent XML

C.2.2 DETAILS

This is a bad solution. We are modifying class XML::XPath::NodeImpl here! I don’t know howto make it nicer <mailto:[email protected]>

In this module some objects and methods are added to the node implementation of XML::XPath.

C.2.3 AUTHOR


C.3 CXML::CXPath::Parser

C.3.1 DESCRIPTION

This is part of the implementation of Extended XPath for Concurrent XML.In this module the new axes are defined.

C.3.2 DETAILS

This module extends XML::XPath::Parser by adding the new axes of Extended XPath to theaxes already existing in XML::XPath::Parser.

46

These axes are:

1. xancestor

2. xdescendant

3. xancestor or self

4. xdescendant or self

5. xfollowing

6. xpreceding

7. following overlapping

8. preceding overlapping

9. overlapping

10. xancestor or overlapping

11. xdescendant or overlapping

12. intersect

Valentin’s definition of overlapping

13. direct following

Retrieves the nodes immediately following the context node

14. direct preceding

Retrieves the nodes immediately preceding the context node

This is quite ugly: we are modifying the constants of the parent class. I’d rather redefineis step() etc., but unfortunately they are called as functions, not methods, in XML::XPath::Parser,so cannot really be redefined.

C.3.3 API

new This just calls XML::XPath::Parser->new now the new axes have been defined.

C.3.4 AUTHOR


C.4 CXML::CXPath::Step

C.4.1 DESCRIPTION

This is part of the implementation of Extended XPath for Concurrent XMLIn this module the routines for processing the new axes are defined.

47

C.4.2 DETAILS

This module extends XML::XPath::Step by adding the axes defined by Extended XPath.Again, we are directly modifying a class. It’s just adding functions though.

C.4.3 API

axis xancestor Find all elements in the current base that start earlier and end later than$context in this base. That’s an xancestor. When those are found, find all their ancestors, andthe ancestors of $context itself.

axis xdescendant This is just like the axis xancestor but with the tests for start and endreversed.

axis xancestor or self This is just like the axis xancestor but it returns the context node aswell.

axis xdescendant or self This is just like the axis xdescendant but it returns the contextnode as well.

axis xfollowing This finds all nodes that follow the current node. It searches first in thecurrent base, then in all following bases.

axis preceding This finds all nodes that precede the current node. It searches first in thecurrent base, then in all preceding bases.

axis direct following This finds all nodes that immediately follow the current node (basedon character positions).

axis direct preceding This finds all nodes that immediately preced the current node (basedon character positions).

axis following overlapping This finds all nodes that start within the span of the contextnode, and end later than the context node.

axis preceding overlapping This finds all nodes that start before the context node startand end within the span of the context node.

axis poverlapping This finds all nodes that either start before the context node start andend within the span of the context node, or that start within the span of the context node, andend later than the context node.

axis intersect This finds all nodes that have some form of overlap or containment with thecontext node.

axis xancestor or overlapping This finds all nodes that are either an xancestor of (con-taining), or overlapping with the context node.

48

axis xdescendant or overlapping This finds all nodes that are either an xdescant of (con-tained within), or overlapping with the context node.

getListFromNodes Expects a ref to a hashOfHash.Returns the hashOfHash transformed into an Array.

C.4.4 AUTHOR


C.5 CXML::CXPath::Base

C.5.1 DESCRIPTION


C.5.2 DETAILS

This module creates the datastructure Base and defines methods for it.

C.5.3 API

get text length Expects a ref to a base.Returns the length of the text in that base.

add node Expects a ref to a base, a ref to a node, a hash containing start and end.Pushes the node, start-value and end-value on the refs of the base.

get basetext Expects a baseReturns the text of the base requested if called with only 2 arguments. Returns the substring

of the text of the requested base if called with more arguments.See also get basetext in CXML::CXPath::Node

C.5.4 AUTHOR


C.6 CXML::CXPath::BaseOrder

C.6.1 DESCRIPTION


C.6.2 DETAILS

BaseOrder is an object that contains a list of base object references in the order in which theywere added. There also are access methods to get a list of the preceding or following bases.

C.6.3 API

new Constructor. Creates an array to store bases.

49

addBase Expects a ref to a BaseOrder object and a base. Adds this base to the BaseOrderobject.

altAddBase Expects a ref to a BaseOrder object and a base. Adds this base to the BaseOrderobject.

This alternative addBase routine heavily relies on the data we use. The base is a TEXTelement with a SENT element as parent. This SENT element has an attribute ”id” and it’s valueis the letter S followed by a number. This number denotes the order of the SENT elements,which serve as base in at least one application.

getPreceding Expects a ref to a BaseOrder object and a base.Returns all bases preceding the current base.

getFollowing Expects a ref to a BaseOrder object and a base.Returns all bases following the current base.

matchBase Expects a ref to a BaseOrder object and a base.Returns 1 if the base is in the list and 0 otherwise.

getNext Expects a ref to a BaseOrder object and a base.Returns the next base if available, returns 0 otherwise.

getPrevious Expects a ref to a BaseOrder object and a base.Returns the previous base if available, returns 0 otherwise.

C.6.4 AUTHOR

Joris Geus [email protected] and Richard de Jong [email protected]

C.7 CXML::CXPath::DocumentOrder

C.7.1 DESCRIPTION


C.7.2 DETAILS

This class provides a datastructure and methods to create an order in a set of documents. Ithas methods to add documents and to retreive the order of documents or nodes.

C.7.3 API

new Constructor. Creates a hash consisting of two arrays. One for documentnames, one forroot-nodes.

addDocument Adds a document to the end of a DocumentOrder object.Expects a ref to a DocumentOrder object, a string containg the name of the document and

a ref to the root-node of the document to add.

50

getDocnameByOrder Expects a ref to a DocumentOrder object, and a number representingthe order.

Returns the name of the document requested, based on it’s order.

getDocrootByOrder Expects a ref to a DocumentOrder object, and a number representingthe order.

Returns the root-node requested, based on it’s order.

getOrderByDocname Expects a ref to a DocumentOrder object, and the name of a docu-ment.

Returns the order of the document.

getOrderByDocroot Expects a ref to a DocumentOrder object, and a ref to a root-node.Returns the order of the root-node.

getDocnameByDocroot Expects a ref to a DocumentOrder object, and a ref to a root-node.Returns the name of the document matching the root-node.

C.7.4 AUTHOR

Joris Geus [email protected] and Richard de Jong [email protected]

51

Multi Dimensional Markup -...

Documents

Transcript of Multi Dimensional Markup -...