On the Count of Trees - Project WAMwam.inrialpes.fr/publications/2010/RR-7251.pdf · when the...

37
apport de recherche ISSN 0249-6399 ISRN INRIA/RR--7251--FR+ENG Knowledge and Data Representation and Management INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE On the Count of Trees Everardo Bárcenas — Pierre Genevès — Nabil Layaïda — Alan Schmitt N° 7251 — version 2 initial version April 2010 — revised version August 2010

Transcript of On the Count of Trees - Project WAMwam.inrialpes.fr/publications/2010/RR-7251.pdf · when the...

appor t de r echerche

ISSN

0249-6399

ISRNINRIA/RR--7251--FR+ENG

Knowledge and Data Representation and Management

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

On the Count of Trees

Everardo Bárcenas — Pierre Genevès — Nabil Layaïda — Alan Schmitt

N° 7251 — version 2initial version April 2010 — revised version August 2010

Centre de recherche INRIA Grenoble – Rhône-Alpes655, avenue de l’Europe, 38334 Montbonnot Saint Ismier

Téléphone : +33 4 76 61 52 00 — Télécopie +33 4 76 61 52 52

On the Count of Trees

Everardo Barcenas , Pierre Geneves , Nabil Layaıda , Alan Schmitt

Theme : Knowledge and Data Representation and ManagementEquipes-Projets WAM et SARDES

Rapport de recherche n° 7251 — version 2 — initial version April 2010 —revised version August 2010 — 34 pages

Abstract: Regular tree grammars and regular path expressions constitute coreconstructs widely used in programming languages and type systems. Neverthe-less, there has been little research so far on frameworks for reasoning aboutpath expressions where node cardinality constraints occur along a path in atree. We present a logic capable of expressing deep counting along paths whichmay include arbitrary recursive forward and backward navigation. The count-ing extensions can be seen as a generalization of graded modalities that countimmediate successor nodes. While the combination of graded modalities, nomi-nals, and inverse modalities yields undecidable logics over graphs, we show thatthese features can be combined in a decidable tree logic whose main featurescan be decided in exponential time. Our logic being closed under negation, itmay be used to decide typical problems on XPath queries such as satisfiability,type checking with relation to regular types, containment, or equivalence.

Key-words: Modal Logic, XML, XPath, Schema

On the Count of Trees

Resume : Ce document introduit une logique d’arbre decidable en tempsexponentielle et qui est capable d’exprimer des contraintes de cardinalite surchemins multidirectionnelle.

Mots-cles : Logique Modal, XML, XPath, Schema

On the Count of Trees 3

1 Introduction

A fundamental peculiarity of XML is the description of regular properties. Forexample, in XML schema languages the content types of element definitionsrely on regular expressions. In addition, selecting nodes in such constrainedtrees is also done by means of regular path expressions (a la XPath). In bothcases, it is often interesting to be able to express conditions on the frequency ofoccurrences of nodes.

Even if we consider simple strings, it is well known that some formal lan-guages easily described in English may require voluminous regular expressions.For instance, as pointed out in [13], the language L2a2b of all strings over! = {a, b, c} containing at least two occurrences of a and at least two occur-rences of b requires a large expression, such as:

!!a!!a!!b!!b!! ! !!a!!b!!a!!b!!! !!a!!b!!b!!a!! ! !!b!!b!!a!!a!!! !!b!!a!!b!!a!! ! !!b!!a!!a!!b!!.

If we add " to the operators for forming regular expressions, then the languageL2a2b can be expressed more concisely as (!!a!!a!!)"(!!b!!b!!). In logicalterms, conjunction o"ers a dramatic reduction in expression size, which is crucialwhen the complexity of the decision procedure depends on formula size.

If we now consider a formalism equipped with the ability to describe nu-merical constraints on the frequency of occurrences, we get a second (exponen-tial) reduction in size. For instance, the above expression can be formulatedas (!!a!!)2 " (!!b!!)2. We can even write (!!a!!)2n " (!!b!!)2n (for anynatural n) instead of a (much) larger expression.

Di"erent extensions of regular expressions with intersection, counting con-straints, and interleaving have been considered over strings, and for describingcontent models of sibling nodes in XML type languages [4, 9, 15]. The complex-ity of the inclusion problem over these di"erent language extensions and theircombinations typically ranges from polynomial time to exponential space (see[9] for a survey). The main distinction between these works and the work pre-sented here is that we focus on counting nodes located along deep and recursivepaths in trees.

When considering regular tree languages instead of regular string languages,succinct syntax such as the one presented above is even more useful, as branchingresults in a higher combinatorial complexity. In the case of trees, it is oftenuseful to express cardinality constraints not only on the sequence of childrennodes, but also in a particular region of a tree, such as a subtree. Suppose, forinstance, that we want to define a tree language over ! where there is no morethan 2 “b” nodes. This requires a quite large regular tree type expression such

RR n° 7251

On the Count of Trees 4

as:xroot # b[xb"1] $ c[xb"2] $ a[xb"2]xb"2 # x¬b, b[x¬b], x¬b, b[x¬b], x¬b $ x¬b, b[xb"1], x¬b$ x¬b, a[xb"2], x¬b $ x¬b, c[xb"2], x¬b $ xb"1xb"1 # x¬b $ x¬b, b[x¬b], x¬b $ a[xb"1] $ c[xb"1]x¬b # (a[x¬b] $ c[x¬b])!

where xroot is the starting non-terminal; x¬b, xb"1, xb"2 are non-terminals; thenotation a[x¬b] describes a subtree whose root is labeled a and in which thereis no b node; and “,” is concatenation.

More generally, the widely adopted notations for regular tree grammars pro-duce very verbose definitions for properties involving cardinality constraints onthe nesting of elements1.

The problem with regular tree (and even string) grammars is that one isforced to fully expand all the patterns of interest using concatenation, union,and Kleene star. Instead, it is often tempting to rely on another kind of (for-mal) notation that just describes a simple pattern and additional constraintson it, which are intuitive and compact with respect to size. For instance, onecould imagine denoting the previous example as follows, where the additionalconstraint is described using XPath notation:

(x#(a[x] $ b[x] $ c[x])!) % count(/descendant-or-self::b) & 2Although this kind of counting operators does not increase the expressive

power of regular tree grammars, it can have a drastic impact on succinctness,thus making reasoning over these languages harder (as noticed in [7] in the caseof strings). Indeed, reasoning on this kind of extensions without relying on theirexpansion (in order to avoid syntactic blow-ups) is often tricky [8]. Determiningsatisfiability, containment, and equivalence over these classes of extended regularexpressions typically requires involved algorithms with higher complexity [22]compared to ordinary regular expressions.

In the present paper, we propose a succinct logical notation, equipped witha satisfiability checking algorithm, for describing many sorts of cardinality con-straints on the frequency of occurrence of nodes in regular tree types. Regulartree types encompass most of XML types (DTDs, XML Schemas, RelaxNGs)used in practice today.

XPath is the standard query language for XML documents, and it is animportant part of other XML technologies such as XSLT and XQuery. XPathexpressions are regular path expressions interpreted as sets of nodes selectedfrom a given context node. One of the reasons why XPath is popular for webprogramming resides in its ability to express multidirectional navigation. In-deed, XPath expressions may use recursive navigation, to access descendantnodes, and also backward navigation, to reach previous siblings or ancestor

1This is typically the reason why the standard DTD for XHTML does not syntacticallyprevent the nesting of anchors, whereas this nesting is actually prohibited in the XHTMLstandard.

RR n° 7251

On the Count of Trees 5

nodes. Expressing cardinality restrictions on nodes accessible by recursive mul-tidirectional paths may introduce an extra-exponential cost [11, 27], or mayeven lead to undecidable formalisms [27, 6]. We present in this paper a decid-able framework capable of succinctly expressing cardinality constraints alongdeep multidirectional paths.

A major application of this logical framework is the decision of problemsfound in the static analysis of programming languages manipulating XML data.For instance, since the logic is closed under negation, it can be used to solve sub-typing problems such as XPath containment in the presence of tree constraints.Checking that a query q is contained in a query p with this logical approachamounts to verifying the validity of q ' p, or equivalently, the unsatisfiabilityof q % ¬p.Contributions We extend a tree logic with a succinct notation for countingoperators. These operators allow arbitrarily deep and recursive counting con-straints. We present a sound and complete algorithm for checking satisfiabilityof logical formulas. We show that its complexity is exponential in the size ofthe succinct form.

Outline We introduce the logic in Section 2. Section 3 shows how the logic canbe applied in the XML setting, in particular for the static analysis of XPathexpressions and of common schemas containing constraints on the frequencyof occurrence of nodes. The decision procedure and the proofs of soundness,completeness, and complexity are presented in Section 4. Finally, we reviewrelated work in Section 5 before concluding in Section 6.

2 Counting Tree Logic

We introduce our syntax for trees, define a notion of trails in trees, then presentthe syntax and semantics of logical formulas.

2.1 Trees

We consider finite trees which are node-labeled and sibling-ordered. Since thereis a well-known bijective encoding between n(ary and binary trees, we focuson binary trees without loss of generality. Specifically, we use the encodingrepresented in Figure 1, where the binary representation preserves the first childof a node and append sibling nodes as second successors.

The structure of a tree is built upon modalities “)” and “*”. Modality“)” labels the edge between a node and its first child. Modality “*” labels theedge between a node and its next sibling. Converse modalities “+” and “,”respectively label the same edges in the reverse direction.

We define a Kripke semantics for our tree logic, similar to the one of modallogics [29]. We write M = {),*,+,,} for the set of modalities. For m -M wedenote by m the corresponding inverse modality () =+,* =,,+ =),, =*).RR n° 7251

On the Count of Trees 6

Figure 1: n(ary to binary trees

We also consider a countable alphabet P of propositions representing names ofnodes. A node is always labeled with exactly one proposition.

A tree is defined as a tuple (N,R,L), where N is a finite set of nodes; R isa partial mapping from N .M to N that defines a tree structure;2 and L is alabeling function from N to P .

2.2 Trails

Trails are defined as regular expressions formed by modalities, as follows:

! //= !0 $ !#0 $ !#0 ,!!0 //=m $ !0,!0 $ !0 0 !0

We restrict trails to sequences of repeated subtrails (which themselves contain norepetition) followed by a subtrail (with no repetition). Since we do not considerinfinite paths, we also disallow trails where both a subtrail and its converseoccurs under the scope of the recursion operator, thus ensuring cycle-freeness(see Section 2.5). These restrictions on trails allow us to prove the completenessof our approach while retaining the ability to express many counting formulas,such as the ones of XPath.

Trails are interpreted as sets of paths. A path, written ", is a sequence ofmodalities that belongs to the regular language denoted by the trail, written" - !.

In a given tree, we say that there is a trail ! from the node n0 to the nodenk, written n0

!1# nk, if and only if there is a sequence of nodes n0, . . . , nk

and a path " = m1, . . . ,mk such that " - !, and R(nj,mj+1) = nj+1 for everyj = 0, . . . , k ( 1.

2For all n,n! ! N,m ! M , R(n,m) = n! "# R(n!,m) = n; for all n ! N except one (theroot), exactly one of R(n,$) or R(n,%) is defined; for the root, neither R(n,$) nor R(n,%)is defined.

RR n° 7251

On the Count of Trees 7

# 2 # //= formula3 $ ¬3 true, false$ p $ ¬p atomic prop (negated)$ x recursion variable$ # 4 # disjunction$ # % # conjunction$ 5m6# $ ¬5m63 modality (negated)$ 5!6"k$ $ 5!6>k$ counting$ µx.$ fixpoint operator$ //= 3 $ ¬3$ p $ ¬p$ x$ $ 4$$ $ %$$ 5m6$ $ ¬5m63$ µx.$

Figure 2: Syntax of Formulas (in Normal Form).

¬5m6# 7 ¬5m63 4 5m6¬# ¬µx.$ 7 µx.¬${x8¬x}¬5!6"k$ 7 5!6>k$ ¬5!6>k$ 7 5!6"k$Figure 3: Reduction to Negation Normal Form.

2.3 Syntax of Logical Formulas

The syntax of logical formulas is given in Figure 2, where m - M and k - N.Formulas written #may contain counting subformulas, whereas formulas written$ cannot. We thus disallow counting under counting or under fixpoints. We alsorestrict formulas to cycle-free formulas, as detailed in Section 2.5. The syntaxis shown in negation normal form. The negation of any closed formula (i.e.,with no free variable) built using the syntax of Figure 2 may be transformedinto negation normal form using the usual De Morgan rules together with rulesgiven in Figure 3. When we write ¬#, we mean its negated normal form.

Defining an equality operator for counting formulas is straightforward usingthe other counting operators.

5!6=k$ 7 5!6>(k$1)$ % 5!6"k$ if k > 05!6=0$ 7 5!6"0$

RR n° 7251

On the Count of Trees 8

[[3]]TV = N

[[¬3]]TV = 9[[p]]TV = {n $ L(n) = p}[[¬p]]TV = {n $ L(n) : p}[[x]]TV = V (x)[[#1 4 #2]]TV = [[#1]]TV ! [[#2]]TV[[#1 % #2]]TV = [[#1]]TV " [[#2]]TV[[5m6#]]TV = {n $ R(n,m) - [[#]]TV }[[¬5m63]]TV = {n $ R(n,m) undefined}[[5!6"k$]]TV = {n $ ${n% - [[$]]TV $ n !1# n%}$ & k}[[5!6>k$]]TV = {n $ ${n% - [[$]]TV $ n !1# n%}$ > k}[[µx.$]]TV = ;{N % $ [[$]]TV [N !&x] < N %}

Figure 4: Semantics of Formulas.

2.4 Semantics of Logical Formulas

A formula is interpreted as a set of nodes in a tree. A model of a formula isa tree such that the formula denotes a non-empty set of nodes in this tree. Acounting formula 5!6>k$ satisfied at a given node n means that there are atleast k + 1 nodes satisfying $ that can be reached from n through the trail !.A counting formula 5!6>k$ is thus interpreted as the set of nodes such that,for each of them, the previously described condition holds. For example, theformula p1 % 5)65*!6>5p2, denotes p1 nodes with strictly more than 5 childrennodes named p2.

In order to present the formal semantics of formulas, we introduce valuations,written V , which relate variables to sets of nodes. We write V [N !8x], where N %is a subset of the nodes, for the valuation defined as V [N !8x](y) = V (y) if x : y,and V [N !8x](x) = N %. Given a tree T = (N,R,L) and a valuation V , the formalsemantics of formulas is given in Figure 4.

Note that the function f / Y # [[$]]TV [Y&x] is monotone, and the denotation

of µx.$ is a fixed point [26].Intuitively, propositions denote the nodes where they occur; negation is in-

terpreted as set complement; disjunction and conjunction are respectively setunion and intersection; the least fixpoint operator performs finite recursive nav-igation; and the counting operator denotes nodes such that the ones accessiblefrom this node through a trail fulfill a cardinality restriction. A logical formulais said to be satisfiable i" it has a model, i.e., there exists a tree for which thesemantics of the formula is not empty.

RR n° 7251

On the Count of Trees 9

2.5 Cycle-Freeness

Formal definition of cycle-freeness can be found in [10]. Intuitively, in a cycle-free formula, fixpoint variables must occur under a modality but cannot occurin the scope of both a modality and its converse. For instance, the formulaµx.5)6x45+6x is not cycle-free. In a cycle-free formula, the number of modalitycycles (of the formmm) is bound independently of the number of times fixpointsare unfolded (i.e., by replacing a fixpoint variable with the fixpoint itself). Afundamental consequence of the restriction to cycle-free formulas is that, whenconsidering only finite trees, the interpretations of the greatest and smallestfixpoints coincide. This greatly simplifies the logic.

Here, we also restrict our approach to cycle-free formulas. We thus need toextend this notion to the counting operators, and more precisely to the trailsthat occur in them. Cycle-free trails are trails where both a subtrail and itsconverse do not occur under the scope of the recursion operator. We thusrestrict the formulas under consideration to cycle-free formulas whose countingoperators contain cycle-free trails.

Lemma 2.1. Let # be a cycle-free formula, and T be a tree for which [[#]]T' : 9.Then there is a finite unfolding #% of the fixpoints of # such that [[#%{¬(8µx."}]]T' =[[#]]T' .Proof. As cycle-free counting formulas may be translated into (exponentiallylarger) cycle-free non-counting formulas, the proof is identical to the one in[10].

As a consequence, our logic is closed under negation even without greatestfixpoints.

2.6 Global Counting Formulas and Nominals

To conclude this section, we turn to an illustration of the expressive power ofour logic. An interesting consequence of the inclusion of backward axes in trailsis the ability to reach every node in the tree from any node of the tree, usingthe trail (+$,)#, ()$*)#.3 We can thus select some nodes depending on someglobal counting property. Consider the following formula, where # stands forone of the comparison operators &, >, or =.

5(+$,)#, ()$*)#6#k#1

Intuitively, this formula counts how many nodes in the whole tree satisfy #1.For each node of the tree, it selects it if and only if the count is compatible withthe comparison considered. The interpretation of this formula is thus eitherevery node of the tree, or none. It is then easy to restrict the selected nodes tosome that satisfy another formula #2, using intersection.

(5(+$,)#, ()$*)#6#k#1) % #2

3Note that this trail is cycle-free.

RR n° 7251

On the Count of Trees 10

This formula select every node satisfying #2 if and only if there are #k nodessatisfying #1, which we write as follows.

#1#k =' #2

We can now express existential properties, such as “select every node satisfying#2 if there exists a node satisfying #1”.

#1 > 0 =' #2

We can also express universal properties, such as “select every node satisfying#2 if every node satisfies #1”.

(¬#1) & 0 =' #2

Another way to interpret global counting formulas is as a generalizationof the so-called nominals in the modal logics community [24]. Nominals arespecial propositions whose interpretation is a singleton (they occur exactly oncein the model). They come for free with the logic. A nominal, denoted “@n”,corresponds to the following global counting formula:

[5(+$,)#, ()$*)#6=1n] % nwhere n is a new fresh atomic proposition.

One may need for nominals to occur in the scope of counting formulas. As wedisallow counting under counting, we propose the following alternative encodingof nominals in these cases:

@n 7 n % ¬[descendant(n) 4 ancestor(n)4anc(or(self(siblings(desc(or(self(n)))],

where:

descendant($) = 5)6µx.$ 4 5)6x 4 5*6x;foll(sibling($) = µx.5*6$ 4 5*6x;prec(sibling($) = µx.5,6$ 4 5,6x;desc(or(self($) = µx.$ 4 5)6µy.x 4 5*6y;

ancestor($) = µx.5+6($ 4 x) 4 5,6x;anc(or(self($) = µx.$ 4 µy.5+6(y 4 x) 4 5,6y;

siblings($) = foll(sibling($) 4 prec(sibling($).3 Application to XML Trees

3.1 XPath Expressions

XPath [3] was introduced as part of the W3C XSLT transformation language tohave a non-XML format for selecting nodes and computing values from an XML

RR n° 7251

On the Count of Trees 11

document (see [10] for a formal presentation of XPath). Since then, XPath hasbecome part of several other standards, in particular it forms the “navigationsubset” of the XQuery language.

In their simplest form XPath expressions look like “directory navigationpaths”. For example, the XPath

/company/personnel/employee

navigates from the root of a document through the top-level “company” nodeto its “personnel” child nodes and on to its “employee” child nodes. The resultof the evaluation of the entire expression is the set of all the “employee” nodesthat can be reached in this manner. At each step in the navigation, the selectednodes for that step can be filtered with a predicate test. Of special interest tous are the predicates that count nodes or that test the position of the selectednode in the previous step’s selection. For example, if we ask for

/company/personnel/employee[position()=2]

then the result is all employee nodes that are the second employee node (in doc-ument order) among the employee child nodes of each personnel node selectedby the previous step.

XPath also makes it possible to combine the capability of searching along“axes” other than the shown “children of” with counting constraints. For ex-ample, if we ask for

/company[count(descendant::employee)<=300]/name

then the result consists of the company names with less than 300 employees intotal (the axis “descendant” is the transitive closure of the default – and oftenomitted – axis “child”).

The syntax and semantics of Core XPath expressions are respectively givenon Figure 5 and Figure 6. An XPath expression is interpreted as a relation be-tween nodes. The considered XPath fragment allows absolute and relative paths,path union, intersection, composition, as well as node tests and qualifiers withcounting operators, conjunction, disjunction, negation, and path navigation.Furthermore, it supports all XPath axes allowing multidirectional navigation.

It was already observed in [11, 27] that using positional information in pathsreduces to counting (at the cost of an exponential blow-up). For example, theexpression

child::a[position()=5]

first selects the “a” nodes occurring as children of the current context node, andthen keeps those occurring at the 5th position. This expression can be rewritteninto the semantically equivalent expression:

child::a[count(preceding-sibling::a)=4]

RR n° 7251

On the Count of Trees 12

Axis //=self $ child $ parent $ descendant $ ancestor $following-sibling $ preceding-sibling $following $ preceding

NameTest //=QName $ >Step //=Axis::NameTest

PathExpr //=PathExpr8PathExpr $ PathExpr[Qualifier] $ StepQualifier //=PathExpr $ CountExpr $ not Qualifier $

Qualifier and Qualifier $ Qualifier or Qualifier $ @n

CountExpr //=count(PathExpr%) Comp k

PathExpr% //=PathExpr%8PathExpr% $ PathExpr%[Qualifier%] $ StepQualifier% //=PathExpr% $ not Qualifier% $ Qualifier% and Qualifier%

$ Qualifier% or Qualifier% $ @n

Comp //= &$>$?$<$=XPath //=PathExpr $ 8PathExpr $ XPath union PathExpr $

XPath intersect PathExpr $ XPath except PathExpr

Figure 5: Syntax of Core XPath Expressions.

which constraints the number of preceding siblings named “a” to 4, so thatthe qualifier becomes true only for the 5th child “a”. A general translation ofpositional information in terms of counting operators [11, 27] is summarized onFigure 7, where@ denotes the document order (depth-first left-to-right) relationin a tree. Note that translated path expressions can in turn be expressed into thecore XPath fragment of Figure 5 (at the cost of another exponential blow-up).Indeed, expressions like PathExpr8(PathExpr2 except PathExpr3)8PathExpr4must be rewritten into expressions where binary connectives for paths occuronly at top level, as in:

PathExpr8PathExpr28PathExpr4 except

PathExpr8PathExpr38PathExpr4We focus on Core XPath expressions involving the counting operator (see

Figure 5). The XPath fragment without the counting operator (the naviga-tional fragment) was already linearly translated into µ-calculus in [10]. Thecontributions presented in this paper allow to equip this navigational fragmentwith counting features such as the ones formulated above. Logical formulascapture the aforementioned XPath counting constraints. For example, considerthe following XPath expression:

child::a[count(descendant::b[parent::c])>5]

RR n° 7251

On the Count of Trees 13

!Axis::NameTest" ={(x, y) - N2 $ x(Axis)y and

y satisfies NameTest}!8PathExpr" ={(r, y) - !PathExpr" $

r is the root}!P18P2" =!P1" A !P2"

!P1 union P2" =!P1" ! !P2"!P1 intersect P2" =!P1" " !P2"

!P1 except P2" =!P1" B !P2"!PathExpr[Qualifier]" ={(x, y) - !PathExpr" $

y - !Qualifier"Qualif}!PathExpr"Qualif ={x $ Cy.(x, y) - !PathExpr"}

!count(PathExpr) Comp k"Qualif ={x - N $${y $ (x, y) - !PathExpr"} $satisfies Comp k}

!not Q"Qualif =N B !Q"Qualif

!Q1 and Q2"Qualif =!Q1"Qualif " !Q1"Qualif

!Q1 or Q2"Qualif =!Q2"Qualif ! !Q2"Qualif

Figure 6: Semantics of Core XPath Expressions

PathExpr[position() = 1] 7PathExpr except (PathExpr8 @)PathExpr[position() = k + 1] 7(PathExpr intersect

(PathExpr[k]8@))[position()=1]@7(descendant::*) union (a-o-s::*8

following-sibling::*8d-or-s::*)a-or-s::* 7ancestor::* union self::*

d-or-s::* 7descendant::* union self::*

Figure 7: Positional Information as Syntactic Sugars [11, 27]

RR n° 7251

On the Count of Trees 14

Path Logical formula%8self::* %

%8child::* 5,!,+6%%8parent::* 5)65*!6%

%8descendant::* 5(, $ +)!,+6%%8ancestor::* 5)65() $ *)!6%

%8following-sibling::* 5,65,!6%%8preceding-sibling::* 5*65*!6%

Figure 8: XPath axes as modalities over binary trees.

This expression selects the children nodes named “a” provided they have morethan 5 descendants which (1) are named “b” and (2) whose parent is named“c”. The logical formula denoting the set of children nodes named “a” is:

$ = a % 5,!,+63The logical translation of the above XPath expression is:

$ % 5)65()$*)#6>5(b % µx.5+6c 4 5,6x)This formula holds for nodes selected by the XPath expression. A correspon-dence between the main XPath axes over unranked trees and modal formulasover binary trees is given in Figure 8. In this figure, each logical formula holdsfor nodes selected by the corresponding XPath axis from a context %.

Let consider another example (XPath expression e1):

child::a/child::b[count(child::e/descendant::h)>3]

Starting from a given context in a tree, this XPath expression navigates tochildren nodes named “a” and selects their children named “b”. Finally, itretains only those “b” nodes for which the qualifier between brackets holds.The first path can be translated in the logic as follows:

& = b % µx.5+6(a % µx%.5+63 4 5,6x%) 4 5,6xThe counting part requires a more sophisticated translation in the logic. This

is because it makes implicit that “e” nodes (whose existence is simply tested forcounting purposes) must be children of selected “b” nodes. The translation ofthe full aforementioned XPath expression is as follows:

& %@n % 5(+ $ ,)!, () $ *)!6>3'where @n is a new fresh nominal used to mark a “b” node which is filtered bythe qualifier and the formula ' describes the counted “h” nodes:

' = h % µx.5+6(e % µx%.5+6@n 4 5,6x%) 4 5,6x 4 5+6xRR n° 7251

On the Count of Trees 15

Intuitively, the general idea behind the translation is to first translate the leadingpath, use a fresh nominal for marking a node which is filtered, then find at least“3” instances of “h” nodes from which we can reach back the marked node viathe inverse path of the counting formula.

Since trails make it possible to navigate but not to test properties (likeexistence of labels), we test for labels in the counted formula ' and we use ageneral navigation (+ $ ,)!, () $ *)! to look for counted nodes everywherein the tree. Introducing the nominal is necessary to bind the context properly(without loss of information). Indeed, the XPath expression e1 makes implicitthat a “e” node must be a child of a “b” node selected by the outer path. Usinga nominal, we restore this property by connecting the counted nodes to theinitial single context node.

Lemma 3.1. The translation of Core XPath expressions with counting con-straints into the logic is linear.

It is proven by structural induction in a similar manner to [10] (in which thetranslation is proven for expressions without counting constraints). For countingformulas, the use of nominals and the general (constant-size) counting trail makeit possible to avoid duplication of trails so that the translation remains linear.

We can now address several decision problems such as equivalence, con-tainment, and emptiness of XPath expressions. These decision problems arereduced to test satisfiability for the logic (in the manner of [10]). We present inSection 4 a satisfiability testing algorithm with a single exponential complexitywith respect to the formula size.

In [10], it was show the logic is also able to capture XML schema languages.This allows to test the XPath decision problems in the presence of XML types.We now show our logic can also succinctly express cardinality constraints onXML types.

3.2 Regular Tree Languages with Cardinality Constraints

Regular tree grammars capture most of the schemas in use today [23]. The logiccan express all regular tree languages (it is easy to prove that regular expressiontypes in the manner of e.g., [14] can be linearly translated into the logic: see[10]).

In practice, schema languages often provide shorthands for expressing car-dinality constraints on node occurrences. XML Schema notably o"ers two at-tributes minOccurs and maxOccurs for this purpose. For instance, the followingXML schema definition:

<xsd:element name="a"><xsd:complexType>

<xsd:sequence><xsd:element name="b" minOccurs="4" maxOccurs="9"/></xsd:sequence>

</xsd:complexType></xsd:element>

RR n° 7251

On the Count of Trees 16

is a notation that restricts the number of occurrences of “b” nodes to be atleast 4 and at most 9, as children of “a” nodes. The goal here is to havea succinct notation for expressing regular languages which could otherwise beexponentially large if written with usual regular expression operators. The aboveregular requirement can be translated as the formula:

# % 5)6(5*#6>3b % 5*#6"9b)where # corresponds to the regular tree type a[b!] as follows:

# = (a % (¬5)63 4 5)6$)) % ¬5*63$ = µx. (b % ¬5)63 % ¬5*63) 4 (b % ¬5)63 % 5*6x)

This example only involves counting over children nodes. The logic allowscounting through more general trails, and in particular arbitrarily deep trails.Trails corresponding to the XPath axes “preceding, ancestor, following” can beused to constrain the context of a schema. The “descendant” trail can be usedto specify additional constraints over the subtree defined by a given schema.For instance, suppose we want to forbid webpages containing nested anchors“a” (whose interpretation makes no sense for web browsers). We can build thelogical formula f which is the conjunction of a considered schema for webpages(e.g. XHTML) with the formula a8descendant::a in XPath notation. Nestedanchors are forbidden by the considered schema i" f is unsatisfiable.

As another example, suppose we want paragraph nodes (“p” nodes) not to benested inside more than 3 unordered lists (“ul” nodes), regardless of the schemadefining the context. One may check for the unsatisfiability of the followingformula:

p % 5(+$,)#,+6>3ul4 Satisfiability Algorithm

We present a tableau-based algorithm for checking satisfiability of formulas.Given a formula, the algorithm seeks to build a tree containing a node selectedby the formula. We show that our algorithm is correct and complete: a satisfyingtree is found if and only if the formula is satisfiable. We also show that the timecomplexity of our algorithm is exponential in the size of the formula.

4.1 Overview

The algorithm operates in two stages.First, a formula # is decomposed into a set of subformulas, called the lean.

The lean gathers all subformulas that are useful for determining the truth statusof the initial formula, while eliminating redundancies. For instance, conjunctionsand disjunctions are eliminated at this stage. More precisely, the lean (definedin 4.2) mainly gathers atomic propositions and modal subformulas. From thelean, one may gather a finite number of formulas, called a #(node, which may

RR n° 7251

On the Count of Trees 17

be satisfied at a given node of a tree. Trees of #(nodes represent the exhaustivesearch universe in which the algorithm is looking for a satisfying tree.

The second stage of the algorithm consists in the building of sets of suchtrees in a bottom-up manner, ensuring consistency at each step. Initially, allpossible leaves (i.e., #(node that do not require children nodes) are considered.During further steps, the algorithm considers every possible #(node that canbe connected with a tree of the previous steps, checking for consistency. Forinstance, if a formula at a #(node n involve a forward modality 5)6#%, then #%must be verified at the first child of n. Reciprocally, due to converse modalities,a #(node may impose restrictions on its possible parent nodes. The new treesthat are built may involve converse modalities, which will be satisfied duringfurther steps of the algorithm. To ensure the algorithm terminates, a bound onthe number of times each #(node may occur in the tree is given.

Finally, the algorithm terminates whenever:

• either a tree that satisfies the initial formula has been found, and its rootdoes not contain any pending (unproven) backward modality; or

• every tree has been considered (the exploration of the whole search uni-verse is complete): the formula is unsatisfiable.

4.2 Preliminaries

To track where counting formulas are satisfied, we annotate each one with afresh counting proposition c, yielding formulas of the form 5!6c#k#. To definethe notions of lean and #(nodes, we need to extract navigating formulas fromcounting formulas (Figure 9).

We now define the Fisher-Ladner relation to extract subformulas. In thefollowing, i ranges over 1 and 2.

Rfl(#1 % #2,#i), Rfl(#1 4 #2,#i),Rfl(µx.#,#[µx.#8x]), Rfl(5!6c#k$, nav(5!6c#k$)),Rfl(5m6#,#).

The Fisher-Ladner closure of a formula #, written FL(#), is the set definedas follow.

FL(#)0 = {#},FL(#)i+1 = FL(#)i ! {#% $ Rfl(#%%,#%),#%% - FL(#)i},FL(#) = FL(#)k,

where k is the smallest integer s.t. FL(#)k = FL(#)k+1. Note that this set isfinite since only one expansion of a fixpoint formula is required in order toproduce all its subformulas in the closure.

The lean of a formula # is a set of formulas containing navigating formulasof the form 5m63, every navigating formulas of the form 5m6$ (i.e., that do

RR n° 7251

On the Count of Trees 18

nav(x) = x nav(p) = pnav(3) = 3 nav(c) = c

nav(¬p) = ¬p nav(¬5m63) = ¬5m63nav(#1 % #2) = nav(#1) % nav(#2)nav(#1 4 #2) = nav(#1) 4 nav(#2)nav(5m6#) = 5m6nav(#)nav(µx.$) = µx.nav($)

nav(5!6c>k$) = nav((!),$ % c)nav(5!6c"k$) = nav((!), ($ % c) 4 (¬$ % ¬c))nav(((),$) = $nav((m),$) = 5m6$

nav((!1,!2),$) = nav((!1), nav((!2),$))nav((!1 $ !2),$) = nav((!1),$) 4 nav((!2),$)

nav((!#),$) = µx.nav($) 4 nav((!), x)Figure 9: Navigation extraction from counting formulas

not contain counting formulas) from FL(#), every proposition occurring in #,written P#, every counting proposition, written C, and an extra propositionthat does not occur in # used to represent other names, written p#.

lean(#) = {5m63} ! {5m6$ - FL(#)} !P# !C ! {p#}A #(node , written n#, is a subset of lean(#), such that:

• exactly one proposition from P# ! {p#} is present;• when 5m6$ is present, then 5m63 is present; and

• both 5+63 and 5,63 cannot be present at the same time.

The set of #(nodes is defined as N#.Intuitively, a node n# corresponds to a formula.

n# = D")n!

$ % D")lean(#)*n!

¬$When the formula # under consideration is fixed, we often omit the super-

script.A #tree is either the empty tree 9, or a triple (n#,$1,$2) where $1 and $2

are #trees. When clear from the context, we usually refer to #trees simply astrees.

RR n° 7251

On the Count of Trees 19

n E# 3$ - nn E# $

$ 8- nn E# ¬$

n E# $1 n E# $2

n E# $1 %$2

n E# $1

n E# $1 4$2

n E# $2

n E# $1 4$2

n E# ${µx."8x}n E# µx.$

Figure 10: Local entailment relation: between nodes and formulas

We now turn to the definition of consistency of a #tree. To this end, wedefine an entailment relation between a node and a formula in Figure 10.

Two nodes n1 and n2 are consistent under modality m - {),*}, writtenR#(n1,m) = n2, i"

F5m6$ - lean(#), 5m6$ - n1 G' n2 E# $

F5m6$ - lean(#), 5m6$ - n2 G' n1 E# $

Consistency is checked each time a node is added to the tree, ensuring thatforward modalities of the node are indeed satisfied by the nodes below, and thatpending backward modalities of the node below are consistent with the addednode. Note that counting formulas are not considered at this point, as they areglobally verified in the next step.

Upon generation of a finished tree, i.e., a tree with no pending backwardmodality, one may check whether a node of this tree satisfies #. To this end, wefirst define forward navigation in a #tree $. Given a path consisting of forwardmodalities ", $(") is the node at that path. It is undefined if there is no suchnode.

(n,$1,$2)(() = n(n,$1,$2)()") = $1(")(n,$1,$2)(*") = $2(")We also allow extending the path with backward modalities if they match thelast modality of the path.

(n,$1,$2)(")+) = (n,$1,$2)(")(n,$1,$2)("*,) = (n,$1,$2)(")Now, we are able to define an entailment relation along paths in #trees

in Figure 11. This relation extends local entailment relation (Figure 10) withchecks for counting formulas. Note that the case for fixpoints is contained in thecase for formulas with no counting subformula. In the “less than” case, we needto make sure that every node reachable through the trail is taken into account,either as counted if it satisfies $, or not counted otherwise (in this case, ¬$denotes the negation normal form).

RR n° 7251

On the Count of Trees 20

#% does not contain counting formulas $(") E# #%" E#! #%

" E#! #1 " E#! #2

" E#! #1 % #2

" E#! #1

" E#! #1 4 #2

" E#! #2

" E#! #1 4 #2

"m E#! #%" E#! 5m6#%

${n%, "% - ! % $(""%) = n% % n% E# $ % c}$ > k" E#! 5!6c>k$

${n%, "% - ! % $(""%) = n% % n% E# $ % c}$ & kF"% - !,$(""%) E# ($ % c) 4 (¬$ % ¬c)" E#! 5!6c"k$

Figure 11: Global entailment relation (incl. counting formulas)

We conclude these preliminaries by introducing some final notations. Theroot of a #tree is defined as follows.

root(9) = 9root((n,$1,$2)) = n

A #tree $ satisfies a formula #, written $ E #, if neither 5+63 nor 5,63 occurin root($), and if there is a path " such that " E#! #. A set of trees ST satisfiesa formula #, written ST E #, when there is a tree $ - ST such that $ E #.

4.3 The Algorithm

We are now ready to present the algorithm, which is parameterized by K(#)(defined in Figure 12), the maximum number of occurrences of a given nodein a path from the root of the tree to a leaf. The algorithm builds consistentcandidate trees from the bottom up, and checks at each step if one of the builttree satisfies the formula, returning 1 if it is the case. As the set of nodes fromwhich to build the trees is finite, it eventually stops and returns 0 if no satisfyingtree has been found.

To bound the size of the trees that are built, we restrict the number ofidentical nodes on a path from the root to any leaf by K(#) + 2, defined inFigure 12, using nmax defined as follows.

nmax(n,$1,$2) = max(nmax(n,$1),nmax(n,$2))nmax(n, (n,$1,$2)) = 1 + nmax(n,$1,$2)nmax(n, (n%,$1,$2)) = nmax(n,$1,$2) if n : n%

nmax(n,9) = 0RR n° 7251

On the Count of Trees 21

Algorithm 1 Check Satisfiability of #ST H 9repeatAUX H {(n,$1,$2) $ {we extend the trees}nmax(n,$1,$2) &K(#) + 2 {with an available node}for i in ),* {and each child is either}$i = 9 and 5i63 I n {an empty tree}or $i - ST {or a previously built tree}5i63 - root($i) {with pending backward modalities}R#(n, i) = root($i)} {checking consistency}

if AUX < ST thenreturn 0 {No new tree was built}

end ifST H ST !AUX

until ST E #return 1

K(p) =K(¬p) =K(¬5m63) =K(3) =K(µx.$) = 0K(#1 % #2) =K(#1 4 #2) =K(#1) +K(#2)K(5m6#) =K(#)K(5!6#k$) = k + 1

Figure 12: Occurrences bound

RR n° 7251

On the Count of Trees 22

p1 p2 p3 . . . p2. . .

p2

p2

p1

)**

Figure 13: Checking # = p1 % 5)65*#6>2p2Consider for instance the formula # = p1 % 5)65*#6>2p2. The computed lean

is as follows, where $ = µx.(p2 % c) 4 5*6x.{p1, p2, p3, c, 5)63, 5*63, 5+63, 5,63, 5)6$, 5*6$}

Names other than p1 and p2 are represented by p3; c identifies counted nodes.Computing the bound on nodes, we get K(#) = 3.

After the first step, ST consists of the trees ({pi},9,9), ({pi, c},9,9),({pi, 5j63},9,9), and ({pi, c, 5j63},9,9) with i - {1,2,3} and j - {),*}. Atthis point the three finished trees in ST are tested and found not to satisfy #.

After the second iteration many trees are created, but the one of interest isthe following.

T0 = ({p2, c, 5*63, 5,63, 5*6$},9, ({p2 , c, 5,63},9,9))The third iteration yields the following tree.

T1 = ({p2, c, 5*63, 5+63, 5*6$},9, T0)We can conclude by the fourth iteration when we find the tree ({p1, 5)6$, 5)63}, T1,9),

which is found to satisfy # at path (. As the nodes at every step are di"erent,the limit is not reached. Figure 13 depicts a graphical representation of theexample where counted nodes (containing c) are drawn as thick circles.

4.4 Termination

Proving termination of the algorithm is straightforward, as only a finite numberof trees may be built and the algorithm stops as soon as it cannot build a newtree.

4.5 Soundness

If the algorithm terminates with a candidate, we show that the initial formulais satisfiable. Let $ and " be the #tree and path such that " E#! #. We build

RR n° 7251

On the Count of Trees 23

a tree from $ and show that the interpretation of # for this tree includes thenode at path ".

We write T ($) for the tree (N,R,L) defined as follows. We first rewrite $such that each node n is replaced by the path to reach it (i.e, nodes are identifiedby their path).

path(n,$1,$2) # ((, path(),$1), path(*,$2))path(", (n,$1,$2))# (", path("),$1), path("*,$2))

path(",9)# 9We then define:

• N = nodes(path($));• for every (",$1,$2) in path($) and i = ),*, if $i : 9 then R(", i) = "iand R("i, i) = "; and

• for all " - N if p - $(") then L(") = p.Lemma 4.1. Let $ a subformula of # with no counting formula. If $(") E# $

then we have " - [[$]]T (!)' .

Proof. We proceed by induction on the lexical ordering of the number of un-folding of $ that are required for T ($) as defined by Lemma 2.1, and of the sizeof the formula.

The base cases are 3, atomic or counting propositions, and negated forms.

These are immediate by definition of [[$]]T (!)' . The cases for disjunction andconjunction are immediate by induction (the formula is smaller). The case forfixpoints is also immediate by induction, as the number of unfoldings required

decreases, and as [[µx.$]]T (!)' = [[${µx."8x}]]T (!)' .The last case is the presence of a modality 5m6$ from the #node $("). In this

case we rely on the fact that the nodes $("m) and $(") are consistent to derive$("m) E# $. We then conclude by induction as the formula is smaller.

Theorem 4.2 (Soundness). If " E#! # then " - [[#]]T (!)'Proof. The proof proceeds by induction on the derivation of " E#! #. Most casesare immediate (or rely on Lemma 4.1). For the “greater than” counting case,we rely on the k+1 selected nodes that have to satisfy $%c thus $. In addition,in the “less than” case, every node that is not counted has to satisfy ¬$ % ¬c,so in particular ¬$. In both cases we conclude by induction.

4.6 Completeness

Our proof proceeds in two step. We build a #tree that satisfies the formula, thenwe show it is actually built by the algorithm. As the proof is quite complex, wedevote some space to detail it.

RR n° 7251

On the Count of Trees 24

Assume that formula # is satisfiable by a tree T . We consider the smallestsuch tree (i.e., the tree with the fewest number of nodes) and fix n#, a nodewitnessing satisfiability.

We now build a #tree homomorphic to T , called the lean labeled version of #,written $(T,#). To this end, we start by annotating counted nodes along withtheir corresponding counting proposition, yielding a new tree Tc. Starting fromn# and by induction on #, we proceed as follows. For formulas with no countingsubformula, including recursion, we stop. For conjunction and disjunction offormulas, we recursively annotate according to both subformulas. For modal-ities, we recursively annotate from the node under the modality. For 5!6c"k$,we annotate every selected node with the counting proposition correspondingto the formula. For 5!6c>k$, we annotate exactly k + 1 selected nodes.

We now extend the semantics of formulas to take into account countingpropositions and annotated nodes, written [[J]]Tc

V . The definition is identicalto Figure 4, with one addition and two changes. The addition is for countingpropositions, which we define as n - [[c]]Tc

V i" n is annotated by c. The twochanges are for counting propositions, which we define as follows, where weselect only nodes that are annotated.

[[5!6"k#%]]TcV = {n, ${n% - [[#%]]Tc

V " [[c]]TcV , n

!1# n%}$ & k}[[5!6>k#%]]Tc

V = {n, ${n% - [[#%]]TcV " [[c]]Tc

V , n!1# n%}$ > k}

We show that this modification of the semantics does no change the satisfi-ability of the formula.

Lemma 4.3. We have n# - [[#]]Tc' .

Proof. We proceed by recursion on the derivation n# - [[#]]T' . The cases whereno counting formula is involved, thus including fixpoints, are immediate, asthe selected nodes are identical. The disjunction, conjunction, and modalitycases are also immediate by induction. The interesting cases are the countingformulas.

For 5!6c>k$, as there are exactly k+1 nodes annotated, the property is true byinduction. For 5!6c"k$, we rely on the fact that every counted node is annotated.We conclude by remarking that $ does not contain a counting formula, thus wehave [[$]]Tc

V = [[$]]TV and [[¬$]]TcV = [[¬$]]TV .

To every node n, we associate n#, the largest subset of formulas of the leanselecting the node.

n# = {#0 $ n - [[#0]]T' ,#0 - lean(#)}This is a #-node as it contains one and exactly one proposition, and if it

includes a modal formula 5m6$, then it also includes 5m63. The tree $(T,#) isthen built homomorphically to T .

In the remainder of this section, we write $ for $(T,#). We now check that$ is consistent, starting with local consistency.

RR n° 7251

On the Count of Trees 25

$ - lean(#)$

.- lean(#)$1

.- lean(#) $2.- lean(#)

$1 %$2.- lean(#)

$1.- lean(#) $2

.- lean(#)$1 4 $2

.- lean(#) 3 .- lean(#)$ - (P# ! 5m63 !C)¬$ .- lean(#)

Figure 14: Formula induced by a lean

In the following, we say a formula $ is induced by the lean of #, written$

.- lean(#), if it consists of the boolean combination of subformulas from thelean as defined in Figure 14.

Lemma 4.4. Let 5m6$ be a formula in lean(#), and let $% be $ after unfoldingits fixpoint formulas not under modalities. We have $% .- lean(#).Proof. By definition of the lean and of the

.- relation.Lemma 4.5. Let $ be a formula induced by lean(#). We have n - [[$]]Tc' ifand only if n# E# $.

Proof. We proceed by induction on $. The base cases (the formula is in the#-node or is a negation of a lean formula not in the #-node) hold by definitionof n#. The inductive cases are straightforward as these formulas only containfixpoints under modalities.

Lemma 4.6. Let n1 and n2 such that R(n1,m) = n2 with m - {),*}. Wehave R#(n#

1 ,m) = n#2 .

Proof. Let 5m6$ be a formula in lean(#). We show that 5m6$ - n#1 G'

n#2 E# $. We have 5m6$ - n#

1 if and only if n1 - [[5m6$]]Tc' by definition of n#1 ,

which in turn holds if and only if n2 = R(n1,m) - [[$]]Tc' . We now consider $%which is $ after unfolding its fixpoint formulas not under modalities. We have[[$%]]Tc' = [[$]]Tc' and we conclude by Lemmas 4.4 and 4.5.

We now turn to global consistency, taking counting formulas into account.

Lemma 4.7. Let #s be a subformula of #, and " be a path from the root in Tsuch that T (") - [[#s]]Tc' . We then have " E#! #s.

Proof. We proceed by induction on #s.If #s does not contain any counting formula, we consider #%s which is #s after

unfolding its fixpoint formulas not under modalities. We have [[#%s]]Tc' = [[#s]]Tc'and #%s .- lean(#). We conclude by Lemma 4.5.

For most inductive cases, the proof is immediate by induction, as the formulasize decreases.

RR n° 7251

On the Count of Trees 26

For 5!6c>k$, we have by induction for every counted node ""% E#! $ and

""% E#! c. We conclude by the conjunction rule and by the counting rule ofFigure 11.

For 5!6c"k$, we proceed as above for the counted nodes. For the nodes thatare not counted, we have $(""%) E# ¬$ by Lemma 4.5 (since ¬$ .- lean(#)). Weconclude by remarking that the node is not annotated by c, hence $(""%) E#¬c.

We next show that the #tree $ is actually built by the algorithm. The prooffollows closely the one from [10], with a crucial exception: we need to makesure there are enough instances of each formula. Indeed, in [10], the algorithmuses a #type (a subset of lean(#)) at most once on each branch from the rootto a leaf of the built tree. This yields a simple condition to stop the algorithmand conclude the formula is unsatisfiable. However, in the presence of countingformulas, a given #type may occur more than once on a branch. To maintainthe termination of the algorithm, we bound the number of identical #type thatmay be needed by K(#) as defined in Figure 12. We thus need to check thatthis bound is su%cient to build a tree for any satisfiable formula.

We recall that # is a satisfiable formula and T is a smallest tree such that #is satisfied, and n# is a witness of satisfiability.

We proceed in two steps: first we show that counted nodes (with countedpropositions) imply a bound on the number of identical #types on a branch fora smallest tree. Second, we show that this minimal marking is bound by K(#).

In the following, we call counted nodes and node n# annotations. We definethe projection of an annotation on a path. Let " be a path from the root of thetree to a leaf. An annotation projects on " at "1 if " = "1"2, the annotation isat "1"m, and "2 shares no prefix with "m.

Lemma 4.8. Let $% be the annotated tree, " a path from the root of the treeto a leaf, n1 and n2 two distinct nodes of " such that n#

1 = n#2 . Then either

annotations projects both on " at n1 and n2, or an annotation projects strictlybetween n1 and n2.

Proof. We proceed by contradiction: we assume there is no annotation thatprojects between n1 and n2 and at most one of them has an annotation thatprojects on it. Without loss of generality, we assume that n2 is below n1 in thetree.

Assume neither n1 nor n2 is annotated (through projection). We considerthe tree $s where n2 is “grafted” upon n1. Formally, let "1 be the path to n1

and "1"2 the path to n2. We remove every node whose path is of the form "1"3where "2 is not a prefix of "3, and we also remove node n2. The mapping R%from nodes and modalities to nodes is the same as before for the node that arekept except for n1, where R%(n1,)) = R(n2,)) and R%(n1,*) = R(n2,*). Forevery path " of $, let "s be the potentially shorter path if it exists (i.e., if itwas not removed when pruning the tree). More precisely, if "% = "%1"%3 where "%1is a prefix of "1 and the paths are disjoint from there, then $s("%) = $("%). If"% = "1"2"3, then $s("1"3) = $("%).RR n° 7251

On the Count of Trees 27

We now show that $s still satisfies # at n#, a contradiction since this tree isstrictly smaller than $.

First, as there was no annotation projected, n# is still part of this tree at apath "s. We show that we have "s E#!s

# by induction on the derivation " E#! #.

Let "% E#! #% in the derivation, assuming that "%s is defined.The case where #% does not mention any counting formula is trivial: $("%) =

$s("%s) thus local entailment is immediate.Conjunction and disjunction are also immediate by induction.We now turn to the modality case, 5m6#% where #% contains a counting

formula. If "% is neither "1 nor "1"2, we deduce from the fact that "%s is definedthat ("%m)s is also defined and we conclude by induction. We now assumethat "% is either "1 or "1"2 and find a contradiction. First, remark that "% E#!5m6#% implies that the navigation generated by 5m6#% is in $("1) = $("1"2). Aseach syntactic occurrence of a counting formula mentions a distinct countingproposition c, this is possible only if the counting formula is under a fixpoint orunder another counting formula, both of which are impossible.

We finally turn to the counting case 5!6c#k$. We say that a path does notcross over when this path does not contain n1 nor n2. For nodes that arereached using paths that do not cross over, we conclude by induction that theyare also counted. We show that the remaining nodes reached through a crossoverremain reachable (there cannot be any counted node in the part of the tree thatis removed since counted nodes are annotated and there was no annotation inthe part removed). Without loss of generality, assume that "% is a prefix of "1(the counting formula is in the “top” part of the tree), and let "n be the pathfrom the counting formula to the counted node ("n is an instance of the trail!). This path is of the shape "%1"2"c, with "1 = "%"%1. We now show that thepath "%1"c is an instance of ! if and only if "n is an instance of the trail, thusthe same node is still reached.

Recall that ! is of the shape !1, . . . ,!n,!n+1 where !1 to !n are of theform !#ri and where !n+1 does not contain a repeated trail. We say that aprefix "p of a path " stops at i if there is a su%x "s such that "p"s is still aprefix of ", "p"s - !1, . . . ,!i, and there is no shorter su%x "%s and j such that"p"%s - !1, . . . ,!j . (Intuitively, !i is the trail being used when matching the endof "p.) If there are several satisfying indices i, we consider the smallest.

We first show that a counting proposition is necessarily mentioned in a for-mula of n#

2 , by contradiction. Assume no counting proposition is mentioned,yet the counting crossed-over. This can only occur for a “less than” countingformula that reaches n2 which is not counted (because the formula was false),and if there is no path whose "n is a strict prefix that is an instance of ! (oth-erwise, by definition of the lean and of nav (Figure 9), a formula of the formnav((!%), ($ % c) 4 (¬$ % ¬c)) would be true and thus would be present, con-tradicting the assumption that no counting proposition is mentioned). Sincen#1 = n#

2 , the same is true for n#1 , a direct contradiction to the fact that n2 is

also reached by the trail. Thus counting propositions are mentioned in n#1 and

n#2 .

RR n° 7251

On the Count of Trees 28

We next show that there are i & j & n such that both "%1 stops at i and"%1"2 stop at j, i.e., neither i nor j may be n + 1. Recall that !n+1 does notcontain a repeated subtrail. Thus every formula of n#

2 mentioning c is of theform nav((!%),$), where !% does not contain a repetition. We consider thelargest such formula. Since n1 is before n2 in the path from the counting nodeto the counted node, a similar formula with a larger trail or with a repetitionmust occur in n#

1 , contradicting n#1 = n#

2 .Consider next the su%xes "1s and "2s computed when stating that the paths

stop at i and j. These su%xes correspond to the path matching the end of !i

and !j , respectively (before the next iteration or switching to the next subtrail).

They have matching formulas in n#1 and n#

2 . As the formulas are present inboth nodes, then the remainder of the paths ("2"c and "c) are instances of("1s $"2s)!i . . .!n+1, thus "%1"c is an instance of ! if and only if "n is.

In the case of “greater than” counting, we conclude immediately by inductionas the same nodes are selected (thus there are enough). In the case of “less than”,we need to check that no new node is counted in the smaller tree. Assumeit is not the case for the formula 5!6"k$, thus there is a path "n - ! to anode satisfying $. As the same node can be reached in $, and as we have$("%"n) E# ¬$ by induction, we have a contradiction.

This concludes the proof when neither n1 nor n2 is annotated. The proofis identical when n2 is annotated. If n1 is annotated, we look at the firstmodality between n1 and n2. If it is a ), then we build the smaller tree bydoing R%(n1,)) = R(n2,)) (we remove the * subtree from n2 instead of n1).Symmetrically, if the first modality is a *, we consider R%(n1,*) = R(n2,*) assmaller tree. The rest of the proof proceeds as above.

Theorem 4.9 (Completeness). If # is satisfiable, then a satisfying tree is built.

Proof. The proof proceeds as in [10], we only need to check there are enoughcopies of each node to build every path. Let " be a path from the root of thetree to the leaves. By Lemma 4.8, there are at most n+1 identical nodes in thispath, where n is the number of annotations. The number of annotations is c+1where c is the number of counted nodes. We show by an immediate inductionon the formula # that c is bound by K(#) as defined in Figure 12. We concludeby remarking that K(#) + 2 is the number of identical nodes we allow in thealgorithm.

4.7 Complexity

We now show that the time complexity of the satisfiability algorithm is expo-nential in the formula size. This is achieved in two steps: we first show thatthe lean size is linear in the formula size, then we show that the algorithm hasa single exponential complexity with relation to the lean size.

Lemma 4.10. The lean size is linear in terms of the original formula size.

RR n° 7251

On the Count of Trees 29

Proof Sketch. First note that the size of the lean is the number of elements itcontains; the size of each element does not matter.

It was shown in [10] that the size of the lean generated by a non-countingformula is linear with respect to the formula size.

We now describe the case for counting formulas. The lean consists of proposi-tions and of modal subformulas, including the ones generated by the navigationof counting formulas (Figure 9). Moreover, each counting formula adds one freshcounting proposition. In the case of “less than” formulas 5!6"k$, a duplicationoccurs due to the consideration of the negated normal form of $. Since thereis no counting under counting, this duplication and the fact that the negatednormal form of a formula is linear in the size of the original formula (Figure 3)result in the lean remaining linear. Another duplication occurs in the case ofcounting formulas of the form 5!1$!26#k$. This duplication does not doublethe size of the lean, however, since $ still occurs only once in the lean, thus thenumber of elements in the lean induced by nav((!1),$) 4 nav((!2),$) is thesame as the sum of the ones in nav((!1),$) and in nav((!2), J).Theorem 4.11. The satisfiability algorithm for the logic is decidable in time2O(n), where n is the size of the lean.

Proof Sketch. The maximum number of considered nodes is the number of dis-tinct tree nodes which is 2n, the number of subsets of the lean. For a givenformula #, the number of occurrences of the same node in the tree is boundedby K(#) & k >m, where k is the greatest constant occurring in the countingformulas and m is the number of counting subformulas of #. Hence the numberof steps of the algorithm is bounded by 2n > k >m.

At each iteration, the main operation performed by the algorithm is thecomposition of trees stored in AUX . The cost of each iteration consists in:the di"erent searches needed to form the necessary triples (n,$1,$2), the nmaxfunction andR#. Since the total number of nodes is exponential, and the numberof di"erent subtrees too, therefore the maximum number of newly formed trees(triples) at each step has also an exponential bound. The function nmax performsa single traversal of the tree which is also exponential. Since the entailmentrelation involved in the definition of R# is local, R# is performed in linear time.Computing the containment AUX < ST and the union ST ! AUX are linearoperations over sets of exponential size.

The stop condition of the algorithm is checked by the global entailmentrelation. It involves traversals parametrized by the number of trees, the numberof nodes in each tree, the number of traversals for the entailment relation ofcounting formulas, and K(#). Its time complexity is bounded by (2n > k >m)3.

Hence, the total time complexity of the algorithm is bounded by (2n>k>m)k! ,for some constant k%.

RR n° 7251

On the Count of Trees 30

5 Related Work

Counting over trees The notion of Presburger Automata for trees, combin-ing both regular constraints on the children of nodes and numerical constraintsgiven by Presburger formulas, has independently been introduced by Dal Zilioand Lugiez [5] and Seidl et al. [25]. Specifically, Dal Zilio and Lugiez [5] proposea modal logic for unordered trees called Sheaves logic. This logic allows to im-pose certain arithmetical constraints on children nodes but lacks recursion (i.e.,fixpoint operators) and inverse navigation. Dal Zilio and Lugiez consider thesatisfiability and the membership problems. Demri and Lugiez [6] showed bymeans of an automata-free decision procedure that this logic is only PSPACE-complete. Restrictions like p1 nodes have no more “children” than p2 nodes, aresuccinctly expressible by this approach. Seidl et al. [25] introduce a fixpointPresburger logic, which, in addition to numerical constraints on children nodes,also supports recursive forward navigation. For example, expressions like thedescendants of p1 nodes have no more “children” than the number of children ofdescendants of p2 nodes are e%ciently represented. This means that constraintscan be imposed on sibling nodes (even if they are deep in the tree) by forwardrecursive navigation but not on distant nodes which are not siblings.

Compared to the work presented here, neither of the two previous approachescan e%ciently support constraints like there are more than 5 ancestors of “p”nodes.

Furthermore, due to the lack of backward navigation, the works found in[5, 25, 6] are not suited for succinctly capturing XPath expressions. Indeed, it iswell-known that expressions with backward modalities are exponentially moresuccinct than their forward-only counterparts [11, 29].

There is poor hope to push the decidability envelope much further for count-ing constraints. Indeed, it is known from [16, 6, 27] that the equivalence problemis undecidable for XPath expressions with counting operators of the form:

• PathExpr1[count(PathExpr2) = count(PathExpr3)], or• PathExpr1[position() = count(PathExpr2)].

This is the reason why logical frameworks that allow comparisons between count-ing operators limit counting by restricting the PathExpr to immediate childrennodes [5, 25]. In this paper, we chose a di"erent tradeo": comparisons arerestricted to constants but at the same time comparisons along more generalpaths are permitted.

Counting over graphs The µ-calculus is a propositional modal logic aug-mented with least and greatest fixpoint operators [18]. Kupferman, Sattler andVardi study a µ-calculus with graded modalities where one can express, e.g.,that a graph node has at least n successors satisfying a certain property [19].The modalities are limited in scope since they only count immediate succes-sors of a given node. A similar notion in trees consists in counting immediatechildren nodes, as performed by the counting formula 5)65*!6#k#, where #

RR n° 7251

On the Count of Trees 31

describes the property to be counted. Compared to graded modalities of [19],we consider trees and we can extend the “immediate successor” notion to nodesreachable from regular paths, involving reverse and recursive navigation.

A recent study [2] focuses on extending the µ-calculus with inverse modali-ties [29], nominals [24], and graded modalities of [19]. If only two of the aboveconstructs are considered, satisfiability of the enriched calculus is EXPTIME-complete [2, 1]. However, if all of the above constructs are considered simul-taneously, the calculus becomes undecidable [2]. The present work shows thatthis undecidability result in the case of graphs does not preclude decidable treelogics combining such features.

XPath-like counting extensions The proposed logic can be the target forthe compilation of a few more sophisticated counting features, considered assyntactic sugars (and that may come at the potential extra cost of their trans-lation).

In particular, XPath allows nested counting, as in the expression

self::book[chapter[section > 1] > 1,which selects the current “book” node provided it has at least two “chapter”child nodes which in turn must contain at least two “section” nodes each. Fora simple set of formulas, formulas that count only on children nodes, such nest-ing can be translated into ordinary logical formulas. For instance, the logicalformulation of the above XPath expression can be captured as follows:

book % 5)6µx. (chapter % $ % 5*6µy.chapter % $ 4 5*6y) 4 5*6xwhere $ = 5)6µx.(section % 5*6µy.section 4 5*6y) 4 5*6x.

In [21], Marx introduced an “until” operator for extending XPath’s expres-sive power to be complete with respect to first-order logic over trees. Thisoperator is trivially expressible in the present logic, owing to the use of the fix-point binder. We can even combine counting features with the “until” operatorand express properties that go beyond the expressive power of the XPath stan-dard. For instance, the following formula states that “starting from the currentnode, and until we reach an ancestor named a, every ancestor has at least 3children named b”:

µx. (5)65*!6>2b % µy.5+6x 4 5,6y) 4 aThese extensions come at an extra cost, however. It is not di%cult to observe

(by induction) that, given a formula # with subformulas $1, ...,$n countingonly on children nodes, if formulas $1, ...,$n are replaced by their expansionsin #, yielding a formula #%, then $lean(#%)$ & $lean(#)$ > kl, where k is greatestnumerical constraint of the counting subformulas, and l is the greatest levelnesting of counting subformulas. As a consequence of Theorem 4.11, the logicextended with nested formulas counting on children nodes and formulas countingon children nodes under the scope of a fixpoint operator can be decided in time

2O(n!kl).

RR n° 7251

On the Count of Trees 32

6 Conclusion

We introduced a modal logic of trees equipped with (1) converse modalities,which allow to succinctly express forward and backward navigation, (2) a leastfixpoint operator for recursion, and (3) cardinality constraint operators for ex-pressing numerical occurrence constraints on tree nodes satisfying some regularproperties. A sound and complete algorithm is presented for testing satisfiabil-ity of logical formulas. This result is surprising since the corresponding logic forgraphs is undecidable [2].

The decision procedure for the logic is exponential time w.r.t. to the formulasize. The logic captures regular tree languages with cardinality restrictions, aswell as the navigational fragment of XPath equipped with counting features.Similarly to backward modalities, numerical constraints do not extend the log-ical expressivity beyond regular tree languages. Nevertheless they enhance thesuccinctness of the formalism as they provide useful shorthands for otherwiseexponentially large formulas.

This exponential gain in succinctness makes it possible to extend static anal-ysis to a larger set of XPath and XML schema features in a more e%cient way.We believe the field of application of this logic may go beyond the XML setting.For example, in verification of linked data structures [20, 30, 12] reasoning ontree structures with in-depth cardinality constraints seems a major issue. Ourresult may help building solvers that are attractive alternatives to those basedon non-elementary logics such as SkS [28], like Mona [17].

References

[1] Alessandro Bianco, Fabio Mogavero, and Aniello Murano. Graded compu-tation tree logic. In LICS, 2009.

[2] Piero Bonatti, Carsten Lutz, Aniello Murano, and Moshe Vardi. The com-plexity of enriched µ-calculi. In ICALP, 2006.

[3] J. Clark and S. DeRose. XML path language (XPath) version 1.0, Novem-ber 1999. W3C recommendation.

[4] Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. E%cient asymmetricinclusion between regular expression types. In ICDT, 2009.

[5] Silvano Dal-Zilio, Denis Lugiez, and Charles Meyssonnier. A logic you cancount on. In POPL, 2004.

[6] S. Demri and D. Lugiez. Presburger modal logic is PSPACE-Complete. InIJCAR, 2006.

[7] Wouter Gelade. Succinctness of regular expressions with interleaving, in-tersection and counting. In MFCS, 2008.

RR n° 7251

On the Count of Trees 33

[8] Wouter Gelade, Marc Gyssens, and Wim Martens. Regular expressionswith counting: Weak versus strong determinism. In MFCS, 2009.

[9] Wouter Gelade, Wim Martens, and Frank Neven. Optimizing schema lan-guages for XML: Numerical constraints and interleaving. SIAM J. Comput.,38(5), 2008.

[10] P. Geneves, N. Layaıda, and A. Schmitt. E%cient static analysis of XMLpaths and types. In PLDI, 2007.

[11] Pierre Geneves and Kristo"er Høgsbro Rose. Compiling XPath for stream-ing access policy. In DocEng, 2005.

[12] Peter Habermehl, Radu Iosif, and Tomas Vojnar. Automata-based verifi-cation of programs with tree updates. Acta Inf., 47(1):1–31, 2010.

[13] J.G. Henriksen, J. Jensen, M. Jørgensen, N. Klarlund, B. Paige, T. Rauhe,and A. Sandholm. Mona: Monadic second-order logic in practice. InTACAS, 1995.

[14] H. Hosoya, J. Vouillon, and B. C. Pierce. Regular expression types forXML. ACM Trans. Program. Lang. Syst., 27(1):46–90, 2005.

[15] Pekka Kilpelinen and Rauno Tuhkanen. One-unambiguity of regular expres-sions with numeric occurrence indicators. Information and Computation,205(6):890 – 916, 2007.

[16] F. Klaedtke and H. Rueß. Monadic second-order logics with cardinalities.In ICALP, 2003.

[17] Nils Klarlund and Anders Møller. MONA Version 1.4 User Manual. BRICSNotes Series NS-01-1, Department of Computer Science, University ofAarhus, January 2001.

[18] D. Kozen. Results on the propositional µ-Calculus. In ICALP, 1982.

[19] O. Kupferman, U. Sattler, and M. Y. Vardi. The complexity of the gradedµ-calculus. In CADE, 2002.

[20] Zohar Manna, Henny B. Sipma, and Ting Zhang. Verifying balanced trees.In LFCS, 2007.

[21] Maarten Marx. Conditional XPath. ACM Trans. Database Syst., 30(4),2005.

[22] Albert R. Meyer and Larry J. Stockmeyer. The equivalence problem forregular expressions with squaring requires exponential space. In FOCS,1972.

[23] M. Murata, D. Lee, M. Mani, and K. Kawaguchi. Taxonomy of XMLschema languages using formal language theory. ACM Trans. InternetTechn., 5(4):660–704, 2005.

RR n° 7251

On the Count of Trees 34

[24] Ulrike Sattler and Moshe Y. Vardi. The hybrid µ-calculus. In IJCAR, 2001.

[25] H. Seidl, T. Schwentick, A. Muscholl, and P. Habermehl. Counting in treesfor free. In ICALP, 2004.

[26] Alfred Tarski. A lattice-theoretical fixpoint theorem and its applications.Pacific J. Math., 5(2):285–309, 1955.

[27] Balder ten Cate and Maarten Marx. Axiomatizing the logical core of XPath2.0. Theor. Comp. Sys., 44(4):561–589, 2009.

[28] James W. Thatcher and Jesse B. Wright. Generalized finite automatatheory with an application to a decision problem of second-order logic.Mathematical Systems Theory, 2(1):57–81, 1968.

[29] M. Y. Vardi. Reasoning about the past with two-way automata. In ICALP,1998.

[30] Karen Zee, Viktor Kuncak, and Martin Rinard. Full functional verificationof linked data structures. In PLDI, 2008.

RR n° 7251

Centre de recherche INRIA Grenoble – Rhône-Alpes655, avenue de l’Europe - 38334 Montbonnot Saint-Ismier (France)

Centre de recherche INRIA Bordeaux – Sud Ouest : Domaine Universitaire - 351, cours de la Libération - 33405 Talence CedexCentre de recherche INRIA Lille – Nord Europe : Parc Scientifique de la Haute Borne - 40, avenue Halley - 59650 Villeneuve d’Ascq

Centre de recherche INRIA Nancy – Grand Est : LORIA, Technopôle de Nancy-Brabois - Campus scientifique615, rue du Jardin Botanique - BP 101 - 54602 Villers-lès-Nancy Cedex

Centre de recherche INRIA Paris – Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay CedexCentre de recherche INRIA Rennes – Bretagne Atlantique : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex

Centre de recherche INRIA Saclay – Île-de-France : Parc Orsay Université - ZAC des Vignes : 4, rue Jacques Monod - 91893 Orsay CedexCentre de recherche INRIA Sophia Antipolis – Méditerranée : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex

ÉditeurINRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)

http://www.inria.frISSN 0249-6399