Generalized Multitext Grammars(monolingual) formalism. Next, we prove that in GMTG each component...

Generalized Multitext GrammarsA revised and extended version of the ACL’04 paper

with the same title by the same authors.∗

I. Dan Melamed1, Giorgio Satta2, and Benjamin Wellington1

1New York University 2University of Padua

Abstract

Generalized Multitext Grammar (GMTG) is a synchronous grammar formalism that is weakly equiva-lent to Linear Context-free Rewriting Systems (LCFRS), but retains much of the notational and intuitivesimplicity of Context-Free Grammar (CFG). GMTG allows both synchronous and independent rewriting.Such flexibility facilitates more perspicuous modeling of parallel text than what is possible with othersynchronous formalisms. This paper investigates the generative capacity of GMTG, proves that eachcomponent grammar of a GMTG retains its generative power, and proposes a generalization of ChomskyNormal Form, which is necessary for synchronous CKY parsing.

1 Introduction

Synchronous grammars have been proposed for the formal description of parallel texts representing transla-tions of the same document. As shown by Melamed (2003), a plausible model of parallel text must be ableto express discontinuous constituents. Since linguistic expressions often vanish in translation, a good modelmust also be able to express independent (in addition to synchronous) rewriting. Inversion TransductionGrammar (ITG) (Wu, 1997) and Syntax-directed Translation Schema (SDTS) (Aho and Ullman, 1969) lackboth of these properties. Synchronous Tree Adjoining Grammar (STAG) lacks the latter and allows onlylimited discontinuities in each tree.

Generalized Multitext Grammar (GMTG) offers a way to synchronize Mildly Context-Sensitive Gram-mars (MCSGs), while satisfying both of the above criteria. The move to MCSG is motivated by our desireto more perspicuously account for certain syntactic phenomena in natural language that cannot be easilycaptured by context-free grammars, for example clitic climbing, extraposition, and other types of long-distance movement (Becker et al., 1991). On the other hand, MCSG still observes some restrictions thatmake the set of languages it generates less expensive to analyze than the languages generates by (properly)context-sensitive formalisms.

More technically, our proposal starts from Multitext Grammar (MTG), a formalism for synchronizingcontext-free grammars recently proposed by Melamed (2003). In MTG, synchronous rewriting is implementedby means of an indexing relation that is maintained over occurrences of nonterminals in a sentential form,using essentially the same machinery as SDTS. Unlike SDTS, MTG can extend the dimensionality of thetranslation relation beyond two, and it can implement independent rewriting by means of partial deletionof syntactic structures. Our proposal generalizes MTG by moving from component grammars that generatecontext-free languages to component grammars whose generative power is equivalent to Linear Context-freeRewriting Systems (LCFRS), a formalism for describing a class of MCSGs. The generalization is achieved byallowing context-free productions to rewrite tuples of strings, rather than single strings. Thus, we retain theintuitive top-down definition of synchronous derivation original in SDTS and MTG but not found in LCFRS,

∗This is a working paper. That means its contents will probably change. Feel free to cite, but please do not quote. Both theoriginal ACL paper and this working paper are on the web at http://www.cs.nyu.edu/~melamed/pubs.html Correspondenceshould be addressed to the first author by emailing {lastname}@cs.nyu.edu .

1

while extending the generative power to linear context-free rewriting languages. In this respect, GMTGshave also been inspired by the class of local unordered scattered context grammars (Rambow and Satta,1999). A syntactically very different synchronous formalism involving LCFRS has also been presented byBertsch and Nederhof (2001), extending the definition of Range Concatenation Grammars (Boullier, 2000).

In this paper we study several properties of GMTG. First, we investigate the generative capacity of thesesystems, proposing a way to compare the power of our synchronous formalism with that of an ordinary(monolingual) formalism. Next, we prove that in GMTG each component grammar retains its generativepower, a requirement for synchronous formalisms that is technically called the weak language preservationproperty in (Rambow and Satta, 1996). Lastly, we propose a synchronous generalization of Chomsky NormalForm for GMTG, which lays the groundwork for a synchronous CKY parser.

2 An Informal Description

GMTG is a generalization of MTG, which is itself a generalization of CFG to the synchronous case. Here wepresent MTG in a new notation that shows the relation to CFG more clearly. For example, MTG productionsfor the multitext [(I fed the cat), (ya kota kormil)] are: 1

[(S), (S)] → [(PN1VP2), (PN1VP2)]

[(PN), (PN)] → [(I), (ya)]

[(VP), (VP)] → [(V1NP2), (NP2V1)]

[(V), (V)] → [(fed), (kormil)]

[(NP), (NP)] → [(D1N2), (N2)] (1)

[(D), ()] → [(the), ()] (2)

[(N), (N)] → [(cat), (kota)]

Each production in this example has two components, the first modeling English and the second (translit-erated) Russian. Nonterminals with the same index have to be rewritten together (synchronous rewriting).One strength of MTG, and thus also GMTG, is shown in production rules (1) and (2). There is a determinerin English, but not in Russian, so production (1) does not have the nonterminal D in the Russian componentand (2) applies only to the English component of the language (independent rewriting).

Similarly, independent derivations are needed to model paraphrasing. Take, for example, [(Tim got apink slip),(Tim got laid off)]. While the two sentences have the same meaning, they are structured verydifferently. GMTG can express their relationships as follows:

[(S), (S)] → [(NP1VP2), (NP1VP2)]

[(VP), (VP)] → [(V1NP2), (V1PP2)]

[(NP), (PP)] → [(DET1NP2), (VBN3RD4)]

[(NP), (NP)] → [(Tim), ( Tim)]

[(V), (V)] → [(got), (got)]

[(DET), ()] → [(a), ()]

[(NP), ()] → [(pink slip), ()]

[(), (VBN)] → [(), (laid)]

[(), (RD)] → [(), (off)]

However, as described by Melamed (2003), MTG does not allow discontinuities in a single productioncomponent, except after binarization. GMTG generalizes MTG to model discontinuities. Take, for example,

1We write production components both side by side and one above another to save space, but each component is always inparentheses.

2

the sentence pair [(The doctor treats his teeth), (El medico le examino los dientes)] (Dras and Bleam, 2000).The clitic le should derive from the NP for teeth in Spanish, so the NP is discontinuous in the Spanishcomponent. A simple GMTG fragment for the sentence is shown below:

[(S), (S)] → [(NP1VP2), (NP1VP2)]

[(VP), (VP)] → [(VP1NP2), (NP2VP1NP2)]

[(NP), (NP)] → [(The doctor), (El medico)]

[(VP), (VP)] → [(treats), (examino)]

[(NP), (NP, NP)] → [(his teeth), (le, los dientes)]

Notice the discontinuity between le and los dientes which allows them to be mapped to the same link. Suchdiscontinuities are marked by commas within one component on both the LHS and the RHS.

GMTG’s modeling of discontinuous constituents allows it to deal with many complex syntactic phenom-ena.Becker et al. (1991) points out that STAG does not have the generative capacity to model certain kindsof scrambling in German. For example, they examine English German sentence fragment (...that the detec-tive has promised the client to indict the suspect of the crime.), (...daß des Verbrechens der Detektiv denVerdachtigen dem Klienten zu uberfuhren versprochen hat)]. The interesting thing here is that the nouns canappear to the left of the verbs in any order in German. Both the verb promised and the verb indict have twonouns as arguments, but alternating between the two verb’s arguments leads to a sentence fragment with 6distinct segments. STAGs can only handle up to 5 distinct segments. The following is a GMTG fragmentfor the sentence2:

[

(S)(S)

]

→

[

(N1det has promised N2

clientS3)

(S3N1

detS3N2

verbS3versprochen hat)

]

[

(S)

(S, S, S)

]

→

[

(to indict N1suspectN

2crime)

(N2V erbrech, N1

V erd ach, zu uberfuhren)

]

The discontinuities allow the noun arguments of indict to be placed in any order with the noun argumentsof promised.

Each tuple-vector consists of∑

i(qi) different addresses, each containing one string. Each address is aninteger pair, (d, q), where d represents the component, and q marks the string in that component. A linkwhich yields only ε is called an ε-link. D copies of the special start-symbol and/or ∆ is called the start-link

A GMTG derivation for a particular multitext can be represented by a D-tuple of GMTG derivation

trees. Each component of the multitext is represented in a separate derivation tree where

• every leaf is labeled with a terminal or ε.

• every non-leaf node is labeled with an indexed nonterminal.

• the nonterminal at the root of each tree is the special start symbol S, (unless $ is ∆ in that component,in which case the whole tree is represented as a ∆.)

In a derivation tree tuple, two nonterminals in the tuple are co-indexed if they are at the same depth in theirrespective trees and if they have the same index. This means that they are all part of the same link.

Example: Consider the example grammar G0 = {N0, T0, P0, S}, where P0 contains the followingproductions:

2These are only a subset of the GMTG productions. The subscripts on the nonterminals indicate what terminal it willeventually yield, but the terminal productions have been left out to save space.

3

Component 1 2

S

B1

b C3

c

D2

ε

B1

D4

ε

S

D2

d

B1

C3

c

D4

d

b

Figure 1: The pair of derivation trees representing a derivation in the sample grammar G0.

[

S

S

]

→

[

B1D2B1

D2B1

]

[

B

B

]

→

[

(bC1, D2)C1D2b

]

[

C

C

]

→

[

c

c

]

[

D

D

]

→

[

ε

d

]

The derivation of the sentential form

[

bc

dcdb

]

is seen in Figure 1.

The leftmost link in a sentential form is the link that contains the leftmost nonterminal in componentd, where d is the first component with nonterminals in it, for 1 ≤ d ≤ D. In order to restrict the number ofchoices we have in deriving a sentential form, it is often useful to require that at each step in a derivationwe replace the leftmost link by one of its RHSs. Such a derivation is called a leftmost derivation, and weindicate that a derivation is leftmost by using the relations ⇒lm and

∗⇒lm.

Each αdq in the definition of GMTG is called a string. Each of these strings consists of a number ofsubstrings, which are sequences of terminals and nonterminals.

3 Formal Definitions

We now provide a formal definition of GMTG and its associated derivation process. Let VN be q finite setof nonterminal symbols and let N be the set of positive natural numbers. We define I(VN) = {A(t) | A ∈VN, t ∈ N}.3 Note that I(VN) is an infinite set. Each t in some A(t) ∈ I(VN) will be called an index,and elements of I(VN) will be called indexed symbols. In what follows we also consider a finite set ofterminal symbols VT, disjoint from VN, and work with strings in (I(VN)∪ VT)∗, that is sequences of indexednonterminals and non-indexed terminals. We use the notation I(VN)∪VT = I(VN)∪VT. For γ ∈ I(VN)∪V ∗

T ,we define index(γ) = {t | γ = γ ′A(t)γ′′, A(t) ∈ I(VN)}, that is index(γ) is the set of all indices of the indexedsymbols occurring in γ.

An indexed tuple vector, or ITV, is a vector of tuples of strings over I(VN) ∪ VT, having the form

γ = [(γ11, . . . , γ1q1), . . . , (γD1, . . . , γDqD)]

where D ≥ 1, qi ≥ 0 and γij ∈ I(VN) ∪ V ∗T for 1 ≤ i ≤ D, 1 ≤ j ≤ qi. Let γ[i], 1 ≤ i ≤ D, to denote the

i-th tuple of γ and write ϕ(γ[i]) to denote the arity of such a tuple, that is qi. When ϕ(γ[i]) = 0, γ[i] is

3We use positive natural numbers to index grammar symbols, but this is just for notational convenience: any infinite setwould work as well.

4

the empty tuple, written (). This should not be confused with (ε), that is the tuple of arity one containingthe empty string. A link is an ITV with all γij consisting of a single indexed nonterminal and with all ofthese nonterminals co-indexed. The notion of a link generalizes the notion of nonterminal in context-freegrammars, as each derivation step is applied to a single link.

Definition 1 Let D ≥ 1 be some integer constant. A generalized multitext grammar with D dimensions(D-GMTG for short) is a tuple G = (VN, VT, P, S) where VN, VT are finite, disjoint sets of nonterminal andterminal symbols, respectively, S ∈ VN is the start symbol and P is a finite set of productions. Each productionhas the form α → β, where α and β are D-dimensional ITVs over I(VN) ∪ VT, subject to the restrictions:

(i) ϕ(α[i]) = ϕ(β[i]), for 1 ≤ i ≤ D; and

(ii) α[i] = (A1i1, . . . , A

1iqi

), Aij ∈ VN, for 1 ≤ i ≤ D, 1 ≤ j ≤ qi, with qi = 1 whenever Aiqi= S.

We omit symbol D from D-GMTG whenever it is understood from the context or it is not relevant.We sometimes write productions as

p =

(A11, . . . , A1q1)...

(AD1, . . . , ADqD)

→

(α11, . . . , α1q1)...

(αD1, . . . , αDqD)

dropping index 1 from the ITV in the left-hand side. We also use the notation

p = [p1, . . . , pD]

with pi = (Ai1, . . . , Aiqi) → (αi1, . . . , αiqi

), 1 ≤ i ≤ D, called a production component of p. Theproduction component () → () is called the inactive production component.

All other production components are called active and we set active(p) = {i | qi > 0}. Inactiveproduction components are used to relax synchronous rewriting on some dimensions, that is to implementrewriting on d < D components. When d = 1, the production is implementing independent rewriting.

In GMTG, the derive relation is defined over ITVs and works as the synchronous application of all theactive productions in some p. The indexed nonterminals to be rewritten simultaneously must all have thesame index t, and all occurrences of t in the ITV must be rewritten.

We now turn to the definition of the derive relation associated with a GMTG. We first introduce someadditional notation. A reindexing is a one-to-one function on N. A reindexing f is extended to I(VN)∪VT byletting f(a) = a for a ∈ VT and f(A(t)) = A(f(t)) for A(t) ∈ I(VN). We also extend f to strings in I(VN)∪V ∗

T

as usual, by letting f(ε) = ε and f(Xα) = f(X)f(α), for each X ∈ I(VN) ∪ VT and α ∈ I(VN) ∪ V ∗T . We

say that strings α, α′ ∈ I(VN) ∪ V ∗T are independent if index(α) ∩ index(α′) = ∅.

Definition 2 Let G = (VN, VT, P, S) be a D-GMTG and let p ∈ P , with p = [p1, . . . , pD] and pi =(Ai1, . . . , Aiqi

) → (αi1, . . . , αiqi) for 1 ≤ i ≤ D. Let γ and δ be two ITVs with γ[i] = [(γi1, . . . , γiqi

)]and δ[i] = [(δi1, . . . , δiqi

)] for 1 ≤ i ≤ D. Assume that α is some concatenation of all αij and that γ is someconcatenation of all γij , 1 ≤ i ≤ D, 1 ≤ j ≤ qi, and let f be some reindexing such that strings f(α) and

γ are independent. The derive relation γ ⇒pG δ holds whenever there exists an index t ∈ N such that the

following two conditions are both satisfied:

(i) for each i 6∈ flp we have t 6∈ index(γi1 · · · γiqi) and γ[i] = δ[i];

(ii) for each i with 1 ≤ i ≤ D and i ∈ flp we have

γi1 · · · γiqi= γ′

i0A(t)i1 γ′

i1A(t)i2 · · · γ′

iqi−1A(t)iqi

γ′iqi

,

with γ′ij ∈ I(VN) ∪ V ∗

T , 0 ≤ j ≤ qi, and t 6∈ index(γ′i0γ

′i1 · · · γ

′iqi

), and each δij is obtained from γij by

replacing each A(t)ij′ with f(αij′ ).

5

Note that condition (ii) in Definition 2 imposes that, for each production component pi of p that is not theinactive production component, we must have that qi exactly equals the number of occurrences of index t

within (γi1, . . . , γiqi). Furthermore, condition (i) imposes that there are no occurrences of index t within

(γi1, . . . , γiqi), in case pi is the inactive production component. If one of these conditions is violated for some

i, then production p cannot be used to rewrite γ.We often use relation ⇒G defined as the union of all ⇒p

G for p ∈ P . We also use the reflexive andtransitive closure of ⇒G, written ⇒∗

G. The language generated by D-GMTG G is

L(G) = {[w1, w2, . . . , wn] | [(S1), (S1), . . . , (S1)] ⇒∗G [(w1), (w2), . . . , (wn)],

wi ∈ V ∗T , 1 ≤ i ≤ D}.

Each such language will be called a multitext.We conclude this section with the definition of two important parameters for GMTG that will be used

throughout this paper.

Definition 3 Let p = [p1, . . . , pD] ∈ P and pi = (Ai1, . . . , Aiqi) → (αi1, . . . , αiqi

). The rank of p and G

are, respectively, ρ(p) = |index(α11 · · ·α1q1α21 · · ·αDqD)| and ρ(G) = maxp∈P ρ(p). The fan-out of pi, p

and G are, respectively, ϕ(pi) = qi, ϕ(p) =∑D

i=1 ϕ(pi) and ϕ(G) = maxp∈P ϕ(p).

For example, the rank of production (36) is two and its fan-out is three.

For integers r ≥ 0 and f ≥ 1, we write D-GMTG(r)f to represent the class of all D-GMTG with rank r

and fan-out f .

4 Generative capacity

In this section we compare the generative capacity of GMTG with that of mildly context-sensitive grammars.We focus on a class of rewriting systems, called linear context-free rewriting systems (LCFRSs; (Vijay-Shanker et al., 1987)).4 LCFRSs are especially relevant in the field of computational linguistics, sincethese formalisms generate languages that are mildly context-sensitive (Joshi et al., 1990). Also, LCFRS aregeneratively equivalent to several other formalisms that have been independently proposed in the formallanguage literature and are called finite-copying parallel rewriting systems; see Weir (1992); Rambow andSatta (1999) and references therein.

To simplify our proofs, we use here a notational variant of LCFRSs that was introduced in (Rambowand Satta, 1999). Let VT be an alphabet of terminal symbols. LCFRSs are defined on the basis of functionsmapping tuples of strings in V ∗

T into tuples of strings in V ∗T . We say that a function g has rank r ≥ 0 if there

exist integers fi ≥ 1, 1 ≤ i ≤ r, such that g is defined on (V ∗T )f1 × (V ∗

T )f2 × · · · × (V ∗T )fr . We also say that g

has fan-out f ≥ 1 if the range of g is a subset of (V ∗T )f . Let yh, xij , 1 ≤ h ≤ f , 1 ≤ i ≤ r and 1 ≤ j ≤ fi, be

string-valued variables. A function g as above is said to be linear regular if it is defined by an equation ofthe form

g(〈x11, . . . , x1f1〉, . . . , 〈xr1, . . . , xrfr〉) = 〈y1, . . . , yf 〉, (3)

where 〈y1, . . . , yf 〉 represents some grouping into f strings of all and only the variables appearing in the left-hand side of (3) (without repetitions) along with some additional terminal symbols (with possible repetitions).

Definition 4 A linear context-free rewriting system (LCFRS) is a quadruple G = (VN, VT, P, S) whereVN and VT are finite, disjoint sets of nonterminal and terminal symbols, respectively, every symbol A ∈ VN

4To be more precise, we should use the term string-based LCFRS, following Weir (1992). The more general term LCFRS hasbeen originally used in (Vijay-Shanker et al., 1987) to denote a family that groups together a larger class of rewriting systemsthat operate on different types of objects, such as strings, tuples of strings, trees, graphs, and so on. The result of rewriting isthen associated with terminal strings by “yield functions”, in order to generate string languages. However, since no confusioncan arise in this paper, we simply write LCFRS for string-based LCFRS.

6

is associated with an integer ϕ(A) ≥ 1, S is a symbol in VN such that ϕ(S) = 1, and P is a finite set ofproductions of the form A → g(B1, B2, . . . , Bρ(g)), where ρ(g) ≥ 0, A, Bi ∈ VN, 1 ≤ i ≤ ρ(g) and where g is

a linear regular function having rank ρ(g) and fan-out ϕ(A), defined on (V ∗T )ϕ(B1) × · · · × (V ∗

T )ϕ(Bρ(g)).For every A ∈ VN and τ ∈ (V ∗

T )ϕ(A), we write A ⇒G τ , if one of the following conditions is met:

(i) A → g() ∈ P and g() = τ ;

(ii) A → g(B1, . . . , Bρ(g)) ∈ P , Bi ⇒G τi ∈ (V ∗T )ϕ(Bi) for every 1 ≤ i ≤ ρ(g), and g(τ1, . . . , τρ(g)) = τ .

The language generated by G is defined as L(G) = {w | S ⇒G (w), w ∈ V ∗T}.

Let G = (VN, VT, P, S) be a LCFRS and let p ∈ P be a production of the form A → g(B1, B2, . . . , Bρ(g)).The rank of p is ρ(p) = ρ(g) and the rank of G is ρ(G) = maxp∈P ρ(p). The fan-out of p is ϕ(p) = ϕ(A)and the fan-out of G is ϕ(G) = maxp∈P ϕ(p).

We now show that one-dimensional GMTGs achieve the same generative capacity as LCFRS. In theremaining of this section, we identify strings with string vectors having a single component.

Theorem 1 For any LCFRS G, there exists some 1-GMTG G′ with ρ(G) = ρ(G′) and ϕ(G) = ϕ(G′) suchthat L(G) = L(G′).

Proof. Let G = (VN, VT, P, S). We define G′ = (V ′N, VT, P ′, 〈S, 1〉), where

V ′N = {〈A, i〉 | A ∈ VN, 1 ≤ i ≤ ϕ(A)}

and P ′ is constructed as follows. Let p ∈ P be a production of the form A → g(B1, . . . , Bρ(p)), with g definedby an equation of the form

g(〈x11, . . . , x1ϕ(B1)〉, . . . , 〈xρ(p)1, . . . , xρ(p)ϕ(Bρ(p))〉) = 〈y1, . . . , yϕ(A)〉.

We define a homomorphism hp mapping set {xij | 1 ≤ i ≤ ρ(p), 1 ≤ j ≤ ϕ(Bi)} ∪ VT into set I(VN) ∪ VT

as follows:

hp(ξ) =

{

〈Bi, j〉(i) if ξ = xij ;ξ if ξ ∈ VT.

Let also π be some permutation defined on set {1, . . . , ϕ(A)}. We add to P ′ the production p′ defined as

p′ : [〈A, π(1)〉, . . . , 〈A, π(ϕ(A))〉] → [(hp(yπ(1)), . . . , hp(yπ(ϕ(A))))].

Note that ρ(p) = ρ(p′) and ϕ(p) = ϕ(p′). We iterate the above construction for each p ∈ P and π specifiedas above.

We claim that, for any A ∈ VN and wi ∈ V ∗T , 1 ≤ i ≤ ϕ(A), the following condition holds: A ⇒G

(w1, . . . , wϕ(A)) if and only if

[(〈A, π(1)〉1, . . . , 〈A, π(ϕ(A))〉1)] ⇒∗G′ [(wπ(1), . . . , wπ(ϕ(A)))]

for every permutation π of {1, . . . , ϕ(A)}. The claim can be easily established by associating with thederivations above the corresponding underlying context-free trees and then proceeding by induction on theheight of these trees (the two trees have the same height). Relation L(G) = L(G′) directly follows from theclaim and from the fact that ϕ(S) = 1.

We now show that the generative capacity of GMTGs does not exceed that of LCFRS. However, the com-parison cannot be carried out directly, since LCFRS generate string languages, while GMTGs generate mul-titexts. To circumvent this syntactic problem, we glue together all components of a string vector generatedby a GMTG, and use a special character to denote boundaries between components. Let G = (VN, VT, P, S)be some D-GMTG and let # be a variable denoting some symbol not in VN ∪ VT. For a string vector[γ1, . . . , γq], γi ∈ (I(VN) ∪ VT)∗, 1 ≤ i ≤ q and q ≥ 1, we write glue([γ1, . . . , γq ]) = γ1# . . .#γq . This isextended to languages of string vectors in the obvious way.

7

Theorem 2 For any D-GMTG G, there exists some LCFRS G′ with ρ(G) = ρ(G′) and ϕ(G) = ϕ(G′) suchthat glue(L(G)) = L(G′).

Proof. Let G = (VN, VT, P, S). Without any loss of generality, we assume that G contains only oneproduction with S appearing in the left-hand side, called pS , and that such production has the form[(S), . . . , (S)] → [(A1), . . . , (A1)] for some A ∈ VN. Let αG be the concatenation of all strings in (I(VN)∪VT)∗

appearing in the right-hand side of some production in P . We define G′ = (V ′N, VT, P ′, [S]) where

V ′N = {[p, t] | p ∈ P, t ∈ index(αG)} ∪ {[S]}

and where P ′ is constructed as specified below.Each production in P ′ is associated with two productions p, p′ in P and an index tp′ ∈ index(G), satisfying

the requirements below, and will thus be named 〈p, p′, tp′〉. Let p = [p1, . . . , pD] and p′ = [p′1, . . . , p′D], where

for each i with 1 ≤ i ≤ D

pi : (Ai1, . . . , AiD) → (αi1, . . . , αiqi),

p′i : (Bi1, . . . , BiD) → (βi1, . . . , βiq′

i).

Assume also that p can rewrite the right-hand side of p′, that is

[(β11, . . . , β1q′

1), . . . , (βD1, . . . , βDq′

D)]

⇒pG [(δ11, . . . , δ1q1), . . . , (δD1, . . . , δDqD

)]. (4)

From derivation (4) and from Definition 2 we conclude that there exists at least one index tp′ ∈ index(αG)such that each tuple (βi1, . . . , βiq′

i), with 1 ≤ i ≤ D and i ∈ flp, contains exactly qi occurrences of tp′ .

We now introduce some additional notation. Let αp be the concatenation of all strings in (I(VN) ∪ VT)∗

appearing in the right-hand side of p, that is

αp = α11 · · ·α1q1α21 · · ·αDqD.

Let index(αp) = {t1, . . . , tρ(p)} and for each i with 1 ≤ i ≤ ρ(p), let ϕ(ti) be the number of occurrencesof index ti appearing in string αp. We define an alphabet Xp of variables, denoting strings over VT, withXp = {xij | 1 ≤ i ≤ ρ(p), 1 ≤ j ≤ ϕ(ti)}. For each i and j with 1 ≤ i ≤ D, i ∈ flp and 1 ≤ j ≤ qi, we definea string h(p, i, j) over Xp ∪ VT as follows. Let αij = Y1Y2 · · ·Yc, where Yk ∈ I(VN)∪VT for 1 ≤ k ≤ c, c ≥ 0.Then h(p, i, j) = Y ′

1Y ′2 · · ·Y

′c , where for each k, 1 ≤ k ≤ c, we have

• Y ′k = Yk in case Yk ∈ VT; and

• Y ′k = xt,m in case Yk ∈ I(VN), where t is the index of Yk and the indicated occurrence of Yk is the

m-th occurrence of such symbol appearing from left to right in string αp .

We can now specify the construction of P ′. For every possible p, p′, and tp′ specified as above, weconstruct a LCFRS production 〈p, p′, tp′〉 to be added to P ′, having the form

〈p, p′, tp′〉 : [p′, tp′ ] → g〈p,p′,tp′ 〉([p, t1], . . . , [p, tρ(p)]).

Function g〈p,p′,tp′〉 is defined as

g〈p,p′,tp′〉(〈x11, . . . , x1ϕ(t1)〉, . . . , 〈xρ(p)1, . . . , xρ(p)ϕ(tρ(p))〉) =

〈h(p, 1, 1), . . . , h(p, 1, q1), h(p, 2, 1), . . . , h(p, D, qD)〉, (5)

where in each h(p, i, j) above we always have i ∈ flp. Note that (5) defines a function with rank ρ(p) and

fan-out∑D

i=1 qi = ϕ(p). Thus we have ρ(〈p, p′, tp′〉) = ρ(p) and ϕ(〈p, p′, tp′〉) = ϕ(p).

8

Recall that in P we have a single production with S appearing in the left-hand side, having the formpS = [(S), . . . , (S)] → [(A1), . . . , (A1)]. To complete the construction of P ′, we add to this set a lastproduction with the start symbol of G′ in its left-hand side, having the form

〈pS〉 : [S] → g〈pS〉([pS , 1]),

where

g〈pS〉(〈x11, x12, . . . , x1D〉) = 〈x11#x12# · · ·#x1D〉.

Let p ∈ P with p = [p1, . . . , pD] and pi = (Ai1, . . . , AiD) → (αi1, . . . , αiqi) for 1 ≤ i ≤ D, and assume

that p can rewrite the nonterminals in the left-hand side of a production p′ ∈ P that are indexed by tp′ . Letalso uij ∈ V ∗

T for 1 ≤ i ≤ D and 1 ≤ j ≤ qi. We claim that

[(A111, . . . , A

11q1

), . . . , (A1D1, . . . , A

1DqD

)]

⇒∗G [(u11, . . . , u1q1), . . . , (uD1, . . . , uDqD

)]

if and only if [p′, tp′ ] ⇒G′ 〈u11, . . . , u1q1 , u21, . . . , wDqD〉. The claim can be easily established by associating

with the derivations above the corresponding underlying context-free trees and then proceeding by inductionon the height of these trees (the two trees have the same height). The statement of the theorem easily followsfrom this claim and the specification of productions pS ∈ P and 〈pS〉 ∈ P ′.

Using the above theorems, we can now explore some important properties of GMTG and of the translationrelations defined through the generated multitexts. Let G = (VN, VT, P, S) be some D-GMTG. For 1 ≤ i ≤ D,the i-th component grammar of G, written proj(G, i), is the 1-GMTG (VN, VT, Pi, S), where

Pi = {pi | [p1, . . . , pD] ∈ P, pi 6= () → ()}.

In words, proj(G, i) is obtained from G by selecting the i-th production component from each production inP , and discarding inactive component productions. Similarly, we define the i-th projection language ofmultitext L(G) as

proj(L(G), i) = {wi | [w1, . . . , wD ] ∈ L(G)}.

In words, proj(L(G), i) is the string language obtained from multitext L(G) by selecting the i-th stringcomponent from each string vector. Note that in general we have L(proj(G, i)) 6= proj(L(G), i). This isbecause component grammars proj(G, i) interact with each other in the rewriting process of G. To give asimple example, consider the 2-GMTG G = ({S, A}, {a, b}, P, S), where P consists of the following threeproductions:

[(S), (S)] → [(ε), (ε)]

[(S), (S)] → [(aA), (aS)]

[(A), (S)] → [(S), (Sb)].

The generated multitext is L(G) = {[an, anbn] | n ≥ 0}, and thus proj(L(G), 2) = {anbn | n ≥ 0}. On theother hand, proj(G, 2) is the 1-GMTG with productions [(S)] → [(aS)], [(S)] → [(Sb)] and [(S)] → [(ε)], thuswe have L(proj(G, 2)) = {anbm | n, m ≥ 0}.

Despite of the interaction between different component grammars discussed above, GMTGs have an im-portant property that is also one of the defining requirements of synchronous rewriting systems, as discussedfor instance by Shieber (1994). Informally stated, we have that the generative capacity of the class of allcomponent grammars of a GMTG exactly corresponds to the class of all projection languages. In otherwords, the interaction among different grammar components in the rewriting process of GMTGs does notadd extra generative power. We state this more formally in the next result.

Theorem 3 Let L(LCFRS ) be the class of languages generated by all LCFRSs. Let also Lp(G) and Lp(L)

be the classes of languages L(proj(G, d)) and proj(L(G), d)), respectively, for every D ≥ 1, every D-GMTGG and every d with 1 ≤ d ≤ D. Then we have

9

(i) Lp(G) = L(LCFRS);

(ii) Lp(L) = L(LCFRS).

Proof. Both in case of (i) and (ii), the ⊇ relation directly follows from Theorem 1. We prove separatelythe ⊆ relation for the two statements in the theorem.(i). Since proj(G, d) is a 1-GMTG, we have glue(L(proj(G, d))) = L(proj(G, d)). Hence L(proj(G, d)) can begenerated by some LCFRS, by Theorem 2.(ii). Let G = (VN, VT, P, S) be some D-GMTG. We define a LCFRS G′

d such that L(G′d) = proj(L(G), d)).

Let G′d = (VN, VT, P ′, S) where P ′ is constructed almost as in the proof of Theorem 2. The only difference

is in the definition of strings h(p, i, j) and of production 〈pS〉. We report the new definitions below omittingthe rest of the proof; we use the same notation as in the proof of Theorem 2.

For each i and j with 1 ≤ i ≤ D, i ∈ flp and 1 ≤ j ≤ qi, we have h(p, i, j) = Y ′1Y ′

2 · · ·Y′c , where for each

k, 1 ≤ k ≤ c

• Y ′k = Yk in case Yk ∈ VT and i = d;

• Y ′k = ε in case Yk ∈ VT and i 6= d; and

• Y ′k = xt,m in case Yk ∈ I(VN), where t is the index of Yk and the indicated occurrence of Yk is the

m-th occurrence of such symbol appearing from left to right in string αp .

Production 〈pS〉 has the form [S] → g〈pS〉([pS , 1]), where

g〈pS〉(〈x11, x12, . . . , x1D〉) = 〈x11x12 · · ·x1D〉.

5 Generalized Chomsky Normal Form

Certain kinds of text analysis require a grammar in a convenient normal form. The prototypical examplefor CFG is Chomsky Normal Form (CNF), which is required for CKY-style parsing. A D-GMTG is inGeneralized Chomsky Normal Form (GCNF) if it has no useless links or useless terminals and if everyproduction is in one of two forms:

(i) A nonterminal production has rank = 2 and no terminals or ε’s on the RHS.

(ii) A terminal production has exactly one component of the form A → a, where A ∈ VN and a ∈ VT.The other components are inactive.

The algorithm to convert a grammar G0 to a new grammar GGCNF in GCNF has several steps detailedbelow.

(i) add new start-symbol

(ii) isolate terminals

(iii) binarize productions

(iv) remove ε’s

(v) eliminate unit productions

(vi) eliminate useless links and terminals

Parts of the GCNF conversion algorithms and some of the proofs presented here are generalizations ofthose presented in (Hopcroft et al., 2001), to the multidimensional case with discontinuities. The orderingof these steps is important, as some steps can restore conditions that others eliminate. Traditionally, theterminal isolation and binarizing steps came last, but the alternative order reduces the number of productionsthat can be created during ε-elimination.

10

5.1 Creating a new Start-Symbol

The first step in converting a grammar G0= {N0, T0, P0, S} to GCNF is to add a new start-symbol, S′, toN0, and add k new productions to P0 of the form:

(S′)...

(S′)

→

(S1)...

(S1)

(6)

where the original grammar has k different versions of start-links. One new start link is created for eachversion of the old start link that appears in the original grammar and the appropriate components are leftinactive in each case. Without loss of generality, we write the LHS of one of these production in shorthandas the link $′ and the RHS as the link $. This step guarantees that a start-symbol of the grammar doesn’tappear on the RHS of any productions.

Theorem 4 If the grammar G1 is constructed from G0 by adding k new start-links using the above con-struction, then L(G1) = L(G0).

Proof: Let w be a sentential form of terminals. We want to show w ∈ L(G0) iff w ∈ L(G1).

(Only-if) Assume w ∈ L(G0). Then $∗⇒G0 w. We know every production in G0 is present in G1, so

$∗⇒G1 w. We also know $′ →G1 $. So, $′ →G1 $

∗⇒G1 w. Thus w ∈ L(G1).

(If) Assume w ∈ L(G1). Then $′∗⇒G1 w. We know the derivation must look like this: $′ →G1 $

∗⇒G1 w,

since the only production with $′ on the LHS is $′ →G1 $. We also know the derivation $∗⇒G1 w, contains

no copies of $′, since $′ never appears on a RHS of a rule. Thus, all the productions used in the derivationare present in G0, and so $

∗⇒G0 w. Thus w ∈ L(G0). 2

Example: Consider the example grammar G0 = {N0, T0, P0, S}, where P0 contains the followingproductions:

[

(S)(S)

]

→

[

(B1D2B1)(D2B1)

]

[

(B, B)(B)

]

→

[

(bC1, D2)(C1D2b)

]

[

(C)(C)

]

→

[

(c)(c)

]

[

(D)(D)

]

→

[

(ε)(d)

]

We add the new start-symbol to N0 and a new start-production to P0 to get the new grammar, G1 ={N1, T1, P1, S′}, such that T1 = T0 where P1 contains the following productions:

[

(S′)(S′)

]

→

[

(S)(S)

]

(7)

[

(S)(S)

]

→

[

(B1D2B1)(D2B1)

]

(8)

[

(B, B)(B)

]

→

[

(bC1, D2)(C1D2b)

]

(9)

[

(C)(C)

]

→

[

(c)(c)

]

(10)

[

(D)(D)

]

→

[

(ε)(d)

]

(11)

11

5.2 Isolating Terminals

The RHS of a GMTG production can mix terminals and nonterminals, which is not allowed in GCNF. Inaddition, GMTG allows more than one terminal on the RHS. We would like to arrange it so that terminalsalways appear in productions without any other symbols on the RHS. Our goal is to identify all the terminalsin the productions where they don’t appear as the only symbol, and replace each of those terminals with anonterminal that eventually yields it. We do this by replacing the set of terminals on the RHS of each ofthese “mixed” productions with a single link, and then adding a new production p1 with that link as its LHS.On the RHS of p1, there is a new unique nonterminal for each terminal being replaced. Each nonterminalhas a unique index. Lastly, for each nonterminal N in component d on the RHS of p1, there will be a newproduction with N on the LHS and the terminal it is associated with on the RHS in component d. Allthe other components are inactive. This process is repeated until all the terminals have been replaced withnonterminals.

PROOF: The construction that converts G1 to a grammar with no non-isolated terminals G2, does soin such a way that each production of G2 can be simulated by one or more productions of G1. Conversely,the introduced nonterminals of G1 each have only one production, so they can only be used in the mannerintended. More formally, we prove that there sentential form of terminals w is in L(G1) iff w is in L(G2).

(Only-if) If w has a derivation in G1, it is easy to replace each production used by a sequence ofproductions of G2. That is, one step in the derivation in G1 becomes one or more steps in G2. If anyitem on the RHS of that production is a non-isolated terminal, we know G2 has a corresponding link witha nonterminal that yields it. We conclude that there is a derivation of w in G2, so w is in L(G2).

(If) Suppose w is in L(G2). Then there is a parse tree in G2 with $ at the root which yields w. Weconvert this tree into a parse tree of G1 that also has a root $ and yields w.

In the terminal separation step in the GCNF construction introduced other links that derive terminals.We can identify those in the current parse, and replace them with the terminal they yield. Now every interiornode of the parse tree forms a production of G1. Since w is the yield of a parse tree in G1, we conclude thatw is in L(G1)

Example: Consider the grammar G1 = {N1, T1, P1, S′} above. We want to convert it to a new grammar,G2 = {N2, T2, P2, S′}. We are interested in altering any production rules that mix terminals and nontermi-nals, or that have more than one terminal on the RHS. Three productions have non-isolated terminals inG1, (9), (10) and (11). First we examine (9). We create a new link [(B1), (B1)] and replace the terminalswith it. We create two new nonterminals, B2 and B3, one for each of the terminals that must be replaced,and add them to N2. We then add the following productions to P2:

[

(B1)(B1)

]

→

[

(B12)

(B23)

]

(12)

[

(B2)∆

]

→

[

(b)∆

]

(13)

[

∆(B3)

]

→

[

∆(b)

]

(14)

We alter production (9) by replacing the terminals with new link that yields them, shown in bold:

[

(B, B)(B)

]

→

[

(B3

1C1, D2)

C1D2B3

1

]

(15)

.We do the same thing to production (10). We create three new productions:

12

[

(C1)(C1)

]

→

[

(C12 )

(C13 )

]

(16)

[

(C2)∆

]

→

[

(c)∆

]

(17)

[

∆(C3)

]

→

[

∆(c)

]

(18)

and alter (10) to look like:[

(C)(C)

]

→

[

C11

C11

]

(19)

Lastly we do the same to (11). So our new set of productions P2 looks like the following:

[

(S′)(S′)

]

→

[

(S1)(S1)

]

(20)

[

(S)(S)

]

→

[

(B1D2B1)(D2B1)

]

(21)

[

(B, B)(B)

]

→

[

(B31C1

1 , D2)(C1

1D2B31)

]

(22)

[

(B1)(B1)

]

→

[

(B12)

(B23)

]

(23)

[

(B2)()

]

→

[

(b)()

]

(24)

[

()(B3)

]

→

[

()(b)

]

(25)

[

C

C

]

→

[

(C11 )

(C11 )

]

(26)

[

(C1)(C1)

]

→

[

(C12 )

(C23 )

]

(27)

[

(C2)()

]

→

[

(c)()

]

(28)

[

()(C3)

]

→

[

()(c)

]

(29)

[

(D)(D)

]

→

[

(D11)

(D11)

]

(30)

[

(D1)(D1)

]

→

[

(D12)

(D23)

]

(31)

[

(D2)()

]

→

[

(ε)()

]

(32)

[

()(D3)

]

→

[

()(d)

]

(33)

5.3 Binarization

The third step of converting to GCNF is a binarization of the productions, making the rank of the grammar

two. For r ≥ 0 and f ≥ 1, we write D-GMTG(r)f to represent the class of all D-GMTG with rank r and

13

(a)

[

(S)(S)

]

→

[

(N1PatV

2wentP

3homeA

4early)

(P3damoyN

1PatA

4ranoV

2pashol)

]

(b)

Pat went home early

damoy

Pat

rano

pashol

Figure 2: (a) A production that requires an increased fan-out to binarize. (b) A graphical representation of the production.

fan-out f . A CFG can always be binarized into another CFG: two adjacent nonterminals are replaced with

a single nonterminal that yields them. In contrast, it might be impossible to binarize a D-GMTG(r)f into

an equivalent D-GMTG2f . For every fan-out f ≥ 2 and rank r ≥ 4, there are some index orderings that

can be generated by D-GMTG(r)f but not D-GMTG

(r−1)f . The distinguishing characteristic of such index

orderings is apparent in Figure 2, which shows a production in a grammar with fan-out two, and a graph thatillustrates which nonterminals are co-indexed. No two nonterminals are adjacent in more than one string, soreplacing any two nonterminals with a single nonterminal causes a discontinuity. Increasing the fan-out ofthe grammar allows a single nonterminal to yield two nonadjacent nonterminals in a single string. ITG (Wu,1997) cannot express such structures in a binarized form since it does not allow discontinuous constituents.

To binarize, we nondeterministically split each nonterminal production p0 of rank r > 2 into two nonter-minal productions p1 and p2 of rank < r, but possibly a higher fan-out. Since this algorithm replaces r withtwo productions that have rank < r, recursively applying the algorithm to productions of rank greater thantwo will reduce the rank of the grammar to two. We binarize by the following algorithm:

(i) Nondeterministically chose n links to be removed from p0 and replaced with a single link to make p1,where 2 ≤ n ≤ r − 1. We call these links the m-links.

(ii) Create a new ITV Γ. Two nonterminals are neighbors if they are adjacent in the same string in aproduction. For each set of m-link neighbors in component d in p0, place that set of neighbors into thed’th component of Γ in the order in which they appeared in p0, so that each set of neighbors becomesa different string, for 1 ≤ d ≤ D.

(iii) Create a new unique nonterminal, say B, and replace each set of neighbors in production p0 with B,to create p1. The production p2 is [B, . . . , B] → Γ

For example, binarization of the production in Figure 2 requires that we increase the fan-out of the languageto three.

The binarized productions are as follows:

[

(S)(S)

]

→

[

(N1PatVP2)

(VP2N1PatVP2)

]

(34)

[

(VP)(VP, VP)

]

→

[

(V1A2early)

(V1, A2ranoV

1)

]

(35)

[

(V)(V, V)

]

→

[

(V1wentP

2home)

(P2damoy, V

1pashol)

]

(36)

Increasing the fan-out can be necessary even for binarizing a 1-GMTG production, as in:

[(B,B)] → [(N1V2P3A4, P3N1A4V2)] (37)

PROOF: The construction that converts G2 to a grammar of rank ≤ 2, G3, does so in such a waythat each production of G3 can be simulated by one or more productions of G2. Conversely, the introducednonterminals of G2 each have only one production, so they can only be used in the manner intended. Moreformally, we prove that there sentential form of terminals w is in L(G2) iff w is in L(G3).

14

(Only-if) If w has a derivation in G2, it is easy to replace each production used by a sequence ofproductions of G3. That is, one step in the derivation in G2 becomes one or more steps in G3. If theRHS has more than two links in a single component, we know there are instead a sequence of links thatyield them, as that is how G3 was constructed. We conclude that there is a derivation of w in G3, so w isin L(G3).

(If) Suppose w is in L(G3). Then there is a parse tree in G3 with $ at the root which yields w. Weconvert this tree into a parse tree of G2 that also has a root $ and yields w.

We “undo” the binarization step of the GCNF construction. That is, suppose there is a node labeled withthe link Γ, with two children, the links Γ and Λ, where Λ is one of the links introduced in the binarization.Then this portion of the parse tree has under it all the links that it replaced. We can alter the derivationtree by replacing Λ with all those links, and leaving what the links yield untouched. Now every interior nodeof the parse tree forms a production of G2. Since w is the yield of a parse tree in G2, we conclude that w isin L(G2)

Example: We run this algorithm on G2. G2 has one production that has rank > 2. We examine (22)and apply our algorithm to it.

p0 =

[

(B, B)(B)

]

→

[

(B31{C

1, D2})({C1D2}B3

1)

]

(i) We chose the links marked with 1 and 2 to be our m-links, and mark the neighbors with brackets.

(ii) Set γ =

[

(C1, D2)(C1D2)

]

(iii) Create the new nonterminals, H , and set p1 to:

[

(B, B)(B)

]

→

[

(B31H1, H1)(H1B3

1)

]

(38)

Set p2 to:

[

(H, H)(H)

]

→

[

(C1, D2)(C1D2)

]

(39)

Our production p0 now has a rank of two and the remaining production set is as follows:

15

[

(S′)(S′)

]

→

[

(S1)(S1)

]

(40)

[

(S)(S)

]

→

[

(B1D2B1)(D2B1)

]

(41)

[

(B, B)(B)

]

→

[

(B11H2, H2)(H2B1

1)

]

(42)

[

(H, H)(H)

]

→

[

(C11 , D2)

(C11D2)

]

(43)

[

(B1)(B1)

]

→

[

(B12)

(B23)

]

(44)

[

(B2)()

]

→

[

(b)()

]

(45)

[

()(B3)

]

→

[

()(b)

]

(46)

[

C

C

]

→

[

(C11 )

(C11 )

]

(47)

[

(C1)(C1)

]

→

[

(C12 )

(C23 )

]

(48)

[

(C2)()

]

→

[

(c)()

]

(49)

[

()(C3)

]

→

[

()(c)

]

(50)

[

(D)(D)

]

→

[

(D11)

(D11)

]

(51)

[

(D1)(D1)

]

→

[

(D12)

(D23)

]

(52)

[

(D2)()

]

→

[

(ε)()

]

(53)

[

()(D3)

]

→

[

()(d)

]

(54)

5.4 Eliminating ε’s

Each tuple-vector consists of∑

i(qi) different addresses, each containing one string. Each address is aninteger pair, (d, q), where d represents the component, and q marks the string in that component. An ε-

component is a production component where at least one string on the RHS is ε. An ε-production is anyproduction that has an ε-component not of the form S → ε, where S is the special start symbol Grammars inGCNF cannot have any ε-productions. We need to show that ε-productions are not essential in any GMTG,i.e. given a GMTG G3 with ε-productions, we can convert it to a weakly equivalent grammar G4 withoutε-productions.

We denote any sentential form with only ε in every component as ε. Our strategy is to start by discoveringwhich links, and in them which strings at which addresses, are nullable. Once these links are found, we willeliminate the productions with nullable links on their RHS, and replace them with one or more productionsthat yield the same terminal ITV. A link Γ = [(A1, . . . , A1), . . . (AD , . . . , AD)] is nullable if Γ

∗⇒ α, where

α = [(α11, . . . , α1q1), . . . , (αD1, . . . , αDqD)] is an ITV where at least one string, αdq, is ε. We call the string

αdq a nullable string, and say that the link Γ is nullable at address (d, q).

16

Let G3 = {N3, T3, P3, S} be a GMTG. We can find all the nullable links and the addresses of theirnullable strings in G3 by the following iterative algorithm.

BASIS: If [A1, . . . , AD] → α is a production in P1, where α = [(α11, . . . , α1q1), . . . , (αD1, . . . , αDqD)] and

αdq = ε, for some d and q, where 1 ≤ d ≤ D and 1 ≤ q ≤ qd, then by definition, the link on the LHS ofthe production, Γ = [(A1, . . . , A1), . . . , (AD, . . . , AD)], is nullable and αdq is a nullable string in Γ at address(d, q).

INDUCTION: If [A1, . . . , AD] → α is a production in G3, where α = [(α11, . . . , α1q1), . . . , (αD1, . . . , αDqD)]

and for some d and q every element of αdq is a nullable string of some link on the RHS of the produc-tion such that all of these elements can yield ε at once, then the link on the LHS of the production,Γ = [(A1, . . . , A1), . . . (AD , . . . , AD)], is a nullable string in Γ at address (d, q). 2

We have shown how to recognize all nullable links and their associated nullable strings and addresses.Now we give the construction of a grammar without ε-productions. Let G3 = {N3, T3, P3, S} be a GMTG.First, we determine all nullable links and associated strings in G3. We construct a new grammar G4 ={N4, T4, P4, S′} whose set of production rules P4 is determined as follows:

For each nullable link, the construction creates 2n versions of the link, where n is the number of nullablestrings of that link. There is one version for each of the possible combinations of the nullable strings beingpresent or absent. The version of the link with all strings present is its original version. Each non-originalversion of the link gets a unique subscript (except for start-links), which is applied to all the nonterminalsin the link, so that each link is unique in the grammar. We construct a new grammar G4 whose set ofproduction rules P4 is determined as follows: for each production, we identify the nullable links on the RHSand replace them with each combination of the non-original versions found earlier. If a string is left emptyduring this process, that string is removed from the RHS and the fan-out of the production component isreduced by one. The link on the LHS is replaced with its appropriate matching non-original link. There aretwo exceptions to the replacements. One is that if a production consists of all nullable strings, do not includethis case. The other is that if the production is a start-production and we remove all strings on the RHSof a component, then that RHS is replaced with ε. Lastly, we remove all strings on the RHS of productionsthat have ε’s, and reduce the fan-out of the productions accordingly. Once again, we replace the LHS linkwith the appropriate version.

Theorem 5 If the grammar G4 is constructed from G3 by the above method for eliminating ε-productions,then L(G4) = L(G3) − ε

PROOF: We must show that given a sentential form w of terminals, w ∈ L(G4) iff w ∈ L(G3). We shallprove a generalization of this:

Given any link Γ3 in G3, and a version of it Γ4 in G4, Γ4∗⇒G4 w iff Γ3

∗⇒G3 w and w 6= ε, where w is

an ITV containing only terminals. This is a generalization because the corresponding link to $ in G3 is $′in G4, so we end up proving that every sentential form generated by one grammar can be generated by theother.

(ONLY-IF) Suppose Γ4∗⇒G4 w. Then surely w 6= ε, because G4 has no ε-productions. We must show by

induction on the length of the derivation that Γ3∗⇒G3 w, where Γ3 is the corresponding link to Γ4.

BASIS: One step. Then there is a production Γ4 →G4 w in G4. The construction of G4 tells us that thereis some corresponding production Γ3 →G3 β of G3, such that β is w, with zero or more ε-links interspersed,and Γ3 is the corresponding link to Γ4. Applying production rules to those ε-links removes them and we areleft with w. So in G3, Γ3 →G3 β

∗⇒G3 w, where the steps after the first, if any, derive ε from whatever links

are remaining on the RHS.INDUCTION: Suppose that the derivation takes n > 1 steps. Then the derivation looks like Γ4 →G4 α

∗⇒G4 w .

The first production must have come from a production in G3, Γ3 →G4 β , where β contains all the terminalsand corresponding links of α in order with zero or more additional ε-links interspersed.

We know that every link in α derived w in n − 1 or fewer productions. So, by our induction, thecorresponding links in β yield w as well. Note that the only links present in β that do not have a correspondinglink in α yield ε in all their components. So, we can remove them by deriving ε from them. What we are

17

Link Address Versions[

(D2)()

]

(1,1) NA[

(D1)(D1)

]

(1,1)

[

()(D4)

]

[

(D)(D)

]

(1,1)

[

()(D5)

]

[

(H, H)(H)

]

(1,2)

[

(H6)(H6)

]

[

(B, B)(B)

]

(1,2)

[

(B7)(B7)

]

Table 1: Nullable links

left with is a γ which contains only terminals and links that have correspondents in α, and can derive w inn − 1 steps. Thus γ

∗⇒G3 w. Since Γ3 →G3 β

∗⇒G3 γ

∗⇒G3 w, we know Γ3

∗⇒G3 w.

(IF) Suppose Γ3∗⇒G3 w and w is not ε. We show by induction on the length n of the derivation that

Γ4∗⇒G4 w.BASIS: One step. Then Γ3 →G3 w is a production of G3. There is a corresponding production in G4 of

the form Γ4 →G4 v. If any component of w is empty, then the construction guarantees that the correspondingcomponent in Γ3 → v will be the dummy component, and so it will yield an empty component in v whereever the ε was in w. All the non empty components of w will be present in v since they are all terminals,and so they are not affected by the construction. Thus w = v, and so Γ4 →G4 w.

INDUCTION: Suppose the derivation takes n > 1 steps. Then the derivation looks like Γ3 →G3 β∗⇒G3 w.

There must be a corresponding production in G4 of the form Γ4 →G4 α where the corresponding link toevery link in β is in α in the proper order. Since the derivation β →G3 w takes fewer than n steps, we know

by our inductive step that the corresponding links in α yield w as well. Since Γ4 →G4 α∗⇒G4 w, we know

Γ4∗⇒G4 w.

Now we complete the proof as follows. We know w is in L(G4) iff $′∗⇒G4 w. Letting $ be Γ3 and $′ be

Γ4 above, we know that w is in L(G4) iff $∗⇒G3 w and w 6= ε. That is, w is in L(G4) iff w is in L(G3) and

w 6= ε. 2

Example: Let us find all the nullable links in the example grammar G3 = {N3, T3, P3, $′}. We displayeach nullable link in Table 1, along with the address it is nullable at and all possible versions of the link.

Now let us construct the productions of grammar G4 = {N4, T4, P4, S′}. First, we consider productionrule (41). This production contains two of our nullable links on the RHS, each containing one nullable string.So, we have four combinations of these two nullable strings being present and absent. We are left with thefollowing production rules, with the nullable strings in bold:

[

(S)(S)

]

→

[

(B1D2B1)(D2B1)

]

(55)

[

(S)(S)

]

→

[

(B17D

2)

(D2B(1))7

]

(56)

[

(S)(S)

]

→

[

(B1B1)(D2

5B1)

]

(57)

[

(S)(S)

]

→

[

(B17)

(D25B

17)

]

(58)

Production rule (43) contains a nullable link as well, also with one nullable string. So there are twoversions of the rule, one with the nullable string present, and one with it absent:

18

[

(H, H)(H)

]

→

[

(C1,D2)(C1D2)

]∣

∣

∣

∣

[

(C1)(C1D2

5)

]

(59)

The same thing is applied to Rule (42). Production (53) has a component with only an ε in it, sothe component is replaced with the inactive component. This removes the production completely from thegrammar. Productions (51) and (52) are also altered. Thus, the following productions constitute P4:

19

[

(S′)(S′)

]

→

[

(S1)(S1)

]

(60)

[

(S)(S)

]

→

[

(B1D2B1)(D2B1)

]

(61)

[

(S)(S)

]

→

[

(B17D2)

(D2B17)

]

(62)

[

(S)(S)

]

→

[

(B1B1)(D2

5B1)

]

(63)

[

(S)(S)

]

→

[

(B17)

(D25B

17)

]

(64)

[

(B, B)(B)

]

→

[

(B21H1, H1)(H1B2

1)

]

(65)

[

(B7)(B7)

]

→

[

(B21H1

6 )(H1

6B21)

]

(66)

[

(H, H)(H)

]

→

[

(C1, D2)(C1D2)

]

(67)

[

(H6)(H6)

]

→

[

(C1)(C1D2

5)

]

(68)

[

(B1)(B1)

]

→

[

(B12)

(B23)

]

(69)

[

(B2)()

]

→

[

(b)()

]

(70)

[

()(B3)

]

→

[

()(b)

]

(71)

[

(C)(C)

]

→

[

(C12 )

(C23 )

]

(72)

[

(C2)()

]

→

[

(c)()

]

(73)

[

()(C3)

]

→

[

()(c)

]

(74)

[

(D)(D)

]

→

[

(D11)

(D11)

]

(75)

[

()(D5)

]

→

[

()(D4)

]

(76)

[

(D1)(D1)

]

→

[

(D12)

(D13)

]

(77)

[

()(D4)

]

→

[

()(D1

3)

]

(78)

[

()(D3)

]

→

[

()(d)

]

(79)

20

5.5 Eliminating Unit Productions

A unit production is a production of the form [c1, . . . , cd] where each cd is either of the form Ad → (Bd1, . . . , Bdqd),

or () → () where Ai ∈ VN, Bdj ∈ I(VN) for 1 ≤ d ≤ D and 1 < j < qd In addition, the rank of the productionmust be 1. This means that the RHS is composed of a single link. We write such a unit production inshort hand as Γ → Λ, where both Γ and Λ are links such that Γ is the LHS of the unit production and Λconstitutes the RHS.

Given a grammar G4 = {N4, T4, P4, S}, we find all those unit pairs (Γ, Λ) such that Γ∗⇒ Λ using a series

of unit productions only. Once we have determined all such unit pairs, we replace any sequence of derivationsteps in which Γ ⇒ Λ1 ⇒ Λ2 ⇒ . . . ⇒ Λ ⇒ Ψ by the production that uses the non-unit production Γ → Ψ,where Ψ is a non-unit tuple-vector. Lastly we remove all unit productions from the grammar. The resultingproduction set is P5 in G5 ={N5, T5, P5,S}

Here is the inductive construction of all unit pairs (Γ, Λ), such that Γ∗⇒ Λ using only unit productions.

BASIS: (Γ, Γ) is a unit pair for any link Γ. That is, Γ∗⇒ Γ in zero steps.

INDUCTION: Suppose we have determined that (Γ, Λ) is a unit pair and Λ → Ψ is a unit production,where Ψ is a link. Then (Γ, Ψ) is a unit pair. 2

Once we have all unit pairs, we simply take the first member of a pair as the head, and all the non-unitbodies for the second member of the pair as the production bodies.

Theorem 6 If the grammar G5 is constructed from G4 by the above construction for eliminating unit-productions, then L(G5) = L(G4)

PROOF We shall show that w is in L(G5) iff it is in L(G4)

(IF) Suppose w is in L(G5). Then $∗⇒G4 w. Since every production of G6 is equivalent to a sequence

of zero or more unit productions of G5 followed by a non-unit production of G4, we know α ⇒G5 β implies

α∗⇒G5 β. That is, every step of a derivation in G5 can be replaced by one or more derivation steps in G4.

If we put these sequences of steps together, we conclude that $∗⇒G4 w and so w is in L(G4).

(Only-if) Suppose now that w is in L(G4). Then we know that w has a leftmost derivation, i.e. $ ⇒lm w.Whenever a unit production is used in a leftmost derivation, the link that constitutes the RHS becomes theleftmost link, and so it is replaced immediately. Thus, the leftmost derivation in grammar G4 can be brokendown into a sequence of steps in which zero or more unit productions are followed by a non-unit production.Note that any non-unit production that is not preceded by a unit production is a “step” by itself. Each ofthese steps can be performed by one production of G5, because the construction of G5 created exactly theproductions that reflect zero or more unit productions followed by a non-unit production. Thus $

∗⇒G5 w,

and so w is in L(G5). 2

Example: We apply this process to our grammar G4 = {N4, T4, P4, S′} to get a new grammar G5 ={N5, T5, P5, S′} where T5 = T4 and N5 = N4. Productions (60), (75), (76),(77) and (78) are the unitproductions in G4. Following the algorithm leads to many new productions, as well as the removal of thefive unit productions. Our new set of productions, P5 is as follows:

21

[

(S′)(S′)

]

→

[

(B1D2B1)(D2B1)

]

(80)

[

(S′)(S′)

]

→

[

(B17D2)

(D2B17)

]

(81)

[

(S′)(S′)

]

→

[

(B1B1)(D2

5B1)

]

(82)

[

(S′)(S′)

]

→

[

(B17)

(D25B

17)

]

(83)

[

(B, B)(B)

]

→

[

(B21H1, H1)(H1B2

1)

]

(84)

[

(B7)(B7)

]

→

[

(B21H1

6 )(H1

6B21)

]

(85)

[

(H, H)(H)

]

→

[

(C1, D2)(C1D2)

]

(86)

[

(H6)(H6)

]

→

[

(C1)(C1D2

5)

]

(87)

[

(B1)(B1)

]

→

[

(B12)

(B23)

]

(88)

[

(B2)()

]

→

[

(b)()

]

(89)

[

()(B3)

]

→

[

()(b)

]

(90)

[

(C)(C)

]

→

[

(C12 )

(C23 )

]

(91)

[

(C2)()

]

→

[

(c)()

]

(92)

[

()(C3)

]

→

[

()(c)

]

(93)

[

()(D5)

]

→

[

()(d)

]

(94)

[

()(D4)

]

→

[

()(d)

]

(95)

[

()(D3)

]

→

[

()(d)

]

(96)

5.6 Eliminating Useless Productions and Symbols

Let X be either a link or a terminal. We say X is useful for a GMTG G if there is some derivation of theform

$∗⇒ α

∗⇒ w (97)

where X is contained in the sentential form α, and where w is a sentential form containing only terminalsand ()’s. If X is not useful, we say it is useless. Evidently, omitting useless links and terminals from ourgrammar will not change the language generated, so we eliminate them.

For X to be useful, it must have both of the following properties:

22

(i) X is generating if it is a terminal or if it is a link such that X∗⇒ w for some ITV w which contains

only terminals in its nonempty components.

(ii) X is reachable if there is a derivation

$∗⇒ α (98)

such that X is contained in the tuple-vector α.

An X that is useful will be both generating and reachable. If we eliminate all non-generating links, andthen eliminate all non-reachable links and terminals, we shall have only useful links and terminals left.

To compute the generating links and terminals of G, we perform the following induction.BASIS: Every terminal is generating by definition.INDUCTION: Suppose there is a production, Γ → α, where every symbol of α is known to be either

a terminal or part of a generating link. Then the link Γ is generating. 2

To compute the reachable links and terminals of G, we perform the following induction.BASIS: $, the special start-link, is reachable by definition.INDUCTION: Suppose we have discovered that some link Γ is reachable. Then for all productions

with Γ as their LHS, all the links and terminals on the RHS are reachable.2Once identified, the following algorithm removes useless symbols from the grammar G5 = {N5, T5, P5, S}:

(i) First, eliminate non-generating links, and all productions that have a non-generating link in them. Wecall the remaining grammar G′

5 = {N ′5, T

′5, P

′5, S}.

(ii) Second, eliminate all terminals and links that are not reachable, and eliminate all productions withthese terminals and links. The remaining grammar is G6 = {N6, T6, P6, S}.

Theorem 7 This process will eliminate all useless links and terminals from our grammar.

PROOF: Suppose X is a link or terminal that remains. If X is a terminal, it is generating by definitionand so it’s in G′

5. If X is a link, we know that X∗⇒G5 w for some sentential form w. Moreover, every link

and terminal used in the derivation of w is generating. Thus X∗⇒G′

5w

Since X was not eliminated in the second step, we also know that there is a sentential form α such that$

∗⇒G′

5α, where X is in α. Further, every link and terminal used in this derivation is reachable, so $

∗⇒G6 α.

We know that every link and terminal in α is reachable, and we also know all these links and terminalsare in G′

5. The derivation of some terminal string, say α∗⇒G′

5v, where v contains w, involves only links and

terminals that are reachable from $, because they are reached by symbols in α. Thus, this derivation is alsoa derivation of G6; that is, $

∗⇒G6 α

∗⇒G6 v, where X is in α and w is in v. We conclude that X is useful in

G6. Since X is an arbitrary link or terminal of G6, we conclude that G6 has no useless links or terminals. 2

Theorem 8 If the GMTG G6 is constructed from G5 by the above construction for eliminating uselessnonterminals and links, then L(G6) = L(G5)

PROOF: First we will show L(G6) ⊆ L(G5). Since we have only eliminated links and symbols from G5 tomake G6, it follows that this is true.

Next we show L(G5) ⊆ L(G6). We must prove that if w is in L(G5), then w is in L(G6). If w is in

L(G5), then $∗⇒G5 w. Each symbol and link in this derivation is evidently both reachable and generating,

so it is also a derivation of G6. That is $∗⇒G6 w, and thus w is in L(G6).2 Example: The previous steps

have created many non-generating links. We remove all productions with non-generating links and are leftwith G6 = {V6,T6,P6,S′}, where P6 is as follows:

23

[

(S′)(S′)

]

→

[

(B17)

(D25B

17)

]

(99)

[

(B7)(B7)

]

→

[

(B21H1

6 )(H1

6B21)

]

(100)

[

(H6)(H6)

]

→

[

(C1)(C1D2

5)

]

(101)

[

(B1)(B1)

]

→

[

(B32)

(B43)

]

(102)

[

(B2)()

]

→

[

(b)()

]

(103)

[

()(B3)

]

→

[

()(b)

]

(104)

[

(C)(C)

]

→

[

(C12 )

(C23 )

]

(105)

[

(C2)()

]

→

[

(c)()

]

(106)

[

()(C3)

]

→

[

()(c)

]

(107)

[

()(D5)

]

→

[

()(d)

]

(108)

6 Conclusions

Generalized Multitext Grammar is a convenient and intuitive model of parallel text. In this paper, wehave presented some important properties of GMTG, including a proof that the generative capacity ofGMTG is comparable to ordinary LCFRS. We gave some examples of syntactic phenomena that GMTGcan model perspicuously. Although some of these phenomena can be easily modeled by other formalisms,we are not aware of any other formalism with the same generative capacity that can easily model all ofthese phenomena. Lastly, we proposed a synchronous generalization of Chomsky Normal Form, laying thefoundation for synchronous CKY parsing under GMTG.

In future work, we shall explore the empirical properties of GMTG, by inducing stochastic GMTGs fromreal multitexts.

References

Aho, A. and Ullman, J. (1969). Syntax directed translations and the pushdown assembler. Journal ofComputer and System Sciences, 3:37–56.

Becker, T., Joshi, A., and Rambow, O. (1991). Long-distance scrambling and tree adjoining grammars. InProceedings of the 5th Meeting of the European Chapter of the Association for Computational Linguistics(EACL), Berlin, Germany.

Bertsch, E. and Nederhof, M. J. (2001). On the complexity of some extensions of RCG parsing. In Proceedingsof the 7th International Workshop on Parsing Technologies (IWPT), pages 66–77, Beijing, China.

Boullier, P. (2000). Range concatenation grammars. In Proceedings of the 6th International Workshop onParsing Technologies (IWPT), Trento, Italy.

24

Dras, M. and Bleam, T. (2000). How problematic are clitics for S-TAG translation? In Proceedings of the 5thInternational Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+5), Paris, France.

Hopcroft, J., Motwani, R., and Ullman, J. (2001). Introduction to Automota Theory, Languages and Com-putation. Addison-Wesley, USA.

Joshi, A. K., Vijay-Shanker, K., and Weir, D. (forthcoming, 1990). The convergence of mildly context-sensitive grammatical formalisms. In Sells, P., Shieber, S., and Wasow, T., editors, Foundational Issuesin Natural Language Processing. MIT Press, Cambridge MA.

Melamed, I. D. (2003). Multitext grammars and synchronous parsers. In Proceedings of the Human Lan-guage Technology Conference and the North American Association for Computational Linguistics (HLT-NAACL), Edmonton, AB.

Rambow, O. and Satta, G. (1996). Synchronous models of language. In Proceedings of the 34th AnnualMeeting of the Association for Computational Linguistics (ACL), Santa Cruz, USA.

Rambow, O. and Satta, G. (1999). Independent parallelism in finite copying parallel rewriting systems.Theoretical Computer Science, 223:87–120.

Shieber, S. (1994). Restricting the weak-generative capactiy of synchronous tree-adjoining grammars. Com-putational Intelligence, 10(4):371–386.

Vijay-Shanker, K., Weir, D., and Joshi, A. (1987). Characterizing structural descriptions produced by variousgrammatical formalism. In Proceedings of the 25th Annual Meeting of the Association for ComputationalLinguistics (ACL), Stanford, CA.

Weir, D. (1992). Linear context-free rewriting systems and deterministic tree-walking transducers. In Pro-ceedings of the 30th Annual Meeting of the Association for Computational Linguistics (ACL), Newark,Delaware.

Wu, D. (1997). Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Com-putational Linguistics, 23(3):377–404.

25

Generalized Multitext Grammars(monolingual) formalism. Next, we prove that in GMTG each component...

Documents

Transcript of Generalized Multitext Grammars(monolingual) formalism. Next, we prove that in GMTG each component...