CINTIL TreeBank Handbook: Design options for the...

33
1 CINTIL TreeBank Handbook: Design options for the representation of syntactic constituency António Branco, João Silva, Francisco Costa and Sérgio Castro University of Lisbon January 2011 1 INTRODUCTION 4 1.1 Concordancer 4 2 CONSTITUENCY RELATIONS 4 2.1 constituency in a nutshell 4 2.2 minimal constituents 5 2.3 syntactic predication 5 2.4 head 5 2.4.1 personal pronouns 6 2.4.2 clitic pronouns 6

Transcript of CINTIL TreeBank Handbook: Design options for the...

Page 1: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

1

CINTILTreeBankHandbook:

Designoptionsfortherepresentationofsyntacticconstituency

AntónioBranco,JoãoSilva,FranciscoCostaandSérgioCastro

UniversityofLisbonJanuary2011

1 INTRODUCTION 4

1.1 Concordancer 4

2 CONSTITUENCYRELATIONS 4

2.1 constituencyinanutshell 4

2.2 minimalconstituents 5

2.3 syntacticpredication 5

2.4 head 52.4.1 personalpronouns 62.4.2 cliticpronouns 6

Page 2: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

2

2.4.3 participles 6

2.5 complements 6

2.6 specifiers 62.6.1 bareNPs 6

2.7 modifiers 7

3 NON‐CONSTITUENCYRELATIONS 7

3.1 CINTILDepBank 7

3.2 CINTILPropBank 8

4 TAGSET 8

4.1 lexicalcategories 8

4.2 non­lexicalcategories 9

4.3 grammaticalfunctions 10

4.4 semanticfunctions 10

5 SPECIFICPHRASES 11

5.1 Sentences 11

5.2 Nominal 13

6 PHONETICALLYNULLITEMS 14

6.1 nullsubjects 14

6.2 nullheads 14

6.3 traces 15

6.4 "though"nullobjects 15

7 SPECIFICCONSTRUCTIONS 16

7.1 relatives 16

7.2 adjectives:predicativeandattributive 17

7.3 comparatives 18

7.4 coordination 20

7.5 appositions 20

7.6 infinitives:inflectedandnoninflected 21

Page 3: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

3

7.7 gerunds 22

7.8 complexpredicates:auxiliary,raisingandmodalverbs 22

7.9 controlverbs 24

7.10 "though"constructions 25

7.11 clitics 25

8 LONG‐DISTANCERELATIONS 26

8.1 topicalization 26

8.2 relatives 26

8.3 interrogatives 26

9 VALENCYALTERNATIONS 27

9.1 passives 27

9.2 anti­causatives 27

10 TOKENIZATION 28

10.1 sentencespliting 28

10.2 nonverbalutterances 28

10.3 contractions 28

10.4 clitics 28

11 MULTI‐WORDEXPRESSIONS 28

11.1 Propernames 28

11.2 cardinals 29

12 TEXTUALMARKING 29

12.1 punctuation 29

12.2 comma 30

12.3 quotationmarks 32

13 REFERENCES 33

Page 4: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

4

1 IntroductionTreebanksaredatasetsofutmostimportanceforthestudyofnaturallanguagesandfortheircomputationalprocessing.Theypermitthetrainingandevaluationofdifferentprocessingtools,includingtaggers,chunkers,parsers,deeplinguisticgrammars,etc.

A treebank is an annotated corpus. It is a data set consisting of a collectionofindividualwritenutterancesassociated to therepresentationof their linguisticstructure,whichcanbesettocapturedifferentdegreesoflinguisticinformation.

CINTIL Treebank is a corpus of Portuguese utterances annotated with therepresentationofconstituencyrelations.ItisbeingdevelopedandmaintainedattheUniversityofLisbon.

ThisdocumentaimsatsupportingtheutilizationandexploitationoftheCINTILTreebank. It presents its major design options in what concerns therepresentationofsyntacticrelations.

The adopted design options were informed by advanced linguistic theorizing.Thereaderisreferredtotheliteratureforathoroughdiscussionandjustificationofthem.

For the sourceof theutterances in this corpus, for its compositionand for theannotationmethodologyusedsee(Barretoetal.,2006).

TheCINTILTreebankhastwoversions.Thereisareferenceversionforhumanusers,andthere isavariant for trainingprobabilisticparsers.Where the latterdiffers from the reference version, that is indicated below by text betweensquarebracketsstartingby"ProbParser:".

1.1 ConcordancerTheCINTILDepBankcanbesearchedthroughaconcordanceronlineathttp://lxcenter/services/en/LXServicesSearcher.html

The example graphs displayed below are associated to its identifier in thecorpus. These sentences can be recovered in this concordancer with theseidentifiers.

2 Constituencyrelations

2.1 constituencyinanutshellInasequenceoflexemesw1w2w3,ifthesubsequencew1w2hasahigherlevelofaggregationthanthesubsequencesw1w2w3orw2w3,thesequencew1w2is considered to form a constituent of w1 w2 w3, of which w1 and w2 arethemselvesconstituents.

Thecontrastinglevelsofaggregationsaredeterminedthroughtheapplicationofempirical testswhichrelyongrammatical intuitionsor judgmentsonsyntacticwell‐formedness. These empirical tests are based on judiciously designedminimalpairsof sequences.To testaputativeconstituent, theseminimalpairs

Page 5: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

5

are constructed, for instance, by means of the insertion of a parentheticalelementinsideit,bydisplacingittoanoncanonicalwordorderinthesentence,by replacing it by an anaphoric expression, or by coordinating it with otherknownconstituents,etc.

A constituent is represented by enclosing the relevant sequence in squarebrackets (e.g. [w1 w2] w3), or in an alternative, but equivalent notation, byforming a one level depth treewhose leaves arew1andw2and the topnodestandsforthewholeconstituent.

Asyntacticcategoryisasetofconstituentswithidenticalsyntacticdistribution,that is constituents whose replacement by each other preserves the syntacticwell‐formedenessoflargerexpressionstheyareconstituentsof(providedsomeotherkeygrammaticalrelationsarenotaffectedbythatreplacement,suchthatmorphologicalagreement,subcategorization,etc.).

Thecategorizationofconstituentsisrepresentedbydecoratingthenodesoftheconstituencytreeswithtagssignalingtheappropriatecategories.Thesetagsareusuallyacronymsof thecategories theycorrespondto.For instance,NPstandsforNounPhrase,SforSentence,etc.Seesection"4.TagSet"belowforthelistofcategoriesinuseinthetreebank.

2.2 minimalconstituentsAlexemeisaterminalnodeanditscategory isrepresentedinthe immediatelydominating,pre‐terminalnode.Theyformaunarybranchingtree.

2.3 syntacticpredicationThe constituency relations are intertwined with other grammatical relations,determining and being determined by them. Syntactic predication is one suchrelationofinterest.

A syntactic predication is organized around a predicate and its complements,possiblyextendedwithmodifiersandspecifiers.

Tointegrateawellformedutterance,apredicaterequiresthatanumberofotherexpressions(zeroormore),ofcertainsyntacticcategoryorcategories,co‐occurwithit.Apredicateanditscomplementsformaconstituent.

2.4 headLexemesofcategoriesN,V,A,P,ADV,CONJandCmaybesyntacticpredicates.

AsyntacticpredicateofcategoryXisaspecialconstituent(termedhead)oftheirphrase, of category XP. In that constituency tree, the path from X to XP onlycontains(zeroormore)intermediatenodesofcategoryX'.ThatnodeXP,aswellastheintermediatenodesX',aresaidtobeprojectedbythatheadX.

Ingeneral,theheadofaphraseXPisitssingleconstituentoflexicalcategoryX,thus immediatelydominatedbyapre‐terminalXnode,except formulti‐words,whoseindividualitemsprojectseveralpre‐terminalsXsimmediatelydominatedbyanodeX'.

Page 6: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

6

In the treebank, in general, for major categories, a head X is represented asprojectinganXPwhenthisisaconstituenthavingcomplementsormodifiersofXassubconstituents(seealsosection"7.3Comparatives").

Given their specific or ambivalent nature in categorial terms, this schema isadaptedforthefollowingitems:

2.4.1 personalpronouns

ApersonalpronounhascategoryPRS.ItistheheadofanNP.

2.4.2 cliticpronouns

AcliticpronounhascategoryCL.ItistheheadofanNP.

2.4.3 participles

A past participle has category V. It is the head of an AP in attributive andpredicativeconstructions.

2.5 complementsAcomplementofapredicateXisaconstituentoftheprojectedXP,immediatelydominatedbyXPorbyanintermediateX'.Suchnodesofthephrasearesaidtobe(internal)complementsofthehead.

Given its specific nature, verbal predicates may also have an externalcomplement, not occuring inside the VP they project (see also section"5.1Sentences").

Givenitsspecificnature,nominalheadsprojectanNPevenwhennocomplementexistsorisrealized(seealsosection"5.2Nominals"formoredetailsonNPs).

Complements are of the following categories: NP, PP, AP, ADVP, CP (see alsosection"7.3Comparatives").

2.6 specifiersInsidetheNPs,besidesthehead,complementsandmodifiers,otherexpressionsmayoccur,thatarespecifiers.

A specifier of an NP projected by a head N is a constituent of that NP,immediately dominated by it or by an intermediate N', provided all otherdominatingN'sarealsodominatedspecifiers(seealsosection"5.2Nominals"formoredetailsonNPs).

Specifiersareofthefollowingcategories:QNT,ART,D,DEM,POSS,CARD.

2.6.1 bareNPs

Giventhekeysemanticfunctionofspecifiers,itisconsideredthatNPswithoutaphonetically realized specifier (bare NPs) still undergo some process ofspecification. As a result, the NP node of bare NPs has a unary branch to theimmediatelydominatednode.

The exception is to be found in Proper Names that modify a common noun,

Page 7: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

7

whosecategoryisN'ifitisamulti‐wordpropername(e.g.oactorArturSemedo),elseisN(e.g.orioJadar).

2.7 modifiersThe event described by a predicate and its complements can be furthercharacterizedbyothercooccurringlexemesorphrases,thataremodifiers.

AmodifierYisinanadjunctionpositiontothemodifiedconstituentZ,thatisitisasisternodeofthatZ,andbotharedominatedbyanodealsoofcategoryZ.

Modifiersareofthefollowingcategories:ORD,ADV,ADVP,A,AP,PP,CONJP,NP,CP,VP.

3 Non‐constituencyrelationsTreesareaimedatdepictingconstituencyrelations.IntheCINTILtreebank,theyare further decorated with information tags relevant also for two types ofgrammaticalrelationsthatareofanon‐constituencynature,namelygrammaticaldependencyrelationsandsemanticrolerelations.

Suchinformationtagsencode,respectively,grammaticalfunctionsandsemanticfunctionsof the correspondingnodes.Theyaredisplayed in accordance to thepatternZ‐GF‐SF,whereZisaconstituencycategory,GFisagrammaticalfunction,andSFisasemanticfunction(e.g.NP‐SJ‐ARG1).

A grammatical function results from an abstraction over complements andmodifiers of different predicates. It permits to categorize complements, ormodifiers, with similar syntactic constraints on their realization, such ascategory,case,agreement,canonicalwordorder,inflectionparadigm,etc.

Asemantic function,orsemantic role, isalsoanabstractionovercomplementsand modifiers of various syntactic predicates, but along a different, semantic,dimension. It permits to categorize complements, or modifiers, according tosimilarsemanticconstraintson theirdenotation, that is in termsof thesimilarcontribution that the extra‐linguistic elements they may denote bring for thecharacterizationoftheeventbeingdescribed.Giventhesemanticrolesaremuchmore elusive than grammatical functions, following common practice withrespect to thecreationofPropBanks(seealsosection3.2belowontheCINTILPropBank), the option here was to adopt a set of roles for complements thatprimarily permits to semantically distinguish complements of the samepredicateamongeachother.

The possible values of grammatical functions are listed in section 4.3 and forsemanticfunctionsarelistedinsection4.4.

3.1 CINTILDepBankGrammaticalfunctionsareanecessarybutnotsufficientelementtocharacterizegrammatical dependencies.Grammatical dependency relations canbedepictedas graphs whose nodes are words and whose directed arcs establish aconnectionfromalexemetoitssubordinatelexemes.

Page 8: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

8

An arc represents the dependency of the subordinate item to the head. Thesedependencies can be of a number of different types, which are mostly thegrammaticalfunctions,andwithwhosetagsthearcsaredecorated.

Corpora annotated with grammatical dependency graphs are known asDependencyBanks.TheCINTILTreebank isaligned toadependencybank, theCINTIL DepBank. The bridging elements are the grammatical function tagsdecoratingthenodes,inthetreebank,andthearcs,inthedependencybank.

FortheHandbookoftheCINTILDepBanksee(Brancoetal.,2011).

3.2 CINTILPropBankTreebanks encoding constituency relationswhich are extended to encode alsosemanticfunctions,orsemanticroles,ofelementsofsyntacticpredicationshavebeen termed as PropBanks in the literature. Given the nodes of the CINTILTreebankaredecoratedwithsemanticfunctions,thisannotatedcorporacanbetakenasbeingalsotheCINTILPropBank.

ItisworthnotingthatinsocalledPropBanks,thesemanticrelationsignaledbythe tag on a given constituent indicates a semantic relation between thatconstituentandapredicatorintheutterance.Hence,thatrelationbeingsignaledover a single constituent isnot fully identified in anexplicitwayasoneof thetermsisnotindicated.

Nonetheless, usually the relevant predicate is the closest predicate in the tree,whichbelongstothesameminimalpredicationasthetaggedconstituentdoes.

The cases where this does not hold are typical cases of complex predicates,formed bymeans of several chained verbs, e.g.modals, auxiliaries and raisingverbs. In suchcases the tagused to code the semantic function is sufixedwith"cp" (standing for "complex predicate") in order to help the search andconcordancingofthetreebank(formoredetailsseethesection4.4below)

Foranannotatedcorpuswithfullyfledgedrepresentationofsemanticrelations,seetheCINTILLogicalFormBank.

4 Tagset

4.1 lexicalcategoriesA Adjective

ADV Adverb

ART Article

C Complementizer

CARD Cardinal

CL Clitic

CONJ Conjunction

Page 9: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

9

D Determiner

DEM Demonstrative

ITJ Interjection

N Noun

ORD Ordinal

P Preposition

PERCENT Percentage

PNT Punctuation

POSS Possessive

PRS Personalpronoun

QNT Quantifier

REL Relativepronoun

V Verb

4.2 non‐lexicalcategoriesA' Adjectivesub‐phraseconstituent

ADV' Adverbsub‐phraseconstituent

ADVP Adverbphrase

AP Adjectivephrase

CARD' Cardinalsub‐phraseconstituent

CONJ' Conjunctionsub‐phraseconstituent

CONJP Cardinalsub‐phraseconstituent

CP Complementizerphrase

ITJ Interjection

N' Nominalsub‐phraseconstituent

NP Nounphrase

PERCENTP Percentagephrase

POSS' Possessivesub‐phraseconstituent

PP Prepositionphrase

QNT' Quantifiersub‐phraseconstituent

Page 10: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

10

S Sentence

V' Verbsub‐phraseconstituent

VP Verbphrase

4.3 grammaticalfunctionsSJ Subject

DO DirectObject

IO IndirectObject

OBL ObliqueObject

M Modifier

PRD Predicate

C Complement

SP Specifier

4.4 semanticfunctionsARG1 Argument1

ARG11 Argument 1 of subordinating predicator and Argument 1 of thesubordinate clause (semantic function of Subjects of so called Subject Controlpredicators)

ARG21 Argument 2 of subordinating predicator and Argument 1 of thesubordinate clause (semantic function of Subjects in so called Direct ObjectControlpredicators)

ARG31 Argument 3 of subordinating predicator and Argument 1 of thesubordinate clause (semantic function of Subjects in so called Indirect ObjectControlpredicators)

ARG2 Argument2

ARG3 Argument3

ARGncp Argumentnincomplexpredicateconstructions

ARGnac Argumentnofanticausativereadings

LOC Location

EXT Extension

ADV Adverbial

CAU Cause

TMP Temporal

Page 11: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

11

PNC Purpose,goal

MNR Manner

DIR Direction

PRD Predication

POV Pointofview

5 SpecificphrasesPhrasesofcategorySandNPhavespecificconstituencyformat(seealsosection7.3oncomparatives).

5.1 SentencesPhrasesofcategoryShavenoS'daugthersorS‐categorialhead.

AnSisprojectedoutofaVincasethisisanimpersonalorintransitiveverb,oraverbwithnorealizedcomplementotherthantheSubject,withnomodifiers:

#Id:a001/1.

AnSisprojectedoutofaVPincasethisVP'sheadhasaninternalcomplement:

#Id:b001/42

Orismodified:

Page 12: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

12

#Id:a012/569

For S not projected out of a verbal head, see section 10.2 on non verbalutterances.

IncanonicalSVOwordorder, theSubject is immediatelydominatedbySandasisternodeoftheprojectingVorVP.

In VOS word order, extraposed Subjects are sister nodes of its V or VP andimmediatelydominatedbyS:

#Id:b051/3036

InVSOorder,whentheextraposedSubjectintervenesbetweentheVerbanditsinternalcomplement,theextraposedSubjectisdominatedbyV':

Page 13: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

13

#Id:b012/782

Quantifiersfloatingtoapost‐verbalposition,asinOsjogadoresviramtodosisso,are in adjunction to a projectionof the verb.Those floating to an immediatleypos‐nominal position, as in Os jogadores todos viram isso, are in adjunctionpositiontotheirNP(seeexample#Id:b092/5911,insection6.1below).

5.2 NominalPhrasesofcategoryNPmayhavespecifierdaugthers.Ingeneral, theseareleft‐branchingnodes.

BareNPs,withnorealizedspecifier,arecharacterizedbyhavingaunarybranchimmediatelybelowtheNPnodeprojectedbyitsnominalhead:

#Id:e000369/56038

Page 14: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

14

6 PhoneticallynullitemsPhoneticallynullitemsmarkpositionsinthetreerelatedtootherpositonsinthetree(incaseoftraces),ormarkellidedelementswhosecontextisrichenoughtosupport the recovery of their interpretation (in case of null subjects or nullheads).

[ProbParser:Phoneticallynullitemsareremovedfromthetreeandrepresentedbymeans of appropriated tree configuration or appropriate relabelling of therelevantnodes.]

6.1 nullsubjectsNull subjects aremarked by *NULL* and are immediately dominated by the SnodeprojectedbyaVoraVP:

#Id:b092/5911

[ProbParser:NullsubjectsarerepresentedbyunarybranchingbelowS.]

6.2 nullheadsNullheadsmaybenominalorverbal.Theyaremarkedby*ELLIPSIS*:

Page 15: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

15

#Id:b001/11

[ProbParser:NullheadsarerepresentedbyXPorX'withnodescendantX.]

6.3 tracesTraces of constituents that aremarked by *GAP* followed by _n where n is anaturalnumber.Thecategoryofthe"displaced"nodeiscoindexedwiththetraceandthusalsofollowedby_n:

#Id:b094/6024

[ProbParser:AnynodeZinthetreepathbetweenthetraceandthegapfillerofcategoryWisrelabeledasZ/W.]

6.4 "though"nullobjectsNulldirectobjectsspecificallylicensedby"though"constructionsaremarkedby*THOUGH*.

Page 16: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

16

#Id:a012/591

[ProbParser: "Though"nullobjectsarerepresentedbyunarybranchingbelowVP.]

7 Specificconstructions

7.1 relativesAmodifying relative clause is dominated by N'. It is of category CP, with twoimmediateconstituents,anXPprojectedbyarelativepronounRELandaclauseS.IthasgrammaticalfunctionMandsemanticrolePRED:

#Id:b080/4682

Anappositiverelativeclauseisprecededbyacomma',',whichisinadjunctionto

Page 17: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

17

theCPconstituentandformsanotherCPconstituentwithit.

#Id:b219/16012

AfreerelativeclauseisofcategoryNP,whichisprojectedoutoftherelative:

#Id:e000149/38916

For the representation of thephonetically null trace, in correspondence to therelativizer XP, see section 6.3 on Traces (see also section "8 Long‐distancerelations").

7.2 adjectives:predicativeandattributiveIn predicative constructions, the Subject is ARG1 of the copula verb, and thecorrspondinglogicalformitshowsupasARG1oftheadjective.

Page 18: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

18

#id:a003/102

Thesameholdswithrespecttheheadnouninattributiveconstructions.Thatisthecaseofthenoundiferenteinthisexample:

#id:a001/34

Accordingly, any further arguments of the adjective surfacing in the tree aretaggedwithARGnwithn>=2.Thatisthecaseofdesteintheexampleabove.

7.3 comparativesA comparative construction is typically built around an adjective by twoconstituents,anadverbialofdegreeandaCONJPphrase:

Page 19: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

19

#Id:e000282/49262

(someadverbsmayalsosupportcomparativeconstructiosn,aswithpertointheexampelmaispertodoqueaMaria)

The exception happens with adjectives likemaior,menor,melhor, pior, whichalsoexpressthecomparison,inwhichcasethecomparativeconstructionisbuiltaroundtheadjectiveandtheCONJPphrase.

#Id:e000481/64969

The adverbial of degree (e.g. mais, menos, tão) is sister of the adjective,dominatedbyanA'node.ItissuperficiallytaggedasA‐M‐M,thatisasmodifier,butnotethatinlogicalformtheadjectiveshowsupastheARG1ofthisadverb.

TheCONJPphrase isasisternodeof thatnodeA'. It isprojectedbyoneof theconjunctionexpressionsforcomparativesque,deque,de_oque,como,quanto.Itisacomplementoftheadverbialofdegree.HencethisadverbialhappennottoprojectanADVP.ThisphraseistaggedasCONJP‐C‐ARG2,indicatingthatitisthecomplementandARG2oftheadverb.

Page 20: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

20

TheCONJPmaybeabsentofthecomparativeconstruction.Insuchcase,thoughitcanbesemanticallyrecovered fromthecontext, there isnophoneticallynulliteminsertedinthetreetomarkit.

7.4 coordinationCoordinationoftwoconstituentsAandBbymeansofacoordinativeconjunctionConj(eitheralexicalitem,suchase,oracomma)areacascadeofadjunctions[A[Conj[B]]].

#Id:b001/30

7.5 appositionsAppositionsareadjoinedtoNPs:

Page 21: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

21

#Id:b005/254

7.6 infinitives:inflectedandnoninflectedAninflectedinfinitiveprojectsanS:

#Id:c031/23222

AndanoninflectedoneprojectsaVP:

Page 22: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

22

#Id:b076/4469

7.7 gerundsWhen in complex predicate constructions, preceded by an auxiliary verb, agerundprojectsaVP.Otherwise,agerundprojectsanadverbialsentencewithanullsubject:

#Id:c020/22209

7.8 complexpredicates:auxiliary,raisingandmodalverbsAuxiliary,modalandraisingverbsselectforaVP.

Different status of this VP is signaled by its grammatical function and/orsemanticrole.

Page 23: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

23

AuxiliariessyntacticallyselectforacomplementVP,andarethussisternodesofVP‐C:

#Id:e000585/73183

ModalandraisingverbsselectnotonlysyntacticallybutalsosemanticallyforacomplementVP.TheyarethussisternodesofVP‐C‐ARG1:

#Id:c003/20534

In a complex predicate, formedby any sequence of auxiliary, raising ormodalverbs, itsSubject ismarkedasNP‐SJ‐ARGncp, signaling that it is theSubjectofthe topmost verb (viz. ‐SJ‐) and the ARGn of some verb down below in thecomplexpredicate:

Page 24: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

24

#Id:b134/8372

7.9 controlverbsSubject control verbs (e.g. querer) select for a Subject NP‐SJ‐ARG11, signalingthatitisboththesubjectofthecontrolverbandintheclauseoccurringasdirectobjectofthelater:

#Id:b001/34

Object control verbs (e.g. obrigar) select for a Direct Object NP‐DO‐ARG21,signalingthatitisboththeobjectofthecontrolverbandthesubjectintheclauseoccurringastheotherinternalargumentofthecontrolverb.

Page 25: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

25

#Id:e000660/79129

Indirect object control verbs (e.g. pedir) select for a Indirect Object PP‐IO‐ARG31, signaling that it is both the indirect object of the control verb and thesubjectintheclauseoccurringastheotherinternalargumentofthecontrolverb.

7.10 "though"constructionsIn "though" constructions, the sentential complement of the adjective,introducedbytheprepositionde,andprojectedbyan inflected infinitive,hasaphoneticallynullobjectmarkedwith*TOUGH*:

#Id:a012/591

For more details, see also section 7.6 on infinitives and 6.4 on "though" nullobjects.

7.11 cliticsCliticsprojectNPs.Intermsofconstituency,theyenterthesamepositionsasanyNPprojectedfromanNwithsimilargrammaticalfunction(seealsosection10.4

Page 26: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

26

onthetokenizationofclitics).

8 Long‐distancerelationsLong distance relations are established between a constituent and a rightdownwards position in the tree where this constituent typically occurs in(declarative)counterpartswithcannonicalSVOwordorder.Constructionswithlong‐distancerelationsincludetopicalizations,interrogativesandrelatives.

The cannonical position ismarked by a phonetically null item *GAP*which iscoindexedwiththeconstituentwithwhichitestablishesalong‐distancerelation.

[ProbParser:Thelongdistancedependencyisrepresentedbydecoratingeverynodeinthepathinsidethetreeconnectingthenodeimmediatelydominatingtheputativegapand thesisternodeof the "displaced"constituent.Thesenodes inthat path are decorated by concatenating to their category tags a slash "/"followedbythetripleCAT‐GF‐SRofthat"displaced"constituent,whereCATisitscategory,GFisitsgrammaticalfunction,andSRisitssemanticrole.]

8.1 topicalizationThetopicalizedconstituentisinadjunctiontotheconstituentfromwhichitwastopicalized:

#idc049/24856

8.2 relativesTherelativephraseprojectsaCPimmediatelydominatingtheconstituentSfromwhichitwasrelativized.

Seealsosection7.1onRelatives.

8.3 interrogativesIn its currentversion, thecorpusdoesnotcontainyet interrogativeswith longdistancerelations.

Page 27: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

27

9 Valencyalternations

9.1 passivesTheby‐phrase isan internalcomplementof thepastparticipleverb form,withgrammaticalfunctionOBLandsemanticroleARG1.

The corresponding Subject bears ARG2cp (see also section 7.8 on complexpredicatesabove):

#Id:b179/11830

9.2 anti‐causativesTheSubjectofananticausativeverbisARG2ac:

#ide000530/68760

In a predication supported by the transitive counterpart of a possibleanticausative verb, the Subject is ARG1, as with other transitive verbs. Asexpected,initspassivealternation,theSubjectisARG2.

Page 28: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

28

10 Tokenization

10.1 sentencesplitingSentencesaresplitedat theexpectedpoints. It isworthofmention thecaseofutterances involving colon ":", which will be split into two separate entrysentencesinthetreebank,oneprecedingthecolonandtheotherfollowingit.

10.2 nonverbalutterancesTitlesofnewspaperarticles, stretchesaround colons, etc. are casesofpossibleutterancesinthecorpuswicharenotprojectedbyacorrespondingverbalhead.Inanycase,everyentryutteranceinthecorpusisdominatedbyanSnode.

10.3 contractionsContractions are expanded. The first element of an expanded contraction ismarkedwithan"_"(underscore)symbol,forinstancedo→|de_|o|.

10.4 cliticsClitics are detached from the verb. The detached clitic is marked with a "‐"(hyphen)symbol,asforinstancedá­se­lho→|dá|­se|­lhe|­o|

When inmesoclisis, a "‐CL‐"mark isused to signal theoriginalpositionof thedetachedclitic:afirmar­se­ia→|afirmar­CL­ia|­se|

Possiblevocalicalterationsoftheverbformaremarkedwith"#"(hash)symbol,asforinstanceinvê­las→|vê#|­las|.

11 Multi‐wordexpressions

11.1 PropernamesMulti‐wordpropernames forma flatconstituentwhereeveryword issisterofeachother,isofcategoryN,andisdominatedbyasinglecommonN'node.ThisheadprojectsanNPexceptwhenitisamodifierofacommonnoun:

Page 29: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

29

#b005/254

Seealsosections2.6.1onbareNPs,and5.2onNPs.

11.2 cardinalsComplexcardinalshaveaflatstructurelikeamulti‐wordnamed‐entity.

#ide000650/78330

12 Textualmarking

12.1 punctuationEachpunctuationmarksisaconstituentofcategoryPNT.

Endofsentencemarkersareintopmostadjunction.

Page 30: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

30

12.2 commaCommas separating left periphery constituents are right adjoined to theseconstituents.

Commas surrounding appositions are top most constituents of the appositiveconstituent.

#id:b029/1761

Commas with coordinative value are represented like lexical coordinativeconjuctionsare(forfurtherdetails,seesection7.4oncoordination):

Page 31: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

31

#ida001/30

Commassurroundingparentheticalsareadjoinedtothesurroundedconstituent.With several parentheticals in sequence, the first one is surrounded, thefollowingoneshaveasinglecommaatitsright:

#idb227/16800

Commasemphasizingconjunctions,thusimmediatelyprecedingthem,arerightadjoinedtotheleftcoordinatedconstituent:

#idb184/12279

Other"pause"commasareleftadjoined:

Page 32: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

32

#idb128/8002

12.3 quotationmarksQuotation marks surrounding constituents are adjoined to them. When theysurroundalexicalitemofcategoryX,theyaredominatedbyanX'nodetogetherwithX:

#Id:b010/654

Quotationmarkssurroundingstringsnot formingaconstituentareadjoinedtothehighestpossiblenode:

Page 33: CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation

33

#Id:b091/5858.

13 ReferencesBarreto, Florbela, António Branco, Eduardo Ferreira, Amália Mendes, Maria

FernandaNascimento,FilipeNunesandJoãoSilva,2006,"OpenResourcesandToolsfortheShallowProcessingofPortuguese",Proceedingsofthe5thInternational Conference on Language Resources and Evaluation(LREC2006),Genoa,Italy.

BrancoAntónio,SérgioCastro,JoãoSilva,FranciscoCosta,2011,CINTILDepBankHandbook: Design options for the representation of grammaticaldependencies.