CINTIL TreeBank Handbook: Design options for the...
Transcript of CINTIL TreeBank Handbook: Design options for the...
1
CINTILTreeBankHandbook:
Designoptionsfortherepresentationofsyntacticconstituency
AntónioBranco,JoãoSilva,FranciscoCostaandSérgioCastro
UniversityofLisbonJanuary2011
1 INTRODUCTION 4
1.1 Concordancer 4
2 CONSTITUENCYRELATIONS 4
2.1 constituencyinanutshell 4
2.2 minimalconstituents 5
2.3 syntacticpredication 5
2.4 head 52.4.1 personalpronouns 62.4.2 cliticpronouns 6
2
2.4.3 participles 6
2.5 complements 6
2.6 specifiers 62.6.1 bareNPs 6
2.7 modifiers 7
3 NON‐CONSTITUENCYRELATIONS 7
3.1 CINTILDepBank 7
3.2 CINTILPropBank 8
4 TAGSET 8
4.1 lexicalcategories 8
4.2 nonlexicalcategories 9
4.3 grammaticalfunctions 10
4.4 semanticfunctions 10
5 SPECIFICPHRASES 11
5.1 Sentences 11
5.2 Nominal 13
6 PHONETICALLYNULLITEMS 14
6.1 nullsubjects 14
6.2 nullheads 14
6.3 traces 15
6.4 "though"nullobjects 15
7 SPECIFICCONSTRUCTIONS 16
7.1 relatives 16
7.2 adjectives:predicativeandattributive 17
7.3 comparatives 18
7.4 coordination 20
7.5 appositions 20
7.6 infinitives:inflectedandnoninflected 21
3
7.7 gerunds 22
7.8 complexpredicates:auxiliary,raisingandmodalverbs 22
7.9 controlverbs 24
7.10 "though"constructions 25
7.11 clitics 25
8 LONG‐DISTANCERELATIONS 26
8.1 topicalization 26
8.2 relatives 26
8.3 interrogatives 26
9 VALENCYALTERNATIONS 27
9.1 passives 27
9.2 anticausatives 27
10 TOKENIZATION 28
10.1 sentencespliting 28
10.2 nonverbalutterances 28
10.3 contractions 28
10.4 clitics 28
11 MULTI‐WORDEXPRESSIONS 28
11.1 Propernames 28
11.2 cardinals 29
12 TEXTUALMARKING 29
12.1 punctuation 29
12.2 comma 30
12.3 quotationmarks 32
13 REFERENCES 33
4
1 IntroductionTreebanksaredatasetsofutmostimportanceforthestudyofnaturallanguagesandfortheircomputationalprocessing.Theypermitthetrainingandevaluationofdifferentprocessingtools,includingtaggers,chunkers,parsers,deeplinguisticgrammars,etc.
A treebank is an annotated corpus. It is a data set consisting of a collectionofindividualwritenutterancesassociated to therepresentationof their linguisticstructure,whichcanbesettocapturedifferentdegreesoflinguisticinformation.
CINTIL Treebank is a corpus of Portuguese utterances annotated with therepresentationofconstituencyrelations.ItisbeingdevelopedandmaintainedattheUniversityofLisbon.
ThisdocumentaimsatsupportingtheutilizationandexploitationoftheCINTILTreebank. It presents its major design options in what concerns therepresentationofsyntacticrelations.
The adopted design options were informed by advanced linguistic theorizing.Thereaderisreferredtotheliteratureforathoroughdiscussionandjustificationofthem.
For the sourceof theutterances in this corpus, for its compositionand for theannotationmethodologyusedsee(Barretoetal.,2006).
TheCINTILTreebankhastwoversions.Thereisareferenceversionforhumanusers,andthere isavariant for trainingprobabilisticparsers.Where the latterdiffers from the reference version, that is indicated below by text betweensquarebracketsstartingby"ProbParser:".
1.1 ConcordancerTheCINTILDepBankcanbesearchedthroughaconcordanceronlineathttp://lxcenter/services/en/LXServicesSearcher.html
The example graphs displayed below are associated to its identifier in thecorpus. These sentences can be recovered in this concordancer with theseidentifiers.
2 Constituencyrelations
2.1 constituencyinanutshellInasequenceoflexemesw1w2w3,ifthesubsequencew1w2hasahigherlevelofaggregationthanthesubsequencesw1w2w3orw2w3,thesequencew1w2is considered to form a constituent of w1 w2 w3, of which w1 and w2 arethemselvesconstituents.
Thecontrastinglevelsofaggregationsaredeterminedthroughtheapplicationofempirical testswhichrelyongrammatical intuitionsor judgmentsonsyntacticwell‐formedness. These empirical tests are based on judiciously designedminimalpairsof sequences.To testaputativeconstituent, theseminimalpairs
5
are constructed, for instance, by means of the insertion of a parentheticalelementinsideit,bydisplacingittoanoncanonicalwordorderinthesentence,by replacing it by an anaphoric expression, or by coordinating it with otherknownconstituents,etc.
A constituent is represented by enclosing the relevant sequence in squarebrackets (e.g. [w1 w2] w3), or in an alternative, but equivalent notation, byforming a one level depth treewhose leaves arew1andw2and the topnodestandsforthewholeconstituent.
Asyntacticcategoryisasetofconstituentswithidenticalsyntacticdistribution,that is constituents whose replacement by each other preserves the syntacticwell‐formedenessoflargerexpressionstheyareconstituentsof(providedsomeotherkeygrammaticalrelationsarenotaffectedbythatreplacement,suchthatmorphologicalagreement,subcategorization,etc.).
Thecategorizationofconstituentsisrepresentedbydecoratingthenodesoftheconstituencytreeswithtagssignalingtheappropriatecategories.Thesetagsareusuallyacronymsof thecategories theycorrespondto.For instance,NPstandsforNounPhrase,SforSentence,etc.Seesection"4.TagSet"belowforthelistofcategoriesinuseinthetreebank.
2.2 minimalconstituentsAlexemeisaterminalnodeanditscategory isrepresentedinthe immediatelydominating,pre‐terminalnode.Theyformaunarybranchingtree.
2.3 syntacticpredicationThe constituency relations are intertwined with other grammatical relations,determining and being determined by them. Syntactic predication is one suchrelationofinterest.
A syntactic predication is organized around a predicate and its complements,possiblyextendedwithmodifiersandspecifiers.
Tointegrateawellformedutterance,apredicaterequiresthatanumberofotherexpressions(zeroormore),ofcertainsyntacticcategoryorcategories,co‐occurwithit.Apredicateanditscomplementsformaconstituent.
2.4 headLexemesofcategoriesN,V,A,P,ADV,CONJandCmaybesyntacticpredicates.
AsyntacticpredicateofcategoryXisaspecialconstituent(termedhead)oftheirphrase, of category XP. In that constituency tree, the path from X to XP onlycontains(zeroormore)intermediatenodesofcategoryX'.ThatnodeXP,aswellastheintermediatenodesX',aresaidtobeprojectedbythatheadX.
Ingeneral,theheadofaphraseXPisitssingleconstituentoflexicalcategoryX,thus immediatelydominatedbyapre‐terminalXnode,except formulti‐words,whoseindividualitemsprojectseveralpre‐terminalsXsimmediatelydominatedbyanodeX'.
6
In the treebank, in general, for major categories, a head X is represented asprojectinganXPwhenthisisaconstituenthavingcomplementsormodifiersofXassubconstituents(seealsosection"7.3Comparatives").
Given their specific or ambivalent nature in categorial terms, this schema isadaptedforthefollowingitems:
2.4.1 personalpronouns
ApersonalpronounhascategoryPRS.ItistheheadofanNP.
2.4.2 cliticpronouns
AcliticpronounhascategoryCL.ItistheheadofanNP.
2.4.3 participles
A past participle has category V. It is the head of an AP in attributive andpredicativeconstructions.
2.5 complementsAcomplementofapredicateXisaconstituentoftheprojectedXP,immediatelydominatedbyXPorbyanintermediateX'.Suchnodesofthephrasearesaidtobe(internal)complementsofthehead.
Given its specific nature, verbal predicates may also have an externalcomplement, not occuring inside the VP they project (see also section"5.1Sentences").
Givenitsspecificnature,nominalheadsprojectanNPevenwhennocomplementexistsorisrealized(seealsosection"5.2Nominals"formoredetailsonNPs).
Complements are of the following categories: NP, PP, AP, ADVP, CP (see alsosection"7.3Comparatives").
2.6 specifiersInsidetheNPs,besidesthehead,complementsandmodifiers,otherexpressionsmayoccur,thatarespecifiers.
A specifier of an NP projected by a head N is a constituent of that NP,immediately dominated by it or by an intermediate N', provided all otherdominatingN'sarealsodominatedspecifiers(seealsosection"5.2Nominals"formoredetailsonNPs).
Specifiersareofthefollowingcategories:QNT,ART,D,DEM,POSS,CARD.
2.6.1 bareNPs
Giventhekeysemanticfunctionofspecifiers,itisconsideredthatNPswithoutaphonetically realized specifier (bare NPs) still undergo some process ofspecification. As a result, the NP node of bare NPs has a unary branch to theimmediatelydominatednode.
The exception is to be found in Proper Names that modify a common noun,
7
whosecategoryisN'ifitisamulti‐wordpropername(e.g.oactorArturSemedo),elseisN(e.g.orioJadar).
2.7 modifiersThe event described by a predicate and its complements can be furthercharacterizedbyothercooccurringlexemesorphrases,thataremodifiers.
AmodifierYisinanadjunctionpositiontothemodifiedconstituentZ,thatisitisasisternodeofthatZ,andbotharedominatedbyanodealsoofcategoryZ.
Modifiersareofthefollowingcategories:ORD,ADV,ADVP,A,AP,PP,CONJP,NP,CP,VP.
3 Non‐constituencyrelationsTreesareaimedatdepictingconstituencyrelations.IntheCINTILtreebank,theyare further decorated with information tags relevant also for two types ofgrammaticalrelationsthatareofanon‐constituencynature,namelygrammaticaldependencyrelationsandsemanticrolerelations.
Suchinformationtagsencode,respectively,grammaticalfunctionsandsemanticfunctionsof the correspondingnodes.Theyaredisplayed in accordance to thepatternZ‐GF‐SF,whereZisaconstituencycategory,GFisagrammaticalfunction,andSFisasemanticfunction(e.g.NP‐SJ‐ARG1).
A grammatical function results from an abstraction over complements andmodifiers of different predicates. It permits to categorize complements, ormodifiers, with similar syntactic constraints on their realization, such ascategory,case,agreement,canonicalwordorder,inflectionparadigm,etc.
Asemantic function,orsemantic role, isalsoanabstractionovercomplementsand modifiers of various syntactic predicates, but along a different, semantic,dimension. It permits to categorize complements, or modifiers, according tosimilarsemanticconstraintson theirdenotation, that is in termsof thesimilarcontribution that the extra‐linguistic elements they may denote bring for thecharacterizationoftheeventbeingdescribed.Giventhesemanticrolesaremuchmore elusive than grammatical functions, following common practice withrespect to thecreationofPropBanks(seealsosection3.2belowontheCINTILPropBank), the option here was to adopt a set of roles for complements thatprimarily permits to semantically distinguish complements of the samepredicateamongeachother.
The possible values of grammatical functions are listed in section 4.3 and forsemanticfunctionsarelistedinsection4.4.
3.1 CINTILDepBankGrammaticalfunctionsareanecessarybutnotsufficientelementtocharacterizegrammatical dependencies.Grammatical dependency relations canbedepictedas graphs whose nodes are words and whose directed arcs establish aconnectionfromalexemetoitssubordinatelexemes.
8
An arc represents the dependency of the subordinate item to the head. Thesedependencies can be of a number of different types, which are mostly thegrammaticalfunctions,andwithwhosetagsthearcsaredecorated.
Corpora annotated with grammatical dependency graphs are known asDependencyBanks.TheCINTILTreebank isaligned toadependencybank, theCINTIL DepBank. The bridging elements are the grammatical function tagsdecoratingthenodes,inthetreebank,andthearcs,inthedependencybank.
FortheHandbookoftheCINTILDepBanksee(Brancoetal.,2011).
3.2 CINTILPropBankTreebanks encoding constituency relationswhich are extended to encode alsosemanticfunctions,orsemanticroles,ofelementsofsyntacticpredicationshavebeen termed as PropBanks in the literature. Given the nodes of the CINTILTreebankaredecoratedwithsemanticfunctions,thisannotatedcorporacanbetakenasbeingalsotheCINTILPropBank.
ItisworthnotingthatinsocalledPropBanks,thesemanticrelationsignaledbythe tag on a given constituent indicates a semantic relation between thatconstituentandapredicatorintheutterance.Hence,thatrelationbeingsignaledover a single constituent isnot fully identified in anexplicitwayasoneof thetermsisnotindicated.
Nonetheless, usually the relevant predicate is the closest predicate in the tree,whichbelongstothesameminimalpredicationasthetaggedconstituentdoes.
The cases where this does not hold are typical cases of complex predicates,formed bymeans of several chained verbs, e.g.modals, auxiliaries and raisingverbs. In suchcases the tagused to code the semantic function is sufixedwith"cp" (standing for "complex predicate") in order to help the search andconcordancingofthetreebank(formoredetailsseethesection4.4below)
Foranannotatedcorpuswithfullyfledgedrepresentationofsemanticrelations,seetheCINTILLogicalFormBank.
4 Tagset
4.1 lexicalcategoriesA Adjective
ADV Adverb
ART Article
C Complementizer
CARD Cardinal
CL Clitic
CONJ Conjunction
9
D Determiner
DEM Demonstrative
ITJ Interjection
N Noun
ORD Ordinal
P Preposition
PERCENT Percentage
PNT Punctuation
POSS Possessive
PRS Personalpronoun
QNT Quantifier
REL Relativepronoun
V Verb
4.2 non‐lexicalcategoriesA' Adjectivesub‐phraseconstituent
ADV' Adverbsub‐phraseconstituent
ADVP Adverbphrase
AP Adjectivephrase
CARD' Cardinalsub‐phraseconstituent
CONJ' Conjunctionsub‐phraseconstituent
CONJP Cardinalsub‐phraseconstituent
CP Complementizerphrase
ITJ Interjection
N' Nominalsub‐phraseconstituent
NP Nounphrase
PERCENTP Percentagephrase
POSS' Possessivesub‐phraseconstituent
PP Prepositionphrase
QNT' Quantifiersub‐phraseconstituent
10
S Sentence
V' Verbsub‐phraseconstituent
VP Verbphrase
4.3 grammaticalfunctionsSJ Subject
DO DirectObject
IO IndirectObject
OBL ObliqueObject
M Modifier
PRD Predicate
C Complement
SP Specifier
4.4 semanticfunctionsARG1 Argument1
ARG11 Argument 1 of subordinating predicator and Argument 1 of thesubordinate clause (semantic function of Subjects of so called Subject Controlpredicators)
ARG21 Argument 2 of subordinating predicator and Argument 1 of thesubordinate clause (semantic function of Subjects in so called Direct ObjectControlpredicators)
ARG31 Argument 3 of subordinating predicator and Argument 1 of thesubordinate clause (semantic function of Subjects in so called Indirect ObjectControlpredicators)
ARG2 Argument2
ARG3 Argument3
ARGncp Argumentnincomplexpredicateconstructions
ARGnac Argumentnofanticausativereadings
LOC Location
EXT Extension
ADV Adverbial
CAU Cause
TMP Temporal
11
PNC Purpose,goal
MNR Manner
DIR Direction
PRD Predication
POV Pointofview
5 SpecificphrasesPhrasesofcategorySandNPhavespecificconstituencyformat(seealsosection7.3oncomparatives).
5.1 SentencesPhrasesofcategoryShavenoS'daugthersorS‐categorialhead.
AnSisprojectedoutofaVincasethisisanimpersonalorintransitiveverb,oraverbwithnorealizedcomplementotherthantheSubject,withnomodifiers:
#Id:a001/1.
AnSisprojectedoutofaVPincasethisVP'sheadhasaninternalcomplement:
#Id:b001/42
Orismodified:
12
#Id:a012/569
For S not projected out of a verbal head, see section 10.2 on non verbalutterances.
IncanonicalSVOwordorder, theSubject is immediatelydominatedbySandasisternodeoftheprojectingVorVP.
In VOS word order, extraposed Subjects are sister nodes of its V or VP andimmediatelydominatedbyS:
#Id:b051/3036
InVSOorder,whentheextraposedSubjectintervenesbetweentheVerbanditsinternalcomplement,theextraposedSubjectisdominatedbyV':
13
#Id:b012/782
Quantifiersfloatingtoapost‐verbalposition,asinOsjogadoresviramtodosisso,are in adjunction to a projectionof the verb.Those floating to an immediatleypos‐nominal position, as in Os jogadores todos viram isso, are in adjunctionpositiontotheirNP(seeexample#Id:b092/5911,insection6.1below).
5.2 NominalPhrasesofcategoryNPmayhavespecifierdaugthers.Ingeneral, theseareleft‐branchingnodes.
BareNPs,withnorealizedspecifier,arecharacterizedbyhavingaunarybranchimmediatelybelowtheNPnodeprojectedbyitsnominalhead:
#Id:e000369/56038
14
6 PhoneticallynullitemsPhoneticallynullitemsmarkpositionsinthetreerelatedtootherpositonsinthetree(incaseoftraces),ormarkellidedelementswhosecontextisrichenoughtosupport the recovery of their interpretation (in case of null subjects or nullheads).
[ProbParser:Phoneticallynullitemsareremovedfromthetreeandrepresentedbymeans of appropriated tree configuration or appropriate relabelling of therelevantnodes.]
6.1 nullsubjectsNull subjects aremarked by *NULL* and are immediately dominated by the SnodeprojectedbyaVoraVP:
#Id:b092/5911
[ProbParser:NullsubjectsarerepresentedbyunarybranchingbelowS.]
6.2 nullheadsNullheadsmaybenominalorverbal.Theyaremarkedby*ELLIPSIS*:
15
#Id:b001/11
[ProbParser:NullheadsarerepresentedbyXPorX'withnodescendantX.]
6.3 tracesTraces of constituents that aremarked by *GAP* followed by _n where n is anaturalnumber.Thecategoryofthe"displaced"nodeiscoindexedwiththetraceandthusalsofollowedby_n:
#Id:b094/6024
[ProbParser:AnynodeZinthetreepathbetweenthetraceandthegapfillerofcategoryWisrelabeledasZ/W.]
6.4 "though"nullobjectsNulldirectobjectsspecificallylicensedby"though"constructionsaremarkedby*THOUGH*.
16
#Id:a012/591
[ProbParser: "Though"nullobjectsarerepresentedbyunarybranchingbelowVP.]
7 Specificconstructions
7.1 relativesAmodifying relative clause is dominated by N'. It is of category CP, with twoimmediateconstituents,anXPprojectedbyarelativepronounRELandaclauseS.IthasgrammaticalfunctionMandsemanticrolePRED:
#Id:b080/4682
Anappositiverelativeclauseisprecededbyacomma',',whichisinadjunctionto
17
theCPconstituentandformsanotherCPconstituentwithit.
#Id:b219/16012
AfreerelativeclauseisofcategoryNP,whichisprojectedoutoftherelative:
#Id:e000149/38916
For the representation of thephonetically null trace, in correspondence to therelativizer XP, see section 6.3 on Traces (see also section "8 Long‐distancerelations").
7.2 adjectives:predicativeandattributiveIn predicative constructions, the Subject is ARG1 of the copula verb, and thecorrspondinglogicalformitshowsupasARG1oftheadjective.
18
#id:a003/102
Thesameholdswithrespecttheheadnouninattributiveconstructions.Thatisthecaseofthenoundiferenteinthisexample:
#id:a001/34
Accordingly, any further arguments of the adjective surfacing in the tree aretaggedwithARGnwithn>=2.Thatisthecaseofdesteintheexampleabove.
7.3 comparativesA comparative construction is typically built around an adjective by twoconstituents,anadverbialofdegreeandaCONJPphrase:
19
#Id:e000282/49262
(someadverbsmayalsosupportcomparativeconstructiosn,aswithpertointheexampelmaispertodoqueaMaria)
The exception happens with adjectives likemaior,menor,melhor, pior, whichalsoexpressthecomparison,inwhichcasethecomparativeconstructionisbuiltaroundtheadjectiveandtheCONJPphrase.
#Id:e000481/64969
The adverbial of degree (e.g. mais, menos, tão) is sister of the adjective,dominatedbyanA'node.ItissuperficiallytaggedasA‐M‐M,thatisasmodifier,butnotethatinlogicalformtheadjectiveshowsupastheARG1ofthisadverb.
TheCONJPphrase isasisternodeof thatnodeA'. It isprojectedbyoneof theconjunctionexpressionsforcomparativesque,deque,de_oque,como,quanto.Itisacomplementoftheadverbialofdegree.HencethisadverbialhappennottoprojectanADVP.ThisphraseistaggedasCONJP‐C‐ARG2,indicatingthatitisthecomplementandARG2oftheadverb.
20
TheCONJPmaybeabsentofthecomparativeconstruction.Insuchcase,thoughitcanbesemanticallyrecovered fromthecontext, there isnophoneticallynulliteminsertedinthetreetomarkit.
7.4 coordinationCoordinationoftwoconstituentsAandBbymeansofacoordinativeconjunctionConj(eitheralexicalitem,suchase,oracomma)areacascadeofadjunctions[A[Conj[B]]].
#Id:b001/30
7.5 appositionsAppositionsareadjoinedtoNPs:
21
#Id:b005/254
7.6 infinitives:inflectedandnoninflectedAninflectedinfinitiveprojectsanS:
#Id:c031/23222
AndanoninflectedoneprojectsaVP:
22
#Id:b076/4469
7.7 gerundsWhen in complex predicate constructions, preceded by an auxiliary verb, agerundprojectsaVP.Otherwise,agerundprojectsanadverbialsentencewithanullsubject:
#Id:c020/22209
7.8 complexpredicates:auxiliary,raisingandmodalverbsAuxiliary,modalandraisingverbsselectforaVP.
Different status of this VP is signaled by its grammatical function and/orsemanticrole.
23
AuxiliariessyntacticallyselectforacomplementVP,andarethussisternodesofVP‐C:
#Id:e000585/73183
ModalandraisingverbsselectnotonlysyntacticallybutalsosemanticallyforacomplementVP.TheyarethussisternodesofVP‐C‐ARG1:
#Id:c003/20534
In a complex predicate, formedby any sequence of auxiliary, raising ormodalverbs, itsSubject ismarkedasNP‐SJ‐ARGncp, signaling that it is theSubjectofthe topmost verb (viz. ‐SJ‐) and the ARGn of some verb down below in thecomplexpredicate:
24
#Id:b134/8372
7.9 controlverbsSubject control verbs (e.g. querer) select for a Subject NP‐SJ‐ARG11, signalingthatitisboththesubjectofthecontrolverbandintheclauseoccurringasdirectobjectofthelater:
#Id:b001/34
Object control verbs (e.g. obrigar) select for a Direct Object NP‐DO‐ARG21,signalingthatitisboththeobjectofthecontrolverbandthesubjectintheclauseoccurringastheotherinternalargumentofthecontrolverb.
25
#Id:e000660/79129
Indirect object control verbs (e.g. pedir) select for a Indirect Object PP‐IO‐ARG31, signaling that it is both the indirect object of the control verb and thesubjectintheclauseoccurringastheotherinternalargumentofthecontrolverb.
7.10 "though"constructionsIn "though" constructions, the sentential complement of the adjective,introducedbytheprepositionde,andprojectedbyan inflected infinitive,hasaphoneticallynullobjectmarkedwith*TOUGH*:
#Id:a012/591
For more details, see also section 7.6 on infinitives and 6.4 on "though" nullobjects.
7.11 cliticsCliticsprojectNPs.Intermsofconstituency,theyenterthesamepositionsasanyNPprojectedfromanNwithsimilargrammaticalfunction(seealsosection10.4
26
onthetokenizationofclitics).
8 Long‐distancerelationsLong distance relations are established between a constituent and a rightdownwards position in the tree where this constituent typically occurs in(declarative)counterpartswithcannonicalSVOwordorder.Constructionswithlong‐distancerelationsincludetopicalizations,interrogativesandrelatives.
The cannonical position ismarked by a phonetically null item *GAP*which iscoindexedwiththeconstituentwithwhichitestablishesalong‐distancerelation.
[ProbParser:Thelongdistancedependencyisrepresentedbydecoratingeverynodeinthepathinsidethetreeconnectingthenodeimmediatelydominatingtheputativegapand thesisternodeof the "displaced"constituent.Thesenodes inthat path are decorated by concatenating to their category tags a slash "/"followedbythetripleCAT‐GF‐SRofthat"displaced"constituent,whereCATisitscategory,GFisitsgrammaticalfunction,andSRisitssemanticrole.]
8.1 topicalizationThetopicalizedconstituentisinadjunctiontotheconstituentfromwhichitwastopicalized:
#idc049/24856
8.2 relativesTherelativephraseprojectsaCPimmediatelydominatingtheconstituentSfromwhichitwasrelativized.
Seealsosection7.1onRelatives.
8.3 interrogativesIn its currentversion, thecorpusdoesnotcontainyet interrogativeswith longdistancerelations.
27
9 Valencyalternations
9.1 passivesTheby‐phrase isan internalcomplementof thepastparticipleverb form,withgrammaticalfunctionOBLandsemanticroleARG1.
The corresponding Subject bears ARG2cp (see also section 7.8 on complexpredicatesabove):
#Id:b179/11830
9.2 anti‐causativesTheSubjectofananticausativeverbisARG2ac:
#ide000530/68760
In a predication supported by the transitive counterpart of a possibleanticausative verb, the Subject is ARG1, as with other transitive verbs. Asexpected,initspassivealternation,theSubjectisARG2.
28
10 Tokenization
10.1 sentencesplitingSentencesaresplitedat theexpectedpoints. It isworthofmention thecaseofutterances involving colon ":", which will be split into two separate entrysentencesinthetreebank,oneprecedingthecolonandtheotherfollowingit.
10.2 nonverbalutterancesTitlesofnewspaperarticles, stretchesaround colons, etc. are casesofpossibleutterancesinthecorpuswicharenotprojectedbyacorrespondingverbalhead.Inanycase,everyentryutteranceinthecorpusisdominatedbyanSnode.
10.3 contractionsContractions are expanded. The first element of an expanded contraction ismarkedwithan"_"(underscore)symbol,forinstancedo→|de_|o|.
10.4 cliticsClitics are detached from the verb. The detached clitic is marked with a "‐"(hyphen)symbol,asforinstancedáselho→|dá|se|lhe|o|
When inmesoclisis, a "‐CL‐"mark isused to signal theoriginalpositionof thedetachedclitic:afirmarseia→|afirmarCLia|se|
Possiblevocalicalterationsoftheverbformaremarkedwith"#"(hash)symbol,asforinstanceinvêlas→|vê#|las|.
11 Multi‐wordexpressions
11.1 PropernamesMulti‐wordpropernames forma flatconstituentwhereeveryword issisterofeachother,isofcategoryN,andisdominatedbyasinglecommonN'node.ThisheadprojectsanNPexceptwhenitisamodifierofacommonnoun:
29
#b005/254
Seealsosections2.6.1onbareNPs,and5.2onNPs.
11.2 cardinalsComplexcardinalshaveaflatstructurelikeamulti‐wordnamed‐entity.
#ide000650/78330
12 Textualmarking
12.1 punctuationEachpunctuationmarksisaconstituentofcategoryPNT.
Endofsentencemarkersareintopmostadjunction.
30
12.2 commaCommas separating left periphery constituents are right adjoined to theseconstituents.
Commas surrounding appositions are top most constituents of the appositiveconstituent.
#id:b029/1761
Commas with coordinative value are represented like lexical coordinativeconjuctionsare(forfurtherdetails,seesection7.4oncoordination):
31
#ida001/30
Commassurroundingparentheticalsareadjoinedtothesurroundedconstituent.With several parentheticals in sequence, the first one is surrounded, thefollowingoneshaveasinglecommaatitsright:
#idb227/16800
Commasemphasizingconjunctions,thusimmediatelyprecedingthem,arerightadjoinedtotheleftcoordinatedconstituent:
#idb184/12279
Other"pause"commasareleftadjoined:
32
#idb128/8002
12.3 quotationmarksQuotation marks surrounding constituents are adjoined to them. When theysurroundalexicalitemofcategoryX,theyaredominatedbyanX'nodetogetherwithX:
#Id:b010/654
Quotationmarkssurroundingstringsnot formingaconstituentareadjoinedtothehighestpossiblenode:
33
#Id:b091/5858.
13 ReferencesBarreto, Florbela, António Branco, Eduardo Ferreira, Amália Mendes, Maria
FernandaNascimento,FilipeNunesandJoãoSilva,2006,"OpenResourcesandToolsfortheShallowProcessingofPortuguese",Proceedingsofthe5thInternational Conference on Language Resources and Evaluation(LREC2006),Genoa,Italy.
BrancoAntónio,SérgioCastro,JoãoSilva,FranciscoCosta,2011,CINTILDepBankHandbook: Design options for the representation of grammaticaldependencies.