A treebank for everyone - cuni.cz
Transcript of A treebank for everyone - cuni.cz
A treebank for everyone
Milena Hnátková, Petr Jäger, Tomáš Jelínek,Vladimír Petkevic, Alexandr Rosen, Hana Skoumalová
Ústav teoretické a komputacní lingvistikyFilozofická fakulta Univerzity Karlovy v Praze
Institute of Theoretical and Computational LinguisticsFaculty of Arts, Charles University, Prague
Zespół Lingwistyki Korpusowej Jezyków SłowianskichInstytut Slawistyki Zachodniej i Południowej
Uniwersytet Warszawski19 czerwca 2012
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 1 / 55
Outline of the talk
1 Introduction
2 Architecture
3 Examples
4 Processing of the input text
5 Conclusions and plans
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 2 / 55
Introduction
Outline of the talk
1 Introduction
2 Architecture
3 Examples
4 Processing of the input text
5 Conclusions and plans
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 3 / 55
Introduction About treebanks
About treebanks
Treebank = a corpus annotated with syntactic structureProbably first major project: Penn Treebank (release v. 0.5, 1992)Now: 74 treebanks in about 40 languages (Wikipedia)Different in:
sizelinguistic backgroundformatlevel of detaildepth of analysisways they are built
Parallel treebanks, semantic annotation, ...
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 4 / 55
Introduction About treebanks
Why are treebanks useful?
Explicit markup of syntactic relations(constituents; heads and dependents)– easier to identify semantic relations(predicates/functors and arguments)– simplifies some queries– simplifies extraction of lexical properties (valency) or syntacticrules– support for grammar development
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 5 / 55
Introduction About treebanks
Treebanks of Czech
Prague Dependency Treebank
Dependency syntax, close to the Prague theory of FunctionalGenerative Description [Sgall et al.(1986)]3 annotation levels: morphology, surface syntax, deep syntaxPDT 0.5 – 1998, 0.5M tokensPDT 1 – 2000, 1.5M tokensPDT 2 – 2004, deep syntax
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 6 / 55
Introduction About treebanks
Some other treebanks
Polish Constituency TreebankBulTreeBankIness Treebanking InfrastructureLingo RedwoodsTigerLASSY...
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 8 / 55
Introduction Why a new project
Why isn’t PDT good enough?
Too small for investigating less frequent forms and phenomenaTheoretical bias –– distracts some people used to constituencystructure
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 12 / 55
Introduction Why a new project
How to build a large treebank?
Treebanks built with human assistance,even semi-automatically, are too small.They can be built by a stochastic parser.Or by a rule-based parser,with the advantage of correspondence to the grammar
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 13 / 55
Introduction Why a new project
Can a single core annotationbe viewed in different ways?
Theory-specific representations have different appearancesbut share a large part of content:constituency/dependency, morphosyntactic categories,even the spirit of analyses of many phenomenaA treebank offering different views of a sufficiently expressiveannotation scheme is a realistic goalAdditional benefit: relating linguistic theories
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 14 / 55
Introduction Why a new project
A larger treebank with customizable visualization?
A project (2010––2012) aiming at:
Syntactic annotation of the Czech National Corpus(1.3 billion words) using a stochastic parser,followed by a rule-based correction moduleRobust and expressive core annotation format, potentiallyunderspecifiedCustomizable query, visualization and export interface,offering multiple options to view syntactic structureAccessible to lay users and satisfying experts at the same time
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 15 / 55
Introduction Why a new project
Hopefully a follow-up project (2013–2015) aiming at:
Development of a corpus-based grammarOptions for queries, visualization and export:
ready-made, tailored to specific theories, ordefinable by the user
Improvements of the correction module
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 16 / 55
Introduction Why a new project
A corpus grammar?
to bridge the gap between the empirical and the theoreticalfor grammar developmentfor checking consistencyfor adding more infofor assisting the treebank userto help converting the data onto other formats more easily
A corpus grammar can be:
hand-craftedextracted from corpushand-crafted but verified against the corpus dataadvantage: incremental development
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 17 / 55
Architecture
Outline of the talk
1 Introduction
2 Architecture
3 Examples
4 Processing of the input text
5 Conclusions and plans
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 18 / 55
Architecture Syntactic structure
Syntactic structureInternal skeleton structures: constituency-based, with acombination of binary and flat branchingInterpretable as constituency or dependency trees, according tousers’ specification, visualized with an arbitrary amount of detail,not necessarily by tree graphsSurface and deep structure encoded within a single structure:constituents are labelled as syntactic functionsincluding head as a special functionHeads are further specified as deep or surface
Deep head: deep syntactic governor: bylo by se to povedloSurface head: can be identical to the deep head or different:auxiliary, prepositions, subordinate conjunctions, numerals
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 19 / 55
PRED
šel
AUXVbyl
AUXVby
AUXPdo
ADV
lesaS
SHD
byDHD
SHD
bylDHD
HD
šelADVB
SHD
doDHD
lesa
Architecture Syntactic structure
Three levelsWord order and syntactic structure as distinct dimensions, eachsentence is represented at three inter-linked levels:
graphemics (orthographic words, contractions)morphology (syntactic words, including haplologized items)syntax (trees, no nodes for pro-dropped subjects)
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 21 / 55
Architecture Syntactic structure
Annotation of syntactic phenomenaAgreement of various typesCompound periphrastic verbal forms (passives, conditionalstructures, future...)Grammatical co-reference (grammatical control, relative/reflexivepronouns, predicative complements)Multi-word units (collocations)
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 22 / 55
Architecture Syntactic structure
Expressive powerExpressive enough to accommodate analyses of arbitrarygranularityAmbiguous or undecidable phenomena represented byunderspecification and distributive disjunctionAnnotation of any kind can be missing, a sentence may be a merelist of words
SpecificationsAnnotation must be licensed by a formal grammar. Words andconstituents have their appropriate (potentially underspecified)sets of featuresLexicons are used to index forms, syntactic words and compoundformsCustomizable visualizations are enabled by formal definitions
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 23 / 55
Architecture General layout
Representation layers:textmorphologysyntactic structure
Lexicons:morphologicalsyntacticmulti-word expressions
Two-way links between layers, and between layers and lexiconsto link information across the layersto provide lexeme-specific informationto identify multi-word expressions, including periphrastic forms
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 24 / 55
Architecture General layout
Links within a treeAgreementCompound (multi-word) verbal predicatesGrammatical coreference...
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 25 / 55
Architecture General layout
word layer
morphological layer
syntactic layer
Morphological lexicon
Syntactic lexicon
Multi-word lexicon
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 26 / 55
Architecture The layers
Word layer
tokenized, including punctuation and MWE: cesko-slovenskýcontractions left intact (not interpreted): ocs, ses
Morphological layer
morphological analysis and lemmatization of all formscontractions split (i.e. interpreted): ocs → o co jsisome punctuation marks glued back with word forms:atd., cesko-
Syntactic layerconstituency-based structure, representable according to user’soptions/specificationspunctuation omitted
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 27 / 55
Architecture Construction types and syntactic functions
Syntactic structureeach nonterminal node is assigned a construction type and asyntactic functioneach terminal node is assigned a syntactic function
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 28 / 55
Architecture Construction types
Hierarchy of construction typesHeadedUnHeaded
Coord – coordinationAdord – adordinationUnspec – unspecified (for collocations and other)
Function for UnHeaded structures:Memb – a member
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 29 / 55
Architecture Syntactic functions
Syntactic functions for HeadedSurfHead – surface head: auxiliary být/bývat, prepositions,subordinate conjunctions, numerals in quantified expressions: petdetíDeepHead – in case it differs from SurfHead (head nouns in PPs,autosemantic verbs in analytical predicates...)Head – both SurfHead and DeepHead
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 30 / 55
Architecture Syntactic functions
Other syntactic functions for HeadedSubj – subjectAttr – attributeObj-Advb
ObjAdvb
VbAttr – predicative complementReflTant – reflexive element (si, se) for inherent reflexivesDeagent – deagentive reflexiveApos – appositionInDep – independent syntactic element (parenthesis, vocativesyntactic noun...)
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 31 / 55
Architecture Lexicons
Morphological lexicon
List of lemmas with inflection paradigmsA lemma is introduced if two words differ in morphologicalparadigms – not only in syntactic properties or in semantics:
travicka ‘little grass’/‘female poisoner’ has only one lemma,clen ‘member’ has two lemmas, as it is either masculine animateor masculine inanimate.
Syntactic lexiconlist of lemmas with their syntactic propertiesinherent reflexives have separate entries – rozhodnout androzhodnout se ‘decide’,videt ‘see’ is one entryvalency frame entriesdifferent valency frames listed under one lemma
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 32 / 55
Architecture Lexicons
Multi-word lexiconcollocations: krížem krážem ‘in all directions’nechat na holickách ‘leave in the lurch’types of analytical verb forms: bych byl prišel ‘(I) would havecome’, jsi prišel ‘(you) have come’inherent reflexives: usmíváš se ‘(you) smile’;rozhodne se ‘(he) decides’(reflexive) deagentive construction: jde se ‘let’s go’reflexive passives: bábovka se pece ‘the cake is baking/beingbaked’analytical passives: je cten ‘is read’nominal predicates: je velký ‘is big’agreement patterns:Lucie ho videla opilého ‘Lucie saw him drunk’
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 33 / 55
Examples
Outline of the talk
1 Introduction
2 Architecture
3 Examples
4 Processing of the input text
5 Conclusions and plans
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 34 / 55
Examples Contractions
Treating contractions
(1) Ty by ses byl ušpinil.you would REFL+AUX2nd ,sg bepple get dirtypple‘You would have got dirty.’
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 35 / 55
Examples Contractions
Ty by ses byl ušpinil.
(2)
S
SUBJty
HEAD
SURFHEAD
by s
DEEPHEAD
SURFHEAD
bylDEEPHEAD
HEAD
ušpinilOBJse
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 36 / 55
Examples Contractions
(3) Surface dependency structure derived from (2)
by+s
SUBJty
byl
ušpinil
OBJse
(4) Deep dependency structure derived from (2)
bys,byl,ušpinil
SUBJty
OBJse
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 37 / 55
Examples Subject/object ambiguity
Subject/object ambiguity
Reflexive passive:
(5) ZarízeníNom/Gen se využívá.device REFL uses‘The device is being used.’
S
SUBJ
zarízeníHEAD
DEAGENT
seHEAD
využívá
S
HEAD
využíváDEAGENT
seOBJ
zarízení
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 38 / 55
Examples Another type of subject/object ambiguity
Another type of subject/object ambiguity
(6) Zdravotnictví musí zachránit stát.health servicenom/acc must save statenom/acc
Two different readings:
#1 Health service must save the State.#2 Health service must be saved by the government.
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 39 / 55
Examples Another type of subject/object ambiguity
S
SUBJ
zdravotnictvíHEAD
HEAD
musíOBJ
HEAD
zachránitOBJ
státS
SUBJ
státHEAD
HEAD
musíOBJ
HEAD
zachránitOBJ
zdravotnictví
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 40 / 55
Examples Another type of subject/object ambiguity
Morphological analysis of (6) with some values unspecified:
1 zdravotnictví noun, CASE=X, NUM=sg, GEND=n2 musí verbfin, PERS=3, NUM=sg3 zachránit verbinf4 stát noun, CASE=Y, NUM=sg, GEND=m
Constituents in one of the two possible syntactic structures of (6),some boxed numbers refer to the forms above:
5 [ 3 zachránit 4 stát ]6 [ 2 musí 5 ]7 [ 1 zdravotnictví 6 ]
Two possible structures with constraints on category values andoverriding clauses:
#1 = 7 , X=nom, Y =acc#2 = 7 , X=acc, Y=nom, 1 → 4 , 4 → 1
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 41 / 55
Examples Representation of a sentence
Multi-word lex.
V se...
Rozhodl se umýt .(he) decided REFL (to) wash .
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 42 / 55
Examples Representation of a sentence
Multi-word lex.
V se...
Rozhodl se umýt .(he) decided REFL (to) wash .
VpMS---3R-AA--- TT-------------
P7-X4----------
Vf--------A----
Z:-------------
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 42 / 55
Examples Representation of a sentence
Morph. lex.
rozhodnoutse_1se_2umýt...
Multi-word lex.
V se...
Rozhodl se umýt .(he) decided REFL (to) wash .
VpMS---3R-AA--- TT-------------
P7-X4----------
Vf--------A----
Z:-------------
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 42 / 55
Examples Representation of a sentence
Morph. lex.
rozhodnoutse_1se_2umýt...
Multi-word lex.
V se...
Rozhodl se umýt .(he) decided REFL (to) wash .
VpMS---3R-AA--- TT-------------
P7-X4----------
Vf--------A----
Z:-------------
Head Obj
Head
ReflTant Head Obj
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 42 / 55
Examples Representation of a sentence
Morph. lex.
rozhodnoutse_1se_2umýt...
Multi-word lex.
V se...
Rozhodl se umýt .(he) decided REFL (to) wash .
VpMS---3R-AA--- TT-------------
P7-X4----------
Vf--------A----
Z:-------------
Head Obj
Head
ReflTant Head Obj
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 42 / 55
Examples Representation of a sentence
Morph. lex.
rozhodnoutse_1se_2umýt...
Synt. lex.
rozhodnoutrozhodnout_se
se_1se_2umýt...
Multi-word lex.
V se...
Rozhodl se umýt .(he) decided REFL (to) wash .
VpMS---3R-AA--- TT-------------
P7-X4----------
Vf--------A----
Z:-------------
Head Obj
Head
ReflTant Head Obj
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 42 / 55
Examples Representation of a sentence
Morph. lex.
rozhodnoutse_1se_2umýt...
Synt. lex.
rozhodnoutrozhodnout_se
se_1se_2umýt...
Multi-word lex.
V se...
Rozhodl se umýt .(he) decided REFL (to) wash .
VpMS---3R-AA--- TT-------------
P7-X4----------
Vf--------A----
Z:-------------
Head Obj
Head
ReflTant Head Obj
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 42 / 55
Examples Representation of a sentence
Morph. lex.
rozhodnoutse_1se_2umýt...
Synt. lex.
rozhodnoutrozhodnout_se
se_1se_2umýt...
Multi-word lex.
V se...
Rozhodl se umýt .(he) decided REFL (to) wash .
VpMS---3R-AA--- TT-------------
P7-X4----------
Vf--------A----
Z:-------------
Head Obj
Head
ReflTant Head Obj
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 42 / 55
Processing of the input text
Outline of the talk
1 Introduction
2 Architecture
3 Examples
4 Processing of the input text
5 Conclusions and plans
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 43 / 55
Processing of the input text
Processing of the input text:Word and sentence segmentation → sentence on the graphemiclevelMorphological analysis and disambiguation (rule-baseddisambiguation, stochastic tagger, collocation module) →sentence on the morphological levelStochastic parser applied to the disambiguated sentenceAutomatic correction of the parseConversion of the corrected parse + modifications:
phenomena that require arbitrary decisions in a dependency tree:constructions with function words, coordinated constructions, listsdisjunction accounting for structural ambiguities expressed by“combined functions” AttrAdv, ObjAdv
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 44 / 55
Processing of the input text
Syntactic tree in the PDT and the new format
(7) Most,Bridge
kterýwhich
bylwas
vin
havarijnímemergency
stavu,state
byshould
melhavemodal
sloužitserve
dalšíchnext
tricetthirty
let.years.
‘The bridge, which was ramshackle, should serve for anotherthirty years.’
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 45 / 55
Processing of the input text Correction module
Correction module
30 correction rules so farfor more frequent errors which can be reliably correctedsuch as noun in accusative as subject
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 49 / 55
Processing of the input text Correction module
Success rate of the correction modules
rules dependency label totalClauses 6 1688 774 1744NP 8 819 2066 2625PP 9 834 7160 7722Other 5 412 1390 1802Total (ppm) 3753 11390 13893Total (%) 0.38% 1.14% 1.39%
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 50 / 55
Conclusions and plans
Outline of the talk
1 Introduction
2 Architecture
3 Examples
4 Processing of the input text
5 Conclusions and plans
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 51 / 55
Conclusions and plans
Conclusions and plans
Results200M corpus parsed and correctedFirst version of a viewer with three modes of representation
Further workManual tagging of a 2M training corpus for improvement of thetaggerManual parsing of sentences for improvement of the parserDetection of more errors made by the parser and their correctionCreating a corpus with improved toolsEnabling more modes of viewing the syntactic structureGrammar development
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 52 / 55
Conclusions and plans
Acknowledgment
Work on this project was supportedby the Grant Agency of the Czech Republic
as project No. P406/10/0434.
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 53 / 55
Conclusions and plans
Thank you for your attention!
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 54 / 55
Bibliography
Sgall, P., Hajicová, E., & Panevová, J. (1986).The Meaning of the Sentence in its Semantic and PragmaticAspects.Reidel and Academia, Dordrecht and Praha.Editor: Jacob Mey.
A. Rosen et al. (CU Prague) A treebank for everyone Zespół Lingwistyki Korpusowej 55 / 55