Download - Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science.

Transcript

Treebanks and MWEs(Part 1)

Jan Hajič, Pavel Straňák, Jiří MírovskýInstitute of Formal and Applied Linguistics & LINDAT/CLARIN

School of Computer ScienceFaculty of Mathematics and Physics

Charles University in PragueCzech Republic

19.1.2015 PARSEME Training School Prague 2

Outline

• Treebanks– Phrase-(Constituency-) based: The Penn Treebank– Dependency: The Prague Dependency Treebanks

• The Penn Treebank (basics)• The Prague Dependency Treebank

– Layers of Annotation• Morphology• Syntax• Semantics• Valency

THE PENN TREEBANK

19.1.2015 PARSEME Training School Prague 3

19.1.2015 PARSEME Training School Prague 4

Phrase- vs. Dependency-Based Treebanks

• The original: The Penn Treebank– Phrase-based style; good for parsing by CFG grammars

• Followers– Almost all Penn-based treebanks

• Chinese, Arabic, Korean, …

– Negra (German), many others

• Now: dependency parsing prevails• Conversion from phrase-based treebanks

– Might lose information, heads added „ad hoc“

• “native” dependency treebanks: annotated as such– Considered “better”– Hindi/Urdu, TIGER (sort of); both styles manually annotated– PDT (of course) and similar ones

» PDT style treebanks: Danish, Croatian, Slovene, Greek, Latin

The Penn Treebank

• Published (first) in 1993, now LDC99T42 (www.ldc.upenn.edu)– First the Wall Street Journal part (1 mil. words, 2312 documents)

• Added other text types– ATIS corpus (dialogs, travel reservations)

– Brown corpus annotated for syntax

– Switchboard (spoken language, tel. conversations)

19.1.2015 PARSEME Training School Prague 5

19.1.2015 PARSEME Training School Prague 6

Penn Treebank Format

• ( (S • (NP-SBJ • (NP (NNP Pierre) (NNP Vinken) )• (, ,) • (ADJP • (NP (CD 61) (NNS years) )• (JJ old) )• (, ,) )• (VP (MD will) • (VP (VB join) • (NP (DT the) (NN board) )• (PP-CLR (IN as) • (NP (DT a) (JJ nonexecutive) (NN director) ))• (NP-TMP (NNP Nov.) (CD 29) )))• (. .) ))

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

POS tag (NNS)(noun, plural)

Phrase label (NP)

Noun Phrase

“Preterminal”

The Penn Treebank(s)

• Extensions– Annotation of named entities, co-reference (BBN)

– cf. also previous slides

– Function labels (SBJ, OBJ, TMP, ...)– PropBank

• Penn Treebank syntax + Predicate-argument relations, added “frame files” (predicate dictionary)

(S (NP-SBJ (PRP I Arg0) VP (VBD gave Pred) (NP-DOBJ (PRP him Arg1) (NP-IOBJ (DET the) (NN book Arg2))) ... )

– NomBank• Like PropBank, but for nouns and their “arguments”

• Other languages (Chinese, Arabic, ...)

19.1.2015 PARSEME Training School Prague 7

THE PRAGUE DEPENDENCY TREEBANK

19.1.2015 PARSEME Training School Prague 8

19.1.2015 PARSEME Training School Prague 9

The Prague Dependency Treebanks: the Basics

• Original Treebank: PDT 1.0, 2001 (morf., dep. syntax)• First full release: PDT 2.0

– http://ufal.mff.cuni.cz/pdt2.0 • LDC2006T01, see http://www.ldc.upenn.edu

– Now: PDT 3.0: http://ufal.mff.cuni.cz/pdt3.0 • Basic general features

– Multilayered annotation, interlinked layers– Dependency-based syntax (both surface and deep)– Information structure of the sentence (topic/focus)– Grammatical and basic textual coreference– New: discourse relations, MWEs

• Languages: Czech, English (also parallel), Arabic– Student work on “samples”: Indonesian, Urdu, Russian, …– Spoken: work started on Czech and English (non-parallel, dialogs)

19.1.2015 PARSEME Training School Prague 10

The Prague Dependency

Treebank

• Three basic layers of annotation– Morphemic layer– Surface syntax (“analytical”) layer– “Tectogrammatical” layer:

underlying syntax, semantic roles (valency), inf. structure, coreference

• Size– 830,000 words (tokens) = 50000 sentences in 3165 full

documents (texts)• Format

– Prague Markup Language (XML-based)

– Now also: .treex format• For smooth uise in the TreeX platform• http://ufal.mff.cuni.cz/treex

19.1.2015 PARSEME Training School Prague 11

PDT (Czech) Data

• 4 sources:– Lidové noviny (daily newspaper, incl. extra sections)– DNES (Mladá fronta Dnes) (daily newspaper)– Vesmír (popular science magazine, monthly)– Českomoravský Profit (economical journal, weekly)

• Full articles selected– article ~ DOCUMENT (basic corpus unit)

• Time period: 1990-1995

• 1.8 million tokens (~110,000 sentences total)

19.1.2015 PARSEME Training School Prague 12

PDT Annotation Layers

• L0 (w) Words (tokens)– automatic segmentation and markup only

• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma

• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function

• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, “functor”, grammatemes, ellipsis

solution, coreference, topic/focus (deep word order), valency lexicon; PDT 3.0PDT 3.0: mass, clauses, formemes, discourse, ...

PD

T 1

.0

(200

1)

PD

T 2

.0

(200

6)

19.1.2015 PARSEME Training School Prague 13

PDT Annotation Layers

• L0 (w) Words (tokens)– automatic segmentation and markup only

• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma

• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function

• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, functor (detailed), grammatemes,

ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

19.1.2015 PARSEME Training School Prague 14

Morphological Attributes

Tag: 13 categories Example: AAFP3----3N---- Adjective no poss. Gender negated Regular no poss. Number no voice Feminine no person reserve1 Plural no tense reserve2 Dative superlative base var.

Lemma: POS-unique identifierBooks/verb -> book-1, went -> go, to/prep. -> to-1

Ex.: nejnezajímavějším“(to) the most uninteresting”

19.1.2015 PARSEME Training School Prague 15

PDT Annotation Layers

• L0 (w) Words (tokens)– automatic segmentation and markup only

• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma

• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function

• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, functor (detailed), grammatemes,

ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

19.1.2015 PARSEME Training School Prague 16

Layer 2 (a-layer): Analytical Syntax

• Dependency + Analytical Function

dependent

governor

The influence of the Mexicancrisis on Central and EasternEurope has apparently been underestimated.

19.1.2015 PARSEME Training School Prague 17

Analytical Syntax: Functions

• Main (for [main] semantic lexemes):• Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom• “Double” dependency: AtrAdv, AtrObj, AtrAtr

• Special (function words, punctuation,...):• Reflexives, particles: AuxT, AuxR, AuxO, AuxZ, AuxY• Prepositions/Conjunctions: AuxP, AuxC• Punctuation, Graphics: AuxX, AuxS, AuxG, AuxK

• Structural• Elipsis: ExD, Coordination etc.: Coord, Apos

PDT Annotation Layers

• L0 (w) Words (tokens)– automatic segmentation and markup only

• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma

• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function

• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, functor (detailed), grammatemes,

ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

19.1.2015 PARSEME Training School Prague 18

Tectogrammatical Annotation

• Underlying (deep) syntax

• 5 sublayers (integrated and/or standoff annotation):– dependency structure, (detailed) functors

• valency annotation

– topic/focus and deep word order

– coreference (mostly grammatical only)

– discourse

– all the rest (grammatemes): • detailed functors• underlying gender, number, mass nouns, ...

• Total: 39 attributes (vs. 5 at m-layer, 2 at a-layer)

19.1.2015 PARSEME Training School Prague 19

19.1.2015 PARSEME Training School Prague 20

Tectogrammatical vs. analytical syntax

TR: Nofunction words

AR: All words

Predicate verb

“Location”

In practice, that procedure will require making of certified copies.

Re-inserted elided actorof “making”

Dependency Structure

• Similar to the surface (Analytical) layer... ...but:– certain nodes deleted

• auxiliaries, non-autosemantic words, punctuation• (some) multiword expressions -> 1 node

– some nodes added• based on word (mostly verb, noun) valency• some ellipsis resolution

– detailed dependency relation labels (functors)

19.1.2015 PARSEME Training School Prague 21

Tectogrammatical Functors

• “Actants”: ACT, PAT, EFF, ADDR, ORIG

– modify: verbs, nouns, adjectives

– cannot repeat in a clause, usually obligatory

• Free modifications (~ 50), semantically defined– can repeat; optional, sometimes obligatory– Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP,

INTT, MANN; MAT, APP; ID, DPHR, CPHR, ...

• Special– Coordination, Rhematizers, Foreign phrases (#Forn),...

19.1.2015 PARSEME Training School Prague 22

“syntactic” semantic

MW

Es

Deep Word Order Topic/Focus

• Example:

• Baker bakes rolls. vs. BakerIC bakes rolls.

19.1.2015 PARSEME Training School Prague 23

Analyticaldep. tree:

Deep Word OrderTopic/Focus

• Deep word order:– from “old” information to the “new” one (left-to-

right) at every level (head included)– projectivity by definition (almost...)

• i.e., partial level-based order -> total d.w.o.

• Topic/focus/contrastive topic– attribute of every node (t, f, c)– restricted by d.w.o. and other constraints

19.1.2015 PARSEME Training School Prague 24

Coreference

• Grammatical (easy)– relative clauses

• which, who– Peter and Paul, who ...

– control• infinitival constructions

– John promised to go home

– reflexive pronouns• {him,her,thme}self(-ves)

– Mary saw herself in ...

19.1.2015 PARSEME Training School Prague 25

Johngo

he home

promisePRED

ACTPAT

ACT DIR3

Coreference

• Textual– Ex.: Peter moved to Iowa after he finished his PhD.

19.1.2015 PARSEME Training School Prague 26

Peter Iowafinish

he PhD

movePRED

ACT DIR1TWHEN

ACT PAT

heAPP

Grammatemes

• Detailed functors (“subfunctors”)– needed for some functors:

• TWHEN: before/after• LOC: next-to, behind, in-front-of, ...• also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT

• Lexical (underlying)– number (Sg/Pl), tense, modality, degree of

comparison, mass-noun?; is_person_name, is_dsp_root, ...

19.1.2015 PARSEME Training School Prague 27

MW

Es

VALENCY IN PDT

19.1.2015 PARSEME Training School Prague 30

Prague Dependency Treebank & Valency

• Valency in the PDT– Valency lexicon for PDT– General valency lexicon

• Valency in deep vs. surface syntax– Links between the layers w.r.t. valency

• Valency and word sense– Sense-disambiguated occurrences:

• Links from data to the lexicon

• Valency in translation, text generation

19.1.2015 PARSEME Training School Prague 31

Definition of Valency

• Ability (“desire”) of words (verbs, nouns, adjectives) to combine themselves with other units of meaning

• Properties of valency:– Specific for every word meaning (in general)

• leave: sb left sth for sb vs. sb left from somewhere• similar to PropBank leave.02 vs. leave.01

– Typically strongly correlates with surface form (Czech)• morphological case (~ ending), preposition+case, ...)

– Semantic constraints

19.1.2015 PARSEME Training School Prague 32

Structure of Valency

• word (lemma) – word sense group 1

• valency frame:– slot1 slot2 slot3

• surface expression

– word sense group 2 • ...

19.1.2015 PARSEME Training School Prague 33

vyměnit (to replace) vyměnit1

ACT PAT EFF

Nom. Acc.za+Acc.

vyměnit2

...

PDT-Vallex Entry

• dosáhnout: “to reach”, “to get [sb to do sth]”

• browser/user-formatted example:

19.1.2015 PARSEME Training School Prague 34

MWEs in PDT-Vallex

• Types included:– Reflexive particle (se, si)

• smát se – to laugh• všimnout si – to notice

– Idiomatic constructions• dosáhnout svého - to achieve one’s goals• běhá mi mráz po zádech – to give me the shivers

– Light verb constructions (and similar)• uzavřit dohodu – to agree [on sth], strike an

agreement, ...• vzbuzovat pochybnosti – to doubt, to raise doubts

19.1.2015 PARSEME Training School Prague 35

smát

_se

(t_l

emm

a)

DP

HR

(ar

gum

ent)

CP

HR

(ar

gum

ent)

Corpus ↔ Valency Lexicon

19.1.2015 PARSEME Training School Prague 36

• Corpus:

ENTRY: uzavřít (to close) vf1: ACT(.1) CPHR({smlouva}.4)

ex: u. dohodu (close a contract)vf2: ACT(.1) PAT(.4)

ex.: u. pokoj (close a room, house)

Lexicon:

Sentence 2035: Sentence 15345: Sentence 51042:

Valency & Text Generation

19.1.2015 PARSEME Training School Prague 37

• Using valency for...– ...getting the correct (lemma, tag) of verb arguments

• Example:

starat_se

PRED

Martin

ACT

tygr

PAT

Martin

....1..........

starat

V..............

o

...............

tygr

....4..........

VALLEX entry: starat (se) ACT(.1) PAT(o.[.4])

se

...............

Martin se stará o tygry.

“Martin takes care of tigers.”

“to take care of”

“tiger”

PARALLEL TREEBANK CZ-EN

19.1.2015 PARSEME Training School Prague 38

19.1.2015 PARSEME Training School Prague 39

Parallel Czech-English Annotation

• English text → Czech text (human translation)• Czech side (goal): all layers manual annotation• English side (goal):

– Morphology and surface syntax: technical conversion• Penn Treebank style -> PDT Analytic layer

– Tectogrammatical annotation: manual annotation• (Slightly) different rules needed for English

• Alignment– Natural, sentence level only (now)

19.1.2015 PARSEME Training School Prague 41

English Annotation POS and Syntax

• Automatic conversion from Penn Treebank– PDT morphological layer

• From POS tags

– PDT analytic layer• From:

– Penn Treebank Syntactic Structure– Non-terminal labels– Function tags (non-terminal “suffixes”)

• 2-step process– Head determination rules– Conversion to dependency + analytic function

Czech-English Example

19.1.2015 PARSEME Training School Prague 46

Dicku Darmane, zavolejte do své kanceláře! Dick Darman, call your office!

SUMMARY OF PART 1/1

19.1.2015 PARSEME Training School Prague 47

19.1.2015 PARSEME Training School Prague 48

PDT Treebanks at UFAL (written language)

• Czech– Prague Dependency Treebank

• Complex annotation, all levels, additional annotation

– Translation of Penn Treebank• Tectogrammatical layer only, no t/f

– Analytical, morphology: automatic tool

• English– Re-annotation of Penn Treebank

• Other languages– Arabic (own annotation)– Other: by conversion (HamleDT – 30 treebanks)

Prague Dependency Treebanks

• Annotation:– 4 layers:

• Words, lemmas/tags, surface dep. syntax, tectogrammatics

– Tectogrammatical layer:• No function words, semantic relations• Valency/verb arguments (some MWE features)

– Separate valency lexicon, fully linked from PDT nodes

• Coreference, Topic/focus, Discourse • Links back to analytical layer (parsing!)

19.1.2015 PARSEME Training School Prague 49

Now also in the Universal Dependency format!

https://github.com/UniversalDependencies

19.1.2015 PARSEME Training School Prague 50

Pointers• PDT 2.0 (the “Original”), newest version: PDT 3.0

– http://ufal.mff.cuni.cz/pdt2.0– http://ufal.mff.cuni.cz/pdt3.0

• PCEDT– http://ufal.mff.cuni.cz/pcedt2.0/

• PEDT– English side of PCEDT, additional: NE, coreference– http://ufal.mff.cuni.cz/pedt2.0/

• PADT (Arabic, morphology + surface syntax)– http://ufal.mff.cuni.cz/padt

• Other corpora, PDT-Vallex, EngVallex:– Search at http://lindat.cz

• LDC catalog numbers:– LDC2006T01 (PDT 2.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PEDT 1.0)

• CoNLL 2009 shared task (7 languages, surface syntax + predicate arguments only)

– http://ufal.mff.cuni.cz/conll2009-st • HamleDT 2.0 (30 treebanks in unified format)

– http://ufal.mff.cuni.cz/hamledt