The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer...

24
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics Major contributions by: E: Silvie Cinková, Jana Šindlerová, Josef Toman, (J. Semecký) C: Marie Mikulová, Zdeňka Urešová, Jan Štěpánek

Transcript of The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer...

Page 1: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

The Prague (Czech-)English Dependency Treebank

Jan HajičCharles University in Prague

Computer Science School

Institute of Formal and Applied Linguistics

Major contributions by:

E: Silvie Cinková, Jana Šindlerová, Josef Toman, (J. Semecký)

C: Marie Mikulová, Zdeňka Urešová, Jan Štěpánek

Page 2: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

2

Today...

• The family of Prague Dependency Treebanks– Incl. the Prague (Czech-)English Dependency Treebank

• English “Tectogrammatical Representation” (TR)– Annotation layers– From Penn Treebank (et al.) to PDT-style English

tectogrammatics– TR annotation of 5 interesting English phenomena

• The annotation process– TrEd, EngVallex and the current status

• To take home + pointers

Page 3: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

3

The Family of Prague Dependency Treebanks

• Prague Dependency Treebank (Czech)– 2001: version 1.0 (no deep syntax/semantics)– 2006: version 2.0 (w/deep syntax, semantics)

• Prague Czech-English Dependency TB 1.0– 2004: automatic annotation– English: PTB, Czech: 1/3rd of PTB translated

• Prague Arabic Dependency Treebank 1.0– 2004: ~ PDT 1.0 (no deep syntax)

Page 4: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

4

The Prague Czech-EnglishDependency Treebank

• Penn Treebank+ PropBank

+ BBN (co-reference and Named Entities)

+ NP structure (D. Vadas, J. R. Curran, ACL’07)

+ “Czech-like” tectogrammatics

• Translation to Czech– Manual annotation (with auto pre-annotation)

• Morphology, Syntax, Tectogrammatics (TR)

Page 5: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

5

Example: English TR

• Words

• Dependencies

• Sem. function

• Valency (predicates)

• Coref (BBN)

• Named Entities (BBN)

Page 6: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

6

Layers of Annotation

• t-layer– tectogrammatics

• a-layer– (surface) syntax

• m-layer– Morphology (POS)

• w-layer– words (tokens)

Page 7: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

7

English Surface Syntax

• From PTB:– Form– POS Tag– Function label– (Structure)

• Added– Lemma– Heads

Page 8: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

8

Head Determination Rules

• Exhaustive set of rules – By J. Eisner + M. Čmejrek/J. Cuřín– 4000 rules (non-terminal based)

• Ex.: (S (NP-SBJ VP .)) → VP

– Additional rules• Coordination, Apposition• Punctuation (end-of-sentence, internal)

• Original idea (possibility of conversion)– J. Robinson (1960s)

Page 9: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

9

Example: Head Determination Rules

(board)

(board)(the)

(join)

(will) (join)

(join)

(join)

(NP (DT NN)) → NN

(VP (VB NP)) → VB

(VP (MD VP)) → VP

(S (… VP …)) → VP

Rules:

Page 10: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

10

Conversion: Analytic Structure, Functions

• Syntactic Function assignment (conversion)• Rules

– based on PTB functional tags:-SBJ Sb -PRD Pnom -BNF Obj -DTV Obj

-LGS Obj -ADV Adv -DIR Adv -EXT Adv-LOC Adv -MNR Adv -PRP Adv -PUT Adv-TMP Adv

– Ad-hoc rules (if functional tags missing)– Lemmatization (years → year)

Page 11: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

11

Syntactic Structure, Functions: PTB to P(E)DT

(board)

(board)(the)

(join)

(will) (join)

(join)

(join)

→→

Penn Treebank structure

(with heads added) PDT-like Analytic

Representation

PRED.Fut

PAT

PDT-like

Tectogrammatic

Representation

(automatic

pre-annotation)

Page 12: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

12

English TR IPredicative Complement

• Free (non-valency) modification (of both a noun and a verb)• attribute compl.rf (green arrow to the noun)

Page 13: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

13

English TR IIWhich + Relative Clause

We have not answered your question completely, for which we apologize.

Page 14: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

14

English TR III: Coordination

Page 15: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

15

English TR IV: Comparison

Page 16: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

16

English TR V: Restriction (“Exclusion”)

except, with the exception of, excluding, (all/none) but, beyond, apart from, unless, bar, barring, besides

Page 17: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

17

English TR: (manual) annotation

• TrEd– Pre-annotated– Graphical

• TR dep. tree is primary

– Text + TR– Czech translation

• Valency (a.k.a. “propbanking”)– During TR annotation– Propbank origins and

examples• Linked, displayed

Page 18: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

18

EngVallex (give)

Page 19: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

19

EngVallex Format (admit)

Page 20: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

20

Interannotator Agreement

2007-2009:- New annotators (lower numbers)- Annotation “by phenomenon”- Restarting now

Page 21: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

21

Prague English Dependency Treebank

• Availability– Version 1.0 now (PTB license needed)

• 250k words

– Full version (parallel with Czech): late 2010

• Size– Full WSJ portion of PTB (2312 files)

– 49208 sentences, 1253013 tokens

– Now:– 17210 sentences (34.97%), 439983 tokens (35.11%)

Page 22: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

22

Czech PDT-style Annotation

• All layers – morphology, syntax, tectogrammatical

• So far…– Automatic (many tools by many authors)

• Manual annotation– In progress (28124 sentences/639326 words)– Top-down

• Tectogrammatical first (lower layers automatically)• … then syntactic structure and morphology

Page 23: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

23

Summary

• PDT is/has (a)…– (Family of) dependency-based treebanking project(s)

• Czech (English, Arabic, ...)

– ~ 1mil. words• sufficient size for ML experiments

– 4 interlinked layers of annotation• token, morphology, syntax, deep syntax/semantics++)• independent and “full” information at all levels• interlinked (for the development of parsers/generators)

– Parallel corpus Cze <-> Eng -> Machine Translation

Page 24: The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank

24

Pointers, Acknowledgements

• http://ufal.mff.cuni.cz/pedt

• http://ufal.mff.cuni.cz/pdt2.0

• http://ufal.mff.cuni.cz/~pajas/tred

• Acknowledgements– FP6-IST “Euromatrix”, FP7-IST “Euromatrix+”– LC536 (Center for Computational Linguistics)– GAČR 405/06/0589 (Speech and deep syntax)– MŠMT: MSM0021620838, ME838, ME09008