Universal Dependency Parsing - Inriapauillac.inria.fr/~seddah/NivreSPMRL.pdf · This Study Parsing...
Transcript of Universal Dependency Parsing - Inriapauillac.inria.fr/~seddah/NivreSPMRL.pdf · This Study Parsing...
Universal Dependency Parsing
Joakim Nivre !
Uppsala University Department of Linguistics and Philology
The Grand Vision
The Grand Vision
The problem • Why 90% parsing accuracy for English but only 80% for Finnish?
• Are some languages intrinsically harder to parse?
• Not just morphological richness – many typological parameters
The Grand Vision
The problem • Why 90% parsing accuracy for English but only 80% for Finnish?
• Are some languages intrinsically harder to parse?
• Not just morphological richness – many typological parameters
An ideal solution • A universal parser for all languages
• Linguistic universals are hard-coded
• Typological parameters are learned from data
The Grand Vision
The problem • Why 90% parsing accuracy for English but only 80% for Finnish?
• Are some languages intrinsically harder to parse?
• Not just morphological richness – many typological parameters
An ideal solution • A universal parser for all languages
• Linguistic universals are hard-coded
• Typological parameters are learned from data
Gee! What a neat idea!
Pieces of the Puzzle
1. Morphosyntactic disambiguation
2. Universal dependency annotation
3. Parsing with universal dependencies
Part I
Morphosyntactic Disambiguation
Joint work with Bernd Bohnet, Igor Boguslavsky, Richárd Farkas, Filip Ginter and Jan Hajič
Background
Parsing accuracy for morphologically rich languages tends to be lower than for languages like English (Nivre et al., 2007)
Hypothesized explanations (Tsarfaty et al. 2010, 2013): • Strict separation morphology-syntax
• Data sparsity due to high type-token ratio
Suggested remedies • Joint morphological and syntactic analysis (Lee et al., 2011)
• Lexical resources (Hajič, 2000; Goldberg and Elhadad, 2013)
Background
This StudyThis StudyParsing techniques:
• Transition-based model for joint morphological and syntactic analysis
• Lexical resources integrated as hard or soft constraints
Evaluation on five morphologically rich languages: • Czech, Finnish, German, Hungarian, Russian
• New state of the art in dependency parsing for all languages
Bernd Bohnet, Joakim Nivre, Igor Boguslavsky, Richárd Farkas, Filip Ginter, and Jan Hajič. 2013. Joint Morphological and Syntactic Analysis for Richly Inflected Languages. Transactions of the Association for Computational Linguistics 1:415–428
RepresentationsRepresentations
For each word of a sentence w1, …, wn:
Representations
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Representations
For each word of a sentence w1, …, wn:• A part-of-speech tag p ∈ P
Representations
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Representations
For each word of a sentence w1, …, wn:• A part-of-speech tag p ∈ P• A morphological feature bundle m ∈ M
Representations
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Representations
For each word of a sentence w1, …, wn:• A part-of-speech tag p ∈ P• A morphological feature bundle m ∈ M• A lemma l ∈ Z*
Representations
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Representations
For each word of a sentence w1, …, wn:• A part-of-speech tag p ∈ P• A morphological feature bundle m ∈ M• A lemma l ∈ Z* • A syntactic head h ∈ {0, wi, …, wn}
Representations
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Representations
For each word of a sentence w1, …, wn:• A part-of-speech tag p ∈ P• A morphological feature bundle m ∈ M• A lemma l ∈ Z* • A syntactic head h ∈ {0, wi, …, wn}• A dependency label d ∈ D
Representations
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
Ein Haus hat er in Ulm gebaut .ART NN VAFIN PPER APPR NE VVPP $.
acc|sg|neut acc|sg|neut sg|3|pres|ind nom|sg|masc|3 — dat|sg|neut — —ein haus haben er in ulm bauen —
NK
OA
SB
MO
NK
OC
—
7
Representations
Parsing Framework
Transition system (Bohnet and Nivre, 2012): • Arc-standard system with online reordering (Nivre, 2009)
• Select morphology when shifting words to the stack
Beam search and structured learning (Zhang and Clark, 2008)
Preprocessing (at learning and parsing time) • Tagger assigns k best tags and feature bundles
• Parser can only select analyses licensed by preprocessor
Experiment 1Preprocessor assigns a single tag and feature bundle per word
Beam = 40 distinct trees
!Preprocessor assigns up to 2 tags and feature bundles per word
Beam = 40 distinct trees + 8 tag variants + 8 feature variants
Pipeline
Treebank Train Dev Test P M D
cs PDT (Hajič et al., 2001) 1249K 88K 70K 12 1851 49
de Tiger (Brants et al., 2002) 648K 32K 32K 54 257 43
fi TDT (Haverinen et al., 2013) 183K 21K 12 1917 47
hu Szeged (Farkas et al., 2012) 1101K 210K 171K 22 1105 33
ru SynTagRus (Boguslavsky et al., 2000) 575K 73K 72K 14 454 78
Joint
10
Morphological Accuracy
85
90
95
100
cs de fi hu ru
92.8
96.2
89.190.0
93.7
92.6
96.1
88.889.1
93.0
Pipeline Joint
10
Morphological Accuracy
85
90
95
100
cs de fi hu ru
92.8
96.2
89.190.0
93.7
92.6
96.1
88.889.1
93.0
Pipeline Joint
11
Syntactic Accuracy
75
80
85
90
95
100
cs de fi hu ru
87.688.9
80.6
92.0
83.3
87.488.4
79.9
91.8
83.1
Pipeline Joint
11
Syntactic Accuracy
75
80
85
90
95
100
cs de fi hu ru
87.688.9
80.6
92.0
83.3
87.488.4
79.9
91.8
83.1
Pipeline Joint
Experiment 2Morphological lexicons
• Lookup of tag, feature bundle and lemma
• Added as hard or soft constraints to preprocessor and parser
• Lemma selected deterministically from form + tag + feature bundle
Lexicon
cs PDT (Hajič and Hladká, 1998)
de SMOR (Schmid et al., 2004)
fi OMorFi (Pirinen, 2011)
hu morphdb.hu (Trón et al., 2006)
ru ETAP-3 (Apresian et al., 2003)
POS MOR
0
1
2
3
4+
POS MOR
0
1
2
3
4+
cs fi de hu ru
13
Morphological Accuracy
80
85
90
95
100
cs de fi hu ru
95.1
97.4
91.691.2
94.4 94.5
97.1
91.1
92.9 92.8
96.2
89.190.0
93.7
Joint LexHard LexSoft
13
Morphological Accuracy
80
85
90
95
100
cs de fi hu ru
95.1
97.4
91.691.2
94.4 94.5
97.1
91.1
92.9 92.8
96.2
89.190.0
93.7
Joint LexHard LexSoft
?
14
Syntactic Accuracy
75
80
85
90
95
100
cs de fi hu ru
87.789.1
82.3
92.1
83.5
88.089.1
82.5
92.0
83.4
87.688.9
80.6
92.0
83.3
Joint LexHard LexSoft
14
Syntactic Accuracy
75
80
85
90
95
100
cs de fi hu ru
87.789.1
82.3
92.1
83.5
88.089.1
82.5
92.0
83.4
87.688.9
80.6
92.0
83.3
Joint LexHard LexSoft
Discussion
Joint inference benefits both morphology and syntax • Strongest effect on morphology for Czech and German – syncretism?
• Strongest effect on syntax for Finnish and Hungarian – why?
Lexical resources mitigate data sparseness • Strongest effect on morphology – soft constraints always best?
• Weaker effect on syntax – relevant errors fixed by joint inference?
• Strong effect on syntax for Finnish – sparse data?
Discussion
Joint inference benefits both morphology and syntax • Strongest effect on morphology for Czech and German – syncretism?
• Strongest effect on syntax for Finnish and Hungarian – why?
Lexical resources mitigate data sparseness • Strongest effect on morphology – soft constraints always best?
• Weaker effect on syntax – relevant errors fixed by joint inference?
• Strong effect on syntax for Finnish – sparse data?
Nice! But why only 82%
for Finnish?
Can we even compare the
numbers?
Part 2
Universal Dependency Annotation
Joint work with Ryan McDonald, Slav Petrov, Chris Manning, Marie de Marneffe, Jinho Choi, Filip Ginter, Yoav Goldberg, Jan Hajič and Reut Tsarfaty
Apples and Oranges
Apples and Oranges
Treebank annotation schemes vary across languages • Hard to compare parsing results across languages (Nivre et al., 2007)
• Hard to evaluate cross-lingual learning (McDonald et al., 2013)
• Hard to make progress towards a universal parser?
Apples and Oranges
Treebank annotation schemes vary across languages • Hard to compare parsing results across languages (Nivre et al., 2007)
• Hard to evaluate cross-lingual learning (McDonald et al., 2013)
• Hard to make progress towards a universal parser?
Recent initiatives: • HamleDT: Conversion of 29 existing treebanks to a PDT-like
annotation scheme (Zeman et al., 2012)
• Universal Dependency Treebank Project: New annotation, conversion and harmonization to Google universal PoS tags and Stanford dependencies (McDonald et al., 2013)
Delexicalized transfer parsing with universal PoS tags
Case Study
SourceTraining
Language
Target Test LanguageUnlabeled Attachment Score (UAS) Labeled Attachment Score (LAS)Germanic Romance Germanic Romance
DE EN SV ES FR KO DE EN SV ES FR KODE 74.86 55.05 65.89 60.65 62.18 40.59 64.84 47.09 53.57 48.14 49.59 27.73EN 58.50 83.33 70.56 68.07 70.14 42.37 48.11 78.54 57.04 56.86 58.20 26.65SV 61.25 61.20 80.01 67.50 67.69 36.95 52.19 49.71 70.90 54.72 54.96 19.64ES 55.39 58.56 66.84 78.46 75.12 30.25 45.52 47.87 53.09 70.29 63.65 16.54FR 55.05 59.02 65.05 72.30 81.44 35.79 45.96 47.41 52.25 62.56 73.37 20.84KO 33.04 32.20 27.62 26.91 29.35 71.22 26.36 21.81 18.12 18.63 19.52 55.85
Table 3: Cross-lingual transfer parsing results. Bolded are the best per target cross-lingual result.
mon to both studies (DE, EN, SV and ES). In thatstudy, UAS was in the 38–68% range, as comparedto 55–75% here. For Swedish, we can even mea-sure the difference exactly, because the test setsare the same, and we see an increase from 58.3%to 70.6%. This suggests that most cross-lingualparsing studies have underestimated accuracies.
4 Conclusion
We have released data sets for six languages withconsistent dependency annotation. After the ini-tial release, we will continue to annotate data inmore languages as well as investigate further au-tomatic treebank conversions. This may also leadto modifications of the annotation scheme, whichshould be regarded as preliminary at this point.Specifically, with more typologically and morpho-logically diverse languages being added to the col-lection, it may be advisable to consistently en-force the principle that content words take func-tion words as dependents, which is currently vi-olated in the analysis of adpositional and copulaconstructions. This will ensure a consistent analy-sis of functional elements that in some languagesare not realized as free words or are not obliga-tory, such as adpositions which are often absentdue to case inflections in languages like Finnish. Itwill also allow the inclusion of language-specificfunctional or morphological markers (case mark-ers, topic markers, classifiers, etc.) at the leaves ofthe tree, where they can easily be ignored in appli-cations that require a uniform cross-lingual repre-sentation. Finally, this data is available on an opensource repository in the hope that the communitywill commit new data and make corrections to ex-isting annotations.
Acknowledgments
Many people played critical roles in the pro-cess of creating the resource. At Google, Fer-
nando Pereira, Alfred Spector, Kannan Pashu-pathy, Michael Riley and Corinna Cortes sup-ported the project and made sure it had the re-quired resources. Jennifer Bahk and Dave Orrhelped coordinate the necessary contracts. AndreaHeld, Supreet Chinnan, Elizabeth Hewitt, Tu Tsaoand Leigha Weinberg made the release processsmooth. Michael Ringgaard, Andy Golding, TerryKoo, Alexander Rush and many others providedtechnical advice. Hans Uszkoreit gave us per-mission to use a subsample of sentences from theTiger Treebank (Brants et al., 2002), the source ofthe news domain for our German data set. Anno-tations were additionally provided by Sulki Kim,Patrick McCrae, Laurent Alamarguy and HectorFernandez Alcalde.
ReferencesAlena Bohmova, Jan Hajic, Eva Hajicova, and Barbora
Hladka. 2003. The Prague Dependency Treebank:A three-level annotation scenario. In Anne Abeille,editor, Treebanks: Building and Using Parsed Cor-pora, pages 103–127. Kluwer.
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf-gang Lezius, and George Smith. 2002. The TIGERTreebank. In Proceedings of the Workshop on Tree-banks and Linguistic Theories.
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-Xshared task on multilingual dependency parsing. InProceedings of CoNLL.
Miriam Butt, Helge Dyvik, Tracy Holloway King,Hiroshi Masuichi, and Christian Rohrer. 2002.The parallel grammar project. In Proceedings ofthe 2002 workshop on Grammar engineering andevaluation-Volume 15.
Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, andChristopher D. Manning. 2009. Discriminativereordering with Chinese grammatical relations fea-tures. In Proceedings of the Third Workshop on Syn-tax and Structure in Statistical Translation (SSST-3)at NAACL HLT 2009.
Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013. Universal Dependency Annotation for Multilingual Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 92–97.
Delexicalized transfer parsing with universal PoS tags
Case Study
SourceTraining
Language
Target Test LanguageUnlabeled Attachment Score (UAS) Labeled Attachment Score (LAS)Germanic Romance Germanic Romance
DE EN SV ES FR KO DE EN SV ES FR KODE 74.86 55.05 65.89 60.65 62.18 40.59 64.84 47.09 53.57 48.14 49.59 27.73EN 58.50 83.33 70.56 68.07 70.14 42.37 48.11 78.54 57.04 56.86 58.20 26.65SV 61.25 61.20 80.01 67.50 67.69 36.95 52.19 49.71 70.90 54.72 54.96 19.64ES 55.39 58.56 66.84 78.46 75.12 30.25 45.52 47.87 53.09 70.29 63.65 16.54FR 55.05 59.02 65.05 72.30 81.44 35.79 45.96 47.41 52.25 62.56 73.37 20.84KO 33.04 32.20 27.62 26.91 29.35 71.22 26.36 21.81 18.12 18.63 19.52 55.85
Table 3: Cross-lingual transfer parsing results. Bolded are the best per target cross-lingual result.
mon to both studies (DE, EN, SV and ES). In thatstudy, UAS was in the 38–68% range, as comparedto 55–75% here. For Swedish, we can even mea-sure the difference exactly, because the test setsare the same, and we see an increase from 58.3%to 70.6%. This suggests that most cross-lingualparsing studies have underestimated accuracies.
4 Conclusion
We have released data sets for six languages withconsistent dependency annotation. After the ini-tial release, we will continue to annotate data inmore languages as well as investigate further au-tomatic treebank conversions. This may also leadto modifications of the annotation scheme, whichshould be regarded as preliminary at this point.Specifically, with more typologically and morpho-logically diverse languages being added to the col-lection, it may be advisable to consistently en-force the principle that content words take func-tion words as dependents, which is currently vi-olated in the analysis of adpositional and copulaconstructions. This will ensure a consistent analy-sis of functional elements that in some languagesare not realized as free words or are not obliga-tory, such as adpositions which are often absentdue to case inflections in languages like Finnish. Itwill also allow the inclusion of language-specificfunctional or morphological markers (case mark-ers, topic markers, classifiers, etc.) at the leaves ofthe tree, where they can easily be ignored in appli-cations that require a uniform cross-lingual repre-sentation. Finally, this data is available on an opensource repository in the hope that the communitywill commit new data and make corrections to ex-isting annotations.
Acknowledgments
Many people played critical roles in the pro-cess of creating the resource. At Google, Fer-
nando Pereira, Alfred Spector, Kannan Pashu-pathy, Michael Riley and Corinna Cortes sup-ported the project and made sure it had the re-quired resources. Jennifer Bahk and Dave Orrhelped coordinate the necessary contracts. AndreaHeld, Supreet Chinnan, Elizabeth Hewitt, Tu Tsaoand Leigha Weinberg made the release processsmooth. Michael Ringgaard, Andy Golding, TerryKoo, Alexander Rush and many others providedtechnical advice. Hans Uszkoreit gave us per-mission to use a subsample of sentences from theTiger Treebank (Brants et al., 2002), the source ofthe news domain for our German data set. Anno-tations were additionally provided by Sulki Kim,Patrick McCrae, Laurent Alamarguy and HectorFernandez Alcalde.
ReferencesAlena Bohmova, Jan Hajic, Eva Hajicova, and Barbora
Hladka. 2003. The Prague Dependency Treebank:A three-level annotation scenario. In Anne Abeille,editor, Treebanks: Building and Using Parsed Cor-pora, pages 103–127. Kluwer.
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf-gang Lezius, and George Smith. 2002. The TIGERTreebank. In Proceedings of the Workshop on Tree-banks and Linguistic Theories.
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-Xshared task on multilingual dependency parsing. InProceedings of CoNLL.
Miriam Butt, Helge Dyvik, Tracy Holloway King,Hiroshi Masuichi, and Christian Rohrer. 2002.The parallel grammar project. In Proceedings ofthe 2002 workshop on Grammar engineering andevaluation-Volume 15.
Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, andChristopher D. Manning. 2009. Discriminativereordering with Chinese grammatical relations fea-tures. In Proceedings of the Third Workshop on Syn-tax and Structure in Statistical Translation (SSST-3)at NAACL HLT 2009.
Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013. Universal Dependency Annotation for Multilingual Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 92–97.
First evaluation of labeled accuracy
Delexicalized transfer parsing with universal PoS tags
Case Study
SourceTraining
Language
Target Test LanguageUnlabeled Attachment Score (UAS) Labeled Attachment Score (LAS)Germanic Romance Germanic Romance
DE EN SV ES FR KO DE EN SV ES FR KODE 74.86 55.05 65.89 60.65 62.18 40.59 64.84 47.09 53.57 48.14 49.59 27.73EN 58.50 83.33 70.56 68.07 70.14 42.37 48.11 78.54 57.04 56.86 58.20 26.65SV 61.25 61.20 80.01 67.50 67.69 36.95 52.19 49.71 70.90 54.72 54.96 19.64ES 55.39 58.56 66.84 78.46 75.12 30.25 45.52 47.87 53.09 70.29 63.65 16.54FR 55.05 59.02 65.05 72.30 81.44 35.79 45.96 47.41 52.25 62.56 73.37 20.84KO 33.04 32.20 27.62 26.91 29.35 71.22 26.36 21.81 18.12 18.63 19.52 55.85
Table 3: Cross-lingual transfer parsing results. Bolded are the best per target cross-lingual result.
mon to both studies (DE, EN, SV and ES). In thatstudy, UAS was in the 38–68% range, as comparedto 55–75% here. For Swedish, we can even mea-sure the difference exactly, because the test setsare the same, and we see an increase from 58.3%to 70.6%. This suggests that most cross-lingualparsing studies have underestimated accuracies.
4 Conclusion
We have released data sets for six languages withconsistent dependency annotation. After the ini-tial release, we will continue to annotate data inmore languages as well as investigate further au-tomatic treebank conversions. This may also leadto modifications of the annotation scheme, whichshould be regarded as preliminary at this point.Specifically, with more typologically and morpho-logically diverse languages being added to the col-lection, it may be advisable to consistently en-force the principle that content words take func-tion words as dependents, which is currently vi-olated in the analysis of adpositional and copulaconstructions. This will ensure a consistent analy-sis of functional elements that in some languagesare not realized as free words or are not obliga-tory, such as adpositions which are often absentdue to case inflections in languages like Finnish. Itwill also allow the inclusion of language-specificfunctional or morphological markers (case mark-ers, topic markers, classifiers, etc.) at the leaves ofthe tree, where they can easily be ignored in appli-cations that require a uniform cross-lingual repre-sentation. Finally, this data is available on an opensource repository in the hope that the communitywill commit new data and make corrections to ex-isting annotations.
Acknowledgments
Many people played critical roles in the pro-cess of creating the resource. At Google, Fer-
nando Pereira, Alfred Spector, Kannan Pashu-pathy, Michael Riley and Corinna Cortes sup-ported the project and made sure it had the re-quired resources. Jennifer Bahk and Dave Orrhelped coordinate the necessary contracts. AndreaHeld, Supreet Chinnan, Elizabeth Hewitt, Tu Tsaoand Leigha Weinberg made the release processsmooth. Michael Ringgaard, Andy Golding, TerryKoo, Alexander Rush and many others providedtechnical advice. Hans Uszkoreit gave us per-mission to use a subsample of sentences from theTiger Treebank (Brants et al., 2002), the source ofthe news domain for our German data set. Anno-tations were additionally provided by Sulki Kim,Patrick McCrae, Laurent Alamarguy and HectorFernandez Alcalde.
ReferencesAlena Bohmova, Jan Hajic, Eva Hajicova, and Barbora
Hladka. 2003. The Prague Dependency Treebank:A three-level annotation scenario. In Anne Abeille,editor, Treebanks: Building and Using Parsed Cor-pora, pages 103–127. Kluwer.
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf-gang Lezius, and George Smith. 2002. The TIGERTreebank. In Proceedings of the Workshop on Tree-banks and Linguistic Theories.
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-Xshared task on multilingual dependency parsing. InProceedings of CoNLL.
Miriam Butt, Helge Dyvik, Tracy Holloway King,Hiroshi Masuichi, and Christian Rohrer. 2002.The parallel grammar project. In Proceedings ofthe 2002 workshop on Grammar engineering andevaluation-Volume 15.
Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, andChristopher D. Manning. 2009. Discriminativereordering with Chinese grammatical relations fea-tures. In Proceedings of the Third Workshop on Syn-tax and Structure in Statistical Translation (SSST-3)at NAACL HLT 2009.
Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013. Universal Dependency Annotation for Multilingual Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 92–97.
First evaluation of labeled accuracyResults make typological sense
Too Many Standards?
Too Many Standards?
English SD de Marneffe et al. (2006)
Too Many Standards?
Google SD McDonald et al. (2013)
English SD de Marneffe et al. (2006)
Too Many Standards?
Google SD McDonald et al. (2013)
Hebrew SD Tsarfaty (2013)
Finnish SD Haverinen et al. (2013)
English SD de Marneffe et al. (2006)
Too Many Standards?
Google SD McDonald et al. (2013)
CLEAR Choi (2012)
HamleDT Zeman et al. (2012)
Hebrew SD Tsarfaty (2013)
Finnish SD Haverinen et al. (2013)
English SD de Marneffe et al. (2006)
Too Many Standards?
Google SD McDonald et al. (2013)
CLEAR Choi (2012)
HamleDT Zeman et al. (2012)
Hebrew SD Tsarfaty (2013)
Finnish SD Haverinen et al. (2013)
English SD de Marneffe et al. (2006)
Universal Dependencies
!
Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre and Christopher D. Manning. 2014. Universal Stanford Dependencies: a Cross-Linguistic Typology. In Proceedings of the Ninth International Conference on Language Resources and Evaluation.
Universal Dependencieshttp://universaldependencies.github.io/docs/
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
Universal Dependencieshttp://universaldependencies.github.io/docs/
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
Stanford Universal Dependencies
Universal Dependencieshttp://universaldependencies.github.io/docs/
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
Google Universal Part-of-Speech Tags + Morphological Features
Universal Dependencieshttp://universaldependencies.github.io/docs/
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
In dem Restaurant isst Maria den Fisch .
ADP DET NOUN VERB PNOUN DET NOUN .CASE=DAT CASE=DAT TENSE=PRES CASE=NOM CASE=ACC CASE=ACCNUM=SIN NUM=SIN NUM=SIN NUM=SINGEN=NEU GEN=NEU GEN=MAS GEN=MAS
1.1 1.2 2 3 4 5 6 7Im Restaurant isst Maria den Fisch .
casedet advmod nsubj det
dobjpunc
1
Explicit Mapping from Tokens to Words
Guiding Principles
Guiding Principles
Maximize parallelism • Don’t annotate the same thing in different ways
• Don’t make different things look the same
Guiding Principles
Maximize parallelism • Don’t annotate the same thing in different ways
• Don’t make different things look the same
But don’t overdo it • Don’t annotate things that are not there
• Languages select from a universal pool of categories
• Allow language-specific extensions
• Keeping content words as heads promotes parallelism
• Function words often correlate with morphology
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
2
Dependency Structure
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
2
• Keeping content words as heads promotes parallelism
• Function words often correlate with morphology
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
2
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
2
Dependency Structure
• Keeping content words as heads promotes parallelism
• Function words often correlate with morphology
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
2
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
The dog was chased by the cat .DET NOUN AUX VERB ADP DET NOUN .
detnsubjpass
auxpasscase
det
agentpunc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
Hunden jagades av katten .NOUN VERB ADP NOUN .DEF=+ VOICE=PAS DEF=+
nsubjpass caseagent
punc
2
Dependency Structure
Dependency Relations
• Taxonomy of 42 universal grammatical relations, broadly supported across many languages in language typology
• Language specific subtypes can be added
• The lexicalist hypothesis • Grammatical relations hold between words (including clitics)
• Morphological categories are properties of words
• Morphological annotation • Revised Google Universal Part-of-Speech Tags (Petrov et al., 2012)
• Universal inventory of morphological features (under construction)
Morphology
ADJ ADP ADV AUX CONJ DET INTJ NOUN NUM PNOUN PRON PRT PUNCT SCONJ SYM VERB X
Tokens and Words
Principle of recoverability • Clitics and contractions are split to allow meaningful annotation
• Mapping from basic tokenization is explicitly represented
• Heuristic mapping of annotation to basic tokens is provided
Encoded in revised CoNLL format (CoNLL-U)
Vamos nos a el mar .VERB PRON ADP DET NOUN .
1.1 1.2 2.1 2.2 3 4Vamonos al mar .
explcase
det
nmodpunc
Vamonos al mar .VERB PRON ADP DET NOUN .
Vamos nos a el mar .
explcase
det
nmodpunc
3
Tokens and Words
Principle of recoverability • Clitics and contractions are split to allow meaningful annotation
• Mapping from basic tokenization is explicitly represented
• Heuristic mapping of annotation to basic tokens is provided
Encoded in revised CoNLL format (CoNLL-U)
Vamos nos a el mar .VERB PRON ADP DET NOUN .
1.1 1.2 2.1 2.2 3 4Vamonos al mar .
explcase
det
nmodpunc
Vamonos al mar .VERB PRON ADP DET NOUN .
Vamos nos a el mar .
explcase
det
nmodpunc
Vamos nos a el mar .VERB PRON ADP DET NOUN .
explcase
det
nmodpunc
3
Tokens and Words
Principle of recoverability • Clitics and contractions are split to allow meaningful annotation
• Mapping from basic tokenization is explicitly represented
• Heuristic mapping of annotation to basic tokens is provided
Encoded in revised CoNLL format (CoNLL-U)
Vamos nos a el mar .VERB PRON ADP DET NOUN .
1.1 1.2 2.1 2.2 3 4Vamonos al mar .
explcase
det
nmodpunc
Vamonos al mar .VERB PRON ADP DET NOUN .
Vamos nos a el mar .
explcase
det
nmodpunc
3
Tokens and Words
Principle of recoverability • Clitics and contractions are split to allow meaningful annotation
• Mapping from basic tokenization is explicitly represented
• Heuristic mapping of annotation to basic tokens is provided
Encoded in revised CoNLL format (CoNLL-U)
Vamos nos a el mar .VERB PRON ADP DET NOUN .
1.1 1.2 2.1 2.2 3 4Vamonos al mar .
explcase
det
nmodpunc
Vamonos al mar .VERB PRON ADP DET NOUN .
Vamos nos a el mar .
explcase
det
nmodpunc
3
Work in Progress
Current time plan: • Stable annotation guidelines by end of September
• First release of data sets before the end of 2014
Follow our progress and give feedback: • Universal Dependencies: http://universaldependencies.github.io/docs/
Check out old releases: • Uni-Dep-TB: https://code.google.com/p/uni-dep-tb/
Let’s work together!
Work in Progress
Current time plan: • Stable annotation guidelines by end of September
• First release of data sets before the end of 2014
Follow our progress and give feedback: • Universal Dependencies: http://universaldependencies.github.io/docs/
Check out old releases: • Uni-Dep-TB: https://code.google.com/p/uni-dep-tb/
Let’s work together!
Hey, wait a minute!
How is this going to improve
parsing?
Part 3
Parsing with Universal Dependencies
Confessions of a converted dependency parser
Taking Stock
Taking Stock
Motivation for universal dependencies
• Improve comparability of parsing results across languages
• Facilitate development of multilingual systems
• Enable typological studies of syntactic structure
Taking Stock
Motivation for universal dependencies
• Improve comparability of parsing results across languages
• Facilitate development of multilingual systems
• Enable typological studies of syntactic structure
What about parsing?
• Not likely to improve parsing accuracy with existing parsers
• Parsers tend to prefer function words as heads (Schwarz et al., 2012)
• We risk bringing English down to 80% instead of Finnish up to 90%
What’s the Problem?
• Dependency parsers know only one syntactic relation
What’s the Problem?
h d
• Dependency parsers know only one syntactic relation
• They do not interpret dependency labels
What’s the Problem?
h d
?
• Dependency parsers know only one syntactic relation
• They do not interpret dependency labels
• They represent a construction primarily by its head
What’s the Problem?
h d
?h
A Simple Case
The dog was chased by the black cat.
• All criteria point to cat being the head
The dog was chased by the .black cat
amod
A Simple Case
• All criteria point to cat being the head
• Little (syntactic) information is lost by dropping black
black cat
amodThe dog was chased by the .
cat
A Simple Case
The dog was chased by the black cat.
A Tricky Case
The dog by the black cat.was chased
?
• Some criteria point to was, others to chased as the head
A Tricky Case
was chased
aux
chased
chasedwas
main
was
The dog by the black cat.
• Some criteria point to was, others to chased as the head
• Neither word can represent the whole
A Tricky Case
• Some criteria point to was, others to chased as the head
• Neither word can represent the whole
• The same problem arises with by (the black cat)
A Tricky Case
The dog was chased by the black cat.
• Three fundamental syntactic relations (Tesnière, 1959)
• Tesnière-style dependency treebank (Sangati and Mazza, 2009)
Back to the Roots
cat
black
Dependency single head
was chased
Transfer split head
cats and mice
Coordination multiple heads
Universal Dependencies
Universal Dependencies
• The UD representation is a simple dependency tree
Fido was chased by Kitty and Tiger .PNOUN AUX VERB ADP PNOUN CONJ PNOUN .
nsubjpassauxpass case
agent conjcc
punc
Fido was chased by Kitty and Tiger .PNOUN AUX VERB ADP PNOUN CONJ PNOUN .
nsubjpassauxpass case
agent conjcc
punc
4
Universal Dependencies
• The UD representation is a simple dependency tree
• But labels are universal and can be interpreted
Fido was chased by Kitty and Tiger .PNOUN AUX VERB ADP PNOUN CONJ PNOUN .
nsubjpassauxpass case
agent conjcc
punc
Fido was chased by Kitty and Tiger .PNOUN AUX VERB ADP PNOUN CONJ PNOUN .
nsubjpassauxpass case
agent conjcc
punc
4
Universal Dependencies
• The UD representation is a simple dependency tree
• But labels are universal and can be interpreted
• Therefore, we can put more knowledge into the parser
was chased
Fido by Kitty and Tiger
Final Remarks
Final Remarks
Constituency parsing – largely driven by PTB • Perhaps too much emphasis on English (until recently)
• But deep analysis of categories and representations
Final Remarks
Constituency parsing – largely driven by PTB • Perhaps too much emphasis on English (until recently)
• But deep analysis of categories and representations
Dependency parsing – largely driven by CoNLL data • More attention to typological diversity
• But ignorance and blind faith with respect to representations
Final Remarks
Constituency parsing – largely driven by PTB • Perhaps too much emphasis on English (until recently)
• But deep analysis of categories and representations
Dependency parsing – largely driven by CoNLL data • More attention to typological diversity
• But ignorance and blind faith with respect to representations
Can UD give us the best of both worlds?
Thanks for Your Attention! Questions?
http://stp.lingfil.uu.se/~nivre/