Getting the Most out of Transition-based Dependency Parsing

1
Getting the Most out of Transition- based Dependency Parsing Jinho D. Choi and Martha Palmer Institute of Cognitive Science, University of Colorado at Boulder Why transition-based dependency parsing? • It is fast. : Projective parsing - O(n), non-projective parsing - O(n 2 ). • Parse history can be used as features. : Parsing complexity is still preserved. Can non-projective dependency parsing be any faster? • # of non-projective dependencies <<< # of projective dependencies. : Perform projective parsing for most cases and non-projective parsing only when it is needed. • Choi and Nicolov, 2009. : Added a non-deterministic SHIFT transition to Nivre’s list-based non-projective algorithm reduced the search space achieved linear time parsing speed in practice. • This work : Adds a transition from Nivre’s projective algorithm to Choi- Nicolov’s approach (LEFT-POP). reduces the search space even more. How do we use parse history as features? • Current approaches use gold-standard parses as features during training. : Not necessarily what parsers encounter during decoding. • This work : Minimizes the gap between gold-standard and automatic parses using bootstrapping. Introduction Transitions Parsing states are represented as tuples (λ 1 , λ 2 , β, E). : λ 1 , λ 2 , and β are lists of word tokens. : E is a set of labeled edges (previously identified dependencies). • L is a dependency label, and i, j, k are indices of their corresponding word tokens. • The initial state is ([0], [ ], [1, …, n], E); 0 corresponds to the root node. The final state is (λ 1 , λ 2 , [ ], E); the algorithm terminates when all tokens in β are consumed. LEFT-POP L and LEFT-ARC L are performed when w j is the head of w i with a dependency L. : LEFT-POP removes w i from λ 1 , assuming that the token is no longer needed. : LEFT-ARC keeps w i so it can be the head of some token w j<k≤n in β. RIGHT-ARC L is performed when w i is the head of w j with a dependency L. • SHIFT is performed when : DT – λ 1 is empty. : NT – There is no token in λ 1 that is either the head or a dependent of w j . • NO-ARC is to buffer processed tokens so each token in β can be compared to all (or some) tokens prior to it. Parsing states Parsing Algorithm Experimental setup • Corpora: English and Czech data distributed by the CoNLL’09 shared task. • Machine learning algorithm: Liblinear L2-L1 SVM. Accuracy comparisons • Our : ‘Choi and Nicolov’ + LEFT-POP transition. • Our+: ‘Our’ + bootstrapping technique. • Gesmundo et al.: the best transition-based system for CoNLL’09. • Bohnet: the best graph-based system for CoNLL’09 (the overall rank is in the parenthesis). • LAS and UAS: Labeled and Unlabeled Attachment Scores. Speed comparisons • ‘Our’ performed slightly faster than ‘Our+’ because it made more non- deterministic SHIFT’s. • ‘Nivre’ indicates Nivre’s swap algorithm that showed an expected linear time non-projective parsing complexity (Nivre, 2009), of which we used the implementation from MaltParser. • The curve shown by ‘Nivre’ might be caused by implementation details regarding feature extraction, which we included as part of parsing. Experiments English Czech LAS UAS LAS UAS Choi and Nicolov, 2009 88.54 90.57 78.12 83.29 Our 88.62 90.66 78.30 83.47 Our+ 89.15 91.18 80.24 85.24 Gesmundo et al., 2009 88.79 (3) - 80.38 (1) - Bohnet, 2009 89.88 (1) - 80.11 (2) - Conclusion • The LEFT-POP transition gives improvements to both parsing speed and accuracy, showing a linear time non-projective dependency parsing speed with respect to sentence length. • The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-the-art performance with respect to other parsing approaches. ClearParser Open source project: http://code.google.com/p/clearparser/ Contact: Jinho D. Choi ([email protected] ) Conclusion Average parsing speeds per sentence • Nivre : 2.86 (ms) • Choi-Nicolov : 2.69 (ms) • Our+ : 2.29 (ms) • Our: 2.20 (ms) Note: ‘Our’, not presented in the figure, showed a growth very similar to ‘Our+’. • We gratefully acknowledge the support of the National Science Foundation Grants CISE-IIS-RI-0910992, Richer Representations for Machine Translation, a sub- contract from the Mayo Clinic and Harvard Children’s Hospital based on a grant from the ONC, 90TR0002/01, Strategic Health Advanced Research Project Area 4: Natural Language Processing, and a grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc. • Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors Acknowledgments The range of subtree and head information When w i and w j are compared, subtree and head information of these tokens is partially provided by previous parsing states. Bootstrapping technique • A simplified version of SEARN, an algorithm for integrating SEARch and lEARNing to solve complex structured prediction problems (Daumé et. al., 2009). • The first time that this idea has been applied to transition-based dependency parsing. Bootstrapping Technique • Stop the procedure when the parsing accuracy of the current cross-validation is lower than the one from the previous iteration. • Gold-standard labels are achieved by comparing the dependency relation between w i and w j in the gold-standard tree.

description

This paper suggests two ways of improving transition-based, non-projective dependency parsing. First, we add a transition to an existing non-projective parsing algorithm, so it can perform either projective or non-projective parsing as needed. Second, we present a boot- strapping technique that narrows down discrepancies between gold-standard and automatic parses used as features. The new addition to the algorithm shows a clear advantage in parsing speed. The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-the- art performance with respect to other parsing approaches evaluated on the same data set.

Transcript of Getting the Most out of Transition-based Dependency Parsing

Page 1: Getting the Most out of Transition-based Dependency Parsing

Getting the Most out of Transition-based Dependency Parsing

Jinho D. Choi and Martha PalmerInstitute of Cognitive Science, University of Colorado at Boulder

Why transition-based dependency parsing?

• It is fast. : Projective parsing - O(n), non-projective parsing - O(n2).• Parse history can be used as features. : Parsing complexity is still preserved.

Can non-projective dependency parsing be any faster?

• # of non-projective dependencies <<< # of projective dependencies. : Perform projective parsing for most cases and non-projective parsing only when it is needed.• Choi and Nicolov, 2009. : Added a non-deterministic SHIFT transition to Nivre’s list-based non-projective algorithm reduced the search space achieved linear time parsing speed in practice.• This work : Adds a transition from Nivre’s projective algorithm to Choi-Nicolov’s approach (LEFT-POP). reduces the search space even more.

How do we use parse history as features?

• Current approaches use gold-standard parses as features during training. : Not necessarily what parsers encounter during decoding. • This work : Minimizes the gap between gold-standard and automatic parses using bootstrapping.

Introduction

Transitions

• Parsing states are represented as tuples (λ1, λ2, β, E). : λ1, λ2, and β are lists of word tokens. : E is a set of labeled edges (previously identified dependencies).• L is a dependency label, and i, j, k are indices of their corresponding word tokens. • The initial state is ([0], [ ], [1, …, n], E); 0 corresponds to the root node. • The final state is (λ1, λ2, [ ], E); the algorithm terminates when all tokens in β are consumed.

• LEFT-POPL and LEFT-ARCL are performed when wj is the head of wi with a dependency L. : LEFT-POP removes wi from λ1, assuming that the token is no longer needed. : LEFT-ARC keeps wi so it can be the head of some token wj<k≤n in β.• RIGHT-ARCL is performed when wi is the head of wj with a dependency L.• SHIFT is performed when : DT – λ1 is empty. : NT – There is no token in λ1 that is either the head or a dependent of wj.• NO-ARC is to buffer processed tokens so each token in β can be compared to all (or some) tokens prior to it.

Parsing states

• After LEFT-POP is performed (#8), [w4 = my] is removed from the search space and no longer considered in the later parsing states (e.g., between #10 and #11).

Parsing Algorithm

Experimental setup

• Corpora: English and Czech data distributed by the CoNLL’09 shared task.• Machine learning algorithm: Liblinear L2-L1 SVM.

Accuracy comparisons

• Our : ‘Choi and Nicolov’ + LEFT-POP transition.• Our+: ‘Our’ + bootstrapping technique.• Gesmundo et al.: the best transition-based system for CoNLL’09.• Bohnet: the best graph-based system for CoNLL’09 (the overall rank is in the parenthesis).• LAS and UAS: Labeled and Unlabeled Attachment Scores.

Speed comparisons

• ‘Our’ performed slightly faster than ‘Our+’ because it made more non-deterministic SHIFT’s.• ‘Nivre’ indicates Nivre’s swap algorithm that showed an expected linear time non-projective parsing complexity (Nivre, 2009), of which we used the implementation from MaltParser.• The curve shown by ‘Nivre’ might be caused by implementation details regarding feature extraction, which we included as part of parsing.

Experiments

English Czech

LAS UAS LAS UAS

Choi and Nicolov, 2009 88.54 90.57 78.12 83.29

Our 88.62 90.66 78.30 83.47

Our+ 89.15 91.18 80.24 85.24

Gesmundo et al., 2009 88.79 (3) - 80.38 (1) -

Bohnet, 2009 89.88 (1) - 80.11 (2) -

Conclusion

• The LEFT-POP transition gives improvements to both parsing speed and accuracy, showing a linear time non-projective dependency parsing speed with respect to sentence length.• The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-the-art performance with respect to other parsing approaches.

ClearParser

• Open source project: http://code.google.com/p/clearparser/• Contact: Jinho D. Choi ([email protected])

Conclusion

Average parsing speedsper sentence

• Nivre: 2.86 (ms)• Choi-Nicolov : 2.69 (ms)• Our+ : 2.29 (ms)• Our : 2.20 (ms)

Note: ‘Our’, not presented in the figure, showed a growthvery similar to ‘Our+’.

• We gratefully acknowledge the support of the National Science Foundation Grants CISE-IIS-RI-0910992, Richer Representations for Machine Translation, a sub- contract from the Mayo Clinic and Harvard Children’s Hospital based on a grant from the ONC, 90TR0002/01, Strategic Health Advanced Research Project Area 4: Natural Language Processing, and a grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc.• Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Acknowledgments

The range of subtree and head information

• When wi and wj are compared, subtree and head information of these tokens is partially provided by previous parsing states.

Bootstrapping technique

• A simplified version of SEARN, an algorithm for integrating SEARch and lEARNing to solve complex structured prediction problems (Daumé et. al., 2009).• The first time that this idea has been applied to transition-based dependency parsing.

Bootstrapping Technique

• Stop the procedure when the parsing accuracy of the current cross-validation is lower than the one from the previous iteration.

• Gold-standard labels are achieved by comparing the dependency relation between wi and wj in the gold-standard tree.