Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December...

40
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December...

Page 1: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Partial Prebracketing to Improve Parser Performance

John Judge

NCLT Seminar Series

7th December 2005

Page 2: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Overview

• Background and Motivation• Prebracketing• NE/MWE Markup• Labelled Bracketing Constituent Markup• NE/MWE + LBCM Combined• Grammars compared• Conclusions

Page 3: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Background and Motivation

• Parse annotated corpora are crucial for developing machine learning and statistics based parsing resources

• Large treebanks are available for major languages

• For other languages there is a lack of such resources for grammar induction

Page 4: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Background and Motivation

• Treebank construction is usually semi-automatic (Penn Treebank, NEGRA)– Raw text is parsed– Annotator corrects parser output

• Propose a new method– Text is pre-processed to help parser– Pre-processed text is parsed– Annotator corrects parser output

• Hopefully the output will be better quality and correction will be quicker and easier for the annotator

Page 5: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Relevance to my work

• Previous work on question analysis has identified a need for a suitable training corpus

• Plan to be able to use this technique for developing a question treebank

• Method is general so it can be applied to other areas/languages

Page 6: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Prebracketing

• Marking up the input text with information that will help the parser parse the sentence properly– Named Entities– Multi-Word Expressions– Constituents VP, PP

• Prebracketing can be done automatically (NE, MWE) or manually (constituents)

• Parser (LoPar, Schmidt (2000)) will respect this markup when parsing

Page 7: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Prebracketing

Robert Erwin, president of Biosource, called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive . ’’

Page 8: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Prebracketing: Named Entities

Robert Erwin, president of Biosource, called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive . ’’

Page 9: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Prebracketing: Multi-Word Expressions

Robert Erwin, president of Biosource, called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive . ’’

Page 10: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Prebracketing: Constituents

Robert Erwin, president of Biosource, called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive . ’’

Page 11: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Named Entities

• Names of people, places, companies, products, etc.

• Similar to Proper Nouns

• Internal structure isn’t really important

• Can be treated as a single lexical unit

Page 12: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Parsing NE’s

• Parser doesn’t normally output NEs

• Retrain on NE annotated treebank

• Add NE annotation to sections 2-21 of Penn-II Treebank to use as training

• Add NE annotations to section 23 as test set

Page 13: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Transforming the Corpus

• Use named entity recogniser, Lingpipe, to pick out the entities

(http://alias-i.com/lingpipe/)

• Use (slightly hacked) NE transformation routine in the LFG Annotation Algortithm to transform the trees

Page 14: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Example Tree Before Transformation

NP-SBJ

NP , NP ,

NNP NNP

Robert Erwin, ,NP

NP

PP

IN

NNP

NN

president of

Biosource

Page 15: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Example Tree After Transformation

NP-SBJ

NE , NP ,

Robert Erwin , ,NP

NE

PP

INNN

president ofBiosource

Page 16: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Example Prebracketed Input

• Parser input is generated by systematically stripping markup from the gold standard trees, leaving in only NE markup

((S (NP-SBJ (NE Robert Erwin)…(JJ competitive))))))(. .)('' '')))

(NE Robert Erwin), president of (NE Biosource), called (NE Plant Genetic)’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive . ’’

Page 17: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Results for NE Parsing Section 23

Baseline Marked up

Coverage 99.42 99.63

F-Score 62.95 66.09

Page 18: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Multi-Word Expressions

• Short expressions that are interpreted as a single unit eg. with respect to, vice versa, all of a sudden, according to, as well as …

• Can correspond to a number of syntactic categories

• Treat as a single lexical unit and ignore internal structure

Page 19: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Parsing MWE’s

• Parser doesn’t normally output MWE’s

• As with NE’s Penn-II is transformed to contain MWE’s

• Parser is retrained on MWE annotated treebank

• Tested on MWE annotated version of Section 23

Page 20: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Transforming the corpus

• Use a list of multi-word expressions from the Stanford University Multi-Word Expression Project (http://mwe.stanford.edu/)

• Transform the corpus using (an even more hacked) NE transformation routine in the LFG Annotation Algorithm

Page 21: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Example Tree Before Transformation

PP

RB IN ADJP

rather thanJJ

competitive

Page 22: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Example Tree After Transformation

PP

MWE ADJP

rather thanJJ

competitive

Page 23: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Example Prebracketed Input

• Parser input is generated by systematically stripping markup from the gold standard trees, leaving in only MWE markup

((S (NP-SBJ (NP (NNP Robert) (NNP Erwin)…(JJ competitive))))))(. .)('' '')))

Robert Erwin, president of Biosource, called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary (MWE rather than) competitive . ’’

Page 24: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Results for MWE Parsing Section 23

Baseline Marked up

Coverage 99.42 99.55

F-score 63.88 64.26

Page 25: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

NE and MWE Markup Combined

• Marking up NEs and MWEs individually has yielded an improvement on the baseline

• Try doing both together

• Treebank is transformed as before but marking up both NEs and MWEs

• Prebracketed input can also contain both NEs and MWEs

Page 26: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Results for NE/MWE Combined Parsing

Baseline MWE NE NE&MWE

Coverage 99.67 99.67 99.67 99.67

F-Score 65.11 65.13 65.40 65.53

• MWE score is almost 1% better than when using MWEs alone

• NE score is over half a percent worse than when using NEs alone

• Coverage is up slightly on both runs

Page 27: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Labelled Bracketing Constituent Markup

• Instead of marking up lexicalised units, sentence constituents (VP, PP) are marked up

• No transformations are necessary as original treebank trees produce constituents in output

Page 28: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Generating LBCM input

• Systematically strip markup from the gold standard trees, leaving in only selected markup

((S (NP-SBJ (NP (NNP Robert) (NNP Erwin)…(JJ competitive))))))(. .)('' '')))

Robert Erwin, president of Biosource, (VP called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive) . ’’

• Simulating “ideal” human generated input

Page 29: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Results for LBCM

Baseline Top VP

Bottom VP

Top PP

Bottom PP

Coverage 99.87 99.11 98.98 99.55 98.55

F-Score 67.19 71.16 69.74 70.48 68.44

Page 30: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Combining NE, MWE and LBCM Prebracketing

• Taking the corpora from NE and MWE combination and preprocessing input as for LBCM

• (NE Robert Erwin), president of (NE Biosource), (VP called (NE Plant Genetic)’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary (MWE rather than) competitive) . ’’

Page 31: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Results for all 3

Prebracketed Coverage F-Score

None (Baseline) 99.67 65.11

MWE 99.67 65.13

NE 99.67 65.40

MWE & NE 99.67 65.53

NE & Top VP 99.13 70.06

MWE & Top VP 99.21 69.63

NE, MWE & Top VP 99.21 69.91

Page 32: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Looking at the Grammar/Lexicon

• Expect to reduce grammar/lexicon size by conflating NEs and MWEs

• Compare 4 grammars and lexicons– Plain PCFG– NE PCFG– MWE PCFG– NE/MWE PCFG

Page 33: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Comparison

Vanilla NE MWE Combi

Rules 15427 16253 16791 17661

NP Rules 6456 6754 6544 6853

NP …NE…

- 1515 - 1520

NP …MWE…

- - 198 197

… NE - 2123 - 2549

… MWE - - 1821 1793

Lexical Entries

45357 51257 45622 51521

NEs in Lex - 14773 - 14773

MWEs in Lex - - 323 318

Page 34: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Why are the Grammars Growing?

• Expectation was that the grammar/lexicon would shrink

• Instead they’re growing in size• Caused by the lexicon

– Many MWEs are appearing as a single MWE entry and also their individual words

– Likewise for NEs– Introducing new categories– Correspondingly more rules in the

grammar

Page 35: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Multi-Word Expression Example

• all of a sudden– all of a sudden MWE 2– All of a sudden MWE 2– all DT 800 PDT 182 RB 36– of IN 22741 RP 2– a DT 19895 FW 6 LS 1 NNP 2 SYM 10

– sudden JJ 32

Page 36: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Named Entity Example

• Winston Churchill– Winston Churchill NE 1– Churchill NE 2

• George Bush– George Bush NE 26– George NE 5– Bush NE 344

Page 37: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Knock-on effects

• Grammar/lexicon size is growing– Parse time is increasing

• A likely cause of the gains in terms of precision, recall and f-score being small– NE/MWE analysis of a phrase is not the

most likely according to the grammar– Consequently a less likely parse is output

Page 38: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Summary

• Generating NE/MWE PCFG grammars in this way is possible

• Unexpectedly, these grammars are larger than plain PCFG grammars

• Results show something can be gained from prebracketing input

• However, even the best result 70.16 (prebracketing topmost VP) is considerabley less than history-based parsers (Charniak, Collins, Bikel)

Page 39: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Further Work

• Better NE recognition– More sophisticated transformation method

• Different/larger MWE lists

• Experiment with history-based parsers for better results*

Page 40: Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.

Thanks

• Any questions or comments?