Error Analysis for Learning-based Coreference Resolution Olga Uryupina 27.05.08.

Error Analysis for Learning-based Coreference Resolution

Olga Uryupina27.05.08

Outline

• CR: state-of-the-art and our system• Distribution of errors• Discussion: possible remedies

Coreference Resolution

„This deal means that Bernard Schwartz can focus most of his time on Globalstar and that is a key plus for Globalstar because Bernard Schwartz is brilliant,“ said Robert Kaimovitz, a satellite communications analyst at Unterberg Harris in New York.

..Globalstar still needs to raise $ 600 million,

and Schwartz said that the company would try..

Machine Learning Approaches

• Soon et al (2000)• Cardie & Wagstaff (1999)• Strube et al. (2002)• Ng & Cardie (2001-2004)• ACE competition

Features: Soon et al. (2000)

1. Anaphor is a pronoun2. Anaphor is a definite NP3. Anaphor is an NP with a demonstrative pronoun

(„this“,..)4. Antecedent is a pronoun5. Both markables are proper names6. Number agreement7. Gender agreement8. Alias9. Appositive10. Same surface form11. Semantic class agreement12. Distance in sentences

Features: other approaches

Cardie & Wagstaff: 11 FeaturesStrube et al.: 17 Features (the same

standard features + approximate matching (MED))

Ng & Cardie: 53 Features (no improvement on the extended feature set, better results (F=63.4) with manual feature selection)

Performance: Soon et al.

Soon et al‘s system:

Our reimlementation:

C5.0, optimized 56.1 65.5 60.4

C4.5, not optimized

53.5 72.8 61.7

Ripper 44.6 74.8 55.9

SVM 50.9 68.8 58.5

MaxEnt 49.2 64.1 55.7

Performance: Soon et al.

Learning Curve for C5.0

474951535557596163

10 15 20 25 30

Tricky and easy anaphors

Cristea et al. (2002): state-of-the-art coreference resolution systems have essentially the same performance level

Pronominal anaphora – 80%Full-scale coreference – 60%

Hypothesis: tricky vs. easy anaphors

Our system

Goal:Bridge the gap between the theory and the practice:

sophisticated linguistic knowledge + data-driven coreference resolution algorithm

New Features

Different aspects of CR:• Surface similarity (122 features)• Syntax (64)• Semantic Compatibility (29)• Salience (136)• (Anaphoricity)

More or less sophisticated linguistic theories exist for all these phenomena

Evaluation

Methodology• Standart dataset (MUC-7)• Standard learning set-up• Compare to Soon et al. (2001)

Performance (F)

Basic feature set

Extended f. set

Soon et al., C5.0

60.4 N/A

C4.5 61.7 64.6

SVM 58.5 65.4

Ripper 55.9 57.5

MaxEnt 55.7 59.4

Performance

Learning Curve, SVM

505254565860626466

10 15 20 25 30

Error analysis

Different approaches – same performance:

• Same errors?• „Tricky anaphors“? (Cristea et al.,

2002)

Extensive error analysis needed!

Outline

• CR: state-of-the-art and our system• Distribution of errors• Discussion: possible remedies

Recall errors

Errors %

MUC 17 3.6

Markables 166 35.4

Propagated P 31 6.6

Pronouns 77 16.4

NE-matching 31 6.6

Syntax 39 8.3

Nominal anaphora

104 22.2

total 469 100

Recall errors - markables

• Auxilliary doc parts• Tokenization• Modifiers• Bracketing/labeling

Recall errors - markables

.. there was no requirement for tether to be manufactured in a contaminant-free enviroment.

A mesmerizing set.

Recall errors - pronouns

1st pl – reconstructing the group:The retiring Republican chairman of the House

Committee on Science want U.S. Businesses to <..> „We need to make it easier for the private sector..“ Walker said

3rd sg, 3rd pl – (non-)salience:[The explanation] for the History Channel‘s success

begin with its association with another channel owned by the same parent consortium.

Recall errors - nominal

Mostly common noun phrases with different heads, WordNet does not help much

.. a report on the satellites‘ findings <..> the abilities of U.S. Reconnaissance technology <..> the use of advanced intelligence-gathering tools <..> Remote-sensing instruments..

Precision errors

Errors %

MUC 30 7.4

Markables 76 18.6

Pronouns 78 19.1

NE-matching 20 4.9

Syntax 22 5.4

Nominal anaphora

182 44.6

total 408 100

Precision errors- pronouns

• incorrect Parsing/TaggingTwo key vice presidents, [Wei Yen] and Eric Carlson, are leaving to start their own Silicon Valley companies.

• (non-)salience• matching (propagated R)

Precision errors - nominal

Mostly same-head descriptions. Possible solutions:

• modifiers?• anaphoricicty detectors?

P errors – nominal - modifiers

Idea: „red car“ cannot corefer with „blue car“

Problem: list of mutually incompatible properties?

MUC7 test data:incompatible modifiers 30„new“ mod for anaphora 15compatible modifiers 58no modifiers 62

P errors – nominal - dnew

Idea: identify and discard unlikely anaphors

Problem: even a very good detector does not help

Outline

• CR: state-of-the-art and our system• Distribution of errors• Discussion: Possible remedies

Discussion – Errors

Problematic areas:• Data• Preprocessing modules• Features• Resolution strategy

Discussion - Data

• bigger corpus• more uniform doc selection, text

only • better definition of COREF• better scoring

Discussion - Preprocessing

• local improvements (e.g. appositions)

• probabilistic architecture to neutralize errors

Discussion - Features

• feature selection• ensemble learning• more targeted learning for under-

represented phenomena (abbreviations)

Discussion - Resolution

• less local: move to the chains level• less uniform: specific treatment for

different types of anaphors

Discussion – Conclusion

• ML approaches to the Coreference Resolution yield similar performance values

• Some anaphors are indeed tricky (esp. crucial for precision errors)

• But some errors can be eliminated within a ML framework– improving the training material– elaborated integration of preprocessing

modules– more global resolution strategies

Thank You!

Recall errors

Errors %

MUC 17 3.6

Markables 166 35.4

Propagated P 31 6.6

Pronouns 77 16.4

NE-matching 31 6.6

Syntax 39 8.3

Nominal anaphora

104 22.2

total 469 100

Recall errors - MUC

Mainly incorrect bracketing

..said <COREF .. MIN=„vice president“>Jim Johannesen, <COREF .. MIN=„vice president“>vice president of site development for McDonald‘s</COREF></COREF>..

Only clear typos etc considered MUC-errors

Recall errors – propagated P

The company also said the Marine Corps has begun testing two of [its radars] as part of a short-range ballistic missile defense program. That testing could lead to an order for the radars.

Crucial for pronouns and indicators for intrasentential coreference

Recall errors - matching

Mostly ORGANIZATIONs. Problems:• Abbreviations

Federal Communication Commission FCC

• Hyphenated names Ziff-Davis Publishing Ziff

• Foreign namesTaiwan President Lee Teng-huiPresident Lee

Recall errors - syntax

Apposition, copulaProblems:• Parsing mistakes• Missing constructions

..the venture will become synonymous with JSkyB

• P/R trade-off ..Kevlar, a synthetic fiber, and Nomex..

Quantitative constructions.. More than quadruple the three-month daily average of

88,700 shares

Precision errors

Errors %

MUC 30 7.4

Markables 76 18.6

Pronouns 78 19.1

NE-matching 20 4.9

Syntax 22 5.4

Nominal anaphora

182 44.6

total 408 100

Precision errors - matching

Finer NE analysis could help, but mostly too difficult even for humans:Loral

Loral Space and Communications CorpLoral SpaceSpace Systems Loral

Anaphoricity

Some markables are not anaphors. We can tell that by looking at them, without any sophisticated coreference resolution.

Poesio & Vieira, Ng & Cardie – try to identify Discourse New entities automatically

Not used for this talk

Error Analysis for Learning-based Coreference Resolution Olga Uryupina 27.05.08.

Documents

Transcript of Error Analysis for Learning-based Coreference Resolution Olga Uryupina 27.05.08.