Error Analysis for Learning-based Coreference Resolution Olga Uryupina 27.05.08.
-
Upload
alannah-baker -
Category
Documents
-
view
220 -
download
1
Transcript of Error Analysis for Learning-based Coreference Resolution Olga Uryupina 27.05.08.
Error Analysis for Learning-based Coreference Resolution
Olga Uryupina27.05.08
Outline
• CR: state-of-the-art and our system• Distribution of errors• Discussion: possible remedies
Coreference Resolution
„This deal means that Bernard Schwartz can focus most of his time on Globalstar and that is a key plus for Globalstar because Bernard Schwartz is brilliant,“ said Robert Kaimovitz, a satellite communications analyst at Unterberg Harris in New York.
..Globalstar still needs to raise $ 600 million,
and Schwartz said that the company would try..
Coreference Resolution
„This deal means that Bernard Schwartz can focus most of his time on Globalstar and that is a key plus for Globalstar because Bernard Schwartz is brilliant,“ said Robert Kaimovitz, a satellite communications analyst at Unterberg Harris in New York.
..Globalstar still needs to raise $ 600 million,
and Schwartz said that the company would try..
Coreference Resolution
„This deal means that Bernard Schwartz can focus most of his time on Globalstar and that is a key plus for Globalstar because Bernard Schwartz is brilliant,“ said Robert Kaimovitz, a satellite communications analyst at Unterberg Harris in New York.
..Globalstar still needs to raise $ 600 million,
and Schwartz said that the company would try..
Machine Learning Approaches
• Soon et al (2000)• Cardie & Wagstaff (1999)• Strube et al. (2002)• Ng & Cardie (2001-2004)• ACE competition
Features: Soon et al. (2000)
1. Anaphor is a pronoun2. Anaphor is a definite NP3. Anaphor is an NP with a demonstrative pronoun
(„this“,..)4. Antecedent is a pronoun5. Both markables are proper names6. Number agreement7. Gender agreement8. Alias9. Appositive10. Same surface form11. Semantic class agreement12. Distance in sentences
Features: other approaches
Cardie & Wagstaff: 11 FeaturesStrube et al.: 17 Features (the same
standard features + approximate matching (MED))
Ng & Cardie: 53 Features (no improvement on the extended feature set, better results (F=63.4) with manual feature selection)
Performance: Soon et al.
Soon et al‘s system:
Our reimlementation:
C5.0, optimized 56.1 65.5 60.4
C4.5, not optimized
53.5 72.8 61.7
Ripper 44.6 74.8 55.9
SVM 50.9 68.8 58.5
MaxEnt 49.2 64.1 55.7
Performance: Soon et al.
Learning Curve for C5.0
474951535557596163
10 15 20 25 30
Tricky and easy anaphors
Cristea et al. (2002): state-of-the-art coreference resolution systems have essentially the same performance level
Pronominal anaphora – 80%Full-scale coreference – 60%
Hypothesis: tricky vs. easy anaphors
Our system
Goal:Bridge the gap between the theory and the practice:
sophisticated linguistic knowledge + data-driven coreference resolution algorithm
New Features
Different aspects of CR:• Surface similarity (122 features)• Syntax (64)• Semantic Compatibility (29)• Salience (136)• (Anaphoricity)
More or less sophisticated linguistic theories exist for all these phenomena
Evaluation
Methodology• Standart dataset (MUC-7)• Standard learning set-up• Compare to Soon et al. (2001)
Performance (F)
Basic feature set
Extended f. set
Soon et al., C5.0
60.4 N/A
C4.5 61.7 64.6
SVM 58.5 65.4
Ripper 55.9 57.5
MaxEnt 55.7 59.4
Performance
Learning Curve, SVM
505254565860626466
10 15 20 25 30
Error analysis
Different approaches – same performance:
• Same errors?• „Tricky anaphors“? (Cristea et al.,
2002)
Extensive error analysis needed!
Outline
• CR: state-of-the-art and our system• Distribution of errors• Discussion: possible remedies
Recall errors
Errors %
MUC 17 3.6
Markables 166 35.4
Propagated P 31 6.6
Pronouns 77 16.4
NE-matching 31 6.6
Syntax 39 8.3
Nominal anaphora
104 22.2
total 469 100
Recall errors - markables
• Auxilliary doc parts• Tokenization• Modifiers• Bracketing/labeling
Recall errors - markables
.. there was no requirement for tether to be manufactured in a contaminant-free enviroment.
A mesmerizing set.
Recall errors - pronouns
1st pl – reconstructing the group:The retiring Republican chairman of the House
Committee on Science want U.S. Businesses to <..> „We need to make it easier for the private sector..“ Walker said
3rd sg, 3rd pl – (non-)salience:[The explanation] for the History Channel‘s success
begin with its association with another channel owned by the same parent consortium.
Recall errors - nominal
Mostly common noun phrases with different heads, WordNet does not help much
.. a report on the satellites‘ findings <..> the abilities of U.S. Reconnaissance technology <..> the use of advanced intelligence-gathering tools <..> Remote-sensing instruments..
Precision errors
Errors %
MUC 30 7.4
Markables 76 18.6
Pronouns 78 19.1
NE-matching 20 4.9
Syntax 22 5.4
Nominal anaphora
182 44.6
total 408 100
Precision errors- pronouns
• incorrect Parsing/TaggingTwo key vice presidents, [Wei Yen] and Eric Carlson, are leaving to start their own Silicon Valley companies.
• (non-)salience• matching (propagated R)
Precision errors - nominal
Mostly same-head descriptions. Possible solutions:
• modifiers?• anaphoricicty detectors?
P errors – nominal - modifiers
Idea: „red car“ cannot corefer with „blue car“
Problem: list of mutually incompatible properties?
MUC7 test data:incompatible modifiers 30„new“ mod for anaphora 15compatible modifiers 58no modifiers 62
P errors – nominal - dnew
Idea: identify and discard unlikely anaphors
Problem: even a very good detector does not help
Outline
• CR: state-of-the-art and our system• Distribution of errors• Discussion: Possible remedies
Discussion – Errors
Problematic areas:• Data• Preprocessing modules• Features• Resolution strategy
Discussion - Data
• bigger corpus• more uniform doc selection, text
only • better definition of COREF• better scoring
Discussion - Preprocessing
• local improvements (e.g. appositions)
• probabilistic architecture to neutralize errors
Discussion - Features
• feature selection• ensemble learning• more targeted learning for under-
represented phenomena (abbreviations)
Discussion - Resolution
• less local: move to the chains level• less uniform: specific treatment for
different types of anaphors
Discussion – Conclusion
• ML approaches to the Coreference Resolution yield similar performance values
• Some anaphors are indeed tricky (esp. crucial for precision errors)
• But some errors can be eliminated within a ML framework– improving the training material– elaborated integration of preprocessing
modules– more global resolution strategies
Thank You!
Recall errors
Errors %
MUC 17 3.6
Markables 166 35.4
Propagated P 31 6.6
Pronouns 77 16.4
NE-matching 31 6.6
Syntax 39 8.3
Nominal anaphora
104 22.2
total 469 100
Recall errors - MUC
Mainly incorrect bracketing
..said <COREF .. MIN=„vice president“>Jim Johannesen, <COREF .. MIN=„vice president“>vice president of site development for McDonald‘s</COREF></COREF>..
Only clear typos etc considered MUC-errors
Recall errors – propagated P
The company also said the Marine Corps has begun testing two of [its radars] as part of a short-range ballistic missile defense program. That testing could lead to an order for the radars.
Crucial for pronouns and indicators for intrasentential coreference
Recall errors - matching
Mostly ORGANIZATIONs. Problems:• Abbreviations
Federal Communication Commission FCC
• Hyphenated names Ziff-Davis Publishing Ziff
• Foreign namesTaiwan President Lee Teng-huiPresident Lee
Recall errors - syntax
Apposition, copulaProblems:• Parsing mistakes• Missing constructions
..the venture will become synonymous with JSkyB
• P/R trade-off ..Kevlar, a synthetic fiber, and Nomex..
Quantitative constructions.. More than quadruple the three-month daily average of
88,700 shares
Precision errors
Errors %
MUC 30 7.4
Markables 76 18.6
Pronouns 78 19.1
NE-matching 20 4.9
Syntax 22 5.4
Nominal anaphora
182 44.6
total 408 100
Precision errors - matching
Finer NE analysis could help, but mostly too difficult even for humans:Loral
Loral Space and Communications CorpLoral SpaceSpace Systems Loral
Anaphoricity
Some markables are not anaphors. We can tell that by looking at them, without any sophisticated coreference resolution.
Poesio & Vieira, Ng & Cardie – try to identify Discourse New entities automatically
Not used for this talk
Anaphoricity
Some markables are not anaphors. We can tell that by looking at them, without any sophisticated coreference resolution.
Poesio & Vieira, Ng & Cardie – try to identify Discourse New entities automatically
Not used for this talk