Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro,...

Evaluation issues in

anaphora resolution and beyond

Ruslan Mitkov

University of Wolverhampton

Faro, 27 June 2002

Evaluation

Evaluation is a driving force for every NLP task/approach/application

Evaluation is indicative of the performance of a specific approach/application but not less importantly, reports where it stands as compared to other approaches/applications

Growing research in evaluation inspired by the availability of annotated corpora

Major impediments to fulfilling evaluation’s mission

Different approaches evaluated on different data

Different approaches evaluated in different

modes Results not independently confirmed As a result, no comparison or objective

evaluation possible

Anaphora resolution vs. coreference resolution

• Anaphora resolution has to do with tracking

down an antecedent of an anaphor

• Coreference resolution seeks to identify all

coreference classes (chains)

Anaphora resolution

For nominal anaphora which involves coreference it would be logical to regard each of the preceding noun phrases which are coreferential with the anaphor(s) as a legitimate antecedent Computational Linguists from many different countries attended PorTAL. The participants enjoyed the presentations; they also took an active part in the discussions.

Evaluation in anaphora resolution

Two perspectives:

• Evaluation of anaphora resolution algorithms

• Evaluation of anaphora resolution systems

Recall and Precision

MUC introduced the measures recall and

precision for coreference resolution.

These measures, as defined, are not

satisfactory in terms of clarity and

coverage (Mitkov 2001).

Evaluation package for anaphora resolution algorithms (Mitkov 1998; 2000)

Evaluation package for anaphora resolution

algorithms

(i) performance measures

(ii) comparative evaluation tasks and

(iii) component measures.

Performance measures

Success rate

Critical success rate

Critical success rate applies only to those ‘tough’

anaphors which still have more than one

candidate for antecedent after gender and

number filter

Example

• Evaluation data: 100 anaphors • Number of anaphors correctly resolved: 80• Number of anaphors correctly resolved

after gender and number constraints: 30

Success rate: 80/100 = 80%,

Critical success rate 50/70 = 71.4%

Comparative evaluation tasks

• Evaluation against baseline models • Comparison to similar approaches • Comparison with well-established approaches

Approaches frequently used for comparison:

Hobbs (1978), Brenan et al. (1987), Lappin and Leass (1994), Kennedy and Boguraev (1996), Baldwin (1997), Mitkov (1996; 1998)

Component measures

• Relative importance

• Decision power (Mitkov 2001)

Evaluation measures for anaphora resolution systems

• Success rate

• Critical success rate

• Resolution etiquette (Mitkov et al. 2002)

Reliability of evaluation results

Evaluation results can be regarded as reliable if evaluation covers/employs

(i) All naturally occurring texts

(ii) Sampling procedures

Relative vs. absolute results

• Results may be relative with regard to a specific evaluation set or other approach

• More “absolute” figures may be obtained if there existed a measure which quantified for the complexity of anaphors to be resolved

Measures quantifying complexity in anaphora resolution

Measures for complexity (Mitkov 2001):

• Knowledge required for resolution

• Distance between anaphor and

antecedent (in NPs, clauses, sentences)

• Number of competing candidates

Fair evaluation

Algorithms should be evaluated on the

basis of the same

• Evaluation data

• Pre-processing tools

Evaluation workbench

Evaluation workbench for anaphora resolution (Mitkov 2000; Barbu and Mitkov 2001)

• Allows the comparison of approaches sharing common principles or similar pre-processing

• Enables the ‘plugging in’ and testing of different anaphora resolution algorithms

All algorithms implemented operate in a fully automatic mode

The need for annotated corpora

Annotated corpora are vital for training

and evaluation

Annotation should cover anaphoric or

coreferential chains and not only anaphor-

antecedent pairs only

Scarce commodity

Lancaster Anaphoric Treebank (100 000 words)

MUC coreference task annotated data (65 000)

Part of the Penn Treebank (90 000)

Additional issues

Annotation scheme

Annotating tools

Annotation strategy

Interannotators’ (dis)agreement is a major issue!

The Wolverhampton coreference annotation project

A 500 000-word corpus annotated for

anaphoric and coreferential links (identity-

of-sense direct nominal anaphora)

Less ambitious in terms of coverage, but

much more consistent

Watch out for the traps!

• Are all annotated data reliable?

• Are all original documents reliable?

• Are all results reported “honest”?

Morale and motivation important!

If I may offer you my advice.... Do not despair if your first evaluation results are

not as high as you wanted them to be Be prepared to provide considerable input in

exchange of minor performance improvement Work hard Be transparent

... and you´ll get there!

Anaphora resolution projects

Ruslan Mitkov’s home page

http://www.wlv.ac.uk/~le1825

Research Group in Computational Linguistics

http://clg.wlv.ac.uk

http://www.wlv.ac.uk/~le1825

Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro,...

Documents

Transcript of Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro,...