Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems Paul...

Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text

Summarization Systems

Paul Over

Retrieval Group

Information Access Division

National Institute of Standards and Technology

DUC-2001 2

Document Understanding Conferences (DUC)

• Summarization has always been a major TIDES component

• An evaluation roadmap was completed in the summer of 2000 following the spring TIDES PI meeting

• DUC-2000 occurred in November 2000:– research reports

– planning for first evaluation using the roadmap

DUC-2001 3

Summarization road map

• Specifies a series of annual cycles, with – progressively more demanding text data

– both direct (intrinsic) and indirect (extrinsic, task-based) evaluations

– increasing challenge in tasks

• Year 1 (September 2001)– Intrinsic evaluation of generic summaries,

• of newswire/paper stories• for single and multiple documents; • with fixed lengths of 50, 100, 200, and 400 words

– 60 sets of 10 documents used• 30 for training• 30 for test

DUC-2001 4

DUC-2001 schedule

• Preliminary call out via ACL; – over 55 responses – 25 groups signed up

• Creation/Distribution of training and test data– 30 training reference sets released March 1– 30 test sets of documents released June 15

• System development• System testing• Evaluation at NIST

– 15 sets of summaries submitted July 1– Human judgments of submissions at NIST – July 9-31

• Analysis of results• Discussion of results and plans

– DUC-2001 at SIGIR in New Orleans – Sept. 13-14

DUC-2001 5

Goals of the talk

• Provide an overview of the:– Data

– Tasks

– Evaluation • Experience with implementing the evaluation procedure• Feedback from NIST assessors

• Introduce the results:– Sanity checking the results and measures

– Effect of reassessment with a different model summary (Phase 2)

• Emphasize:– Exploratory data analysis

– Attention to evaluation fundamentals over “final” conclusions

– Improving future evaluations

DUC-2001 6

The design…

DUC-2001 7

Data: Formation of training/test document sets

• Each of 10 NIST information analysts chose one set of newswire/paper articles of each of the following types:

1. A single event with causes and consequences

2. Multiple distinct events of a single type

3. Subject (discuss a single subject)

4. One of the above in the domain of natural disasters

5. Biographical (discuss a single person)

6. Opinion (different opinions about the same subject)

• Each set contains about 10 documents (mean=10.2, std=2.1)

• All documents in a set to be mainly about a specific “concept”

DUC-2001 8

Human summary creation

400

200

100

50

Documents

Single-documentsummaries

Multi-documentsummaries

A B

C

D

E

F

A: Read hardcopy of documents.

B: Create a 100-word softcopy summary for each document using the document author’s perspective.

C: Create a 400-word softcopy multi-documentsummary of all 10 documents written as a report for a contemporary adult newspaper reader.

D,E,F: Cut, paste, and reformulate to reduce the sizeof the summary by half.

DUC-2001 9

Training and test document sets

• For each of the 10 authors, – 3 docsets were chosen at random to be training sets

– the 3 remaining sets were reserved for testing

• Counts of docsets by type:

Training Test

Single event 9 3

Multiple events of same type 6 12

Subject 4 6

Biographical 7 3

Opinion 4 6

DUC-2001 10

Example training and test document sets

• Assessor A:1. TR - D01: Clarence Thomas’s nomination to the Supreme Court [11]

2. TR - D06: Police misconduct [16]

3. TR - D05: Mad cow disease [11]

4/1. TE - D04: Hurricane Andrew [11]

5. TE - D02: Rise and fall of Michael Miliken [11]

6. TE - D03: Sununu resignation [11]

• Assessor B:1. TR - D09: America’s response to the Iraqi invasion of Kuwait [16]

2. TE - D08: Solar eclipses [11]

3. TR - D07: Antarctica [9]

4/2. TE - D11: Tornadoes [8]

5. TR - D10: Robert Bork [12]

6. TE - D12: Welfare reform [8]

DUC-2001 11

Automatic baselines

• NIST created 3 baselines automatically based roughly on algorithms suggested by Daniel Marcu from earlier work

• Single-document summaries: 1. Take the first 100 words in the document

• Multi-document summaries1. Take the first 50, 100, 200, 400 words in the most recent

document.– 23.3% of the 400-word summaries were shorter than the target.

2. Take the first sentence in the 1st, 2nd, 3rd,… document in chronological sequence until you have the target summary size. Truncate the last sentence if target size is exceeded.– 86.7% of the 400-word summaries and 10% of the 200-word

summaries were shorter than the target .

DUC-2001 12

Submitted summaries

L Columbia University 120 -----M Cogentex 112 -----N USC/ISI – Webclopedia 120 -----O Univ. of Ottowa 120 307P Univ. of Michigan 120 308Q Univ. of Lethbridge ----- 308R SUNY at Albany 120 308S TNO/TPD 118 308T SMU 120 307U Rutgers Univ. 120 -----V NYU ----- 308W NSA 120 279X NIJL ----- 296Y USC/ISI 120 308Z Baldwin Lang. Tech. 120 308 -------- ------- 1430 3345

System Multi- Single-Code Group name doc. doc.

DUC-2001 13

Evaluation basics

• Intrinsic evaluation by humans using special version of SEE (thanks to Chin-Yew Lin, ISI)

• Compare: – a model summary - authored by a human

– a peer summary - system-created, baseline, or human

• Produce judgments of: – Peer grammaticality, cohesion, organization

– Coverage of each model unit by the peer (recall)

– Characteristics of peer-only material

DUC-2001 14

PhasesSummary evaluation and evaluation evaluation

• Phase 1: Assessor judged peers against his/her own models.

• Phase 2: Assessor judged subset of peers for a subset of docsets twice - against two other humans’ summaries

• Phase 3 (not implemented): 2 different assessors judge same peers using same models.

DUC-2001 15

Models

• Source: – Authored by a human

– Phase 1: assessor is document selector and model author

– Phase 2: assessor is neither document selector nor model author

• Formatting:– Divided into model units (MUs) (EDUs - thanks to William Wong at

ISI)

– Lightly edited by authors to integrate uninterpretable fragments

– Flowed together with HTML tags for SEE

DUC-2001 16

Model editing very limited

DUC-2001 17

Peers

• Formatting:– Divided into peer units (PUs) –

• simple automatically determined sentences • tuned slightly to documents and submissions

– Abbreviations list– Submission ending most sentences with “…”– Submission formatted as lists of titles

– Flowed together with HTML tags for SEE

• 3 Sources:1. Automatically generated by research systems

• For single-document summaries: 5 “randomly” selected, common • No multi-document summaries for docset 31 (model error)

2. Automatically generated by baseline algorithms3. Authored by a human other than the assessor

DUC-2001 18

The implementation…

DUC-2001 19

Origins of the evaluation framworkSEE+++

• Evaluation framework builds on ISI work embodied in original SEE software

• Challenges for DUC-2001– Better explain questions posed to the NIST assessors

– Modify the software to reduce sources of error/distraction

– Get agreement from DUC program committee

• Three areas of assessment in SEE:– Overall peer quality

– Per-unit content

– Unmarked peer units

DUC-2001 20

Overall peer qualityDifficult to define operationally

• Grammaticality: “Do the sentences, clauses, phrases, etc. follow the basic rules of English? – Don’t worry here about style or the ideas.

– Concentrate on grammar.”

• Cohesion: “Do the sentences fit in as they should with the surrounding sentences? – Don’t worry about the overall structure of the ideas.

– Concentrate on whether each sentence naturally follows the preceding one and leads into the next.”

• Organization: “Is the content expressed and arranged in an effective manner? – Concentrate here on the high-level arrangement of the ideas.”

DUC-2001 21

SEE: overall peer quality

DUC-2001 22

Overall peer quality: assessor feedback

• How much should typos, truncated sentences, obvious junk characters, headlines vs. full sentences, etc. affect grammaticality score?

• Hard to keep all three questions separate – especially cohesion and organization.

• 5-values answer scale is ok.

• Good to be able to go back and change judgments for correctness and consistency.

• Need rule for small and single-unit summaries – cohesion and organization as defined don’t make much sense for these.

DUC-2001 23

Counts of peer units (sentences) in submissionsWidely variable

DUC-2001 24

Grammaticality across all summaries

• Most scores relatively high

• System score range very wide

• Medians/means: Baselines < Systems < Humans

• But why are baselines (extractions) less than perfect?

Mean Std.

Baseline 3.23 0.67

System 3.53 0.75

Human 3.79 0.52

Notches in box plots indicate 95% confidenceintervals around the mean if and only if: - the sample is large (> 30), or - the sample has an approximate normal distribution.

DUC-2001 25

Most baselines contained a sentence fragment

• Single-document summaries: 1. Take the first 100 words in the document

– 91.7% of these summaries ended with a sentence fragment.

• Multi-document summaries1. Take the first 50, 100, 200, 400 words in the most recent

document.– 87.5% of these summaries ended with a sentence fragment.

2. Take the first sentence in the 1st, 2nd, 3rd,… document in chronological sequence until you have the target summary size. Truncate the last sentence if target size is exceeded.– 69.2 % of these summaries ended with a sentence fragment.

DUC-2001 26

Grammaticality: singles vs multis

Single- vs multi-document seems to have little effect

DUC-2001 27

Grammaticality: among multis

Why more lower scores for baseline 50s and human 400s?

DUC-2001 28

Cohesion across all summaries

Median baselines = systems < humans

DUC-2001 29

Cohesion: singles vs multis

• Better results on singles than multis

• For singles: median baselines = systems = humans

DUC-2001 30

Cohesion: among multis

Why more higher system summaries in 50s?

DUC-2001 31

Organization across all summaries

Median baselines > systems > humans

DUC-2001 32

Organization: singles vs multis

Generally lower scores for multi-document summaries than single-document summaries

DUC-2001 33

Organization: among multisWhy more higher system summaries in 50s?

Why are human summaries worse for the 200s?

DUC-2001 34

Cohesion vs Organization Any real difference for assessors?

Why is organization ever higher than cohesion?

DUC-2001 35

Per-unit content: evaluation details

– “First, find all the peer units which tell you at least some of what the current model unit tells you, i.e., peer units which express at least some of the same facts as the current model unit. When you find such a PU, click on it to mark it.

– “When you have marked all such PUs for the current MU, then think about the whole set of marked Pus and answer the question.”

– “The marked PUs, taken together, express [ All, Most, Some, Hardly any, or None ]of the meaning expressed by the current model unit”

DUC-2001 36

SEE: per-unit content

DUC-2001 37

Per-unit content: assessor feedback

– This is a laborious process and easy to get wrong – loop within a loop.

– How to interpret fragments as units, e.g., a date standing alone?

– How much and what kind of information (e.g., from context) can/should you add to determine what a peer unit means?

– Criteria for marking a PU need to be clear - sharing of what?:

• Facts

• Ideas

• Meaning

• Information

• Reference

DUC-2001 38

Per-unit content: measures

– Recall• Average coverage - average of the per-MU completeness judgments

[0..4] for a peer summary

• Recall at various threshold levels:– Recall4: # MUs with all information covered / # MUs– Recall3: # MUs with all/most information covered / # MUs– Recall2: # MUs with all/most/some information covered / # MUs– Recall1: # MUs with all/most/some/any information covered / # MUs

• Weighted average?

– Precision: problems• Peer summary lengths fixed• Insensitive to:

– Duplicate information– Partially unused peer units

DUC-2001 39

Average coverage across all summaries

• Medians: baselines <= systems < humans

• Lots of “outliers”

• Best system summaries approach, equal, or exceed human models

DUC-2001 40

Average coverage : singles vs multis

• Relatively lower baseline and system summaries for multi-document summaries

DUC-2001 41

Average coverage : among multisSmall improvement as size increases

DUC-2001 42

Average coverage by system for singles

Base Humans Systems

OP

R

S

T

VW

XY

Z

Q

DUC-2001 43

Average coverage by system for multis

L

M

N

O

P

RS

T

U W

Y

Z

Bases Humans Systems

DUC-2001 44

Average coverage by docset for 2 systemsAverages hide lots of variation by docset-assessor

DUC-2001 45

SEE: unmarked peer units

DUC-2001 46

Unmarked peer units: evaluation details

• “Think of 3 categories of unmarked PUs:– really should be in the model in place of something already there– not good enough to be in the model, but at least relevant to the

model’s subject– not even related to the model

• Answer the following question for each category: [ All, Most, Some, Hardly any, or None] of the unmarked PUs

belong in this category.

• Every PU should be accounted for in some category.• If there are no unmarked PUs, then answer each question with

“None”• If there is only one unmarked PU, then the answers can only be

“All” or “None”.

DUC-2001 47

Unmarked peer units: assessor feedback

• Many errors (illogical results) had to be corrected, e.g., if one question is answered “all”, then the others must be answered “none”.

• Allow identification of duplicate information in the peer.

• Very little peer material that deserved to be in the model in place of something there.

• Assessors were possessive of their model formulations.

DUC-2001 48

Unmarked peer unitsFew extremely good or bad

Needed in model

Not needed, but relevant

Not relevant

All mean =

median =

0.32

0

2.80

4

0.40

0

Singles 0.35 2.57 0.33

Multis 0.29 3.06 0.47

50 0.16 1.06 0.55

100 0.25 3.03 0.49

200 0.34 3.18 0.50

400 0.41 3.27 0.41

DUC-2001 49

Phase 2 initial results

• Designed to gauge effect of different models

• Restricted to multi-document summaries of size 50- and 200-words

• Assessor used 2 models created by other authors

• Within-assessor differences mostly very small:

– Mean = 0.020– Std = 0.55

• Still want to compare to original judgments…

DUC-2001 50

Summing up …

DUC-2001 51

Summing up …

• Overall peer quality:– “Grammaticality”(especially “All” choice) was too sensitive to

low-level formatting.– “Coherence” and “organization”, as defined, made little

sense for very short summaries.– “Coherence” was generally hard to distinguish from

“organization”.

• For the future:– Address dependence of grammaticality on low-level

formatting– Reassess operational definitions of coherence/organization

to better capture what researchers want to measure

DUC-2001 52

Summing up …

• Per-unit content (coverage):– Assessors were often in a quandary about how much

information to bring to the interpretation of the summaries.– Even for simple sentences/EDUs, determination of shared

meaning was very hard.

• For the future:– Results seem to pass sanity check? – Was the assessor time worth it in terms of what researchers

can learn from the output?

DUC-2001 53

Summing up …

• Unmarked peers:– Very little difference in the DUC 2001 summaries with

respect to quality of the unmarked (peer-only) material

• For the future…– DUC-2001 systems are not producing junk, so little will be

unrelated.– Summary authors are not producing junk, so little will be

good enough to replace what is there– What then if anything can be usefully measured with

respect to peer-only material?

DUC-2001 54

The End

DUC-2001 55

Average coverage by docset type(confounded by docset and assessor/author)

Human and system summaries

DUC-2001 56

Average coverage by docset type(confounded by docset and assessor/author)

Single- and multi-doc baselines

Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems Paul...

Documents

Transcript of Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems Paul...