Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca...

23
Do Summaries Help? Do Summaries Help? A Task-Based Evaluation A Task-Based Evaluation of Multi-Document of Multi-Document Summarization Summarization Kathleen McKeown , Rebecca J. Passonneau Kathleen McKeown , Rebecca J. Passonneau David K . Elson , Ani Nenkova , Julia David K . Elson , Ani Nenkova , Julia Hirschberg Hirschberg Columbia University Columbia University SIGIR 2005 SIGIR 2005

Transcript of Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca...

Page 1: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

Do Summaries Help?Do Summaries Help?A Task-Based Evaluation of A Task-Based Evaluation of

Multi-Document SummarizationMulti-Document Summarization

Kathleen McKeown , Rebecca J. PassonneauKathleen McKeown , Rebecca J. PassonneauDavid K . Elson , Ani Nenkova , Julia HirschbergDavid K . Elson , Ani Nenkova , Julia Hirschberg

Columbia UniversityColumbia University

SIGIR 2005SIGIR 2005

Page 2: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

22

INTRODUCTION

Newsblaster, a system that provides an interface to browse the news, featuring multi-document summaries of clusters of articles on the same event

key components of Newsblaster:• Article clustering• Event cluster summarization• User interface

Page 3: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

33http://newsblaster.cs.columbia.edu/

Page 4: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

44

METHODS In the Experiment, subjects are asked to

write a report using a news aggregator as a tool.

Each subject was asked to perform four 30-minute fact gathering scenarios using a Web interface.

Each scenario involved answering three related questions about an issue in the news.

Page 5: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

55

METHODS

The four tasks were: The Geneva Accord (日內瓦協定 ) in the

Middle East Hurricane Ivan’s (伊凡颶風 ) effects Conflict in the Iraqi city of Najaf Attacks by Chechen (車臣 ) separatists in

Russia

Page 6: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

66

Page 7: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

77

METHODS

Subjects were given a Web page that we constructed as their sole resource

The page contained four document clusters, two of which were centrally related to the topic at hand, and two of which were peripherally related.

Each cluster contained, on average, ten articles.

Page 8: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

88

METHODS

Four summary condition levels: Level 1: No summaries Level 2: One-sentence summary for each

article, one-sentence summary for each entire cluster

Level 3: Newsblaster multi-document summary for each cluster

Level 4: Human multi-document summary for each cluster

Page 9: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

99

Experiment

45 subjects for three studies Study A: 21 subjects wrote reports for two

scenarios each in two summary conditions: Level 3 and Level 4.

Study B: 11 subjects wrote reports for all four scenarios, using Summary Level 2.

Study C: 13 subjects wrote reports for all four scenarios, using Summary Level 1.

Page 10: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

1010

Scoring

Use the Pyramid method for evaluation

For example, to score a report from the human summary, we constructed the pyramid using reports created using all other conditions, plus the reports written by other people with human summaries.

If there are n reports, then there will be n levels in the pyramid. The top level will contain those facts that appear in all articles

Page 11: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

1111

Scoring

Let Tj refer to the jth tier in a pyramid of facts. If the pyramid has n tiers, then Tn is the top-most tier and T1, the bottom-most. The score for a report with X facts is:

where j is equal to the index of the lowest tier an optimally informative report will draw from.

Page 12: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

1212

Page 13: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

1313

Scoring

It has been observed that report length has a significant effect on evaluation results

We restricted the length of reports to be no longer than one standard deviation above the mean, and we truncated all question answers to a length of eight content units, which was the third quartile of lengths of all answers.

Page 14: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

1414

RESULTS

We measured in three ways: 1. By scoring the reports and comparing

scores across summary conditions

2. By comparing user satisfaction per summary condition

3. By comparing where subjects preferred to draw report content from, measured by counting the citations they inserted following each extracted fact.

Page 15: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

1515

Page 16: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

1616

Content Score Comparison

Differences between the scoresare not significant (p=0.3760 from ANOVA analysis) in Table 1.

This may have been due to the fact that the event clusters for Geneva contained more editorials with less “hard” news, while the clusters for Hurricane Ivan contained more breaking news reports.

Page 17: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

1717

Content Score Comparison

After removing the Geneva Accord scenario scores, Newsblaster summaries is significantly better than documents only

The differences between Newsblaster and minimal or human summaries are not significant

Page 18: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

1818

User Satisfaction

Page 19: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

1919

Citation patterns

Page 20: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

2020

DISCUSSION When we developed Newsblaster, we

speculated that summaries could help users find necessary information in the following ways:

1. they may find the information they need in the summaries themselves, thus requiring them to read less of the full articles

2. the summaries may link them directly to the relevant articles and positions within the articles where the relevantinformation occurs

Page 21: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

2121

DISCUSSION There were two problems First, the interface for Summary Level 2

identified individual articles with a title and a one sentence summary; but Level 3 only had titles for each article.

Second, the interface for Summary Level 2 shows the list of individual articles on the same Web page for the cluster; but Level 3 shows the summary and cluster title on the same page and requires the subject to click on cluster title to see the list of individual articles.

Page 22: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

2222

DISCUSSION

Another problem that we noted was that reports written by subjects were of widely varying length.

Reports varied from 102 words to 1525 words. We adjusted for this by truncating reports.

Lengthy reports tended to have more duplication of facts, which clearly makes for less effective reports.

The impact of truncating reports requires follow-up study

Page 23: Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca J. Passonneau David K. Elson, Ani Nenkova, Julia Hirschberg.

2323

CONCLUSIONS Our answer to the question, Do Summaries

Help?, is clearly yes. Our results show that subjects produce better quality reports using a news interface with Newsblaster summaries than with no summaries.

Users are also more satisfied with multi-document summaries than with minimalone-sentence summaries such as those used by commercial online news systems.

Interface design, report length, and scenario design effect on task completion