LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou...

31
LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’0 7) Advisor Dr. Koh Jia-Ling Speaker Tu Yi-Lang Date 2008.08.01

Transcript of LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou...

Page 1: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

LOGO

Summarizing Email Conversations with Clue

Words

Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07)

Advisor: Dr. Koh Jia-LingSpeaker: Tu Yi-Lang

Date: 2008.08.01

Page 2: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

2

Outline

Introduction.Building The Fragment Quotation Graph.

Creating nodes Creating edges

Email Summarization Methods. CWS MEAD RIPPER

Result 1 : User Study.Result 2 : Evaluation Of CWS.

Page 3: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

3

Introduction

With the ever increasing popularity of emails, email overload becomes a major problem for many email users.

In this paper, proposing a different form of support - email summarization.

The goal is to provide a concise, informative summary of emails contained in a folder, thus saving the user from browsing through each email one by one.

Page 4: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

4

Introduction

The summary is intended to be multi-granularity in that the user can specify the size of the concise summary.

Email summarization can also be valuable for users reading emails with mobile devices, given the small screen size of handheld devices, efforts have been made to re-design the user interface.

Page 5: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

5

Introduction

Many emails are asynchronous responses to some previous messages and as such they constitute a conversation, which may be hard to reconstruct in detail.

A conversation may involve many users, many of whom may have different writing styles.

A hidden email is an email quoted by at least one email in the folder but is not present itself in the user’s folders, and hidden email may carry important information to be part of the summary.

Page 6: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

6

Building The Fragment Quotation Graph

Here assuming that if one email quotes another email, they belong to the same conversation.

Using a fragment quotation graph to represent conversations, and the graph G = (V,E) is a directed graph, where each node u V is a text ∈unit in the email folder, and an edge (u, v) means node u is in reply to node v.

Page 7: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

7

Building The Fragment Quotation Graph

Identifying quoted and new fragment : Here we assume that there exist one or more

quotation markers (e.g., “>”) that are used as a prefix of every quoted line.

Quotation depth : the number of quotation markers “>” in the prefix, it reflects the number of times that this line has been quoted since the original message containing this line was sent.

Page 8: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

8

Building The Fragment Quotation Graph

Identifying quoted and new fragment : (cont.) Quoted fragment : is a maximally contiguous

block of quoted lines having the same quotation depth.

New fragment : is a maximally contiguous block of lines that are not prefixed by the quotation markers.

An email can be viewed as an alternating sequence of quoted and new fragments.

Page 9: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

9

Building The Fragment Quotation Graph

Creating nodes : Given an email folder Fdr = {M1, . . . ,Mn}, the f

irst step is to identify distinct fragments, each of which will be represented as a node in the graph.

Quoted and new fragments from all emails in Fdr are matched against each other to identify overlaps.

There is an overlap between two fragments if there is a common substring that is sufficiently long. (given an overlap threshold)

Page 10: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

10

Building The Fragment Quotation Graph

Page 11: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

11

Building The Fragment Quotation Graph

Creating edges : Assuming that any new fragment is a potential

reply to neighboring quotations – quoted fragments immediately preceding or following it.

Because of possible differences in quotation depth, a block may contain multiple fragments.

For the general situation when QSp precedes NS, which is then followed by QSf , we create an edge (v, u) for each fragment u (QS∈ p QS∪ f ) and v NS.∈

Page 12: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

12

Building The Fragment Quotation Graph

Creating edges : (cont.) For a hidden fragment, additional edges are c

reated within the quoted block, following the same neighboring quotation assumption.

Using the minimum equivalent graph as the fragment quotation graph, which is transitively equivalent to the original graph.

In the fragment quotation graph, each of these conversations will be reflected as weakly connected components.

Page 13: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

13

Email Summarization Methods

Clue words : A clue word in node (fragment) F is a word wh

ich also appears in a semantically similar form in a parent or a child node of F in the fragment quotation graph.

In this paper, we only apply stemming to the identification of clue words, using the Porter’s stemming algorithm to compute the stem of each word, and use the stems to judge the reoccurrence.

Page 14: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

14

Email Summarization Methods

Fragments (a) and (b) are two adjacent nodes with (b) as the parent node of (a).

Page 15: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

15

Email Summarization Methods

Clue words : (cont.) Here observing 3 major kinds of reoccurrence :

• The same root (stem) with different forms, e.g., “settle” vs. “settlement” and “discuss” vs. “discussed” as in the example above.

• Synonyms/antonyms or words with similar/contrary meaning, e.g., “talk” vs. “discuss” and “peace” vs. “war”.

• Words that have a looser semantic link, e.g., “deadline” with “Friday morning”.

Page 16: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

16

Email Summarization Methods

Algorithm CWS : Algorithm ClueWordSummarizer (CWS) uses

clue words as the main feature for email summarization.

The assumption is that if those words reoccur between parent and child nodes, they are more likely to be relevant and important to the conversation.

Page 17: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

17

Email Summarization Methods

Page 18: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

18

Email Summarization Methods

Algorithm CWS : (cont.) To evaluate the significance of the clue words :

To the sentence level :

Page 19: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

19

Email Summarization Methods

12 1

ClueScore(s) = 3

Page 20: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

20

Email Summarization Methods

MEAD : A centroid-based multi-document summarizer. MEAD computes the centroid of all emails in o

ne conversation, a centroid is a vector of words’ average TFIDF values in all documents.

MEAD compares each sentence s with the centroid and assigns it a score as the sum of all the centroid values of the common words shared by the sentence s and the centroid.

.

Page 21: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

21

Email Summarization Methods

MEAD : (cont.) All sentences are ranked based on the MEAD

Score, and the top ones are included in the summary.

Compared with MEAD, CWS may appear to use more “local” features, and MEAD may capture more “global” salience related to the whole conversation.

Page 22: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

22

Email Summarization Methods

Hybrid ClueScore with MEADScore : ClueScore and MEADScore tend to represent

different aspects of the importance of a sentence, so it is natural to combine both methods together.

Using the linear combination of ClueScore and MEADScore :

• LinearClueMEAD(s) = α ClueScore(s) +∗ (1 − α) MEADScore(s), where α [0, 1] . ∗ ∈

Page 23: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

23

Email Summarization Methods

RIPPER : Using the RIPPER system to induce a classifi

er for determining if a given sentence should be included in the summary.

RIPPER is trained on a corpus in which each sentence is described by a set of 14 features and annotated with the correct classification (i.e., whether it should be included in the summary or not).

• 8 “basic” linguistic features• 2 “basic+” features• 4 “basic++” features

Page 24: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

24

Result 1: User Study

Dataset setup : Here collected 20 email conversations from th

e Enron email dataset as the testbed and recruited 25 human summarizers to review them.

The email threads could be divided into two types, one is the single chain type, and the other is the thread hierarchy type.

We randomly select 4 single chains and 16 trees.

Page 25: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

25

Result 1: User Study

Dataset setup : (cont.) Each summarizer reviewed 4 distinct convers

ations in one hour, each email conversation were reviewed by 5 different human summarizers.

The generated summary contained about 30% of the original sentences.

The human summarizers were asked to classify each selected sentence as either essential or optional.

Page 26: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

26

Result 1: User Study

Result of the user study : We assign a GSValue for each sentence to ev

aluate its importance according to human summarizers’ selection, for each sentence s, one essential selection has a score of 3, one optional selection has a score of 1.

Out of the 741 sentences in the 20 conversations, 88 are overall essential sentences which is about 12% of the overall sentences.

Page 27: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

27

Result 1: User Study

Result of the user study : (cont.) Sorting all sentences by their GSValue and ab

out 17% sentences in the top-30% sentences are from hidden emails, among the 88 overall essential sentences, about 18% sentences are from hidden emails.

Comparing the average ClueScore of overall essential sentences in the gold standard with the average of all other sentences.

Page 28: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

28

Result 2: Evaluation Of CWS

Comparing CWS with MEAD and RIPPER : Sentence-level training. Conversation-level training.

Page 29: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

29

Result 2: Evaluation Of CWS

Comparing CWS with MEAD :

Page 30: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

30

Result 2: Evaluation Of CWS

Hybrid methods :

100% CWS

Page 31: LOGO Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.

31

Conclusion

In this paper, we study how to generate accurate email summaries, and build a novel structure : the fragment quotation graph, to represent the conversation structure.

The experiments with the Enron email dataset not only indicate that hidden emails have to be considered for email summarization, but also shows that CWS can generate more accurate summaries when compared with other methods.