Motivation

1
Motivation Conclusion Effective Access Over Public Email Conversations William Lee, Hui Fang and Yifan Li University of Illinois at Urbana-Champaign Clustering • Information within newsgroups or mailing lists has largely been underutilized. • For now, access to those data is restricted to traditional searching and browsing. Goal: Find commonly-discussed topics from a set of conversations (threads) Method: Use agglomerative clustering with complete link Learn similarity functions from different “perspectives” of threads: • authors, date, subject, contents, contents without quote, first message, reply, reply without quote. • Use Linear and Logistic Regression to learn the combined similarity function • Propose two ways to access to the public email archive Clustering Summarization • Use conversation map to visualize the clustering result • Future Work – Derive better algorithms to learn the similarity function – Faster clustering algorithms that work on mining patterns in conversations – LM approach to email summarization Experiment Design Data: 3 CS class newsgroups from UIUC Judgement file: manually created by three different human taggers Methodology: 3-way cross validation, using one group’s judgment file as training set and test on the other two. Evaluation: Use overall entropy as comparison metric Experiment Result Search Browse Existing technologies First Message Thread 1: Subject, Authors, Date First Reply Second Reply Third Reply First Message Thread 2: Subject, Authors, Date First Reply Second Reply Third Reply Visualization---Conversation Map – Derived from Treemap – Clusters sorted by two time dimensions – Allows user to adjust the similarity threshold -- “zoom” to the more similar threads • How to access those information more effectively? Clustering Summarization Summarization Goal: – Find the gist from a conversation (i.e. a thread) Observation: Different types of conversation need different types of summarization method – Question-driven conversation – Announcement –driven conversation Solution to Question-Driven summarization – Key observation • Question plays an important role – Method • Identify the question • Detect the topic shifting during the conversation • Divide the conversation into segments based on topic shifting • Store all the segments containing the question • Remove segments with the redundant question • Return the remaining segments Solution to Announcement- Driven summarization – Key observation • Subject plays an important role • Threads with similar subjects may have common pivot words in their summaries – Method • Training stage Clustering the threads by similarities of the subjects Find frequent words (pivot words) in the summaries Extend the subjects by combining the corresponding pivot words • Testing stage Given a thread, find the similar subject w.r.t. the current one Find the pivot words associated with the subject to extend the current one Select similar sentences w.r.t. the extended subject as the summary Experiment Design Data: 3 CS class newsgroups from UIUC Judgement file: manually created by two different human taggers Evaluation: Precision and Recall or user study Examples of Experiment Results Subject:MP7: Viewing Array in debugger From:Scott Stephens <[email protected]> on Sat, 04 Dec 2004 22:49:06 -0600 I've been debugging things using DDD with dbx, but I'm running into a weird problem. My PatriciaTree class is basically a wrapper around a root pointer, so I observe that pointer, and dereference it to give me my first PatriciaNode, and then look at all that's inside that, and one of those things is a pointer to "data" within the Array object that's embedded in my PatriciaNode. An address shows up fine, but I'd like to dereference it to take a look at the actual array, so I can continue to examine the structure of my tree. But when I dereference it in DDD, i just shows up as "(nil)". I know for a fact that there's valid data in that array somewhere, because I can access it in my program, I just can't look at it in the debugger. Anybody have any ideas? It'd be really nice to look at that info for debugging purposes. -Scott Question- Driven Announcement-Driven SUBJECT: Final Exam - PLEASE READ TIME: 1:30-4:30pm PLACE: Regular classroom (SC 1404) TOPICS: The exam is cummulative but with emphasis on the material after the midterm. Most important from earlier material are general techniques: divide-and-conquer, greedy, dynamic programming, randomization. Also topics like MST and shortest paths that have reappeared after the midterm. Student having more than two consecutive examinations: No student should be required to take more than two consecutive final examinations. N In a semester, this means that a student taking a final examination at 8:00 a.m. and another at 1:30 p.m. on the same day cannot be required to take an examination that same evening. N However, the student could be required to take an examination beginning at 8:00 a.m. the next day ...

description

Effective Access Over Public Email Conversations William Lee, Hui Fang and Yifan Li University of Illinois at Urbana-Champaign. Motivation. Clustering. Goal: Find commonly-discussed topics from a set of conversations (threads) Method: Use agglomerative clustering with complete link - PowerPoint PPT Presentation

Transcript of Motivation

Page 1: Motivation

Motivation

Conclusion

Effective Access Over Public Email ConversationsWilliam Lee, Hui Fang and Yifan Li

University of Illinois at Urbana-Champaign

Clustering • Information within newsgroups or mailing lists has

largely been underutilized.• For now, access to those data is restricted to

traditional searching and browsing.

• Goal: – Find commonly-discussed topics from a set of conversations

(threads)

• Method: – Use agglomerative clustering with complete link– Learn similarity functions from different “perspectives” of threads:

• authors, date, subject, contents, contents without quote, first message, reply, reply without quote.

• Use Linear and Logistic Regression to learn the combined similarity function

• Propose two ways to access to the public email archive– Clustering– Summarization

• Use conversation map to visualize the clustering result

• Future Work– Derive better algorithms to learn the similarity

function– Faster clustering algorithms that work on

mining patterns in conversations– LM approach to email summarization

• Experiment Design– Data: 3 CS class newsgroups from UIUC– Judgement file: manually created by three different

human taggers – Methodology: 3-way cross validation, using one

group’s judgment file as training set and test on the other two.

– Evaluation: Use overall entropy as comparison metric

• Experiment ResultSearch Browse

Existing technologies

First Message

Thread 1: Subject, Authors, Date

First Reply

Second Reply

Third Reply

First Message

Thread 2: Subject, Authors, Date

First Reply

Second Reply

Third Reply

• Visualization---Conversation Map– Derived from Treemap– Clusters sorted by two time dimensions– Allows user to adjust the similarity

threshold -- “zoom” to the more similar threads

• How to access those information more effectively?– Clustering– Summarization

Summarization • Goal:

– Find the gist from a conversation (i.e. a thread)

• Observation: Different types of conversation need different types of summarization method– Question-driven conversation– Announcement –driven conversation

• Solution to Question-Driven summarization– Key observation

• Question plays an important role

– Method • Identify the question • Detect the topic shifting during the conversation• Divide the conversation into segments based on topic

shifting• Store all the segments containing the question• Remove segments with the redundant question• Return the remaining segments

• Solution to Announcement-Driven summarization– Key observation

• Subject plays an important role• Threads with similar subjects may have common

pivot words in their summaries

– Method• Training stage

– Clustering the threads by similarities of the subjects– Find frequent words (pivot words) in the summaries– Extend the subjects by combining the

corresponding pivot words

• Testing stage– Given a thread, find the similar subject w.r.t. the

current one– Find the pivot words associated with the subject to

extend the current one– Select similar sentences w.r.t. the extended subject

as the summary

• Experiment Design– Data: 3 CS class newsgroups from UIUC– Judgement file: manually created by two different

human taggers– Evaluation: Precision and Recall or user study

• Examples of Experiment Results

Subject:MP7: Viewing Array in debuggerFrom:Scott Stephens <[email protected]> on Sat, 04 Dec 2004 22:49:06 -0600

I've been debugging things using DDD with dbx, butI'm running into a weird problem. My PatriciaTree class is basically a wrapper around a root pointer, so I observe that pointer, and dereference it to give me my first PatriciaNode, and then look at all that'sinside that, and one of those things is a pointer to "data" within the Array object that's embedded in my PatriciaNode. An address shows up fine, but I'd like to dereference it to take a look at the actual array,so I can continue to examine the structure of my tree. But when I dereference it in DDD, i just shows up as "(nil)". I know for a fact that there's valid data in that array somewhere, because I can access it in my program, I just can't look at it in the debugger. Anybody have any ideas? It'd be reallynice to look at that info for debugging purposes.

-Scott

Question-Driven Announcement-Driven

SUBJECT: Final Exam - PLEASE READ

TIME: 1:30-4:30pm

PLACE: Regular classroom (SC 1404)

TOPICS: The exam is cummulative but with emphasis on the material after the midterm. Most important from earlier material are general techniques: divide-and-conquer, greedy, dynamic programming, randomization. Also topics like MST and shortest paths that have reappeared after the midterm.

Student having more than two consecutive examinations: No student should be required to take more than two consecutive final examinations. N In a semester, this means that a student taking a final examination at 8:00 a.m. and another at 1:30 p.m. on the same day cannot be required to take an examination that same evening. N However, the student could be required to take an examination beginning at 8:00 a.m. the next day ...