BuzzTrackTopic Detection and Tracking in Email
IUI – Intelligent User InterfacesJanuary 2007
Keno AlbrechtETH Zurich
Roger WattenhoferETH Zurich
Gabor CselleGoogle
2
Email Overload• Email clients were not designed to
handle volume and variety of messages users are dealing with today:
• Large volumes of email• Task Management• Personal Archiving or Filing• Keeping Context
[Whittaker and Sidner, 1996]
3
Search vs. Inbox Browsing• Fast full-text search
is today's solution to finding past emails.
• But the flat inbox view of newly incoming emails hasn’t changed.
In our work, we focus on the problem of sensibly structuring emails in the inbox.
4
Today's Email Clients: The Three-Pane View
No sense of context: unrelated messages are shown together
Important emails may drop off the “first screen”
“Thread-based” tree views are unsophisticated, may not pull in all relevant messages.
5
BuzzTrackEmail client extension for Mozilla Thunderbirdfor displaying email grouped by topic.
6
Related Work
7
Visualizations: ConversationsGmail (Google)
common conversation title
one entry per email, folds out on click
8
Automatic Foldering• Using machine learning
techniques to automatically move emails into folders upon arrival
• Low accuracy rates [Bekkerman et al, 2005], conceptual problems:• Users need to manually
create folders and seed them with data.
9
People-Centered Email Clients
Bifrost ContactMap
[Bälter and Sidner, 2002] [Whittaker et al., 2004]
10
Task-based Email
Example: TaskMaster
thrasks
thrask contents
item contents
(emails, documents, etc.)
TaskMaster[Belotti et al., 2003]
11
BuzzTrack
12
BuzzTrack• Mozilla Thunderbird
extension to automatically group related emails into topics.
• Will be distributed through website: www.buzztrack.net
• Provides a view on the user’s inbox.
13
What’s a Topic?
• Topics are groups of emails that relate to the same idea, action, event, task, or question.
• Examples:•A conversation about buying a
digital camera.•Referring a candidate for a job.•All emails belonging to same
newsgroup.
14
Clustering Process• For every new incoming email:
Preprocessing Clustering
Label generation
Cluster storeBuzzTrack View in
Thunderbird
15
Preprocessing• Tokenization (remove HTML tags, style
sheets, punctuation, and numbers)• Language detection• Stemming• For topic labelling:
• Identify Parts-of-speech• Remember popular original word
forms
16
Clustering• Single-link clustering: Newly incoming emails are
compared to every email in existing topics:• Similarity value > threshold: assigned to topic• Similarity value <= threshold: email starts new topic
Topic 1 Topic 2
Topic 3
new email
17
Features - 1• How do we generate similarity values
between emails?• Via a linear combination of several
similarity features. • Examples:
• Text similarity (TFIDF Value, cosine similarity metric)
• People similarities (comparing sets of people in the From / To / Cc lines of email headers)
• Thread membership
18
Features - 2Other features for deriving similarities:• Subject similarity• Sender domain overlaps• Sender rank and percentage• % of email from sender that is
answered• Time passed since last email in topic• People and reference count for email• Known people and reference %• Cluster size• Has attachment
19
Decision Score
Similarities are combined into a decision score for each email / cluster pair through a linear combination of feature values:deci,j = wa*sima(mi,Cj) + wb*simb(mi,Cj) + …
We tested two sets of weights wx, both trained on a development set of emails:
• Empirical• Linear SVM
20
Evaluation• How do we evaluate clustering quality?• Topic Detection and Tracking
competitions by NIST. Aimed at clustering news articles.
• Corpus:
21
Clustering Tasks• Clustering Task is split into subtasks:
• New Topic Detection (NTD):Given stream of emails, which ones start new topics?
• Topic Tracking (TT):Given a fixed topic, which newly incoming emails belong to it?
• DET Curves plot miss rate vs. false alarm rate for possible threshold for decision scores
22
Results NTD• TDT New Topic Detection Task
Miss: 3%False alarm: 30%
bett
er
better
23
Results TT• TDT Topic Tracking Task
Miss: 8%False alarm: 2%
bett
er
better
24
Comparison• Comparable quality to TDT for news
articles [NIST 2004]• News has less metadata, email has
worse text quality.• Wide body of work exists on improving
clustering performance on news, we haven’t tapped into that yet.
25
BuzzTrack View
• Mozilla Thunderbird plugin that provides useful view on inbox data “for free”
• Topics contain email from last 60 days• We’re interested in current email
only• Reduces initial clustering time
• Each email is shown in one topic
26
27
Demo 1: BuzzTrack
28
BuzzTrack PanesTopic pane: • Provides additional
info• Starred topics
Email pane:• Topics sorted by last
incoming email
29
Future Work• Distribute plugin to Thunderbird users
• Input on possible UI improvements• Input on clustering quality
• Different clustering styles• People-based• Thread-based
• We hope BuzzTrack will be valuable tool for real-world users
Top Related