& A Recommendation System for Email Recipients Vitor R. Carvalho and William W. Cohen, Carnegie...

29
& A Recommendation System for Email Recipients Vitor R. Carvalho and William W. Cohen, Carnegie Mellon University March 2007 Preventing Email Leaks
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of & A Recommendation System for Email Recipients Vitor R. Carvalho and William W. Cohen, Carnegie...

&

A Recommendation System for Email Recipients

Vitor R. Carvalho and William W. Cohen,

Carnegie Mellon University

March 2007

Preventing Email Leaks

On July 6th 2001, the news agency Bloomberg.com published…

“California Governor Gray Davis’s office released data

on the state’s purchases in the spot electricity market — information Davis has been trying to keep secret — through a misdirected e-mail. The e-mail, containing data on California’s power purchases yesterday, was intended for members of the governor’s staff, said Davis spokesman Steve Maviglio. It was accidentally sent to some reporters on the office’s press list, he said. Davis is fighting disclosure of state power purchases, saying it would compromise negotiations for future contracts”.

Information Leaks via Email

Email leak = email message accidentally sent to “unintended” recipients.

1. Similar first/last names, aliases

2. Aggressive auto-completion of email addresses

3. Typos

4. Keyboard settings

Email leaks may contain sensitive information leading to disastrous consequences.

Detecting Email Leaks: Method

Idea1. Goal: to detect emails

accidentally sent to the wrong person

2. Generate artificial leaks: Email leaks may be simulated by various criteria: a typo, similar last names, identical first names, aggressive auto-completion of addresses, etc.

3. Method: Look for outliers

Look for Outliers1. Build model for (msg-

recipients) pairs: train classifier on real data to detect simulated outliers (added to the “true” recipient list).

2. Features: textual(subject, body), network features (frequencies, co-occurrences, etc).

3. Rank potential outliers - Detect outlier and warn user based on classifier’s confidence

Detecting Email Leaks: Method

Rec_6

Rec_2…

Rec_K

Rec_5

Most likely outlier

Least likely outlier

P(rec_t)

P(rec_t) =Probability recipient t is an outlier given “message text and other recipients in the message”.

Look for Outliers

1. Build model for (msg-recipients) pairs: train classifier on real data to detect simulated outliers (added to the “true” recipient list).

2. Features: textual(subject, body), network features (frequencies, co-occurrences, etc).

3. Rank potential outliers - Detect outlier and warn user based on classifier’s confidence

Leak Criteria: how to generate (artificial) outliers

• Several options:– Frequent typos, same/similar last names,

identical/similar first names, aggressive auto-completion of addresses, etc.

• We adopted the “3g-address” criteria:– On each trial, one of the msg recipients is randomly chosen and

an outlier is generated according to:

Else: Randomly select an address book entry

1

2

3

Marina.wang @enron.com

Data Preprocessing

• Used the Enron Email Dataset

• Setup a realistic temporal setup – For each user, 10% (most recent) sent messages will be used

as test

• All users had their Address Books extracted– List of all recipients in the sent messages.

• Self-addressed messages were disregarded

Still Data Preprocessing

• ISI version of Enron– Remove repeated messages and inconsistencies

• Disambiguate Main Enron addresses– List provided by Corrada-Emmanuel from UMass

• Bag-of-words– Messages were represented as the union of BOW of body

and BOW of subject (Textual Features)

• Some stop words removed

Experiments:Textual Features only

• Three Baseline Methods– Random

• Rank recipient addresses randomly– Cosine or TfIdf Centroid (Rocchio)

• Create a “TfIdf centroid” for each user in Address Book. A user1-centroid is the sum of all training messages (in TfIdf vector format) that were addressed to user user1. For testing, rank according to cosine similarity between test message and each centroid.

– Knn-30• Given a test msg, get 30 most similar msgs in training

set. Rank according to “sum of similarities” of a given user on the 30-msg set.

Experiments: Textual

Features only

Email Leak Prediction Results: Accuracy (or Prec@rank_1) in 10 trials.

On each trial, a different set of outliers is generated

Accuracy

Using Network Features

1. Frequency features– Number of received messages (from this user)– Number of sent messages (to this user)– Number of sent+received messages

2. Co-Occurrence Features– Number of times a user co-occurred with all other recipients.

Co-occurr means “two recipients were addressed in the same message in the training set”

3. Max3g features– For each recipient R, find Rm (=address with max score from

3g-address list of R), then use score(R)-score(Rm) as feature. Scores come from the CV10 procedure. Leak-recipient scores are likely to be smaller than their 3g-address highest score.

Results: Textual+Network Features

Finding Real Leaks in Enron

• How can we find it?– Look for “mistake”, “sorry” or “accident”. We were looking for

sentences like “Sorry. Sent this to you by mistake. Please disregard.”, “I accidentally send you this reminder”, etc.

• How many can we find?– Dozens of cases. Unfortunately, most of these cases were originated

by non-Enron email addresses or by an Enron email address that is not one of the 151 Enron users whose messages were collected. Our method requires a collection of sent (+received) messages from a user.

• Found 2 real “valid” cases! (“valid” = testable)– Message germanyc/sent/930, message has 20 recipients, leak is

alex.perkins@– kitchen-l/sent items/497, it has 44 recipients, leak is rita.wynne@

Finding Real Leaks in Enron

– Very Disappointing Results!!

– Reason: alex.perkins@ and rita.wynne@ were never observed in the training set!

[Accuracy, Average Rank], 100 trials

“Smoothing” the leak generation

Else: Randomly select an address book entry

1

2

3

Marina.wang @enron.comGenerate a random email address NOT in Address Book

•Sampling from random unseen recipients with probability

Some Results:

•Kitchen-l has 4 unseen addresses out of the 44 recipients,

•Germany-c has only one, out of 20.

Mixture parameter :Germany Leak Case

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0 0.1 0.2 0.3 0.4 0.5

a

AvgRank

Prec@1

Back to the simulated leaks:

Conclusions

• Privacy and Email papers are rare. To the best of our knowledge, this was the first paper on preventing information leaks via email.

• Can prevent HUGE problems

• Easy to implement in any email client – no change in email server side.

• The Email Leak paper was accepted in SDM-2007.

“This is a feature I would like to have in the email client I use myself” “Personally, I am eager to use such a tool if its accuracy is good.”

&

A Recommendation System for Email Recipients

Vitor R. Carvalho and William W. Cohen,

Carnegie Mellon University

March 2007

Preventing Email Leaks

Recommending Email Recipients

1. Prevent a user from forgetting to add an important collaborator or manager as recipient, preventing costly misunderstandings and communication delays. Cost of errors in task management is high: for instance, deadlines can be missed or opportunities wasted because of such errors.

2. Find people in an organization that are working in a similar topic or project, or to find people with appropriate expertise or skills.

• Valuable addition to email systems, particularly in large corporations

Email systems that can suggest who recipients of a message might be while the message is being composed, given its current contents and given its previously-specified recipients.

Two Recommendation Tasks

TO+CC+BCC

PredictionCC+BCC

Prediction

Method: 1. Extract Features: Textual and non-Textual2. Build model for (msg-recipients) pairs:

train classifier to detect “true” missing recipients.

3. Rank all email addresses in Address Book according to classifier’s confidence

Methods

• Large-scale multi-class multi-label classification tasks:

• Address Books have hundreds, sometimes thousands, of email addresses (classes)

• One-vs-all training is too expensive, even for users having a small collections of messages

• Using Information Retrieval Techniques as baselines:• Rocchio (TfIdf-Centroid) and KNN

• Enron Dataset with similar preprocessing steps as Leak Problem

Using Network Features

1. Frequency features– Number of received messages (from this user)– Number of sent messages (to this user)– Number of sent+received messages

2. Co-Occurrence Features (CC+BCC only)– Number of times a user co-occurred with all other

recipients. Co-occurr means “two recipients were addressed in the same message in the training set”

3. Recency features– How frequent a recipient is in the last 20, 50, 100

messages.

Results: TO+CC+BCC Prediction

Avg. Recall vs Rank Curves

Overall Results

Related Work• Email Privacy Enforcement System

– Boufaden et al. (CEAS-2005) - used information extraction techniques and domain knowledge to detect privacy breaches via email in a university environment. Breaches: student names, student grades and student IDs.

• CC Prediction– Pal & McCallum (CEAS-06) Counterpart problem:

prediction of most likely intended recipients of email msg. One single user, limited evaluation, not public data

• Expert finding in Email– Dom et al.(SIGMOD-03), Campbell et al(CIKM-03)– Balog & de Rijke (www-06), Balog et al (SIGIR-06)– Soboroff, Craswell, de Vries (TREC-Enterprise 2005-06-

07…) Expert finding task on the W3C corpus

Conclusions

• Submitted to KDD-07

• The Recipient Prediction task can be seen as the negative counterpart of the Email Leak Prediction task. In the former, we want to find the intended recipients of email messages, whereas in the latter we want to find the unintended recipients or email-leaks.

• A desirable email system addition to avoid misunderstandings and communication delays. Efficient, easy to implement and integrate, particularly in email systems where traditional search is already available.