Post on 30-Dec-2015
Representation of Electronic Mail Filtering Profiles: A User Study
Michael J. PazzaniInformation and Computer Science
University of California, Irvine
pazzani@ics.uci.edu
http://www.ics.uci.edu/~pazzani
Issues Addressed
• Would you let an agent filter your mail?
• If you could examine its filtering criteria, would this increase acceptance?
• Comprehensible filters can reduce legal liabilityThis release of Outlook Express comes equipped with a
new "junk" e-mail "filter. Insofar as Blue Mountain can ascertain, Microsoft's e-mail filter relegates e-mail greeting cards sent from Blue Mountain's web site to a "junk mail" folder for immediate discard, rather than receipt by the user.
• How should the mail filtering profile be represented?
Learning to Filter Mail
• Vector Space (TF-IDF)- R. Segal and J. Kephart. MailCat: An Intelligent Assistant for Organizing E-Mail. In Proceedings of the Third International Conference on Autonomous Agents, May 1999.
• Rules- Cohen, W. (1996). Learning Rules that Classify E-Mail
• Bayesian- Sahami, M., Dumais, S., Heckerman, D. and E. Horvitz (1998). A
Bayesian approach to filtering junk e-mail.
• Support Vector Machines Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization.
• Neural Networks- Lewis, D., Schapire, R., Callan, J., & Papka, R. (1996). Training algorithms for linear text classifiers.
The paper I was going to write
• Word pairs increase user acceptance of learned rule-based e-mail filters
– Collect representative e-mail messages
– Learned rule-based models with and without word pairs
– Ask users to rate profiles learned under various conditions
– Demonstrate increased acceptance of models with word pairs
Assumptions
• Why Rules?
W. Cohen (1996) “the greater comprehensibility of the rules may be advantageous in a system that allows users to extend or otherwise modify a learned classifier.”
• Word Pairs: Treating two contiguous words as a single term
Restaurant Recommendation: Pazzani (in press)
“goat” vs. “goat cheese” “prime” vs. “prime rib”
General finding: Negligible increase in accuracy of learned profile
Intuition: It might make profiles much more understandable
Ripper Rules: Comprehensible Acceptable
Discard if the message contains our & internetDiscard if the message contains free & callDiscard if the message contains http & comDiscard if the message contains UCI & availableDiscard if the message contains all & our & notDiscard if the message contains business & youDiscard if the message contains by & HumanitiesDiscard if the message contains over & you & canOtherwise Forward
Ripper Rules with Word Pairs;A “floor” effect
Discard if the message contains you can & to beDiscard if the message contains the UCIDiscard if the message contains the internet & if youDiscard if the message contains you can & you haveDiscard if the message contains http://wwwDiscard if the message contains P.M.Discard if the message contains you wantDiscard if the message contains one ofDiscard if the message contains there areDiscard if the message contains please contactOtherwise Forward
Ripper Rules for Forwarding
Forward if the message contains I ¬ business ¬ you can Forward if the message contains computer scienceForward if the message contains Subject Re:Forward if the message contains in your ¬ free Forward if the message contains I ¬ usForward if the message contains use theOtherwise Discard
Ripper Rules with Style Features
Discard if the message has greater than 5% capital letters & does not contain I & does not contain computingDiscard if there is greater than 1 $ & not theyDiscard if the message contains our & httpDiscard if greater than 2% of the words are in ALL CAPSDiscard if the message contains please ¬ yourOtherwise Forward
FOCL Rules with Word Pairs
Discard if the message contains not I ¬ scienceDiscard if the message contains business ¬ Subject:ReDiscard if the message contains our & internetDiscard if the message contains incomeDiscard if the message contains you can ¬ all yourDiscard if the message contains the UCIOtherwise Forward
Ripper Rules:80% accurate profile
Discard if the message contains the UCI & to theDiscard if the message contains the internet & you haveDiscard if the message contains http://www & you canDiscard if the message contains are availableDiscard if the message contains you willDiscard if the message contains web siteDiscard if the message contains of the & we areDiscard if the message contains a newOtherwise Forward
Evaluation Criteria for Mail Filtering
• Accuracy (and precision, recall, sensitivity, etc.)
• Efficiency (Learning and Classification)
• Cost Sensitivity• Traceability The ease with which the user can emulate
the categorization using a model.
• Credibility: The degree to which the user believes the decision-making criteria will produce the desirable results.
• Accountability: The degree to which the representation allows a user to distinguish an accurate model from an inaccurate one.
Text classification profiles
Goals: create user understandable/editable create profile that makes errors easy to detect/correct
• Rule-based Representation similar to outlook disappointing results
• SpeculationsRepresentation issues
Are weighted representations less understandable?
Are “prototype” representations more understandable
• HypothesesUsing word pairs as terms make profile more understandable
Using absence of words make profile less understandable
Prototype Representation
IF the message contains more of
papers particular business internet http money us
THANI me Re science problem talk ICS begins
THEN Discard
OTHERWISE Forward
Linear Threshold
IF ( 11"remove" + 10"internet" + 8"http" + 7"call" + 7"business" +5"center"
+3"please" + 3"marketing" + 2"money" + 1"us" + 1"reply" + 1"my" + 1"free"
-14"ICS" - 10"me" - 8"science" - 6"thanks" - 6"meeting" - 5"problem"
-5"begins" - 5"I" - 3"mail" - 3"com" - 2"www" - 2"talk" - 2"homework"
-1"our" - 1"it" - 1"email" - 1"all" - 1) is positiveThen DELETEElse Forward
Linear Threshold with Pairs
IF ( 10"business" + 7"internet" + 6"you can" + 6"http" + 6"center"
+5"our" + 5"e-mail" + 3"money" + 2"the UCI" + 1"I have"
-13"ICS" - 10"I'm" - 7"science" - 7"com" - 6"but I" - 6"Subject: Re"
-5"I" - 4"thanks" - 4"problem" - 4"me" - 4"computer science"
-4"I can" - 2"talk" - 2"mail" - 1"my" - 2) is positive
Prototype Representation with Pairs
IF the message contains more ofcom service us marketing financial 'the UCI' 'http www' 'you
can' 'removed from'
THANI me ICS learning 'Subject: Re:' function 'talk begins'
'computer science' 'the end'
THEN Discard
OTHERWISE Forward
Prototype Representation 80% accurate
IF the message contains more oflooking are over mailing expert reply
‘the subject’ ‘send an’ ‘at UCI’
THANdone I research sorry science because minute
overview similar ‘of it’ ‘need to’ ‘a minute’
THEN Discard
OTHERWISE Forward
Preferences
Algorithm Mean Rating
Rules 0.015
Rules (Pairs) -0.135
Rules (Noise) -0.105
Linear Model 0.421
Linear Model (Pairs) 0.518
Linear Model (Noise) -0.120
Prototype 0.677
Prototype (Pairs) 1.06
Prototype (Noise) 0.195
• The following differences were highly significant (at least at the .005 level).
Prototype representations with word pairs received higher ratings than rule representations with word pairs t(132) = 5.64. Inaccurate prototype models (learned from noisy training data) are less acceptable to users than accurate ones t(132)= 4.88.
• The following differences were significant (at least at the .05 level).
Prototype representations with word pairs received higher ratings than linear model representations with word pairs t(132) = 2.84. Inaccurate linear models are less acceptable to users than accurate ones. t(132)=2.99.
• The following difference was marginally significant (between the 0.1 and .05 level).
For prototype representations using word pairs as terms increases user ratings: t(132) = 2.37.
Learning Prototype:A First Pass
• Genetic AlgorithmInstance is a pair of terms vectors
128 most informative termsInitialized with 10% of features of each example
Fitness function: number correct on training data Operators
breedingmutation
results on mail, S&W data: as good as anything else
Algorithm Mail Goats Sheep BandsPerceptron 83.6 65.7 80.0 65.9Nearest 81.4 71.4 75.7 70.4ID3 82.3 72.8 86.3 68.8Naïve Bayes 90.1 72.6 81.4 70.7Rocchio 84.9 70.1 78.5 67.6Prototype 87.1 72.8 84.2 71.4