05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS &...

05/04/07

Using Active Learning to Label Large Email Corpora

Ted MarkowitzPace University CSIS DPS &

IBM T. J. Watson Research Ctr.

2

Quick History

• Email-related research suggested by Dr. Chuck Tappert’s work with MS student, Ian Stuart

• Decided to approach IBM Research’s SpamGuru anti-spam group for joint research

• Started P/T onsite at IBM in 11/05• Dr. Richard Segal of IBM Research

generously agreed to act as adjunct advisor

3

Research Motivation

• Assumption: Ongoing training and testing of anti-spam tools require large, fresh databases–corpora–of labeled (spam vs. good) messages

• Problem: How do we accurately label large numbers of examples―potentially millions― without manually examining every one?

4

Building Email Corpora

• Accurate training & testing of anti-spam tools require:– truly random, i.e., unbiased, samples– sufficient # of examples to measure low (< 0.1%) error rates– reasonable distributions of spam vs. good mail– examples which represent the target operating environment

• However, most existing email testing corpora are:– Rather small (just a few thousand messages)– Very narrowly focused in type and content– Aging rapidly and growing more and more stale over time

5

Building Email Corpora (cont.)

• Email and spam are constantly evolving• Building large, current and diverse bodies of

examples is time-consuming and expensive• Result: Just a few–relatively small and aging–

email corpora are used over and over again

6

One Potential Approach

• Machine Learning (ML) methods can help to build corpus labelers which learn how to label

• Research in semi-supervised learning (SSL) has shown it’s possible to accurately learn by bootstrapping, i.e., using relatively few labeled examples and lots of unlabeled examples

7

Active Learning

• Active Learning (AL) is one form of SSL• While some ML is passive (e.g., learner is only

given labeled examples), AL is proactive• Active Learner component directs attention to

particular areas it wants information about from a teacher who knows all the labels

Active Learning & Email Corpora

Select M “best” messages to label

Ask human to label selected messages

Update model based on returned labels

Label messages using Spam Classifier Model

Unlabeled Messages

Done?No

Yes

SpamClassifier

Model

9

Active Learning (cont.)

• Basic Challenge: Minimize the total cost of teacher queries required to achieve a target error rate, often simply the fewest queries

• Research Question: How does one selectively choose an optimal set of queries for the teacher during each update cycle?

10

Selective Sampling

• Uncertainty Sampling† (US) is one selective sampling technique for choosing the most informative examples

• US is based on the premise that the learner learns fastest by asking first about those examples it, itself, is most uncertain about

† “A Sequential Algorithm for Training Text Classifiers”, D. D. Lewis & W. A. Gale, ACM SIGIR ‘94

11

Uncertainty Sampling (cont.)

• Minimizing total uncertainty over all examples is computationally expensive: O(n)

• Can you reduce the # of questions asked in each cycle and still learn accurately?

• Is picking just the most uncertain examples always the best learning strategy?

• Can other knowledge be brought to bear in selecting the best questions?

12

Research Hypothesis

• Hypothesis: It should be possible to achieve close to full US accuracy while asking fewer, better questions

• Focused on development of Approximate Uncertainty Sampling (AUS) labelers– Compromise between speed of learning, # of

questions asked & computational resources– Computational complexity is O(m log(n)) vs. O(n)

for original Uncertainty Sampling algorithm

13

Research Approach

1. Construct competing AL/US-based labelers2. Compare them by…

– Accuracy (% correct, FP’s & FN’s) – # of teacher queries required to hit error rates– Relative sample sizes– Overall performance & resource usage

3. Select best labelers and refine them

14

Research Infrastructure

• Built a Java labeler testbench for comparing labeler variations on IBM SpamGuru codebase

• Developed and tested several Uncertainty Sampling-based labelers

• Used gold-standard, labeled 92K msg TREC 2005 Enron mail corpus to simulate the teacher

• Built a GUI front-end (CSI) to support human teacher interaction with labelers

16

Benefits of AUS

• Nearly as effective as vanilla US, but with lower computational complexity: O(m log(n))

• Reduced computational cost allows AUS to be applied to labeling larger datasets

• AUS makes it possible to update the learned model more frequently

• AUS is applicable to any AL/US-based solution, not just email corpus labeling

17

Ongoing Work

• Determine why selective sampling of queries using simple unsupervised clustering (AUS3 & AUS4) didn’t produce better results

• Develop enhanced clustering versions to attempt to improve AUS performance

05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS &...

Documents

Transcript of 05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS &...