End-User Debugging of Machine Learning Systems Weng-Keen Wong Oregon State University School of...

End-User Debugging of Machine Learning Systems

Weng-Keen WongOregon State UniversitySchool of Electrical Engineering and Computer Sciencehttp://www.eecs.oregonstate.edu/~wong

Collaborators

• Margaret Burnett

• Simone Stumpf

• Tom Dietterich

• Jon Herlocker

• Erin Fitzhenry

• Lida Li

• Ian Oberst

• Vidya Rajaram

• Russell Drummond

• Erin Sullivan

Faculty Grad Students Undergrads

Papers

Stumpf S., Rajaram V., Li L., Burnett M., Dietterich T., Sullivan E., Drummond R., Herlocker J. (2007) . Toward Harnessing User Feedback For Machine Learning. In Proceedings of IUI 2007.

Stumpf, S., Rajaram V., Li L., Wong, W.-K., Burnett, M., Dietterich, T., Sullivan, E., Herlocker, J. (2008) Interacting Meaningfully with Machine Learning Systems: Three Experiments. (Submitted to IJHCS)

Stumpf, S., Sullivan, E., Fitzhenry, E., Oberst, I., Wong, W.-K., Burnett., M. (2008). Integrating Rich User Feedback into Intelligent User Interfaces. In Proceedings of IUI 2008.

MotivationDate: Mon, 28 Apr 2008 23:59:00 (PST)From: John Doe <[email protected]>To: Weng-Keen Wong <[email protected]>Subject: CS 162 Assignment

I can’t get my Java assignment to work! It just won’t compile and it prints out lots of error messages! Please help!

public class MyFrame extends JFrame {

private AsciiFrameManager reader;

private JPanel displayPanel;

public MyFrame(String filename) throws Exception {reader = new AsciiFrameManager(filename);displayPanel = new JPanel();

...

CS 162

John Doe

Trash

?

• Machine learning tool adapts to end user

• Similar situation in recommender systems, smart desktops, etc.

MotivationDate: Mon, 28 Apr 2008 23:51:00 (PST)From: Bella Bose <[email protected]>To: Weng-Keen Wong <[email protected]>Subject: Teaching Assignments

I’ve compiled the teaching preferences for all the faculty. Here are the teaching assignments for next year:

Fall QuarterCS 160 (Computer Science Orientation) – Paul PaulsonCS 161 (Introduction to Programming I) – Chris WallaceCS 162 (Introduction to Programming II) – Weng-Keen Wong...

Trash

• Machine Learning systems are great when they work correctly, aggravating when they don’t

• The end user is the only person at the computer

• Can we let end users correct machine learning systems?

6

Motivation

Learn to correct behavior quickly Sparse data on start Concept drift

Rich end-user knowledge Effects of user feedback on accuracy? Effects on users?

Overview

ExplanationEnd user feedback

End-User

Machine Learning Algorithm

Related WorkExplanation

• Expert Systems (Swartout 83, Wick and Thompson 92)

• TREPAN (Craven and Shavlik 95)

• Description Logics (McGuinness 96)

• Bayesian networks (LaCave and Diez 00)

• Additive classifiers (Poulin et al. 06)

• Others (Crawford et al. 02, Herlocker et al. 00)

End user interaction

• Active Learning (Cohn et al. 96, many others)

• Constraints (Altendorf et al. 05, Huang and Mitchell 06)

• Ranks (Radlinski and Joachims 05)

• Feature Selection (Raghavan et al. 06)

• Crayons (Fails and Olsen 03)

• Programming by Demonstration (Cypher 93, Lau and Weld 99, Lieberman 01)

9

Outline

1. What types of explanations do end users understand? What types of corrective feedback could end users provide? (IUI 2007)

2. How do we incorporate this feedback into a ML algorithm? (IJHCS 2008)

3. What happens when we put this together? (IUI 2008)

What Types of Explanations do End Users Understand? Thinkaloud study with 13

participants Classify Enron emails Explanation systems: rule-based,

keyword-based, similarity-based Findings:

Rule-based best but not a clear winner Evidence indicates multiple

explanation paradigms needed

What types of corrective feedback could end users provide?

Suggested corrective feedback in response to explanations:

1. Adjust importance of word2. Add/remove word from consideration3. Parse / extract text in a different way4. Word combinations5. Relationships between

messages/people

12

Outline

1. What types of explanations do end users understand? What types of corrective feedback could end users provide? (IUI 2007)



Incorporating Feedback into ML Algorithms

Two approaches: Constraint-based User co-training

Constraint-based approach

Constraints:1. If weight on word reduced or word removed,

remove the word as a feature2. If weight of word increased, word assumed to

be important for that folder

3. If weight of word increased, word is a better predictor for that folder than other words

)1|()1|( kkjk xyYPxyYP

)|1()|1( kjkj yYxPyYxP

Estimate parameters for Naive Bayes using MLE with these constraints

Standard Co-training

Create classifiers C1 and C2 based on the two independent feature sets.

Repeat i timesAdd most confidently classified messages by any classifier to training data

Rebuild C1 and C2 with the new training data

User Co-training

CUSER = “Classifier” based on user feedback

CML = Machine learning algorithm

For each “session” of user feedback

Add most confidently classified messages by CUSER to training data

Rebuild CML with the new training data

User Co-training

CUSER = “Classifier” based on user feedback

CML = Machine learning algorithm

For each “session” of user feedback

Add most confidently classified messages by CUSER to training data


We’ll expand the inner loop on the next slide

User Co-training

For each folder f, let vector vf = words with weights increased by the user

For each message m in the unlabeled set For each folder f, Compute Probf from the machine learning classifier Scoref=# of words in vf appearing in the message * Probf

Scorem=Scorefmax –Scoreother

Sort Scorem for all messages in decreasing order

Select the top k messages to add to the training set along with their folder label fmax


fmax ScorefFoldersf

maxarg

fother ScoreScoremax\

maxfFoldersf

Constraint-based vs User co-training

Constraint-based Difficult to set “hardness” of constraint Constraints often already satisfied End-user can over-constrain the

learning algorithm Slow

User co-training Requires unlabeled emails in inbox Better accuracy than constraint-based

Results

0%10%20%30%40%50%60%70%80%90%

100%

Algorithm

Accura

cy

0%10%20%30%40%50%60%70%80%90%

100%

Algorithm

Acc

ura

cy

Feedback from keyword-based paradigm

Feedback from similarity-based paradigm

21

Outline

1. What types of explanations work for end users? What types of corrective feedback could end users provide? (IUI 2007)



Experiment: Email program

22

Experiment: Procedure

Intelligent email system to classify emails into folders 43 English-speaking, non-CS students Background questionnaire Tutorial (email program and folders) Experiment task on feedback set

Correct folders. Add, remove, change weight on keywords.

30 interaction logs Post-session questionnaire

23

Experiment: Data

Enron data set 9 folders 50 training messages

10 each for 5 folders with folder labels 50 feedback messages

For use in experiment Same for each participant

1051 test messages For evaluation after experiment

24

Experiment: Classification algorithm “User co-training”

Two classifiers: User, Naïve Bayes Slight modification on user classifier

Scoref=sum of weights in vf appearing in the message

Weights can be modified interactively by user

25

Results: Accuracy improvements of rich feedback

26

Rich Feedback: participant folder labels and keyword changes

Folder feedback: participant folder labels

Subject

Accuracy Δ over folder feedback

Results: Accuracy improvements of rich feedback

27

Rich Feedback: participant folder labels and keyword changes

Baseline: original Enron labels

Subject

Accuracy Δ over baseline

Results: Accuracy summary

60% of participants saw accuracy improvements, some very substantial

Some dramatic decreases More time between filing emails or more

folder assignments → higher accuracy

29

Interesting bits

1. Need to communicate the effects of the user’s corrective feedback

2. Unstable classifier period With sparse training data, a single new

training example can dramatically change the classifier’s decision boundaries

Wild fluctuations in classifier’s predictions frustrate end users

Causes “wall of red”

Interesting bits: Unstable classifier period

31

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 50 100 150 200 250 300 350

Number of training data points

Acc

ura

cy

Moved test emails into training set to look for effect on accuracy (Baseline, participant 101)

Interesting bits

3. “Unlearning” important, especially to correct undesirable changes

4. Gender differences Females took longer to complete Females added twice as many

keywords Comment more on unlearning

Interesting directions for HCI

1. Gender differences2. More directed debugging3. Other forms of feedback4. Communicating effects of corrective

feedback Users need to detect the system is

listening to their feedback

5. Explanations Form Fidelity

Interesting directions for Machine Learning

1. Algorithms for learning from corrective feedback

2. Modeling reliability of user feedback

3. Explanations4. Incorporating new features

35

Future work

ML Whyline (with Andy Ko)

For more information

[email protected]

www.eecs.oregonstate.edu/~wong

36

End-User Debugging of Machine Learning Systems Weng-Keen Wong Oregon State University School of...

Documents

Transcript of End-User Debugging of Machine Learning Systems Weng-Keen Wong Oregon State University School of...