Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and...

Learning from Labeled and Unlabeled Data

Tom MitchellStatistical Approaches to Learning and Discovery, 10-702 and 15-802

March 31, 2003

When can Unlabeled Data help supervised learning?

Consider setting:• Set X of instances drawn from unknown P(X)• f: X Y target function (or, P(Y|X))• Set H of possible hypotheses for f

Given:• iid labeled examples• iid unlabeld examples Determine:

When can Unlabeled Data help supervised learning?

Important question! In many cases, unlabeled data is plentiful, labeled data expensive

• Medical outcomes (x=<patient,treatment>, y=outcome)

• Text classification (x=document, y=relevance)

• User modeling (x=user actions, y=user intent)

• …

Four Ways to Use Unlabeled Data for Supervised Learning

1. Use to reweight labeled examples

2. Use to help EM learn class-specific generative models

3. If problem has redundantly sufficient features, use CoTraining

4. Use to detect/preempt overfitting

1. Use U to reweight labeled examples

2. Use U with EM and Assumed Generative Model

X1 X4X3X2

Y X1 X2 X3 X4

1 0 0 1 1

0 0 1 0 0

0 0 1 1 0

? 0 0 0 1

? 0 1 0 1

Learn P(Y|X)

From [Nigam et al., 2000]

E Step:

M Step:wt is t-th word in vocabulary

Elaboration 1: Downweight the influence of unlabeled examples by factor

New M step:Chosen by cross validation

Experimental Evaluation

• Newsgroup postings – 20 newsgroups, 1000/group

• Web page classification – student, faculty, course, project– 4199 web pages

• Reuters newswire articles – 12,902 articles– 90 topics categories

20 Newsgroups

Using one labeled example per class

• Can’t really get something for nothing…

• But unlabeled data useful to degree that assumed form for P(X,Y) is correct

• E.g., in text classification, useful despite obvious error in assumed form of P(X,Y)

2. Use U with EM and Assumed Generative Model

3. If Problem Setting Provides Redundantly Sufficient Features, use CoTraining

)()()()(,

221121

xfxgxgxggandondistributiunknownfromdrawnxwhere

XXXwhereYXflearn

Redundantly Sufficient Features

Professor Faloutsos my advisor

CoTraining Algorithm #1 [Blum&Mitchell, 1998]

Given: labeled data L,

unlabeled data U

Train g1 (hyperlink classifier) using L

Train g2 (page classifier) using L

Allow g1 to label p positive, n negative examps from U

Allow g2 to label p positive, n negative examps from U

Add these self-labeled examples to L

CoTraining: Experimental Results

• begin with 12 labeled web pages (academic course)• provide 1,000 additional unlabeled web pages• average error: learning from labeled data 11.1%; • average error: cotraining 5.0%

Typical run:

Co-Training Rote Learner

My advisor+

pageshyperlinks

Co-Training Rote Learner

My advisor+

pageshyperlinks

Expected Rote CoTraining error given m examples

gxPgxPerrorE ))(1)(( Where g is the jth connected component of graph

)()()()(,

221121

XXXwhereYXflearn

settingCoTraining

How many unlabeled examples suffice?

Want to assure that connected components in the underlying distribution, GD, are connected components in the observed sample, GS

O(log(N)/) examples assure that with high probability, GS has same connected components as GD [Karger, 94]

N is size of GD, is min cut over all connected components of GD

CoTraining Setting

)()()()(,

221121

XXXwhereYXflearn

• If– x1, x2 conditionally independent given y– f is PAC learnable from noisy labeled data

• Then– f is PAC learnable from weak initial classifier plus

unlabeled data

PAC Generalization Bounds on CoTraining

[Dasgupta et al., NIPS 2001]

• Idea: Want classifiers that produce a maximally consistent labeling of the data

• If learning is an optimization problem, what function should we optimize?

What if CoTraining Assumption Not Perfectly Satisfied?

What Objective Function?

2)(ˆ)(ˆ

))(ˆ)(ˆ(3

))(ˆ(2

))(ˆ(1

ULxLyx

xgxgUL

EcEcEEE

Error on labeled examples

Disagreement over unlabeled

Misfit to estimated class priors

What Function Approximators?

• Same fn form as Naïve Bayes, Max Entropy

• Use gradient descent to simultaneously learn g1 and g2, directly minimizing E = E1 + E2 + E3 + E4

• No word independence assumption, use both labeled and unlabeled data

jj xwe

1)(ˆ1

jj xwe

1)(ˆ2

Classifying Jobs for FlipDog

X1: job titleX2: job description

Gradient CoTraining Classifying FlipDog job descriptions: SysAdmin vs. WebProgrammer

Final Accuracy

Labeled data alone: 86%

CoTraining: 96%

Gradient CoTraining Classifying Upper Case sequences as Person Names

25 labeled

5000 unlabeled

2300 labeled

5000 unlabeled

Using labeled data only

Cotraining

Cotraining without fitting class priors (E4)

.13.24

* sensitive to weights of error terms E3 and E4

.11 *.15 *

Error Rates

CoTraining Summary

• Unlabeled data improves supervised learning when example features are redundantly sufficient – Family of algorithms that train multiple classifiers

• Theoretical results– Expected error for rote learning– If X1,X2 conditionally indep given Y

• PAC learnable from weak initial classifier plus unlabeled data• error bounds in terms of disagreement between g1(x1) and g2(x2)

• Many real-world problems of this type– Semantic lexicon generation [Riloff, Jones 99], [Collins, Singer 99]

– Web page classification [Blum, Mitchell 98]

– Word sense disambiguation [Yarowsky 95]

– Speech recognition [de Sa, Ballard 98]

4. Use U to Detect/Preempt Overfitting

• Definition of distance metric– Non-negative d(f,g)¸0; – symmetric d(f,g)=d(g,f); – triangle inequality d(f,g) · d(f,h)+d(h,g)

• Classification with zero-one loss:

• Regression with squared loss:

Experimental Evaluation of TRI[Schuurmans & Southey, MLJ 2002]

• Use it to select degree of polynomial for regression

• Compare to alternatives such as cross validation, structural risk minimization, …

Generated y values contain zero mean Gaussian noiseY=f(x)+

Cross validation (Ten-fold)Structural risk minimization

Approximation ratio: true error of selected hypothesis

true error of best hypothesis considered

Results using 200 unlabeled, t labeled

Worst performance in top .50 of trials

Bound on Error of TRI Relative to Best Hypothesis Considered

Extension to TRI: Adjust for expected bias of training data estimates

[Schuurmans & Southey, MLJ 2002]

Experimental results: averaged over multiple target functions, outperforms TRI

Summary

Several ways to use unlabeled data in supervised learning

Ongoing research area

1. Use to reweight labeled examples

2. Use to help EM learn class-specific generative models

3. If problem has redundantly sufficient features, use CoTraining

4. Use to detect/preempt overfitting

Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and...

Documents

Transcript of Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and...

Learning from Labeled and Unlabeled Data part 2: coupled ...tom/10601_sp09/lectures/LabUnlab-4-6-2009-ann.… · 06/04/2009 · Learning from Labeled and Unlabeled Data part 2: coupled

Learning Video Object Segmentation from Unlabeled Videos · 2020-03-12 · Learning Video Object Segmentation from Unlabeled Videos Xiankai Lu 1, Wenguan Wang2, Jianbing Shen , Yu-Wing

Estimating Accuracy from Unlabeled Data - Machine Learning · 2020-05-09 · Estimating Accuracy from Unlabeled Data Emmanouil Antonios Platanios e.a.platanios@cs.cmu.edu Machine

Learning from labelled and unlabeled data Semi-Supervised Learning Filipe Tiago Alves de Magalhães Machine Learning – PDEEC 2008/2009 17-05-2015.

1 Efﬁcient Training for Positive Unlabeled LearningarXiv:1608.06807v4 [cs.LG] 14 Mar 2018 1 Efﬁcient Training for Positive Unlabeled Learning Emanuele Sansone, Francesco G. B.

Positive Unlabeled Learning for Time Series Classification

CHALLENGES IN MACHINE LEARNING FOR COMPLEX PHYSICAL … · Reinforcement learning One-shot learning Model Testing Data Training Data Unlabeled Training Data Model ... the time to

Learning from Limited Labeled Data (but a lot of unlabeled ... · Learning from Limited Labeled Data (but a lot of unlabeled data) NELL as a case study Tom M. Mitchel l Carnegie Mellon

Learning with labeled and unlabeled data - Fordhamstorm.cis.fordham.edu/~gweiss/selected-papers/semi...Learning with labeled and unlabeled data Matthias Seeger Institute for Adaptive

Learning to Separate Object Sounds by Watching Unlabeled ...

Semi-Supervised AUC Optimization based on Positive-Unlabeled … · Semi-Supervised AUC Optimization based on Positive-Unlabeled Learning Tomoya Sakai1,2, Gang Niu1,2, and Masashi

Set 1 Unlabeled

Learning from positive and unlabeled examples in biology

Active Learning - UMIACSusers.umiacs.umd.edu/~jbg/teaching/DATA_DIGGING/lecture... · 2020. 6. 5. · Stream-Based Active Learning Unlabeled example by example query its label or

Learning to Separate Object Sounds by Watching …openaccess.thecvf.com/content_cvpr_2018_workshops/papers/...Learning to Separate Object Sounds by Watching Unlabeled Video Ruohan

Learning to Separate Object Sounds by Watching Unlabeled …rhgao/talks/CVPR2018_sight_and_sound.pdfLearning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao1, Rogerio

Learning From Labeled And Unlabeled Data: An …In this paper we present an empirical study of some of the existing techniques in learning from labeled and unlabeled data under the

Semi-Supervised Learningzhuxj/tmp/book.pdf1.1.2 Semi-Supervised Learning Semi-supervised learning (SSL) is half way between supervised and unsupervised learning. In addition to unlabeled

Poisoning the Unlabeled Dataset of Semi-Supervised Learning

SEMI-SUPERVISED DETECTOR LEARNING USING MINIMAL LABELSpeople.csail.mit.edu/spillai/data/presentations/ssl-cv-project-slides.pdf · learning from unlabeled data given a minimal set