Download - Presenter: Russell Greiner. 2 Helping the world understand data … and make informed decisions. Single decision: determine class label of an instance set.

Learning to Learning to PredictPredict

Presenter: Russell GreinerPresenter: Russell Greiner

2

Vision StatementVision Statement

Helping the world understand data

… and make informed decisions.… and make informed decisions.

Single decision:determineclass label of an instance

set of labels of set of pixels, …

value of a property of an instance, …

3

Motivation for Training a Motivation for Training a PredictorPredictor

TempPress.

Sore-Throat

… Color

32 90 N … PalePredictor

treatX

Ok

Need to know “label” of an instance,to determine appropriate actionPredictorMed( patient#2 ) =? “treatX is Ok”

Unfortunately, Predictor( . ) not known a priori

But many examples of patient, treatX

4

Motivation for Training a Motivation for Training a PredictorPredictorMachine learning provide alg’s for mapping

{ patient, treatX } to Predictor(.) function

Pale…N9032

Color…Sore-

ThroatPress.Temp Predictor

treatX

Ok

Learner

N

N

Y

Sore Throat

…

…

…

…

NoPale8710

::::

OkClear11022

NoPale9535

treatXColourPress.Temp.

5

Motivation for Training a PredictorMotivation for Training a Predictor

Need to learn (not program it in) when predictor is … … not known… not expressible… changing… user dependent

Pale…N9032

Color…Sore-Throat

Press.Temp PredictortreatX

No

Learner

N

N

Y

Sore Throat

…

…

…

…

NoPale8710

::::

OkClear11022

NoPale9535

treatXColourPress.Temp.

6

PersonnelPersonnelPI synergy:

Greiner, Schuurmans, Holte, Sutton, Szepesvari, Goebel

5 Postdocs16 Grad students (5 MSc, 11 PhD)5 Supporting technical staff

+ personnel for Bioinformatics thrust

7

Partners/CollaboratorsPartners/Collaborators

4 UofA CS profs1 UofAlberta Math/StatNon-UofA collaborators:

Google, Yahoo!, Electronic Arts, UofMontreal, UofWaterloo, UofNebraska, NICTA, NRC-IIT, …

+ Bioinformatics thrust collaborators

8

Additional ResourcesAdditional ResourcesGrants

$225K CFI$100K MITACS$100K Google

Hardware68 processor, 2TB, Opteron Cluster54 processor, dual core, 1.5TB, Opteron

Cluster

+ funds/data for Bioinformatics thrust

9

HighlightsHighlights

IJCAI 2005 – Distinguished Paper Prize

UM 2003 – Best Student Paper PrizeWebIC technology is foundation for

start-up companySignificant advances in extending

SVMs to use Un-supervised/Semi-supervised data, and for structured data

+ Highlights from Bioinformatics thrust

10

Learning to Predict: Learning to Predict: ChallengesChallenges

Simplifying assumptions re: training dataIID / unstructuredLots of instancesLow dimensions Complete featuresCompletely labeledBalanced data is sufficient

Pale…N9032

Color…Sore-

ThroatPress.

Temp PredictortreatX

No

Learner

N

N

Y

Sore Throat

…

…

…

…

NoPale8710

::::

OkClear11022

NoPale9535

treatXColourPressTemp.

11


Simplifying assumptions re: training dataSimplifying assumptions re: training dataIID / unstructured ?Lots of instancesLots of instancesLow dimensions Low dimensions Complete featuresComplete featuresCompletely labeledCompletely labeledBalanced dataBalanced data is sufficientis sufficient

Segmenting Brain Tumors

Extensions to Conditional Random Fields, …

12


Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instances ?Low dimensions ?Complete featuresComplete featuresCompletely labeledCompletely labeledBalanced dataBalanced data is sufficientis sufficient

m 1000’s

7.3 2.1 5.0 … 1.1 Y

22.1 6.03 3.1 … 3.0 Y

22.1 6.03 3.1 … 3.0 Y

22.1 6.03 3.1 … 3.0 Y

22.1 6.03 2.2 … 3.0 Y

22.1 6.03 12 … 3.0 Y

22.1 6.03 5 … 3.0 Y

: : : : :

32.0 1.9 5.8 … 2.8 N

N 10’s

13


Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instances ?Low dimensions ?Complete featuresComplete featuresCompletely labeledCompletely labeledBalanced dataBalanced data is sufficientis sufficient

g1 g2 g3 … gN disease

7.3 2.1 55.0 … 1.1 Y

22.1 6.03 29.1 … 3.0 Y

: : : : :

32.0 1.9 15.8 … 2.8 N

m100

N 20,000

Microarray, SNP Chips, …

Dimensionality Reduction… L 2 Model: Component Discovery BiCluster Coding

14


Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instancesLots of instancesLow dimensions Low dimensions Complete features ?Completely labeledCompletely labeledBalanced dataBalanced data is sufficientis sufficient

g1 g2 g3 … gN diseaseX

7.3 2.1 55.0 … 1.1 Y

22.1 6.03 29.1 … 3.0 Y

: : : : :

32.0 1.9 15.8 … 2.8 N

Budget Learning


7.3 2.1 55.0 … 1.1 Y

22.1 … 3.0 Y

: : : : :

1.9 … N


Y

Y

:

N

15


Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instancesLots of instancesLow dimensions Low dimensions Complete featuresComplete featuresCompletely labeled ?Balanced dataBalanced data is sufficientis sufficient

g1 g2 g3 … gN treatX

7.3 2.1 55.0 … 1.1 Y

22.1 6.03 29.1 … 3.0 Y

20.7 6.03 29.1 … 3.0 N

22.1 8.73 20.1 … 5.0 N

123 6.03 17.1 … 7.0 Y

: : : : :

32.0 1.9 15.8 … 2.8 N

SemiSupervised LearningActive Learning

treatX

Y

Y

16


Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instancesLots of instancesLow dimensions Low dimensions Complete featuresComplete featuresCompletely labeledCompletely labeledBalanced data ? is sufficientis sufficient

Cost Curves (analysis)

17


Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instancesLots of instancesLow dimensions Low dimensions Complete featuresComplete featuresCompletely labeledCompletely labeledBalanced dataBalanced data is sufficient ?

Robust SVMMixture Using VarianceLarge Margin Bayes NetCoordinate Classifier…

18

Projects and StatusProjects and StatusStructured Prediction

Random Fields Parsing Unsupervised M3N

Dimensional Reduction (L 2 Model: Component Discovery)

Budgeted LearningSemiSupervised Learning

large-margin (SVM) probabilistic (CRF) graph based transduction

Active Learning CostCurvesRobust SVMCoordinated ClassifiersMixture Using VarianceLarge Margin Bayes Net

IID / unstructured

Lots of instancesLow dimensions Complete features

Completely labeled

Balanced data

Beyond simple learners

Poster # 26

Budgeted LearningBudgeted Learning

Technical DetailsTechnical Details

20

b 0 5 b 1

b 1 3 a 0

a 1 1 a 0

b 1 1 a 0

a 0 3 a 1

Person 1

Person 2

Typical Supervised LearningTypical Supervised LearningResponse

Predictor

Learner

21

Person 1

Person 2

ActiveActive LearningLearningResponse

Predictor

Learner

b 0 5 b ?

b 1 3 a ?

a 1 1 a ?

b 1 1 a ?

a 0 3 a ?

User is able to PURCHASE labels, at some cost… for which instances??

?

+

?

+

--

22

Person 1

Person 2

BudgetedBudgeted LearningLearningResponse

Predictor

Learner

? ? ? ? 1

? ? ? ? 0

? ? ? ? 0

? ? ? ? 0

? ? ? ? 1

User is able to PURCHASE values of features, at some cost … but which features for which instances??

1 5 + t

0 5 + f

? ? ? ?

? ? ? ?

? ? ? ?

23

Person 1

Person 2

BudgetedBudgeted LearningLearningResponse

Predictor

Learner

? ? ? ? 1

? ? ? ? 0

? ? ? ? 0

? ? ? ? 0

? ? ? ? 1

Significantly different from ACTIVE learning:correlations between feature values

? 5 ? ?

? ? + ?

0 ? + ?

? 9 -- ?

? ? ? ?

User is able to PURCHASE values of features, at some cost … but which features for which instances??

24

n=10, Beta(10,1)

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0 5 10 15 20 25 30 35 40 45

time

round-robin

random

greedy

allocational

lookahead

biased-robin

10 tests ($1/test)

Budget =$40 Beta(10,1)

# features purchased

25

Budgeted Learning… so farBudgeted Learning… so farDefined framework

Ability to purchase individual feature valuesFixed LEARNING / CLASSIFICATION Budget

Theoretical resultsNP-hard in general Standard algorithms not even Approx !

Empirical Results show …Avoid Round RobinTry clever algorithms

Biased Robin Randomized Single Feature Lookahead

[Lizotte,Madani,Greiner: UAI’03], [Madani,Lizotte,Greiner: UAI’04], [Kapoor,Greiner: ECML’05]

26

Person 1

Person 2

Future Work #1Future Work #1

Response

Classifier

Learner

? ? ? ? 1

b ? ? ? 0

? ? ? ? 0

? ? ? ? 0

? ? ? ? 1

?

?

?

?

?

27


Sample complexity of Budgeted LearningHow many (Ij, Xi) “probes” required to PAC-learn ?

Develop policies with guarantees on learning performance

Complex cost model … Bundling tests, … Allow learner to perform more powerful probes

purchase X3 in instance where X7 = 0 & Y = 1

More complex classifiers ?

28

Person 1

Person 2


Response

? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

Goal: Find * = argmax P(D)

?

?

?

?

?

Learning Generative Model

29

Projects and StatusProjects and StatusStructured Prediction

(ongoing)

Dimensional Reduction: (ongoing; RoBiC: Poster#8)

Budgeted Learning(ongoing)

SemiSupervised Learning(ongoing)

Active Learning(ongoing)

CostCurves(complete; Post#26)

LabelsMTest MTrain

0 1 .. 1

1 1 .. 0

1 0 … 1

1 1 … 0

0 0 … 1

1 1 … 0

Learner

Classifier

+

–

–

+

–

+

–

FindBiClusters

BiClusterMembershi

p

Using Variance Using Variance Estimates to Combine Estimates to Combine Bayesian ClassifiersBayesian Classifiers

Technical DetailsTechnical Details

32

MotivationMotivation

Spse many different classifiers …For each instance, want each classifier to…

“know what it knows” …… and shout LOUDEST when it knows best…

“Loudness” 1/ Variance !

C2++ +

+

+

+o

o

oo o

o

+

++ +

++

+

oo

oo o

o

+

o

++ +

+

+

+o

o

oo o

o

+ +

+ +

+

+

+o

o

oo o

o

+

C1

C3 C4

*

§

33

Mixture Using VarianceMixture Using Variance

Given belief net classifierfixed (correct) structureparameters estimated from (random) datasample

Response to query “P(+c| -e, +w)” is… asymptotically normal with …(asymptotic) variance

Variance easy to compute …for simple structures (Naïve Bayes, TAN) … and for complete queries

|

2

2 2| | | |

θ |

1' ( ) ' ( )

1 nC

Q d d d dd D d DD

q q

f

f f f ff

34

Experiment #4b:Experiment #4b:MUVMUV(kNB(kNB, , AdaboostAdaboost, js), js) vs vs AdaBoost(NB)AdaBoost(NB)

MUV significantly out-performs AdaBoost even when using base-classifiers that AdaBoost generated!

MUV(kNB, AdaBoost, js) better than AdaBoost[NB]

with p < 0.023

35

MUV ResultsMUV ResultsSound statistical foundationVery effective classifier …

…across many real datasetsMUV(NB) better than AdaBoost(NB)!

C. Lee, S. Wang and R. Greiner; ICML’06

36

Mixture Using Variance … next Mixture Using Variance … next steps?steps?Other structures (beyond NB, TAN)

Beyond just tabular CPtables for discrete variablesNoisy-orGaussians

Learn different base-classifiers from different subset of features

Scaling up to many MANY featuresoverfitting characteristics?

37

Confidence in ClassifierConfidence in Classifier

Confidence of Prediction?Fit each j, j

2 to Beta(aji, bj)Compute area CDFBeta(aj, bj)(0.5)

38

Temp. BP.Sore Throa

t… Colour diseaseX

35 95 Y … Pale No

22 110 N … Clear Yes

: : :

10 87 N … Pale

Semi-Supervised LearningSemi-Supervised Learning

TempPress.

Sore-Throat

… Color

32 90 N … PaleClassifier

diseaseX

No

Learner

Temp. BP.Sore Throa

t… Colour diseaseX

35 95 Y … Pale No

22 110 N … Clear Yes

10 87 N … Pale Yes

17 82 Y … Red No

33 82 N … Blue No

: : : :

4 87 N … Pale No

Labeled Training Data

UnLabeled Training Data

39

ApproachesApproaches

Ignore the unlabeled dataGreat if have LOTS of labeled data

Use the unlabeled data, as is…“Semi-Supervised Learning”… based on

large margin (SVM) graph probabilistic model

Pay to get labels for SOME unlabeled data“Active Learning”

40

Semi-supervised Multi-class SVMSemi-supervised Multi-class SVM

Approach: find a labeling that would yield an optimal SVM classifier, on the resulting training data.

Hard, but semi-definite relaxations can approximate

this objective surprisingly welltraining procedures are computationally

intensive, but produce high quality generalization results.

L. Xu, J. Neufeld, B. Larson, D. Schuurmans. Maximum margin clustering. NIPS-04.

L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class SVMs. AAAI-05.

41

Probabilistic Approach to Probabilistic Approach to Semi-Supervised LearningSemi-Supervised Learning

Probabilistic model: P(y|x)Context: non-IID data

Language modellingSegmenting Brain Tumor from

MR ImagesUse Unlabeled Data as

RegularizerFuture: Other applications…

C-H. Lee, W. Shaojun, F. Jiao, D. Schuurmans and R. Greiner. Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields. NIPS06.

F. Jiao, S. Wang, C. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. COLING/ACL06.

42

Active LearningActive Learning

Pay for label to query xi that ... maximizes conditional mutual information about unlabeled data:

How to determine yi ? Take EXPECTATION wrt Yi ?

Use OPTIMISTIC guess wrt Yi ?

( , )arg min min ( | , )ii U y u u L y

u

H Y xx

( , )arg min ( | , ) ( | , )ii U i L u u L yy

u

P y H Y x x x

43

Optimistic Active Learning Optimistic Active Learning using Mutual Informationusing Mutual Information

Need Optimism Need “on-line adjustment” Better than just MostUncertain, …

pima breast

Y. Guo and R. Greiner. Optimistic active learning using mutual information. IJCAI’07

44

Future Work on Active Future Work on Active LearningLearningUnderstand WHY “optimism” works…

+ other applications of optimismExtend framework to deal with

non-iid datadifferent qualities of labelers…