Learning to Learning to PredictPredict
Presenter: Russell GreinerPresenter: Russell Greiner
2
Vision StatementVision Statement
Helping the world understand data
… and make informed decisions.… and make informed decisions.
Single decision:determineclass label of an instance
set of labels of set of pixels, …
value of a property of an instance, …
3
Motivation for Training a Motivation for Training a PredictorPredictor
TempPress.
Sore-Throat
… Color
32 90 N … PalePredictor
treatX
Ok
Need to know “label” of an instance,to determine appropriate actionPredictorMed( patient#2 ) =? “treatX is Ok”
Unfortunately, Predictor( . ) not known a priori
But many examples of patient, treatX
4
Motivation for Training a Motivation for Training a PredictorPredictorMachine learning provide alg’s for mapping
{ patient, treatX } to Predictor(.) function
Pale…N9032
Color…Sore-
ThroatPress.Temp Predictor
treatX
Ok
Learner
N
N
Y
Sore Throat
…
…
…
…
NoPale8710
::::
OkClear11022
NoPale9535
treatXColourPress.Temp.
5
Motivation for Training a PredictorMotivation for Training a Predictor
Need to learn (not program it in) when predictor is … … not known… not expressible… changing… user dependent
Pale…N9032
Color…Sore-Throat
Press.Temp PredictortreatX
No
Learner
N
N
Y
Sore Throat
…
…
…
…
NoPale8710
::::
OkClear11022
NoPale9535
treatXColourPress.Temp.
6
PersonnelPersonnelPI synergy:
Greiner, Schuurmans, Holte, Sutton, Szepesvari, Goebel
5 Postdocs16 Grad students (5 MSc, 11 PhD)5 Supporting technical staff
+ personnel for Bioinformatics thrust
7
Partners/CollaboratorsPartners/Collaborators
4 UofA CS profs1 UofAlberta Math/StatNon-UofA collaborators:
Google, Yahoo!, Electronic Arts, UofMontreal, UofWaterloo, UofNebraska, NICTA, NRC-IIT, …
+ Bioinformatics thrust collaborators
8
Additional ResourcesAdditional ResourcesGrants
$225K CFI$100K MITACS$100K Google
Hardware68 processor, 2TB, Opteron Cluster54 processor, dual core, 1.5TB, Opteron
Cluster
+ funds/data for Bioinformatics thrust
9
HighlightsHighlights
IJCAI 2005 – Distinguished Paper Prize
UM 2003 – Best Student Paper PrizeWebIC technology is foundation for
start-up companySignificant advances in extending
SVMs to use Un-supervised/Semi-supervised data, and for structured data
+ Highlights from Bioinformatics thrust
10
Learning to Predict: Learning to Predict: ChallengesChallenges
Simplifying assumptions re: training dataIID / unstructuredLots of instancesLow dimensions Complete featuresCompletely labeledBalanced data is sufficient
Pale…N9032
Color…Sore-
ThroatPress.
Temp PredictortreatX
No
Learner
N
N
Y
Sore Throat
…
…
…
…
NoPale8710
::::
OkClear11022
NoPale9535
treatXColourPressTemp.
11
Learning to Predict: Learning to Predict: ChallengesChallenges
Simplifying assumptions re: training dataSimplifying assumptions re: training dataIID / unstructured ?Lots of instancesLots of instancesLow dimensions Low dimensions Complete featuresComplete featuresCompletely labeledCompletely labeledBalanced dataBalanced data is sufficientis sufficient
Segmenting Brain Tumors
Extensions to Conditional Random Fields, …
12
Learning to Predict: Learning to Predict: ChallengesChallenges
Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instances ?Low dimensions ?Complete featuresComplete featuresCompletely labeledCompletely labeledBalanced dataBalanced data is sufficientis sufficient
m 1000’s
7.3 2.1 5.0 … 1.1 Y
22.1 6.03 3.1 … 3.0 Y
22.1 6.03 3.1 … 3.0 Y
22.1 6.03 3.1 … 3.0 Y
22.1 6.03 2.2 … 3.0 Y
22.1 6.03 12 … 3.0 Y
22.1 6.03 5 … 3.0 Y
: : : : :
32.0 1.9 5.8 … 2.8 N
N 10’s
13
Learning to Predict: Learning to Predict: ChallengesChallenges
Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instances ?Low dimensions ?Complete featuresComplete featuresCompletely labeledCompletely labeledBalanced dataBalanced data is sufficientis sufficient
g1 g2 g3 … gN disease
7.3 2.1 55.0 … 1.1 Y
22.1 6.03 29.1 … 3.0 Y
: : : : :
32.0 1.9 15.8 … 2.8 N
m100
N 20,000
Microarray, SNP Chips, …
Dimensionality Reduction… L 2 Model: Component Discovery BiCluster Coding
14
Learning to Predict: Learning to Predict: ChallengesChallenges
Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instancesLots of instancesLow dimensions Low dimensions Complete features ?Completely labeledCompletely labeledBalanced dataBalanced data is sufficientis sufficient
g1 g2 g3 … gN diseaseX
7.3 2.1 55.0 … 1.1 Y
22.1 6.03 29.1 … 3.0 Y
: : : : :
32.0 1.9 15.8 … 2.8 N
Budget Learning
g1 g2 g3 … gN diseaseX
7.3 2.1 55.0 … 1.1 Y
22.1 … 3.0 Y
: : : : :
1.9 … N
g1 g2 g3 … gN diseaseX
Y
Y
:
N
15
Learning to Predict: Learning to Predict: ChallengesChallenges
Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instancesLots of instancesLow dimensions Low dimensions Complete featuresComplete featuresCompletely labeled ?Balanced dataBalanced data is sufficientis sufficient
g1 g2 g3 … gN treatX
7.3 2.1 55.0 … 1.1 Y
22.1 6.03 29.1 … 3.0 Y
20.7 6.03 29.1 … 3.0 N
22.1 8.73 20.1 … 5.0 N
123 6.03 17.1 … 7.0 Y
: : : : :
32.0 1.9 15.8 … 2.8 N
SemiSupervised LearningActive Learning
treatX
Y
Y
16
Learning to Predict: Learning to Predict: ChallengesChallenges
Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instancesLots of instancesLow dimensions Low dimensions Complete featuresComplete featuresCompletely labeledCompletely labeledBalanced data ? is sufficientis sufficient
Cost Curves (analysis)
17
Learning to Predict: Learning to Predict: ChallengesChallenges
Simplifying assumptions re training dataSimplifying assumptions re training dataIID / unstructuredIID / unstructuredLots of instancesLots of instancesLow dimensions Low dimensions Complete featuresComplete featuresCompletely labeledCompletely labeledBalanced dataBalanced data is sufficient ?
Robust SVMMixture Using VarianceLarge Margin Bayes NetCoordinate Classifier…
18
Projects and StatusProjects and StatusStructured Prediction
Random Fields Parsing Unsupervised M3N
Dimensional Reduction (L 2 Model: Component Discovery)
Budgeted LearningSemiSupervised Learning
large-margin (SVM) probabilistic (CRF) graph based transduction
Active Learning CostCurvesRobust SVMCoordinated ClassifiersMixture Using VarianceLarge Margin Bayes Net
IID / unstructured
Lots of instancesLow dimensions Complete features
Completely labeled
Balanced data
Beyond simple learners
Poster # 26
Budgeted LearningBudgeted Learning
Technical DetailsTechnical Details
20
b 0 5 b 1
b 1 3 a 0
a 1 1 a 0
b 1 1 a 0
a 0 3 a 1
Person 1
Person 2
Typical Supervised LearningTypical Supervised LearningResponse
Predictor
Learner
21
Person 1
Person 2
ActiveActive LearningLearningResponse
Predictor
Learner
b 0 5 b ?
b 1 3 a ?
a 1 1 a ?
b 1 1 a ?
a 0 3 a ?
User is able to PURCHASE labels, at some cost… for which instances??
?
+
?
+
--
22
Person 1
Person 2
BudgetedBudgeted LearningLearningResponse
Predictor
Learner
? ? ? ? 1
? ? ? ? 0
? ? ? ? 0
? ? ? ? 0
? ? ? ? 1
User is able to PURCHASE values of features, at some cost … but which features for which instances??
1 5 + t
0 5 + f
? ? ? ?
? ? ? ?
? ? ? ?
23
Person 1
Person 2
BudgetedBudgeted LearningLearningResponse
Predictor
Learner
? ? ? ? 1
? ? ? ? 0
? ? ? ? 0
? ? ? ? 0
? ? ? ? 1
Significantly different from ACTIVE learning:correlations between feature values
? 5 ? ?
? ? + ?
0 ? + ?
? 9 -- ?
? ? ? ?
User is able to PURCHASE values of features, at some cost … but which features for which instances??
24
n=10, Beta(10,1)
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0 5 10 15 20 25 30 35 40 45
time
round-robin
random
greedy
allocational
lookahead
biased-robin
10 tests ($1/test)
Budget =$40 Beta(10,1)
# features purchased
25
Budgeted Learning… so farBudgeted Learning… so farDefined framework
Ability to purchase individual feature valuesFixed LEARNING / CLASSIFICATION Budget
Theoretical resultsNP-hard in general Standard algorithms not even Approx !
Empirical Results show …Avoid Round RobinTry clever algorithms
Biased Robin Randomized Single Feature Lookahead
[Lizotte,Madani,Greiner: UAI’03], [Madani,Lizotte,Greiner: UAI’04], [Kapoor,Greiner: ECML’05]
26
Person 1
Person 2
Future Work #1Future Work #1
Response
Classifier
Learner
? ? ? ? 1
b ? ? ? 0
? ? ? ? 0
? ? ? ? 0
? ? ? ? 1
?
?
?
?
?
27
Future Work #2Future Work #2
Sample complexity of Budgeted LearningHow many (Ij, Xi) “probes” required to PAC-learn ?
Develop policies with guarantees on learning performance
Complex cost model … Bundling tests, … Allow learner to perform more powerful probes
purchase X3 in instance where X7 = 0 & Y = 1
More complex classifiers ?
28
Person 1
Person 2
Future Work #3Future Work #3
Response
? ? ? ? ?
? ? ? ? ?
? ? ? ? ?
? ? ? ? ?
? ? ? ? ?
Goal: Find * = argmax P(D)
?
?
?
?
?
Learning Generative Model
29
Projects and StatusProjects and StatusStructured Prediction
(ongoing)
Dimensional Reduction: (ongoing; RoBiC: Poster#8)
Budgeted Learning(ongoing)
SemiSupervised Learning(ongoing)
Active Learning(ongoing)
CostCurves(complete; Post#26)
LabelsMTest MTrain
0 1 .. 1
1 1 .. 0
1 0 … 1
1 1 … 0
0 0 … 1
1 1 … 0
Learner
Classifier
+
–
–
+
–
+
–
FindBiClusters
BiClusterMembershi
p
30
Using Variance Using Variance Estimates to Combine Estimates to Combine Bayesian ClassifiersBayesian Classifiers
Technical DetailsTechnical Details
32
MotivationMotivation
Spse many different classifiers …For each instance, want each classifier to…
“know what it knows” …… and shout LOUDEST when it knows best…
“Loudness” 1/ Variance !
C2++ +
+
+
+o
o
oo o
o
+
++ +
++
+
oo
oo o
o
+
o
++ +
+
+
+o
o
oo o
o
+ +
+ +
+
+
+o
o
oo o
o
+
C1
C3 C4
*
§
33
Mixture Using VarianceMixture Using Variance
Given belief net classifierfixed (correct) structureparameters estimated from (random) datasample
Response to query “P(+c| -e, +w)” is… asymptotically normal with …(asymptotic) variance
Variance easy to compute …for simple structures (Naïve Bayes, TAN) … and for complete queries
|
2
2 2| | | |
θ |
1' ( ) ' ( )
1 nC
Q d d d dd D d DD
q q
f
f f f ff
34
Experiment #4b:Experiment #4b:MUVMUV(kNB(kNB, , AdaboostAdaboost, js), js) vs vs AdaBoost(NB)AdaBoost(NB)
MUV significantly out-performs AdaBoost even when using base-classifiers that AdaBoost generated!
MUV(kNB, AdaBoost, js) better than AdaBoost[NB]
with p < 0.023
35
MUV ResultsMUV ResultsSound statistical foundationVery effective classifier …
…across many real datasetsMUV(NB) better than AdaBoost(NB)!
C. Lee, S. Wang and R. Greiner; ICML’06
36
Mixture Using Variance … next Mixture Using Variance … next steps?steps?Other structures (beyond NB, TAN)
Beyond just tabular CPtables for discrete variablesNoisy-orGaussians
Learn different base-classifiers from different subset of features
Scaling up to many MANY featuresoverfitting characteristics?
37
Confidence in ClassifierConfidence in Classifier
Confidence of Prediction?Fit each j, j
2 to Beta(aji, bj)Compute area CDFBeta(aj, bj)(0.5)
38
Temp. BP.Sore Throa
t… Colour diseaseX
35 95 Y … Pale No
22 110 N … Clear Yes
: : :
10 87 N … Pale
Semi-Supervised LearningSemi-Supervised Learning
TempPress.
Sore-Throat
… Color
32 90 N … PaleClassifier
diseaseX
No
Learner
Temp. BP.Sore Throa
t… Colour diseaseX
35 95 Y … Pale No
22 110 N … Clear Yes
10 87 N … Pale Yes
17 82 Y … Red No
33 82 N … Blue No
: : : :
4 87 N … Pale No
Labeled Training Data
UnLabeled Training Data
39
ApproachesApproaches
Ignore the unlabeled dataGreat if have LOTS of labeled data
Use the unlabeled data, as is…“Semi-Supervised Learning”… based on
large margin (SVM) graph probabilistic model
Pay to get labels for SOME unlabeled data“Active Learning”
40
Semi-supervised Multi-class SVMSemi-supervised Multi-class SVM
Approach: find a labeling that would yield an optimal SVM classifier, on the resulting training data.
Hard, but semi-definite relaxations can approximate
this objective surprisingly welltraining procedures are computationally
intensive, but produce high quality generalization results.
L. Xu, J. Neufeld, B. Larson, D. Schuurmans. Maximum margin clustering. NIPS-04.
L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class SVMs. AAAI-05.
41
Probabilistic Approach to Probabilistic Approach to Semi-Supervised LearningSemi-Supervised Learning
Probabilistic model: P(y|x)Context: non-IID data
Language modellingSegmenting Brain Tumor from
MR ImagesUse Unlabeled Data as
RegularizerFuture: Other applications…
C-H. Lee, W. Shaojun, F. Jiao, D. Schuurmans and R. Greiner. Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields. NIPS06.
F. Jiao, S. Wang, C. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved sequence segmentation and labeling. COLING/ACL06.
42
Active LearningActive Learning
Pay for label to query xi that ... maximizes conditional mutual information about unlabeled data:
How to determine yi ? Take EXPECTATION wrt Yi ?
Use OPTIMISTIC guess wrt Yi ?
( , )arg min min ( | , )ii U y u u L y
u
H Y xx
( , )arg min ( | , ) ( | , )ii U i L u u L yy
u
P y H Y x x x
43
Optimistic Active Learning Optimistic Active Learning using Mutual Informationusing Mutual Information
Need Optimism Need “on-line adjustment” Better than just MostUncertain, …
pima breast
Y. Guo and R. Greiner. Optimistic active learning using mutual information. IJCAI’07
44
Future Work on Active Future Work on Active LearningLearningUnderstand WHY “optimism” works…
+ other applications of optimismExtend framework to deal with
non-iid datadifferent qualities of labelers…
Top Related