RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon
description
Transcript of RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon
![Page 1: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/1.jpg)
RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION
CHALLENGE
Isabelle GuyonAmir Reza Saffari Azar Alamdari
Gideon Dror
![Page 2: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/2.jpg)
Part I
INTRODUCTION
![Page 3: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/3.jpg)
Model selection
• Selecting models (neural net, decision tree, SVM, …)
• Selecting hyperparameters (number of hidden units, weight decay/ridge, kernel parameters, …)
• Selecting variables or features (space dimensionality reduction.)
• Selecting patterns (data cleaning, data reduction, e.g by clustering.)
![Page 4: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/4.jpg)
Performance prediction
How good are you at predicting
how good you are?
• Practically important in pilot studies.
• Good performance predictions render model selection trivial.
![Page 5: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/5.jpg)
Why a challenge?
• Stimulate research and push the state-of-the art.
• Move towards fair comparisons and give a voice to methods that work but may not be backed up by theory (yet).
• Find practical solutions to true problems.• Have fun…
![Page 6: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/6.jpg)
History
• USPS/NIST.• Unipen (with Lambert Schomaker): 40 institutions
share 5 million handwritten characters. • KDD cup, TREC, CASP, CAMDA, ICDAR, etc.• NIPS challenge on unlabeled data.• Feature selection challenge (with Steve Gunn):
success! ~75 entrants, thousands of entries.• Pascal challenges.• Performance prediction challenge …
1980
1990
2000
2001
2002
2003
2004
2005
![Page 7: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/7.jpg)
Challenge
• Date started: Friday September 30, 2005.
• Date ended: Monday March 1, 2006
• Duration: 21 weeks.
• Estimated number of entrants: 145.
• Number of development entries: 4228.
• Number of ranked participants: 28.
• Number of ranked submissions: 117.
![Page 8: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/8.jpg)
Datasets
Dataset Domain Type Feat-ures
Training Examples
Validation Examples
Test Examples
ADA Marketing Dense 48 4147 415 41471
GINA Digits Dense 970 3153 315 31532
HIVADrug discovery
Dense 1617 3845 384 38449
NOVAText classif.
Sparse binary 16969 1754 175 17537
SYLVA Ecology Dense 216 13086 1308 130858
http://www.modelselect.inf.ethz.ch/
![Page 9: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/9.jpg)
BER distribution
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
ADA
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
GINA
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
HIVA
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
NOVA
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
SYLVA
BERTest BER
![Page 10: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/10.jpg)
Results
Overall winners for ranked entries:
Ave rank: Roman Lutz with LB tree mix cut adaptedAve score: Gavin Cawley with Final #2
ADA: Marc Boullé with SNB(CMA)+10k F(2D) tv or SNB(CMA) + 100k F(2D) tv
GINA: Kari Torkkola & Eugene Tuv with ACE+RLSCHIVA: Gavin Cawley with Final #3 (corrected)NOVA: Gavin Cawley with Final #1SYLVA: Marc Boullé with SNB(CMA) + 10k F(3D) tv
Best AUC: Radford Neal with Bayesian Neural Networks
![Page 11: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/11.jpg)
Part II
PROTOCOL and
SCORING
![Page 12: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/12.jpg)
Protocol
• Data split: training/validation/test.• Data proportions: 10/1/100.• Online feed-back on validation data.• Validation label release one month before
end of challenge.• Final ranking on test data using the five
last complete submissions for each entrant.
![Page 13: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/13.jpg)
Performance metrics
• Balanced Error Rate (BER): average of error rates of positive class and negative class.
• Guess error: BER = abs(testBER – guessedBER)
• Area Under the ROC Curve (AUC).
![Page 14: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/14.jpg)
Optimistic guesses
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Test BER
Gu
esse
d B
ER
ADA
GINA
HIVA
NOVA
SYLVA
![Page 15: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/15.jpg)
Scoring method
E = testBER + BER [1-exp(- BER/)] BER = abs(testBER – guessedBER)
Guessed BER
Cha
lleng
e sc
ore
Test BER
Test BER
![Page 16: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/16.jpg)
BER/
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.510
-4
10-3
10-2
10-1
100
101
102
103
104
BER
DE
LTA
/SIG
MA
BER/
Test BER
E testBER + BER
ADA
GINA
HIVA
NOVA
SYLVA
![Page 17: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/17.jpg)
Score
-10 -8 -6 -4 -2 0 2
0.04
0.045
0.05
0.055
0.06
0.065
log(gamma)
score
GINA
Roman LutzGavin Cawley
Radford Neal
Corinne Dahinden
Wei ChuNicolai Meinshausen
E
testBER testBER+BER
E = testBER + BER [1-exp(- BER/)]
![Page 18: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/18.jpg)
Score (continued)
-10 -8 -6 -4 -2 0 2
0.2
0.25
0.3
0.35
0.4
log(gamma)
scor
e
ADA
-10 -8 -6 -4 -2 0 20.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
log(gamma)
scor
e
GINA
-10 -8 -6 -4 -2 0 20.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
log(gamma)
scor
e
HIVA
-10 -8 -6 -4 -2 0 20
0.05
0.1
0.15
0.2
0.25
log(gamma)
scor
e
NOVA
-10 -8 -6 -4 -2 0 20
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
log(gamma)
scor
e
SYLVAADA GINA SYLVA
HIVA NOVA
![Page 19: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/19.jpg)
Part III
RESULT ANALYSIS
![Page 20: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/20.jpg)
What did we expect?
• Learn about new competitive machine learning techniques.
• Identify competitive methods of performance prediction, model selection, and ensemble learning (theory put into practice.)
• Drive research in the direction of refining such methods (on-going benchmark.)
![Page 21: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/21.jpg)
Method comparison
0 0.05 0.1 0.15 0.2 0.25 0.3 0.3510
-4
10-3
10-2
10-1
100
BER
Del
ta B
ER
X
TREE
NN/BNNNB
LD/SVM/KLS/GP
SYLVA
GINA
NOVA
ADA
HIVA
BER
Test BER
![Page 22: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/22.jpg)
Danger of overfitting
0 20 40 60 80 100 120 140 1600
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5B
ER
Time (days)
ADA
GINA
HIVA
NOVA
SYLVA
Full line: test BER
Dashed line: validation BER
![Page 23: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/23.jpg)
How to estimate the BER?
• Statistical tests (Stats): Compute it on training data; compare with a “null hypothesis” e.g. the results obtained with a random permutation of the labels.
• Cross-validation (CV): Split the training data many times into training and validation set; average the validation data results.
• Guaranteed risk minimization (GRM): Use of theoretical performance bounds.
![Page 24: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/24.jpg)
Stats / CV / GRM ???
![Page 25: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/25.jpg)
Top ranking methods
• Performance prediction:– CV with many splits 90% train / 10% validation– Nested CV loops
• Model selection:– Use of a single model family– Regularized risk / Bayesian priors– Ensemble methods– Nested CV loops, computationally efficient with
with VLOO
![Page 26: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/26.jpg)
Other methods
• Use of training data only:– Training BER.– Statistical tests.
• Bayesian evidence.
• Performance bounds.
• Bilevel optimization.
![Page 27: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/27.jpg)
Part IV
CONCLUSIONS AND FURTHER WORK
![Page 28: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/28.jpg)
Open problems
Bridge the gap between theory and practice…• What are the best estimators of the variance of CV?• What should k be in k-fold?• Are other cross-validation methods better than k-
fold (e.g bootstrap, 5x2CV)?• Are there better “hybrid” methods?• What search strategies are best?• More than 2 levels of inference?
![Page 29: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/29.jpg)
Future work
• Game of model selection.
• JMLR special topic on model selection.
• IJCNN 2007 challenge!
![Page 30: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/30.jpg)
Benchmarking model selection?
• Performance prediction: Participants just need to provide a guess of their test performance. If they can solve that problem, they can perform model selection efficiently. Easy and motivating.
• Selection of a model from a finite toolbox: In principle a more controlled benchmark, but less attractive to participants.
![Page 31: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/31.jpg)
CLOP
• CLOP=Challenge Learning Object Package.
• Based on the Spider developed at the Max Planck Institute.
• Two basic abstractions:– Data object– Model object
http://clopinet.com/isabelle/Projects/modelselect/MFAQ.html
![Page 32: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/32.jpg)
CLOP tutorial
D=data(X,Y);hyper = {'degree=3', 'shrinkage=0.1'};
model = kridge(hyper); [resu, model] = train(model, D);tresu = test(model, testD);model = chain({standardize,kridge(hyper)});
At the Matlab prompt:
![Page 33: RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon](https://reader034.fdocuments.us/reader034/viewer/2022051419/5681587d550346895dc5dc9b/html5/thumbnails/33.jpg)
Conclusions
• Twice as much volume of participation as in the feature selection challenge
• Top methods as before (different order):– Ensembles of trees– Kernel methods (RLSC/LS-SVM, SVM)– Bayesian neural networks– Naïve Bayes.
• Danger of overfitting.• Triumph of cross-validation?