1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem.
-
Upload
tiffany-wilson -
Category
Documents
-
view
216 -
download
0
Transcript of 1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem.
3
Matlab: POLY_GUI
• The code implements the ridge regression algorithm: w=argmin i (1-yi f(xi))2 + || w ||2
f(x) = w1 x + w2 x2 + … + wn xn = w xT
x = [x, x2, … , xn]
wT = X+Y
X+= XT(XXT+)-1=(XTX+ )-1XT
X=[x(1); x(2); … x(p)] (matrix (p, n))
• The leave-one-out error (LOO) is obtained with PRESS statistic (Predicted REsidual Sums of Squares.):
• LOO error = (1/p) k[ rk/1-(XX+)kk ]2
5
Matlab: POLY_GUI
• At the prompt type: poly_gui;• Vary the parameters. Refrain from hitting
“CV”. Explain what happens in the following situations:– Sample num. << Target degree (small noise)– Large noise, small sample num– Target degree << Model degree
• Why is the LOO error sometimes larger than the training and test error?
• Are there local minima in the LOO error? Is the LOO error flat near the optimum?
• Propose ways of getting a better solution.
6
CLOP Data Objects
X = rand(10,5) Y = rand(10,1) D = data(X,Y) % constructor methods(D) get_x(D) get_y(D) plot(D);
The poly_gui emulates CLOP objects of type “data”:
7
CLOP Model Objects
P = poly_ridge; h = plot(P);D = gene(P); plot(D, h); [resu, P] = train(P, D); mse(resu)Dt = gene(P);[tresu, P] = test(P, Dt); mse(tresu) plot(P, h);
poly_ridge is a “model” object.
9
Support Vector Classifier
x1
x2
x=[x1,x2]
f(x)=0
f(x)>0f(x)<0
f(x) = i yi k(x, xi)k SV
Boser-Guyon-Vapnik-1992
10
Matlab: SVC_GUI
• At the prompt type: svc_gui;• The code implements the Support
Vector Machine algorithm with kernelk(s, t) = (1 + s t)q exp -||s-t||2
• Regularization similar to ridge regression:
Hinge loss: L(xi)=max(0, 1-yi f(xi))
Empirical risk: i L(xi)
w=argmin (1/C) ||w||2 + i L(xi)
shrinkage
12
Loss Functions
-1 -0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4
z=y f(x)
L(y, f(x))Decision boundary
Margin
well classifiedmissclassified
0/1 loss
square loss (1- z)2
SVC loss, =1 max(0, 1-z)
logistic loss log(1+e-z)
Adaboost loss e-z
Perceptron loss max(0, -z)
SVC loss, =2 max(0, (1- z))2
13
Exercise: Gradient Descent
• Linear discriminant f(x) = j wj xj
• Functional margin z=y f(x), y=1
• Compute z/ wj
• Derive the learning rules wj-L/wj corresponding to the following loss functions:square loss
(1- z)2
SVC loss max(0, 1-z)
logistic loss log(1+e-z)
Adaboost loss e-z
Perceptron loss max(0, -z)
14
Exercise: Dual Algorithms
• From the wj derive the w
• w = i i xi
• From the w, derive the i of the dual algorithms.
18
What is CLOP?
• CLOP=Challenge Learning Object Package.• Based on the Spider developed at the Max Planck
Institute.• Two basic abstractions:
– Data object– Model object
• Put the CLOP directory in your path.• At the prompt type: use_spider_clop;• If you have used before poly_gui… type
clear classes
19
CLOP Data Objects
addpath(<clop_dir>);use_spider_clop;X=rand(10,8);Y=[1 1 1 1 1 -1 -1 -1 -1 -1]';D=data(X,Y); % constructor[p,n]=get_dim(D) get_x(D) get_y(D)
At the Matlab prompt:
20
CLOP Model Objects
model = kridge; % constructor[resu, model] = train(model, D);resu, model.W, model.b0Yhat = D.X*model.W' + model.b0testD = data(rand(3,8), [-1 -1 1]');tresu = test(model, testD); balanced_errate(tresu.X, tresu.Y)
D is a data object previously defined.
21
Hyperparameters and Chains
default(kridge) hyper = {'degree=3', 'shrinkage=0.1'}; model = kridge(hyper);
model = chain({standardize,kridge(hyper)}); [resu, model] = train(model, D); tresu = test(model, testD); balanced_errate(tresu.X, tresu.Y)
A model often has hyperparameters:
Models can be chained:
22
Hyper-parameters
• Kernel methods: kridge and svc:k(x, y) = (coef0 + x y)degree exp(-gamma ||x - y||2)
kij = k(xi, xj)
kii kii + shrinkage
• Naïve Bayes: naive: none• Neural network: neural
units, shrinkage, maxiter
• Random Forest: rf (windows only)mtry
23
Exercise
• Here some the pattern recognition CLOP objects: @rf @naive @svc @neural @gentleboost @lssvm @gkridge @kridge @klogistic @logitboost • Try at the prompt example(neural)• Try other pattern recognition objects• Try different sets of hyperparameters, e.g., example(svc({'gamma=1', 'shrinkage=0.001'}))
• Remember: use default(method) to get the HP.
24
Lab 2
Example: Digit Recognition
Subset of the MNIST data of LeCun and Cortes used for the NIPS2003 challenge
25
data(X, Y)% Go to the Gisette directory: cd('GISETTE')
% Load “validation” data: Xt=load('gisette_valid.data'); Yt=load('gisette_valid.labels'); % Create a data object % and examine it: Dt=data(Xt, Yt); browse(Dt, 2);
% Load “training” data (longer): X=load('gisette_train.data'); Y=load('gisette_train.labels'); [p, n]=get_dim(Dt); D=train(subsample(['p_max=' num2str(p)]), data(X, Y)); clear X Y Xt Yt
% Save for later use: save('gisette', 'D', 'Dt');
26
model(hyperparam)
% Define some hyperparameters: hyper = {'degree=3', 'shrinkage=0.1'};
% Create a kernel ridge % regression model: model = kridge(hyper);
% Train it and test it: [resu, Model] = train(model, D); tresu = test(Model, Dt);
% Visualize the results: roc(tresu); idx=find(tresu.X.*tresu.Y<0); browse(get(D, idx), 2);
27
Exercise
• Here are some pattern recognition CLOP objects: @rf @naive @gentleboost @svc @neural @logitboost @kridge @lssvm @klogistic• Instanciate a model with some hyperparameters
(use default(method) to get the HP)• Vary the HP and the number of training examples
(Hint: use get(D, 1:n) to restrict the data to n examples).
28
chain({model1, model2,…})
% Combine preprocessing and kernel ridge regression: my_prepro=normalize; model = chain({my_prepro,kridge(hyper)});
% Combine replicas of a base learner: for k=1:10 base_model{k}=neural; end model=ensemble(base_model);
ensemble({model1, model2,…})
29
Exercise
• Here are some preprocessing CLOP objects:
@normalize @standardize @fourier• Chain a preprocessing and a model, e.g., model=chain({fourier, kridge('degree=3')}); my_classif=svc({'coef0=1', 'degree=4', 'gamma=0', 'shrinkage=0.1'});
model=chain({normalize, my_classif});
• Train, test, visualize the results. Hint: you can browse the preprocessed data:
browse(train(standardize, D), 2);
30
Summary
% After creating your complex model, just one command: train
model=ensemble({chain({standardize,kridge(hyper)}),chain({normalize,naive})});
[resu, Model] = train(model, D);
% After training your complex model, just one command: test
tresu = test(Model, Dt);
% You can use a “cv” object to perform cross-validation:
cv_model=cv(model); [resu, Model] = train(model, D); roc(resu);
32
POLY_GUI again…
clear classes
poly_gui;
•Check the “Multiplicative updates” (MU) box.
•Play with the parameters.
•Try CV
•Compare with no MU
34
Re-load the GISETTE data
% Start CLOP: clear classes use_spider_clop;
% Go to the Gisette directory: cd('GISETTE') load('gisette');
35
Visualization
1) Create a heatmap of the data matrix or a subset:show(D);
show(get(D,1:10, 1:2:500));
2) Look at individual patterns: browse(D);
browse(D, 2); % For 2d data
% Display feature positions:
browse(D, 2, [212, 463, 429, 239]);
3) Make a scatter plot of a few features:scatter(D, [212, 463, 429, 239]);
36
Example
my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});
model=chain({normalize, s2n('f_max=100'), my_classif});
[resu, Model] = train(model, D);tresu = test(Model, Dt);roc(tresu);% Show the misclassified first[s,idx]=sort(tresu.X.*tresu.Y);browse(get(Dt, idx), 2, Model{2});
37
Some Filters in CLOP
Univariate:• @s2n (Signal to noise ratio.)
• @Ttest (T statistic; similar to s2n.)
• @Pearson (Uses Matlab corrcoef. Gives the same results as Ttest, classes are balanced.)
• @aucfs (ranksum test)
Multivariate:• @relief (no elimination of redundancy)
• @gs (Gram-Schmidt orthogonalization; complementary features)
38
Exercise
• Change the feature selection algorithm• Visualize the features• What can you say of the various
methods?• Which one gives the best results for 2,
10, 100 features?• Can you improve by changing the
preprocessing? (Hint: try @pc_extract)
40
T-test
• Normally distributed classes, equal variance 2 unknown; estimated from data as 2
within.
• Null hypothesis H0: + = -
• T statistic: If H0 is true,
t= (+ - -)/(withinm++1/m-Studentm++m--d.f.
-1
- +
- +
P(Xi|Y=-1)
P(Xi|Y=1)
xi
41
Evalution of pval and FDR
• Ttest object: – computes pval analytically– FDR~pval*nsc/n
• probe object: – takes any feature ranking object as an
argument (e.g. s2n, relief, Ttest)– pval~nsp/np
– FDR~pval*nsc/n
42
Analytic vs. probe
0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
rank
FD
R
43
Example
[resu, FS] = train(Ttest, D);[resu, PFS] = train(probe(Ttest), D);
figure('Name', 'pvalue');plot(get_pval(FS, 1), 'r');hold on; plot(get_pval(PFS, 1));figure('Name', 'FDR');plot(get_fdr(FS, 1), 'r');hold on; plot(get_pval(PFS, 1));
44
Exercise
• What could explain differences between the pvalue and fdr with the analytic and probe method?
• Replace Ttest with chain({rmconst('w_min=0'), Ttest})
• Recompute the pvalue and fdr curves. What do you notice?
• Choose an optimum number fnum of features based on pvalue or FDR. Visualize with browse(D, 2,FS, fnum);
• Create a model with fnum. Is fnum optimal? Do you get something better with CV?
46
Exercise
Consider the 1 nearest neighbor algorithm. We define the following score:
Where s(k) (resp. d(k)) is the index of the nearest neighbor of xk belonging to the same class (resp. different class) as xk.
47
Exercise
1. Motivate the choice of such a cost function to approximate the generalization error (qualitative answer)
2. How would you derive an embedded method to perform feature selection for 1 nearest neighbor using this functional?
3. Motivate your choice (what makes your method an ‘embedded method’ and not a ‘wrapper’ method)
48
Relief
nearest hit
nearest miss
Dhit Dmiss
Relief=<Dmiss/Dhit>
Dhit
Dmiss
Local_Relief= Dmiss/Dhit
49
Exercise
[resu, FS] = train(relief, D);browse(D, 2,FS, 20);[resu, LFS] = train(local_relief,D);browse(D, 2,LFS, 20);
•Propose a modification to the nearest neighbor algorithm that uses features relevant to individual patterns (like those provided by “local_relief”).
•Do you anticipate such an algorithm to perform better than the non-local version using “relief”?
51
Some CLOP objects
Basic learning machines
Feature selection, pre- and post- processing
Compound models
52
http://clopinet.com/challenges/
• Challenges in– Feature selection– Performance prediction– Model selection– Causality
• Large datasets
53
MADELON Best BER=6.22Best BER=6.220.57% - n0=20 (4%) – BER0=7.33%0.57% - n0=20 (4%) – BER0=7.33%
my_classif=svc({'coef0=1', 'degree=0', 'gamma=1', 'shrinkage=1'});
my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif})
DOROTHEA Best BER=8.54Best BER=8.540.99% - n0=1000 (1%) – BER0=12.37%0.99% - n0=1000 (1%) – BER0=12.37%
my_model=chain({TP('f_max=1000'), naive, bias});
Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmarkCompetitive baseline methods set new standards for the NIPS 2003 feature selection benchmark , , Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick A. Pletscher, Georg Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick A. Pletscher, Georg
Schneider and Markus UhrSchneider and Markus Uhr ,Pattern Recognition Letters, Volume 28, Issue 12, 1 September 2007, Pages 1438-1444.,Pattern Recognition Letters, Volume 28, Issue 12, 1 September 2007, Pages 1438-1444.
Dataset Size Type FeaturesTraining Examples
Validation Examples
Test Examples
Arcene8.7 MB
Dense 10000 100 100 700
Gisette22.5 MB
Dense 5000 6000 1000 6500
Dexter0.9 MB
Sparse integer
20000 300 300 2000
Dorothea4.7 MB
Sparse binary
100000 800 350 800
Madelon2.9 MB
Dense 500 2000 600 1800
Class taught at ETH, Zurich, winter 2005Task of the students:• Baseline method provided, BER0 performance and n0 features.• Get BER<BER0 or BER=BER0 but n<n0.• Extra credit for beating best challenge entry.
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
GISETTE
DOROTHEA
NEW YORK, October 2, 2001 – Instinet Group Incorporated (Nasdaq: INET), the world’s largest electronic agency securities broker, today announced tha
DEXTER
MADELON
0 2000 4000 6000 8000 10000 12000 14000 160000
10
20
30
40
50
60
70
80
90
100
ARCENE
DEXTER Best BER=3.30Best BER=3.300.40% - n0=300 (1.5%) – BER0=5%0.40% - n0=300 (1.5%) – BER0=5%
my_classif=svc({'coef0=1', 'degree=1', 'gamma=0', 'shrinkage=0.5'});
my_model=chain({s2n('f_max=300'), normalize, my_classif})
GISETTE Best BER=1.26Best BER=1.260.14% - n0=1000 (20%) – BER0=1.80%0.14% - n0=1000 (20%) – BER0=1.80%
my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});
my_model=chain({normalize, s2n('f_max=1000'), my_classif});
ARCENE Best BER= 11.9 Best BER= 11.9 1.2 %1.2 % - n0=1100 (11%) – BER0=14.7%- n0=1100 (11%) – BER0=14.7%
my_svc=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=0.1'});
my_model=chain({standardize, s2n('f_max=1100'), normalize, my_svc})
NIPS 2003 Feature Selection Challenge
54
NIPS 2006 Model Selection Game
Dataset
CLOP models selected
ADA 2*{sns,std,norm,gentleboost(neural),bias}; 2*{std,norm,gentleboost(kridge),bias}; 1*{rf,bias}
GINA
6*{std,gs,svc(degree=1)}; 3*{std,svc(degree=2)}
HIVA
3*{norm,svc(degree=1),bias}
NOVA
5*{norm,gentleboost(kridge),bias}
SYLVA
4*{std,norm,gentleboost(neural),bias}; 4*{std,neural}; 1*{rf,bias}
First place: Juha Reunanen, cross-indexing-7
sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters
not shown)
Dataset
CLOP models selected
ADA {sns, std, norm, neural(units=5), bias}
GINA
{norm, svc(degree=5, shrinkage=0.01), bias}
HIVA
{std, norm, gentleboost(kridge), bias}
NOVA
{norm,gentleboost(neural), bias}
SYLVA
{std, norm, neural(units=1), bias}
Second place: Hugo Jair Escalante Balderas, BRun2311062
sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters not shown)
Note: entry Boosting_1_001_x900 gave better results, but was older.
Subject: Re: Goalie masksLines: 21
Tom Barrasso wore a great mask, one time, last season. It was all black, with Pgh city scenes on it. The "Golden Triangle" graced the top, along with a steel mill on one side and the Civic Arena on the other. On the back of the helmet was the old Pens' logo the current (at the time) Pens logo, and a space for the "new" logo.
Lori
NOVA
GINA
HIVA
ADA
SYLVA
Dataset Domain Feature # Training # Validation # Test #
ADA Marketing 48 4147 415 41471
GINA Digit recognition 970 3153 315 31532
HIVA Drug discovery 1617 3845 384 38449
NOVA Text classification 16969 1754 175 17537
SYLVA Ecology 216 13086 1309 130857
Proc. IJCNN07, Orlando, FL, Aug, 2007:
PSMS for Neural Networks H. Jair Escalante, Manuel Montes y G´omez, and Luis Enrique Sucar
Model Selection and Assessment Using Cross-indexing, Juha Reunanen