1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem.

1

Lab 1

Getting started with Basic Learning

Machinesand

the Overfitting Problem

2

Lab 1

Polynomial regression

3

Matlab: POLY_GUI

• The code implements the ridge regression algorithm: w=argmin i (1-yi f(xi))2 + || w ||2

f(x) = w1 x + w2 x2 + … + wn xn = w xT

x = [x, x2, … , xn]

wT = X+Y

X+= XT(XXT+)-1=(XTX+ )-1XT

X=[x(1); x(2); … x(p)] (matrix (p, n))

• The leave-one-out error (LOO) is obtained with PRESS statistic (Predicted REsidual Sums of Squares.):

• LOO error = (1/p) k[ rk/1-(XX+)kk ]2

4

Matlab: POLY_GUI

5

Matlab: POLY_GUI

• At the prompt type: poly_gui;• Vary the parameters. Refrain from hitting

“CV”. Explain what happens in the following situations:– Sample num. << Target degree (small noise)– Large noise, small sample num– Target degree << Model degree

• Why is the LOO error sometimes larger than the training and test error?

• Are there local minima in the LOO error? Is the LOO error flat near the optimum?

• Propose ways of getting a better solution.

6

CLOP Data Objects

X = rand(10,5) Y = rand(10,1) D = data(X,Y) % constructor methods(D) get_x(D) get_y(D) plot(D);

The poly_gui emulates CLOP objects of type “data”:

7

CLOP Model Objects

P = poly_ridge; h = plot(P);D = gene(P); plot(D, h); [resu, P] = train(P, D); mse(resu)Dt = gene(P);[tresu, P] = test(P, Dt); mse(tresu) plot(P, h);

poly_ridge is a “model” object.

8

Lab 1

Support Vector Machines

9

Support Vector Classifier

x1

x2

x=[x1,x2]

f(x)=0

f(x)>0f(x)<0

f(x) = i yi k(x, xi)k SV

Boser-Guyon-Vapnik-1992

10

Matlab: SVC_GUI

• At the prompt type: svc_gui;• The code implements the Support

Vector Machine algorithm with kernelk(s, t) = (1 + s t)q exp -||s-t||2

• Regularization similar to ridge regression:

Hinge loss: L(xi)=max(0, 1-yi f(xi))

Empirical risk: i L(xi)

w=argmin (1/C) ||w||2 + i L(xi)

shrinkage

11

Lab 1

More loss functions…

12

Loss Functions

-1 -0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

3.5

4

z=y f(x)

L(y, f(x))Decision boundary

Margin

well classifiedmissclassified

0/1 loss

square loss (1- z)2

SVC loss, =1 max(0, 1-z)

logistic loss log(1+e-z)

Adaboost loss e-z

Perceptron loss max(0, -z)

SVC loss, =2 max(0, (1- z))2

13

Exercise: Gradient Descent

• Linear discriminant f(x) = j wj xj

• Functional margin z=y f(x), y=1

• Compute z/ wj

• Derive the learning rules wj-L/wj corresponding to the following loss functions:square loss

(1- z)2

SVC loss max(0, 1-z)

logistic loss log(1+e-z)

Adaboost loss e-z

Perceptron loss max(0, -z)

14

Exercise: Dual Algorithms

• From the wj derive the w

• w = i i xi

• From the w, derive the i of the dual algorithms.

15

Summary

• Modern ML algorithms optimize a penalized risk functional:

16

Lab 2

Getting started with CLOP

17

Lab 2

CLOP tutorial

18

What is CLOP?

• CLOP=Challenge Learning Object Package.• Based on the Spider developed at the Max Planck

Institute.• Two basic abstractions:

– Data object– Model object

• Put the CLOP directory in your path.• At the prompt type: use_spider_clop;• If you have used before poly_gui… type

clear classes

19

CLOP Data Objects

addpath(<clop_dir>);use_spider_clop;X=rand(10,8);Y=[1 1 1 1 1 -1 -1 -1 -1 -1]';D=data(X,Y); % constructor[p,n]=get_dim(D) get_x(D) get_y(D)

At the Matlab prompt:

20

CLOP Model Objects

model = kridge; % constructor[resu, model] = train(model, D);resu, model.W, model.b0Yhat = D.X*model.W' + model.b0testD = data(rand(3,8), [-1 -1 1]');tresu = test(model, testD); balanced_errate(tresu.X, tresu.Y)

D is a data object previously defined.

21

Hyperparameters and Chains

default(kridge) hyper = {'degree=3', 'shrinkage=0.1'}; model = kridge(hyper);

model = chain({standardize,kridge(hyper)}); [resu, model] = train(model, D); tresu = test(model, testD); balanced_errate(tresu.X, tresu.Y)

A model often has hyperparameters:

Models can be chained:

22

Hyper-parameters

• Kernel methods: kridge and svc:k(x, y) = (coef0 + x y)degree exp(-gamma ||x - y||2)

kij = k(xi, xj)

kii kii + shrinkage

• Naïve Bayes: naive: none• Neural network: neural

units, shrinkage, maxiter

• Random Forest: rf (windows only)mtry

23

Exercise

• Here some the pattern recognition CLOP objects: @rf @naive @svc @neural @gentleboost @lssvm @gkridge @kridge @klogistic @logitboost • Try at the prompt example(neural)• Try other pattern recognition objects• Try different sets of hyperparameters, e.g., example(svc({'gamma=1', 'shrinkage=0.001'}))

• Remember: use default(method) to get the HP.

24

Lab 2

Example: Digit Recognition

Subset of the MNIST data of LeCun and Cortes used for the NIPS2003 challenge

25

data(X, Y)% Go to the Gisette directory: cd('GISETTE')

% Load “validation” data: Xt=load('gisette_valid.data'); Yt=load('gisette_valid.labels'); % Create a data object % and examine it: Dt=data(Xt, Yt); browse(Dt, 2);

% Load “training” data (longer): X=load('gisette_train.data'); Y=load('gisette_train.labels'); [p, n]=get_dim(Dt); D=train(subsample(['p_max=' num2str(p)]), data(X, Y)); clear X Y Xt Yt

% Save for later use: save('gisette', 'D', 'Dt');

26

model(hyperparam)

% Define some hyperparameters: hyper = {'degree=3', 'shrinkage=0.1'};

% Create a kernel ridge % regression model: model = kridge(hyper);

% Train it and test it: [resu, Model] = train(model, D); tresu = test(Model, Dt);

% Visualize the results: roc(tresu); idx=find(tresu.X.*tresu.Y<0); browse(get(D, idx), 2);

27

Exercise

• Here are some pattern recognition CLOP objects: @rf @naive @gentleboost @svc @neural @logitboost @kridge @lssvm @klogistic• Instanciate a model with some hyperparameters

(use default(method) to get the HP)• Vary the HP and the number of training examples

(Hint: use get(D, 1:n) to restrict the data to n examples).

28

chain({model1, model2,…})

% Combine preprocessing and kernel ridge regression: my_prepro=normalize; model = chain({my_prepro,kridge(hyper)});

% Combine replicas of a base learner: for k=1:10 base_model{k}=neural; end model=ensemble(base_model);

ensemble({model1, model2,…})

29

Exercise

• Here are some preprocessing CLOP objects:

@normalize @standardize @fourier• Chain a preprocessing and a model, e.g., model=chain({fourier, kridge('degree=3')}); my_classif=svc({'coef0=1', 'degree=4', 'gamma=0', 'shrinkage=0.1'});

model=chain({normalize, my_classif});

• Train, test, visualize the results. Hint: you can browse the preprocessed data:

browse(train(standardize, D), 2);

30

Summary

% After creating your complex model, just one command: train

model=ensemble({chain({standardize,kridge(hyper)}),chain({normalize,naive})});

[resu, Model] = train(model, D);

% After training your complex model, just one command: test

tresu = test(Model, Dt);

% You can use a “cv” object to perform cross-validation:

cv_model=cv(model); [resu, Model] = train(model, D); roc(resu);

31

Lab 3

Getting started with Feature Selection

32

POLY_GUI again…

clear classes

poly_gui;

•Check the “Multiplicative updates” (MU) box.

•Play with the parameters.

•Try CV

•Compare with no MU

33

Lab 3

Exploring feature selection methods

34

Re-load the GISETTE data

% Start CLOP: clear classes use_spider_clop;

% Go to the Gisette directory: cd('GISETTE') load('gisette');

35

Visualization

1) Create a heatmap of the data matrix or a subset:show(D);

show(get(D,1:10, 1:2:500));

2) Look at individual patterns: browse(D);

browse(D, 2); % For 2d data

% Display feature positions:

browse(D, 2, [212, 463, 429, 239]);

3) Make a scatter plot of a few features:scatter(D, [212, 463, 429, 239]);

36

Example

my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});

model=chain({normalize, s2n('f_max=100'), my_classif});

[resu, Model] = train(model, D);tresu = test(Model, Dt);roc(tresu);% Show the misclassified first[s,idx]=sort(tresu.X.*tresu.Y);browse(get(Dt, idx), 2, Model{2});

37

Some Filters in CLOP

Univariate:• @s2n (Signal to noise ratio.)

• @Ttest (T statistic; similar to s2n.)

• @Pearson (Uses Matlab corrcoef. Gives the same results as Ttest, classes are balanced.)

• @aucfs (ranksum test)

Multivariate:• @relief (no elimination of redundancy)

• @gs (Gram-Schmidt orthogonalization; complementary features)

38

Exercise

• Change the feature selection algorithm• Visualize the features• What can you say of the various

methods?• Which one gives the best results for 2,

10, 100 features?• Can you improve by changing the

preprocessing? (Hint: try @pc_extract)

39

Lab 3

Feature significance

40

T-test

• Normally distributed classes, equal variance 2 unknown; estimated from data as 2

within.

• Null hypothesis H0: + = -

• T statistic: If H0 is true,

t= (+ - -)/(withinm++1/m-Studentm++m--d.f.

-1

- +

- +

P(Xi|Y=-1)

P(Xi|Y=1)

xi

41

Evalution of pval and FDR

• Ttest object: – computes pval analytically– FDR~pval*nsc/n

• probe object: – takes any feature ranking object as an

argument (e.g. s2n, relief, Ttest)– pval~nsp/np

– FDR~pval*nsc/n

42

Analytic vs. probe

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

rank

FD

R

43

Example

[resu, FS] = train(Ttest, D);[resu, PFS] = train(probe(Ttest), D);

figure('Name', 'pvalue');plot(get_pval(FS, 1), 'r');hold on; plot(get_pval(PFS, 1));figure('Name', 'FDR');plot(get_fdr(FS, 1), 'r');hold on; plot(get_pval(PFS, 1));

44

Exercise

• What could explain differences between the pvalue and fdr with the analytic and probe method?

• Replace Ttest with chain({rmconst('w_min=0'), Ttest})

• Recompute the pvalue and fdr curves. What do you notice?

• Choose an optimum number fnum of features based on pvalue or FDR. Visualize with browse(D, 2,FS, fnum);

• Create a model with fnum. Is fnum optimal? Do you get something better with CV?

45

Lab 3

Local feature selection

46

Exercise

Consider the 1 nearest neighbor algorithm. We define the following score:

Where s(k) (resp. d(k)) is the index of the nearest neighbor of xk belonging to the same class (resp. different class) as xk.

47

Exercise

1. Motivate the choice of such a cost function to approximate the generalization error (qualitative answer)

2. How would you derive an embedded method to perform feature selection for 1 nearest neighbor using this functional?

3. Motivate your choice (what makes your method an ‘embedded method’ and not a ‘wrapper’ method)

48

Relief

nearest hit

nearest miss

Dhit Dmiss

Relief=<Dmiss/Dhit>

Dhit

Dmiss

Local_Relief= Dmiss/Dhit

49

Exercise

[resu, FS] = train(relief, D);browse(D, 2,FS, 20);[resu, LFS] = train(local_relief,D);browse(D, 2,LFS, 20);

•Propose a modification to the nearest neighbor algorithm that uses features relevant to individual patterns (like those provided by “local_relief”).

•Do you anticipate such an algorithm to perform better than the non-local version using “relief”?

50

Epilogue

Becoming a pro andplaying with

other datasets

51

Some CLOP objects

Basic learning machines

Feature selection, pre- and post- processing

Compound models

52

http://clopinet.com/challenges/

• Challenges in– Feature selection– Performance prediction– Model selection– Causality

• Large datasets

53

MADELON Best BER=6.22Best BER=6.220.57% - n0=20 (4%) – BER0=7.33%0.57% - n0=20 (4%) – BER0=7.33%


my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif})

DOROTHEA Best BER=8.54Best BER=8.540.99% - n0=1000 (1%) – BER0=12.37%0.99% - n0=1000 (1%) – BER0=12.37%

my_model=chain({TP('f_max=1000'), naive, bias});

Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmarkCompetitive baseline methods set new standards for the NIPS 2003 feature selection benchmark , , Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick A. Pletscher, Georg Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick A. Pletscher, Georg

Schneider and Markus UhrSchneider and Markus Uhr ,Pattern Recognition Letters, Volume 28, Issue 12, 1 September 2007, Pages 1438-1444.,Pattern Recognition Letters, Volume 28, Issue 12, 1 September 2007, Pages 1438-1444.

Dataset Size Type FeaturesTraining Examples

Validation Examples

Test Examples

Arcene8.7 MB

Dense 10000 100 100 700

Gisette22.5 MB

Dense 5000 6000 1000 6500

Dexter0.9 MB

Sparse integer

20000 300 300 2000

Dorothea4.7 MB

Sparse binary

100000 800 350 800

Madelon2.9 MB

Dense 500 2000 600 1800

Class taught at ETH, Zurich, winter 2005Task of the students:• Baseline method provided, BER0 performance and n0 features.• Get BER<BER0 or BER=BER0 but n<n0.• Extra credit for beating best challenge entry.

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

GISETTE

DOROTHEA

NEW YORK, October 2, 2001 – Instinet Group Incorporated (Nasdaq: INET), the world’s largest electronic agency securities broker, today announced tha

DEXTER

MADELON

0 2000 4000 6000 8000 10000 12000 14000 160000

10

20

30

40

50

60

70

80

90

100

ARCENE

DEXTER Best BER=3.30Best BER=3.300.40% - n0=300 (1.5%) – BER0=5%0.40% - n0=300 (1.5%) – BER0=5%

my_classif=svc({'coef0=1', 'degree=1', 'gamma=0', 'shrinkage=0.5'});

my_model=chain({s2n('f_max=300'), normalize, my_classif})

GISETTE Best BER=1.26Best BER=1.260.14% - n0=1000 (20%) – BER0=1.80%0.14% - n0=1000 (20%) – BER0=1.80%


my_model=chain({normalize, s2n('f_max=1000'), my_classif});

ARCENE Best BER= 11.9 Best BER= 11.9 1.2 %1.2 % - n0=1100 (11%) – BER0=14.7%- n0=1100 (11%) – BER0=14.7%

my_svc=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=0.1'});

my_model=chain({standardize, s2n('f_max=1100'), normalize, my_svc})

NIPS 2003 Feature Selection Challenge

54

NIPS 2006 Model Selection Game

Dataset

CLOP models selected

ADA 2*{sns,std,norm,gentleboost(neural),bias}; 2*{std,norm,gentleboost(kridge),bias}; 1*{rf,bias}

GINA

6*{std,gs,svc(degree=1)}; 3*{std,svc(degree=2)}

HIVA

3*{norm,svc(degree=1),bias}

NOVA

5*{norm,gentleboost(kridge),bias}

SYLVA

4*{std,norm,gentleboost(neural),bias}; 4*{std,neural}; 1*{rf,bias}

First place: Juha Reunanen, cross-indexing-7

sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters

not shown)

Dataset

CLOP models selected

ADA {sns, std, norm, neural(units=5), bias}

GINA

{norm, svc(degree=5, shrinkage=0.01), bias}

HIVA

{std, norm, gentleboost(kridge), bias}

NOVA

{norm,gentleboost(neural), bias}

SYLVA

{std, norm, neural(units=1), bias}

Second place: Hugo Jair Escalante Balderas, BRun2311062

sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters not shown)

Note: entry Boosting_1_001_x900 gave better results, but was older.

Subject: Re: Goalie masksLines: 21

Tom Barrasso wore a great mask, one time, last season. It was all black, with Pgh city scenes on it. The "Golden Triangle" graced the top, along with a steel mill on one side and the Civic Arena on the other. On the back of the helmet was the old Pens' logo the current (at the time) Pens logo, and a space for the "new" logo.

Lori

NOVA

GINA

HIVA

ADA

SYLVA

Dataset Domain Feature # Training # Validation # Test #

ADA Marketing 48 4147 415 41471

GINA Digit recognition 970 3153 315 31532

HIVA Drug discovery 1617 3845 384 38449

NOVA Text classification 16969 1754 175 17537

SYLVA Ecology 216 13086 1309 130857

Proc. IJCNN07, Orlando, FL, Aug, 2007:

PSMS for Neural Networks H. Jair Escalante, Manuel Montes y G´omez, and Luis Enrique Sucar

Model Selection and Assessment Using Cross-indexing, Juha Reunanen

1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem.

Documents

Transcript of 1 Lab 1 Getting started with Basic Learning Machines and the Overfitting Problem.