Data mining and statistical learning, lecture 5 Outline Summary of regressions on correlated inputs...

26
Data mining and statistic al learning, lecture 5 Outline Summary of regressions on correlated inputs Ridge regression PCR (principal components regression) PLS (partial least squares regression) Model selection using cross-validation Linear classification models Logistic regression Regression on indicator functions Linear discriminant analysis (LDA)
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Data mining and statistical learning, lecture 5 Outline Summary of regressions on correlated inputs...

Page 1: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Outline

Summary of regressions on correlated inputs Ridge regression PCR (principal components regression) PLS (partial least squares regression) Model selection using cross-validation

Linear classification models Logistic regression Regression on indicator functions Linear discriminant analysis (LDA)

Page 2: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Ridge regression

The ridge regression coefficients minimize a penalized residual sum of squares:

or

Normally, inputs are centred prior to the estimation of regression coefficients

N

i

p

jjpjpji

ridge xxy1 1

22110 )...(argminˆ

p

jj

N

ipjpji

ridge

s

xxy

1

2

1

2110

)...(argminˆ

tosubject

Page 3: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Regression methods using derived input directions

- Partial Least Squares Regression

Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of these features

x1 x1 xp

z1 z2 zM…

y

Select the intermediates so that the covariance with the response variable is maximized

Normally, the inputs are standardized to mean zero and variance one prior to the PLS analysis

Page 4: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

PLS vs PCR

- absorbance records for chopped meat

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10

Number of factors

Per

cen

t va

riat

ion

acc

ou

nte

d f

or

PLS

PCRIn general, PLS models has fewer factors than PCR models

Page 5: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Common characteristics of ridge regression, PCR, and PLS

Ridge regression, PCR, and PLS can all handle high-dimensional inputs

In contrast to ordinary least squares regression, the cited methods can be used for prediction even if the number of inputs (x-variables) exceeds the number of cases

For minimizing prediction error, ridge regression, PCR, and PLS are generally preferable to variable subset selection in ordinary least squares regression

Page 6: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Behaviour of ridge regression, PCR, and PLS

Ridge regression, PCR, and PLS tend to behave similarly

Ridge regression shrinks all directions, but shrinks low-variance directions more

Principal components regression leaves M high-variance directions alone, and discards the rest

Partial least squares regression tends to shrink the low-variance directions, but may inflate some of the higher variance directions

Page 7: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Model selection: ordinary cross-validation

For each model, do the following:

(i) Fit the model to the training set of inputs and responses

(ii) Use the fitted model to predict the response value in the test set and compute the prediction error

Select the model that produces the smallest PRESS-value

npnnn

jjpjj

jpjjj

p

p

yxxx

yxxx

yxxx

yxxx

yxxx

21

11,1,21,1

21

222212

112111

PRESS = Prediction Error Sum of Squares

Test set

Training set

Page 8: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Model selection: leave-one-out cross-validation

For each model, do the following:

(i) Leave out one case and fit the model to the remaining data

(ii) Use the fitted model to predict the response value in the case that was left out and compute the prediction error

(iii) Repeat steps (i) and (ii) for all cases and compute the PRESS-value (prediction error sum of squares)

Select the model that produces the smallest PRESS-value

npnnn

jpjjj

p

p

yxxx

yxxx

yxxx

yxxx

21

21

222212

112111

PRESS = Prediction Error Sum of Squares

Page 9: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Model selection: K-fold (block) cross-validation

Divide the data set into m blocks of size K and do the following for each model:

(i) Leave out one block of cases and fit the model to the remaining data

(ii) Use the fitted model to predict the response values in the block that was left out and compute the sum of squared prediction errors

(iii) Repeat steps (i) and (ii) for all blocks and compute the PRESS-value (prediction error sum of squares)

Select the model that produces the smallest PRESS-value

mKpmKpmKmK

KmKmpKmKm

KjKjpKjKj

jKjKpjKjK

KpKKK

p

yxxx

yxxx

yxxx

yxxx

yxxx

yxxx

.,,2,1

1)1(1)1(,1)1(,21)1(,1

)1()1(,)1(,2)1(,1

11,1,21,1

21

112111

....

....

....

....

.... Block 1

Block j

Block m

Page 10: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Classification

The task of assigning objects to one of several predefined categories

Detecting spams among e-mails

Credit scoring

Classifying tumours as malignant or benign

Page 11: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Customer relations management

- an example

Consider a database in which 2470 customers have been registered

For each customer the enterprise has recorded a binary response variable Y (Y = 1: multiple purchases, Y = 0: single purchase) and several predictors

We shall model the probability that Y = 1.

Y Installment First_amount_spentNo._productsAge51_89 Age36_50 Age15_35 Sex North Central South_and_islands0 0 520000 0 0 0 1 0 0 0 10 1 1484000 2 0 1 0 1 0 0 10 0 2459000 1 1 0 0 1 0 0 10 0 3389000 0 0 1 0 1 0 1 0

Page 12: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Logistic regression for a binary response variable Y

- single input

xxXYP

xXYP

p

p

x

xxXYPp

10

10

10

)|0(

)|1(log

1log

)exp(1

)exp()|1(

0

0.2

0.4

0.6

0.8

1

0 1 2 3

x

P(Y

=1)

The log of the odds ratio is linear in x

Page 13: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Logistic regression of multiple purchases

vs first amount spent

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000

First amount spent

Observed binary response Estimated event probability

Page 14: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Logistic regression of multiple purchases vs first amount spent

- inference from a model comprising a single input

Response Information

Variable Value Count

Multiple_purchases 1 34 (Event)

0 66

Total 100

Logistic Regression Table

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant -2.50310 0.450895 -5.55 0.000

First_amount_spent 0.0014381 0.0003063 4.69 0.000 1.00 1.00 1.00

Log-Likelihood = -43.215

Test that all slopes are zero: G = 41.776, DF = 1, P-Value = 0.000

Page 15: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Logistic regression for a binary response variable

- multiple inputs

Consider a binary response variable Y

Set p = P(Y = 1)

Assume that the log odds ratio

is a linear function of m predictors x1, …, xm

)...exp(1

)...exp(

110

110

mm

mm

xx

xxp

mm xx

p

p

...1

log 110

Page 16: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Logistic regression

- inference from a model comprising two inputs

Binary Logistic Regression: RestingPulse versus Smokes, Weight

Variable Value Count

RestingPulse Low 70 (Event)

High 22

Total 92

Factor Information

Factor Levels Values

Smokes 2 No, Yes

Logistic Regression Table

Odds 95% CI

Predictor Coef SE Coef Z P Ratio Lower Upper

Constant -1.98717 1.67930 -1.18 0.237

SmokesYes -1.19297 0.552980 -2.16 0.031 0.30 0.10 0.90

Weight 0.0250226 0.0122551 2.04 0.041 1.03 1.00 1.05

The estimated coefficient -1.19297 represents the change in the log of P(low pulse)/P(high pulse) when the subject smokes compared to when he/she does not smoke, with the covariate Weight held constant

The odds of smokers in the sample having a low pulse being 30% of the odds of non-smokers having a low pulse.

Page 17: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Logistic regression for an ordinal response variable Y

)exp()exp(1

1)|3(

)exp()exp(1

)exp()|2(

)exp()exp(1

)exp()|1(

21201110

21201110

2120

21201110

1110

xxxXYP

xx

xxXYP

xx

xxXYP

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5x

P(Y=1) P(Y=2) P(Y=3)

Page 18: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Logistic regression for an ordinal response variable Y

xxXYP

xXYP

xxXYP

xXYP

2120

1110

)|3(

)|2(log

)|3(

)|1(log

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5x

P(Y=1) P(Y=2) P(Y=3)

Page 19: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Classification using logistic regression

Assign the object to the class k that maximizes

)|( xXkYP

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5x

P(Y=1) P(Y=2) P(Y=3)

Page 20: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Regression of an indicator matrix

0

2

4

6

8

10

12

14

16

2 4 6 8 10

x1

x2 Class 1

Class 2

Find a linear function

which is (on average) one for objects in class 1 and otherwise (on average) zero

Find a linear function

which is (on average) one for objects in class 1 and otherwise (on average) zero

Assign a new object to class 1 if

21211110211ˆˆˆ),(ˆ xxxxf

21211110211ˆˆˆ),(ˆ xxxxf

),(ˆ),(ˆ212211 xxfxxf

Page 21: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

3D-plot of an indicator matrix for class 1

15

0.0 10

0.5

1.0

4 6 58 10

Class_1

x2

x1

3D Scatterplot of Class_1 vs x2 vs x1

Page 22: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

3D-plot of an indicator matrix for class 2

15

0.0 10

0.5

1.0

4 6 58 10

Class_2

x2

x1

3D Scatterplot of Class_2 vs x2 vs x1

Page 23: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Regression of an indicator matrix

- discriminating function

0

2

4

6

8

10

12

14

16

2 4 6 8 10

x1

x2

Class 1

Class 2

Discr.

Page 24: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Regression of an indicator matrix

- discriminating function

0

5

10

15

20

25

2 4 6 8 10

x1

x2

Class 1

Class 2

Class 3

Estimate discriminant functions

for each class, and then classify a new object to the class with the largest value for its discriminant function

)(xk

Page 25: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Linear discriminant analysis (LDA)

LDA is an optimal classification method when the data arise from Gaussian distributions with different means and a common covariance matrix

4

6

8

10

12

14

16

18

2 4 6 8 10 12

Class1

Class 2

Class3

Page 26: Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)

Data mining and statistical learning, lecture 5

Software recommendation

SAS Proc DISCRIM

Proc discrim data=mining.lda;CLASS class;VAR x1 x2;Run;