Case Selection and Resampling Lucila Ohno-Machado HST951.

Case Selection and Resampling

Lucila Ohno-MachadoHST951

Topics

• Case selection (influence detection)• Regression diagnostics

• Sampling procedures– Bootstrap– Jackknife– Cross-validation

Unusual Data

• Outlier (discrepancy, unusual observation that may change parameters)

• Leverage (far from mean or centroid of other observations, unusual combinations of independent variable values X that may change parameters)

• Influence = discrepancy x leverage

Detecting Outliers: Residuals

• Measure of error

• Studentized residuals can be calculated by removing one observation at a time

• Obs: High-leverage observations may have small residuals

Assessing Leverage

• Hat values measure the distance of an observation to the means (or centroid) of all observations

• Dependent variables are not involved in determining leverage

Measuring Influence

• Impact on coefficient of deleting an observation– DFBETA– COX’s D– DFFITS

• Impact on standard error– COVRATIO

Case selection

• Not all cases are created equal• Some influential cases are good• Some are bad• “Outliers”• Some non-influential cases are redundant

• It would be nice to keep “minimal” set of good cases in training sets for fast on-line training

Classical Diagnostics

• Unicase selection is determined by removing one observation and inspecting results

• Unicase influence on– Estimated parameters (coefficients)– Fitted value (Y-hat)– Residuals (error)

When outcomes are binary

• Residuals may not reflect discriminatory performance, but rather calibration

• Remember that a model with good discriminatory performance may be recalibrated

• Same rationale for coefficients

Influence

• Definition of influence is not fixed

• If the main reason for building models is prediction

• Then evaluating model performance given different subsets of original sample might point to good, redundant, and bad cases

Qualifying a case

• Bad cases, when removed, should result in models with better predictions

• Redundant cases, when removed, should not affect predictions

• Good cases, if removed, would result in models with worse predictions

Defining prediction performance

• Use, for example, areas under ROC curves (or mean square error or cross entropy error)

• For each set of samples:– Evaluate performance on training and holdout sets– Determine which cases to remove

• Determine performance on test or validation sets

Sequential Multicase Selection

• Sequential procedure– remove most influential case– remove second-most influential case

(conditioned on the first)– and so on…i(C(n,m)), for all i=1 to m,

where C(.) represents the number of subsets of size m that can be built from n cases.

• Problem: cases are not considered en bloc

Alternatives

• Multicase selection that is not sequential, yet not exhaustive (e.g., genetic algorithm search)

• Analogous to variable selection

Genetic Algorithm

• Given a training set C, and a selection of cases v, we construct a logistic regression model lC(v). We evaluate the model using the AUC, and represent this evaluation as a(lC(v)). For a total number of cases n, and m cases in selection v, we use the following fitness function:

• f(v,C) = a(lC(v)) + (n - m)/n.

Resampling

Bootstrap Motivation

• Sometimes it is not possible to collect many samples from a population

• Sometimes it is not correct to assume a certain distribution for the population

• Goal: Assess sampling variation

Bootstrap

• Efron (Stanford biostats) late 80’s– “Pulling oneself up by one’s bootstraps”

• Nonparametric approach to statistical inference• Uses computation instead of traditional

distributional assumptions and asymptotic results• Can be used to derive standard errors, confidence

intervals, and test hypothesis

Example

• Adapted from Fox (1997) “Applied Regression Analysis”

• Goal: Estimate mean difference between Male and Female finding X

• Four pairs of observations are available:

Observ. Male Female Differ.

1 24 18 6

2 14 17 -3

3 40 35 5

4 44 41 3

Mean Difference

• Sample mean is (6-3+5+3)/4 = 2.75• If Y were normally distributed, 95% CI

• But we do not know

Y

nY 96.1

Estimates

• Estimate of is

• Estimate of standard error is

• Assuming population is normally distributed, we can use t-distribution as

1

2

n

YYS i

n

SYES ˆ

nStY n 025.0,1

Confidence Interval

= 2.75 ± 4.30 (2.015) = 2.75 ± 8.66

-5.91 < < 11.41

HUGE!!!

nStY n 025.0,1

Sample mean and variance

• Use distribution Y* of sample to estimate distribution Y in population

y* p*(y*)6 .25-3 .25 E*(Y*) = y* p(y*) = 2.75

5 .25 V*(Y*) = [y*-E*]2p(y*)3 .25 = 12.187

Sample with ReplacementSample Y1* Y2* Y3* Y4* *

1 6 6 6 6 6.002 6 6 6 -3 3.753 6 6 6 5 5.75..100 -3 5 6 3 2.75101 -3 5 -3 6 1.25…255 -3 3 3 5 3.5256 3 3 3 3 3.00

Y

Calculating the CI

• Mean of 256 bootstrap means is 2.75, but SE is

(no hat since SE is not estimated, but known)

745.134015.2

*)(1

)(ˆ *

YSE

nnYES

745.1*)(* 1

2*

n

n

b b

nYY

YSE

n

So what?

• We already knew that!

• But with bootstrap– Confidence intervals can be more accurate– Can be used for non-linear statistics without

known standard error formulas

The population is to the sampleas

the sample is to the bootstrap samples

In practice (as opposed to previous example), not all bootstrap samples are selected

Procedure

• 1. Specify data-collection scheme that results in observed sampleCollect(population) -> sample

• 2. Use sample as if it were population (with replacement)Collect(sample) -> bootstrap sample1

bootstrap sample 2etc…

Cont.

• 3. For each bootstrap sample, calculate the estimate you are looking for

• 4. Use the distribution of the bootstrap estimates to estimate the properties of the sample

Bootstrap Confidence Intervals

• Normal Theory• Percentile Intervals

Example– 95% percentile is calculated by taking– Lower = 0.025 x bootstrap replicates– Upper = 0.975 x bootstrap replicates

• There are corrections for bootstrap intervals

Bootstrapping Linear Regression

Observed estimate is usually the coefficient(s)- (at least) 2 ways of doing this• Resample observations (usual) and re-regress (X

will vary)• Resample residuals (X are fixed, Y*=Y+E* is new

dependent variable, re-regress X fixed)– Assumes errors are identically distributed– High-leverage outlier impact may be lost

Bootstrap for other methods

• Used in other classification methods (neural networks, classification trees, etc.)

• Usually useful when sample size is small and no distribution assumptions can be made

• Same principles apply

Other resampling methods

• Jackknife (take one out) is a special case of bootstrap– Resamples without one case and without replacement

(samples have size n-1)• Cross-validation

– Divides data into training and test

• Generally used to estimate confidence intervals on predictions for “full” model (i.e., model that utilized all cases)

Case Selection and Resampling Lucila Ohno-Machado HST951.

Documents

Transcript of Case Selection and Resampling Lucila Ohno-Machado HST951.