Case Selection and Resampling Lucila Ohno-Machado HST951.
-
Upload
judith-stevens -
Category
Documents
-
view
219 -
download
0
description
Transcript of Case Selection and Resampling Lucila Ohno-Machado HST951.
![Page 1: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/1.jpg)
Case Selection and Resampling
Lucila Ohno-MachadoHST951
![Page 2: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/2.jpg)
Topics
• Case selection (influence detection)• Regression diagnostics
• Sampling procedures– Bootstrap– Jackknife– Cross-validation
![Page 3: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/3.jpg)
Unusual Data
• Outlier (discrepancy, unusual observation that may change parameters)
• Leverage (far from mean or centroid of other observations, unusual combinations of independent variable values X that may change parameters)
• Influence = discrepancy x leverage
![Page 4: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/4.jpg)
Detecting Outliers: Residuals
• Measure of error
• Studentized residuals can be calculated by removing one observation at a time
• Obs: High-leverage observations may have small residuals
![Page 5: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/5.jpg)
Assessing Leverage
• Hat values measure the distance of an observation to the means (or centroid) of all observations
• Dependent variables are not involved in determining leverage
![Page 6: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/6.jpg)
Measuring Influence
• Impact on coefficient of deleting an observation– DFBETA– COX’s D– DFFITS
• Impact on standard error– COVRATIO
![Page 7: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/7.jpg)
Case selection
• Not all cases are created equal• Some influential cases are good• Some are bad• “Outliers”• Some non-influential cases are redundant
• It would be nice to keep “minimal” set of good cases in training sets for fast on-line training
![Page 8: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/8.jpg)
Classical Diagnostics
• Unicase selection is determined by removing one observation and inspecting results
• Unicase influence on– Estimated parameters (coefficients)– Fitted value (Y-hat)– Residuals (error)
![Page 9: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/9.jpg)
When outcomes are binary
• Residuals may not reflect discriminatory performance, but rather calibration
• Remember that a model with good discriminatory performance may be recalibrated
• Same rationale for coefficients
![Page 10: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/10.jpg)
Influence
• Definition of influence is not fixed
• If the main reason for building models is prediction
• Then evaluating model performance given different subsets of original sample might point to good, redundant, and bad cases
![Page 11: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/11.jpg)
Qualifying a case
• Bad cases, when removed, should result in models with better predictions
• Redundant cases, when removed, should not affect predictions
• Good cases, if removed, would result in models with worse predictions
![Page 12: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/12.jpg)
Defining prediction performance
• Use, for example, areas under ROC curves (or mean square error or cross entropy error)
• For each set of samples:– Evaluate performance on training and holdout sets– Determine which cases to remove
• Determine performance on test or validation sets
![Page 13: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/13.jpg)
Sequential Multicase Selection
• Sequential procedure– remove most influential case– remove second-most influential case
(conditioned on the first)– and so on…i(C(n,m)), for all i=1 to m,
where C(.) represents the number of subsets of size m that can be built from n cases.
• Problem: cases are not considered en bloc
![Page 14: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/14.jpg)
Alternatives
• Multicase selection that is not sequential, yet not exhaustive (e.g., genetic algorithm search)
• Analogous to variable selection
![Page 15: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/15.jpg)
Genetic Algorithm
• Given a training set C, and a selection of cases v, we construct a logistic regression model lC(v). We evaluate the model using the AUC, and represent this evaluation as a(lC(v)). For a total number of cases n, and m cases in selection v, we use the following fitness function:
• f(v,C) = a(lC(v)) + (n - m)/n.
![Page 16: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/16.jpg)
![Page 17: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/17.jpg)
![Page 18: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/18.jpg)
![Page 19: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/19.jpg)
![Page 20: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/20.jpg)
Resampling
![Page 21: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/21.jpg)
Bootstrap Motivation
• Sometimes it is not possible to collect many samples from a population
• Sometimes it is not correct to assume a certain distribution for the population
• Goal: Assess sampling variation
![Page 22: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/22.jpg)
Bootstrap
• Efron (Stanford biostats) late 80’s– “Pulling oneself up by one’s bootstraps”
• Nonparametric approach to statistical inference• Uses computation instead of traditional
distributional assumptions and asymptotic results• Can be used to derive standard errors, confidence
intervals, and test hypothesis
![Page 23: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/23.jpg)
Example
• Adapted from Fox (1997) “Applied Regression Analysis”
• Goal: Estimate mean difference between Male and Female finding X
• Four pairs of observations are available:
![Page 24: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/24.jpg)
Observ. Male Female Differ.
1 24 18 6
2 14 17 -3
3 40 35 5
4 44 41 3
![Page 25: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/25.jpg)
Mean Difference
• Sample mean is (6-3+5+3)/4 = 2.75• If Y were normally distributed, 95% CI
• But we do not know
Y
nY 96.1
![Page 26: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/26.jpg)
Estimates
• Estimate of is
• Estimate of standard error is
• Assuming population is normally distributed, we can use t-distribution as
1
2
n
YYS i
n
SYES ˆ
nStY n 025.0,1
![Page 27: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/27.jpg)
Confidence Interval
= 2.75 ± 4.30 (2.015) = 2.75 ± 8.66
-5.91 < < 11.41
HUGE!!!
nStY n 025.0,1
![Page 28: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/28.jpg)
Sample mean and variance
• Use distribution Y* of sample to estimate distribution Y in population
y* p*(y*)6 .25-3 .25 E*(Y*) = y* p(y*) = 2.75
5 .25 V*(Y*) = [y*-E*]2p(y*)3 .25 = 12.187
![Page 29: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/29.jpg)
Sample with ReplacementSample Y1* Y2* Y3* Y4* *
1 6 6 6 6 6.002 6 6 6 -3 3.753 6 6 6 5 5.75..100 -3 5 6 3 2.75101 -3 5 -3 6 1.25…255 -3 3 3 5 3.5256 3 3 3 3 3.00
Y
![Page 30: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/30.jpg)
Calculating the CI
• Mean of 256 bootstrap means is 2.75, but SE is
(no hat since SE is not estimated, but known)
745.134015.2
*)(1
)(ˆ *
YSE
nnYES
745.1*)(* 1
2*
n
n
b b
nYY
YSE
n
![Page 31: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/31.jpg)
So what?
• We already knew that!
• But with bootstrap– Confidence intervals can be more accurate– Can be used for non-linear statistics without
known standard error formulas
![Page 32: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/32.jpg)
The population is to the sampleas
the sample is to the bootstrap samples
In practice (as opposed to previous example), not all bootstrap samples are selected
![Page 33: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/33.jpg)
Procedure
• 1. Specify data-collection scheme that results in observed sampleCollect(population) -> sample
• 2. Use sample as if it were population (with replacement)Collect(sample) -> bootstrap sample1
bootstrap sample 2etc…
![Page 34: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/34.jpg)
Cont.
• 3. For each bootstrap sample, calculate the estimate you are looking for
• 4. Use the distribution of the bootstrap estimates to estimate the properties of the sample
![Page 35: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/35.jpg)
Bootstrap Confidence Intervals
• Normal Theory• Percentile Intervals
Example– 95% percentile is calculated by taking– Lower = 0.025 x bootstrap replicates– Upper = 0.975 x bootstrap replicates
• There are corrections for bootstrap intervals
![Page 36: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/36.jpg)
Bootstrapping Linear Regression
Observed estimate is usually the coefficient(s)- (at least) 2 ways of doing this• Resample observations (usual) and re-regress (X
will vary)• Resample residuals (X are fixed, Y*=Y+E* is new
dependent variable, re-regress X fixed)– Assumes errors are identically distributed– High-leverage outlier impact may be lost
![Page 37: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/37.jpg)
Bootstrap for other methods
• Used in other classification methods (neural networks, classification trees, etc.)
• Usually useful when sample size is small and no distribution assumptions can be made
• Same principles apply
![Page 38: Case Selection and Resampling Lucila Ohno-Machado HST951.](https://reader035.fdocuments.us/reader035/viewer/2022062503/5a4d1acf7f8b9ab059970c1b/html5/thumbnails/38.jpg)
Other resampling methods
• Jackknife (take one out) is a special case of bootstrap– Resamples without one case and without replacement
(samples have size n-1)• Cross-validation
– Divides data into training and test
• Generally used to estimate confidence intervals on predictions for “full” model (i.e., model that utilized all cases)