1 CSI5388 Error Estimation: Re-Sampling Approaches.

24
1 CSI5388 CSI5388 Error Estimation: Error Estimation: Re- Re- Sampling Approaches Sampling Approaches

Transcript of 1 CSI5388 Error Estimation: Re-Sampling Approaches.

Page 1: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

11

CSI5388CSI5388Error Estimation: Error Estimation: Re-Sampling ApproachesRe-Sampling Approaches

Page 2: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

22

Error EstimationError Estimation Error estimation is concerned with establishing Error estimation is concerned with establishing

whether the results we have obtained on a whether the results we have obtained on a particular experiment are representative of the particular experiment are representative of the truth or whether they are meaningless.truth or whether they are meaningless.

Traditionally, error estimation was performed Traditionally, error estimation was performed using the classical parametric (and sometimes, using the classical parametric (and sometimes, non-parametric) tests discussed in Lecture 5.non-parametric) tests discussed in Lecture 5.

More recently, however, new tests have emerged More recently, however, new tests have emerged for error estimation, based on re-sampling for error estimation, based on re-sampling methods, that have the advantage of not making methods, that have the advantage of not making distributional assumptions the way parametric distributional assumptions the way parametric tests do. The tradeoff, though, is that such tests tests do. The tradeoff, though, is that such tests require high computational power. require high computational power.

Page 3: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

33

Traditional Statistical Methods Traditional Statistical Methods versus Resampling Methodsversus Resampling Methods

Classical parametric tests compare observed Classical parametric tests compare observed statistics to theoretical sampling distributions.statistics to theoretical sampling distributions.

Resampling makes statistical inferences based Resampling makes statistical inferences based upon repeated sampling within the same upon repeated sampling within the same sample.sample.

Resampling methods stem from Monte Carlo Resampling methods stem from Monte Carlo simulations, but differ from them in that they simulations, but differ from them in that they are based upon some real data; Monte Carlo are based upon some real data; Monte Carlo simulations, on the other hand, could be simulations, on the other hand, could be based on completely hypothetical data.based on completely hypothetical data.

Page 4: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

44

Error Estimation through Resampling Error Estimation through Resampling Techniques in Machine LearningTechniques in Machine Learning

Error estimation through re-sampling Error estimation through re-sampling techniques is concerned with finding the techniques is concerned with finding the best way to utilize the available data to best way to utilize the available data to assess the quality of our algorithms. assess the quality of our algorithms.

In other words, we want to make sure that In other words, we want to make sure that our classifiers are tested on a variety of our classifiers are tested on a variety of instances, within our sample, presenting instances, within our sample, presenting different types of properties, so that we different types of properties, so that we don't mistaken good performance on one don't mistaken good performance on one type of instances as good performance type of instances as good performance across the entire domain.across the entire domain.

Page 5: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

55

Re-Sampling ApproachesRe-Sampling Approaches We will be considering three different kinds of We will be considering three different kinds of

sampling approaches, with some of their sampling approaches, with some of their variations in each case:variations in each case:• Cross-validationCross-validation• Jackknife (Leave-one-out)Jackknife (Leave-one-out)• Bootstrapping Bootstrapping • RandomizationRandomization

The following discussion is based on The following discussion is based on • Yu, Chong Ho (2003): Resampling methods: concepts, applications Yu, Chong Ho (2003): Resampling methods: concepts, applications

and justification. Practical Assessment, Research & Evaluation, 8(19). and justification. Practical Assessment, Research & Evaluation, 8(19). • http://www.uvm.edu/~dhowell/StatPages/Resampling/Resampling.htmhttp://www.uvm.edu/~dhowell/StatPages/Resampling/Resampling.htm

ll, ,

• [Weiss & Kulikowski, 1991, Chapter 2][Weiss & Kulikowski, 1991, Chapter 2]

Page 6: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

66

Cross-ValidationCross-Validation A sample is randomly divided into two or A sample is randomly divided into two or

more subsets and test results are more subsets and test results are validated by comparing across sub-validated by comparing across sub-samples.samples.

The purpose of cross-validation is to find The purpose of cross-validation is to find out whether the result is replicable or out whether the result is replicable or whether it is just a matter of random whether it is just a matter of random fluctuations.fluctuations.

If the sample size is small, there is a If the sample size is small, there is a chance that the results obtained are just chance that the results obtained are just artifacts of the sub-sample. In such cases, artifacts of the sub-sample. In such cases, the jackknife procedure is preferred.the jackknife procedure is preferred.

Page 7: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

77

JackknifeJackknife In the Jackknife or Leave-One-Out approach, In the Jackknife or Leave-One-Out approach,

rather than splitting the data set into several rather than splitting the data set into several subsamples, all but one sample is used for subsamples, all but one sample is used for training and the testing is done on the remaining training and the testing is done on the remaining sample. This procedure is repeated for all the sample. This procedure is repeated for all the samples in the data set.samples in the data set.

The procedure is preferable to cross-validation The procedure is preferable to cross-validation when the distribution is widely dispersed or in the when the distribution is widely dispersed or in the presence of extreme scores in the data set.presence of extreme scores in the data set.

The estimate produced by the Jackknife approach The estimate produced by the Jackknife approach is less biased, in the two cases mentioned above is less biased, in the two cases mentioned above than cross-validation.than cross-validation.

Page 8: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

88

Bootstrapping and Randomization: Bootstrapping and Randomization: Main IdeasMain Ideas

Bootstrapping makes the assumption that the Bootstrapping makes the assumption that the sample is representative of the original sample is representative of the original distribution, and creates over a thousand distribution, and creates over a thousand bootstrapped samples by drawing, bootstrapped samples by drawing, with with replacement,replacement, from that pseudo-population. from that pseudo-population.

Randomization makes the same assumption, but, Randomization makes the same assumption, but, instead of drawing samples with replacement, it instead of drawing samples with replacement, it reorders (shuffles) the data systematically or reorders (shuffles) the data systematically or randomly a thousand times or more. It calculates randomly a thousand times or more. It calculates the appropriate test statistic on each reordering. the appropriate test statistic on each reordering.

Since shuffling the data amounts to sampling Since shuffling the data amounts to sampling withoutwithout replacement, the issue of replacement is replacement, the issue of replacement is one factor that differentiates the two approaches. one factor that differentiates the two approaches.

Page 9: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

99

BootsrappingBootsrapping One can understand the concept of bootstrapping by One can understand the concept of bootstrapping by

thinking of what can be done when not enough is known thinking of what can be done when not enough is known about the data.about the data.

For example, let us assume that we don't know the For example, let us assume that we don't know the standard error of the difference of medians. One solution standard error of the difference of medians. One solution consists of drawing many pairs of samples, calculating consists of drawing many pairs of samples, calculating and recording, for each of these pairs of samples, the and recording, for each of these pairs of samples, the difference between the medians, and outputting the difference between the medians, and outputting the standard deviation of these differences in lieu of the standard deviation of these differences in lieu of the standard error of the difference of medians. standard error of the difference of medians.

In other words, bootstrapping consists of using an In other words, bootstrapping consists of using an empirical, brute-force solution when no analytical solution empirical, brute-force solution when no analytical solution is available. is available.

Bootstrapping is also very useful in cases where the Bootstrapping is also very useful in cases where the sample is too small for techniques such as cross-validation sample is too small for techniques such as cross-validation or leave-one out to provide a good estimate, due to the or leave-one out to provide a good estimate, due to the large variance a small sample will cause in such large variance a small sample will cause in such procedures.procedures.

Page 10: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

1010

The e0 Bootstrap EstimatorThe e0 Bootstrap Estimator For the e0 bootstrap, we create training and For the e0 bootstrap, we create training and

testing sets as follows:testing sets as follows: Given a data set consisting of n cases, we draw, Given a data set consisting of n cases, we draw,

with replacement, n samples from that set. The with replacement, n samples from that set. The set of samples not included in the training set, set of samples not included in the training set, thus created, forms the testing set.thus created, forms the testing set.

The error rate obtained on the test set represents The error rate obtained on the test set represents one estimate of e0.one estimate of e0.

We repeat this procedure a number of times We repeat this procedure a number of times (between 50 and 200 times) and take the (between 50 and 200 times) and take the average of the results obtained for e0. average of the results obtained for e0.

Page 11: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

1111

The .632 Bootstrap EstimatorThe .632 Bootstrap Estimator

Based on the fact that the expected Based on the fact that the expected fraction of non-repeated cases in the fraction of non-repeated cases in the training set is .632, the .632 training set is .632, the .632 Bootstrap estimator is the following Bootstrap estimator is the following linear combination:linear combination: 1/I ∑1/I ∑i=1i=1

II (.368 * app + .632 * e0 (.368 * app + .632 * e0ii))Where app is the apparent error rate Where app is the apparent error rate

(training and testing on the complete (training and testing on the complete data set), and I represents the number data set), and I represents the number of iterations.of iterations.

Page 12: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

1212

Bootstrapping versus Cross-Bootstrapping versus Cross-ValidationValidation

Both e0 and .632B are low variance Both e0 and .632B are low variance estimator.estimator.

On the other hand, they are both more On the other hand, they are both more biased than cross-validation, which is biased than cross-validation, which is nearly unbiased, but has high variance.nearly unbiased, but has high variance.• e0 is pessimistically biased on moderately e0 is pessimistically biased on moderately

sized samples. Nonetheless, it gives good sized samples. Nonetheless, it gives good results in case of a high true error rate.results in case of a high true error rate.

• .632B becomes too optimistic as the sample .632B becomes too optimistic as the sample size grows. However, it is a very good size grows. However, it is a very good estimator on small data sets, especially if the estimator on small data sets, especially if the true error rate is small.true error rate is small.

Page 13: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

1313

Randomization TestingRandomization Testing Randomization testing is closer in spirit to Randomization testing is closer in spirit to

Classical parametric testing than Classical parametric testing than Boostrapping.Boostrapping.

Bootstrapping proceeds to estimate the Bootstrapping proceeds to estimate the error directly.error directly.

In contrast, randomization, testing like In contrast, randomization, testing like parametric testing, a null hypothesis gets parametric testing, a null hypothesis gets tested.tested.

The null hypothesis of a randomization The null hypothesis of a randomization test, however, is quite different from that test, however, is quite different from that of a parametric tests (See next sets of of a parametric tests (See next sets of slides)slides)

Page 14: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

1414

Randomization versus Randomization versus Parametric Tests IParametric Tests I

In randomization, we are not required to have In randomization, we are not required to have random samples from one or two population(s).random samples from one or two population(s).

In randomization, we don’t think of the In randomization, we don’t think of the distribution from which the data comes. There is distribution from which the data comes. There is no assumption about normality or the like.no assumption about normality or the like.

The null hypothesis does not concern The null hypothesis does not concern distributional parameters. distributional parameters. Instead, it is worded as a vague, but perhaps more Instead, it is worded as a vague, but perhaps more

easily interpretable, statement of the kind: “the easily interpretable, statement of the kind: “the treatment has no effect on the patient”, or perhaps, treatment has no effect on the patient”, or perhaps, more precisely, as “the score that is associated with a more precisely, as “the score that is associated with a participant is independent of the treatment that person participant is independent of the treatment that person received.”received.”

Page 15: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

1515

Randomization versus Randomization versus Parametric Tests IIParametric Tests II

In randomization, we are not concerned with In randomization, we are not concerned with estimating or testing population parameters.estimating or testing population parameters.

Randomization testing does involve the Randomization testing does involve the computation of some sort of test statistic, but it computation of some sort of test statistic, but it does not compare that statistic to tabled does not compare that statistic to tabled distributions. distributions.

The test statistics computed in randomization The test statistics computed in randomization testing gets compared to the results obtained in testing gets compared to the results obtained in repeating the same calculation over and over repeating the same calculation over and over over different randomizations of the data.over different randomizations of the data.

Randomization testing is even more concerned Randomization testing is even more concerned about the random assignments of subjects to about the random assignments of subjects to treatments (for example) than parametric tests.treatments (for example) than parametric tests.

Page 16: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

1616

Randomization: Basic Permutation TestRandomization: Basic Permutation Test• Decide on a metric to measure the effect in question. Decide on a metric to measure the effect in question.

For this example let’s use the For this example let’s use the tt statistic (though several others are statistic (though several others are possible and equivalent, including the difference between the possible and equivalent, including the difference between the means) means)

• Calculate that test statistic on the data (here denoted Calculate that test statistic on the data (here denoted ttobtobt). ). • Repeat the following Repeat the following NN times, where times, where NN is a number greater than is a number greater than

1000 1000 Shuffle the data Shuffle the data Assign the first Assign the first nn1 observations to the first condition, and the 1 observations to the first condition, and the

remaining remaining nn2 observations to the second condition. 2 observations to the second condition. Calculate the test statistic (here denoted Calculate the test statistic (here denoted ti*ti*) for the reshuffled ) for the reshuffled

data. data. If If ti*ti* is greater than is greater than ttobtobt increment a counter by 1. increment a counter by 1.

• Divide the value in the counter by Divide the value in the counter by NN, to get the proportion of , to get the proportion of times the times the tt on the randomized data exceeded the on the randomized data exceeded the ttobtobt on the on the data we actually obtained. data we actually obtained.

• This is the probability of such an extreme result under the null. This is the probability of such an extreme result under the null. • Reject or retain the null hypothesis on the basis of this Reject or retain the null hypothesis on the basis of this

probability. probability.

Page 17: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

1717

Permutation Test Example I:Permutation Test Example I:Two Independent Samples Two Independent Samples

Do drivers about to leave a parking space take more Do drivers about to leave a parking space take more time than necessary if someone else is waiting for time than necessary if someone else is waiting for that spot?that spot?

Ruback and Juieng, 1997 observed 200 drivers to Ruback and Juieng, 1997 observed 200 drivers to answer this question.answer this question.

Howell reduced the data set to 20 samples under Howell reduced the data set to 20 samples under each condition (someone waiting or not), and each condition (someone waiting or not), and modified the standard deviation appropriately to modified the standard deviation appropriately to maintain the significant difference that they found. maintain the significant difference that they found.

His data is reasonable in that it is positively skewed, His data is reasonable in that it is positively skewed, because a driver can safely leave a space only so because a driver can safely leave a space only so quickly, but, as we all know, he can sometimes take quickly, but, as we all know, he can sometimes take a very long time. a very long time.

Because the data are skewed, using a parametric Because the data are skewed, using a parametric tt test may be problematic. So we will adopt a test may be problematic. So we will adopt a randomization test. randomization test.

Page 18: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

1818

Permutation Test Example II:Permutation Test Example II:The Data and Null HypothesisThe Data and Null Hypothesis

No one waitingNo one waiting 36.30  42.07  39.97  39.33  33.76  33.91  39.65  84.92  40.70  36.30  42.07  39.97  39.33  33.76  33.91  39.65  84.92  40.70 

39.6539.6539.48  35.38  75.07  36.46  38.73  33.88  34.39  60.52  39.48  35.38  75.07  36.46  38.73  33.88  34.39  60.52  53.63  50.62 53.63  50.62

Someone waitingSomeone waiting 49.48  43.30  85.97  46.92  49.18  79.30  47.35  46.52  59.68  49.48  43.30  85.97  46.92  49.18  79.30  47.35  46.52  59.68 

42.8942.8949.29  68.69  41.61  46.81  43.75  46.55  42.33  71.48  49.29  68.69  41.61  46.81  43.75  46.55  42.33  71.48  78.95  42.0678.95  42.06

Null HypothesisNull HypothesisWhether or not someone is waiting for a spot does Whether or not someone is waiting for a spot does

not affect the speed at which the driver of the car not affect the speed at which the driver of the car will vacate that spot.will vacate that spot.

Page 19: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

1919

Permutation Test Example III:Permutation Test Example III:ProcedureProcedure

If the null hypothesis is true, then of the 40 If the null hypothesis is true, then of the 40 scores listed on the previous slide, any one of scores listed on the previous slide, any one of those scores (e.g. 36.30) is equally likely to those scores (e.g. 36.30) is equally likely to appear in the No Waiting condition than it is in appear in the No Waiting condition than it is in the Waiting condition. the Waiting condition.

So, when the null hypothesis is true, any random So, when the null hypothesis is true, any random shuffling of the 40 observations is as likely as any shuffling of the 40 observations is as likely as any other shuffling to represent the outcome for the t-other shuffling to represent the outcome for the t-statistics (the difference between the means statistics (the difference between the means divided by the standard error of the difference) divided by the standard error of the difference) obtained in this particular sample (here, Howell obtained in this particular sample (here, Howell obtained tobtained tObsObs= -2.15). = -2.15).

Howell shuffled the data 5,000 times and Howell shuffled the data 5,000 times and computed the value of the t-statistics obtained on computed the value of the t-statistics obtained on each of these randomized sets.each of these randomized sets.

Page 20: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

2020

Page 21: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

2121

Permutation Test Example III:Permutation Test Example III:ResultResult

The basic question we are asking is: "Would the The basic question we are asking is: "Would the result that we obtained (here, result that we obtained (here, tt = -2.15) be likely = -2.15) be likely to arise if the null hypothesis were true?" to arise if the null hypothesis were true?"

The distribution on the previous slide shows what The distribution on the previous slide shows what tt values we would be likely to obtain values we would be likely to obtain when the when the null hypothesis is true.null hypothesis is true.

It is clear that the probability of obtaining a It is clear that the probability of obtaining a tt as as extreme as ours is very small (actually, extreme as ours is very small (actually, only .0396). only .0396).

At the traditional level of At the traditional level of αα = .05, we would reject = .05, we would reject the null hypothesis and conclude that people do, the null hypothesis and conclude that people do, in fact, take a longer time to leave a parking in fact, take a longer time to leave a parking space when someone is sitting there waiting for space when someone is sitting there waiting for it.it.

Page 22: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

2222

Conclusions I: Reasons for Conclusions I: Reasons for Supporting resampling techniquesSupporting resampling techniques

Resampling methods do not make assumptions Resampling methods do not make assumptions about the sample and the population.about the sample and the population.

Resampling techniques are conceptually clean Resampling techniques are conceptually clean and simple.and simple.

Resampling is useful when sample sizes are small Resampling is useful when sample sizes are small and the distributional assumptions made by and the distributional assumptions made by classical techniques cannot be made.classical techniques cannot be made.

Some people argue that resampling techniques Some people argue that resampling techniques will work even if the data sample is not random. will work even if the data sample is not random. Others remain skeptical, however, since non-Others remain skeptical, however, since non-random samples may not be representative of the random samples may not be representative of the population.population.

Page 23: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

2323

Conclusions II: Reasons for Conclusions II: Reasons for Supporting resampling techniquesSupporting resampling techniques

Even if a data set meets parametric assumptions, Even if a data set meets parametric assumptions, if that set is small, the power of the conclusions, if that set is small, the power of the conclusions, in classical statistics will be low. Resampling in classical statistics will be low. Resampling techniques should suffer less from this.techniques should suffer less from this.

If the data set is too large, any null hypothesis If the data set is too large, any null hypothesis can be supported using classical techniques. can be supported using classical techniques. Cross-validation can help relieve this problem.Cross-validation can help relieve this problem.

Classical procedures do not inform researchers of Classical procedures do not inform researchers of how likely the results are to be replicated. Cross-how likely the results are to be replicated. Cross-validation and Bootstrapping can be seen as validation and Bootstrapping can be seen as internal replications (external replication is still internal replications (external replication is still necessary for confirmation purposes, but internal necessary for confirmation purposes, but internal replication is useful to establish as well).replication is useful to establish as well).

Page 24: 1 CSI5388 Error Estimation: Re-Sampling Approaches.

2424

Conclusions II: Reasons for Conclusions II: Reasons for criticizing resampling techniquescriticizing resampling techniques

Re-sampling techniques are not devoid of Re-sampling techniques are not devoid of assumptions. The hidden assumption is that the assumptions. The hidden assumption is that the same numbers are used over and over to get an same numbers are used over and over to get an answer that cannot be obtained in any other way.answer that cannot be obtained in any other way.

Because resampling techniques are based on a Because resampling techniques are based on a single sample, the conclusions do not generalize single sample, the conclusions do not generalize beyond that particular sample.beyond that particular sample.

Confidence intervals obtained by simple Confidence intervals obtained by simple bootstrapping are always biased.bootstrapping are always biased.

If the collected data is biased, then resampling If the collected data is biased, then resampling technique could repeat and magnify that bias.technique could repeat and magnify that bias.

If researchers do not conduct enough If researchers do not conduct enough experimental trials, then the accuracy of experimental trials, then the accuracy of resampling estimates may be lower than those resampling estimates may be lower than those obtained by conventional parametric techniques.obtained by conventional parametric techniques.