Notes on Simulations in SAS Studiojames/STAT579-F18/SAS13.pdfdi erent way. SAS Programming November...

Notes on Simulations in SAS Studio

If you are not careful about simulations in SAS Studio, you can run intoproblems. In particular, SAS Studio has a limited amount of memory thatyou can use to write to RESULTS tab (where results are normallydisplayed). If you are doing many t-tests, for example, then it takes a fairbit of memory and is likely to have trouble.

It helps enormously to do something like this

ods output TTests=pvalues;

ods select TTests;

proc ttest data=sim;

by iter n;

value x;

run;

The ODS SELECT statement reduces the output and increases the speedand number of iterations you can do. In SAS Studio, it can make thedifference between your code working and not working.

SAS Programming November 13, 2014 1 / 63

Power: comparing methods

Here’s an example from an empirical paper,


Power: comparing methods


Speed: comparing methods

For large analyses, speed and/or memory might be an issue for choosingbetween methods and/or algorithms. This paper compared using differentmethods within SAS based on speed for doing permutation tests.


Use of macros for simulations

The author of the previous paper provides an appendix with lengthymacros to use as more efficient substitutes to use as replacements for SASprocedures such as PROC NPAR1WAY and PROC MULTTEST, whichfrom his data could crash or not terminate in a reasonable time.

In addition to developing your own macros, a common use of macros is touse macros written by someone else that have not been incorporated intothe SAS language. You might just copy and paste the macro into yourcode, possibly with some modification, and you can use the macro even ifyou cannot understand it. Popular macros might eventually get replacedby new PROCs or new functionality within SAS. This is sort of the SASalternative to user-defined packages in R.


From Macro to PROC

An example of an evolution from macros to PROCS is for bootstrapping.For several years, to perform bootstrapping, SAS users relied on macrosoften written by others to do the bootstrapping. In bootstrapping, yousample you data (or the rows of your data set) with replacement and get anew dataset with the same sample size but some of the values repeatedand others omitted. For example if your data is

-3 -2 0 1 2 5 6 9 bootstrap replicated datas set might be

-2 -2 1 5 6 9 9 9

-3 0 1 1 2 5 5 6

etc.


From Macro to Proc

Basically to generate the bootstrap data set, you generate random nrandom numbers from 1 to n, with replacement, and extract those valuesfrom your data. This was done using macros, but now can be done withPROC SURVEYSELECT. If you search on the web for bootstrapping, youstill might run into one of those old macros.

Newer methods might still be implemented using macros. A webpage from2012 has a macro for Bootstrap bagging, a method of averaging resultsfrom multiple classification algorithms.http://statcompute.wordpress.com/2012/07/14/a-sas-macro-for-bootstrap-aggregating-bagging/

There are also macros for searching the web to download movie reviews orextract data from social media. Try searching on ”SAS macro 2013” forinteresting examples.


Bootstrapping with PROC SURVEYSELECT



To explain the syntax, everything here is an option. There are nostatements within the procedure, which is why there is only one semicolonbefore the RUN statement.

We create an output dataset by whatever name we want, here outboot.seed is a random number seed. method refers to the type of sampling,which for the bootstrap should be sampling with replacement. If yousampled without replacement, you’d be permuting your observations, butthis would have no effect on the mean, median, etc.

samprate is the fraction of the sample you want, which here is 1 for 100%(i.e., we want the same original sample size). outhits gives the numberof times each observation is selected, which isn’t necessary, but interestingto observe. rep is the number of bootstrap datasets, which in this case isset to 1000, which is a typical number. I’ve also seen 100 used a lot forgenetics examples that require time-consuming maximum likelihoodapproaches.



Opening the outboot dataset.



Things to note:

I the number of men versus women is a random variable in thereplicated datasets. It is not fixed to be the same as the original dataset, but will be the same on average. However, the data set seems tobe sorted by sex.

I the number of times an observation is repeated is in the columnNumberHits. If this number is 4, for example, the same row occursfour times in a row. If an observation is selected 0 times, it doesn’tshow up in this column.

I The replicate is indicated in a column called Replicate. This is similarto the structure of the data sets we used to simulate power analyses,with (conceptually) multiple datasets simulated within a single SASdataset.



The idea behind bootstrapping is that we can use the simulated data setsto get a simulated distribution of sample statistics: sample means, samplestandard deviations, sample medians, sample coefficient of variation (s/x),interquartile range, etc.

We have theory to tell us the distribution of X , which is normal in mostcases with large sample sizes. The distribution of the sample median, the95th percentile, the range, and so forth is more difficult theoretically andwill depend on the underlying distribution, so bootstrapping can be usefulfor this purpose.


Bootstrapping

The idea behind bootstrapping is that if we don’t know what theunderlying population is, then our sample is our best guess at what theunderlying population is. The idea then is to draw samples from our initialsample as if we were drawing multiple samples from a population. Thisshould work well if our sample is representative of the population we aremaking inferences about.

When shouldn’t this work well? If the sample size is too small, then thesample won’t do a good job of representing the entire population,particular the extremes of a distribution. Bootstrap samples don’textrapolate beyond the original sample, so a sample of 100 observationsmight not do a good job of estimating the 99th percentile or even 95thpercentile of a distribution. A sample that is biased will also not becorrected by using a bootstrap.


Bootstrapping

We can now think about how to use the bootstrapped data that PROCSURVEYSELECT creates. Suppose we want to estimate the populationmedian. A reasonable guess for the population median is the samplemedian, assuming that we don’t know anything about the distribution. (Isthis always the case? No — what is the best guess for the populationmedian when sampling from a normal distribution?)

The more interesting applications of bootstrapping is to get some form ofconfidence interval around your estimate or estimated standard error.


Bootstrapping confidence intervals: percentiles



PROC UNIVARIATE gives enough information to give a 90% bootstrapinterval and a 98% interval, but for the 95% interval, we need the 2.5%and 97.5% quantiles.

The 90% interval is (98.2,98.4) for the median, which is pretty narrow.The 98% interval is (98.1,98.5). To get the 95% interval we need to do alittle more work to get customized percentiles out of PROC UNIVARIATE(using the output with options shown), or we can generate them adifferent way.


Customized percentiles in PROC UNIVARIATE

The 95% interval is (98.15,98.4). Note that the 90% and 98% intervalswere symmetric around the sample median of 98.3 but the 95% intervalwas not.


Getting the percentiles by sorting

Another way to get the percentiles is to sort the 1000 replicates and getthe 25th and 976th ordered observations.


What are the percentiles?

It’s a little tricky to get the right percentiles. Should it be observation 25or 26, 975 or 976 or 974?

My reasoning was that I wanted to get the middle 950 observations, sothat there were 25 observations to the left of my interval and 25observations to the right of my interval. The exact number people usevaries, though, so sometimes you’ll see people use the 25th and 975thobservations in the sorted data. This usually won’t make much difference.


What are percentiles?

There’s a nice function in R to help you find the percentiles. It interpolatesbetween numbers though and has 9 different algorithms (you can specifywhich) to define the quantile. The R function apparently includes the SASinterpretation as one of the types. Here are some examples

> x <- 1:1000

> quantile(x,.025)

2.5\%

25.975

> quantile(x,.975)

97.5\%

975.025

> quantile(x,.025,type=3) # type 3 is for SAS

2.5%

25

> quantile(x,.975,type=3)

97.5%

975SAS Programming November 13, 2014 21 / 63

Interpreting Bootstrap CIs

Interpreting Bootstrap CIs is a little unclear. Is it a probability? It is easiestto think of a bootstrap CI as a plausible range of values for the parameter.

Of course the same might be said of frequentist CIs. If I say that a 95% CI(based on the theory for normal distributions) for µ is (1.3, 2.1), this doesNOT mean that there is a 95% chance that µ is between 1.3 and 2.1 sincethis would be treating µ as a random variable rather than a parameter.



What we hope to be the case for a frequentist CI is that 95% of the time,a 95% CI captures the population mean. The idea is that if there are manysamples, then before you look at the data, you expect 95% of the CIsconstructed from the different samples to capture the population mean.

As an example, if there are multiple polls for the proportion of people whosupport say, Hilary Clinton for the next presidential election conducted byCNN, ABC, Fox News, MSNBC, The New York Times, etc., then hopefully95% of those polls will have the true percentage in their confidenceintervals, assuming those polls are independent. On the frequentist way oflooking at things, once the intervals have been constructed, any individualpoll either captures this true proportion or it does not. But probabilitystatements don’t make sense unless the proportion is a random variable.



Bootstrap CIs are similar to frequentist CIs in this respect — that if thesample is representative, then approximately (1− α)% of the time, theycapture the parameter value they are estimating.


Interpreting CIs

How can we test how well a confidence interval (of any type) does? Onething we can do is estimate it’s coverage probability. That is, we generatemany samples, construct a CI for each of them, and test how often itcovers the true parameter. A well constructed 95% confidence intervalshould cover the true parameter 95% of the time. For examples where youreject H0 : µ = µ0 if and only if the confidence interval include µ0, thecoverage probability is the flip side of thinking about the type 1 error. Ifthe confidence interval include µ0 95% of the time, then you reject theH0 : µ = µ0 exactly 5% of the time.


Bootstrap estimate of the standard error

The bootstrap estimate of the standard error is obtained by by taking thesample variance of your test statistic, where the sample variance iscomputed across bootstrap replicates. For example, if my bootstrapmedian values are m1,m2, . . . ,mB where B = 1000 is the number ofreplicates. Then the bootstrap estimate of the standard error of thesample median is √√√√ B∑

i=1

(mi −m)2

where m is the arithmetic average of the sample medians.


Bootstrap standard error



This is the outboot data set and the first PROC MEANS output.



This is the meanboot3 data set and second PROC MEANS output.

meanboot3


Bootstrap standard error CI

From this estimate of the standard error, we can construct a 95% interval,based on using

98.3± 1.96(.08619) = (98.13, 98.47)

which is similar to the percentile-based estimator but slightly larger. In myexperience I have mostly seen the percentile-based interval used forbootstrap CIs.


Hypothesis Testing

You can also use a bootrapping framework to do hypothesis testing.Suppose you want to test the difference in two medians for twopopulations. The null hypothesis is that the two populations have thesame median, so H0 : η1 = η2, while the alternative is HA : η1 6= η2.Using the bootstrap, we can estimate the standard errors for each groupand the sample medians, m1 and m2 for each group. The standard error forthe difference is the square root of the sum of the squared standard errors:

se(m1 −m2) =√

se(m1)2 + se(m2)2

Using bootstrap estimates of se(m1) and se(m2), this can be used to testwhether the difference m1 −m2 is significantly different from 0.


Bootstrapping in R

Perhaps not surprisingly, bootstrapping is a little easier in R, largelybecause of a function called sample(), which allows you to sample with orwithout replacement. Suppose my temperatures are in temperature.Then

bootmedian <- 1:1000

for(b in 1:B) {

bootmedian[i] <- median(sample(temperature,replace=T))

}

bootmedian <- sort(bootmedian)

ci <- c(bootmedian[26],bootmedian[975])

print(ci)

This is is sufficient to generate your bootstrapped medians and thepercentile interval. Of course you can also do sd(boot median) to getthe bootstrap standard error.


Why do we use the sample mean instead of the samplemedian?

Suppose we are sampling from a symmetric distribution. Why do we usethe sample mean instead of the sample median? Both are unbiasedestimators of the center of the distribution.


Why do we use the sample mean instead of the samplemedian?

A basic answer is that for most distributions, the sample median is morevariable than the sample mean, so for finite sample sizes, the mean is moreprecise. Since the variance of the sample median is difficult to determine,this could be investigated by simulation, either by simulating manysamples from the same distribution and computing the standard deviationsof the sample means and sample medians, or by using the bootstrapestimates of the standard errors if you are working with one sample.

For the temperature data, it actually doesn’t make much difference. Anormal-based 95% confidence interval for the mean temperature is(98.12,98.38) (based on PROC TTEST), and the mean temperature is abit lower than the median, being 98.249 instead of 98.3.


What if there were no PROC SURVEYSELECT?

PROC SURVEYSELECT was introduced in SAS version 6, and hasn’talways been around. What would you do if it weren’t available?Bootstrapping was invented in the 1970s, long before PROCSURVEYSELECT, and generally, statistical methods will be inventedbefore there is a tidy SAS procedure for them. This is part of why beingable to program can be important.

In the past, people used macros to accomplish bootstrapping. How wouldyou do this? First think about how you would generate one bootstrapreplicate data set.


Bootstrap “by hand”

Here is one way to create a single bootstrap dataset. To create many, youcould loop over this code with a macro.


Bootstrapping with a macro

In most cases, it is more efficient to create a giant dataset with all of yourbootstrap replicates together, then summarize using a CLASS statement inPROC MEANS or some other PROC.

However, if you had 1 million observations and wanted 1000 bootstrapreplicates, this would create a dataset with 1 billion observations. Withthe macro approach, you can extract the information you need from eachbootstrap data set (median, quantile, etc), then save that to a dataset,and rewrite your bootstrap data set, so that you never use more space than1 million observations at a time. There is no need to save every bootstrapdataset. Sometimes there is a tradeoff between speed and memory.


Bootstrapping and outliers

One interesting thing to think about is what happens when you usebootstrapping and there is an outlier in your data? Ordinarily, we want ourinferences to be good for the population we are sampling from, and notsensitive to the idiosyncrasies of the sample we happened to collect. Inother words, if we collect a new sample from the same population, we’dlike our inferences to be stable.

Bootstrapping simulates this idea of getting a slightly different samplefrom the same population. Some of the same values will be repeated, andsome of them will be left out. This leads back to the question aboutoutliers. Sometimes an outlier will be in your bootstrap replicate,sometimes it won’t. If an outlier is seriously affecting your inferences, thiscan show up by your bootstrap inferences not being very stable.



What is the probability that an outlier is in one particular bootstrapreplicate?

Let’s say you have a sample of size 5. The probability that the outlier is

NOT chosen is (4/5)(4/5)(4/5)(4/5)(4/5), or(5−15

)5. What if the sample

size is n?

(n − 1

n

)n

=

(1− 1

n

)n



What is this value for large n?

limn→∞

(1− 1

n

)n

= e−1

So for large n, the probability that the outlier IS in the dataset isapproximately 1− e−1 = 0.632, a little less than two-thirds.


Parametric bootstrapping

The type of bootstrapping we’ve been doing is often called nonparametricbootstrapping, whereas parametric bootstrapping involves simulatingsamples from a known distribution and calculating simulated test statisticsfrom these samples. This is also a useful procedure.

Often the parametric bootstrap is based on simulating from a distributionthat is estimated from the data. Suppose you believe that your data isnormally distributed, and you want to know what the distribution of thesample coefficient of variation should be from the population you sampledfrom. You could use the nonparametric bootstrap as we did for themedian, or you could assume that your data is normal, use the mean andvariance estimated from your data, and draw samples from a normaldistribution to simulate the distribution of the coefficient of variation fromthis normal population with the same sample size as you obtained in yourdata.


Parametric bootstrapping the coefficient of variation

First, we’ll just look at computing the coefficient of variation from thedata in SAS for the temperature data. There are several ways to do this.We could use macro variables to store the mean and standard deviationusing the output of PROC MEANS, or we can create a dataset thatcomputes them.

I’ll use the second approach first, which will remind us how to use aRETAIN statement, but the macro variable approach is just as good.


Computing the coefficient of variation



Note that the dataset cov just has one variable.



If we want to do a parametric bootstrap assuming that temperatures arenormally distributed with the same mean and standard variation as we’reobserved in our data, then it would help to have the mean and standarddeviation stored as macro variables so that we can we can use those valueswhenever we want, or we could just hard code it.



To use a macro variable, we use CALL SYMPUT. We used this in thenotes for week 11, but that was probably easy to forget. (I forgot thesyntax myself and had to look it up again....)

First, we’ll just modify the previous code to store the mean and standarddeviation as macro variables.



Note that the dataset cov just has one variable.


Simulating the coefficient of variation

Next we want to simulate normally distributed samples of size 130 (thesame size as our original data) from a normal distribution with mean98.2492308 and standard deviation 0.7331832.

Since we’re using the actual values generated internally in SAS, we’llactually be using as many digits of precision as SAS uses. If you hardcoded the values yourself, you’d have to decide how many digits ofprecision you wanted to use for your parameters.



Now that we have the mean and standard deviation saved, we cansimulate from a normal distribution with those parameters. The point ofdoing this is to get an idea of how variable the sample coefficient ofvariation is when you have samples of size 130 from a normal with thesame mean and variance as your data.

If you simulate data from a normal distribution with this mean andstandard deviation, and the coefficient of variation is never close to yoursample coefficient of variation, this suggests that a normal distribution isnot a good fit to your data. The parametric bootstrap is sometimes usedas a kind of goodness-of-fit test.



Where does our sample coefficient of variation lie on this plot? It is veryclose to the center, so it is not a surprising value for the sample c.o.v. forthis normal distribution. This is probably not the most powerful test forwhether or not the data is normal, but sometimes this kind of proceduresuggests that the data are not consistent with a particular distribution. Inany case, your main interest might lie in the c.o.v. rather than thenormality of the data.

We can highlight some things in the plot to make it more interesting. Forexample, we could add a point to the plot illustrating our sample c.o.v.and maybe a central 95% interval (a percentile interval) for thedistribution of simulated coefficient of variation values. This is not really aconfidence interval, but tells you where you expect the sample coefficientof variation to lie most of the time from this normal distribution.



Here I just draw a refline at the sample c.o.v. Note that SAS Studiorecognizes when I start to type a user-defined macro variable.



In this case, the sample c.o.v. is near the center of the distribution ofsimulated c.o.v. values. This is not surprising since the original distributionwas reasonably close to normal. In other cases, sample statistics are notnecessarily near the average of their simulated distributions if the originaldistribution is different from the assumed distribution.



In addition to seeing the distribution of the sample statistic, we can alsoget the standard error of the sample statistic, which is the standarddeviation of s/x when n = 130. (A standard error is a standard deviationof a sample statistic).

Here we just run PROC MEANS yet again.


Parametric bootstrap estimate of the standard error

The estimate of the standard error of the coefficient of variation is 0.00046.


Simulation versus theory

In some cases, theory can tell us what the standard error for some trickystatistic is. Sometimes we want standard errors for statistics that estimateodds, p/(1− p), odds ratios p1/(1−p1)

p2/(1−p2) , precision 1/σ2, or other functionsof parameters. If theory can give you a good answer, then this is oftenpreferable to doing a simulation. However, theoretical expressions for thisthings often involve mathematical (if not numerical) approximations, sothat a simulation, while approximate, might be just as good.

The main disadvantage for simulation is often computation time and thefact that you need to do separate simulations for different parametervalues. If I want the standard error for the c.o.v., I need to do separatesimulations for different choices of n, µ and σ (assuming a normaldistribution). If I have a function to give me the standard error, then I canquickly examine the effect of one or more parameters on the standard erroras a function of the parameter(s).


How much is bootstrapping used?


How is bootstrapping used in phylogenetics?

Bootstrapping is used in phylogenetics primary to help quantify uncertaintyabout maximum likelihood estimates. The idea is that bootstrap replicatesare made of DNA sequences, and the best tree is constructed from thesebootstrap replicates. Then the proportion of trees that have certainfeatures in common is reported. This application is a bit different from themedian example, because there is a discrete parameter being inferred.


Non-parametric bootstrapping in phylogenetics


Simulating the likelihood ratio statistic using parametric bootstrapping

(Huelsenbeck and Bull, Systematic Biology, 1996)

In this paper, the distribution of the likelihood ratio statistic δ = −2Λ issimulated under H0 (in a case where it is not asymptotically χ2) andcompared to the observed likelihood ratio statistic.


Parametric versus nonparametric bootstrapping

In my experience, or in my area, nonparametric bootstrapping is usedmuch more than parametric bootstrapping, although simulation fromknown distributions is used extensively that isn’t called bootstrapping. Weexpect parametric bootstrapping to give more precise answers if we reallyknow something about the distribution that the data comes from, and it isalso useful in cases where we are testing a specific hypothesis.


Notes on Simulations in SAS Studiojames/STAT579-F18/SAS13.pdfdi erent way. SAS Programming November...

Documents

Transcript of Notes on Simulations in SAS Studiojames/STAT579-F18/SAS13.pdfdi erent way. SAS Programming November...