Chapter I: Families of randomization tests

Discrete Markov Chain Monte Carlo Week I INTRODUCTION

Stat 344 GWCobb 03/19/01 page 1 NSF#0089004

Chapter I: Families of randomization tests

All simulation-based randomization tests follows the same 3-step algorithm that you’veseen already.

Algorithm for estimating a p-value

1. Generate a random sample of data sets X1, X2, … in a way that giveseach possible data set the same probability of getting into the sample.

2. For each random data set, compute the value of the test statistic, andrecord a Yes or No: Yes if the value is greater than or equal to thevalue for the actual data, No otherwise.

3. Estimate p using the observed fraction of Yes answers in the sample:

Although this algorithm is extremely flexible and general, for most people the mostefficient way to come to appreciate and take advantage of that generality is throughworking with groups of special cases. That’s the approach of this chapter.

Randomization tests, I: Fisher’s exact test

One of the simplest randomization tests is Fisher’s exact test, which is used to testhypotheses about data that can be summarized in a 2x2 table of counts.

Example. The Salem witchcraft hysteriaThe year 1692 saw nineteen convicted witches hanged in Salem Village (now Danvers)Massachusetts. Almost three centuries later, historians examining documents related tothe trials discovered a striking pattern relating trial testimony and geography. Those whotestified against the accused witches tended to live in the western part of Salem Village;those who testified in defense of the accused tended to live in the eastern part, which waswealthier, more commercial, more cosmopolitan, and closer to the town of Salem, at thetime the second busiest port in the colonies.1 A total of 61 residents testified in the trials,of whom 35 lived in the western part of the village, and 26 in the eastern part. Of the 35“westerners,” 30 were “accusers” and only 5 were “defenders;” of the 26 easterners, only2 were accusers; the remaining 24 were defenders:

1 Historians Boyer and Nissenbaum cite this pattern as part of the evidence in support of an economicallybased interpretation of the witchcraft hysteria. See Boyer, Paul and Stephen Nissenbaum (1974). SalemPossessed: The Social Origins of Witchcraft, Cambridge: Harvard University Press

$#

#p

Yesdata sets in the sample

=



Display 3s.1 Geography and testimony in the Salem witch trials of 1692

Is it possible to get a pattern as extreme as this just by chance? If the relationshipbetween geography and testimony were purely random, how likely would a pattern thisextreme be? Represent the residents who testified by poker chips, 35 marked “West” and26 marked “East.” Put all 61 chips in a bag, mix thoroughly, and draw out 32 at random.Call these Accusers; count how many of the accuser chips say West, how many say East,and record the results in a table.2 I just did this, and got:

Display 3s.2 Results for a sample of accusers drawn at random

My random data set is not nearly as extreme as the actual one.

I repeated the whole process -- random draws, count, compare – 10,000 times, using theS-Plus code in Display 3s.3, and not once did I get a table as extreme as the actual data.Conclusion: If you draw at random, it is all but impossible to get a data table like theobserved one. In other words, “It’s just a chance relationship” is not a believableexplanation for the data.

pop <- c(rep(1,35),rep(0,26))phat <- 0NRep <- 10000for (i in 1:NRep){

phat <- phat + (sum(sample(pop,32,replace=F))>=30)/NRep}phat

Display 3s.3 S-Plus code for drawing random samples of accusers and estimating p

2 The chips left in the bag are the defenders. You could count them to fill in the rest of the table, or youcould get the missing values by subtraction, since they are determined by what you draw out.

Geography Total

West 30 5 35 85.7%

East 2 24 26 7.7%

Total 32 29 61

Accuser Defender % accuser

Testimony

Geography Total

West 19 16 35 54.3%

East 13 13 26 50.0%

Total 32 29 61

TestimonyAccuser Defender % accuser



Here’s an abstract version of the same analysis:

Step 1: Generate random data setsPopulation: 61 individuals, 35 of them 1s (West) and 26 of them 0s (East).Sample: A subset of 32.Null model: The sample is chosen completely randomly; all subsets of 32 are equally

likely.

Step 2. Compare random data sets with the actual data.Test statistic: Number of 1s in the sample (= number of “West” chips among the

randomly chosen “accusers.”)Actual data value: There were, in fact, 30 residents of the western part of Salem

Village among the accusers.Compare: Record a Yes if there are 30 or more 1s in the sample.

Step 3. Estimate.Out of 10,000 data sets, none had a as many as 30 1s.

Because the p-value is so very tiny, we reject the null model. It is not a believableexplanation for the actual data.

Drill Exercises:

1. A small version of the witch data.Assume that only 10 people had testified, of whom 4 lived in the west, 6 in the east.Assume also that 3 were accusers, and that all 3 came from the west. Set up thepopulation, null model, test statistic and observed value. Then find the p-value andstate your conclusion.

2. There is more than one way to define the population and null model. Two easyvariations are (a) to reverse the labels for 1s and 0s, so that 1 represents East and 0represents West, and/or (b) to reverse the labels for in the sample and not in thesample, so that the sample represents Defenders, and those not in the sample are theAccusers. For each of these variations, tell what the test statistic would be, and howto compute the p-value. Verify that the p-value is the same for all these variations.

3. A more substantive variation reverses the roles of population and sample: Let the 1sand 0s in the population tell whether an individual was an Accuser or Defender. Let“in the sample” correspond to “West,” and “not in the sample” to East. Define a teststatistic, and tell how to modify the S-Plus code in Display 3s.3 to compute the p-value. (Optional: Run your modified S-Plus code, and verify the (non-obvious) factthat (apart from random variation) it is equal to the p-value from the original code.

Example 2. US v. GilbertFor several years in the 1990s, Kristen Gilbert worked as a nurse in the intensive careunit (ICU) of the Veteran’s Administration hospital in Northampton, Massachusetts.



Over the course of her time there, other nurses came to suspect that she was killingpatients by injecting them with the heart stimulant epinephrine.3 Part of the evidenceagainst Gilbert was a statistical analysis xxx 8-hour shifts during the time Gilbert workedin the ICU. Was there an association between Gilbert’s presence on the ICU and whetheror not someone died on the shift?4 Here are some fictitious data of the same sort as theactual data from the grand jury testimony:

Display 3s.4 Fictitious data on possible association between nurse Gilbert’s presence inthe ICU of the Northampton VA hospital and deaths on a shift.

Drill Exercises:

4. Define the population of 0s and 1s: What does a 0 represent? a 1? How many 0s,and how many 1s, are in the population. (Note: There is more than one right way todo this.)

5. Define the null model: If you think of drawing a random sample from thepopulation,, what does “drawn out” (that is, in the sample) represent? What does “notdrawn out” represent?

6. Define the test statistic: What does the number of 1s in the sample represent? Whatis the observed value of the test statistic?

7. P-value. Modify the S-Plus code in Display 3s.3 so that it would compute the p-value.(Optional: Compute the p-value using your code.)

Example 3. Capture-mark-recapture.An entymologist wants to estimate the size of an animal population. Suppose, forexample you want to know the number of giant cockroaches that feed in the vicinity of aparticular stand of trees in a Costa Rican rainforest. The entymologist captures andmarks 25 roaches, then releases them. After waiting long enough for the marked roachesto mix thoroughly with the rest of the roach population, he captures a sample of 50cockroaches, and finds that 14 of them are marked.

3 A synthetic form of adrenaline.

4As of this writing, the actual data are not yet public. They figured prominently in the grand jurytestimony that led to Gilbert’s indictment, but at the subsequent trial, the judge ruled that they could not beshown to the jury.

K.G. presenton shift? Total

Yes 27 273 300 9.0%

No 18 882 900 2.0%

Total 45 1155 1200

Death on shift?Yes No % Yes



Exercises.

8. Let N be the unknown number of cockroaches in the population. Tell how to use anull model to test the hypothesis that N is greater than or equal to 100.

9. Tell how to use your null model to find the set of possible population sizes N that arenot rejected by a test of the sort in (1).

10. What assumptions about the cockroaches are necessary in order to make the randomsampling model appropriate?



Randomization tests, II: the median test and its relatives.

Fisher’s exact test is appropriate when (1) you want to compare two randomly chosengroups of individuals, and (2) the feature of the individuals that you use to make thecomparison is dichotomous.

Example 0. Think of randomly drawing marbles from a bucket. This gives tworandomly chosen groups: those drawn out, and those left in. The marbles are oftwo colors; color is the feature used to compare the two groups.

Example 1. For the Salem witch data, we asked, “What if the ‘accusers’ and‘defenders’ had been chosen at random?” In that example, the actual accusers anddefenders were not chosen at random, but we wanted to test whether the observeddata was consistent with random selection. So under our null model, the‘accusers’ and ‘defenders’ were the randomly chosen groups. The feature usedfor the comparison was geography, east or west.

Example 2. For the Gilbert data we asked, “What if the deaths had occurred onrandomly chosen shifts?” Here, also, the actual shifts were not chosen that way,but we wanted to compare the actual data with what we would be likely to get ifthe shifts had been chosen randomly. Thus shifts with and without deaths werethe randomly chosen groups. The feature used for comparing groups was whetheror not Gilbert was present on the shift.

Example 3. For the cockroach data we assume that the 50 insects caught in thesample are a random subset of the population. (This situation differs from the twoprevious examples. Here our goal is not to test whether the data are consistentwith choosing at random. Rather, we accept random selection as a reasonableassumption,5 in order to use it to make deductions about the size of thepopulation.) For this example, the two randomly chosen groups are the 50roaches that got caught, and the N-50 that didn’t get caught. The feature used forcomparing groups is whether or not a roach was marked.

What if the feature you use to make the comparison is a quantitative variable, like age orblood pressure? There are several options, but the simplest is to turn the quantitativevariable into a dichotomous variable: pick a threshold value of the variable, and replaceits actual value with a Yes or No answer to the question “Is the value of the variablegreater than or equal to the threshold?” Once you’ve replaced the numbers with Yes/Noanswers, you can carry out Fisher’s exact test.

5 Biologists have broken down the assumption of random selection into a list of conditions about thepopulation. In order for the assumption to be reasonable, there must be no movement of individuals in orout of the population during the time of the study, and the individuals must move around enough that themarked individuals get well mixed into the population before the sample is caught.



Example 4. Martin vs. WestVaCoHere, once again, are the ages of the ten hourly worker involved in the second round oflayoffs at WestVaCo, with the ages of those laid off indicated by underlining.

25 33 35 38 48 55 55 55 56 64

a. One way to choose a threshold is to go by the law. According to federalemployment law, the “protected class” (of workers who cannot be firedbecause of their age) begins at age 40. If we use 40 as the threshold, then theset of ten ages becomes

N N N N Y Y Y Y Y Y

We can summarize the new, dichotomized version of what actually happenedin a 2x2 table:

If we use as our test statistic the number of workers aged 40 or more amongthose laid off, we get a p-value of about .17:

p = P(3 or more Y in a random sample of 3 chosen from 4 N, 6 Y) ≈ .17.

b. Notice that using Fisher’s test when you have a quantitative variable ignoresrelevant information.6 here, for example, all three workers laid of were verymuch older than the threshold age of 40, but the test in (a) didn’t use thatinformation. In general, it is better not to use Fisher’s test in a situation likethis, and later on you’ll see some alternatives that would ordinarily bepreferred. However, you can sometimes get a better version of Fisher’s testby changing the threshold. For example, you could choose as your thresholdthe median (half-way point) of the set of observed ages. The resulting test iscalled the median test. For the Martin data we have an even number ofvalues, so there are two middle values 48 and 55, and the median is thenumber half-way between them, 51.5. Using 51.5 as a threshold, andreplacing ages by Yes/No answers to “Is the age 51.5 or older?” gives apopulation of 5 Ns, 5 Ys.

Drill exercise: (1) Summarize the data in a 2x2 table.

6 Ignoring relevant information can reduce the power of a statistical test by making it less likely that thetest will reject null models that should be rejected.

40 or older? No Yes TotalNo 4 0 4Yes 3 3 6

Total 7 3 10



For this threshold, the p-value is

p = P(3 or more Y in a random sample of 3 chosen from 5 N, 5 Y) ≈ .06.

c. Drill exercise. (2) Repeat the test using 55 as the threshold. Summarize thedata in a 2x2 table, and explain why the p-value should be less than the twoprevious p-values. (It’s actual value is 1/30 ≈ .03.)

Example 5. Calcium and blood pressure.To test whether taking calcium supplements can reduce blood pressure, investigators useda chance device to divide 21 male subjects into two groups. One group of 10 men, thetreatment group, were given calcium supplements and told to take them every day for 12weeks. The other 11 men, the control group, were given pills that looked the same as thesupplements (a placebo), and given the same instructions: take one every day. Neitherthe subjects themselves nor the people giving out the pills and taking blood pressurereadings knew which pills contained the calcium. (The experiment was double blind.)Subjects had their blood pressure read at the beginning of the study and again at the end.The numbers below tell the change in systolic blood pressure (when the heart iscontracted), in millimeters of mercury.

Calcium: -7, 4, -18, -17, 3, 5, -1, -10, -11, 2Placebo: 1, -12, 1, 3, -3, 5, -5, -2, 11, 1, 3

Here are the same numbers, arranged in order, with the values in the treatment groupunderlined:

-18 -17 -12 -11 -10 -7 -5 -3 -2 -1 1 1 1 2 3 3 3 4 5 5 11

Notice that for this example, as for all the others, there are two groups to compare, in thisinstance those assigned to the treatment group, and those assigned to the placebo group.7Here, as in the Martin example, the information we have available for comparing the twogroups is quantitative. If we want to apply Fisher’s test, we have to reduce thequantitative data to dichotomous data by choosing a threshold value and asking, “Is thechange in blood pressure greater than or equal to the threshold?”

7 There is an extremely important difference that sets this example apart, however: The two groups in factwere chosen purely at random. For the witches, and Martin, and Gilbert, there wasn’t any actualrandomization; instead, random selection was the null model being tested. For the giant roaches, randomselection was an assumption. For a randomized controlled experiment like the calcium study, therandomization was a deliberate part of the experimental design. Although the source of the randomizationmakes no difference in how you calculate the p-value, it makes a tremendous difference in what it tells you.For the witches, Martin and Gilbert, a tiny p-value tells you to reject the null model of random selection.For the calcium study, that’s not a logical option, because the randomization actually occurred. Thatrandomization guarantees that there are only two possible explanations for the observed difference betweenthe treatment and control groups: chance, or the treatment.



Drill exercises.

(3) One natural choice for the threshold is 0. Values 0 or greater indicate that the bloodpressure did not go down. Carry out Fisher’s test using 0 as your threshold.

(4) Carry out the median test.

Example 6. Hospital carpets.In a hospital, noise can be an irritation that interferes with a patient’s recovery. Puttingdown carpeting in the rooms would cut down on noise, but the carpeting might tend toharbor bacteria. To study this possibility, doctors at a Montana hospital conducted anexperiment to see whether rooms with carpeting had higher levels of airborne bacteriathan rooms with bare floors. They began with 16 rooms and randomly chose eight tohave carpeting installed. The other eight were left bare. At the end of their test period,they pumped air from each room over a culture medium (agar in a petri dish), allowedenough time for the bacterial colonies to grow, and recorded the number of colonies percubic foot of air. Here are the results:

Drill exercises

(5) Tell how to conduct a median test: Null model.

a. What is the population?b. What constitutes a random sample?c. What are the objects that are equally likely according to the null model?

Test statisticd. Tell what test statistic to use.

Room # Room #212 11.8 210 12.1216 8.2 214 8.3220 7.1 215 3.8223 13.0 217 7.2225 10.8 221 12.0226 10.1 222 11.2227 14.6 224 10.1228 14.0 229 13.7

Average 11.2 Average 9.8

Colonies/cu.ft. Colonies/cu.ft.Carpeted floors Bare floors



Randomization tests, III: the rank and permutation testsfor two samples.

Randomization tests, IV: sign, rank sum, and permutation testsfor paired data



Randomization tests, V: Chi-square tests.

One of the most common methods in all of statistics is the chi-square test. This test is soflexible and broadly applicable that almost any data set based on sorting and counting canbe studied using it.8 The chi-square test has two parts, which are often presented as asingle package, without noting or distinguishing between one part and the other.9 This isunfortunate, because one part – a measure of distance -- is much more useful than theother – a shortcut method for approximating p-values. In what follows, I’ll describe andillustrate the more useful part, which gives a general-purpose method for carrying outStep 2 (the comparison step) of the p-value algorithm.10

The chi-square distance: an example.

For a concrete illustration, consider (yet again!) the summary table for the Martinexample. For 2x2 tables, the chi-square distance isn’t something you really need,because its value is completely determined by the entry in the upper left cell of thetable.11 However, you do need chi-square (or some alternative) for tables larger than 2x2,and the 2x2 case is a simple starting point for learning.

Display 6s.1. Summary table for the Martin example

The null model corresponds to choosing a random subset of size 3 – those laid off -- from{28, 3, 35, 38, 48, 55, 55, 55, 56, 64} and counting the number of people who are 55 orolder. Since 5 of the 10, or 50% of those in our population are 55 or older, we wouldexpect that, on the average over the long run, 50% of those in a randomly chosen subsetwould be 55 or older. For subsets of size 3, this long run average – the expected value –

8 If you thing about the last statement, it should sound too good to be true, and it is. Nevertheless, alongwith regression and analysis of variance, chi-square methods constitute one of the three principal groups ofclassical statistical methods.

9 If you’ve seen the chi-square test before, you yourself may well have been given the package tour.

10 The less useful part of the chi-square test is an approximation based on mathematical theory, one that youcan sometimes use as a substitute for Steps 1 and 3 of the p-value algorithm. Back in 1901 when thisapproximation was invented (by Karl Pearson), before the days of computers, it was a major breakthrough,and statisticians had no choice but to use it. Without the approximation, there would have been no way tocompute p-values. However, the approximation only works well for a restricted class of data sets, whereasthe randomization algorithm is not restricted in this way, and can be used to estimate p-values to anydesired accuracy.

11 This means that the p-value will be the same regardless of whether you use the chi-square distance or thecell entry in the comparison step of the algorithm.

Yes No Total

55 or Yes 3 2 5older? No 0 5 5

Total 3 7 10

Laid off?



would be 50% of 3, or 1.5. Because the table entries have to add to give the same rowand column totals as for the actual data, we can fill in the entire table:

Display 6s.2. Expected values for the Martin example

We can now use these expected values to compare tables. In effect, we invent a way tomeasure the “distance” between two tables, and compare tables according to how far theyare from the table of expected values.

Display 6s.3. The chi-square distance between observed and expected counts

By looking closely at the way the chi-square value is defined, you can convince yourselfthat it does behave like a distance.

• Observed close to expected ⇒ chi-square near zero. Consider first a table whoseobserved counts are exactly equal to the expected counts. All of the entries in thetable of differences will be zeros, and the chi-square distance will be zero as well.In other words, the chi-square distance from a table to itself is zero, just as itshould be.

• Observed far from expected ⇒ chi-square large. Now consider a table ofobserved counts that are far from their expected values. At least some of thedifferences (Obs – Exp) in Step 2a will be far from 0.12 When these differencesare squared, in Step 2b, the resulting values will be large, and that will make thechi-square value large.

12 Notice that because the observed and expected counts have the same marginal totals, the marginal totalsfor the table of differences will all be 0.

Yes No Total

55 or Yes 1.5 3.5 5older? No 1.5 3.5 5

Total 3 7 10

Laid off?

Step 2a Observed Expected Obs - Exp

3 2 - 1.5 3.5 = 1.5 -1.5

0 5 1.5 3.5 -1.5 1.5

Step 2b

2.25 2.25 / 1.5 3.5 = 1.5 0.64

2.25 2.25 1.5 3.5 1.5 0.64

Step 2c Chi-square = sum of (O-E)^2/E = 4.29

Expected (O-E)^2/E(Obs - Exp)^2



• Why divide (O-E)2 by E? Dividing by E is a technical adjustment, designed togive all the cells an equal chance to contribute to the chi-square total. To see howthis works, consider two extreme cases. First, suppose that the expected value is1. If 1 is the expected value, we might get observed values of 2, or 5, or even 10,but 10 would be a very major departure from expectation. On the low side, theobserved count can never be less than 0, so –1 is the lowest possible value for O –E. Next, suppose that instead of 1, the expected value is 101. An observed countof 102, or 105, or even 110 is, in percentage terms, still quite close to the expectedvalue, even though the differences (O-E) are the same as in the first case. On thelow side, (O-E) can easily go far below –1.

Now compare the two cases. In the first, a value of (O-E) = 4 indicates majordeparture from expectation: 5 instead of 1. In the second case, a value of (O-E) = 4indicates a departure of less than 4% from the expectation. Dividing (O-E)2 by E putsthese departures in perspective. In the first case, (O-E)2/E = 16; in the second case,(O-E)2/E = 0.16.

Once you have defined the chi-square distance, you can use it to compare data sets. Arandom data set is more extreme than the actual Martin data, for example, if and only ifits chi-square distance from the table of expected values is greater than or equal to 4.29.You calculate the p-value in the usual way, as the fraction of random data sets that are atleast as extreme as the actual data.

Testing for association in two-way tables of counts

For tables larger than 2x2, the arithmetic is messier, but the logic is the same as in theMartin example. Here’s a version of the randomization algorithm suitable for largertables.

Step 0. Expected value = (row fraction)(column total).Observed chi-square: Follow Steps 2a-2c below for the actual data.

Step 1. Generate random data sets with the same row and column totals as theactual data.

Step 2 a. Observed – Expectedb. (Obs-Exp)2/Expc. Chi-square = sum of (O-E)2/Ed. Compare: is chi-square for the random data set at least as big as for

the actual data?

Step 3. Estimate: $ # / #p Yes Datasets=



Example. Victoria’s descendants

Some people claim there is an association between a person’s birthday and the day of theyear on which they die. According to the theory, people who are dying tend to “hang on”until their birthday. Display 6s.4 shows counts for 82 descendants of Queen Victoria,classified by month of birth (row) and month of death (column). Those who died in thesame month as they were born in appear on the main diagonal. Those who died in amonth just before or just after their birth month appear just below or just above the maindiagonal. If the claim of association is true, we would expect to find a tendency forcounts to be higher on or near the main diagonal, lower near the southwest and northeastcorners. If, on the other hand, there is no association, we would expect the count in a cellto be equal to the product of the cell’s column total times its row fraction. For example,consider the upper left cell, which corresponds to those born in a January who also diedin a January. The row and column totals tell us that 6 of 82 descendants, or 7.32%, wereborn in a January; 13 died in a January. If there is no association between month of birthand month of death, we would expect the fraction of January births to be the same in eachcolumn. In particular, we would expect 7.32% of the 13 January deaths, or 13(0.0732) =0.95, to be January births.

Display 6s.4. Month of birth (row) and month of death (column)for 82 descendants of Queen Victoria

For this data set, the value of chi-square is 115.6. To see how this value compares withthe values we’d get from random data sets, we need a method for generating 12x12 tablesof counts with the same margins as in Display 6s.4.

Generating random tables of counts with given margins.

You can generate data sets by physical simulation, much as in the Martin example.Here’s how it works for Victoria’s descendants.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov DecJan 1 0 0 0 1 2 0 0 1 0 1 0 6Feb 1 0 0 1 0 0 0 0 0 1 0 2 5Mar 1 0 0 0 2 1 0 0 0 0 0 1 5Apr 3 0 2 0 0 0 1 0 1 3 1 1 12May 2 1 1 1 1 1 1 1 1 1 1 0 12Jun 2 0 0 0 1 0 0 0 0 0 0 0 3Jul 2 0 2 1 0 0 0 0 1 1 1 2 10Aug 0 0 0 3 0 0 1 0 0 1 0 2 7Sep 0 0 0 1 1 0 0 0 0 0 1 0 3Oct 1 1 0 2 0 0 1 0 0 1 1 0 7Nov 0 1 1 1 2 0 0 2 0 1 1 0 9Dec 0 1 1 0 0 0 1 0 0 0 0 0 3Total 13 4 7 10 8 4 5 3 4 9 7 8 82

Total



Step 1. Label by rows. Put 82 chips in a bucket, with labels determined by therow (birth month) totals. Thus 6 of the chips say “Jan”, 5 say “Feb”, 5 “Mar”, 12“Apr”, and so on.13

Step 2. Draw out by columns. Mix the chips thoroughly. Then draw them out instages determined by the column totals: The first 13 chips you draw outcorrespond to deaths in January; the next 4 correspond to deaths in February, …,the last 8 correspond to deaths in December.

In S-plus, you can carry out the same two steps:

Step 1. Label by rows. Just as rep(1, 3) creates a vector (1, 1, 1), rep(1:3,c(4,3,1)) creates a vector (1, 1, 1, 1, 2, 2, 2, 3). For the Victoria data, we wantrep(1:12, c(6, 5, 5, 12, 12, 3, 10, 7, 3, 7, 9, 3)). Rather than type in all the rowtotals, we use the command rowSums to tell S-plus to compute the totals for us.

Pop <- rep(1:12, rowSums(ActualData))

More generally, the number of rows won’t be necessarily be 12, but it will begiven by the first element of the vector that gives the dimension of the data,dim(ActualData)[1]. Thus we create the bucket of labeled chips with thecommand

Pop <- rep(1:dim(ActualData)[1], rowSums(ActualData))

Step 2. Draw out by columns. In the same way, we create a vector ColGroups ofcolumn labels. Following the column totals, this vector will have thirteen 1s forJanuary, four 2s for February, etc. For our particular example, we could userep(1:12, c(13, 4, 7, 10, 8, 4, 5, 3, 4, 9, 7, 8)). The S-plus code uses the moregeneral

ColGroups <- rep(1:dim(ActualData)[2], colSums(ActualData))

To create a random data set, we permute the row labels using

Permutation <- Sample(Pop, length(Pop)),

line up the vector of permuted row labels next to the vector of column labels, andcount. Here’s how it works for the Martin example:

Row labels: 1 1 1 1 1 2 2 2 2 2

Permuted row labels: 1 2 2 1 2 1 1 1 2 2Column labels: 1 1 1 2 2 2 2 2 2 2

13 Notice that the births tend to come between April and August. Our null model, which fixes the rowtotals, copies the actual distribution of the birth months.



The last two rows give 10 vertical pairs (Permuted row label, column label).Sorting and counting these gives a 2x2 summary table:

Display 6s.5. Summary table from sorting and counting

S-plus does the counting for us, in response to the command “table”:

RandomTable <- table(Permutation, ColGroups)

Display 6s.6 shows the S-plus code for (1) computing expected values, (2) computingchi-square distance, and (3) estimating the p-value by generating random data sets andfinding the fraction whose chi-square value is at least as large as for the actual data. Outof 10,000 random data sets, xxx gave a chi-square value of 115.6 or more. Conclusion:It isn’t at all unusual to get data as extreme as the actual data. Queen Victoria’sdescendants provide no evidence of the claim that birth month and death month areassociated.

Row label 1 2 Total1 (under 55) 1 4 5

2 (55 or older) 2 3 5Total 3 7 10

Column label(2 = laid off, 1 = retained)



################################################################ Chi-Square Tests by randomization################################################################################# Expected. This function takes a matrix of non-negative entries# and returns a matrix of expected values computed assuming there# is no association between rows and columns: the (i,j) element of# the matrix equals the total for row i times the proportion for# column j.################expected <- function(A){

RowTotals <- matrix(rowSums(A),dim(A)[1],1)GrandTotal <- sum(A)ColFractions <- matrix(colSums(A),1,dim(A)[2]) / GrandTotalExpected <- RowTotals %*% ColFractionsreturn(Expected)

}################# ChiSqDistance. This function takes matrices of observed and# expected counts and returns the usual chi-square statistic,# with value equal to the sum of# (observed - expected)^2 / expected.################ChiSqDistance <- function(Observed,Expected){

ChiSq <- sum((Observed-Expected)^2/Expected)return(ChiSq)}

ChiSqDistance(A,expected(A))################# ChiSqSim: Carries out a randomization chi-square test for independence# by creating random data sets, computing the chi-square distance from# expected values computed assuming no association, and estimates the# p-value using the fraction of random data sets whose chi-square values# are at least as large as the value for the actual data.# Input: a matrix of counts and the number of repetitions.################ChiSqSim <- function(ActualData, NReps){

Exp <- expected(ActualData) # Expected valuesActChiSq <- ChiSqDistance(ActualData,Exp) # Observed value of the

# chi-square distanceNYes <- 0 # NYes counts # of random

# data sets with# bigger chi-square# Pop = the contents of the# bucket:

Pop <- rep(1:dim(ActualData)[1],rowSums(ActualData))PopSize <- length(Pop) # Number of chips in the bucket

# ColGroups tells which draws go



# with which columnsColGroups <- rep(1:dim(ActualData)[2],colSums(ActualData))

#for (i in 1:NReps){ # Repeat this loop once for

# each random data setPermutation <- sample(Pop,PopSize) # Create a random permuation

# of the chips in the bucketRandomData <- table(Permutation,ColGroups) # Summarize the results

# in a r x c table of counts# Add 1 to NYes if the random# data set has a chi-square# values at least as large# as for the actual data.#

NYes <- NYes + (ChiSqDistance(RandomData,Exp) >= ActChiSq)}

p.hat <- NYes / NRepsreturn(p.hat)}

## The Martin data#Martin <- matrix(c(3,2,0,5),2,2,byrow=T)Martinexpected(Martin)ChiSqDistance(Martin,expected(Martin))ChiSqSim(Martin,1000)## Victoria’s Descendants#Victoriaexpected(Victoria)ChiSqDistance(Victoria,expected(Victoria))ChiSqSim(Victoria,1000)

Display 6s.6 S-plus code for randomization chi-square test for two-way tables



Randomization tests, VI: ad hoc tests. {Preliminary}GIVEN:

1. A null model (or model and null hypothesis):a. A set, called the population. (Subsets of the population are samples; the number

of elements in the sample is the sample size, n.)b. A finite (though often very large) collection of equally likely samples.

2. A test statistic (or metric):a. A function or rule that assigns a real number to each sample. (The function itself

is the test statistic.)b. A way to tell which of two values of the test statistic is more extreme. (Often,

larger values are more extreme.)

3. Observed dataA particular sample (and so, automatically, an particular value of the test statistic.)

COMPUTE:

1. The p-value (more formally, the observed significance level):The p-value is the probability, computed using the null model, of getting a value ofthe test statistic at least as extreme as the observed value.

According to the null model, all the samples are equally likely, so the p-value is justthe fraction of samples that have values of the test statistic at least as extreme as theobserved value. There are three general approaches to computing p-values, bruteforce, mathematical theory, and simulation.a. Brute force: List all the samples, compute values of the test statistic for each

sample, and count.b. Mathematical theory: Find a shortcut by applying mathematical ideas (e.g., the

theory of permutations and combinations) to the structure of the set of samples inthe null model.

c. Simulation: Use physical apparatus or a computer to generate a large number ofrandom samples, and estimate the p-value using the fraction of samples that give avalue of the test statistic at least as extreme as the observed value.

INTERPRETATION:

The p-value measures how surprising the observed data would be if the null model weretrue. A moderate sized p-value means that the observed value of the test statistic is prettymuch what you would expect to get for data generated by the null model. A tiny p-valueraises doubts about the null model: If the model is correct (or approximately correct),then it would be very unusual to get such an extreme value of the test statistic for datagenerated by the model. In other words, either the null model is wrong, or else a veryunlikely outcome has occurred.



Related research questions

APPLIED QUESTIONS:

No matter what area of application you are interested in, there are two closely relatedresearch questions to be answered:

a. Null model: What choice or choices will provide useful evidence for evaluating agiven scientific hypothesis?

b. Test statistic: Same question.

The questions and answers will depend very much on the applied context. In thefield of community ecology, questions of this sort have been a source of vigorousdebate, the subject of voluminous published research, and, on several occasions, theexcuse for indecorous name-calling, for more than 20 years.

MATHEMATICAL QUESTIONS:

If you estimate the p-value using simulation, there are two generic research questionsyou can ask about the estimation process:

a. Convergence: Does the observed proportion converge to the correct p-value asyou increase the number of samples? If not, can you alter the simulation to getconvergence to the right value?

b. Rate: How fast is the convergence? In other words, how many samples to youneed to get an estimate that is “close enough” to the true p-value?



Drill and practice: Null models, test statistics, p-values

1. Samples of size 1. Find p-values for the following situations.

a. Null model: All elements of S1 = {1, 2, …, 20} are equally likely.Test statistic: t(x) = x, for x ∈ S1; larger values are more extreme.Observed data: x0 = 3.

b. Null model: All elements of S2 = {1, 2, …, N} are equally likely.Test statistic: t(x) = x, for x ∈ S2; larger values are more extreme.Observed data: x0 = 3.

c. Same as (b), but x0 = N-3.

d. Null model: All 26 letters of the English alphabet are equally likely.Test statistic: t=1 if the letter drawn is a vowel, 0 otherwise.Observed data: x0 = e.

2. Samples of size 2 or more. Find p-values for the following situations.

a. Null model: All subsets of size two drawn from {1, 2, …, 6} are equally likely.Test statistic: t({x,y}) = x+y; larger values are more extreme.Observed data: {x0, y0}= {3,6}.

b. Null model: All pairs (i,j) with i<j chosen from {1, 2, …, 10} are equally likely.Test statistic: t1(i,j) = j – i ; larger values are more extreme.Observed data: (1,3).

c. Same as (b), but : t2(i,j) = (j – i)2.

d. Same as (b), but : t3(i,j) = least common multiple of i and j.

e. Null model: All samples of size three chosen with replacement from {1, 2, …, 6} areequally likely. (Note that each sample is like a roll of three fair dice.)Test statistic: Sum of the sample values; larger values are more extreme.Observed data: (5,5,4).

3. More complicated structures. Find p-values for the following situations.

a. Null model: All 3 x 3 matrices of 0s and 1s with row totals 2, 1, 2 and column totals,2, 1, 2 are equally likely.Test statistic: Number of checkerboard units (see bottom of page 3).Observed data: The table on page 1.

b. Null model: All 2x2 matrices of 0s, 1s and 2s are equally likely.



Test statistic: Sum of squares of the elements of the matrix.Observed data: Sum of squares = 10

S-Plus exercises: very basic drill

Write S-Plus code to do the following; then run your code as a check.

4. Sample. Take a simple random sample of size 3 from {1, 2, …, 20}

5. Test statistic. Take a simple random sample of size 3 from {1, 2, …, 20} and find thenumber of elements in the sample with values of 18 or more.

6. $p . Take NRep random samples of size 3 from the same population, and find p-hat,the proportion of samples with two or more elements with values 18 or more.

Martin v. Westvaco

7. The table below classifies salaried workers using two Yes/No questions: Under 40?and Laid off? (In employment law, 40 is a special age, because only those 40 or olderbelong to what is called the "protected class," the group covered by the law against agediscrimination.)

Laid off?Under 40? Yes No Total % Yes

Yes 4 5 9 44.4%

No 14 13 27 51.9%

Total 18 18 36 50.0%

a. Set up a null model: Tell the population (how many of what kinds of items); tell thesample size, and describe the set of equally likely samples.

b. What is the test statistic?c. What is its observed value?d. Use S-Plus to find the p-value.

8. 50 or older. The average age of the salaried workforce in Westvaco’s engineeringdepartment was older than in many companies: ¾ of the 36 employees were over 40.Here are data like those in (7), except that ages are divided at 50 instead of 40; repeatparts (a) – (d):



Laid off?Under 50? Yes No Total % Yes

Yes 5 10 15 33.3%

No 13 8 21 61.9%

Total 18 18 36 50.0%

Verbal/interpretive practice

9. Martin, continued. Does the evidence in the second table (8) provide stronger orweaker support for Martin's case? Explain. How do you account for the differentmessages from the two tables? Both provide evidence; how do you judge theevidence from the two tables taken together?

10. P-values. Write a short paragraph explaining the logic of p-values and significancetesting in your own words.

11. Number of samples. Write a paragraph summarizing what you now know about therelationship between the number of repetitions in a simulation and the stability andreliability of the estimated p-value.

12. Graphs. In your investigation of the effect of NReps on the stability of p-hat, youused several different kinds of graphs. List at least four different kinds, and for each,tell its advantages and disadvantages in comparison with the other kinds of graphs.

More on the logic of inferenceContrast the time sequences for what actually happens when you generate statistical data,with the time sequence for how you reason about it after the fact. Here's what actuallyhappens when you produce data:

1. The setting: There is a particular given chance mechanism for producing the data,such as tossing a coin 100 times.

2. Before producing data, you can use probability calculations to determine which groupsof outcomes are likely, which are unlikely. (In probability, which is a branch ofmathematics, the model is known, and you use it to deduce the chances for various eventsthat have not yet happened.)

2. The chance mechanism produces the data, e.g., 91 heads in 100 tosses of a fair coin.



Nw consider the analysis using the logic of inference. What was once assumed fixed (thechance mechanism) is now regarded as unknown. In the coin example, you don't knowthe value p for the probability of heads. To make matters worse, what was initiallyunknown and variable – the number of heads you would get if you were to toss 100 times-- is now fixed: you got 91 heads in 100 tosses. It may appear to make no sense tocompute the probability of something that has already happened. However, according tothe logic of classical inference, probabilities only apply to the data. There is no way toanswer the question we really want answered: "How likely is it that p=.5?" No wonderpeople find this hard!

Here's the time sequence for the steps in the inference.

1. The setting: You have fixed, known data, and you want to test whether a particularmodel is reasonably consistent with the data.

2. You spell out the (tentative) model you want to test.

3. Now, in your imagination, you "go back in time," ignoring for the moment the factthat you already have the data, and ask " Which outcomes are likely, and which areunlikely?" In particular, you ask, "How likely is it to get the data we actually got?" Informal inference, this probability is the p-value.

4. The p-value is used not for prediction, the way probabilities are ordinarily used, but asa measure of surprise: "If I believe the model, how surprised should I be to get data likewhat I actually got?" If the p-value is small enough, the model is rejected. This is atypical of the way probability is used in statistics: after the fact. (Though statistics is amathematical science, it is not a branch of mathematics they way probability is.)



Practice Exercises: Interpretation1. A trustworthy friend? A friend wants to bet with you on the outcome of a coin toss.The coin looks fair, but you decide to do a little checking. You flip the coin: it landsHeads. You flip again: also heads. A third flip: heads. Flip: heads. Flip: heads. Youcontinue to flip, and the coin lands Heads nineteen times in twenty tosses. Don't try anycalculations, but explain why the evidence -- 19 heads in 20 tosses -- makes it hard tobelieve the coin is fair.

2. The logic in (1) relies on the fact that a certain probability is small. Describe in wordswhat this probability is, and tell how you could use simulation to estimate it.

3. Snow in July? A friendly tornado puts you and your dog Toto down in Kansas, and abooming voicefrom behind a screen tells you that the date is July 4 (hypothesis).However, you see snow in the air (data), and make an inference that it is not really July 4.Describe in words what probability your inference is based on.

4. Which test statistic? The Physical Simulation asked you to use the average age tosummarize the set of three ages of the workers chosen for layoff. How different wouldyour conclusions have been if you had chosen some other summary? (There is a well-developed mathematical theory for deciding which summaries work well, but that theorywould take us off on a tangent. All the same, you can still think about the issues evenwithout the theory.) Some other possible summaries are listed at the end of this question.Are any of them equivalent to the average age? Which summary do you like best, andwhy?

Sum of the ages of the three who were laid offAverage age difference

(= average age of those laid off - average age of those retained)Number of employees 55 or older who were laid offAge of the youngest worker who was laid offAge of the oldest worker who was laid offMiddle of the ages of the three who were laid off

5. How unlikely is "too unlikely"? The probability you estimated in the PhysicalSimulation is in fact exactly equal to 0.05. What if it had been 0.01 instead? Or 0.10?How would that have changed you conclusions? (In a typical court case, a probability of0.025 or less is required to serve as evidence of discrimination. Some scientificpublications use a cut-off value of 0.05, or sometimes 0.01.)

6. At the end of Round 3, there were only six hourly workers left. Their ages were 25,33, 34, 38, 48, and 56. The 33 and 34 year olds were chosen for layoff. Think about howyou would repeat the Physical Simulation using the data four Round 4.

a. What is the population? (Give a list.)

b. How big is the sample?

c. Define, in words, the probability you would estimate if you were to do the simulation.

d. Write out the rule for estimating the probability, using the format from Step A1 as aguide.



e. Give your best estimate for the probability by choosing from

1%, 5%, 20%, 50%, 80%, 95%, and 99%.

f. Is the actual outcome easy to get just by chance, or hard?

g. Does this one part of the data (Round 4, hourly) provide evidence in Martin's favor?

7. After the first three, the next hourly worker laid off by Westvaco was the other 55year-old. What's wrong with the following argument?

Lawyer for Westvaco: "I grant you that if you choose three workers at random, theprobability of getting an average age of 58 or older is only .05. But if you extend theanalysis to include the fourth person laid off, the average age is lower, only 57.25 (=[55+55+55+64] / 4). An average of 57.25 or older is more likely than an average of 58 orolder. So in fact, if you look at all four who were laid off instead of just the first three,the evidence of age bias is weaker than you claim."

8. Use the data from Rounds 2 and 3 combined. Tell how to simulate the chance ofgetting an average age of 57.25 or more using the methods of the Physical Simulation:What is the population? the sample size? Tell how to estimate the probability, followingthe Physical Simulation as a guide. Give your best estimate of the probability. Then tellhow to use this probability in judging the evidence from Rounds 2 and 3 combined.

9. Sketch a dot graph like Display 1.4 to illustrate what you think simulations would looklike for the following scenario:

Three workers were laid from a set of ten whose ages were the same as in the Martincase. The ages of those laid off were 48, 55, and 55. If you choose three workers atrandom, the probability of getting an average of 52.66 or older is .166.

10. For some situations, it is possible to find probabilities by counting equally likelyoutcomes instead of by simulating. Suppose only two workers had been laid off, with anaverage age of 59.5 years. It is straightforward, though tedious, to list all possible pairsof workers who might have been chosen.. Here's the beginning of a systematic listing.The first nine outcomes all include the 25-year-old and one other. The next eightoutcomes all include the 33-year old and one other, but not the 25-year-old, since the pair(25,33) was already counted.

Count Pair chosen (underlined = laid off) Average age

1 25 33 35 38 48 55 55 55 56 64 29.0

2 25 33 35 38 48 55 55 55 56 64 30.0

3 25 33 35 38 48 55 55 55 56 64 31.5

9 25 33 35 38 48 55 55 55 56 64 44.5

10 25 33 35 38 48 55 55 55 56 64 24.0

11 25 33 35 38 48 55 55 55 56 64 35.5



etc.

How many possible pairs are there? (Don't list them all!) How many give an averageage of 59.5 years or older? (Do list them.) If the pair is chosen completely at random,then all possibilities are equally likely, and the probability of getting an average age of59.5 or older equals the number of possibilities with an average of 59.5 or more dividedby the total number of possibilities. What is the probability for this situation? Is theevidence of age bias stronger or weaker than in the example?

11. It is possible to use the same approach of listing and counting possibilities to find theprobability of getting an average of 58 or more when drawing three at random. It turnsout there are 120 possibilities. List the ones that give an average of 58 or more, andcompute the probability. How does this number compare with the results of the classsimulation in Physical Simulation? Why do the two probabilities differ (if they do)?

12. How would your reasoning and conclusions change if the five oldest workers amongthe entire group of ten were all age 55 (so that the ages of the ten were 25, 33, 35, 38, 48,55, 55, 55, 55, 55), and the three chosen for layoff were all 55? Is the evidence of agebias stronger or weaker than in the actual case?

13. The law on age discrimination applies only to people 40 or older. Suppose thatinstead of looking at actual ages, you look only at whether each worker is less than 40, or40 or older. Tell what summary statistic you would use, and tell how you would set upthe model for simulating an age-neutral process for choosing three workers to be laid off.Conclude by discussing whether you think it is better to use actual ages, or just theinformation about whether a person is 40 or older.



Applied problems: Null models and test statistics

13. More Martin.a. Use the 2x2 summary tables for salaried workers (in 7 and 8) to create a 3x2

summary table with three age groups: under 40, 40 to 49, and 50 or older:b. Describe a null model that corresponds to the hypothesis of no discrimination.c. Invent/define a test statistic that will be larger if older workers are more likely to be

chosen for layoff.d. Compute the observed value of your test statistic.e. Find the p-value for your combination of null model, test statistic, and observed

value.

14. Horse racingHorse races in the US are run on a circular track with eight positions, numberedfrom1, on the inside, closest to the rail, though 8, on the outside. You might expectthat starting on the inside carries an advantage, and all it takes to confirm theconjecture is a quick look at the data in Display 1.xx.

Display 1.xx. Starting position of winning horses in 144 races.14

a. Describe a null model that corresponds to the hypothesis that starting position has noeffect.

b. Invent/describe a test statistic that will have larger values if lower numbered startingpositions are more advantageous.

c. For the data given here, tell whether the p-value will be < 0.01, ≥ 0.01 but ≤ 0.1, or >0.1.

15. Spatial data.One way to record the spatial distribution of a plant species is to subdivide a largerarea into a grid of small squares (quadrats), and record whether the plant (Carexarenaria in Display 1.xx below) is present (1) or absent (0), in each square.

a. Suppose you want to test the hypothesis that the plants distribute themselvesrandomly, without regard to how close they are to others of the same species. Thusthe chance of finding a plant in any one quadrat doesn’t depend on whether or notthere are plants in neighboring quadrats. Consider two possible null models, one that

14 The original source is the New York Post, August 30, 1955, page 42. The data were reprinted in Siegel,S. and N. J. Castellan (1988). Non-parametric Statistics for the Behavioral Sciences, 2nd ed., New York:McGraw-Hill, p. 47.

Starting position 1 2 3 4 5 6 7 8Number of wins 29 19 18 25 17 10 15 11



keeps only the total number of ones fixed, and a second that keep the row and columnsums fixed. Which model is more appropriate, and why?

Display 1.xx Presence (1) or absence(0) of plants of the species Carex arenariawithin the cells of a 24x24 grid15

b. Suppose you want to test the hypothesis that the plants distribute themselvesrandomly, without regard to how close they are to others of the same species. Thusthe chance of finding a plant in any one quadrat doesn’t depend on whether or notthere are plants in neighboring quadrats. Consider two possible null models, one thatkeeps only the total number of ones fixed, and a second that keep the row and columnsums fixed. Which model is more appropriate, and why?

c. Consider two alternatives to the null model: (i) The presence of the plant in any onequadrat makes its presence in neighboring quadrats more likely. (ii) The presence ofthe plant in a quadrat makes its presence in neighboring quadrats less likely. Deviseand define two test statistics, one for each alternative

15 Data taken from Strauss, D. (1992). “The many faces of logistic regression,” The American Statistician46 321-327.

0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 1 1 0 1 00 1 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 11 1 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 00 0 1 1 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 00 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 0 0 00 1 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 00 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 00 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 00 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 00 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 00 0 1 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 1 10 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 10 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 11 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 10 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 0 11 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 10 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 11 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 10 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0



The effect of the number of samples on the precision of your estimate

How many repetitions (how many samples) do you need for a given level of precision inyour estimated p-value? Here is a possible rough guide: “95% of the time, the distancebetween your estimate and the true p-value will be at most 1/sqrt(NRep).” Gatherevidence until you can decide which of the following is true: (a) The rule is false. (b)The rule is true, but it can be improved on, because the chances are in fact greater than95%. (c) The rule is true, and it can’t be improved on.

Short Investigations

1. What is the effect of ties on the permutation test? For example, compare drawingthree at random from {1, 2, …, N}and drawing three at random from {1, 1, 2, …, N-1}.

2. What is the effect of the “cut point” when you turn a quantitative variable like ageinto a categorical variable like “Under 40” or “40 or older”?

3. What is the effect of the choice of test statistic on the performance of the test?

If you use simulations to estimate the p-value, different simulations will give differentresults. Large simulations (many samples; large value of NRep) are more stable andreliable than small ones. Invent a way to tell from a set of simulation results whether ornot you’ve done enough to stop.

Chapter I: Families of randomization tests

Documents

Transcript of Chapter I: Families of randomization tests