Chapter 2: Randomization tests - Mount Holyoke College · 2003. 6. 5. · Discrete Markov Chain...

Discrete Markov Chain Monte Carlo Chapter 2 RANDOMIZATION TESTS

Chapter 2: Randomization tests There is a name for the method used by Connor and Simberloff to compare the observed number of checkerboards (10 for the finch data) with what you could expect to get if species had distributed themselves purely at random. The method belongs to the general area of statistics called hypothesis testing, and more specifically, the method is an instance of a randomization test. (In ecology, randomization tests are often named for the models they are based on, called null models.) Until the 1980s, randomization tests tended to be limited to comparatively simple kinds of data sets, because there was no general and well-understood method for generating random data sets in more complicated situations. Computers and related theory have been changing that during the last two decades, so that randomization tests have grown in importance, and are now used much more often than in the past. Much of the mathematics in this book is tied to the theory of how to create random data sets in order to carry out randomization tests. The following example, which comes from a court case, illustrates the use of randomization tests in one of the simpler situations where it is easy to see how to generate random data sets. 2.1 Martin vs. Westvaco: An Introduction to Randomization Tests Robert Martin turned 55 during 1991. Earlier in that same year the Westvaco Corporation, which makes paper products, decided to downsize. They laid off several members of their engineering department, where Bob Martin worked, and he was one of those who lost their jobs. Later that year, he hired a lawyer to sue Westvaco, claiming he had been laid off because of his age. A major piece of Martin's case was based on a statistical analysis of the ages of the employees at Westvaco.

At the time the layoffs began, Bob Martin was one of 50 people working at various jobs in the engineering department of Westvaco's envelope division. Some were paid by the hour; others, like Martin, who had more education and greater responsibility, were salaried. Over the course of the spring, Westvaco's management went through five rounds of planning for a reduction in force. In Round 1, they decided to eliminate 11 positions. In Round 2 they added 9 more to the list. By the time the layoffs ended, after all 5 rounds, only 22 of the 50 workers had kept their jobs, and the average age in the department had fallen from 48 to 46.

Display 2.1 shows the data provided by Westvaco to Martin's lawyers.1 Each row corresponds to one worker, and each column corresponds to some feature: job title, whether hourly or salaried, the date of hire and age as of the first of January, 1991 (shortly before the layoffs). The last column tells how the worker fared in the downsizing: a 1 means chosen for layoff in Round 1 of planning for the reduction in force, a 2 means Round 2, and similarly for 3, 4 or 5; however, 0 means "not chosen for layoff."

1 The statistical analysis in the lawsuit (CA No. 92-03121-F) used all 50 employees in the Engineering Department of the Envelope Division, with separate analyses for exempt (salaried) and non-exempt (hourly) workers.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 2.1


Row Job title Pay Birth Hire RIF AGEmo yr mo yr 1/1/91

1 Engineering Clerk H 9 66 7 89 0 252 Engineering Tech II H 4 53 8 78 0 383 Engineering Tech II H 10 35 7 65 0 564 Secretary to Engin Manag H 2 43 9 66 0 485 Engineering Tech II H 8 38 9 74 1 536 Engineering Tech II H 8 36 3 60 1 557 Engineering Tech II H 1 32 2 63 1 598 Parts Crib Attendant H 11 69 10 89 1 229 Engineering Tech II H 5 36 4 77 2 55

10 Engineering Tech II H 8 27 12 51 2 6411 Technical Secretary H 5 36 11 73 2 5512 Engineering Tech II H 2 36 4 62 3 5513 Engineering Tech II H 9 58 11 76 4 3314 Engineering Tech II H 7 56 5 77 4 3515 Customer Serv Engineer S 4 30 9 66 0 6116 Customer Serv Engr Assoc S 2 62 5 88 0 2917 Design Engineer S 12 43 9 67 0 4818 Design Engineer S 3 37 6 74 0 5419 Design Engineer S 3 36 2 78 0 5520 Design Engineer S 1 31 3 67 0 6021 Engineering Assistant S 6 60 7 86 0 3122 Engineering Associate S 2 57 4 85 0 3423 Engineering Manager S 2 32 11 63 0 5924 Machine Designer S 9 59 3 90 0 3225 Packaging Engineer S 3 38 11 83 0 5326 Prod Spec - Printing S 12 44 11 74 0 4727 Proj Eng-Elec S 9 43 4 71 0 4828 Project Engineer S 7 49 9 73 0 4229 Project Engineer S 8 43 4 64 0 4830 Project Engineer S 6 34 8 81 0 5731 Supv Engineering Serv S 4 54 6 72 0 3732 Supv Machine Shop S 11 37 3 64 0 5433 Chemist S 8 22 4 54 1 6934 Design Engineer S 9 38 12 87 1 5335 Engineering Associate S 2 61 9 85 1 3036 Machine Designer S 2 39 4 85 1 5237 Machine Parts Cont-Supv S 10 28 8 53 1 6338 Prod Specialist S 9 27 10 43 1 6439 Project Engineer S 7 25 9 59 1 6640 Chemist S 12 30 10 52 2 6141 Design Engineer S 4 60 5 89 2 3142 Electrical Engineer S 11 49 3 86 2 4243 Machine Designer S 3 35 12 68 2 5644 Machine Parts Cont Coor S 9 37 10 67 2 5445 VH Prod Specialist S 5 35 9 55 2 5646 Printing Coordinator S 2 41 1 62 3 5047 Prod Dev Engineer S 6 59 11 85 3 3248 Prod Specialist S 7 32 1 55 4 5949 VH Prod Specialist S 3 42 4 62 4 4950 Engineering Associate S 8 68 5 89 5 23

Display 2.1 The data in Martin versus Westvaco.



On balance, the patterns in the Martin data show that the percentage of people laid off was higher for older workers than for younger ones. One of the main arguments in the case was about what those patterns mean: are the patterns "real," or could they be due just to natural variation? There's no way to repeat Westvaco's actual decision process, which means there's no way to measure the variability in that process. In fact, it's hard to say precisely what "natural variability" really means. It is possible, however, first to define a simple, artificial, age-neutral decision process (a null model), then to repeat that process, and use the results to ask whether that process is variable enough to give results as extreme as Westvaco's.

A comprehensive analysis to answer that question would be quite involved. For now, though, you can get a pretty good idea of how the analysis goes by working with just a subset of the data. Here are ages and Row IDs of the ten hourly workers involved in the second of the five rounds of layoffs, arranged from youngest to oldest. The ages of the three that were laid off are underlined:

Age 25 33 35 38 48 55 55 55 56 64

Row ID 1 13 14 2 4 12 9 11 3 10

What to make of the data requires balancing two points of view. On one hand, the pattern in the data is pretty striking. Of the five people under age 50, all kept their jobs. Of the five who were 55 or older, only two kept their jobs. On the other hand, the numbers of people involved are pretty small: just three out of ten. Should you take seriously a pattern involving so few people? The two viewpoints correspond to two sides of an argument that was at the center of statistical part of the Martin case. Here's a simplified version.2

Martin: Look at the pattern in the data: All three of the workers laid off were much older than average. That's evidence of age bias.

Westvaco: Not so fast! You're only looking at ten people total, and only three jobs were eliminated. Just one small change and the picture would be entirely different. For example, suppose it had been the 25-year-old instead of the 64-year-old who was laid off. Switch the 25 and the 64, and you get a totally different set of averages:

Actual data: 25 33 35 38 48 55 55 55 56 64

Altered data: 25 33 35 38 48 55 55 55 56 64

Average ages: Laid off Kept

Actual data 58.0 41.4

Altered data 45.0 47.0

2 I owe the idea of a dialog to Statistics (1978) by David Freedman, Robert Pisani, and Roger Purves, now in a third edition (1998).



See! Just one small change and the average age of the three who were laid off is actually lower than the average age of the others.

Martin: Not so fast, yourself! Of all the possible changes, you picked the one that is most favorable to your side. If you'd switched one of the 55-year-olds who got fired with the 55-year-old who kept his job, the averages wouldn't change at all.

Why not compare what actually happened with all the possibilities that might have happened? Start with the ten workers, and pick three at random. Do this over and over, to see what typically happens, and compare the actual data with these results.

Westvaco: But you'd be ignoring relevant information, things like worker qualifications, and which positions were easiest to do without.

Martin: I agree. But you're changing the subject. Remember our question: "Is the sample large enough to support a conclusion?" That's a pretty narrow question. It doesn't say anything about why the workers were chosen. At this point, we're just asking "If you treat all ten workers alike, and pick three at random without regard to age, how likely is it that their average age will be 58 or more?"

You can use simulation to estimate the probability p that if you draw three workers at random, just by chance you will get an average age of 58 years or more.

Randomization tests by simulation: generate, compare, estimate

Generate a large number (NReps) of random data sets. Here, each “data set” is a random subset of 3 worker’s ages chosen from the 10.

Compare each random data set with the actual data. Is the average age for the random data set greater than or equal to 58? (Yes/No)

Estimate the probability p by the observed proportion of Yes answers:

p̂ = (# Yes)/(# Repetitions) .

If p̂ is tiny, you know that an average age of 58 is too extreme to occur just by chance. Some other explanation is needed.

For most applications, it will be necessary to carry out the three steps on a computer, but this deliberately simplified example is one you can do by drawing marbles out of a bucket or the equivalent.

Activity 2.1 (Physical simulation): Did Westvaco Discriminate?

Step 1. Generate random data sets Write each of the ten ages on identical squares cut from 3x5 cards, and put them

in a box: 28, 33, 35, 38, 48, 55, 55, 55, 56, 64. Mix the squares thoroughly and draw out three at random without replacement.

Step 2. Compare each random data set with the actual data (55, 55, 64) Compute the average age for the sample. Is the value ≥ 58? (Record Yes or no.)



Step 3. Estimate the value of p using the observed proportion.

Repeat Steps 1 and 2 ten times. Combine your results with those from the rest of the class before you compute the proportion p̂ = (# Yes) / (# Repetitions).

Your chance model in the physical simulation is completely age neutral: All sets of three workers have exactly the same chance of being selected for layoff, regardless of age. The simulation tells you what sort of results are reasonable to expect from that sort of age-blind process. Here are the first four of 1000 repetitions from such a model: Simulation (underlined = laid off) Average age

25 33 35 38 48 55 55 55 56 64 42.67

25 33 35 38 48 55 55 55 56 64 48.00

25 33 35 38 48 55 55 55 56 64 42.67

25 33 35 38 48 55 55 55 56 64 37.00 Display 2.2 is a plot that shows the distribution of average ages for 1000 repetitions of the sampling process.

Average age of those chosen

Num

ber o

f tim

es

30 35 40 45 50 55 60

010

2030

4050

Display 2.2 Results of 1000 repetitions

The distribution of average age of those chosen for layoff by the chance model



Out of 1000 repetitions, only 49, or about 5% gave an average age of 58 or older. So it is not at all likely that just by chance you'd pick workers as old as the three Westvaco picked. Did the company discriminate? There's no way to tell just from the numbers alone. However, if your simulations had told you that an average of 58 or older is easy to get by chance alone, then the data would provide no evidence of discrimination. If, on the other hand, it turns out to be very unlikely to get a value this big just by chance, statistical logic says to conclude that the pattern is "real," that is, more than just coincidence. It is then up to the company to explain why their decision-making process led to such a large average age for those laid off.

The logic of the last paragraph may take some time to get used to, but it can help to recast the logic in the form of a real argument between two people. Here's an imaginary version of such an argument.

Martin: Look at the pattern in the data: All three of the workers laid off were much older than average.

Westvaco: So what? I claim you could get a result like that just by chance. If chance alone can account for the pattern, there's no reason to look for any other explanation.

Martin: OK, let's test your claim. If it's easy to get an average as big as 58 by drawing at random, I'll agree that we can't rule out chance as one possible explanation. But if an average that big is really hard to get from random draws, we agree that chance alone can't account for the pattern. Right?

Westvaco: Right.

Martin: Here are the results of my simulations. If you look at the three hourly workers laid off in round two, the probability is only 5% that you could get an average age of 58 or more. And if you do the same computations for the entire engineering department, the probability is a lot less, about 0.01, or one out of 100. What do you say to that?

Westvaco: Well ... I'll agree that it's really hard to get patterns that extreme just by chance, but that by itself still doesn't prove discrimination.3

In principle we can apply the same three steps to the finch data, using the number of checkerboards in place of average age for comparing data sets in Step 2. Our estimate in Step 3 would then give an answer to the question Connor and Simberloff asked: “ If you generate data sets purely at random, so that each data sets has the same chance as each of the others, how likely are you to get 10 or more checkerboards?” Although in principle this three-step approach will answer the question, in practice it is hard to carry out Step 1, because there is no quick and simple way to generate random data sets. In a sense, much of the rest of this entire book deals with the mathematics of solving this problem, along with related questions that have yet to be answered.

3 In the actual case, an analysis based on all 50 employees in the department gave a p-value much less than .05. Martin and Westvaco reached a settlement out of court before the case went to trial.



2.2 An informal introduction to S-Plus A useful reference: http://lib.stat.cmu.edu/S/cheatsheet Opening a new script file in S-Plus: Click on the S-Plus icon Click OK to use existing data Start a new script file (File > New > Script file)

Warm-up There is a standard statistical vocabulary to describe choosing a random subset from some larger set: The larger set that you choose from is called a population; the random subset that you choose is called a sample. In the Martin example, the population is the set of ten ages {25, 33, 35, 38, 48, 55, 55, 55, 56, 64}. The set of three chosen (e.g., {55, 55, 64}) is the sample. Several sets of lines of S-Plus code are shown below. For each set of lines, first make a guess about what the code will do. Then type the code into the top part of the split window of the script file. This is where you can enter and edit code. (Commands and keystrokes for editing are pretty much the same as in Microsoft Word.) Finally, click on the “run” button, the solid triangle in the left margin of the second toolbar, in the column below File. This will execute your code in the bottom half of the window, and let you check whether your guess was correct. Populations as vectors 1a Pop


Populations and samples 2a sample(Pop,3,replace=F)

2b sum(sample(Pop,3,replace=F))

2c Pop2 = 58

Using a programming loop to create many samples4 Read through the following S-Plus code to see what it does. Note that a # separates comments from the executable code. 3 # Draw random samples of size 3, without replacement, from a given population, # and determine whether the average is > = 58. # Repeat this process NRep times, and find the proportion of samples that # have an average age of 58 or more. # NRep


# } # End the loop # pHat


2.3 Randomizations tests, I: The two-sample permutation test The Martin example is typical of a large class of situations. Here is another instance: Example 1. Calcium and blood pressure. To test whether taking calcium supplements can reduce blood pressure, investigators used a chance device to divide 21 male subjects into two groups. One group of 10 men, the treatment group, were given calcium supplements and told to take them every day for 12 weeks. The other 11 men, the control group, were given pills that looked the same as the supplements (a placebo), and given the same instructions: take one every day. Neither the subjects themselves nor the people giving out the pills and taking blood pressure readings knew which pills contained the calcium. (The experiment was double blind.) Subjects had their blood pressure read at the beginning of the study and again at the end. The numbers below tell the reduction in systolic blood pressure (when the heart is contracted), in millimeters of mercury. (Negative values mean that the blood pressure went up.)

Calcium: 7, -4, 18, 17, -3, -5, 1, 10, 11, -2 Placebo: -1, 12, -1, -3, 3, -5, 5, 2, -11, -1, -3

Here are the same numbers, arranged in order, with the values in the treatment group underlined: -18 -17 -12 -11 -10 -7 -5 -3 -2 -1 1 1 1 2 3 3 3 4 5 5 11 -11 -5 -5 -4 -3 -3 -3 -2 -1 -1 -1 1 2 3 5 7 10 11 12 17 18 Notice that for this example, as for Martin, there are two groups to compare, in this instance those assigned to the treatment group, and those assigned to the placebo group.5 Here also, as in the Martin example, the information we have available for comparing the two groups is quantitative, and we can judge the results using the average reduction in blood pressure for the calcium group, which was 5 millimeters of mercury.

5 There is an extremely important difference that makes this example different from Martin, however. For the calcium study, the two groups in fact were chosen purely at random. For the Martin example, there wasn’t any actual randomization; instead, random selection was the null model being tested. For a randomized controlled experiment like the calcium study, the randomization was a deliberate part of the experimental design. Although the source of the randomization makes no difference in how you calculate the p-value, it makes a tremendous difference in what it tells you. For Martin, a tiny p-value tells you to reject the null model of random selection. For the calcium study, that’s not a logical option, because the randomization actually occurred. That randomization guarantees that there are only two possible explanations for the observed difference between the treatment and control groups: chance, or the treatment.



Was the calcium supplement effective in lowering blood pressure? Here’s how the logic goes: The only differences between the two groups were (1) the calcium, and (2) differences created by the random assignment. Assume for the moment that the calcium had no effect. Then the observed reduction of 5 mm Hg in the calcium group was due purely to chance, that is, to the random assignment. To see whether chance is a believable way to account for the average of 5, we ask, “If you take the 21 blood pressure values, and choose 10 of them at random, how likely is it that you’ll get an average of 5 or more?” If this probability, the p-value, is tiny, we conclude that chance is not a believable explanation; it must be due to the calcium treatment. Exercise: 7. (a) I used 10,000 repetitions to estimate this probability, and got 0.0813. What do you conclude? (b) Use the S-plus code from before to compute the p-value, with NReps =1000. How far is your estimate from mine? Which value is more reliable? Example 2. Hospital carpets. In a hospital, noise can be an irritation that interferes with a patient’s recovery. Putting down carpeting in the rooms would cut down on noise, but the carpeting might tend to harbor bacteria. To study this possibility, doctors at a Montana hospital conducted an experiment to see whether rooms with carpeting had higher levels of airborne bacteria than rooms with bare floors. They began with 16 rooms and randomly chose eight to have carpeting installed. The other eight were left bare. At the end of their test period, they pumped air from each room over a culture medium (agar in a petri dish), allowed enough time for the bacterial colonies to grow, and recorded the number of colonies per cubic foot of air. Here are the results:

Room # Room #212 11.8 210 12.1216 8.2 214 8.3220 7.1 215 3.8223 13.0 217 7.2225 10.8 221 12.0226 10.1 222 11.2227 14.6 224 10.1228 14.0 229 13.7

Average 11.2 Average 9.8

Colonies/cu.ft. Colonies/cu.ft.Carpeted floors Bare floors

Display 2.3 Levels of airborne bacteria for 16 hospital rooms Exercise: 8. Estimate the p-value for testing the hypothesis that carpeting had no effect on the levels of airborne bacteria. (Find the chance that if you choose 8 values at random from the 16 bacteria level, you’ll get an average of 11.2 or more. Use 10,000 repetitions.)



The three examples, Martin, calcium, and carpets, all have the same abstract structure:

Summary: Two-sample permutation tests Data: Two groups (samples) of numerical values, n1 in Group 1 and n2 in

Group 2. Test statistic: Average (mean) of the values in Group 1. Observed value: Group 1 average for the actual data Null model: All possible ways to choose n1 values (a random sample)

from the combined set of n1 + n2 values (the population) are equally likely.

p-value: The chance that the average for a random sample is at least as large as the observed value.

Example 3. Speed limits and traffic deaths. The year 1996 offered an unusual opportunity to scientists who study traffic safety. Until that year, states had to keep highway speeds at 55 mile per hour or below in order to receive federal money. Then, toward the end of 1995, a new federal law took effect, one that allowed states to raise their speed limits. Thirty two states did just that, either at the beginning of 1996, or at some point during the year. The other 18 states, and the District of Columbia kept the 55 mph limit. Conventional wisdom had it that increasing the speed limit would lead to more highway deaths. The change in the law gave scientists a chance to test this hypothesis. The numbers in Display 2.46 show the percentage change in numbers of highway traffic deaths between 1995 and 1996, for all 50 states and DC.

AK -29.0 NH -20.0 AL 24.5 KS -13.3 OH 1.6CT -4.4 NJ 44.1 AR 41.3 MA 33.3 OK 34.1DC -80.0 NY -9.7 AZ 0.0 MD -1.8 PA -7.0HI -25.0 OR -16.4 CA 4.4 MI -7.9 RI 18.2IN -13.2 SC 32.1 CO -19.1 MO 50.7 SD 22.2KY 3.4 VA -9.1 DE 30.0 MS 17.6 TN 4.0LA -5.4 VT -41.2 FL 8.2 MT 17.9 TX 14.8ME -14.3 WI 41.4 GA 32.1 NC 5.4 UT 9.4MN 10.8 WV 23.2 IA 41.4 NE 62.5 WA 34.3ND -50.0 ID -17.9 NM 3.4 WY -31.5

IL 9.4 NV 17.9

States that kept 55 mph States that raised the speed limit

Display 2.4 Percentage change in traffic deaths

Here is how the features of this example correspond to the elements of the abstract summary. The two groups are the states that increased the 55 mph speed limit (Group 1) and those that kept it (Group 2). The test statistic is the average percent change in highway deaths for Group 1. Its observed value is the actual average for those states, which work out to 13.2. The null model is that all way to choose 32 numbers from the

6 Data from Ramsey, Fred L. and Daniel W. Shafer (2002). The Statistical Sleuth, 2nd ed. Pacific Grove, CA: Duxbury. Original source: “Report to Congress: The effect of increased speed limits in the post-NMSL era,” National Highway Traffic Safety Administration, February, 1998.



set of 51 listed in the table are equally likely. The p-value is the chance of getting a group average of 13.2 or more; this works out to about 0.005. Discussion question 9. What would change, and what would be the same, if you defined Group 1 to be the states that didn’t raise their speed limits? Example 4. O-rings. The explosion of the Challenger space shuttle has received a lot of attention from statisticians because the disaster and loss of the astronauts’ lives could have been prevented by fairly simple data analysis. The explosion was caused by failure of )-ring seals that allowed rocket fuel to leak and explode, and an investigation concluded that the o-ring failures were themselves caused by the low temperature at the time of the launch. The summary below7 shows the relationship between air temperature at launch time and the number of “O-ring incidents” per launch for 24 launches.

Launch temperature Below 65° 1 1 1 3 Above 65° 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2

Exercises: 10. Identify the components (samples, null model, test statistic, observed value, p-value) for the Martin and calcium examples. 11. Guess: Will the p-value for Example 4 turn out to be closest to .5, .1, .05, .01, or .001? After you guess, use S-plus to estimate the p-value. 12. The faulty analysis before the launch ignored all the 0s in the data and looked only at the temperature on the days for which the launches had problems with the O-rings. Ignoring 0s gives the following summary:

Launch temperature Below 65° 1 1 1 3 Above 65° 1 1 2

Guess: Will the p-value turn out to be closest to .5, .1, .05, .01, or .001? Then use S-plus to estimate the p-value. 7 Data from Ramsey, Fred L. and Daniel W. Shafer (2002). The Statistical Sleuth, 2nd ed. Pacific Grove, CA: Duxbury, p. 86. Original source: Feynman, Richard P. (1988). What Do Dou Care What Other People Think? New York: W. W. Norton.



2.3 Randomization tests, II: Fisher’s exact test So far, the data values in the examples have been quantitative: ages, reduction in blood pressure, levels of airborne bacteria. What if the data values are categorical? One of the simplest randomization tests is Fisher’s exact test, which is used to test hypotheses about data that can be summarized in a 2x2 table of counts. Example 5. The Salem witchcraft hysteria The year 1692 saw nineteen convicted witches hanged in Salem Village (now Danvers) Massachusetts. Almost three centuries later, historians examining documents related to the trials discovered a striking pattern relating trial testimony and geography. Those who testified against the accused witches tended to live in the western part of Salem Village; those who testified in defense of the accused tended to live in the eastern part, which was wealthier, more commercial, more cosmopolitan, and closer to the town of Salem, at the time the second busiest port in the colonies.8 A total of 61 residents testified in the trials, of whom 35 lived in the western part of the village, and 26 in the eastern part. Of the 35 “westerners,” 30 were “accusers” and only 5 were “defenders;” of the 26 easterners, only 2 were accusers; the remaining 24 were defenders:

Geography TotalWest 30 5 35 85.7%

East 2 24 26 7.7%

Total 32 29 61

Accuser Defender % accuser

Testimony

Display 2.5 Geography and testimony in the Salem witch trials of 1692 Is it possible to get a pattern as extreme as this just by chance? If the relationship between geography and testimony were purely random, how likely would a pattern this extreme be? Represent the residents who testified by poker chips, 35 marked “West” and 26 marked “East.” Put all 61 chips in a bag, mix thoroughly, and draw out 32 at random. Call these Accusers; count how many of the accuser chips say West, how many say East, and record the results in a table.9 I just did this, and got:

Geography TotalWest 19 16 35 54.3%

East 13 13 26 50.0%

Total 32 29 61

TestimonyAccuser Defender % accuser

Display 2.6 Results for a sample of accusers drawn at random

8 Historians Boyer and Nissenbaum cite this pattern as part of the evidence in support of an economically based interpretation of the witchcraft hysteria. See Boyer, Paul and Stephen Nissenbaum (1974). Salem Possessed: The Social Origins of Witchcraft, Cambridge: Harvard University Press 9 The chips left in the bag are the defenders. You could count them to fill in the rest of the table, or you could get the missing values by subtraction, since they are determined by what you draw out.



My random data set is not nearly as extreme as the actual one. I repeated the whole process -- random draws, count, compare – 10,000 times, using the S-Plus code in Display 2.7, and not once did I get a table as extreme as the actual data. Conclusion: If you draw at random, it is all but impossible to get a data table like the observed one. In other words, “It’s just a chance relationship” is not a believable explanation for the data. For the S-plus simulation, I used 0s and 1s to represent East and West. There were 26 people from the East who testified, and 35 from the West, so my population has 26 0s and 35 1s.

pop


Drill Exercises: 13. A small version of the witch data. Assume that only 10 people had testified, of whom 4 lived in the west, 6 in the east. Assume also that 3 were accusers, and that all 3 came from the west. Set up the population, null model, test statistic and observed value. Then find the p-value and state your conclusion. 14. There is more than one way to define the population and null model for S-plus. Two easy variations are (a) to reverse the labels for 1s and 0s, so that 1 represents East and 0 represents West, and/or (b) to reverse the labels for in the sample and not in the sample, so that the sample represents Defenders, and those not in the sample are the Accusers. For each of these variations, tell what the test statistic would be, and how to compute the p-value. Verify that the p-value is the same for all these variations. 15. A more substantive variation reverses the roles of population and sample: Let the 1s and 0s in the population tell whether an individual was an Accuser or Defender. Let “in the sample” correspond to “West,” and “not in the sample” to East. Define a test statistic, and tell how to modify the S-Plus code in Display 3s.3 to compute the p-value. (Optional: Run your modified S-Plus code, and verify the (non-obvious) fact that (apart from random variation) it is equal to the p-value from the original code. Example 6. US v. Gilbert For several years in the 1990s, Kristen Gilbert worked as a nurse in the intensive care unit (ICU) of the Veteran’s Administration hospital in Northampton, Massachusetts. Over the course of her time there, other nurses came to suspect that she was killing patients by injecting them with the heart stimulant epinephrine.10 Part of the evidence against Gilbert was a statistical analysis of more than one thousand 8-hour shifts during the time Gilbert worked in the ICU. Was there an association between Gilbert’s presence on the ICU and whether or not someone died on the shift?11 Here are some fictitious data of the same sort as the actual data from the grand jury testimony: K.G. present

on shift? TotalYes 27 273 300 9.0%

No 18 882 900 2.0%

Total 45 1155 1200

Death on shift?Yes No % Yes

Display 2.8 Fictitious data on possible association between nurse Gilbert’s presence in

the ICU of the Northampton VA hospital and deaths on a shift.

10 A synthetic form of adrenaline. 11As of this writing, the actual data are not yet public. They figured prominently in the grand jury testimony that led to Gilbert’s indictment, but at the subsequent trial, the judge ruled that they could not be shown to the jury.



Drill Exercises: 16. Define the population of 0s and 1s: What does a 0 represent? a 1? How many 0s, and how many 1s, are in the population. (Note: There is more than one right way to do this.) 17. Define the null model: If you think of drawing a random sample from the population,, what does “drawn out” (that is, in the sample) represent? What does “not drawn out” represent? 18. Define the test statistic: What does the number of 1s in the sample represent? What is the observed value of the test statistic? 19. p-value. Modify the S-Plus code in Display 3s.3 so that it would compute the p-value. (Optional: Compute the p-value using your code.) Example 7. Anthrax. After Senator Tom Daschle received a letter containing anthrax, the Hart Senate Office Building was fumigated in an attempt to kill the spores. After the first fumigation, public health officials conducted a multi-part test to see whether the building was safe to work in. In the first phase of the test, 17 strips capable of detecting live anthrax spores were placed throughout the test area, and later were checked for anthrax. Five of the 17 were positive. In the second phase of the test, another 17 strips were placed in the same locations, but this time suitably protected technicians walked around on the carpet, moving the room air in the process, to simulate normal office traffic. This time 16 of 17 strips were positive. The results of these tests led to a second, more vigorous, and successful fumigation. Exercises 20. Summarize the test results in a 2x2 table. 21. Define a suitable population, null model, and test statistic for Fisher’s exact test. 22. Use S-plus to compute an appropriate p-value. Discussion question: 23. Just as each strip in the test for anthrax can show a false positive (indicate anthrax when none is present) or a false negative (indicating no anthrax when it is in fact present), statistical tests can also show false positives and false negatives. For the study of calcium supplements, what would a false positive be? A false negative? Summary. Fisher’s exact test is appropriate when (1) you want to compare two randomly chosen groups of individuals, and (2) the feature of the individuals that you use



to make the comparison is dichotomous – reducible to yes/no. Think of randomly drawing marbles from a bucket. This gives two randomly chosen groups: those drawn out, and those left in. The marbles are of two colors; color is the feature used to compare the two groups. For the Salem witch data (Example 5), we asked, “What if the ‘accusers’ and ‘defenders’ had been chosen at random?” In that example, the actual accusers and defenders were not chosen at random, but we wanted to test whether the observed data was consistent with random selection. So under our null model, the ‘accusers’ and ‘defenders’ were the randomly chosen groups. The feature used for the comparison was geography, east or west. For the Gilbert data (Example 6) we asked, “What if the deaths had occurred on randomly chosen shifts?” Here, also, the actual shifts were not chosen that way, but we wanted to compare the actual data with what we would be likely to get if the shifts had been chosen randomly. Thus shifts with and without deaths were the randomly chosen groups. The feature used for comparing groups was whether or not Gilbert was present on the shift.

2.4 Randomization tests, III: variations. Variation 1: Dichotomizing a numerical variable. Although Fisher’s exact test is designed for dichotomous populations – those with just two kinds of individuals – it is possible to use the test when the feature you use to compare groups is quantitative. To turn a quantitative variable into a dichotomous one, pick a threshold value of the variable, and replace its actual value with a Yes or No answer to the question “Is the value of the variable greater than or equal to the threshold?” Once you’ve replaced the numbers with Yes/No answers, you can carry out Fisher’s exact test. Example 8. Martin vs. WestVaCo Here, once again, are the ages of the ten hourly worker involved in the second round of layoffs at WestVaCo, with the ages of those laid off indicated by underlining.

25 33 35 38 48 55 55 55 56 64

a. One way to choose a threshold is to go by the law. According to federal employment law, the “protected class” (of workers who cannot be fired because of their age) begins at age 40. If we use 40 as the threshold, then the set of ten ages becomes

N N N N Y Y Y Y Y Y

We can summarize the new, dichotomized version of what actually happened in a 2x2 table:

40 or older? No Yes TotalNo 4 0 4Yes 3 3 6

Total 7 3 10



If we use as our test statistic the number of workers aged 40 or more among those laid off, we get a p-value of about .17: p = P(3 or more Y in a random sample of 3 chosen from 4 N, 6 Y) ≈ .17.

b. Notice that using Fisher’s test when you have a quantitative variable ignores

relevant information.12 here, for example, all three workers laid of were very much older than the threshold age of 40, but the test in (a) didn’t use that information. In general, it is better not to use Fisher’s test in a situation like this; the permutation test would ordinarily be preferred. However, you can sometimes get a better version of Fisher’s test by changing the threshold. For example, you could choose as your threshold the median (half-way point) of the set of observed ages. The resulting test is called the median test. For the Martin data we have an even number of values, so there are two middle values 48 and 55, and the median is the number half-way between them, 51.5. Using 51.5 as a threshold, and replacing ages by Yes/No answers to “Is the age 51.5 or older?” gives a population of 5 Ns, 5 Ys.

Drill exercise: (24) Summarize the data in a 2x2 table. For this threshold, the p-value is p = P(3 or more Y in a random sample of 3 chosen from 5 N, 5 Y) ≈ .06.

c. Drill exercise. (25) Repeat the test using 55 as the threshold. Summarize the

data in a 2x2 table, and explain why the p-value should be less than the two previous p-values. (It’s actual value is 1/30 ≈ .03.)

Example 9. Calcium and blood pressure. If we want to apply Fisher’s test to the data in Example 1, we have to reduce the quantitative data to dichotomous data by choosing a threshold value and asking, “Is the change in blood pressure greater than or equal to the threshold?” Drill exercises. 26. One natural choice for the threshold is 0. Values 0 or greater indicate that the blood pressure did not go down. Carry out Fisher’s test using 0 as your threshold. 27. Carry out the median test. 28. Compare p-values for the two tests. Why do you think the p-values differ in the way that they do? 12 Ignoring relevant information can reduce the power of a statistical test by making it less likely that the test will reject null models that should be rejected.



Drill exercises 29. Tell how to conduct a median test: Null model.

a. What is the population? b. What constitutes a random sample? c. What are the objects that are equally likely according to the null model? Test statistic d. Tell what test statistic to use.

Variation 2: Transforming to ranks Back before the days of cheap computers, it was often not practical to estimate p-values by simulation. Statisticians who wanted to do randomization tests found a clever way around the problem. Their solution was based on the fact that if your population consists of consecutive integers, like {1, 2, 3, …, n} there is a theoretical analysis that gives a workable approximation to p-values.13 Of course most populations don’t consist of consecutive integers, but you can force them to if you replace the actual data values with their ranks: order the values from smallest to largest, assign rank 1 to the smallest, rank 2 to the next smallest, etc. Once you’ve assigned ranks, you can do a two-sample permutation test on the ranks. The resulting test is called the Wilcoxon rank sum test.14 Here’s how the ranking works for the Martin data. Example 10: A rank test for the Martin data

Age 25 33 35 38 48 55 55 55 56 64Rank 1 2 3 4 5 7 7 7 9 10

Notice how the ranking handles ties: the three 55s have ranks 6, 7, and 8, so we assign each of the 55s the average of those ranks, (6+7+8)/3 = 7. Exercises 30. Carry out the Wilcoxon test on the Martin data, but first, guess whether the p-value will be larger or smaller than for the permutation test using the actual age. (It will not be the same.) What do you think is the reason for the difference in p-values? 31. Carry out the Wilcoxon test on the calcium data of Example 1. As in (30), before you run the simulation, guess the p-value.

13 Find the mean and variance of {1, 2, …, n}. Then use the fact that for large enough n1 and n2, the distribution of the average of the values in a random sample of size n1 will be approximately normal. 14 The p-value you get from the Wilcoxon test will always be equal to the p-value you would get using a different approach, called the Mann-Whitney U-test.



Variation 3: Paired data Until now, all the data sets have had the same structure: two groups of values. To generate random data sets with the same structure, you combined the two group into a single population, then randomly chose exactly enough values for Group 1, leaving the rest for Group 2. This structure is just one of a great many that are possible. Often data come in the form of pairs: Example 11. Beestings. When a bee stings and leaves his stinger behind in his victim, does he also leave with it some odor that tells other bees, “Drill here!”? To answer this question, J.B. Free designed a randomized experiment.15 First, he took a square board and from it suspended 16 cotton balls on threads in a 4x4 arrangement. Half the cotton balls had been previously stung, the other half were brand new, and the positions of the two kinds was randomized. Apparatus completed, Free went to a beehive, opened the top and jerked the array of cotton balls up and down, inviting stings. Later, he counted the numbers of new stingers his provocation had garnered. He repeated all this eight more times, with the results shown in Display 2.9.

Occasion I II III IV V VI VII VIII IX AveStung 27 9 33 33 4 22 21 33 70 28.0Fresh 33 9 21 15 6 16 19 15 10 16.0

Display 2.9 Numbers of new stinger left by bees in previously stung and fresh cotton balls

On average, there were 28 stingers left in the cotton balls that had been previously stung, only 16 in those that had not. Given the grand total of 396 new stingers, is the average of 28 for Stung too big to be due just to the random assignment? Suppose for the moment that the presence of stingers in the cotton balls had no effect. Then within each pair, it would be just a matter of chance as to which number got assigned to Stung, and which to Fresh. We can create random data sets by regarding each occasion as a tiny population of just two values, and randomly choosing the value that gets assigned to Stung, leaving the other for fresh. Equivalently, we can toss a coin for each occasion, choosing the first value of the coin lands heads, the second if tails. Display 2.10 shows an instance of this:

Occasion I II III IV V VI VII VIII IX Ave

Stung 27 9 33 33 4 22 21 33 70 28.0Fresh 33 9 21 15 6 16 19 15 10 16.0

Coin toss 1 1 0 0 0 1 0 0 1 Av"Stung" 27 9 21 15 6 22 19 15 70 22.7"Fresh" 33 9 33 33 4 16 21 33 10 21.3

e

Display 2.10 Generating a random data set for the bee sting data

If the coin toss lands head (1) the first value in a pair is assigned to “Stung”; if tails (0), the second value is assigned to “Stung.”

15 Free, J. B. (1961). “The stinging response of honeybees,” Animal Behavior 9, pp. 193-196.



Once we have a way to generate random data sets, we’re in business. We can carry out the randomization test by the usual 3-step algorithm: generate, compare, estimate. Display 2.11 shows S-plus code for doing this.16 The p-value turns out to be about .04. Apparently, bees are more likely to sting where others have stung before.

Stung


Exercises: 33. Modify the S-plus code in Display 2.11 to carry out a permutation test. What do you conclude about the effect of environment on health? 34. For the bee data, the justification for a permutation test comes from the fact that conditions (stung or fresh) were randomly assigned. The twin study, however, is an observational study, with no randomization of the conditions (rural or urban) possible. What is the justification for using a permutation test? Additional exercises involving variations on the permutation test 35. Dichotomize the O-ring data of Example 4, replacing the number of incidents with Yes (1 or more incidents) or No (0 incident), and summarize the results in a 2x2 table. Then carry out Fisher’s exact test. How does your p-value here compare with the one based directly on the numbers of incidents? 36. Dichotomize the data on traffic deaths (Example 3) using the sign of the change, i.e., whether the deaths went up or down. Summarize the results in a 2x2 table, and carry out Fisher’s exact test. 37. Use the bee sting data of Example 11. Order all 18 data values and assign ranks. Then carry out a permutation test on the pairs of ranks, using suitably modified code from Display 2.11. This test is called the signed rank test. 38. Use the bee sting data one more time. This time assign ranks separately for each pair: assign a 1 to the larger value and a 0 to the smaller value. If the two values are equal, simply omit that pair from the analysis. Carry out a permutation test on the pairs of ranks. This test is called the sign test. 39. Compare the p-values from the three tests using the bee sting data: the permutation test using the numbers of stings, the signed rank test, and the sign test. The first test uses the actual data, the second gives up some of that information by converting to ranks, and the third test give up still more information by looking only at which value in a pair was the larger one. Based on the bee sting data, how does giving up information appear to affect p-values?



2.5 Randomization tests, IV: Chi-square tests. Introduction. Fisher’s exact test applies to data sets you can summarize in a 2x2 table. Such data sets have the same structure as the results of drawing a sample from a bucket containing red and blue marbles: there are two groups (drawn out, left in) and two kinds of individuals (the two colors). What if there are more than two kinds of individuals, or more than two samples? Example 13. Victoria’s descendants Some people claim there is an association between a person’s birthday and the day of the year on which they die. According to the theory, people who are dying tend to “hang on” until their birthday. Display 2.13 shows counts for 82 descendants of Queen Victoria, classified by month of birth (row) and month of death (column). Those who died in the same month as they were born in appear on the main diagonal. Those who died in a month just before or just after their birth month appear just below or just above the main diagonal. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Jan 1 0 0 0 1 2 0 0 1 0 1 0 6Feb 1 0 0 1 0 0 0 0 0 1 0 2 5Mar 1 0 0 0 2 1 0 0 0 0 0 1 5Apr 3 0 2 0 0 0 1 0 1 3 1 1 12May 2 1 1 1 1 1 1 1 1 1 1 0 12Jun 2 0 0 0 1 0 0 0 0 0 0 0 3Jul 2 0 2 1 0 0 0 0 1 1 1 2 10Aug 0 0 0 3 0 0 1 0 0 1 0 2 7Sep 0 0 0 1 1 0 0 0 0 0 1 0 3Oct 1 1 0 2 0 0 1 0 0 1 1 0 7Nov 0 1 1 1 2 0 0 2 0 1 1 0 9Dec 0 1 1 0 0 0 1 0 0 0 0 0 3Total 13 4 7 10 8 4 5 3 4 9 7 8 82

Total

Display 2.13. Month of birth (row) and month of death (column) for 82 descendants of Queen Victoria

If the claim of association is true, we would expect to find a tendency for counts to be higher on or near the main diagonal, lower near the southwest and northeast corners. If, on the other hand, there is no association, we would expect the count in a cell to be equal to the product of the cell’s column total times its row fraction. For example, consider the upper left cell, which corresponds to those born in a January who also died in a January. The row and column totals tell us that 6 of 82 descendants, or 7.32%, were born in a January; 13 died in a January. If there is no association between month of birth and month of death, we would expect the fraction of January births to be the same in each column. In particular, we would expect 7.32% of the 13 January deaths, or 13(0.0732) = 0.95, to be January births. The goal of this section is to apply the same kind of thinking to all the cells of the table, and somehow combine the results to create/define a test statistic that can serve as the basis of a randomization test.



The chi-square test statistic. One of the most common methods in all of statistics is the chi-square test. This test is so flexible and broadly applicable that almost any data set based on sorting and counting can be studied using it.18 The chi-square test has two parts, which are often presented as a single package, without noting or distinguishing between one part and the other.19 This is unfortunate, because one part – a measure of distance that serves as a test statistic -- is much more useful than the other – a shortcut method for approximating p-values. In what follows, I’ll describe and illustrate the more useful part, which gives a general-purpose method for carrying out Step 2 (the comparison step) of the p-value algorithm.20 For a concrete illustration, consider the summary table for the Martin example. For 2x2 tables, the chi-square distance isn’t something you really need, because its value is completely determined by the entry in the upper left cell of the table.21 However, you do need chi-square (or some alternative) for tables larger than 2x2, and the 2x2 case is a simple starting point for learning.

Yes No Total

55 or Yes 3 2 5older? No 0 5 5

Total 3 7 10

Laid off?

Display 2.14. Summary table for the Martin example The null model corresponds to choosing a random subset of size 3 – those laid off -- from {28, 3, 35, 38, 48, 55, 55, 55, 56, 64} and counting the number of people who are 55 or older. Since 5 of the 10, or 50% of those in our population are 55 or older, we would expect that, on the average over the long run, 50% of those in a randomly chosen subset would be 55 or older. For subsets of size 3, this long run average – the expected value – would be 50% of 3, or 1.5. Because the table entries have to add to give the same row and column totals as for the actual data, we can fill in the entire table:

18 If you thing about the last statement, it should sound too good to be true, and it is. Nevertheless, along with regression and analysis of variance, chi-square methods constitute one of the three principal groups of classical statistical methods. 19 If you’ve seen the chi-square test before, you yourself may well have been given the package tour. 20 The less useful part of the chi-square test is an approximation based on mathematical theory, one that you can sometimes use as a substitute for Steps 1 and 3 of the p-value algorithm. Back in 1901 when this approximation was invented (by Karl Pearson), before the days of computers, it was a major breakthrough, and statisticians had no choice but to use it. Without the approximation, there would have been no way to compute p-values. However, the approximation only works well for a restricted class of data sets, whereas the randomization algorithm is not restricted in this way, and can be used to estimate p-values to any desired accuracy. 21 This means that the p-value will be the same regardless of whether you use the chi-square distance or the cell entry in the comparison step of the algorithm.



Yes No Total

55 or Yes 1.5 3.5 5older? No 1.5 3.5 5

Total 3 7 10

Laid off?

Display 2.15. Expected values for the Martin example

We can now use these expected values to compare tables. In effect, we invent a way to measure the “distance” between two tables, and compare tables according to how far they are from the table of expected values. Step 2a Observed Expected Obs - Exp

3 2 - 1.5 3.5 = 1.5 -1.50 5 1.5 3.5 -1.5 1.5

Step 2b

2.25 2.25 / 1.5 3.5 = 1.5 0.642.25 2.25 1.5 3.5 1.5 0.64

Step 2c Chi-square = sum of (O-E)^2/E = 4.29

Expected (O-E)^2/E(Obs - Exp)^2

Display 2.16. The chi-square distance between observed and expected counts By looking closely at the way the chi-square value is defined, you can convince yourself that it does behave like a distance.

• Observed close to expected ⇒ chi-square near zero. Consider first a table whose observed counts are exactly equal to the expected counts. All of the entries in the table of differences will be zeros, and the chi-square distance will be zero as well. In other words, the chi-square distance from a table to itself is zero, just as it should be.

• Observed far from expected ⇒ chi-square large. Now consider a table of

observed counts that are far from their expected values. At least some of the differences (Obs – Exp) in Step 2a will be far from 0.22 When these differences are squared, in Step 2b, the resulting values will be large, and that will make the chi-square value large.


22 Notice that because the observed and expected counts have the same marginal totals, the marginal totals for the table of differences will all be 0.


• Why divide (O-E)2 by E? Dividing by E is a technical adjustment, designed to give all the cells an equal chance to contribute to the chi-square total. To see how this works, consider two extreme cases. First, suppose that the expected value is 1. If 1 is the expected value, we might get observed values of 2, or 5, or even 10, but 10 would be a very major departure from expectation. On the low side, the observed count can never be less than 0, so –1 is the lowest possible value for O – E. Next, suppose that instead of 1, the expected value is 101. An observed count of 102, or 105, or even 110 is, in percentage terms, still quite close to the expected value, even though the differences (O-E) are the same as in the first case. On the low side, (O-E) can easily go far below –1.

Now compare the two cases. In the first, a value of (O-E) = 4 indicates major departure from expectation: 5 instead of 1. In the second case, a value of (O-E) = 4 indicates a departure of less than 4% from the expectation. Dividing (O-E)2 by E puts these departures in perspective. In the first case, (O-E)2/E = 16; in the second case, (O-E)2/E = 0.16.

Once you have defined the chi-square distance, you can use it to compare data sets. A random data set is more extreme than the actual Martin data, for example, if and only if its chi-square distance from the table of expected values is greater than or equal to 4.29. You calculate the p-value in the usual way, as the fraction of random data sets that are at least as extreme as the actual data.

Testing for association in two-way tables of counts For tables larger than 2x2, the arithmetic is messier, but the logic is the same as in the Martin example. Here’s a version of the randomization algorithm suitable for larger tables.

Step 0. Expected value = (row fraction)(column total). Observed chi-square: Follow Steps 2a-2c below for the actual data. Step 1. Generate random data sets with the same row and column totals as the

actual data. Step 2 Compare values of the chi-square statistic. For each random data set,

compute a. Observed – Expected b. (Obs-Exp)2/Exp c. Chi-square = sum of (O-E)2/E d. Compare: is chi-square for the random data set at least as big as for the

actual data? Step 3. Estimate: $ # / #p Yes Datasets=



For the data in Display 2.13, the value of chi-square is 115.6. To see how this value compares with the values we’d get from random data sets, we need a method for generating 12x12 tables of counts with the same margins as in Display 2.13. Generating random tables of counts with given margins. You can generate data sets by physical simulation, much as in the Martin example. Here’s how it works for Victoria’s descendants.

Step 1. Label by rows. Put 82 chips in a bucket, with labels determined by the row (birth month) totals. Thus 6 of the chips say “Jan”, 5 say “Feb”, 5 “Mar”, 12 “Apr”, and so on.23 Step 2. Draw out by columns. Mix the chips thoroughly. Then draw them out in stages determined by the column totals: The first 13 chips you draw out correspond to deaths in January; the next 4 correspond to deaths in February, …, the last 8 correspond to deaths in December.

In S-plus, you can carry out the same two steps:

Step 1. Label by rows. Just as rep(1, 3) creates a vector (1, 1, 1), rep(1:3, c(4,3,1)) creates a vector (1, 1, 1, 1, 2, 2, 2, 3). For the Victoria data, we want rep(1:12, c(6, 5, 5, 12, 12, 3, 10, 7, 3, 7, 9, 3)). Rather than type in all the row totals, we use the command rowSums to tell S-plus to compute the totals for us.

Pop


Permutation


############################################################## # # Chi-Square Tests by randomization # ############################################################## # # ############## # # Expected. This function takes a matrix of non-negative entries # and returns a matrix of expected values computed assuming there # is no association between rows and columns: the (i,j) element of # the matrix equals the total for row i times the proportion for # column j. # ############## # expected


# with which columns ColGroups


Appendix 2.1 Randomization tests: A summary GIVEN: 1. A null model (or model and null hypothesis):

a. A set, called the population. (Subsets of the population are samples; the number of elements in the sample is the sample size, n.)

b. A finite (though often very large) collection of equally likely samples.

2. A test statistic (or metric): a. A function or rule that assigns a real number to each sample. (The function itself

is the test statistic.) b. A way to tell which of two values of the test statistic is more extreme. (Often,

larger values are more extreme.)

3. Observed data A particular sample (and so, automatically, a particular value of the test statistic.)

COMPUTE: 1. The p-value (more formally, the observed significance level):

The p-value is the probability, computed using the null model, of getting a value of the test statistic at least as extreme as the observed value. According to the null model, all the samples are equally likely, so the p-value is just the fraction of samples that have values of the test statistic at least as extreme as the observed value. There are three general approaches to computing p-values, brute force, mathematical theory, and simulation. a. Brute force: List all the samples, compute values of the test statistic for each

sample, and count. b. Mathematical theory: Find a shortcut by applying mathematical ideas (e.g., the

theory of permutations and combinations) to the structure of the set of samples in the null model.

c. Simulation: Use physical apparatus or a computer to generate a large number of random samples, and estimate the p-value using the fraction of samples that give a value of the test statistic at least as extreme as the observed value.

INTERPRETATION:

The p-value measures how surprising the observed data would be if the null model were true. A moderate sized p-value means that the observed value of the test statistic is pretty much what you would expect to get for data generated by the null model. A tiny p-value raises doubts about the null model: If the model is correct (or approximately correct), then it would be very unusual to get such an extreme value of the test statistic for data generated by the model. In other words, either the null model is wrong, or else a very unlikely outcome has occurred.



More on the logic of inference Contrast the time sequences for what actually happens when you generate statistical data, with the time sequence for how you reason about it after the fact. Here's what actually happens when you produce data:

1. The setting: There is a particular given chance mechanism for producing the data, such as tossing a coin 100 times.

2. Before producing data, you can use probability calculations to determine which groups of outcomes are likely, which are unlikely. (In probability, which is a branch of mathematics, the model is known, and you use it to deduce the chances for various events that have not yet happened.)

2. The chance mechanism produces the data, e.g., 91 heads in 100 tosses of a fair coin.

Now consider the analysis using the logic of inference. What was once assumed fixed (the chance mechanism) is now regarded as unknown. In the coin example, you don't know the value p for the probability of heads. To make matters worse, what was initially unknown and variable – the number of heads you would get if you were to toss 100 times -- is now fixed: you got 91 heads in 100 tosses. It may appear to make no sense to compute the probability of something that has already happened. However, according to the logic of classical inference, probabilities only apply to the data. There is no way to answer the question we really want answered: "How likely is it that p=.5?" No wonder people find this hard!

Here's the time sequence for the steps in the inference.

1. The setting: You have fixed, known data, and you want to test whether a particular model is reasonably consistent with the data.

2. You spell out the (tentative) model you want to test.

3. Now, in your imagination, you "go back in time," ignoring for the moment the fact that you already have the data, and ask " Which outcomes are likely, and which are unlikely?" In particular, you ask, "How likely is it to get the data we actually got?" In formal inference, this probability is the p-value.

4. The p-value is used not for prediction, the way probabilities are ordinarily used, but as a measure of surprise: "If I believe the model, how surprised should I be to get data like what I actually got?" If the p-value is small enough, the model is rejected. This is a typical of the way probability is used in statistics: after the fact. (Though statistics is a mathematical science, it is not a branch of mathematics they way probability is.)



Drill and practice with the abstract structure: Null models, test statistics, p-values 40. Samples of size 1. Find p-values for the following situations. a. Null model: All elements of S1 = {1, 2, …, 20} are equally likely.

Test statistic: t(x) = x, for x ∈ S1; larger values are more extreme. Observed data: x0 = 3.

b. Null model: All elements of S2 = {1, 2, …, N} are equally likely. Test statistic: t(x) = x, for x ∈ S2; larger values are more extreme. Observed data: x0 = 3.

c. Same as (b), but x0 = N-3. d. Null model: All 26 letters of the English alphabet are equally likely.

Test statistic: t=1 if the letter drawn is a vowel, 0 otherwise. Observed data: x0 = e.

41. Samples of size 2 or more. Find p-values for the following situations. a. Null model: All subsets of size two drawn from {1, 2, …, 6} are equally likely.

Test statistic: t({x,y}) = x+y; larger values are more extreme. Observed data: {x0, y0}= {3,6}.

b. Null model: All pairs (i,j) with i


b. Null model: All 2x2 matrices of 0s, 1s and 2s are equally likely. Test statistic: Sum of squares of the elements of the matrix. Observed data: Sum of squares = 10

S-Plus exercises: very basic drill 43 – 45. Write S-Plus code to do the following; then run your code as a check. 43. Sample. Take a simple random sample of size 3 from {1, 2, …, 20} 44. Test statistic. Take a simple random sample of size 3 from {1, 2, …, 20} and find the number of elements in the sample with values of 18 or more.

45. .Take NRep random samples of size 3 from the same population, and find p-hat, the proportion of samples with two or more elements with values 18 or more.

$p

Martin v. Westvaco

46. The table below classifies salaried workers using two Yes/No questions: Under 40? and Laid off? (In employment law, 40 is a special age, because only those 40 or older belong to what is called the "protected class," the group covered by the law against age discrimination.)

Laid off?Under 40? Yes No Total % Yes

Yes 4 5 9 44.4%

No 14 13 27 51.9%

Total 18 18 36 50.0% Display 2.19 Martin data for salaried workers

a. Set up a null model: Tell the population (how many of what kinds of items); tell the

sample size, and describe the set of equally likely samples. b. What is the test statistic? c. What is its observed value? d. Use S-Plus to find the p-value.

47. 50 or older. The average age of the salaried workforce in Westvaco’s engineering department was older than in many companies: ¾ of the 36 employees were over 40. Here are data like those in (7), except that ages are divided at 50 instead of 40; repeat parts (a) – (d):



Laid off?Under 50? Yes No Total % Yes

Yes 5 10 15 33.3%

No 13 8 21 61.9%

Total 18 18 36 50.0%

Verbal/interpretive practice

48. Martin, continued. Does the evidence in the second table (8) provide stronger or weaker support for Martin's case? Explain. How do you account for the different messages from the two tables? Both provide evidence; how do you judge the evidence from the two tables taken together? 49. P-values. Write a short paragraph explaining the logic of p-values and significance

testing in your own words. 50. Number of samples. Write a short paragraph summarizing your current

understanding of the relationship between the number of repetitions in a simulation and the stability and reliability of the estimated p-value.

51. A trustworthy friend? A friend wants to bet with you on the outcome of a coin toss. The coin looks fair, but you decide to do a little checking. You flip the coin: it lands Heads. You flip again: also heads. A third flip: heads. Flip: heads. Flip: heads. You continue to flip, and the coin lands Heads nineteen times in twenty tosses. Don't try any calculations, but explain why the evidence -- 19 heads in 20 tosses -- makes it hard to believe the coin is fair.

52. The logic in (1) relies on the fact that a certain probability is small. Describe in words what this probability is, and tell how you could use simulation to estimate it.

53. Snow in July? A friendly tornado puts you and your dog Toto down in Kansas, and a booming voice from behind a screen tells you that the date is July 4 (hypothesis). However, you see snow in the air (data), and make an inference that it is not really July 4. Describe in words what probability your inference is based on.

54. Which test statistic? The Physical Simulation asked you to use the average age to summarize the set of three ages of the workers chosen for layoff. How different would your conclusions have been if you had chosen some other summary? (There is a well-developed mathematical theory for deciding which summaries work well, but that theory would take us off on a tangent. All the same, you can still think about the issues even without the theory.) Some other possible summaries are listed at the end of this question. Are any of them equivalent to the average age? Which summary do you like best, and why?

Sum of the ages of the three who were laid off Average age difference (= average age of those laid off - average age of those retained) Number of employees 55 or older who were laid off



Age of the youngest worker who was laid off Age of the oldest worker who was laid off Middle of the ages of the three who were laid off

55. How unlikely is "too unlikely"? The probability you estimated in the Physical Simulation is in fact exactly equal to 0.05. What if it had been 0.01 instead? Or 0.10? How would that have changed you conclusions? (In a typical court case, a probability of 0.025 or less is required to serve as evidence of discrimination. Some scientific publications use a cut-off value of 0.05, or sometimes 0.01.)

56. At the end of Round 3, there were only six hourly workers left. Their ages were 25, 33, 34, 38, 48, and 56. The 33 and 34 year olds were chosen for layoff. Think about how you would repeat the Physical Simulation using the data four Round 4.

a. What is the population? (Give a list.)

b. How big is the sample?

c. Define, in words, the probability you would estimate if you were to do the simulation.

d. Write out the rule for estimating the probability, using the format from Step A1 as a guide.

e. Give your best estimate for the probability by choosing from

1%, 5%, 20%, 50%, 80%, 95%, and 99%.

f. Is the actual outcome easy to get just by chance, or hard?

g. Does this one part of the data (Round 4, hourly) provide evidence in Martin's favor?

57. After the first three, the next hourly worker laid off by Westvaco was the other 55 year-old. What's wrong with the following argument?

Lawyer for Westvaco: "I grant you that if you choose three workers at random, the probability of getting an average age of 58 or older is only .05. But if you extend the analysis to include the fourth person laid off, the average age is lower, only 57.25 (= [55+55+55+64] / 4). An average of 57.25 or older is more likely than an average of 58 or older. So in fact, if you look at all four who were laid off instead of just the first three, the evidence of age bias is weaker than you claim."

58. Use the data from Rounds 2 and 3 combined. Tell how to simulate the chance of getting an average age of 57.25 or more using the methods of the Physical Simulation: What is the population? the sample size? Tell how to estimate the probability, following the Physical Simulation as a guide. Give your best estimate of the probability. Then tell how to use this probability in judging the evidence from Rounds 2 and 3 combined.

59. Sketch a dot graph like Display 1.4 to illustrate what you think simulations would look like for the following scenario:

Three workers were laid from a set of ten whose ages were the same as in the Martin case. The ages of those laid off were 48, 55, and 55. If you choose three workers at random, the probability of getting an average of 52.66 or older is .166.



60. For some situations, it is possible to find probabilities by counting equally likely outcomes instead of by simulating. Suppose only two workers had been laid off, with an average age of 59.5 years. It is straightforward, though tedious, to list all possible pairs of workers who might have been chosen.. Here's the beginning of a systematic listing. The first nine outcomes all include the 25-year-old and one other. The next eight outcomes all include the 33-year old and one other, but not the 25-year-old, since the pair (25,33) was already counted.

Count Pair chosen (underlined = laid off) Average age

1 25 33 35 38 48 55 55 55 56 64 29.0

2 25 33 35 38 48 55 55 55 56 64 30.0

3 25 33 35 38 48 55 55 55 56 64 31.5

9 25 33 35 38 48 55 55 55 56 64 44.5

10 25 33 35 38 48 55 55 55 56 64 24.0

11 25 33 35 38 48 55 55 55 56 64 35.5

etc.

How many possible pairs are there? (Don't list them all!) How many give an average age of 59.5 years or older? (Do list them.) If the pair is chosen completely at random, then all possibilities are equally likely, and the probability of getting an average age of 59.5 or older equals the number of possibilities with an average of 59.5 or more divided by the total number of possibilities. What is the probability for this situation? Is the evidence of age bias stronger or weaker than in the example?

61. It is possible to use the same approach of listing and counting possibilities to find the probability of getting an average of 58 or more when drawing three at random. It turns out there are 120 possibilities. List the ones that give an average of 58 or more, and compute the probability. How does this number compare with the results of the class simulation in Physical Simulation? Why do the two probabilities differ (if they do)?

62. How would your reasoning and conclusions change if the five oldest workers among the entire group of ten were all age 55 (so that the ages of the ten were 25, 33, 35, 38, 48, 55, 55, 55, 55, 55), and the three chosen for layoff were all 55? Is the evidence of age bias stronger or weaker than in the actual case?

63. The law on age discrimination applies only to people 40 or older. Suppose that instead of looking at actual ages, you look only at whether each worker is less than 40, or 40 or older. Tell what summary statistic you would use, and tell how you would set up the model for simulating an age-neutral process for choosing three workers to be laid off. Conclude by discussing whether you think it is better to use actual ages, or just the information about whether a person is 40 or older.



Applied problems: creating your own null models and test statistics 64. More Martin. a. Use the 2x2 summary tables for salaried workers (in 7 and 8) to create a 3x2

summary table with three age groups: under 40, 40 to 49, and 50 or older: b. Describe a null model that corresponds to the hypothesis of no discrimination. c. Invent/define a test statistic that will be larger if older workers are more likely to be

chosen for layoff. d. Compute the observed value of your test statistic. e. Find the p-value for your combination of null model, test statistic, and observed

value. 65. Horse racing The data set below shows the starting position of winning horses in 144 races.24 All races took place in the US, and each race had eight horses. Position 1, nearest the inside rail, is hypothesized to be advantageous.

Starting position 1 2 3 4 5 6 7 8Number of wins 29 19 18 25 17 10 15 11

a. Describe a null model that corresponds to the hypothesis that starting position has no

effect. b. Invent/describe a test statistic that will have larger values if lower numbered starting

positions are more advantageous. c. For the data given here, tell whether the p-value will be < 0.01, ≥ 0.01 but ≤ 0.1, or >

0.1. 70. Spatial data. One way to record the spatial distribution of a plant species is to subdivide a larger area into a grid of small squares (quadrats), and record whether the plant (Carex arenaria in Display 2.20) is present (1) or absent (0), in each square.25 a. Suppose you want to test the hypothesis that the plants distribute themselves

randomly, without regard to how close they are to others of the same species. Thus the chance of finding a plant in any one quadrat doesn’t depend on whether or not there are plants in neighboring quadrats. Consider two possible null models, one that keeps only the total number of ones fixed, and a second that keep the row and column sums fixed. Which model is more appropriate, and why?

b. Consider two alternatives to the null model: (i) The presence of the plant in any one quadrat makes its presence in neighboring quadrats more likely. (ii) The presence of

24New York Post, August 30, 1955, p. 42. Reprinted in Siegel, S. and Castellan, N.J. (1988). Non-parametric Statistics for the Behavioral Sciences, 2nd ed., New York: McGraw-Hill, p. 47. 25 Strauss, D. (1992). The many faces of logistic regression. American Statistician 46, 321-327.



the plant in a quadrat makes its presence in neighboring quadrats less likely. Devise and define two test statistics, one for each alternative

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 1 1 0 1 02 0 1 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 13 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 04 0 0 1 1 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 05 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 0 0 06 0 1 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 1 1 0 0 0 07 0 1 0 1 1 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 08 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 09 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0

10 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 011 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 012 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 013 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 014 0 0 1 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 1 115 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 116 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 117 1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 018 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 119 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 0 120 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 121 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 122 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 023 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 124 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

Display 2.20 Presence (1) or absence (0) of the plant Carex arenaria in the squares of a 24 x 24 array.

71. Capture-mark-recapture. An entymologist wants to estimate the size of an animal population. Suppose, for example you want to know the number of giant cockroaches that feed in the vicinity of a particular stand of trees in a Costa Rican rainforest. The entymologist captures and marks 25 roaches, then releases them. After waiting long enough for the marked roaches to mix thoroughly with the rest of the roach population, he captures a sample of 50 cockroaches, and finds that 14 of them are marked. a. Let N be the unknown number of cockroaches in the population. Tell how to use a

null model to test the hypothesis that N is greater than or equal to 100. b. Tell how to use your null model to find the set of possible population sizes N that are

not rejected by a test of the sort in (1). c. What assumptions about the cockroaches are necessary in order to make the random

sampling model appropriate?

For the cockroach data we assume that the 50 insects caught in the sample are a random subset of the population. (This situation differs from the two previous examples. Here our goal is not to test whether the data are consistent with choosing



at random. Rather, we accept random selection as a reasonable assumption,26 in order to use it to make deductions about the size of the population.) For this example, the two randomly chosen groups are the 50 roaches that got caught, and the N-50 that didn’t get caught. The feature used for comparing groups is whether or not a roach was marked.

26 Biologists have broken down the assumption of random selection into a list of conditions about the population. In order for the assumption to be reasonable, there must be no movement of individuals in or out of the population during the time of the study, and the individuals must move around enough that the marked individuals get well mixed into the population before the sample is caught.




Short Investigations 72. What is the effect of ties on the permutation test? For example, compare drawing three at random from {1, 2, …, N}and drawing three at random from {1, 1, 2, …, N-1}. 73. What is the effect of the “cut point” (threshold) when you turn a quantitative variable like age into a categorical variable like “Under 40” or “40 or older”? 74. What is the effect of the choice of test statistic on the performance of the test? 75. If you use simulations to estimate a p-value, different simulations will give different results. Large simulations (many samples; large value of NRep) are more stable and reliable than small ones. Invent a way to tell from a set of simulation results whether or not you’ve done enough to stop. Related research questions

76. APPLIED QUESTIONS:

No matter what area of application you are interested in, there are two closely related research questions to be answered: a. Null model: What choice or choices will provide useful evidence for evaluating a

given scientific hypothesis?

b. Test statistic: Same question. The questions and answers will depend very much on the applied context. In the field of community ecology, questions of this sort have been a source of vigorous debate, the subject of voluminous publ

Chapter 2: Randomization tests - Mount Holyoke College · 2003. 6. 5. · Discrete Markov Chain...

Documents

Transcript of Chapter 2: Randomization tests - Mount Holyoke College · 2003. 6. 5. · Discrete Markov Chain...