The Difference Between Statistical Hypotheses and Scientific Hypotheses

download The Difference Between Statistical Hypotheses and Scientific Hypotheses

of 7

Transcript of The Difference Between Statistical Hypotheses and Scientific Hypotheses

  • 8/2/2019 The Difference Between Statistical Hypotheses and Scientific Hypotheses

    1/7

    Psychological Reports, 1962, 11, 639-645. @ Southern Universities Press 1962

    THE DIFFERENCE BETWEEN STATISTICAL H Y P O T H E S E S A N DSCIENTIFIC HYPOTHESES

    ROBERT C. BOLLESHollins College

    When a professional stacistician runs a statistical test he is usually con-cerned only with the mathematical properties of certain sets of numbers, butwhen a scientist runs a stztistical test he is usually crying to understand somenamral phenom enon. T he hypotheses the statistician tests exist in a world ofblack and white, where the alternatives are clear, simple, and few in number,whereas the scientist works in a vast gray area in which the alternativehypotheses are often confusing, complex, and limited in number only by thescientist's ingenuity.T he present paper is concerned with just o ne feature of this distinction,namely, that when a statistician rejects the null hypothesis at a certain levelof confidence, say .05, he may then be fairly well assured ( p = .95) that thealternative statistical hypothesis is correct. How ever, when a scientist runs thesame rest, using the same numbers, rejecting the same null hypothesis, he cannotin general conclude with p = .95 that his scientific hypothesis is correct.

    In assessing the probability of his hypothesis he is also obliged to considerthe probability char the s ta t i s t i ca l model he assumed for purposes of the testis really applicable. T he staciscician can say "if the distribution is normal," or" z f we assume the parent pp u la ti o n is distributed exponentially." These ifscost the statistician nothing, but they can prove to be quite a burden on thepoor E whose numbers represent controlled observations nor just symbolswrit ten on paper.

    The scientist also has che burden of judging whether his hypothesis has agreater probability of being correct than other hypotheses that could alsoexplain his data. T he stacistician is confro nted wit h just two hypotheses, andthe decision which h e makes is only between these two. Suppose he has twosamples and is concerned with whether the two means differ. Th e observeddifference can be attribute d either to random v ariation ( th e null hypothesis)or to the alternative hypochesis that the samples have been drawn from twopopulations with differen t means. Ordinarily these two alternatives exhaust thestatistician's universe. T he scientist, on the other hand, being ultimately con-cerned with the n ature of natural phenom ena, has only started his w o rk wh enhe rejects the null hypothesis. An exam ple may help to illustrate these rwopoints.

    Consider the following situation. T w o grou ps of rats are tested for waterconsump tion a fter one, the experimental grou p, has been subjected to a particular

  • 8/2/2019 The Difference Between Statistical Hypotheses and Scientific Hypotheses

    2/7

    640 R . C. BOLLEStreatment. Suppose the collected data should appe ar as shown in Table 1.l

    After the data are collected, E can pretend he is a statistician for a whileand say to himself, "Let's assume the populations are normal and try a t test."T h e t statistic is encouragingly large but not Iarge enough (which is just as well

    TABLE 1

    N Occ. l c c . 2cc . 3cc . 4cc . . . . 12cc .Control Ss 1 5 3 2 0 0 . . . 0Experimental Ss 12 3 0 0 4 . . . 1

    because of the difficulty that would arise in attempting to justify the assumptionof nor ma lity ). Several transformations of th e data are tried, but they don 'thelp. Our E recognizes that he needs another statistical model. Perhaps a non-parametric test would work, one which is not sensitive to the great skewnessof the data, and one which makes no assumption about the underlying distri-bution. A Mann-Whicney tesc is discouraging. A chi-square tesc (tried eventhough the expected frequencies in th e imp ortant cells at th e tails of the distr i-bution are really too small) does not even approach a significant value.

    By this cime E, weary of being an amateur statistician, consults a profes-sional one who tells him that, if we can assume [he populations have exponentialdistributions, then we can use Fescinger's test ( 1943 ) . This works (F = 4.83 ) .In due cime E publishes a report in which a highly significant ( f i < .01) in -crease in water consumption is attributed to the experimental treatment.

    Inspe ction of the table, however, should lead to som e skeptic ism. If ourE has actually discovered anything about nature, he has found that (1 ) mostanimals under these test conditions don't drink more than 2 cc., and ( 2 ) hisexperimental treatment may make a few animals drink a good deal more rhannormal, 4 cc. or more.It is necessary to digress a moment in order to notice that throughoutthis whole scientific episode our E's behavior has been above reproach. Evenwhen he was acting like a statistician he wasn't jmt hunting for a test thatwould work; he was also searching for a tesc and a model, that would fit hisparticular problem. T he t test will not work, not just because it is inappropriatefrom a statistical poin t of view, but also because it tests the wrong th ing. Th e ttesc tests whether the means are different. T he M an n-W hitn ey test, for anotherexample, is appropriate in the statistical sense, but it too is primarily sensitivelThe problem, the data, and the daim of significance are those of Siegel and Siegel(1949); what follows is my construction.

  • 8/2/2019 The Difference Between Statistical Hypotheses and Scientific Hypotheses

    3/7

    STATISTICAL AND SCIENTIFIC HYPOTHESES 641co differences in means, and so is likely to pick u p only the first phenomeno n ourE has discovered (that most Ss drink very little) and noc the second (that someSs respond co tre atm en t). E should use a test which is sensitive to his specialproblem such as the Kolmogorov-Smirnov test which is highly sensitive to dif-ferences in the shafes of distributions, or a test specifically for the difference inskewness between the two distributions."

    Enough statistical digression. The point about which our E, as a scientist,should be concerned is the discrepancy between the high level of statistical sig-nificance obtained ( f < .01) and the lingering doubts he must have whetherthere may be some explanation for his data ocher than the experimental trear-ment (subjective p = ? ) . There are two very good but often ignored reasonsfor the discrepancy, as I suggested above. O ne source of doubt is that th eprobability of correctness of the scientist's hypothesis depends not only uponthe probability of rejecting the null hypothesis, but also upon the probabilitythat the statistical model is app ropria te. N ow , with the no n-pa ram etric teststhere is little problem here, and in fact, that is their great virtue."

    Our E with the thirscy rats wisely eschewed the t test, and he probablywould have even if it had yielded significance, not because it was "wrong," butbecause it would not have given him any assurance that his scientific hypothesishad been confirmed . But wh at, we may ask, is th e probability that the pop ula-tions which his rwo grou ps represent are actually distributed expon entially? Imust say they don't look exponential. Th e samples are much too small t o giveus any assurance that the model m ight be appropriare. Let us be generous andsay chat the model has a .50 chance of being applicable. W ha t becomes of E'sclaimed high significance level?Poor E has a more serious matter to worry about, a more vexing source ofdoubt. H is whole case hinges on the performan ce of one animal, the one chardrank 12 cc. The remaining 39 Ss don't give him a thing, with any test. N o wsuppose that the true state of affairs is this: Th e experimental rreatment reallyhas no appreciable effect u?on w ater consumption. But let us suppose, how-ever, that there occurs, every once in a while, a bubble in the animal's home-cage drinking tube, which prevents normal drinking. If this were the case,and if bubbles occur, say, 21/2% of the time, then about 50% of the rime aaOr he may replicate the study to get a larger n so that other tests n~ill ave more power.This is probably the best approach, especially if he varies the experimental conditionsin order to find out how to ger better control over the rare event.T h e power of the Man n-Whitney, for example, is usually cited as approaching .95 thatof the t test, in those conditions w here the latter is appropriate. Considering that thereis usually at least a .05 chance that any given set of data does not come from a normalpopulation, the Mann-Whitney emerges as perhaps the more powerful test for testingscientific (a s against statistical) hypotheses. Moreover, its loss of statistical power withsmall samples may be more than offset by the scientist's gain in assurance that theunderlying model is appropriare.

  • 8/2/2019 The Difference Between Statistical Hypotheses and Scientific Hypotheses

    4/7

    642 R. C. BOLLESbubble S will appear in the experimental group, so E will get just the resultshe got about 5 0% of the time.4 Moreov er, if he continues to use the Festingertest, we can expect him half the time to get highly significant differences insupport of his hypothesis, whatever his hypochesis may be!

    Th e problem here, basically, is that scaciscical rejection of the n ~ ~ l lypochesistells the scientist only wh at he was already quite sure of-the animals are notbehaving randomly. T he fact the null hypothesis can be rejected w ith a p of.99 does not give E an assurance of .99 that his particular hypothesis is true, butonly that some alternative to the null hypothesis is true. H e may not like thebubble hypothesis because it is ad hoc. Buc that is quite irrelevant. W ha t iscrucial is that the bubble hypochesis, or some other hypothesis, may be moreprobable than his ow n. Th e final confidence he can have in his scientifichypothesis is not dependent upon statistical significance levels; it is ultimatelydetermined by his ability to reject alternatives.Consider another illustration. Suppose we are interested in whether acertain stimulus will have reinforcing power for a certain group of 20 animals.After pretraining on a straightaway, we run them 15 trials in a T-maze whichhas the stimulus in question on one side and not on the other. W e collect ourdata and graph them in the hope of seeing a typical learning curve. B u t whatwe find is Fig. 1. All is not lost, though, apparently, because it is still possibleto concIude from the data that the stimulus did have a reinforcing effect, andthat the associated p value is less than .00Z!5

    The first thing to look for is whether there is a rising trend in the pointsof Fig. 1. It turns out that the best fit line does rise, but that an F test for thesignificance of the trend shows it to be less than would be expected o n the basisof the day-to-day variation. T o find a significanc difference anywhere here,we have to ignore the data of Fig. 1 and turn our attention to the numberof responses made by each S to the "reinforced" side, du rin g its 15 trials. Th e20 such scores have a mean of 8.8 which proves to be highly significantly dif-ferent from the expected value of 7.5 ( p = 00 2). (T he learning curve, cor-respondingly, runs along at about 59% instead of 50% .) W ha t this significancetest tells us is that the animals probably weren 't run nin g randomly.6 But it isa long way from that to the inference that learning has occurred because of thespecial stimulus.T h e other half of the t ime the experiment wil l be disastrous for E's hypothesis , however.H e had bet ter just do th e s tudy once; that way he has at least a .50 chance of get t inghigh s ignificance in the favorable direct ion, and no m ore tha n a .50 chance of discoveringthat the real world is full of bubbles.T h e p ro b lem , t h e da ta , an d t h e con c lu si on s a r e D 'A m ato ' s (1 9 5 5 ) . N o t e t h a t p e rfo rm an ceon the fi rs t t r ial was at 50 % . This was pre-ar ranged by pu t t ing the re inforcer on bo thsides for half the Ss and on nei ther s ide for the other half .'No on e wh o has ever run an imals would seriously consider that they mig ht ru n randomly.W ha t the nul l hypochesis implies in empirical terms is chat the different anima ls weredoing d i f feren t th ~ ng s t d i f feren t t imes so that the to ta l se t of scoses looks as if i t we rerandom.

  • 8/2/2019 The Difference Between Statistical Hypotheses and Scientific Hypotheses

    5/7

    STATISTICAL AND SCIENTIFIC HYPOTHESES

    T R I A L SFIG. 1. The performance of rats alleged to have learned to go to one side of aT-mazeOne hypochesis with a high a priori probabiliry is that mosc of the animals

    gave scores that lay quite close to 7.5 but that one or two animals, with strongunlearned position habits, continued going to the side to which they had gone onthe first trial. T he probability that in the samplin g process, just those one ortwo animals with strong positions habits should be placed under the same con-dition, and under the particular condition that the "reinforcer" was on theirpreferred side, is fairly small, but still a great deal larger than the reportedsignific ance level. Th is hypothesis can be ruled ouc, howeve r, but by certainfeatures of the data and not on grounds of ics a primi probabiliry. W e candeduce from [he small SD (1 .4 7) that few of the animals could have hadposition habits of appreciable strength . In fact, we can deduce (w ith a littleef for t ) from the size of the SD that no more than one S could have hadextreme position habit, and chat even chis was not actually the case, since oneS could not have moved the mean from 7.5 to 8.8. Hence, we must concludethat the distribution represents a tightly bunched set of scores, whose mean isindeed significantly larger Clan 7.5.

    But this suggests another hypothesis, which does account for the data, andwhich also has a high a p~ i o7 i robability. Th e hypothesis is simply that moscanimals have slight position habits. Acco rding to this hypochesis, any particularanimal could be expected to g o to one side 8 or 9 or 10 times out of 15 . ( W ehave already noted the high significance level indicating how consistent chisis.) Th e setting of the performanc e level at 50% on the first trial is a redherrin g; it does not set th e expected percentage correct at 50%-that wouldbe true in any case before ic was known which was the preferred side for a

  • 8/2/2019 The Difference Between Statistical Hypotheses and Scientific Hypotheses

    6/7

    644 R . C. BOLLESgiven animal. Now that the data are in, we can see (according to this hy-pothesis) that the preferred side happened to be predominantly on the sameside for which S was "reinforced." T o assess the probability tha t a significantlyhigh proportion of the animals had the "reinforcement" on their preferred side(wh ich would be good evidence that the reinforcement was effec tive), we m ustgo back to the data of Fig. 1. Performance over the last 1 4 trials was at 59 % ,while the performance on Trial 1 was 50%. The quescion is whether E'sperformance on Trial 1, when he was selecting the side to put the criticalstimulus, is significantly different from the 59% baseline for the animal's per-f ~ r m a n c e ? ~ he answer is that i t is well within the tr ial by trial variation.There was that much or more variation from the mean on 6 of the trials.

    So, what it comes down co is that the animals did show slight but con-sistent preferences for one side or the other; the p figure of .002 shows this.The important question is whether these slight and consistent preferences aredue to a slight but consiscent effect of the ex perim ental treatm ent, or whecherthey would have occurred without the treatment and E was just a little unluckyin trying to counterbalance them. W hic h is the more probable?

    Th e poinc of this message is not thac it is fucile to do expe rime nts (althoughit might be wise to be cautious of some of che statistician's favorite designs).Rather, the emphasis should be upon the distinction between why scientists runstatistical tests and why statisticians do it. Th e former run tests for the samereason chey run experiments, in the atcempt to understand natural physicalphenomena. Th e latter do it in the attemp t to understand machematical pheno-mena. Th e scientist gains his understanding through the rejeccion or con-firmation of sciencific hypotheses, but this depends upon much more thanmerely rejec ting or failing to reject the null hypothesis. It depends partlyupon the confirmation from other investigators (e.g., Amsel & Malczman, 1950;W i k e & Casey, 19 5 4 ), particularly as the experimental c onditions are varied (e.g.,Siege1 & Brantley, 1951; Amsel & Cole, 19 53 ). Confirma tion of scientifichypotheses also depends in part upon whether they can be incorporated intoa larger theoretical framew ork (e.g., Hu ll, 1 9 4 3 ). Final confirmacion ofscientific hypotheses and the larger theories they support depends upon whecherthey can stand th e test of tim e.

    These processes have to move slowly. As Bakan (1 95 3) has observed, thedevelopment of a scientific idea is gradual, like learning itself; ics probabilityof being correct increases gradually from one experimental verification to thenext, as response probability increases fro m one trial to the next. Th e effeccof any single experimental verification is not to confirm a scientific hypothesis'Or put another way, what is the probability chat 20 animals, all with slig ht position habits,will distribute themselves o n one particular trial so that the g roup w ill depart 9 per-centage points from its mean value.

  • 8/2/2019 The Difference Between Statistical Hypotheses and Scientific Hypotheses

    7/7

    STATISTICAL A N D SCIENTIFIC HYPOTHESES 64 5but only to make its a p o s t e r i o r i probability a little higher than its a p r i o r iprobability. Our present day over-reliance upon statistical hypothesis testingis apr to obscure this feature of the scientific enterprise. W e have almostcome to believe that an assertion about the nature of the empirical worldcan be validated ( a t least with a prob ability level such as .95 or .99) in onestroke if the data demonstrate statistical significance. k it any wonder thenthat our use of statistical hypothesis testing is rapidly passing from routineto ri tual?

    SUMMARYOne of the chief differences between the hypotheses of the statistician and

    those of the scientist is that, when the statistician has rejected the null hypothesis,. -his job is virtually finished. T he scientist, however, has only just be gun histask. H e mus t also be able to show that the statistical model underlying the testis applicable to his empirical situation because whatever significance level heobtained for the test, his cocfidence in his scientific hypothesis must be reducedbelow that by any lack of confidence in the model. Furtherm ore, confidence inhis scientific hypothesis is reduced by the plausibility of alternative hypotheses.Henc e the scientist's ultima te confidence in his hypothesis may be far lower th anthe significance level he can report.

    REFERENCESAMSEL,A., & COLE, . F. Generalization of fear motivated interference with waterintake. J. exp. Psychol., 1953, 46, 243-247.AMSEL,A., & MALTZW, I. Th e effect upon generalized drive strength of emotion-ality as inferred from the level of consummatory response. J. exp. Psychol., 1950,40, 563-569.BAKAN,D. Learning and the principle of inverse probab ility. Psychol. Rev., 1953, 60,360-370.D'AMATO, M. R. Transfer of secondary reinforcement across the hunger and thirst drives.1. exp . Psychol., 1955,49, 352-356.FESTINGER, . An exact test of significance for means of samples drawn from populationswith an exponential frequency distribution. Psychometrika, 1943, 8 , 153-160.HULL,. L. Principles of behavior. New York: Appleton-Century, 1943.SIEGEL, . S., & BRANTLEY,J. J. The relacionship of emotionality to the consum-matory response of eating. J . exp. Psychol., 1951, 42, 304-306.SIEGEL,. S. , & SIEGEL, . S. The effect of emotionality on the water intake of therat. J . comp. physiol. Psychol., 1949, 42, 12-16.W K E , E. L., & CASEY, . Th e secondary reinforcing value of food for thirsty an imals.

    J. comp, physiol. Psychol., 1954, 4 7 , 240-243.Accepted September 25, 1962.