Paul Meehl’s search for the optimal epistemology for the behavioral and social sciences

3
Applied & Preventive Psychology 11 (2004) 65–67 Commentary Paul Meehl’s search for the optimal epistemology for the behavioral and social sciences Frank Schmidt Department of Management and Organization, Tippie College of Business, University of Iowa, Iowa City, IA 52242, USA Keywords: Paul Meehl; Statistical significance tests; Meta-analysis; Confidence intervals; Data analysis Paul Meehl spent a large portion of his life in a search for the optimal epistemology for the behavioral and social sciences. Meehl (1978) is one of his finest examples of that quest. He asked broad questions: How should we conduct research? How should we test our theories? How should we analyze and interpret our data? Meehl was not the first to point out to psychologists the dangers to reliance on statis- tical significance testing in data interpretation. Jones (1995) did this earlier, and Carver (1978) sounded the alarm the same year that the Meehl article we celebrate here appeared. Nevertheless, Meehl deserves credit for presenting an eru- dite analysis of the harm done by significance testing to progress towards cumulative knowledge in psychology and for stating the indictment unusually bluntly. In his words: I suggest to you that Sir Ronald Fisher has befuddled us, mesmerized us, and led us down the primrose path. I be- lieve that the almost universal reliance on merely refuting the null hypothesis as the standard method for corrobo- rating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology. (p. 817) Meehl went on to point out that the null hypothesis is al- ways false—that no relationship is ever precisely zero—and that therefore every null hypothesis will be rejected with a large enough sample size (and hence high statistical power to detect even small relationships). Like Cohen (1994), he focused on the fact that every null hypothesis is statistically false. It is also true, however, that in most contemporary psychology research literatures, almost all null hypotheses are substantively false as well (Schmidt, 1996; Schmidt & Tel.: +1-319-335-0949. E-mail address: [email protected] (F. Schmidt). Hunter, 1997). That is, in most developed research areas, enough has been learned over time about the phenomenon being studied that the relationships hypothesized do in fact exist—not just statistically but substantively. For example, Lipsey and Wilson (1993) surveyed 302 meta-analyses of psychological research literatures and found that in 99% the results showed substantial substantive relationships. Hence, across these research literatures, the probability that the null was substantively—and not just statistically—false was 0.99. It is likely that these 302 meta-analyses are represen- tative of a much wider domain. If this is true, then any defi- ciencies in statistical power become all the more devastating. The argument Meehl was making here is that given high enough statistical power, every null hypothesis will be re- jected. The real problem, however, is not excessively high statistical power, but low statistical power. Across the psy- chology research literatures that have been examined, the power of significance tests to detect relationships that are present is in the 0.40–0.60 range (Cohen, 1962, 1965, 1988, 1992; Schmidt, Hunter, & Urry, 1976; Sedlmeier & Gigeren- zer, 1989). This is the real reason for the difficulty described by Meehl (1978) of interpreting research literatures based on “tabular asterisks” (pp. 823–824). If a substantive relation- ship actually exists (highly probable, as indicated above), and if average statistical power in the literature is, say, 0.40, then a count of the tabular asterisks should reveal that 60% of the studies found no significant relationship and only 40% found a significance relationship. Thus, this majority vote procedure (called “counting noses” by Meehl) leads to the false conclusion that there is no relationship. The point is that there is simple and easily understood explanation for the confusion and befuddlement in interpretation of research literatures that Meehl describes: low statistical power is that explanation. Some of the more complex explanatory factors that Meehl offers are probably not needed. 0962-1849/$ – see front matter © 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.appsy.2004.02.011

Transcript of Paul Meehl’s search for the optimal epistemology for the behavioral and social sciences

Applied & Preventive Psychology 11 (2004) 65–67

Commentary

Paul Meehl’s search for the optimal epistemologyfor the behavioral and social sciences

Frank Schmidt∗

Department of Management and Organization, Tippie College of Business, University of Iowa, Iowa City, IA 52242, USA

Keywords:Paul Meehl; Statistical significance tests; Meta-analysis; Confidence intervals; Data analysis

Paul Meehl spent a large portion of his life in a searchfor the optimal epistemology for the behavioral and socialsciences.Meehl (1978)is one of his finest examples of thatquest. He asked broad questions: How should we conductresearch? How should we test our theories? How should weanalyze and interpret our data? Meehl was not the first topoint out to psychologists the dangers to reliance on statis-tical significance testing in data interpretation.Jones (1995)did this earlier, andCarver (1978)sounded the alarm thesame year that the Meehl article we celebrate here appeared.Nevertheless, Meehl deserves credit for presenting an eru-dite analysis of the harm done by significance testing toprogress towards cumulative knowledge in psychology andfor stating the indictment unusually bluntly. In his words:

I suggest to you that Sir Ronald Fisher has befuddled us,mesmerized us, and led us down the primrose path. I be-lieve that the almost universal reliance on merely refutingthe null hypothesis as the standard method for corrobo-rating substantive theories in the soft areas is a terriblemistake, is basically unsound, poor scientific strategy, andone of the worst things that ever happened in the historyof psychology. (p. 817)

Meehl went on to point out that the null hypothesis is al-ways false—that no relationship is ever precisely zero—andthat therefore every null hypothesis will be rejected with alarge enough sample size (and hence high statistical powerto detect even small relationships). LikeCohen (1994), hefocused on the fact that every null hypothesis is statisticallyfalse. It is also true, however, that in most contemporarypsychology research literatures, almost all null hypothesesare substantively false as well (Schmidt, 1996; Schmidt &

∗ Tel.: +1-319-335-0949.E-mail address:[email protected] (F. Schmidt).

Hunter, 1997). That is, in most developed research areas,enough has been learned over time about the phenomenonbeing studied that the relationships hypothesized do in factexist—not just statistically but substantively. For example,Lipsey and Wilson (1993)surveyed 302 meta-analyses ofpsychological research literatures and found that in 99% theresults showed substantial substantive relationships. Hence,across these research literatures, the probability that thenull was substantively—and not just statistically—false was0.99. It is likely that these 302 meta-analyses are represen-tative of a much wider domain. If this is true, then any defi-ciencies in statistical power become all the more devastating.

The argument Meehl was making here is that given highenough statistical power, every null hypothesis will be re-jected. The real problem, however, is not excessively highstatistical power, but low statistical power. Across the psy-chology research literatures that have been examined, thepower of significance tests to detect relationships that arepresent is in the 0.40–0.60 range (Cohen, 1962, 1965, 1988,1992; Schmidt, Hunter, & Urry, 1976; Sedlmeier & Gigeren-zer, 1989). This is the real reason for the difficulty describedby Meehl (1978)of interpreting research literatures based on“tabular asterisks” (pp. 823–824). If a substantive relation-ship actually exists (highly probable, as indicated above),and if average statistical power in the literature is, say, 0.40,then a count of the tabular asterisks should reveal that 60%of the studies found no significant relationship and only 40%found a significance relationship. Thus, this majority voteprocedure (called “counting noses” by Meehl) leads to thefalse conclusion that there is no relationship. The point isthat there is simple and easily understood explanation forthe confusion and befuddlement in interpretation of researchliteratures that Meehl describes: low statistical power is thatexplanation. Some of the more complex explanatory factorsthat Meehl offers are probably not needed.

0962-1849/$ – see front matter © 2004 Elsevier Ltd. All rights reserved.doi:10.1016/j.appsy.2004.02.011

66 F. Schmidt / Applied & Preventive Psychology 11 (2004) 65–67

What is the solution to this problem? At the level of theindividual study, the solution is to replace significance testswith point estimates and confidence intervals (CIs) (Hunter,1997; Loftus, 1996; Schmidt, 1996). Unlike significancetests, point estimates reveal the size of the relationship un-covered. And CIs, also unlike significance tests, reveal howmuch uncertainty there is in results. For studies with typicalsample sizes, CIs are quite wide, and this fact is informativein itself. It shows how limited is the information containedin a single study, and the fact that CIs from different studiesare found to overlap reveals to researchers that the differentstudies are not necessarily contradictory, as they appear tobe when significance tests are used.

Meehl (1978)was concerned here with the problem ofmaking sense of research literatures, and he wrote beforethe advent of meta-analysis, which is the real solution tothis problem. In many (perhaps by now most) researchliteratures, meta-analysis has revealed a clarity of mean-ing that was never possible before (Schmidt, 1992, 1996).Meta-analysis ignores indices of statistical significance inthe primary studies and instead computes accurate esti-mates across studies of the mean and standard deviation ofindices of size of relationships. The results usually revealthat research literatures are much less contradictory andmake much more substantive and theoretical sense than dointerpretations based on significance tests (cf.Hedges &Olkin, 1985, in press; Hunter & Schmidt, 1990, 2004).

In this connection, it seems to me thatMeehl (1978)wascalling for at least two things. First, he was calling for re-searchers to stop relying on statistical significance tests inindividual research studies. Second, he was calling for thedevelopment or discovery of some means of making senseof our research literatures—something he correctly statedsignificance tests cannot do. Events since 1978 have shownthat it is much easier to achieve the second objective thanthe first. In fact, the second objective has been achieved bymeta-analysis. Almost all the reviews appearing today in theprimary review journal in psychology,Psychological Bul-letin, are based on meta-analysis—and have been for over 10years. Conclusions about relationships that are used in theoryconstruction are today mostly the product of meta-analyses.The summaries of research findings in our textbooks todayare based almost entirely on the results of meta-analyses.

It has, however, proven much more difficult to convincethose who conduct primary studies to abandon the signif-icance test in favor of point estimates and CIs. In an ear-lier article (Schmidt, 1996), I referred to the attachment ofresearchers to significance testing as “an addiction” that isbased on false beliefs about benefits provided by significancetesting. For example, many researchers believe that statisti-cal significance indicates that a finding is “reliable”; that is,they believe that a finding significant at the 0.05 level hasa 95% probability of replicating in a new study. Statementslike this can even be found in books on research methods.In fact, significance level reveals nothing about replicabil-ity, and the probability of replication is likely to be closer to

0.50 than 0.95 (Schmidt, 1996). Many researchers also be-lieve that level of significance indicates the importance of afinding; they falsely believe that a finding significant at the0.01 level is more important than one significant only at the0.05 level. But theP-value is heavily a function of samplesize, not importance. Many also believe that if a differenceor relation is not statistically significant, then it is probablyjust due to chance and can be considered to be zero. In fact,most non-significant findings are due to low statistical powerto detect potentially important real differences or relations.

When the truth about these false beliefs is pointed out,many researchers respond with a variety of objections toabandoning significance testing. Some of the most commonof these objections are examined bySchmidt and Hunter(1997) and shown not to stand up to careful analysis. Allof this is evidence of the strong but unfounded attachmentmany researchers have to significance testing.

As should be apparent by now, the good news is thatfailure to achieve Meehl’s first objective has not preventedachievement of his second objective. It has not been pos-sible to wean most researchers conducting primary studiesfrom their addiction to significance testing, but meta-analysishas made it possible to clarify the meaning of researchliteratures, and this process has greatly advanced cumula-tive knowledge in psychology (Schmidt & Hunter, 1997). Ilearned from an e-mail forwarded to me before he died thatthis is a development that Paul Meehl welcomed and heart-edly approved of.

References

Carver, R. P. (1978). The case against statistical significance testing.Harvard Educational Review, 48, 378–399.

Cohen, J. (1962). The statistical power of abnormal-social psychologicalresearch: A review.Journal of Abnormal and Social Psychology, 65,145–153.

Cohen, J. (1965). Some statistical issues in psychological research. InB. B. Wolman (Ed.),Handbook of clinical psychology(pp. 95–121).New York: McGraw-Hill.

Cohen, J. (1988).Statistical power analysis for the behavioral sciences(2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.

Cohen, J. (1992). Statistical power analysis.Current Directions in Psy-chological Science, 1, 98–101.

Cohen, J. (1994). The earth is round (P < 05). American Psychologist,49, 997–1003.

Hedges, L. V., & Olkin, I. (1985).Statistical methods for meta-analysis.Orlando, FL: Academic Press.

Hedges, L. V., & Olkin, I. (in press).Statistical methods for meta-analysis(2nd ed.). Orlando, FL: Academic Press.

Hunter, J. E. (1997). Needed: A ban on the significance test.PsychologicalScience, 8, 3–7.

Hunter, J. E., & Schmidt, F. L. (1990).Methods of meta-analysis: Cor-recting error and bias in research findings. Newbury Park, CA: Sage.

Hunter, J. E., & Schmidt, F. L. (2004).Methods of meta-analysis: Cor-recting error and bias in research findings. Thousand Oaks, CA: Sage.

Jones, L. V. (1995). Statistics and research design.Annual review ofpsychology(Vol. 6, pp. 405–430). Stanford, CA: Annual Reviews Inc.

Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological, ed-ucational, and behavioral treatment: Confirmation from meta-analysis.American Psychologist, 48, 1181–1209.

F. Schmidt / Applied & Preventive Psychology 11 (2004) 65–67 67

Loftus, G. R. (1996). Psychology will be a much better science when wechange the way we analyze data.Current Directions in PsychologicalScience, 5, 161–171.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl,Sir Ronald, and the slow progress of soft psychology.Journal ofConsulting and Clinical Psychology, 46, 806–834.

Schmidt, F. L. (1992). What do data really mean? Research findingsmeta-analysis and cumulative knowledge in psychology.AmericanPsychologist, 47, 1173–1181.

Schmidt, F. L. (1996). Statistical significance testing and cumulativeknowledge in psychology: Implications for the training of researchers.Psychological Methods, 1, 115–129.

Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objec-tions to the discontinuation of significance testing in the analysis ofresearch data. In L. Harlow, S. Muliak, & J. Steiger (Eds.),What ifthere were no significance tests?(pp. 37–64). Mahwah, NJ: LawrenceErlbaum.

Schmidt, F. L., Hunter, J. E., & Urry, V. (1976). Statistical power incriterion related validation studies.Journal of Applied Psychology, 61,473–485.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical powerhave an effect on the power of studies?Psychological Bulletin, 105,309–316.