James W. Grice
Oklahoma State UniversityDepartment of Psychology
Presented to researchers and staff of Walter Reed Army Research Institute, Silver Spring, MD, April 14th, 2015.
Alternatives to Null Hypothesis Significance Testing and Variable-Based Modeling
Null Hypothesis Significance Testing (NHST)
Thoughts running through the researcher’s mind: Do I have an effect? Are my results significant? Is my hypothesis supported?
α = pcrit = .05
The Null Ritual:
1. Set up a statistical null hypothesis of “no mean difference” or “zero correlation.” Don’t specify the predictions of your research hypothesis or of any alternative substantive hypotheses.
2. Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis. Report the result as p < 0.05, p < 0.01, or p < 0.001 (whichever comes next to the obtained p-value).
3. Always perform this procedure.
p. 588, Gigerenzer, G. (2001). Mindless Statistics. Journal of Socio-Economics, 33, 587-606.
NHST
Optimism Delay visit to doctor
-.18*
rz z
xyx y
n 1.18
Assumption-laden NHST
Assumptions• Linearity• Random Sampling• Bivariate Normal Population Distribution• Homoscedasticity• Continuous variables• Independence of pairs of observations• Ho is true• “p ≤ .05” is proper significance level
Goal is to estimate a population parameter; here, the population correlation
NHST
Linear relationship between optimism and visiting a doctor after detecting a lump in the breast.
NHST
Hypotheses:
Ho : ρxy = 0
HA : ρxy > 0 or ρxy < 0
where ρxy is the population correlation
Assumptions• Linearity• Random Sampling• Bivariate Normal Population Distribution• Homoscedasticity• Continuous variables• Independence of pairs of observations• Ho is true• “p ≤ .05” is proper significance level
NHST
Ho : ρxy = 0
Sampling Distribution : Distribution of possible outcomes (r values) with assumptions being fulfilled.
pcrit = .05
rcrit = -.169 rcrit = .169
NHST
pcrit = .05
-.18 +.18pobs = .037
rcrit = -.169 rcrit = .169
Specifically: Given the assumptions, pobs is the probability of obtaining a result at least as extreme as +/- .18 in a repeated, random sampling scheme.
This is all you get!
.0185.0185
Things you may want, but do not get from the p-value…
“Bakan (1966) and Thompson (1996, 1999) catalogue some of the most common:1. A p value is the probability the results will replicate if the study is conducted again (false).2. We should have more confidence in p values obtained with larger Ns than smaller Ns (this is not
only false but backwards).3. A p value is a measure of the degree of confidence in the obtained result (false).4. A p value automates the process of making an inductive inference (false, you still have to do that
yourself—and most don’t bother).5. Significance testing lends objectivity to the inferential process (it really doesn’t).6. A p value is an inference from population parameters to our research hypothesis (false, it is only an
inference from sample statistics to population parameters).7. A p value is a measure of the confidence we should have in the veracity of our research hypothesis
(false).8. A p value tells you something about the members of your sample (no it doesn’t).9. A p value is a measure of the validity of the inductions made based on the results (false).10. A p value is the probability the null is true (or false) given the data (it is not).11. A p value is the probability the alternative hypothesis is true (or false; this is false).12. A p value is the probability that the results obtained occurred due to chance (very popular but
nevertheless false).”
p. 73. Lambdin, C. (2011) Significance tests as sorcery: Science is empirical—significance tests are not. Theory & Psychology, 22(1) 67–90.
NHST
NHST
pcrit = .05
-.18 +.18pobs = .037
rcrit = -.169 rcrit = .169
Specifically: Given the assumptions, pobs is the probability of obtaining a result at least as extreme as +/- .18 in a repeated, random sampling scheme.
This is all you get!
.0185.0185
NHST
“The 16th edition of a highly influential textbook, Gerrig and Zimbardo’s Psychology and Life (2002), portrays the null ritual as statistics per se and calls it the ‘backbone of psychological research’ ” (p. 46).
p. 589, Gigerenzer, G. (2001). Mindless Statistics. Journal of Socio-Economics, 33, 587-606.
NHST
Optimism Delay visit to doctor
-.18*
Assumptions• Linearity• Random Sampling• Bivariate Normal Population Distribution• Homoscedasticity• Continuous variables• Independence of pairs of observations• Ho is true
Hypotheses:
Ho : ρxy = 0
HA : ρxy > 0 or ρxy < 0
Goal: ? ≤ ρxy ≤ ?
Population of Women
All women over 40 years of age?Only women without a history of breast cancer in their families?Only women who have had children?Only American women?
Population correlation often has no empirical reality
NHST
Population of Women
“…researchers may find themselves assuming that their sample is a random sample from an imaginary population. Such a population has no empirical existence, but is defined in an essentially circular way—as that population from which the sample may be assumed to be randomly drawn. At the risk of the obvious, inferences to imaginary populations are also imaginary.”
Berk, R. A. & Freedman, D. A. (2003). Statistical assumptions as empirical commitments. In T. G. Blomberg and S. Cohen (eds.), Law, Punishment, and Social Control: Essays in Honor of Sheldon Messinger, 2nd ed., pp. 235-254, Aldine de Gruyter.
NHST
NHST
Assumptions• Linearity• Random Sampling• Bivariate Normal Population Distribution• Homoscedasticity• Continuous variables• Independence of pairs of observations• Ho is true
Hypotheses:
Ho : ρxy = 0
HA : ρxy > 0 or ρxy < 0
Goal: ? ≤ ρxy ≤ ?
The authors did not draw a random sample!
What of the other assumptions as well?
NHST
The correlation (r = -18, n = 135) is statistically significant (p = .038). I have an effect. My result is significant. My hypothesis is supported.
Statisticians: “We have corrections for some assumption violations.”
-.18 +.18pobs = ?
rcrit = -.169 rcrit = .169
NHST
“These adjustments will be successful only under restrictive assumptions whose relevance to the social world is dubious. Moreover, adjustments require new layers of technical complexity, which tend to distance the researcher from the data. Very soon, the model rather than the data will be driving the research.” Berk & Freedman (2003).
-.18 +.18pobs = ?
rcrit = -.169 rcrit = .169
NHST
Paul Meehl: NHST is “one of the worst things that ever happened in the history of psychology” (p. 817; Journal of Consulting and Clinical Psychology, 46, 806-834).
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Med, 2(8), e124.
A few references…
NHST
Gigerenzer, G. (2004) Mindless statistics. The Journal of Socio-Economics, 33, 587-606.
Lambdin, C. (2011) Significance tests as sorcery: Science is empirical—significance tests are not. Theory & Psychology, 22(1) 67–90.
Ziliak, S. & McCloskey, D. (2008). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives. Ann Arbor: University of Michigan Press.
McCloskey, D. (1995). The insignificance of statistical significance. Scientific American 72, 32–33.
Cohen, J. (1994). The earth is round (p < 0.05). American Psychologist , 49, 997–1003.
Branch, M. (2014). Malignant side effects of null-hypothesis significance testing. Theory & Psychology, 24(2), 256-277.
Nuzzo, R. (2014). Statistical errors. Nature, 506, 151-152.
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Med, 2(8), e124.
Some suggest…
1. Replace or supplement p-values with confidence intervals and effect sizes
2. Replace NHST with Bayesian statistics
Others suggest…
Attempt a Gestalt shift: 3. De-emphasize mean and variance-based statistics4. Think in terms of patterns5. Focus on accuracy6. Create analogical (particularly iconic) models7. …all of this will require that we take our numbers more seriously
What must we do?
Effect Sizes and Confidence Intervals
1. R2 = .67; p = .002; CI.95 = .40 to .94
2. R2 = .67; p = .002; CI.95 = .40 to .94
3. R2 = .67; p = .002; CI.95 = .40 to .94
4. R2 = .67; p = .002; CI.95 = .40 to .94
Notice the large effect sizes, small p-values, and moderately wide confidence intervals (df = 1,10)
Hypothetical Results from Four Studies:
• “LOT [optimism] scores were related inversely to delay…”• “Consistent with theory and prior research, overall, optimism
explained both delay and…” (p. 205)
• Optimism was a significant predictor of delay
Effect Sizes and Confidence Intervals
A Study in Terror Management TheoryNorenzayan, A. & Hansen, I. (2006). Belief in Supernatural Agents in the face of death. Personality and Social Psychology Bulletin, 32, 174-187.
• Random assignment to one of two groups:1. Write about favorite food2. Write about personal death
• Memory task to clear your short term memory• “How strongly do you believe in God?”
Not at all 1 2 3 4 5 6 7 Very Strongly
| Midpoint
Effect Sizes and Confidence Intervals
Thought of Death
Belief in God
t(64) = 2.18*
tx x
obsD F
s
n
s
np2
D
p2
F
Assumption-laden NHST
Assumptions• Random assignment (or sampling)• Normal population distributions• Homogeneity of population variances• Continuous dependent variable• Independence of observations• Ho is true• “p ≤ .05” is proper significance level
Goal is to estimate two population parameters, µDeath and µFood, and the difference between them.
Effect Sizes and Confidence Intervals
Hypotheses: Ho : μFood = μDeath; HA : μFood > μDeath or μFood < μDeath
MDeath = 4.39 (SD = 1.64), MFood = 3.42 (SD = 1.97), t(64) = 2.18, p < .033, d = .54 (medium effect using Cohen’s conventions), CI.95: .08 to 1.86.
Effect Sizes and Confidence Intervals
Accuracy
“In contrast [to traditional statistical methods], ODA maximizes the accuracy of a model.” (Yarnold, P., & Soltysik, R. (2005). Optimal Data Analysis. APA, Washington, DC. (p. 4).
Accuracy & Patterns
Focus on patterns and accuracy using the Percent Correct Classification (PCC) index
Thoughtof Death
IncreasedReligiosity
OOM shows the pattern of results makes no sense with regard to Terror Management Theory when examined at the level of the individuals in the study and when we attempt to take our numbers seriously
MDeath = 4.39 (SD = 1.64), MFood = 3.42 (SD = 1.97), t(64) = 2.18, p < .033, d = .54 (medium effect using Cohen’s conventions), CI.95: .08 to 1.86.
t(64) = 2.18*
Accuracy & Patterns
Daily NA
Daily PTSD symptoms
Number of standard
drinks/day
0.13*** 0.42***
-0.14
*** p < .001
(-0.02)
Persons & Patterns, not Aggregates
• Diary data for 54 women. Plenty of within-person data! (Cohn, Hagman, Moore, Mitchell, Ehlke (2014). Psychology of Addictive Behaviors, 28, 114-126.)
• “Statisticism” : In part is a failure to recognize the difference between an aggregate statistical effect and the cause-effect processes at the level of the persons (Lamiell, J. T., 2013, New Ideas in
Psychology, 31, 65-71).• How many individual women fit this causal model?
“Indeed, only six women responded to the survey on all 14 days, and the median number of completed days was equal to 11. The median PCC value was equal to 44.35, indicating general incongruity between the relative changes in PTSD and negative affect observations across all days and all women. More specifically, PCC values for only 23 women exceeded 50%, and of those only eight patterns 1) passed the eye test, 2) included seven or more days of observations, and 3) showed some variability in the observations.” Grice et al., in press.
Persons & Patterns, not Aggregates
Inferences
1. An inference to a population parameter : ? ≤ µDeath - µFood ≤ ?
2. An inference about aggregate statistics (in Bayesian analysis)
Rather than seeking:
We are seeking:
Inference to best explanation. Why are the data patterned in such and such a manner?
Aristotle
• Philosophical Realism : AKA “Reasoned common sense”
• Natural science (epistēmē) is demonstrable knowledge of nature through its causes
• Causes inhere in the things themselves and are knowable; this is causality
• Thing-based rather than event-based ontology• Cause : Material, Formal, Efficient, and Final
Philosophical Realism
Integrated Model from Bill Powers’ Perceptual Control TheoryPowers, W.T. (2008). Living control systems III: Modeling behavior. Montclair, NJ: Benchmark Publications.
Analogical (Iconic) Models
https://www.youtube.com/watch?v=AJXFiO-ULv0
http://ccl.northwestern.edu/netlogo/
Analogical (Iconic) Models
So…Forget NHST!
Attempt a Gestalt shift: 1. De-emphasize mean and variance-based statistics2. Think in terms of patterns3. Focus on accuracy4. Create analogical (particularly iconic) models5. …all of this will require that we take our numbers more seriously
What must we do?
Top Related