[Paul E.J. Hammer] Elizabeth's Wars War, Governme(BookFi.org)
”Magnitude-based Inference”: A Statistical Review · August 2014 A.H.Welsh & E.J. Knight...
Transcript of ”Magnitude-based Inference”: A Statistical Review · August 2014 A.H.Welsh & E.J. Knight...
August 2014 A.H. Welsh & E.J. Knight
”Magnitude-based
Inference”:
A Statistical Review
Alan Welsh
and Emma Knight
The Australian National
University
The Australian Institute
of Sport Thinkstock
August 2014 A.H. Welsh & E.J. Knight
August 2014 A.H. Welsh & E.J. Knight
xParallelGroupsTrial.xls
August 2014 A.H. Welsh & E.J. Knight
Comparing change in two groups
Compare Post1 - Pre2 measurements for the Control group with Post1 - Pre2
measurements for the Exptal group to see if there is a treatment effect.
August 2014 A.H. Welsh & E.J. Knight
Comparing two means
• Assume all 40 Post1 - Pre2 measurements are independent.
• The Post1 - Pre2 measurements for the Control group and the Post1 - Pre2
measurements for the Exptal group are approximately normally
distributed.
• The problem is to make inferences about the effect of the treatment
on a typical (randomly chosen) individual; this effect is summarized
by the difference in the means of the separate (normal) populations
represented by the experimental and control athletes.
• For simplicity, assume throughout that positive values of the Exptal
population mean - Control population mean represent a positive or
beneficial effect.
• The two normal populations are allowed to have different variances; this is
called the Behrens-Fisher problem.
August 2014 A.H. Welsh & E.J. Knight
xParallelGroupsTrial.xls Results
August 2014 A.H. Welsh & E.J. Knight
According to the papers . . . : Confidence Intervals
Compute a standard approximate Student t confidence interval (default
level: 90%) for the difference in population means.
Specify the smallest meaningful positive effect δ > 0; this defines three
regions on the real line:
“negative or harmful” region (−∞,−δ),
“trivial or no effect” region [−δ, δ],
“positive or beneficial” region (δ,∞).
The confidence interval is classified by the extent of overlap with these three
regions into one of the four categories “Beneficial”, “Trivial”, “Harmful” or
“Unclear”, where the last category is used for confidence intervals that do not
belong to any of the other categories.
August 2014 A.H. Welsh & E.J. Knight
For δ = 4.41, the xParallelGroupsTrial.xls data produces the third confidence
interval: not significant but possibly beneficial.
August 2014 A.H. Welsh & E.J. Knight
xParallelGroupsTrial.xls : Classical Results
August 2014 A.H. Welsh & E.J. Knight
“It’s all in the spreadsheets . . . ”
August 2014 A.H. Welsh & E.J. Knight
“Chances” and “Qualitative Probabilities”
pb “substantially positive (+ve) or beneficial” value
1− pb − ph “trivial value”
ph “substantially negative (-ve) or harmful” value
August 2014 A.H. Welsh & E.J. Knight
xParallelGroupsTrial.xls : “Magnitude-based Inference” Results
August 2014 A.H. Welsh & E.J. Knight
“Clinical Inference” and “Mechanistic Inference”
Classify pb and ph into one of four categories:
ph
small large
pb small trivial harmful
large beneficial unclear
Qualify the classifications “beneficial”, “harmful” and “trivial” by the
corresponding classifications of pb, ph and 1− pb − ph.
“Clinical inference” distinguishes positive and negative values; it needs
thresholds for the “minimum chance of benefit” (default: ηb = 0.25) and
the “maximum risk of harm” (default: ηh = 0.001).
“Mechanistic inference” applies when there is no direct clinical or practical
application and positive and negative values represent equally important effects;
it needs a single threshold (default α = 0.1 obtained by setting
ηb = ηh = 0.05).
August 2014 A.H. Welsh & E.J. Knight
A graphical representation
ANIMATION 1: Constructing
the ternary diagram to interpret
“magnitude-based inference” and
show the effect of changing the
thresholds ηb and ηh
Thinkstock
August 2014 A.H. Welsh & E.J. Knight
Interpretation
The “chance of benefit” pb and “risk of harm” ph cannot be derived as
frequentist probabilities from the standard confidence interval; they can be
derived from a Bayesian credibility interval if we switch to a Bayesian framework.
We can derive pb and ph as frequentist p-values. For δ ≥ 0:
pb is the one-sided p-value for testing the null hypothesis that
µ2 − µ1 = δ against the alternative that µ2 − µ1 < δ;
ph is the one-sided p-value for testing the null hypothesis that
µ2 − µ1 = −δ against the alternative that µ2 − µ1 > −δ;
p, the usual p-value, is the two-sided test of the null hypothesis that µ2 − µ1 = 0
against the alternative that µ2 − µ1 6= 0.
When δ = 0, pb = 1− p/2 and ph = p/2, so small p corresponds to large pb and
small ph. For p in 0.05− 0.15, moderate increases in δ shift the analysis towards a
positive conclusion: we decrease ph and pb, but usually not by enough to lose the
“evidence” for a positive effect (given that ηb is small; 0.25 compared to, say,
0.95). The important threshold for obtaining a positive result is ηh.
August 2014 A.H. Welsh & E.J. Knight
A graphical representation
ANIMATION 2: The effect of chang-
ing δ on pb and ph in the ternary
diagram and on the probabilities of
finding an effect when there is none
ANIMATION 3: The effect of chang-
ing δ on pb and ph, showing both the
Frequentist and the Bayesian inter-
pretations of these probabilities
Thinkstock
August 2014 A.H. Welsh & E.J. Knight
The “Magnitude-based Inference” Test
“Magnitude-based inference” has not replaced tests by confidence
intervals but is actually a test.
“Mechanistic inference” is a complicated and confusing way of
increasing the level of the test; it does nothing to the power of the test.
It is equivalent to using the usual p-value with a much larger
threshold value. e.g. 50% instead of 5%
August 2014 A.H. Welsh & E.J. Knight
“Clinical inference” in “magnitude-based inference” increases the level
of the test and changes the thresholds.
• The increase in ηb (from 5% to 25%) looks spectacular but this is
misleading because ηb is not actually important when the p-value is in the
range 0.05–0.15.
• The decrease in ηh (from 5% to 0.5%) works against the other changes (in
the p-value and δ), but is outweighed by the gains from the other two
changes.
“Magnitude-based inference” is less conservative than other clinical inference
procedures.
If other researchers feel that clinical conclusions should be more conservative (“do
no harm”) than mere statistical significance, what is the role for a method
for clinical inference that is explicitly designed to be less conservative?
August 2014 A.H. Welsh & E.J. Knight
I can’t be bothered addressing
this kind of criticism. If you
believe in God, no amount of
evidence against His existence
will disabuse you of your be-
lief. Similarly, if you believe in
null hypothesis testing, the ev-
idence for a better method of
making inferences about true
effects means nothing to you. In
any case, has this person read
the evidence? I doubt it.
Will Hopkins, Quoted by Martin
Buchheit April 30, 2013 Thinkstock
August 2014 A.H. Welsh & E.J. Knight
Sample size calculations
August 2014 A.H. Welsh & E.J. Knight
The standard formula (from significance testing) is
n ≈function of level (default: 5%) and power (default: 80%)
(smallest difference you hope to detect)2
Without explanation or justification, Hopkins uses
n ≈function of 2ηh (default: 1%) and 1− ηb (default: 75%)
4(smallest difference you hope to detect)2
Calling ηh the “Type I clinical error rate” and ηb the “Type II clinical error rate”
acknowledges (ironically) that “magnitude-based inference” is a test but does not
justify their use in the standard sample size formula because ηh is not the level of
the test and ηb is not the probability of “not using an effect that is beneficial”.
There is no basis for the division by 4.
The changes to the numerator produce a 4/3 increase, the division by 4
changes this to an overall 1/3 decrease.
August 2014 A.H. Welsh & E.J. Knight
Conclusion
The real motivation for “magnitude-based inference” is that significance tests
(the use of p-values) and confidence intervals are seen as being too
conservative.
“Magnitude-based inference” is promoted as an alternative to significance
tests, but it is also a test.
It is less conservative than standard tests because it inflates the level of the
test to levels that should not be used.
The sample size calculations should not be used.
We sympathize with the frustration of the researcher finding that the
evidence they have for an effect is weaker than they would like, but we
have to recognize the limitations of the data and be careful about
trying to strengthen weak evidence just because it suits us to do so.
We recommend being realistic about the limitations of the data and
using confidence intervals (in preference to p-values).
August 2014 A.H. Welsh & E.J. Knight
Thinkstock
“Should scientists accept and offer overconfidence,
oversimplification, distortion and rhetoric disguised as
quantified science ...?” Sander Greenland