”Magnitude-based Inference”: A Statistical Review · August 2014 A.H.Welsh & E.J. Knight...

August 2014 A.H. Welsh & E.J. Knight

”Magnitude-based

Inference”:

A Statistical Review

Alan Welsh

and Emma Knight

The Australian National

University

The Australian Institute

of Sport Thinkstock


xParallelGroupsTrial.xls


Comparing change in two groups

Compare Post1 - Pre2 measurements for the Control group with Post1 - Pre2

measurements for the Exptal group to see if there is a treatment effect.


Comparing two means

• Assume all 40 Post1 - Pre2 measurements are independent.

• The Post1 - Pre2 measurements for the Control group and the Post1 - Pre2

measurements for the Exptal group are approximately normally

distributed.

• The problem is to make inferences about the effect of the treatment

on a typical (randomly chosen) individual; this effect is summarized

by the difference in the means of the separate (normal) populations

represented by the experimental and control athletes.

• For simplicity, assume throughout that positive values of the Exptal

population mean - Control population mean represent a positive or

beneficial effect.

• The two normal populations are allowed to have different variances; this is

called the Behrens-Fisher problem.


xParallelGroupsTrial.xls Results


According to the papers . . . : Confidence Intervals

Compute a standard approximate Student t confidence interval (default

level: 90%) for the difference in population means.

Specify the smallest meaningful positive effect δ > 0; this defines three

regions on the real line:

“negative or harmful” region (−∞,−δ),

“trivial or no effect” region [−δ, δ],

“positive or beneficial” region (δ,∞).

The confidence interval is classified by the extent of overlap with these three

regions into one of the four categories “Beneficial”, “Trivial”, “Harmful” or

“Unclear”, where the last category is used for confidence intervals that do not

belong to any of the other categories.


For δ = 4.41, the xParallelGroupsTrial.xls data produces the third confidence

interval: not significant but possibly beneficial.


xParallelGroupsTrial.xls : Classical Results


“It’s all in the spreadsheets . . . ”


“Chances” and “Qualitative Probabilities”

pb “substantially positive (+ve) or beneficial” value

1− pb − ph “trivial value”

ph “substantially negative (-ve) or harmful” value


xParallelGroupsTrial.xls : “Magnitude-based Inference” Results


“Clinical Inference” and “Mechanistic Inference”

Classify pb and ph into one of four categories:

ph

small large

pb small trivial harmful

large beneficial unclear

Qualify the classifications “beneficial”, “harmful” and “trivial” by the

corresponding classifications of pb, ph and 1− pb − ph.

“Clinical inference” distinguishes positive and negative values; it needs

thresholds for the “minimum chance of benefit” (default: ηb = 0.25) and

the “maximum risk of harm” (default: ηh = 0.001).

“Mechanistic inference” applies when there is no direct clinical or practical

application and positive and negative values represent equally important effects;

it needs a single threshold (default α = 0.1 obtained by setting

ηb = ηh = 0.05).


A graphical representation

ANIMATION 1: Constructing

the ternary diagram to interpret

“magnitude-based inference” and

show the effect of changing the

thresholds ηb and ηh

Thinkstock


Interpretation

The “chance of benefit” pb and “risk of harm” ph cannot be derived as

frequentist probabilities from the standard confidence interval; they can be

derived from a Bayesian credibility interval if we switch to a Bayesian framework.

We can derive pb and ph as frequentist p-values. For δ ≥ 0:

pb is the one-sided p-value for testing the null hypothesis that

µ2 − µ1 = δ against the alternative that µ2 − µ1 < δ;

ph is the one-sided p-value for testing the null hypothesis that

µ2 − µ1 = −δ against the alternative that µ2 − µ1 > −δ;

p, the usual p-value, is the two-sided test of the null hypothesis that µ2 − µ1 = 0

against the alternative that µ2 − µ1 6= 0.

When δ = 0, pb = 1− p/2 and ph = p/2, so small p corresponds to large pb and

small ph. For p in 0.05− 0.15, moderate increases in δ shift the analysis towards a

positive conclusion: we decrease ph and pb, but usually not by enough to lose the

“evidence” for a positive effect (given that ηb is small; 0.25 compared to, say,

0.95). The important threshold for obtaining a positive result is ηh.


A graphical representation

ANIMATION 2: The effect of chang-

ing δ on pb and ph in the ternary

diagram and on the probabilities of

finding an effect when there is none

ANIMATION 3: The effect of chang-

ing δ on pb and ph, showing both the

Frequentist and the Bayesian inter-

pretations of these probabilities

Thinkstock


The “Magnitude-based Inference” Test

“Magnitude-based inference” has not replaced tests by confidence

intervals but is actually a test.

“Mechanistic inference” is a complicated and confusing way of

increasing the level of the test; it does nothing to the power of the test.

It is equivalent to using the usual p-value with a much larger

threshold value. e.g. 50% instead of 5%


“Clinical inference” in “magnitude-based inference” increases the level

of the test and changes the thresholds.

• The increase in ηb (from 5% to 25%) looks spectacular but this is

misleading because ηb is not actually important when the p-value is in the

range 0.05–0.15.

• The decrease in ηh (from 5% to 0.5%) works against the other changes (in

the p-value and δ), but is outweighed by the gains from the other two

changes.

“Magnitude-based inference” is less conservative than other clinical inference

procedures.

If other researchers feel that clinical conclusions should be more conservative (“do

no harm”) than mere statistical significance, what is the role for a method

for clinical inference that is explicitly designed to be less conservative?


I can’t be bothered addressing

this kind of criticism. If you

believe in God, no amount of

evidence against His existence

will disabuse you of your be-

lief. Similarly, if you believe in

null hypothesis testing, the ev-

idence for a better method of

making inferences about true

effects means nothing to you. In

any case, has this person read

the evidence? I doubt it.

Will Hopkins, Quoted by Martin

Buchheit April 30, 2013 Thinkstock


Sample size calculations


The standard formula (from significance testing) is

n ≈function of level (default: 5%) and power (default: 80%)

(smallest difference you hope to detect)2

Without explanation or justification, Hopkins uses

n ≈function of 2ηh (default: 1%) and 1− ηb (default: 75%)

4(smallest difference you hope to detect)2

Calling ηh the “Type I clinical error rate” and ηb the “Type II clinical error rate”

acknowledges (ironically) that “magnitude-based inference” is a test but does not

justify their use in the standard sample size formula because ηh is not the level of

the test and ηb is not the probability of “not using an effect that is beneficial”.

There is no basis for the division by 4.

The changes to the numerator produce a 4/3 increase, the division by 4

changes this to an overall 1/3 decrease.


Conclusion

The real motivation for “magnitude-based inference” is that significance tests

(the use of p-values) and confidence intervals are seen as being too

conservative.

“Magnitude-based inference” is promoted as an alternative to significance

tests, but it is also a test.

It is less conservative than standard tests because it inflates the level of the

test to levels that should not be used.

The sample size calculations should not be used.

We sympathize with the frustration of the researcher finding that the

evidence they have for an effect is weaker than they would like, but we

have to recognize the limitations of the data and be careful about

trying to strengthen weak evidence just because it suits us to do so.

We recommend being realistic about the limitations of the data and

using confidence intervals (in preference to p-values).


Thinkstock

“Should scientists accept and offer overconfidence,

oversimplification, distortion and rhetoric disguised as

quantified science ...?” Sander Greenland

”Magnitude-based Inference”: A Statistical Review · August 2014 A.H.Welsh & E.J. Knight...

Documents

Transcript of ”Magnitude-based Inference”: A Statistical Review · August 2014 A.H.Welsh & E.J. Knight...