Institut für Medizinische Biometrie, Epidemiologie und Informatik
Aesthetics and power in multiple testing – a contradiction?
MCP 2007, Vienna
Gerhard Hommel
2 2
Introduction: Economics and Statistics
Economics: profit is not everything Ethical / social component Competing interests Aesthetics: protection of environment, industrial art,
patronage
Statistics: power is not everything Ethics: decisions are logical, conceivable, simple Competing interests Aesthetics: “beauty of mathematics” (subjective), but
also same points as for ethics
3 3
Examples for (non-) aesthetics:
Closure test + : principle simply to describe + : coherence directly obtained – : often very cumbersome to perform Bonferroni-Holm: SD(α/n, α/(n-1), … , α/2, α) Hochberg : SU(α/n, α/(n-1), … , α/2, α)
FDP, e.g. control of P(FDP > 0.2): SD(α/n, α/(n-1), α/(n-2), α/(n-3), 2α/(n-3), 2α/(n-4), … ,
3α/(n-7), …)
not beautiful (and not powerful)!
4 4
Logical decisions: Coherence
Coherence: When a hypothesis (= subset of the parameter space) is rejected, every of its subsets can be rejected.
Closure test: Local level α tests for all - hypotheses + coherence control of multiple level (FWER) α.
Closure tests form a complete class within all MTP’s controlling the FWER α.
But: Bonferroni-Holm is not coherent, in general!
Quasi-coherence: coherence for all index sets forming an intersection.
Quasi-closure test: Local level α tests for all index sets + quasi-coherence control of multiple level (FWER) α.
5 5
Monotonic decisions
Consider: monotonicity between different hypotheses:
p1, … ,pn = p-values
pi pj and Hj rejected Hi rejected.
Not obligatory: weights for hypotheses (from importance or expected power)
See Benjamini / Hochberg (1997) Fixed sequence tests Gatekeeping procedures
6 6
Monotonic decisions:nested hypotheses
Example: Yi = ß0 + ß1 xi + ß2 xi² +i
H1: ß1 = ß2 = 0 H2: ß2 = 0
F test of H1: p = .051
t test of H2: p = .024
Bonferroni-Holm ( = .05) rejects only H2
Logical: reject H1, too.
Size of a p-value is not the only criterion for rejection!
xi –3 –2 –1 0 1 2 3
yi 8 2 –1 1.6 –2 3 4
7 7
Monotonic decisions:multiple comparisons
Example: Comparison of k=4 means (ANOVA)
Hij: i = j , 1 i < j 4
p13 = .0241 < p34 = .0244 (t test; pooled variance)
Closure test rejects H14, H24, H34, but not H13!
(same result with regwq)
Non-monotonicity may be reasonable:
It is easier to separate group 4 from the cluster of groups 1,2,3 than to find differences within the cluster.
group 1 2 3 4
mean value 0 1 2 3.99
8 8
Monotonic decisions
My conclusion:
Only for equal weights and no logical constraints, it is mandatory that
decisions are monotonic in p-values, anddecisions are exchangeable.
9 9
Monotonicity within same hypothesis(α-consistency)
Given p-values p1, …, pn; q1, …, qn
with qi pi for i=1,…,n.
When a hypothesis is rejected, based on pi‘s, it should also be rejected when based on qi‘s.
Counterexample 1 (WAP procedure of Benjamini-Hochberg, 1997):
Stepdown based on p(j) w(j)α/(w(j)+…+w(n)):
Controls the FWER, but is not α-consistent.
10 10
Monotonicity within same hypothesis(α-consistency)
Counterexample 2: Tarone‘s (1990) MTPUses information about minimum attainable p-
values α1*, …, αn*
n=2, α1*=.03, α2*=.04: α = .05: no Hj can be rejected; α = .035: H1 can be rejected if p1 .035.
Hommel/Krummenauer (1998): monotonic improvement of Tarone‘s procedure (using a „rejection function“ b(α))
11 11
The fallback procedure (I)
Wiens (2003): „fixed sequence testing procedure“ with possibility to continue
Dmitrienko, Wiens, Westfall (2005): „fallback procedure“
Wiens + Dmitrienko (2005): Proof that FWER is controlled, suggestion for improvement
Two types of weights: sequence of hypotheses; „assigned weights“ α1‘,…,αn‘ with Σαi‘ =α.
12 12
The fallback procedure (II)
Use „assigned weights“ α1‘,…,αn‘ with Σαi‘ =α .
Actual significance levels:
α1 = α1‘
αi = αi‘ + αi-1 if Hi-1 has been rejected
αi = αi‘ if Hi-1 has not been rejected.
α1‘ = α, α2‘ = ... = αn‘ = 0 fixed sequence test.
13 13
Example for n = 2
Endpoint 1: Functional capacity of heart Endpoint 2: Mortality α = .05, α1‘ = .04, α2‘ = .01
p1 .04: Reject H1 and test H2 with α2 = .05 .
p1 > .04: Retain H1 and test H2 with α2 = .01 .
Weighted Bonferroni-Holm with α1‘ = .04, α2‘ = .01 :
Rejects H1, in addition, when p2 .01 and
.04 < p1 .05 !
14 14
Comparison with weighted Bonferroni-Holm
For n = 2: WBH is strictly more powerful than the fallback procedure. The improvement by Wiens + Dmitrienko is identical to WBH.
For n 3: There exist situations where fallback rejects and WBH not, and conversely. ( the improvement by W+D is not identical to WBH)
15 15
The fallback procedure for n=3:weights for intersection hypotheses
αi‘= wiα
wi = 1
(see W+D)
index set weight for index
1 2 3
{1,2,3}
{1,2}
{1,3}
{2,3}
{1}
{2}
{3}
w1 w2 w3
w1 w2 --
w1 -- w2+w3
-- w1+w2 w3
w1 -- --
-- w1+w2 --
-- -- w1+w2+w3
16 16
The fallback procedure for n=3:equal weights
αi‘= wiα
wi = 1/3
Consequence
for importance:
H2 H3 H1?
index set weight for index
1 2 3
{1,2,3}
{1,2}
{1,3}
{2,3}
{1}
{2}
{3}
1/3 1/3 1/3
1/3 1/3 --
1/3 -- 2/3
-- 2/3 1/3
1/3 -- --
-- 2/3 --
-- -- 1
17 17
The fallback procedure for n=3:equal weights
αi‘= wiα
wi = 1/3
Consequence
for importance:
H2 H3 H1?
index set weight for index
1 2 3
{1,2,3}
{1,2}
{1,3}
{2,3}
{1}
{2}
{3}
1/3 1/3 1/3
1/3 1/3 --
1/3 -- 2/3
-- 2/3 1/3
1/3 -- --
-- 2/3 --
-- -- 1
18 18
The fallback procedure for n=3:equal weights; improvement by W+D
αi‘= wiα
wi = 1/3
Consequence
for importance:
H2 H3 H1
(remains)
index set weight for index
1 2 3
{1,2,3}
{1,2}
{1,3}
{2,3}
{1}
{2}
{3}
1/3 1/3 1/3
1/2 1/2 --
1/3 -- 2/3
-- 2/3 1/3
1 -- --
-- 1 --
-- -- 1
19 19
The fallback procedure for n=3:equal weights
The decisions of the fallback procedure (with equal weights) are not exchangeable (and can never become!).
Example: p(1)=.015, p(2)=.02, p(3)=1; α=.05.
(Bonferroni-Holm: rejects H(1) and H(2) )
p1 < p2 < p3 : reject H1, H2 p1 < p3 < p2 : reject H1
p2 < p1 < p3 : reject H2
p2 < p3 < p1 : reject H2, H3
p3 < p1 < p2 : reject H3 (, H1) p3 < p2 < p1 : reject H3
20 20
The fallback procedure:critical questions
What are the relations of the two different types of weighting?
Can it be meaningful to give higher assigned weights for higher indices?
Can one give „guidelines“ how to choose the weights? Equal assigned weights: what is the influence of
ordering? (anyway: the procedure has „aesthetic“ drawbacks)
For which situations can one expect that the fallback procedure is more powerful than WBH?
Or should one better renounce it completely?
Top Related