A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval...
-
Upload
julian-urbano -
Category
Science
-
view
55 -
download
4
description
Transcript of A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval...
![Page 1: A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation](https://reader031.fdocuments.us/reader031/viewer/2022020207/559d38d61a28ab56398b46fc/html5/thumbnails/1.jpg)
A Comparison of the Optimality
of Statistical Significance Tests
for Information Retrieval Evaluation
Julián Urbano, Mónica Marrero and Diego Martín
Department of Computer Science · University Carlos III of Madrid
The problem: is system A more effective than system B?
The drill: evaluate with a test collection and run a statistical significance test
The dilemma: t-test, Wilcoxon, sign, bootstrap or permutation?
The reason: test assumptions are violated, so which one is optimal in practice?
Three criteria: power (maximize # of significants), safety (minimize # of errors), exact (keep errors at α)
Data and Methods
• TREC Robust 2004: 100 topics from Ad Hoc 7 and 8
o 110 runs, 5995 pairs of systems
• Randomly split topics in T1 and T2, as if two collections
o Evaluate all runs and compute p-values
o Compare p-values from T1 with p-values from T2
o 1000 trials, 12M p-values per test, 60M in total
• Interpret pairs of p-values for different α levels
T2
A ≻B A ≺B A ≻≻B A ≺≺B
T1
A ≻B Non-significance
A≻≻B Lack of
power
Minor
error Success
Major
error
Dublin, Ireland · 30th July 2013 Supported by ACM SIGIR Student Travel Grant
Non-significance rate
Significance level α
Non
-sig
nific
ants
/ T
otal
0.3
0.3
50.
40.
45
0.5
0.6
.001 .005 .01 .05 .1
t-testpermutationbootstrapWilcoxonsign
Success rate
Significance level α
Suc
cess
es /
Tot
al s
igni
fican
ts
0.76
0.78
0.80
0.82
0.84
0.86
.001 .005 .01 .05 .1
Lack of power rate
Significance level α
Lack
s of
pow
er /
Tot
al s
igni
fican
ts
0.12
0.14
0.16
0.18
0.20
.001 .005 .01 .05 .1
Minor error rate
Significance level α
Min
or e
rror
s / T
otal
sig
nific
ants
0.00
10.
002
0.00
50.
010
0.02
0
.001 .005 .01 .05 .1
t-testpermutationbootstrapWilcoxonsign
y=x
Major error rate
Significance level α
Maj
or e
rror
s / T
otal
sig
nific
ants
5e-0
75e
-06
5e-0
55e
-04
.001 .005 .01 .05 .1
Global error rate
Significance level α
Min
or a
nd M
ajor
err
ors
/ Tot
al s
igni
fican
ts
5e-0
42e
-03
5e-0
32e
-02
5e-0
2
.0001 .0005.001 .005 .01 .05 .1 .5
Take-Home Messages
• Power: bootstrap test gives more significant results
• Safety: t-test gives fewer errors
• Exactness: Wilcoxon test best tracks the nominal level
• The permutation test is not optimal in practice
• Error rates seem lower than expected; focus on power
Previous Work
Zobel’98, Sanderson & Zobel’05, Cormack & Lynam’06
• Wilcoxon more powerful than t-test, but more errors
Smucker et al. ‘07, ‘09
• bootstrap test overly powerful, though similar to t-test
and permutation
• Wilcoxon and sign unreliable, should use permutation