Designing Test Collections for Comparing Many Systems
-
Upload
tetsuya-sakai -
Category
Technology
-
view
596 -
download
2
description
Transcript of Designing Test Collections for Comparing Many Systems
![Page 1: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/1.jpg)
Designing Test Collections for Comparing Many Systems
Tetsuya SakaiWaseda University, Japan
@tetsuyasakai
November 4, 2014@CIKM 2014, Shanghai
![Page 2: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/2.jpg)
Acknowledgement
This research is a part of Waseda University’s project “Taxonomising and Evaluating Web Search Engine User Behaviours,” supported by Microsoft Research.
THANK YOU!
![Page 3: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/3.jpg)
Takeaways (1)
• Using one‐way ANOVA‐based power analysis, researchers can determine the topic set size n by specifying:
α: Type I error probability β: Type II error probabilityminD: minimum detectable range (performance diff between best and worst systems) for ensuring a statistical power of 1‐βm: number of systems to be compared
: estimated variance of each system• Different measures have different s, so researchers should decide on the evaluation measure at the test collection design phase
![Page 4: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/4.jpg)
Takeaways (2)
• Our method can provide different test collection designs (n, pd) that satisfy the same statistical requirement.
n: topic set size pd: pool depth• The assessment cost of a pd=100 test collection can be reduced to 18% or less while keeping it statistically equally reliable.
• Our method can be used to compare evaluation measures in terms of practical significance = judgment cost.
• Our tools and data are available athttp://www.f.waseda.jp/tetsuya/tools.htmlhttp://www.f.waseda.jp/tetsuya/data.html
![Page 5: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/5.jpg)
TALK OUTLINE
1. How Information Retrieval (IR) test collections are constructed2. Statistical reform3. How test collections SHOULD be constructed4. Experimental results5. Conclusions and future work
![Page 6: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/6.jpg)
Test collections = standard data sets for evaluation
Test collection A Test collection BEvaluationmeasurevalues
Evaluationmeasurevalues
![Page 7: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/7.jpg)
An Information Retrieval (IR) test collection
Topic Relevance assessments(relevant/nonrelevant documents)
Document collection
Topic Relevance assessments(relevant/nonrelevant documents)
Topic Relevance assessments(relevant/nonrelevant documents)
: :
Topic set “Qrels = query relevance sets”
CIKM 2014 home page cikm2014.fudan.edu.cn/: highly relevant
cikmconference.org/: partially relevantwww.cikm2013.org: nonrelevant
![Page 8: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/8.jpg)
How IR people build test collections (1)
Okay, let’s build a test collection…
Organiser
![Page 9: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/9.jpg)
How IR people build test collections (2)
…with maybe n=50topics (search requests)…
Well n>25 sounds good for statistical significance testing, but why 50? Why not 100? Why not 30?
TopicTopicTopicTopicTopic 1
![Page 10: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/10.jpg)
How IR people build test collections (3)
TopicTopicTopicTopicTopic 1
50 topicsOkay folks, give me your runs (search results)!
run run run
Participants
![Page 11: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/11.jpg)
How IR people build test collections (4)
TopicTopicTopicTopicTopic 1
50 topicsPool depth pd=100 looks
affordable…
run run run
Top pd=100 documentsfrom each run
Pool for
Topic 1Document collection too large to doexhaustive relevance assessments sojudge pooled documents only
![Page 12: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/12.jpg)
How IR people build test collections (5)
TopicTopicTopicTopicTopic 1
50 topics
Top pd=100 documentsfrom each run
Pool for
Topic 1Relevance assessments
Highly relevant
Partially relevant
Nonrelevant
![Page 13: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/13.jpg)
An Information Retrieval (IR) test collection
Topic Relevance assessments(relevant/nonrelevant documents)
Document collection
Topic Relevance assessments(relevant/nonrelevant documents)
Topic Relevance assessments(relevant/nonrelevant documents)
: :
Topic set “Qrels = query relevance sets”
CIKM 2014 home page cikm2014.fudan.edu.cn/: highly relevant
cikmconference.org/: partially relevantwww.cikm2013.org: nonrelevant
n=50topics…why?
Pool depth pd=100(not exhaustive)
![Page 14: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/14.jpg)
TALK OUTLINE
1. How Information Retrieval (IR) test collections are constructed2. Statistical reform3. How test collections SHOULD be constructed4. Experimental results5. Conclusions and future work
![Page 15: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/15.jpg)
NHST = null hypothesis significance testing (1)
EXAMPLE: paired t‐test for comparing systems X and Y with n topics
Assumptions:
Null hypothesis:
Test statistic:
Population means are the same
![Page 16: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/16.jpg)
NHST = null hypothesis significance testing (2)
EXAMPLE: paired t‐test for comparing systems X and Y with n topics
Null hypothesis:
Test statistic:
Under H0, t0 obeys a t distribution with n‐1 degrees of freedom.
![Page 17: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/17.jpg)
NHST = null hypothesis significance testing (3)
EXAMPLE: paired t‐test for comparing systems X and Y with n topicsNull hypothesis:Under H0, t0 obeys a t distribution with n‐1 degrees of freedom.
Given a significance criterion α(=0.05), reject H0 if |t0| >= t(n‐1; α).
0
0.1
0.2
0.3
0.4
‐t(n‐1; α)
n=50
t(n‐1; α)
“H0 is probably not true because the chance of observing t0 under H0
is very small”
![Page 18: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/18.jpg)
NHST = null hypothesis significance testing (4)
EXAMPLE: paired t‐test for comparing systems X and Y with n topicsNull hypothesis:Given a significance criterion α(=0.05), reject H0 if |t0| >= t(n‐1; α).
0
0.1
0.2
0.3
0.4
‐t(n‐1; α)
n=50
t(n‐1; α)
0
0.1
0.2
0.3
0.4
‐t(n‐1; α)
n=50
t(n‐1; α)
Conclusion:X ≠ Y!
t0 t0Conclusion:
H0 not rejected, so don’t know
![Page 19: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/19.jpg)
NHST is not good enough [Cumming12]
• Dichotomous thinking ( “different or not different?” )A more important question is “what is the magnitude of the difference?” Another is “How accurate is my estimate?”• p‐values a little more informative than “significant at α=0.05” but…
0
0.1
0.2
0.3
0.4
‐t(n‐1; α)
n=50
t(n‐1; α)
t0
Probability of observing t0 or something more extreme under H0
![Page 20: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/20.jpg)
The p‐value is not good enough either [Ellis10,Nagata03]
Reject H0 if |t0| >= t(n‐1; α) where
But a large |t0| could mean two things:(1) Sample effect size (ES) is large;(2) Topic set size n is large.
If you increase the sample size n, you can always achieve statistical significance!
Difference between X and Y measured in standard deviation
units
![Page 21: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/21.jpg)
Statistical reform – effect sizes [Cumming12,Okubo12]
• ES: “how much difference is there?”• ES for paired t test measures difference in standard deviation unitsPopulation ES =
Sample ES as an estimate of the above =
In several research disciplines such as psychology and medicine, it is required to report ESs! In this study, we determine the topic set size n by ensuring high power 1‐β whenever ES is large.
![Page 22: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/22.jpg)
Statistical reform – confidence intervals
• CIs are much more informative than NHST(point estimate + uncertainty/accuracy)• Estimation thinking, not dichotomous thinking[Cumming12]
In several research disciplines such as psychology and medicine, it is required to report CIs! See [Sakai14FIT] (Designing Test Collections that Provide Tight Confidence Intervals)
[Sakai14SIGIRforum]
![Page 23: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/23.jpg)
TALK OUTLINE
1. How Information Retrieval (IR) test collections are constructed2. Statistical reform3. How test collections SHOULD be constructed4. Experimental results5. Conclusions and future work
![Page 24: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/24.jpg)
Overview of our approaches
• Using the paired t‐test to determine n (for m=2 systems)INPUT: α, β, minDt (minimum detectable difference for ensuring power=1‐β), (estimated variance of performance difference)• Using one‐way ANOVA to determine n (for m>=2 systems)INPUT: α, β, m, minD (minimum detectable range for ensuring power=1‐β), (estimated variance of system performance)• Methods for estimating and INPUT: test collections with runs and an evaluation measure(topic‐by‐run matrices)
See Appendix
![Page 25: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/25.jpg)
The ANOVA approach (1)
Assume
i=1,…,m (systems)j=1,…,n (topics)
Homoscedasticity(equal variance)
Let
where
Hypotheses
: ai ≠ 0 for some i
System effect
No system effect
![Page 26: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/26.jpg)
The ANOVA approach (2)i=1,…,m (systems)j=1,…,n (topics)
Total variation
can be decomposed intoST = SA + SE where
Between‐systemvariation
Within‐systemvariation
Sample grand mean
Sample system mean
![Page 27: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/27.jpg)
The ANOVA approach (3)i=1,…,m (systems)j=1,…,n (topics)
Test statistic F0 = VA/VE where VA = SA/φA, VE = SE/φE,φA = m‐1, φE = m(n‐1)
Under H0,F0~ F distribution with (φA, φE) degrees of freedom.One‐way ANOVA rejects H0if F0 >= F(φA; φE; α).
F(φA; φE; α)F0
α
1‐α
How large is the between‐system variance compared to the within‐system variance?
![Page 28: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/28.jpg)
The ANOVA approach (4)i=1,…,m (systems)j=1,…,n (topics)
The probability of rejecting H0
Under H0, this is exactly α (rejecting H0 that is true)
F(φA; φE; α)F0
α
1‐α
![Page 29: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/29.jpg)
The ANOVA approach (5)i=1,…,m (systems)j=1,…,n (topics)
The probability of rejecting H0
Under H1, this is exactly the power 1‐β (rejecting H0 that is false)
Under H1, F0~ noncentral F distribution with (φA, φE) degrees of freedomand a noncentrality parameter λ = nΔ, where
Measures total system effects in variance units
![Page 30: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/30.jpg)
The ANOVA approach (6)i=1,…,m (systems)j=1,…,n (topics)
The power (probability of rejecting H0 that is false)
1‐βF0~ noncentralF distribution
For a random variable F’ that obeys a noncenral F distribution,Pr{ F’ <= w } can be approximated using a normal distribution[Nagata03] (Eqs. 14 and 15 in my paper).
Given α, n, Δ, m, the power 1‐β can be computed.But what we want is: Given α, β, Δ, m, compute n!
![Page 31: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/31.jpg)
Under H0, we know Δ=0. But under H1, we only know that Δ ≠ 0.Δ needs to be specified to guarantee power=1‐β.Let’s guarantee power=1‐β (i.e. correctly reject H0 with 100(1‐β)% confidence)whenever Δ >= minΔ (minimum detectable delta).
How shall we set minΔ?
The ANOVA approach (7)i=1,…,m (systems)j=1,…,n (topics)
![Page 32: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/32.jpg)
Let
where minD is the minimum detectable range that you specify.Whenever the difference between the best system and the worst system is minD or more, we guarantee power=1‐β.
Given α, n, Δ, m, the power 1‐βcan be computed.Using α, n, minD, , m, the worst‐case power can be computed (See Eq.17 in my paper).
The ANOVA approach (8)i=1,…,m (systems)j=1,…,n (topics)
Estimate variance from past data
μ
best
worst
μi
minDai = μi ‐ μ
![Page 33: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/33.jpg)
The ANOVA approach (9)i=1,…,m (systems)j=1,…,n (topics)
Given (α, β, minD, , m),
Here, λ can be obtained if we use the following approximation [Nagata03]:
λ = nΔ: Noncentrality parameter
1‐β ≒
Let φE = m(n‐1) ≒∞ ~ noncentral chi‐square distribution with φA degrees of
freedom and the same noncentrality parameter λ
![Page 34: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/34.jpg)
The ANOVA approach (10)i=1,…,m (systems)j=1,…,n (topics)
Given (α, β, minD, , m), λ = nΔ: Noncentrality parameter
φA = m‐1
For noncentral chi‐square distributions, use the following λ values [Nagata03]:
![Page 35: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/35.jpg)
The ANOVA approach (11)i=1,…,m (systems)j=1,…,n (topics)
Given (α, β, minD, , m),
Obtain n using λ from the table, and check if (α, n, minΔ, m) satisfies the power requirement:
λ = nΔ: Noncentrality parameter
1‐β
Normal approximation available (Eqs. 15, 16 in my
paper)
If not, n++ and try the above again.
![Page 36: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/36.jpg)
Demo: determine n from (α, β, minDt, , m)
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx
![Page 37: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/37.jpg)
Obtaining
A time‐honoured method using one‐way ANOVA statistics [Okubo12]:
Estimate of the population between‐system variance
Estimate of the population within‐system variance
Estimate of the populationvariance
Given multiple ANOVA statistics (test collections + runs),the estimated variances can be pooled to enhance reliability:
![Page 38: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/38.jpg)
Overview of our approaches
• Using the paired t‐test to determine n (for m=2 systems)INPUT: α, β, minDt (minimum detectable difference for ensuring power=1‐β), (estimated variance of performance difference)• Using one‐way ANOVA to determine n (for m>=2 systems)INPUT: α, β, m, minD (minimum detectable range for ensuring power=1‐β), (estimated variance of system performance)• Methods for estimating and INPUT: test collections with runs and an evaluation measure(topic‐by‐run matrices)
See Appendix
![Page 39: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/39.jpg)
TALK OUTLINE
1. How Information Retrieval (IR) test collections are constructed2. Statistical reform3. How test collections SHOULD be constructed4. Experimental results5. Conclusions and future work
![Page 40: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/40.jpg)
Data for estimating
Data #topics runs pd #docsTREC03new 50 78 125 528,155 news articlesTREC04new 49 78 100 dittoTREC11w 50 37 25 One billion web pagesTREC12w 50 28 20/30 dittoTREC11wD 50 25 25 dittoTREC12wD 50 20 20/30 ditto
Adhoc news IR
Adhoc web IR
Diversified web IR
We have a topic‐by‐run matrix for each data set and evaluation measure
![Page 41: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/41.jpg)
Evaluation measures
News(l=10,1000)Web (l=10)
Web (l=10)
l: measurement depth
![Page 42: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/42.jpg)
![Page 43: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/43.jpg)
t‐test vs ANOVAWhen m=2,• minD for ANOVA (range) reducesto minDt for t‐test (difference).• Results are similar,with ANOVA giving slightly larger estimates of n.Henceforth we discussANOVA as it can also considerm>3 and we prefer to“err on the side of oversampling”[Ellis10] .
![Page 44: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/44.jpg)
0
1000
2000
3000
4000
5000
6000
0 20 40 60 80 100 120 140 160 180 200AP Q nDCG nERR
(a2) adhoc/news (l=10)(α, β, minD)=(0.05, 0.20, 0.05)
m
n
For comparing m=100 systems,Q/nDCG/AP/nERR require
2198/2382/2863/4063 topics
![Page 45: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/45.jpg)
0
500
1000
1500
2000
2500
3000
3500
4000
0 20 40 60 80 100 120 140 160 180 200AP Q nDCG nERR
(b) adhoc/web (l=10)(α, β, minD)=(0.05, 0.20, 0.05)
m
n
For comparing m=100 systems,Q/nDCG/AP/nERR require
1240/1291/2801/2921 topics
![Page 46: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/46.jpg)
0
500
1000
1500
2000
2500
3000
3500
4000
0 20 40 60 80 100 120 140 160 180 200α‐nDCG nERR‐IA D‐nDCG D#‐nDCG
(c) diversity/web (l=10)(α, β, minD)=(0.05, 0.20, 0.05)
m
n
For comparing m=100 systems,D/D#/α/nERR‐IA require
1201/1749/2662/2869 topics
![Page 47: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/47.jpg)
What if we reduce the pool depth pd?
TopicTopicTopicTopicTopic 1
n=50 topics
Top pd=100 documentsfrom each run
Pool for
Topic 1Relevance assessments
Highly relevant
Partially relevant
Nonrelevant
For adhoc/news l=1000 (pd=100) only
![Page 48: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/48.jpg)
when pd is reduced
As pd gets smaller,• Average #judged/topic decreases (naturally)• Variance increases (fewer data points hurts stability)Re‐estimate n for (α, β, minD, new )
![Page 49: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/49.jpg)
0
200
400
600
800
1000
1200
1400
1600
0 100 200 300 400 500 600 700 800AP Q nDCG nERR
pd=50
n
(α, β, minD)=(0.05, 0.20, 0.05)m=10
pd=100pd=70
pd=30pd=10
#Average judged/topic
Total cost for AP:96 docs/topic *
879 topics = 84,384 docs
Total cost for AP:731 docs/topic *
652 topics = 476,612 docs
TREC ad hocpool depth
Alternative design with costreduced to 18%
![Page 50: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/50.jpg)
TALK OUTLINE
1. How Information Retrieval (IR) test collections are constructed2. Statistical reform3. How test collections SHOULD be constructed4. Experimental results5. Conclusions and future work
![Page 51: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/51.jpg)
Takeaways (1)
• Using one‐way ANOVA‐based power analysis, researchers can determine the topic set size n by specifying:
α: Type I error probability β: Type II error probabilityminD: minimum detectable range (performance diff between best and worst systems) for ensuring a statistical power of 1‐βm: number of systems to be compared
: estimated variance of each system• Different measures have different s, so researchers should decide on the evaluation measure at the test collection design phase
![Page 52: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/52.jpg)
Takeaways (2)
• Our method can provide different test collection designs (n, pd) that satisfy the same statistical requirement.
n: topic set size pd: pool depth• The assessment cost of a pd=100 test collection can be reduced to 18% or less while keeping it statistically equally reliable.
• Our method can be used to compare evaluation measures in terms of practical significance = judgment cost.
• Our tools and data are available athttp://www.f.waseda.jp/tetsuya/tools.htmlhttp://www.f.waseda.jp/tetsuya/data.html
![Page 53: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/53.jpg)
Future work
• Investigating the relationship between our power‐based approach and a CI (confidence interval)‐based approach: DONE [Sakai14EVIA]
• Estimating n for various tasks (not just IR) – our methods are applicable to any paired‐data evaluation tasks
• Given a set of statistically equally reliable designs (n,pd), choose the best one based on reusability and assessment cost
Can we evaluate new systems fairly?
![Page 54: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/54.jpg)
References[Cumming12] Cumming, G.: Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta‐Analysis. Routledge, 2012.[Ellis10] Ellis, P.D.: The Essential Guide to Effect Sizes, Cambridge, 2010.[Nagata03] Nagata, Y.: How to Design the Sample Size. Asakura Shoten, 2003.[Okubo12] Okubo, M. and Okada, K. Psychological Statistics to Tell Your Story: Effect Size, CondenceInterval (in Japanese). Keiso Shobo, 2012.[Sakai14PROMISE] Sakai, T.: Metrics, Statistics, Tests. PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), pp.116‐163, Springer, 2014. [Sakai14SIGIRforum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR Forum, 48(1), 2014. [Sakai14FIT] Sakai, T.: Designing Test Collections that Provide Tight Confidence Intervals, FIT 2014, RD‐003, 2014.[Webber08] Webber, W., Moffat, A. and Zobel, J.: Statistical power in Retrieval Experimentation. ACM CIKM 2008, pp.571–580, 2008.
![Page 55: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/55.jpg)
Appendices
• The t‐test approach• Delta over [Webber08]
![Page 56: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/56.jpg)
The t‐test approach (1)
vs.
Hypotheses
Assume
⇒
where
Systems X and Y are equally effective
![Page 57: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/57.jpg)
The t‐test approach (2)
Test statistic
Under H0, t0~ t distribution with φ=n‐1 degrees of freedom.The paired t test rejects H0 if |t0| >= t(φ; α).
α/2 α/2
t(φ; α)‐t(φ; α)
1‐αt0
Two‐sided critical t value
![Page 58: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/58.jpg)
The t‐test approach (3)
The probability of rejecting H0
α/2 α/2
t(φ; α)‐t(φ; α)
1‐αt0
Under H0, this is exactly α (rejecting H0 that is true)
![Page 59: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/59.jpg)
The t‐test approach (4)The probability of rejecting H0
Under H1, this is exactly the power 1‐β (rejecting H0 that is false)
Under H1, t0~ noncentral t distribution with φ=n‐1 degrees of freedom anda noncentrality parameter λt =where
Effect size
![Page 60: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/60.jpg)
The t‐test approach (5)The power (probability of rejecting H0 that is false)
1‐β =
t0~ noncentralt distribution
For a random variable t’ that obeys a noncenral t distribution,Pr{ t’ <= w } can be approximated using a normal distribution[Nagata03] (Eqs. 4 and 5 in my paper).
Given α, n, effect size Δt, the power 1‐β can be computed.But what we want is: Given α, β, Δt, compute n!
![Page 61: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/61.jpg)
The t‐test approach (6)Given α, n, effect size Δt, the power 1‐β can be computed.But what we want is: Given α, β, Δt, compute n!
Under H0, we know Δt = 0. But under H1, we need to specify Δt to discuss power.So let’s correctly reject H0 with 100(1‐β)% confidencewhenever |Δt| >= minΔt (minimum detectable effect).
Don’t miss a real difference if it’s minΔt or larger!
![Page 62: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/62.jpg)
The t‐test approach (7)
Given (α, β, minΔt), the required n can be approximated [Nagata03]:
zp: one‐sided critical z value
1‐β <=
Check if the above n actually satisfies the power requirement:
t0~ noncentralt distribution
If not , n++ and try the above again.
![Page 63: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/63.jpg)
The t‐test approach (8)
Effect size
In practice, instead of setting minΔt (in terms of effect size),set minDt (minimum detectable difference):
|μX – μY | > = minDtthen convert it to
Need a variance estimatefrom past data!
![Page 64: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/64.jpg)
Demo: determine n from (α, β, minDt, )
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeTTEST.xlsx
![Page 65: Designing Test Collections for Comparing Many Systems](https://reader033.fdocuments.us/reader033/viewer/2022052322/557d60cfd8b42abf3d8b5139/html5/thumbnails/65.jpg)
Delta over [Webber08]
• They addressed the problem of building a test collection incrementally (add a topic, judge, re‐estimate variance…).
• They considered the t‐test only.
• They used heuristics to estimate the variance.
• They considered AP and adhoc IR only.
We ask the direct question: “How many topics do we need to create?”
We use both the t‐test and ANOVA to handle m(>=2) systems
We use estimates from ANOVA
We consider a variety of graded‐relevance measures and three different IR tasks.