Evaluation in (Music) Information Retrieval through the Audio Music Similarity task
-
Upload
julian-urbano -
Category
Education
-
view
1.042 -
download
2
description
Transcript of Evaluation in (Music) Information Retrieval through the Audio Music Similarity task
Evaluation in (Music) Information Retrieval through the Audio Music Similarity task
Julián Urbano
Barcelona, Spain · January 16th 2014
Spam
• @julian_urbano
• Postdoctoral researcher
– Music Technology Group, Universitat Pompeu Fabra
• Recently: PhD, Computer Science
– (Evaluation in) (Music) Information Retrieval
2
Information Retrieval
• Automatic representation, storage and search of unstructured information
– Traditionally textual information
– Lately multimedia too: images, video, music
• A user has an information need and uses an IR system that retrieves the relevant or significant information from a collection of documents
3
Information Retrieval Evaluation
• IR systems are based on models to estimate relevance, implementing different techniques
• How good is my system? What system is better?
– Answered with IR Evaluation experiments
– “if you can’t measure it, you can’t improve it”
– But we need to be able to trust our measurements
• Research on IR Evaluation
– Improve our methods to evaluate systems
– Critical for the correct development of the field
4
Disclaimer
• If you see…
A system is evaluated with a test collection containing queries, documents and judgments
telling how relevant a document is to a query
• …you can think of
An algorithm is evaluated with a dataset containing queries, songs and annotations
telling how similar a song is to a query
5
Talk outline
• Why we want to Evaluate…
• …and what we do with Cranfield
• Validity: users versus systems
• Reliability: estimating from samples
• Efficiency: reducing annotations
6
Introduction: Why we want to Evaluate…
The two questions
• How good is my system?
– What does good mean?
– What is good enough?
• Is system A better than system B?
– What does better mean?
– How much better?
• Efficiency? Effectiveness? Ease? 8
Measure user experience
• We are interested in user-measures
– Time to complete task, idle time, success rate, failure rate, frustration, ease to learn, ease to use …
• Their distributions describe user experience, fully
– For an arbitrary user, query and document collection, what is the distribution of…
9
0 time to complete task
none frustration
much some
The big(ger) picture
• Different user-measures attempting to assess the same thing: user satisfaction
– How likely is it that an arbitrary user, with an arbitrary query (and with an arbitrary document collection) will be satisfied by the system?
• This is the ultimate goal: the good, the better
10
The big(ger) question
• User satisfaction…as Bernoulli trial
• Probability of satisfaction P(Sat = yes)?
• Probability that k in n users are satisfied?
• Probability of >80% users satisfied?
11
satisfaction yes no
Introduction: …what we do with Cranfield
Sources of variability
user-measure = f(documents, query, user, system)
• Our goal is the distribution of the user-measure for our system, which is impossible to calculate
– (Possibly?) infinite population
• The best we can do is estimate it
– Sample documents, queries and users
– Measure user experience, implicitly or explicitly
– Representativeness, cost, ethics, privacy…
13
Fix samples
• Hard to replicate experiment and repeat results
• Just plain impossible to reproduce results
• Get a (hopefully) good sample and fix it
– Documents and queries
• But we can’t fix the users!
14
Simulate users…and fix them
• Cranfield paradigm: remove users, but include a user-abstraction, fixed across experiments
– Static user component: judgments in the ground truth
– Dynamic user component: effectiveness measures
• Remove all sources of variability, except systems
user-measure = f(documents, query, user, system)
15
Simulate users…and fix them
• Cranfield paradigm: remove users, but include a user-abstraction, fixed across experiments
– Static user component: judgments in the ground truth
– Dynamic user component: effectiveness measures
• Remove all sources of variability, except systems
user-measure = f(documents, query, user, system)
user-measure = f(system)
15
Test collections
• Controlled set of documents, queries and judgments, shared across researchers
• (Most?) important resource for IR research
– Experiments are inexpensive (collections are not!)
– Research becomes systematic
– Reproducibility becomes possible and easy
16
Wait a minute
• Are we estimating distributions about users or distributions about systems?
system-effectiveness = f(system, scale, measure)
• We come up with different distributions of system-effectiveness, depending on how we abstract users from the experiment – Different scales to assess relevance
– Different measures to model user behavior
17
Assumption
• System-measures correspond to user-measures
18
Users Systems
Time to complete task Idle time
Success rate Failure rate Frustration
Ease to learn Ease to use Satisfaction
…
P AP RR DCG nDCG ERR GAP Q …
Assumption
• System-measures correspond to user-measures
18
Users Systems
Time to complete task Idle time
Success rate Failure rate Frustration
Ease to learn Ease to use Satisfaction
…
P AP RR DCG nDCG ERR GAP Q …
Assumption
• System-measures correspond to user-measures
18
Users Systems
Time to complete task Idle time
Success rate Failure rate Frustration
Ease to learn Ease to use Satisfaction
…
P AP RR DCG nDCG ERR GAP Q …
Assumption
• System-measures correspond to user-measures
18
Users Systems
Time to complete task Idle time
Success rate Failure rate Frustration
Ease to learn Ease to use Satisfaction
…
P AP RR DCG nDCG ERR GAP Q …
Assumption
• System-measures correspond to user-measures
19
Users Systems
Time to complete task Idle time
Success rate Failure rate Frustration
Ease to learn Ease to use Satisfaction
…
P AP RR DCG nDCG ERR GAP Q …
Experiments with Test Collections
• Our goal is the users
user-measure = f(system)
• but Cranfield tells us about systems
system-effectiveness = f(system, scale, measure)
20
Experiments with Test Collections
• Our goal is the users
user-measure = f(system)
• but Cranfield tells us about systems
system-effectiveness = f(system, scale, measure)
20
Experiments with Test Collections
• Our goal is the users
user-measure = f(system)
• but Cranfield tells us about systems
system-effectiveness = f(system, scale, measure)
• This poses several problems
– That we have been dealing with for over 50 years
– But hey, they’re extremely interesting!
20
Validity, Reliability and Efficiency
• Validity: are we measuring what we want to? – Internal: are observed effects due to hidden factors?
– External: are queries, documents and users representative?
– Construct: do system-measures match user-measures?
– Conclusion: how good is good and how better is better?
• Reliability: how repeatable are the results? – How large do collections need to be?
– What statistical methods should be used?
• Efficiency: how inexpensive is it to get valid and reliable results? (i.e. to build a test collection) – Can we estimate results with fewer judgments?
21
In this talk
How to study and improve the validity, reliability and efficiency
of the methods used to evaluate IR systems
• Audio Music Similarity task as example – Song as query input to system, audio signal
– Retrieve songs musically similar to it, by content
– Resembles traditional Ad Hoc retrieval in Text IR
– Important task in Music IR • Music recommendation
• Playlist generation
• Plagiarism detection
22
Validity: Effectiveness and Satisfaction
Assumption of Cranfield
• Systems with better effectiveness are perceived by users as more useful, more satisfactory
• Tricky: different effectiveness measures and relevance scales produce different distributions
– Which one is better to predict satisfaction?
• Map system effectiveness onto user satisfaction, experimentally
– If P@10 = 0.2, how likely is it that an arbitrary user will find the results satisfactory?
– What is P(Sat | P@10 = 0.2)? 24
User-oriented System-measures
• Effectiveness measures are generally not formulated to correlate with user-satisfaction
– If effectiveness is λ = 0, we expect P(Sat) = 0
– If effectiveness is λ = 1, we expect P(Sat) = 1
– In general, we expect P(Sat | λ) = λ
• But this is not what we have
– Effectiveness measures need to be reformulated
– Upper bounds, recall components, ideal rankings
– Many mathematical details omitted in this talk 25
User Components: Measures and Scales
• How is relevance measured in the judgments?
– Nominal, ordinal, interval, ratio
• How are results consumed?
– Set, list
• What determines document utility?
– Positional, cascade
– Linear, exponential
• What determines user persistence?
– Navigational, informational
26
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
Measures and Scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
27
MIREX
Experimental design
28
What can we infer?
• Preference: difference noticed by user
– Positive: user agrees with evaluation
– Negative: user disagrees with evaluation
• Non-preference: difference not noticed by user
– Good: both systems are satisfactory
– Bad: both systems are not satisfactory
29
Data
• Queries, documents and judgments from MIREX
• 4115 unique and artificial examples
– At least 200 examples per (measure-scale-λ)
• 432 unique queries, 5636 unique documents
• Answers collected via Crowdsourcing
– Quality control with trap questions
• 113 unique subjects 30
Single system: how good is it?
• For 2045 examples (49%) users could not decide which system was better
What do we expect?
31
Single system: how good is it?
• For 2045 examples (49%) users could not decide which system was better
31
Single system: how good is it?
• Large ℓmin thresholds underestimate satisfaction
32
Single system: how good is it?
• Users don’t pay attention to ranking?
33
Single system: how good is it?
• Exponential gain underestimates satisfaction
34
Single system: how good is it?
• Document utility independent of others
35
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one system over the other one
What do we expect?
36
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one system over the other one
36
Two systems: which one is better?
• Large differences needed for users to note them
37
Two systems: which one is better?
• More relevance levels are better to discriminate
38
Two systems: which one is better?
• Cascade and navigational user models are not appropriate
39
Two systems: which one is better?
• Users do prefer the (supposedly) worse system
40
Summary
• Effectiveness and satisfaction are clearly correlated – There is a 20% bias: P(Sat | 0) > 0 and P(Sat | 1) < 1 – Room to improve: personalization, better user abstraction
• Magnitude of differences does matter – Just looking at rankings is very naive
• Be careful with statistical significance
– Need Δλ≈0.4 for users to agree with effectiveness • Historically, only 20% of times in MIREX
• Differences among measures and scales – Linear gain slightly better than exponential gain – Informational and positional user models better than
navigational and cascade – The more relevance levels, the better
41
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
42
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
43
Validity: Satisfaction over Samples
Evaluate in terms of user satisfaction
• So far, arbitrary users for a single query
– P Sat Ql@5 = 0.61 = 0.7
• Easily for n users and a single query
– P Sat15 = 10 Ql@5 = 0.61 = 0.21
• What about a sample of queries 𝒬?
– Map queries separately for the distribution of P(Sat)
– For easier mappings, P(Sat | λ) functions are interpolated with simple polynomials
45
Expected probability of satisfaction
• Now we can compute point and interval estimates of the expected probability of satisfaction
• Intuition fails when interpreting effectiveness
46
System success
• If P(Sat) ≥ threshold the system is successful
– Setting the threshold was rather arbitrary before
– Now it is meaningful, in terms of user satisfaction
• Intuitively, we want the majority of users to find the system satisfactory
– P Succ = P P Sat > 0.5 = 1 − FP Sat (0.5)
• Improving queries for which we are bad is worthier than further improving those for which we are already good
47
Distribution of P(Sat)
• But we (will) only have a handful queries, estimates will probably be bad – Need to estimate the cumulative distribution function of
user satisfaction: FP(Sat)
– Not described by any typical distribution family
• More than ≈25 queries in the collection – ecdf approximates better
• Less than ≈25 queries in the collection – Normal for graded scales, ecdf for binary scales
• Beta is always the best with the Fine scale – Which turns out to be the best scale, overall
48
Intuition fails, again
Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
49
Intuition fails, again
Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
49
Intuition fails, again
Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
49
Historically, in MIREX
• Systems are not as satisfactory as we thought
• But they are more successful
– Good (or bad) for some kinds of queries
50
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=4 nℒ=5 ℓmin=20 ℓmin=40
P@5 X X
AP@5 X X
CGl@5 X X X X P@5 P@5
CGe@5 X X X P@5 P@5
DCGl@5 X X X X X X
DCGe@5 X X X DCGl@5 DCGl@5
Ql@5 X X X X AP@5 AP@5
Qe@5 X X X AP@5 AP@5
RBPl@5 X X X X X X
RBPe@5 X X X RBPl@5 RBPl@5
GAP@5 X X X X AP@5 AP@5
51
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=4 nℒ=5 ℓmin=20 ℓmin=40
P@5 X X
AP@5 X X
CGl@5 X X X X P@5 P@5
CGe@5 X X X P@5 P@5
DCGl@5 X X X X X X
DCGe@5 X X X DCGl@5 DCGl@5
Ql@5 X X X X AP@5 AP@5
Qe@5 X X X AP@5 AP@5
RBPl@5 X X X X X X
RBPe@5 X X X RBPl@5 RBPl@5
GAP@5 X X X X AP@5 AP@5
52
Reliability: Optimal Statistical Significance Tests
Random error
• Test collections are just samples from larger, possibly infinite, populations
• If we conclude system A is better than B, how confident can we be?
– Δλ𝒬 is just an estimate of the population mean μΔλ
• Usually employ some statistical significance test for differences in location
• If it is statistically significant, we have confidence that the true difference is at least that large
54
Statistical hypothesis testing
• Set two mutually exclusive hypotheses
– H0: μΔλ = 0
– H1: μΔλ ≠ 0
• Run test, obtain p-value= P μΔλ ≥ Δλ𝒬 H0
– p ≤ α: statistically significant, high confidence
– p > α: statistically non-significant, low confidence
• Possible errors in the binary decision
– Type I: incorrectly reject H0
– Type II: incorrectly accept H0
55
Statistical significance tests
• (Non-)parametric tests
– t-test, Wilcoxon test, Sign test
• Based on resampling
– Bootstrap test, permutation/randomization test
• They make certain assumptions about distributions and sampling methods
– Often violated in IR evaluation experiments
– Which test behaves better, in practice, knowing that assumptions are violated?
56
Optimality criteria
• Power
– Achieve significance as often as possible (low Type II)
– Usually increases Type I error rates
• Safety
– Minimize Type I error rates
– Usually decreases power
• Exactness
– Maintain Type I error rate at α level
– Permutation test is theoretically exact
57
Experimental design
• Randomly split query set in two
• Evaluate all systems with both subsets
– Simulating two different test collections
• Compare p-values with both subsets
– How well do statistical tests agree with themselves?
– At different α levels
• All systems and queries from MIREX 2007-2011
– >15M p-values 58
Power and success
• Bootstrap test is the most powerful
• Wilcoxon, bootstrap and permutation are the most successful, depending on α level
59
Conflicts
• Wilcoxon and t-test are the safest at low α levels
• Wilcoxon is the most exact at low α levels, but bootstrap is for usual levels
60
Summary
• Bootstrap test is the most powerful, and still it has smaller Type I error rates, so we are safe
• Power and success:
– CGl@5 > GAP@5 > DCGl@5 > RBPl@5
– Fine > Broad > binary
• Conflicts:
– Very similar across measures and scales
– Corrections for multiple comparisons (e.g. Tukey) do not seem necessary
61
Reliability: Test Collection Size
Acceptable sample size
• Reliability is higher with larger sample sizes
– But it is also more expensive
– What is an acceptable test collection size?
• Answer with Generalizability Theory
– G-Study: estimate variance components
– D-Study: estimate reliability of different sample sizes and experimental designs
63
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
64
Intuition
• If σs2 is small or σq
2 is large, we need more queries
65
𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵
Intuition
• If σs2 is small or σq
2 is large, we need more queries
65
𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵
𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵
Larger σs2
Intuition
• If σs2 is small or σq
2 is large, we need more queries
65
𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵
𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵 𝜆𝐴 𝜆𝐶 𝜆𝐷 𝜆𝐸 𝜆𝐹 𝜆𝐵
Larger σs2 Smaller σq
2 or
more queries
D-study: variance ratios
• Stability of absolute scores
Φ nq =σs2
σs2 +
σq2 + σe
2
nq
• Stability of relative scores
Eρ2 nq =σs2
σs2 +
σe2
nq
• We can easily estimate how many queries are needed to reach some level of stability (reliability)
66
D-study: variance ratios
• Stability of absolute scores
Φ nq =σs2
σs2 +
σq2 + σe
2
nq
• Stability of relative scores
Eρ2 nq =σs2
σs2 +
σe2
nq
• We can easily estimate how many queries are needed to reach some level of stability (reliability)
66
Effect of query set size • Average absolute stability Φ = 0.97 • ≈65 queries needed for Φ2 = 0.95, ≈100 in worst cases • Fine scale slightly better than Broad and binary scales • RBPl@5 and nDCGl@5 are the most stable
67
Effect of query set size • Average relative stability Eρ 2 = 0.98
• ≈35 queries needed for Eρ2 = 0.95, ≈60 in worst cases
• Fine scale better than Broad and binary scales
• CGl@5 and RBPl@5 are the most stable
68
Effect of cutoff k
• What if we use a deeper cutoff, k=10?
– From 100 queries and k=5 to 50 queries and k=10
– Should still have stable scores
– Judging effort should decrease
– Rank-based measures should become more stable
• Tested in MIREX 2012
– In 2013 too, but not analyzed here
69
Effect of cutoff k
• Judging effort reduced to 72% of the usual
• Generally stable – From Φ = 0.81 to Φ = 0.83
– From Eρ 2 = 0.93 to Eρ 2 = 0.95
70
Effect of cutoff k
• Reliability given a fixed budged for judging?
– k=10 allows us to use fewer queries, about 70%
– Slightly reduced relative stability
71
Effect of assessor set size
• More assessors or simply more queries?
– Judging effort is multiplied
• Can be studied with MIREX 2006 data
– 3 different assessors per query
– Nested experimental design: s × h: q
72
Effect of assessor set size
• Broad scale: σ s2 ≈ σ h:q
2
• Fine scale: σ s2 ≫ σ h:q
2
• Always better to spend resources on queries
73
Summary
• MIREX collections generally larger than necessary
• For fixed budget
– More queries better than more assessors
– More queries slightly better than deeper cutoff
• Worth studying alternative user model?
• Employ G-Theory while building the collection
• Fine better than Broad, better than binary
• CGl@5 and DCGl@5 best for relative stability
• RBPl@5 and nDCGl@5 best for absolute stability 74
Implications
• Fixing the number of queries across years is unrealistic – Especially because they are not intended for reuse
• Fixing the number of queries across task is simply nonsense
• Need to analyze on a case-by-case basis, while building the collections – GT4IReval, R package online – https://github.com/julian-urbano/GT4IREval
75
Efficiency: Learning Relevance Distributions
Probabilistic evaluation
• The MIREX setting is still expensive – Need to judge all top k documents from all systems
– Takes days, even weeks sometimes
• Model relevance probabilistically – Relevance judgments are random variables over the space
of possible assignments of relevance
E Rd = P 𝑅𝑑 = ℓ · ℓ
ℓ∈ℒ
Var 𝑅𝑑 = P 𝑅𝑑 = ℓ · ℓ2
ℓ∈ℒ
− E 𝑅𝑑2
• Effectiveness measures are also probabilistic 77
Probabilistic evaluation
• Accuracy increases as we make judgments
– E Rd ← rd
• Reliability increases too (confidence)
– Var Rd ← 0
• Iteratively estimate relevance and effectiveness
– If confidence is low, make judgments
– If confidence is high, stop
• Judge as few documents as possible 78
Learning distributions of relevance
• Uniform distribution is very uninformative
• Historical distribution in MIREX has high variance
• Estimate from a set of features: P Rd = ℓ θd
– For each document separately
– Ordinal Logistic Regression
• Three sets of features
– Output-based, can always be used
– Judgment-based, to exploit known judgments
– Audio-based, to exploit musical similarity 79
Learned models
• Mout : can be used even without judgments
– Similarity between systems’ outputs
– Genre and artist metadata
• Genre is highly correlated to similarity
– Decent fit, R2 ≈ 0.35
• Mjud : can be used when there are judgments
– Similarity between systems’ outputs
– Known relevance of same system and same artist
• Artist is extremely correlated to similarity
– Excellent fit, R2 ≈ 0.91 80
Estimation errors
• Actual vs. predicted by Mout
– 0.36 with Broad and 0.34 with Fine
• Actual vs. predicted by Mjud
– 0.14 with Broad and 0.09 with Fine
• Among assessors in MIREX 2006
– 0.39 with Broad and 0.31 with Fine
• Negligible under the current MIREX setting
81
Efficiency: Probabilistic Evaluation
Probabilistic effectiveness measures
• Effectiveness scores become random variables too
• Example: DCGl@k
– (Usual) deterministic formulation:
𝐷𝐶𝐺𝑙@𝑘 = 𝑟𝑖/ log2 𝑖 + 1𝑘𝑖=1
𝑛ℒ − 1 / log2 𝑖 + 1𝑘𝑖=1
– (New) probabilistic formulation:
E 𝐷𝐶𝐺𝑙@𝑘 =1
𝜂𝐷𝐶𝐺𝑙
E 𝑅𝑖log2 𝑖 + 1
𝑘
𝑖=1
Var 𝐷𝐶𝐺𝑙@𝑘 =1
𝜂𝐷𝐶𝐺𝑙2
Var 𝑅𝑖log2 𝑖 + 1 2
𝑘
𝑖=1
83
Probabilistic effectiveness measures
• Effectiveness scores become random variables too
• Example: DCGl@k
– (Usual) deterministic formulation:
𝐷𝐶𝐺𝑙@𝑘 = 𝑟𝑖/ log2 𝑖 + 1𝑘𝑖=1
𝑛ℒ − 1 / log2 𝑖 + 1𝑘𝑖=1
– (New) probabilistic formulation:
E 𝐷𝐶𝐺𝑙@𝑘 =1
𝜂𝐷𝐶𝐺𝑙
E 𝑅𝑖log2 𝑖 + 1
𝑘
𝑖=1
Var 𝐷𝐶𝐺𝑙@𝑘 =1
𝜂𝐷𝐶𝐺𝑙2
Var 𝑅𝑖log2 𝑖 + 1 2
𝑘
𝑖=1
83
𝜂𝐷𝐶𝐺𝑙
Probabilistic effectiveness measures
• From there we can compute Δ𝐷𝐶𝐺𝑙@𝑘AB
• And averages over a sample of queries 𝒬
• Different approaches to compute estimates
– Deal with dependence of random variables
– Different definitions of confidence
• For measures based on ideal ranking (nDCGl@k and RBPl@k) we do not have a closed form
– Approximated with Delta method and Taylor series
84
Ranking without judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
• Average confidence in the rankings is 94%
• Average accuracy of the ranking is 92%
85
Ranking without judgments
• Can we trust individual estimates?
– Ideally, we want X% accuracy when X% confidence
– Confidence slightly overestimated in [0.9, 0.99)
86
DCGl@5
Confidence Broad Fine
In bin Accuracy In bin Accuracy
[0.5, 0.6) 23 (6.5%) 0.826 22 (6.2%) 0.636
[0.6, 0.7) 14 (4%) 0.786 16 (4.5%) 0.812
[0.7, 0.8) 14 (4%) 0.571 11 (3.1%) 0.364
[0.8, 0.9) 22 (6.2%) 0.864 21 (6%) 0.762
[0.9, 0.95) 23 (6.5%) 0.87 19 (5.4%) 0.895
[0.95, 0.99) 24 (6.8%) 0.917 27 (7.7%) 0.926
[0.99, 1) 232 (65.9%) 0.996 236 (67%) 0.996
E[Accuracy] 0.938 0.921
Relative estimates with judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
3. While confidence is low (<95%) 1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of differences and rank systems
• What documents should we judge? – Those that are the most informative
– Measure-dependent 87
Relative estimates with judgments
• Judging effort dramatically reduced – 1.3% with CGl@5, 9.7% with RBPl@5
• Average accuracy still 92%, but improved individually – 74% of estimates with >99% confidence, 99.9% accurate
– Expected accuracy improves slightly from 0.927 to 0.931
88
Absolute estimates with judgments
1. Estimate relevance with Mout
2. Estimate absolute effectiveness scores
3. While confidence is low (expected error >±0.05) 1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of absolute effectiveness scores
• What documents should we judge? – Those that reduce variance the most
– Measure-dependent 89
Absolute estimates with judgments • The stopping condition is overly confident – Virtually no judgments are even needed (supposedly)
• But effectiveness is highly overestimated – Especially with nDCGl@5 and RBPl@5 – Mjud, and especially Mout, tend to overestimate relevance
90
Absolute estimates with judgments
• Practical fix: correct variance
• Estimates are better, but at the cost of judging
– Need between 15% and 35% of judgments
91
Summary
• Estimate ranking of systems with no judgments
– 92% accuracy on average, trustworthy individually
– Statistically significant differences are always correct
• If we want more confidence, judge documents
– As few as 2% needed to reach 95% confidence
– 74% of estimates have >99% confidence and accuracy
• Estimate absolute scores, judging as necessary
– Around 25% needed to ensure error <0.05
92
Implications
• We do not need dozens of volunteers to make thousands of judgments over several days
• Just one person spending a couple hours is fine
• The spare manpower can be put to better use – Redundant judgments to have better estimates
– Make annotations for other tasks
• It naturally promotes collaborative creation of test collections by iteratively adding the judgments needed in each experiment (if any)
93
Future Work
Validity
• User studies to understand user behavior
• What information to include in test collections
• Other forms of relevance judgment to better capture document utility
• Explicitly define judging guidelines
• Similar mapping for Text IR
– Different user models within the same task
95
Reliability
• Corrections for Multiple Comparisons
• Methods to reliably estimate reliability while building test collections
96
Efficiency
• Better models to estimate document relevance
• Correct variance when having just a few relevance judgments available
• Estimate relevance beyond k=5
• Other stopping conditions and document weights
97
Conduct similar studies
for the wealth of tasks in
Music Information Retrieval