I-VECTORS FOR TIMBRE-BASED MUSIC SIMILARITY AND MUSIC ARTIST CLASSIFICATION
Evaluation in Audio Music Similarity
-
Upload
julian-urbano -
Category
Technology
-
view
556 -
download
0
description
Transcript of Evaluation in Audio Music Similarity
![Page 1: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/1.jpg)
Evaluation in Audio Music Similarity
PhD dissertation
by
Julián Urbano
Leganés, October 3rd 2013 Picture by Javier García
![Page 2: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/2.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
2
![Page 3: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/3.jpg)
Outline
• Introduction
– Scope
– The Cranfield Paradigm
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
3
![Page 4: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/4.jpg)
Information Retrieval
• Automatic representation, storage and search of unstructured information
– Traditionally textual information
– Lately multimedia too: images, video, music
• A user has an information need and uses an IR system that retrieves the relevant or significant information from a collection of documents
4
![Page 5: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/5.jpg)
Information Retrieval Evaluation
• IR systems are based on models to estimate relevance, implementing different techniques
• How good is my system? What system is better?
• Answered with IR Evaluation experiments
– “if you can’t measure it, you can’t improve it”
– But we need to be able to trust our measurements
• Research on IR Evaluation
– Improve our methods to evaluate systems
– Critical for the correct development of the field
5
![Page 6: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/6.jpg)
History of IR Evaluation research
6
1960
Cranfield 2 MEDLARS
SMART
1980 1970 1990 2000 2010
SIGIR
![Page 7: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/7.jpg)
History of IR Evaluation research
6
1960
TREC
CLEF NTCIR
Cranfield 2 MEDLARS
SMART
INEX
1980 1970 1990 2000 2010
SIGIR
![Page 8: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/8.jpg)
History of IR Evaluation research
6
1960
MIREX
TREC
CLEF NTCIR
ISMIR
Cranfield 2 MEDLARS
SMART
INEX
MusiCLEF
1980 1970 1990 2000 2010
MSD Challenge
SIGIR
![Page 9: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/9.jpg)
History of IR Evaluation research
6
1960
MIREX
TREC
CLEF NTCIR
ISMIR
Cranfield 2 MEDLARS
SMART
INEX
MusiCLEF
1980 1970 1990 2000 2010
MSD Challenge
SIGIR
![Page 10: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/10.jpg)
History of IR Evaluation research
6
1960
MIREX
TREC
CLEF NTCIR
ISMIR
Cranfield 2 MEDLARS
SMART
INEX
MusiCLEF
1980 1970 1990 2000 2010
MSD Challenge
SIGIR
![Page 11: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/11.jpg)
Audio Music Similarity
• Song as input to system, audio signal
• Retrieve songs musically similar to it, by content
• Resembles traditional Ad Hoc retrieval in Text IR
• (most?) Important task in Music IR
– Music recommendation
– Playlist generation
– Plagiarism detection
• Annual evaluation in MIREX
7
![Page 12: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/12.jpg)
Outline
• Introduction
– Scope
– The Cranfield Paradigm
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
8
![Page 13: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/13.jpg)
Outline
• Introduction
– Scope
– The Cranfield Paradigm
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
9
![Page 14: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/14.jpg)
The two questions
• How good is my system?
– What does good mean?
– What is good enough?
• Is system A better than system B?
– What does better mean?
– How much better?
• Efficiency? Effectiveness? Ease?
10
![Page 15: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/15.jpg)
Measure user experience
• We are interested in user-measures
– Time to complete task, idle time, success rate, failure rate, frustration, ease to learn, ease to use …
– Their distributions describe user experience, fully
• User satisfaction is the bigger picture
– How likely is it that an arbitrary user, with an arbitrary query (and with an arbitrary document collection) will be satisfied by the system?
• This is the ultimate goal: the good, the better
11
![Page 16: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/16.jpg)
The Cranfield Paradigm
• Estimate user-measure distributions
– Sample documents, queries and users
– Monitor user experience and behavior
– Representativeness, cost, ethics, privacy …
• Fix samples to allow reproducibility
– But cannot fix users and their behavior
– Remove users, but include a static user component, fixed across experiments: ground truth judgments
– Still need to include the dynamics of the process: user models behind effectiveness measures and scales
12
![Page 17: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/17.jpg)
Test collections
• Our goal is the users: user-measure = f(system)
• Cranfield measures systems: system-effectiveness = f(system, measure, scale)
• Estimators of the distributions of user-measures – Only source of variability is the systems themselves
– Reproducibility becomes easy
– Experiments are inexpensive (collections are not)
– Research becomes systematic
13
![Page 18: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/18.jpg)
Validity, Reliability and Efficiency
• Validity: are we measuring what we want to?
– How well are effectiveness and satisfaction correlated?
– How good is good and how better is better?
• Reliability: how repeatable are the results?
– How large do samples have to be?
– What statistical methods should be used?
• Efficiency: how inexpensive is it to get valid and reliable results?
– Can we estimate results with fewer judgments?
14
![Page 19: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/19.jpg)
Goal of this dissertation
Study and improve the validity, reliability and efficiency
of the methods used to evaluate AMS systems
Additionally, improve meta-evaluation methods
15
![Page 20: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/20.jpg)
Outline
• Introduction
– Scope
– The Cranfield Paradigm
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
16
![Page 21: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/21.jpg)
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
17
![Page 22: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/22.jpg)
Assumption of Cranfield
• Systems with better effectiveness are perceived by users as more useful, more satisfactory
• But different effectiveness measures and relevance scales produce different distributions
– Which one is better to predict user satisfaction?
• Map system effectiveness onto user satisfaction, experimentally
– If P@10 = 0.2, how likely is it that an arbitrary user will find the results satisfactory?
– What if DCG@20 = 0.46? 18
![Page 23: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/23.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 24: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/24.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 25: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/25.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 26: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/26.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 27: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/27.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 28: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/28.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 29: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/29.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 30: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/30.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 31: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/31.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 32: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/32.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 33: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/33.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 34: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/34.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
MIREX
![Page 35: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/35.jpg)
Experimental design
20
![Page 36: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/36.jpg)
What can we infer?
• Preference: difference noticed by user
– Positive: user agrees with evaluation
– Negative: user disagrees with evaluation
• Non-preference: difference not noticed by user
– Good: both systems are satisfactory
– Bad: both systems are not satisfactory
21
![Page 37: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/37.jpg)
Data
• Queries, documents and judgments from MIREX
• 4115 unique and artificial examples
• 432 unique queries, 5636 unique documents
• Answers collected via Crowdsourcing
– Quality control with trap questions
• 113 unique subjects
22
![Page 38: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/38.jpg)
Single system: how good is it?
• For 2045 examples (49%) users could not decide which system was better
What do we expect?
23
![Page 39: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/39.jpg)
Single system: how good is it?
• For 2045 examples (49%) users could not decide which system was better
23
![Page 40: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/40.jpg)
Single system: how good is it?
• Large ℓmin thresholds underestimate satisfaction
24
![Page 41: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/41.jpg)
Single system: how good is it?
• Users don’t pay attention to ranking?
25
![Page 42: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/42.jpg)
Single system: how good is it?
• Exponential gain underestimates satisfaction
26
![Page 43: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/43.jpg)
Single system: how good is it?
• Document utility independent of others
27
![Page 44: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/44.jpg)
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one system over the other one
What do we expect?
28
![Page 45: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/45.jpg)
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one system over the other one
28
![Page 46: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/46.jpg)
Two systems: which one is better?
• Large differences needed for users to note them
29
![Page 47: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/47.jpg)
Two systems: which one is better?
• More relevance levels are better to discriminate
30
![Page 48: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/48.jpg)
Two systems: which one is better?
• Cascade and navigational user models are not appropriate
31
![Page 49: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/49.jpg)
Two systems: which one is better?
• Users do prefer the (supposedly) worse system
32
![Page 50: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/50.jpg)
Summary
• Effectiveness and satisfaction are clearly correlated – But there is a bias of 20% because of user disagreement – Room for improvement through personalization
• Magnitude of differences does matter – Just looking at rankings is very naive
• Be careful with statistical significance
– Need Δλ≈0.4 for users to agree with effectiveness • Historically, only 20% of times in MIREX
• Differences among measures and scales – Linear gain slightly better than exponential gain – Informational and positional user models better than
navigational and cascade – The more relevance levels, the better
33
![Page 51: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/51.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
34
![Page 52: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/52.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
35
![Page 53: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/53.jpg)
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
36
![Page 54: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/54.jpg)
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
37
![Page 55: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/55.jpg)
Evaluate in terms of user satisfaction
• So far, arbitrary users for a single query
– P Sat Ql@5 = 0.61 = 0.7
• Easily for n users and a single query
– P Sat15 = 10 Ql@5 = 0.61 = 0.21
• What about a sample of queries 𝒬?
– Map queries separately for the distribution of P(Sat)
– For easier mappings, P(Sat | λ) functions are interpolated with simple polynomials
38
![Page 56: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/56.jpg)
Expected probability of satisfaction
• Now we can compute point and interval estimates of the expected probability of satisfaction
• Intuition fails when interpreting effectiveness
39
![Page 57: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/57.jpg)
System success
• If P(Sat) ≥ threshold the system is successful
– Setting the threshold was rather arbitrary
– Now it is meaningful, in terms of user satisfaction
• Intuitively, we want the majority of users to find the system satisfactory
– P Succ = P P Sat > 0.5 = 1 − FP Sat (0.5)
• Improving queries for which we are bad is worthier than further improving those for which we are already good
40
![Page 58: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/58.jpg)
Distribution of P(Sat)
• Need to estimate the cumulative distribution function of user satisfaction: FP(Sat)
• Not described by a typical distribution family
– ecdf converges, but what is a good sample size?
– Compare with Normal, Truncated Normal and Beta
• Compared on >2M random samples from MIREX collections, at different query set sizes
• Goodness of fit as to Cramér-von Mises ω2
41
![Page 59: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/59.jpg)
Estimated distribution of P(Sat)
• More than ≈25 queries in the collection
– ecdf approximates better
• Less than ≈25 queries in the collection
– Normal for graded scales, ecdf for binary scales
• Beta is always the best with the Fine scale
• The more levels in the relevance scale, the better
• Linear gain better than exponential gain
42
![Page 60: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/60.jpg)
Intuition fails, again
• Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
43
![Page 61: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/61.jpg)
Intuition fails, again
• Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
43
![Page 62: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/62.jpg)
Intuition fails, again
• Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
43
![Page 63: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/63.jpg)
Historically, in MIREX
• Systems are not as satisfactory as we thought
• But they are more successful
– Good (or bad) for some kinds of queries
44
![Page 64: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/64.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=4 nℒ=5 ℓmin=20 ℓmin=40
P@5 X X
AP@5 X X
CGl@5 X X X X P@5 P@5
CGe@5 X X X P@5 P@5
DCGl@5 X X X X X X
DCGe@5 X X X DCGl@5 DCGl@5
Ql@5 X X X X AP@5 AP@5
Qe@5 X X X AP@5 AP@5
RBPl@5 X X X X X X
RBPe@5 X X X RBPl@5 RBPl@5
GAP@5 X X X X AP@5 AP@5
45
![Page 65: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/65.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=4 nℒ=5 ℓmin=20 ℓmin=40
P@5 X X
AP@5 X X
CGl@5 X X X X P@5 P@5
CGe@5 X X X P@5 P@5
DCGl@5 X X X X X X
DCGe@5 X X X DCGl@5 DCGl@5
Ql@5 X X X X AP@5 AP@5
Qe@5 X X X AP@5 AP@5
RBPl@5 X X X X X X
RBPe@5 X X X RBPl@5 RBPl@5
GAP@5 X X X X AP@5 AP@5
46
![Page 66: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/66.jpg)
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
47
![Page 67: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/67.jpg)
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
48
![Page 68: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/68.jpg)
Random error
• Test collections are just samples from larger, possibly infinite, populations
• If we conclude system A is better than B, how confident can we be?
– Δλ𝒬 is just an estimate of the population mean μΔλ
• Usually employ some statistical significance test for differences in location
• If it is statistically significant, we have confidence that the true difference is at least that large
49
![Page 69: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/69.jpg)
Statistical hypothesis testing
• Set two mutually exclusive hypotheses
– H0: μΔλ = 0
– H1: μΔλ ≠ 0
• Run test, obtain p-value= P μΔλ ≥ Δλ𝒬 H0
– p ≤ α: statistically significant, high confidence
– p > α: statistically non-significant, low confidence
• Possible errors in the binary decision
– Type I: incorrectly reject H0
– Type II: incorrectly accept H0
50
![Page 70: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/70.jpg)
Statistical significance tests
• (Non-)parametric tests
– t-test, Wilcoxon test, Sign test
• Based on resampling
– Bootstrap test, permutation/randomization test
• They make certain assumptions about distributions and sampling methods
– Often violated in IR evaluation experiments
– Which test behaves better, in practice, knowing that assumptions are violated?
51
![Page 71: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/71.jpg)
Optimality criteria
• Power
– Achieve significance as often as possible (low Type II)
– Usually increases Type I error rates
• Safety
– Minimize Type I error rates
– Usually decreases power
• Exactness
– Maintain Type I error rate at α level
– Permutation test is theoretically exact
52
![Page 72: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/72.jpg)
Experimental design
• Randomly split query set in two
• Evaluate all systems with both subsets
– Simulating two different test collections
• Compare p-values with both subsets
– How well do statistical tests agree with themselves?
– At different α levels
• All systems and queries from MIREX 2007-2011
– >15M p-values
53
![Page 73: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/73.jpg)
Power and success
• Bootstrap test is the most powerful
• Wilcoxon, bootstrap and permutation are the most successful, depending on α level
54
![Page 74: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/74.jpg)
Conflicts
• Wilcoxon and t-test are the safest at low α levels
• Wilcoxon is the most exact at low α levels, but bootstrap is for usual levels
55
![Page 75: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/75.jpg)
Optimal measure and scale
• Power: CGl@5, GAP@5, DCGl@5 and RBPl@5
• Success: CGl@5, GAP@5, DCGl@5 and RBPl@5
• Conflicts: very similar across measures
• Power: Fine, Broad and binary
• Success: Fine, Broad and binary
• Conflicts: very similar across scales
56
![Page 76: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/76.jpg)
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
57
![Page 77: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/77.jpg)
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
58
![Page 78: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/78.jpg)
Acceptable sample size
• Reliability is higher with larger sample sizes
– But it is also more expensive
– What is an acceptable test collection size?
• Answer with Generalizability Theory
– G-Study: estimate variance components
– D-Study: estimate reliability of different sample sizes and experimental designs
59
![Page 79: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/79.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 80: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/80.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 81: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/81.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 82: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/82.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 83: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/83.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 84: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/84.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 85: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/85.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 86: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/86.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 87: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/87.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 88: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/88.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 89: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/89.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 90: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/90.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
• If σs2 is small or σq
2 is large, we need more queries
60
![Page 91: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/91.jpg)
D-study: variance ratios
• Stability of absolute scores
Φ nq =σs2
σs2 +
σq2 + σe
2
nq
• Stability of relative scores
Eρ2 nq =σs2
σs2 +
σe2
nq
• We can easily estimate how many queries are needed to reach some level of stability (reliability)
61
![Page 92: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/92.jpg)
D-study: variance ratios
• Stability of absolute scores
Φ nq =σs2
σs2 +
σq2 + σe
2
nq
• Stability of relative scores
Eρ2 nq =σs2
σs2 +
σe2
nq
• We can easily estimate how many queries are needed to reach some level of stability (reliability)
61
![Page 93: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/93.jpg)
Effect of query set size
• Average absolute stability Φ = 0.97 • ≈65 queries needed for Φ2 = 0.95, ≈100 in worst cases • Fine scale slightly better than Broad and binary scales • RBPl@5 and nDCGl@5 are the most stable
62
![Page 94: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/94.jpg)
Effect of query set size
• Average relative stability Eρ 2 = 0.98
• ≈35 queries needed for Eρ2 = 0.95, ≈60 in worst cases
• Fine scale better than Broad and binary scales
• CGl@5 and RBPl@5 are the most stable
63
![Page 95: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/95.jpg)
Effect of cutoff k
• What if we use a deeper cutoff, k=10?
– From 100 queries and k=5 to 50 queries and k=10
– Should still have stable scores
– Judging effort should decrease
– Rank-based measures should become more stable
• Tested in MIREX 2012
– Apparently in 2013 too
64
![Page 96: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/96.jpg)
Effect of cutoff k
• Judging effort reduced to 72% of the usual
• Generally stable – From Φ = 0.81 to Φ = 0.83
– From Eρ 2 = 0.93 to Eρ 2 = 0.95
65
![Page 97: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/97.jpg)
Effect of cutoff k
• Reliability given a fixed budged for judging?
– k=10 allows us to use fewer queries, about 70%
– Slightly reduced relative stability
66
![Page 98: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/98.jpg)
Effect of assessor set size
• More assessors or simply more queries?
– Judging effort is multiplied
• Can be studied with MIREX 2006 data
– 3 different assessors per query
– Nested experimental design: s × h: q
67
![Page 99: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/99.jpg)
Effect of assessor set size
• Broad scale: σ s2 ≈ σ h:q
2
• Fine scale: σ s2 ≫ σ h:q
2
• Always better to spend resources on queries
68
![Page 100: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/100.jpg)
Summary
• MIREX collections generally larger than necessary
• For fixed budget
– More queries better than more assessors
– More queries slightly better than deeper cutoff
• Worth studying alternative user model?
• Employ G-Theory while building the collection
• Fine better than Broad, better than binary
• CGl@5 and DCGl@5 best for relative stability
• RBPl@5 and nDCGl@5 best for absolute stability
69
![Page 101: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/101.jpg)
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
70
![Page 102: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/102.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation
• Conclusions and Future Work
71
![Page 103: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/103.jpg)
Probabilistic evaluation
• The MIREX setting is still expensive
– Need to judge all top k documents from all systems
– Takes days, even weeks sometimes
• Model relevance probabilistically
• Relevance judgments are random variables over the space of possible assignments of relevance
• Effectiveness measures are also probabilistic
72
![Page 104: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/104.jpg)
Probabilistic evaluation
• Accuracy increases as we make judgments
– E Rd ← rd
• Reliability increases too (confidence)
– Var Rd ← 0
• Iteratively estimate relevance and effectiveness
– If confidence is low, make judgments
– If confidence is high, stop
• Judge as few documents as possible
73
![Page 105: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/105.jpg)
Learning distributions of relevance
• Uniform distribution is very uninformative
• Historical distribution in MIREX has high variance
• Estimate from a set of features: P Rd = ℓ θd
– For each document separately
– Ordinal Logistic Regression
• Three sets of features
– Output-based, can always be used
– Judgment-based, to exploit known judgments
– Audio-based, to exploit musical similarity
74
![Page 106: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/106.jpg)
Learned models
• Mout : can be used even without judgments
– Similarity between systems’ outputs
– Genre and artist metadata
• Genre is highly correlated to similarity
– Decent fit, R2 ≈ 0.35
• Mjud : can be used when there are judgments
– Similarity between systems’ outputs
– Known relevance of same system and same artist
• Artist is extremely correlated to similarity
– Excellent fit, R2 ≈ 0.91
75
![Page 107: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/107.jpg)
Estimation errors
• Actual vs. predicted by Mout
– 0.36 with Broad and 0.34 with Fine
• Actual vs. predicted by Mjud
– 0.14 with Broad and 0.09 with Fine
• Among assessors in MIREX 2006
– 0.39 with Broad and 0.31 with Fine
• Negligible under the current MIREX setting
76
![Page 108: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/108.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation
• Conclusions and Future Work
77
![Page 109: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/109.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation
• Conclusions and Future Work
78
![Page 110: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/110.jpg)
Probabilistic effectiveness measures
• Effectiveness scores are also random variables
• Different approaches to compute estimates
– Deal with dependence of random variables
– Different definitions of confidence
• For measures based on ideal ranking (nDCGl@k and RBPl@k) we do not have a closed form
– Approximated with Delta method and Taylor series
79
![Page 111: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/111.jpg)
Ranking without judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
• Average confidence in the rankings is 94%
• Average accuracy of the ranking is 92%
80
![Page 112: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/112.jpg)
Ranking without judgments
• Can we trust individual estimates?
– Ideally, we want X% accuracy when X% confidence
– Confidence slightly overestimated in [0.9, 0.99)
81
DCGl@5
Confidence Broad Fine
In bin Accuracy In bin Accuracy
[0.5, 0.6) 23 (6.5%) 0.826 22 (6.2%) 0.636
[0.6, 0.7) 14 (4%) 0.786 16 (4.5%) 0.812
[0.7, 0.8) 14 (4%) 0.571 11 (3.1%) 0.364
[0.8, 0.9) 22 (6.2%) 0.864 21 (6%) 0.762
[0.9, 0.95) 23 (6.5%) 0.87 19 (5.4%) 0.895
[0.95, 0.99) 24 (6.8%) 0.917 27 (7.7%) 0.926
[0.99, 1) 232 (65.9%) 0.996 236 (67%) 0.996
E[Accuracy] 0.938 0.921
![Page 113: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/113.jpg)
Relative estimates with judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
3. While confidence is low (<95%) 1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of differences and rank systems
• What documents should we judge? – Those that are the most informative
– Measure-dependent
82
![Page 114: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/114.jpg)
Relative estimates with judgments
• Judging effort dramatically reduced – 1.3% with CGl@5, 9.7% with RBPl@5
• Average accuracy still 92%, but improved individually – 74% of estimates with >99% confidence, 99.9% accurate
– Expected accuracy improves slightly from 0.927 to 0.931
83
![Page 115: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/115.jpg)
Absolute estimates with judgments
1. Estimate relevance with Mout
2. Estimate absolute effectiveness scores
3. While confidence is low (expected error >±0.05) 1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of absolute effectiveness scores
• What documents should we judge? – Those that reduce variance the most
– Measure-dependent
84
![Page 116: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/116.jpg)
Absolute estimates with judgments
• The stopping condition is overly confident – Virtually no judgments are even needed (supposedly)
• But effectiveness is highly overestimated – Especially with nDCGl@5 and RBPl@5 – Mjud, and especially Mout, tend to overestimate relevance
85
![Page 117: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/117.jpg)
Absolute estimates with judgments
• Practical fix: correct variance
• Estimates are better, but at the cost of judging
– Need between 15% and 35% of judgments
86
![Page 118: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/118.jpg)
Summary
• Estimate ranking of systems with no judgments
– 92% accuracy on average, trustworthy individually
– Statistically significant differences are always correct
• If we want more confidence, judge documents
– As few as 2% needed to reach 95% confidence
– 74% of estimates have >99% confidence and accuracy
• Estimate absolute scores, judging as necessary
– Around 25% needed to ensure error <0.05
87
![Page 119: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/119.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation
• Conclusions and Future Work
88
![Page 120: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/120.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
– Conclusions
– Future Work
89
![Page 121: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/121.jpg)
Validity
• Cranfield tells us about systems, not about users
• Provide empirical mapping from system effectiveness onto user satisfaction
• Room for personalization quantified in 20%
• Need large differences for users to note them
• Consider full distributions, not just averages
• Conclusions based on effectiveness tend to contradict conclusions based on user satisfaction
90
![Page 122: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/122.jpg)
Reliability
• Different significance tests for different needs
– Bootstrap test is the most powerful
– Wilcoxon and t-test are the safest
– Wilcoxon and bootstrap test are the most exact
• Practical interpretation of p-values
• MIREX collections generally larger than needed
• Spend resources on queries, not on assessors
• User models with deeper cutoffs are feasible
• Employ G-Theory while building collections
91
![Page 123: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/123.jpg)
Efficiency
• Probabilistic evaluation reduces cost, dramatically
• Two models to estimate document relevance
• System rankings 92% accurate without judgments
• 2% of judgments to reach 95% confidence
• 25% of judgments to reduce error to 0.05
92
![Page 124: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/124.jpg)
Measures and scales
• Best measure and scale depends on situation
• But generally speaking
– CGl@5, DCGl@5 and RBPl@5
– Fine scale
– Model distributions as Beta
93
![Page 125: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/125.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
– Conclusions
– Future Work
94
![Page 126: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/126.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
– Conclusions
– Future Work
95
![Page 127: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/127.jpg)
Validity
• User studies to understand user behavior
• What information to include in test collections
• Other forms of relevance judgment to better capture document utility
• Explicitly define judging guidelines
• Similar mapping for Text IR
96
![Page 128: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/128.jpg)
Reliability
• Corrections for Multiple Comparisons
• Methods to reliably estimate reliability while building test collections
97
![Page 129: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/129.jpg)
Efficiency
• Better models to estimate document relevance
• Correct variance when having just a few relevance judgments available
• Estimate relevance beyond k=5
• Other stopping conditions and document weights
98
![Page 130: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/130.jpg)
Conduct similar studies
for the wealth of tasks in
Music Information Retrieval
99
![Page 131: Evaluation in Audio Music Similarity](https://reader033.fdocuments.us/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/131.jpg)
Evaluation in Audio Music Similarity
PhD dissertation
by
Julián Urbano
Leganés, October 3rd 2013 Picture by Javier García