Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
description
Transcript of Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
![Page 1: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/1.jpg)
CrowdsourcingPreference Judgments for Evaluation of Music Similarity Tasks
Julián Urbano, Jorge Morato,Mónica Marrero and Diego Martí[email protected]
SIGIR CSE 2010Geneva, Switzerland · July 23rd
![Page 2: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/2.jpg)
2
Outline
•Introduction•Motivation•Alternative Methodology•Crowdsourcing Preferences•Results•Conclusions and Future Work
![Page 3: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/3.jpg)
3
Evaluation Experiments
•Essential for Information Retrieval [Voorhees, 2002]
•Traditionally followed the Cranfield paradigm▫Relevance judgments are the most
important part of test collections (and the most expensive)
•In the music domain evaluation has not been taken too seriously until very recently▫MIREX appeared in 2005 [Downie et al., 2010]
▫Additional problems with the construction and maintenance of test collections [Downie, 2004]
![Page 4: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/4.jpg)
4
Music Similarity Tasks
•Given a music piece (i.e. the query) return a ranked list of other pieces similar to it▫Actual music contents, forget the
metadata!
•It comes in two flavors▫Symbolic Melodic Similarity (SMS)▫Audio Music Similarity (AMS)
•It is inherently more complex to evaluate▫Relevance judgments are very problematic
![Page 5: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/5.jpg)
Relevance (Similarity) Judgments•Relevance is usually considered on a fixed
scale▫Relevant, not relevant, very relevant…
•For music similarity tasks relevance is rather continuous [Selfridge-Field, 1998][Typke et al., 2005][Jones et al., 2007]
▫Single melodic changes are not perceived to change the overall melody Move a note up or down in pitch, shorten it,
etc.▫But the similarity is weaker as more
changes apply
•Where is the line between relevance levels?
5
![Page 6: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/6.jpg)
Partially Ordered Lists
•The relevance of a document is implied by its position in a partially ordered list [Typke et al., 2005]
▫Does not need any prefixed relevance scale
•Ordered groups of documents equally relevant▫Have to keep the order of the groups▫Allow permutations within the same group
•Assessors only need to be sure that any pair of documents is ordered properly
6
![Page 7: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/7.jpg)
Partially Ordered Lists (II)
7
![Page 8: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/8.jpg)
Partially Ordered Lists (and III)
•Used in the first edition of MIREX in 2005[Downie et al., 2005]
•Widely accepted by the MIR communityto report new developments [Urbano et al., 2010a][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006]
•MIREX was forced to move to traditionallevel-based relevance since 2006▫Partially ordered lists are expensive▫And have some inconsistencies
8
![Page 9: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/9.jpg)
Expensiveness
•The ground truth for just 11 queries took 35 music experts for 2 hours [Typke et al., 2005]
▫Only 11 of them had time to work on all 11 queries▫This exceeds MIREX’s resources for a single task
•MIREX had to move to level-based relevance▫BROAD: Not Similar, Somewhat Similar, Very Similar▫FINE: numerical, from 0 to 10 with one decimal digit
•Problems with assessor consistency came up
9
![Page 10: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/10.jpg)
Issues with Assessor Consistency•The line between levels is certainly
unclear[Jones et al., 2007][Downie et al., 2010]
10
![Page 11: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/11.jpg)
Original Methodology
•Go back to partially ordered lists▫Filter the collection▫Have the experts rank the candidates▫Arrange the candidates by rank▫Aggregate candidates whose ranks are not
significantly different (Mann-Whitney U)•There are known odd results and inconsistencies
[Typke et al., 2005][Hanna et al., 2007][Urbano et al., 2010b]
▫Disregard changes that do not alter the actual perception, such as clef or key and time signature
▫Something like changing the language of a text and use synonyms [Urbano et al., 2010a]
11
![Page 12: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/12.jpg)
Inconsistencies due to Ranking
12
![Page 13: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/13.jpg)
Alternative Methodology
•Minimize inconsistencies [Urbano et al., 2010b]
•Cheapen the whole process
•Reasonable Person hypothesis [Downie, 2004]
▫With crowdsourcing (finally)
•Use Amazon Mechanical Turk▫Get rid of experts [Alonso et al., 2008][Alonso et al., 2009]
▫Work with “reasonable turkers”▫Explore other domains to apply
crowdsourcing
13
![Page 14: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/14.jpg)
Equally Relevant Documents
•Experts were forced to give totally ordered lists
•One would expect ranks to randomly average out▫Half the experts prefer one document▫Half the experts prefer the other one
•That is hardly the case▫Do not expect similar ranks if the experts
can not give similar ranks in the first place
14
![Page 15: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/15.jpg)
Give Audio instead of Images
•Experts may guide by the images, not the music▫Some irrelevant changes in the image can
deceive
•No music expertise should be needed▫Reasonable person turker hypothesis
15
![Page 16: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/16.jpg)
Preference Judgments
•In their heads, experts actually do preference judgments▫Similar to a binary search▫Accelerates assessor fatigue as the list
grows
•Already noted for level-based relevance▫Go back and re-judge [Downie et al., 2010][Jones et al.,
2007]
▫Overlapping between BROAD and FINE scores
•Change the relevance assessment question▫Which is more similar to Q: A or B? [Carterette
et al., 2008]
16
![Page 17: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/17.jpg)
Preference Judgments (II)
•Better than traditional level-based relevance▫Inter-assessor agreement▫Time to answer
•In our case, three-point preferences▫A < B (A is more similar)▫A = B (they are equally similar/dissimilar)▫A > B (B is more similar)
17
![Page 18: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/18.jpg)
Preference Judgments (and III)
•Use a modified QuickSort algorithm to sort documents in a partially ordered list▫Do not need all O(n2) judgments, but
O(n·log n)
X is the current pivot on the segmentX has been pivot already
18
![Page 19: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/19.jpg)
How Many Assessors?
•Ranks are given to each document in a pair▫+1 if it is preferred over the other one▫-1 if the other one is preferred▫0 if they were judged equally
similar/dissimilar•Test for signed differences in the
samples•In the original lists 35 experts were used
▫Ranks of a document between 1 and more than 20
•Our rank sample is less (and equally) variable▫rank(A) = -rank(B) ⇒ var(A) = var (B)▫Effect size is larger so statistical power
increases▫Fewer assessors are needed overall
19
![Page 20: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/20.jpg)
Crowdsourcing Preferences
•Crowdsourcing seems very appropriate▫Reasonable person hypothesis▫Audio instead of images▫Preference judgments▫QuickSort for partially ordered lists
•The task can be split into very small assignments
•It should be much more cheap and consistent▫Do not need experts▫Do not deceive and increase consistency▫Easier and faster to judge▫Need fewer judgments and judges
20
![Page 21: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/21.jpg)
New Domain of Application
•Crowdsourcing has been used mainly to evaluate text documents in English
•How about other languages?▫Spanish [Alonso et al., 2010]
•How about multimedia?▫Image tagging? [Nowak et al., 2010]
▫Music similarity?
21
![Page 22: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/22.jpg)
Data•MIREX 2005 Evaluation collection
▫~550 musical incipits in MIDI format▫11 queries also in MIDI format▫4 to 23 candidates per query
•Convert to MP3 as it is easier to play in browsers
•Trim the leading and tailing silence▫1 to 57 secs. (mean 6) to 1 to 26 secs.
(mean 4)▫4 to 24 secs. (mean 13) to listen to all 3
incipits•Uploaded all MP3 files and a Flash player
to a private server to stream data on the fly
22
![Page 23: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/23.jpg)
HIT Design
2 yummy cents of dollar
23
![Page 24: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/24.jpg)
Threats to Validity
•Basically had to randomize everything▫Initial order of candidates in the first
segment▫Alternate between queries▫Alternate between pivots of the same query▫Alternate pivots as variations A and B
•Let the workers know about this randomization
•In first trials some documents were judged more similar to the query than the query itself!▫Require at least 95% acceptance rate▫Ask for 10 different workers per HIT [Alonso et
al., 2009]
▫Beware of bots (always judged equal in 8 secs.)
24
![Page 25: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/25.jpg)
Summary of Submissions
•The 11 lists account for 119 candidates to judge
•Sent 8 batches (QuickSort iterations) to MTurk
•Had to judge 281 pairs (38%) = 2810 judgments
•79 unique workers for about 1 day and a half
•A total cost (excluding trials) of $70.25
25
![Page 26: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/26.jpg)
Feedback and Music Background•23 of the 79 workers gave us feedback
▫4 very positive comments: very relaxing music
▫1 greedy worker: give me more money▫2 technical problems loading the audio in 2
HITs Not reported by any of the other 9 workers
▫5 reported no music background▫6 had formal music education▫9 professional practitioners for several
years▫9 play an instrument, mainly piano▫6 performers in choir
26
![Page 27: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/27.jpg)
Agreement between Workers
•Forget about Fleiss’ Kappa▫Does not account for the size of the
disagreement▫A<B and A=B is not as bad as A<B and
B<A•Look at all 45 pairs of judgments per pair
▫+2 if total agreement (e.g. A<B and A<B)▫+1 if partial agreement (e.g. A<B and
A=B)▫0 if no agreement (i.e. A<B and B<A)▫Divide by 90 (all pairs with total
agreement)
•Average agreement score per pair was 0.664 ▫From 0.506 (iteration 8) to 0.822 (iteration
2)
27
![Page 28: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/28.jpg)
Agreement Workers-Experts
•Those 10 judgments were actually aggregated
Percentages per row total
▫155 (55%) total agreement▫102 (36%) partial agreement▫23 (8%) no agreement
•Total agreement score = 0.735•Supports the reasonable person
hypothesis
28
![Page 29: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/29.jpg)
Agreement Single Worker-Experts
29
![Page 30: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/30.jpg)
Agreement (Summary)
30
•Very similar judgments overall▫The reasonable person hypothesis stands still▫Crowdsourcing seems a doable alternative▫No music expertise seems necessary
•We could use just one assessor per pair▫If we could keep him/her throughout the query
![Page 31: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/31.jpg)
Ground Truth Similarity
•Do high agreement scores translate intohighly similar ground truth lists?
•Consider the original lists (All-2) as ground truth
•And the crowdsourced lists as a system’s result▫Compute the Average Dynamic Recall [Typke et al.,
2006]
▫And then the other way around
•Also compare with the (more consistent) original lists aggregated in Any-1 form [Urbano et al., 2010b]
31
![Page 32: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/32.jpg)
Ground Truth Similarity (II)
•The result depends on the initial ordering▫Ground truth = (A, B, C), (D, E) ▫Results1 = (A, B), (D, E, C)
ADR score = 0.933▫Results2 = (A, B), (C, D, E)
ADR score = 1
•Results1 is identical to Results2
•Generate 1000 (identical) versions by randomly permuting the documents within a group
32
![Page 33: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/33.jpg)
Ground Truth Similarity (and III)
Min. and Max. between square brackets
•Very similar to the original All-2 lists•Like the Any-1 version, also more
restrictive•More consistent (workers were not
deceived)
33
![Page 34: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/34.jpg)
MIREX 2005 Revisited
•Would the evaluation have been affected?▫Re-evaluated the 7 systems that
participated▫Included our Splines system [Urbano et al., 2010a]
•All systems perform significantly worse▫ADR score drops between 9-15%
•But their ranking is just the same▫Kendall’s τ = 1
34
![Page 35: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/35.jpg)
Conclusions
•Partially ordered lists should come back
•We proposed an alternative methodology▫Asked for three-point preference judgments▫Used Amazon Mechanical Turk
Crowdsourcing can be used for music-related tasks
Provided empirical evidence supporting the reasonable person hypothesis
•What for?▫More affordable and large-scale
evaluations
35
![Page 36: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/36.jpg)
Conclusions (and II)
•We need fewer assessors▫More queries with the same man-power
•Preferences are easier and faster to judge•Fewer judgments are required
▫Sorting algorithm
•Avoid inconsistencies (A=B option)•Using audio instead of images gets rid of
experts
•From 70 expert hours to 35 hours for $70
36
![Page 37: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/37.jpg)
Future Work
•Choice of pivots in the sorting algorithm▫e.g. the query itself would not provide
information
•Study the collections for Audio Tasks▫They have more data
Inaccessible▫But no partially ordered list (yet)
•Use our methodology with one real expert judging preferences for the same query
•Try crowdsourcing too with one single worker
37
![Page 38: Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks](https://reader038.fdocuments.us/reader038/viewer/2022103101/568143ab550346895db03372/html5/thumbnails/38.jpg)
Future Work (and II)
•Experimental study on the characteristics of music similarity perception by humans▫Is it transitive?
We assumed it is▫Is it symmetrical?
•If these properties do not hold we have problems
•Id they do, we can start thinking on Minimal and Incremental Test Collections[Carterette et al., 2005]
38