Ben Carterett — Advances in Information Retrieval Evaluation
Information Retrieval Evaluation
-
Upload
jose-ramon-rios-viqueira -
Category
Data & Analytics
-
view
623 -
download
0
Transcript of Information Retrieval Evaluation
1
IR Evaluation
Mihai [email protected]
Chapter 8 of the Introduction to IR bookM. Sanderson. Test Collection Based Evaluation of Information
Retrieval Systems Foundations and Trends in IR, 2010
2
Outline
Introduction– Introduction to IR
Kinds of evaluation Retrieval Effectiveness evaluation
– Measures, Experimentation– Test Collections
User-based evaluation Discussion on Evaluation Conclusion
3
Introduction
• Why?– Put a figure on the benefit we get from a system – Because without evaluation, there is no research
Objective measurements
Information Retrieval
“Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)
General definition that can be applied to many types of information and search applications
Primary focus of IR since the 50s has been on text and documents
Information Retrieval
Information Retrieval
Information Retrieval
Key insights of/for information retrieval– text has no meaning
ฉันมรีถสแีดง– but it is still the most informative source
ฉันมรีถสฟีา้ is more similar to the above than คณุมรีถไฟฟา้– text is not random
I drive a red car is more probable than – I drive a red horse– A red car I drive– Car red a drive I
– meaning is defined by usage I drive a truck / I drive a car / I drive the bus truck / car / bus
are similar in meaning
Information Retrieval
Key insights of/for information retrieval– text has no meaning
ฉันมรีถสแีดง– but it is still the most informative source
ฉันมรีถสฟีา้ is more similar to the above than คณุมรีถไฟฟา้– text is not random
I drive a red car is more probable than – I drive a red horse– A red car I drive– Car red a drive I
– meaning is defined by usage I drive a truck / I drive a car / I drive the bus truck / car / bus
are similar in meaning
term frequency (TF), document frequency (DF)TF-IDF, BM25 (Best match 25)
language models (uni-gram, bi-gram, n-gram)
statistical semantics (latent semantic analysis, random indexing, deep learning)
Big Issues in IR
Relevance– What is it?– Simple (and simplistic) definition: A relevant document contains
the information that a person was looking for when they submitted a query to the search engine
– Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style
– Topical relevance (same topic) vs. user relevance (everything else)
Relevance– Retrieval models define a view of relevance– Ranking algorithms used in search engines are based
on retrieval models– Most models describe statistical properties of text rather
than linguistic i.e. counting simple text features such as words
instead of parsing and analyzing the sentences Statistical approach to text processing started with
Luhn in the 50s Linguistic features can be part of a statistical model
Big Issues in IR
Big Issues in IR
Evaluation– Experimental procedures and measures for comparing system
output with user expectations Originated in Cranfield experiments in the 60s
– IR evaluation methods now used in many fields– Typically use test collection of documents, queries, and relevance
judgments Most commonly used are TREC collections
– Recall and precision are two examples of effectiveness measures
Big Issues in IR
Users and Information Needs– Search evaluation is user-centered– Keyword queries are often poor descriptions of actual information
needs– Interaction and context are important for understanding user intent– Query refinement techniques such as query expansion, query
suggestion, relevance feedback improve ranking
13
Introduction
• Why?– Put a figure on the benefit we get from a system – Because without evaluation, there is no research
• Why is this a research field in itself?– Because there are many kinds of IR
• With different evaluation criteria– Because it’s difficult
• Why?– Because it involves human subjectivity (document
relevance)– Because of the amount of data involved (who can sit down
and evaluate 1,750,000 documents returned by Google for ‘university vienna’?)
14
Kinds of evaluation
15
Kinds of evaluation
• “Efficient and effective system”• Time and space: efficiency
– Generally constrained by pre-development specification • E.g. real-time answers vs. batch jobs• E.g. index-size constraints
– Easy to measure• Good results: effectiveness
– Harder to define --> more research into it• And…
16
Kinds of evaluation (cont.)
• User studies– Does a 2% increase in some retrieval performance measure actually
make a user happier?– Does displaying a text snippet improve usability even if the
underlying method is 10% weaker than some other method?
– Hard to do– Mostly anecdotal examples– Many IR people don’t like to do it (though it’s starting to change)
17
Kinds of evaluation (cont.)
Intrinsic– “internal”– ultimate goal is the retrieved set
Extrinsic– “external”– in the context of the usage of the retrieval tool
18
What to measure in an IR system?
1966, Cleverdon:1. coverage – the extent to which relevant matter exists in the
system2. time lag ~ efficiency3. presentation 4. effort on the part of the user to answer his information
need5. recall 6. precision
19
What to measure in an IR system?
1966, Cleverdon:1. coverage – the extent to which relevant matter exists in the
system2. time lag ~ efficiency3. presentation 4. effort on the part of the user to answer his information
need5. recall 6. precision
Effectiveness
A desirable measure of retrieval performance would have the following properties: 1, it would be a measure of effectiveness. 2, it would not be
confounded by the relative willingness of the system to emit items. 3, it would be a single number – in preference, for example, to a pair of numbers which
may co-vary in a loosely specified way, or a curve representing a table of several pairs of numbers 4, it would allow complete ordering of different
performances, and assess the performance of any one system in absolute terms. Given a measure with these properties, we could be confident of
having a pure and valid index of how well a retrieval system (or method) were performing the function it was primarily designed to accomplish, and we could
reasonably ask questions of the form “Shall we pay X dollars for Y units of effectiveness?” (Swets, 1967)
20
Outline
• Introduction• Kinds of evaluation• Retrieval Effectiveness evaluation
– Measures– Test Collections
User-based evaluation• Discussion on Evaluation• Conclusion
21
Efficiency Metrics
22
Retrieval Effectiveness
Precision– How happy are we with what we’ve got
Recall– How much more we could have had
Precision =Number of relevant documents
retrievedNumber of documents retrieved
Recall =Number of relevant documents
retrievedNumber of relevant documents
23
Retrieval Effectiveness
Retrieved documents
Relevant documents
Universe of documents
24
Precision and Recall
25
Retrieval effectiveness
What if we don’t like this twin-measure approach? A solution:
– Van Rijsbergen’s E-Measure:
– With a special case: Harmonic mean
26
Retrieval effectiveness
What if we don’t like this twin-measure approach? A solution:
– Van Rijsbergen’s E-Measure:
– With a special case: Harmonic mean
27
Retrieval effectiveness
Tools we need:– A set of documents (the “dataset”)– A set of questions/queries/topics– For each topic, and for each document, a decision: relevant
or not relevant Let’s assume for the moment that’s all we need and
that we have it
28
Retrieval Effectiveness
• Precision and Recall generally plotted as a “Precision-Recall curve”
0
1
1
precision
recall
size of retrieved set increases
• They do not play well together
29
Precision-Recall Curves
How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis
0
1
1
precision
recall
30
Precision-Recall Curves
How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis
0
1
1
precision
recall
31
Precision-Recall Curves
• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis– Repeat for all queries
0
1
1
precision
recall
32
Precision-Recall Curves
• And the average is the system’s P-R curve
0
1
1
precision
recall
# retrieved documents increases
• We can compare systems by comparing the curves
33
Precision-Recall Graph--reality check--
34
Interpolation
To average graphs, calculate precision at standard recall levels:
– where S is the set of observed (R,P) points Defines precision at any recall level as the maximum
precision observed in any recall-precision point at a higher recall level– produces a step function– defines precision at recall 0.0
35
Interpolation
36
Average Precision at Standard Recall Levels
• Recall-precision graph plotted by simply joining the average precision points at the standard recall levels
37
Average Recall-Precision Graph
38
Graph for 50 Queries
39
Retrieval Effectiveness
• Not quite done yet…– When to stop retrieving?
• Both P and R imply a cut-off value– How about graded relevance
• Some documents may be more relevant to the question than others
– How about ranking?• A document retrieved at position 1,234,567 can still be
considered useful?– Who says which documents are relevant and which not?
40
Single-value measures
• Fix a “reasonable” cutoff– R-precision
Precision at R, where R is the number of relevant documents. Fix the number of desired documents
– Reciprocal rank (RR) 1/rank of first relevant document in the ranked list returned
Make it less sensitive to the cutoff• Average precision
– For each query: R= # relevant documents i = rank k = # retrieved documents P(i) precision at rank i
• rel(i)=1 if document at rank i relevant, 0 otherwise– For each system:
• Compute the mean of these averages: Mean Average Precision (MAP) – one of the most used measures
41
R- Precision
Precision at the R-th position in the ranking of results for a query that has R relevant documents.
n doc # relevant1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
42
Averaging Across Queries
43
Average Precision
44
MAP
45
Retrieval Effectiveness
• Not quite done yet…– When to stop retrieving?
• Both P and R imply a cut-off value– How about graded relevance
• Some documents may be more relevant to the question than others
– How about ranking?• A document retrieved at position 1,234,567 can still be
considered useful?– Who says which documents are relevant and which not?
46
Cumulative Gain
• For each document d, and query q, definerel(d,q) >= 0
• The higher the value, the more relevant the document is to the query
• Pitfalls:– Graded relevance introduces even more ambiguity in practice
With great flexibility comes great responsibility to justify parameter values
47
Retrieval Effectiveness
• Not quite done yet…– When to stop retrieving?
• Both P and R imply a cut-off value– How about graded relevance
• Some documents may be more relevant to the question than others
– How about ranking?• A document retrieved at position 1,234,567 can still be
considered useful?– Who says which documents are relevant and which not?
48
Discounted Cumulative Gain
Popular measure for evaluating web search and related tasks
Two assumptions:– Highly relevant documents are more useful than marginally relevant
document– the lower the ranked position of a relevant document, the less useful
it is for the user, since it is less likely to be examined
49
Discounted Cumulative Gain
Uses graded relevance as a measure of the usefulness, or gain, from examining a document
Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks
Typical discount is 1/log (rank)– With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
50
Discounted Cumulative Gain
DCG is the total gain accumulated at a particular rank p:
Alternative formulation:
– used by some web search companies– emphasis on retrieving highly relevant documents
[Jarvelin:2000]
[Borges:2005]
51
Discounted Cumulative Gain
• Neither CG, nor DCG can be used for comparison across topics! depends on the # relevant documents per topic
52
Normalised Discounted Cumulative Gain
Compute CG / DCG for the optimal return setEg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)has the Ideal Discounted Cumulative Gain: IDCG
Normalise:
53
some more variations
Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)has the Ideal Discounted Cumulative Gain: IDCG
“our rank”: (5,2,0,0,5,2,4,0,0,1,4,…) two ranked lists
– rank correlation measures kendall Tau (similarity of orderings) pearson Rho (linear correlation between variables) spearman Rho (Pearson for ranks)
54
some more variations
rank biased precision (RBP)– “log-based discount is not a good model of users’ behaviour”– imagine the probability p of the user moving on to the next document
p~0.95 p~0.0
55
Time-based calibration
Assumption– The objective of the search engine is to improve the efficiency of an
information seeking task Extend nDCG to replace discount with a time-based
function
(Smucker and Clarke:2011)
Normalization
Gain Decay, as a function of time to reach item k in
the ranked list
56
The water filling model (Luo et al, 2013)
and the corresponding Cube Test (CT)
also for professional search– to capture embedded subtopics
no assumption of linear traversal of documents– takes into account time
potential cap on the amount of information taken into account
high discriminative power
57
Other diversity metrics
several aspects of the topic might [need to] be covered– Aspectual recall/precision
discount may take into account previously seen aspects– α-NDCG = NDCG where
58
Other measures
• There are many IR measures!• trec_eval is a little program that computes many of them
– 37 in v9.0, many of which are multi-point (e.g. Precision @10, @20…)
• http://trec.nist.gov/trec_eval/• “there is a measure to make anyone a winner”
– Not really true, but still…
59
Other measures
• How about correlations between measures?
• Kendal Tau values • From Voorhees and Harman,2004
• Overall they correlate
P(30) R-Prec MAP .5 prec R(1,1000) Rel Ret MRR
P(10) 0.88 0.81 0.79 0.78 0.78 0.77 0.77
P(30) 0.87 0.84 0.82 0.80 0.79 0.72
R-Prec 0.93 0.87 0.83 0.83 0.67
MAP 0.88 0.85 0.85 0.64
.5 prec 0.77 0.78 0.63
R(1,1000) 0.92 0.67
Rel ret 0.66
60
Topic sets
Topic selection– In early TREC candidates rejected if ambiguous
Are all topics equal?– Mean Average Precision uses arithmetic mean
– Classical Test Theory experiments (Bodoff and Li,2007) identified outliers that could change the rankings
MAP: a change in AP from 0.05 to 0.1 has the same effect as a change from 0.25 to 0.3GMAP: a change in AP from 0.05 to 0.1 has the same effect as a change from 0.25 to 0.5
61
Measure measures
What is the best measure?– What makes a measure better?
Match to task– E.g.
Known item search: MRR Something more quantitative?
– Correlations between measures Does the system ranking change when using different measures Useful to group measures
– Ability to distinguish between runs– Measure stability
62
Ad-hoc quiz
It was necessary to normalize the discounted cumulative gain (NDCG) because… of the assumption for normal probability distribution to be able to compare across topics normalization is always better to be able to average across topics
63
Ad-hoc quiz
It was necessary to normalize the discounted cumulative gain (NDCG) because… of the assumption for normal probability distribution to be able to compare across topics normalization is always better to be able to average across topics
64
Measure stability
Success criteria:– A measure is good if it is able to predict differences between
systems (on the average of future queries) Method
– Split collection in 21. Use as train collection to rank runs2. Use as test collection to compute how many pair-wise
comparisons hold Observations
– Cut-off measures less stable than MAP
65
Measure stability
Success criteria:– A measure is good if it is able to predict differences between
systems (on the average of future queries) Method
– Split collection in 21. Use as train collection to rank runs2. Use as test collection to compute how many pair-wise
comparisons hold Observations
– Cut-off measures less stable than MAP
Any other criteria for measure quality? ?
66
Measure measures
started with opinions from ’60s, seen some measures – have the targets changed?
7 numeric properties of effectiveness metrics (Moffat 2013)
68
7 properties of effectiveness metrics
Boundedness – the set of scores attainable by the metric is bounded, usually in [0,1]
Monotonicity – if a ranking of length k is extended so that k+1 elements are included, the score never decreases
Convergence – if a document outside the top k is swapped with a less relevant document inside the top k, the score strictly increases
Top-weightedness – if a document within the top k is swapped with a less relevant one higher in the ranking, the score strictly increases
Localization – a score at depth k can be compute based solely on knowledge of the documents that appear in top k
Completeness – a score can be calculated even if the query has no relevant documents
Realizability – provided that the collection has at least one relevant document, it is possible for the score at depth k to be maximal.
69
So far
introduction metrics
we are now able to say “System A is better than System B” or are we?Remember
- we only have limited data- potential future applications unbounded
a very strong statement!
70
Statistical validity
Whatever evaluation metric used, all experiments must be statistically valid– i.e. differences must not be the result of chance
00.020.040.060.080.1
0.120.140.160.180.2
MAP
71
Statistical validity
• Ingredients of a significance test– A test statistic (e.g. the differences between AP values) – A null hypothesis (e.g. “there is no difference between the two
systems) This gives us a particular distribution of the test statistic
– An alternative hypothesis (one or two-tailed tests) don’t change it after the test
– A significance level computed by taking the actual value of the test statistic and determining how likely it is to see this value given the distribution implied by the null hypothesis
• P-value• If the p-value is low, we can feel confident that we can reject
the null hypothesis the systems are different
72
Statistical validity
Common practice is to declare systems different when the p-value <= 0.05
A few tests– Randomization tests
Wilcoxon Signed Rank test Sign test
– Boostrap test– Student’s Paired t-test
See recent discussion in SIGIR Forum– T. Sakai - Statistical Reform in Information Retrieval?
effect sizes confidence intervals
73
Statistical validity
How do we increase the statistical validity of an experiment?
By increasing the number of topics– The more topics, the more confident we are that the
difference between average scores will be significant What’s the minimum number of topics?
42
• Depends, but• TREC started with 50• Below 25 is generally considered
not significant
74
Example Experimental Results
75
t-Test
Assumption is that the difference between the effectiveness values is a sample from a normal distribution
Null hypothesis is that the mean of the distribution of differences is zero
Test statistic
– for the example,
76
t-Testt=2.33
77
t-Testt=2.33
78
Statistical Validity - example
79
80
81
82
83
Summary
so far– introduction– metrics
next– where to get ground truth
some more metrics– discussion
84
Retrieval Effectiveness
• Not quite done yet…– When to stop retrieving?
• Both P and R imply a cut-off value– How about graded relevance
• Some documents may be more relevant to the question than others
– How about ranking?• A document retrieved at position 1,234,567 can still be
considered useful?– Who says which documents are relevant and which not?
85
Relevance assessments
• Ideally– Sit down and look at all documents
• Practically– The ClueWeb09 collection has
• 1,040,809,705 web pages, in 10 languages • 5 TB, compressed. (25 TB, uncompressed.)
– No way to do this exhaustively– Look only at the set of returned documents
• Assumption: if there are enough systems being tested and not one of them returned a document – the document is not relevant
86
Relevance assessments - Pooling Combine the results retrieved by all systems Choose a parameter k (typically 100) Choose the top k documents as ranked in each
submitted run The pool is the union of these sets of docs
– Between k and (# submitted runs) × k documents in pool– (k+1)st document returned in one run either irrelevant or
ranked higher in another run Give pool to judges for relevance assessments
87 From Donna Harman
88
Relevance assessments - Pooling Conditions under which pooling works [Robertson]
– Range of different kinds of systems, including manual systems
– Reasonably deep pools (100+ from each system) But depends on collection size
– The collections cannot be too big. Big is so relative…
89
Relevance assessments - Pooling Advantage of pooling:
– Fewer documents must be manually assessed for relevance Disadvantages of pooling:
– Can’t be certain that all documents satisfying the query are found (recall values may not be accurate)
– Runs that did not participate in the pooling may be disadvantaged
– If only one run finds certain relevant documents, but ranked lower than 100, it will not get credit for these.
90
Relevance assessments
Pooling with randomized sampling As the data collection grows, the top 100 may not be
representative of the entire result set– (i.e. the assumption that everything after is not relevant
does not hold anymore) Add, to the pool, a set of documents randomly
sampled from the entire retrieved set– If the sampling is uniform easy to reason about, but may
be too sparse as the collection grows– Stratified sampling: get more from the top of the ranked
list [Yilmaz et al.:2008]
91
Relevance assessments - incomplete
• The unavoidable conclusion is that we have to handle incomplete relevance assessments– Consider unjudged = non relevant– Do not consider unjudged at all (i.e. compress the ranked lists)
• A new measure:– BPref (binary preference)
r = a relevant returned document R = # documents judged relevant N = # documents judged non-relevant n = a non-relevant document
92
Relevance assessments - incomplete
• BPref was designed to mimic MAP• soon after, induced AP and inferred AP were proposed
• if data complete – equal to MAP
expectation of precision at rank k
93
not only are we incomplete, but we might also be inconsistent in our judgments
94
Relevance assessment - subjectivity In TREC-CHEM’09 we had each topic evaluated by
two students– “conflicts” ranged between 2% and 33% (excluding a topic
with 60% conflict)– This all increased if we considered “strict disagreement”
In general, inter-evaluator agreement is rarely above 80%
There is little one can do about it
95
Relevance assessment - subjectivity Good news:
– “idiosyncratic nature of relevance judgments does not affect comparative results” (E. Voorhees)
– Mean Kendall Tau between system rankings produced from different query relevance sets: 0.938
– Similar results held for: Different query sets Different evaluation measures Different assessor types Single opinion vs .group opinion judgments
96
No assessors
Pooling assumes all relevant documents found by systems– Take this assumption further
Voting based- relevance assessments– Consider top K only
Soboroff et al:2001
97
Test Collections
Generally created as the result of an evaluation campaign– TREC – Text Retrieval Conference (USA)– CLEF – Cross Language Evaluation Forum (EU)– NTCIR - NII Test Collection for IR Systems (JP)– INEX – Initiative for evaluation of XML Retrieval– …
First one and paradigm definer:– The Cranfield Collection
In the 1950s Aeronautics 1400 queries, about 6000 documents Fully evaluated
98
TREC
Started in 1992 Always organised in the States, on the NIST campus
As leader, introduced most of the jargon used in IR Evaluation:– Topic = query / request for information– Run = a ranked list of results – Qrel = relevance judgements
99
TREC
Organised as a set of tracks that focus on a particular sub-problem of IR– E.g.
Patient records, Session, Chemical, Genome, Legal, Blog, Spam,Q&A, Novelty, Enterprise, Terabyte, Web, Video, Speech, OCR, Chinese, Spanish, Interactive, Filtering, Routing, Million Query, Ad-Hoc, Robust
– Set of tracks in a year depends on Interest of participants Fit to TREC Needs of sponsors Resource constraints
100
TREC
C all forpartic ipation Task
defin ition
D ocum entprocurem ent
Top ic defin ition
IRexperim ents
R elevance assessm ents
R esultsevaluation
R esultsanalysis
TR E Cconference
Proceedingspublication
101
TREC – Task definition
Each Track has a set of Tasks: Examples of tasks from the Blog track:
– 1. Finding blog posts that contain opinions about the topic – 2. Ranking positive and negative blog posts – 3. (A separate baseline task to just find blog posts relevant
to the topic) – 4. Finding blogs that have a principal, recurring interest in
the topic
102
TREC - Topics
For TREC, topics generally have a specific format (not always though)– <ID>– <title>
Very short– <description>
A brief statement of what would be a relevant document– <narrative>
A long description, meant also for the evaluator to understand how to judge the topic
103
TREC - Topics
Example:– <ID>
312– <title>
Hydroponics– <description>
Document will discuss the science of growing plants in water or some substance other than soil
– <narrative> A relevant document will contain specific information on
the necessary nutrients, experiments, types of substrates, and/or any other pertinent facts related to the science of hydroponics. Related information includes, but is not limited to, the history of hydro- …
104
CLEF
Cross Language Evaluation Forum– From 2010: Conference on Multilingual and Multimodal
Information Access Evaluation– Supported by the PROMISE Network of Excellence
Started in 2000 Grand challenge:
– Fully multilingual, multimodal IR systems Capable of processing a query in any medium and any
language Finding relevant information from a multilingual
multimedia collection And presenting it in the style most likely to be useful
for the user
105
CLEF
• Previous tracks:• Mono-, bi- multilingual text retrieval• Interactive cross language retrieval• Cross language spoken document retrieval• QA in multiple languages• Cross language retrieval in image collections• CL geographical retrieval• CL Video retrieval• Multilingual information filtering• Intellectual property• Log file analysis• Large scale grid experiments
• From 2010– Organised as a series of “labs”
106
MediaEval
dedicated to evaluating new algorithms for multimedia access and retrieval.
emphasizes the 'multi' in multimedia focuses on human and social aspects of multimedia tasks
– speech recognition, multimedia content analysis, music and audio analysis, user-contributed information (tags, tweets), viewer affective response, social networks, temporal and geo-coordinates.
http://www.multimediaeval.org/
107
Test collections - summary
it is important to design the right experiment for the right IR task– Web retrieval is very different from legal retrieval
The example of Patent retrieval– High Recall: a single missed document can invalidate a
patent– Session based: single searches may involve days of cycles
of results review and query reformulation– Defendable: Process and results may need to be defended
in court
108
Outline
Introduction Kinds of evaluation Retrieval Effectiveness evaluation
– Measures, Experimentation– Test Collections
User-based evaluation Discussion on Evaluation Conclusion
109
User-based evaluation
Different levels of user involvement– Based on subjectivity levels1. Relevant/non-relevant assessments
Used largely in lab-like evaluation as described before2. User satisfaction evaluation
Some work on 1., very little on 2.– User satisfaction is very subjective
UIs play a major role Search dissatisfaction can be a result of the non-existence of
relevant documents
110
User-based evaluation
User-based relevance assessments– Focus the user on each query-document pair
111
User-based evaluation
User-based relevance assessments– Focus the user one each query-document pair
112
User-based evaluation
User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document
113
User-based evaluation
User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document
Relative judgements of documents“Is document X more relevant than document Y for
the given query?”
- Many more assessments needed- Better inter-annotator agreement [Rees and Schultz,
1967]
114
User-based evaluation
User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document – Focus the user on lists of results
115
User-based evaluation
User-based relevance assessments– Focus the user one each query-document pair– Focus the user on query-document-document – Focus the user on lists of results
Image from Thomas and Hawking, Evaluation by comparing result sets in context, CIKM2006
116
User-based evaluation
User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document – Focus the user on lists of results
Some issues, alternatives– Control for all sorts of user-based biases
117
User-based evaluation
User-based relevance assessments– Focus the user one each query-document pair– Focus the user on lists of results– Focus the user on query-document-document
Some issues, alternatives– Control for all sorts of user-based biases
Image from Bailey, Thomas and Hawking, Does brandname influence perceived search result quality?, ADCS2007
118
User-based evaluation
User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document – Focus the user on lists of results
Some issues, alternatives– Control for all sorts of user-based biases– Two-panel evaluation
– limits the number of systems which can be evaluated– Is unusable in real-life contexts
– Interspersed ranked list with click monitoring
119
Effectiveness evaluationlab-like vs. user-focused
Results are mixed: some experiments show correlations, some not
Do user preferences and Evaluation Measures Line up? SIGIR 2010: Sanderson, Paramita, Clough, Kanoulas– shows the existence of correlations
User preferences is inherently user dependent Domain specific IR will be different
The relationship between IR effectiveness measures and user satisfaction, SIGIR 2007, Al-Maskari, Sanderson, Clough– strong correlation between user satisfaction and DCG, which
disappeared when normalized to NDCG.
120
Predicting performance
Future data and queries not absolute, but relative performance
– ad-hoc evaluations suffer in particular– no comparison between lab and operational settings
for justified reasons, but still none– how much better must a system be?
generally, require statistical significance
[ Trip
pe:2
011
]
121
Predictive performance
Future systems Test collections are often used to prove we have a better
system than the state of the art– not all documents were evaluated
122
Predictive performance
Future systems Test collections are often used to prove we have a better
system than the state of the art– not all documents were evaluated– “retrofit” metrics that are not considered resilient to such evolution
RBP [Webber:2009] Precision@n [Lipani:2014], Recall@n […]
Why do this?- Precision@n and Recall@n are loved in industry- Also in industry, technology migration steps are high (i.e. hold on to a
system that ‘works’ until it is patently obvious it affects business performance)
123
Are Lab evals sufficient?
Patent search is an active process where the end-user engages in a process of understanding and interacting with the information
evaluation needs a definition of success– success ~ lower risk
partly precision and recall partly (some argue the most important part) the intellectual and
interactive role of the patent search system as a whole series of evaluation layers
– lab evals are now the lowest level– to elevate them, they must measure risk and incentivize systems to
provide estimates of confidence in the results they provide
[ Trip
pe:2
011
]
124
Outline
Introduction Kinds of evaluation Retrieval Effectiveness evaluation
– Measures, Experimentation– Test Collections
User-based evaluation Discussion on Evaluation Conclusion
125
Discussion on evaluation
Laboratory evaluation – good or bad?– Rigorous testing– Over-constrained
I usually make the comparison to a tennis racket:– No evaluation of the device will tell you how
well it will perform in real life – that largely depends on the user
– But the user will chose the device based on the lab evaluation
126
Discussion on evaluation
There is bias to account for– E.g. number of relevant documents per topic
127
Discussion on evaluation
Recall and recall-related measures are often contested [cooper:73,p95]
– “The involvement of unexamined documents in a performance formula has long been taken for granted as a perfectly natural thing, but if one stops to ponder the situation, it begins to appear most peculiar. … Surely a document which the system user has not been shown in any form, to which he has not devoted the slightest particle of time or attention during his use of the system output, and of whose very existence he is unaware, does that user neither harm nor good in his search”
Clearly not true in the legal & patent domains
128
Discussion on Evaluation
Realistic tasks and user models– Evaluation has to be based on the available data sets.
This creates the user model Tasks need to correspond to available techniques
Much literature on generating tasks– Experts describe typical tasks– Use of log files of various sorts
IR Research decades behind sociology in terms of user modeling – there is a place to learn from
129
Discussion on Evaluation
Competitiveness– Most campaigns take pain in explaining “This is not a
competition – this is an evaluation” Competitions are stimulating, but
– Participants wary of participating if they are not sure to win Particularly commercial vendors
– Without special care from organizers, it stifles creativity: Best way to win is to take last year’s method and
improve a bit Original approaches are risky
130
Discussion on Evaluation
Topical Relevance What other kinds of relevance factors are there?
– diversity of information– quality– credibility– ease of reading
131
Conclusion
• IR Evaluation is a research field in itself• Without evaluation, research is pointless
– IR Evaluation research included • statistical significance testing is a must to validate results
• Most IR Evaluation exercises are laboratory experiments– As such, care must be taken to match, to the extent possible, real
needs of the users• Experiments in the wild are rare, small and domain specific:
– VideOlympics (2007-2009)– PatOlympics (2010-2012)
132
Bibliography
Test Collection Based Evaluation of Information Retrieval Systems– M. Sanderson 2010
TREC – Experiment and Evaluation in Information Retrieval– E. Voorhees, D. Harman (eds.)
On the history of evaluation in IR – S. Robertson, 2008, Journal of Information Science
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation– M. Smucker, J. Allan, B. Carterette (CIKM’07)
A Simple and Efficient Sampling Methodfor Estimating AP and NDCG– E. Yilmaz, E. Kanoulas, J. Aslam (SIGIR’08)
133
Bibliography
Do User Preferences and Evaluation Measures Line Up?, M. Sanderson and M. L. Paramita and P. Clough and E. Kanoulas 2010 A Review of Factors Influencing User Satisfaction in Information Retrieval, A. Al-Maskari and M. Sanderson 2010 Towards higher quality health search results: Automated quality rating of depression websites, D. Hawking and T. Tang and R.
Sankaranarayana and K. Griffiths and N. Craswell and P. Bailey 2007 Evaluating Sampling Methods for Uncooperative Collections, P. Thomas and D. Hawking 2007 Comparing the Sensitivity of Information Retrieval Metrics, F. Radlinski and N. Craswell 2010 Redundancy, Diversity and Interdependent Document Relevance, F. Radlinski and P. Bennett and B. Carterette and T. Joachims
2009 Does Brandname influence perceived search result quality? Yahoo!, Google, and WebKumara, P. Bailey and P. Thomas and D.
Hawking 2007 Methods for Evaluating Interactive Information Retrieval Systems with Users, D. Kelly 2009 C-TEST: Supporting Novelty and Diversity in TestFiles for Search Tuning, D. Hawking and T. Rowlands and P. Thomas 2009 Live Web Search Experiments for the Rest of Us, T. Jones and D. Hawking and R. Sankaranarayana 2010 Quality and relevance of domain-specific search: A case study in mental health, T. Tang and N. Craswell and D. Hawking and K.
Griffiths and H. Christensen 2006 New methods for creating testfiles: Tuning enterprise search with C-TEST, D. Hawking and P. Thomas and T. Gedeon and T.
Jones and T. Rowlands 2006 A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching, A. M. Rees and D. G.
Schultz, Final Report to the National Science Foundation. Volume II, Appendices. Clearing- house for Federal Scientific and Technical Information, October 1967
The Water Filling Model and the Cube Test: Multi-dimensional Evaluation for Professional Search , J. Luo, C. Wing, H. Yang and M. Hearst, CIKM 2013
On sample sizes for non-matched-pair IR experiments, S. Robertson, 1990, Information Processing & Management Lipani A, Lupu M, Hanbury A, Splitting Water: Precision and Anti-Precision to Reduce Pool Bias, SIGIR 2015 W. Webber and L. A. F. Park. Score adjustment for correction of pooling bias. In Proc. of SIGIR, 2009