Information Retrieval Evaluation

1

IR Evaluation

Mihai [email protected]

Chapter 8 of the Introduction to IR bookM. Sanderson. Test Collection Based Evaluation of Information

Retrieval Systems Foundations and Trends in IR, 2010

2

Outline

Introduction– Introduction to IR

Kinds of evaluation Retrieval Effectiveness evaluation

– Measures, Experimentation– Test Collections

User-based evaluation Discussion on Evaluation Conclusion

3

Introduction

• Why?– Put a figure on the benefit we get from a system – Because without evaluation, there is no research

Objective measurements

Information Retrieval

“Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)

General definition that can be applied to many types of information and search applications

Primary focus of IR since the 50s has been on text and documents


Key insights of/for information retrieval– text has no meaning

ฉันมรีถสแีดง– but it is still the most informative source

ฉันมรีถสฟีา้ is more similar to the above than คณุมรีถไฟฟา้– text is not random

I drive a red car is more probable than – I drive a red horse– A red car I drive– Car red a drive I

– meaning is defined by usage I drive a truck / I drive a car / I drive the bus truck / car / bus

are similar in meaning


Key insights of/for information retrieval– text has no meaning

ฉันมรีถสแีดง– but it is still the most informative source

ฉันมรีถสฟีา้ is more similar to the above than คณุมรีถไฟฟา้– text is not random

I drive a red car is more probable than – I drive a red horse– A red car I drive– Car red a drive I

– meaning is defined by usage I drive a truck / I drive a car / I drive the bus truck / car / bus

are similar in meaning

term frequency (TF), document frequency (DF)TF-IDF, BM25 (Best match 25)

language models (uni-gram, bi-gram, n-gram)

statistical semantics (latent semantic analysis, random indexing, deep learning)

Big Issues in IR

Relevance– What is it?– Simple (and simplistic) definition: A relevant document contains

the information that a person was looking for when they submitted a query to the search engine

– Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style

– Topical relevance (same topic) vs. user relevance (everything else)

Relevance– Retrieval models define a view of relevance– Ranking algorithms used in search engines are based

on retrieval models– Most models describe statistical properties of text rather

than linguistic i.e. counting simple text features such as words

instead of parsing and analyzing the sentences Statistical approach to text processing started with

Luhn in the 50s Linguistic features can be part of a statistical model

Big Issues in IR

Big Issues in IR

Evaluation– Experimental procedures and measures for comparing system

output with user expectations Originated in Cranfield experiments in the 60s

– IR evaluation methods now used in many fields– Typically use test collection of documents, queries, and relevance

judgments Most commonly used are TREC collections

– Recall and precision are two examples of effectiveness measures

Big Issues in IR

Users and Information Needs– Search evaluation is user-centered– Keyword queries are often poor descriptions of actual information

needs– Interaction and context are important for understanding user intent– Query refinement techniques such as query expansion, query

suggestion, relevance feedback improve ranking

13

Introduction

• Why?– Put a figure on the benefit we get from a system – Because without evaluation, there is no research

• Why is this a research field in itself?– Because there are many kinds of IR

• With different evaluation criteria– Because it’s difficult

• Why?– Because it involves human subjectivity (document

relevance)– Because of the amount of data involved (who can sit down

and evaluate 1,750,000 documents returned by Google for ‘university vienna’?)

14

Kinds of evaluation

15

Kinds of evaluation

• “Efficient and effective system”• Time and space: efficiency

– Generally constrained by pre-development specification • E.g. real-time answers vs. batch jobs• E.g. index-size constraints

– Easy to measure• Good results: effectiveness

– Harder to define --> more research into it• And…

16

Kinds of evaluation (cont.)

• User studies– Does a 2% increase in some retrieval performance measure actually

make a user happier?– Does displaying a text snippet improve usability even if the

underlying method is 10% weaker than some other method?

– Hard to do– Mostly anecdotal examples– Many IR people don’t like to do it (though it’s starting to change)

17

Kinds of evaluation (cont.)

Intrinsic– “internal”– ultimate goal is the retrieved set

Extrinsic– “external”– in the context of the usage of the retrieval tool

18

What to measure in an IR system?

1966, Cleverdon:1. coverage – the extent to which relevant matter exists in the

system2. time lag ~ efficiency3. presentation 4. effort on the part of the user to answer his information

need5. recall 6. precision

19

What to measure in an IR system?

1966, Cleverdon:1. coverage – the extent to which relevant matter exists in the

system2. time lag ~ efficiency3. presentation 4. effort on the part of the user to answer his information

need5. recall 6. precision

Effectiveness

A desirable measure of retrieval performance would have the following properties: 1, it would be a measure of effectiveness. 2, it would not be

confounded by the relative willingness of the system to emit items. 3, it would be a single number – in preference, for example, to a pair of numbers which

may co-vary in a loosely specified way, or a curve representing a table of several pairs of numbers 4, it would allow complete ordering of different

performances, and assess the performance of any one system in absolute terms. Given a measure with these properties, we could be confident of

having a pure and valid index of how well a retrieval system (or method) were performing the function it was primarily designed to accomplish, and we could

reasonably ask questions of the form “Shall we pay X dollars for Y units of effectiveness?” (Swets, 1967)

20

Outline

• Introduction• Kinds of evaluation• Retrieval Effectiveness evaluation

– Measures– Test Collections

User-based evaluation• Discussion on Evaluation• Conclusion

21

Efficiency Metrics

22

Retrieval Effectiveness

Precision– How happy are we with what we’ve got

Recall– How much more we could have had

Precision =Number of relevant documents

retrievedNumber of documents retrieved

Recall =Number of relevant documents

retrievedNumber of relevant documents

23


Retrieved documents

Relevant documents

Universe of documents

24

Precision and Recall

25

Retrieval effectiveness

What if we don’t like this twin-measure approach? A solution:

– Van Rijsbergen’s E-Measure:

– With a special case: Harmonic mean

26


What if we don’t like this twin-measure approach? A solution:

– Van Rijsbergen’s E-Measure:

– With a special case: Harmonic mean

27


Tools we need:– A set of documents (the “dataset”)– A set of questions/queries/topics– For each topic, and for each document, a decision: relevant

or not relevant Let’s assume for the moment that’s all we need and

that we have it

28


• Precision and Recall generally plotted as a “Precision-Recall curve”

0

1

1

precision

recall

size of retrieved set increases

• They do not play well together

29

Precision-Recall Curves

How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis

0

1

1

precision

recall

30


How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis

0

1

1

precision

recall

31


• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis– Repeat for all queries

0

1

1

precision

recall

32


• And the average is the system’s P-R curve

0

1

1

precision

recall

# retrieved documents increases

• We can compare systems by comparing the curves

33

Precision-Recall Graph--reality check--

34

Interpolation

To average graphs, calculate precision at standard recall levels:

– where S is the set of observed (R,P) points Defines precision at any recall level as the maximum

precision observed in any recall-precision point at a higher recall level– produces a step function– defines precision at recall 0.0

35

Interpolation

36

Average Precision at Standard Recall Levels

• Recall-precision graph plotted by simply joining the average precision points at the standard recall levels

37

Average Recall-Precision Graph

38

Graph for 50 Queries

39


• Not quite done yet…– When to stop retrieving?

• Both P and R imply a cut-off value– How about graded relevance

• Some documents may be more relevant to the question than others

– How about ranking?• A document retrieved at position 1,234,567 can still be

considered useful?– Who says which documents are relevant and which not?

40

Single-value measures

• Fix a “reasonable” cutoff– R-precision

Precision at R, where R is the number of relevant documents. Fix the number of desired documents

– Reciprocal rank (RR) 1/rank of first relevant document in the ranked list returned

Make it less sensitive to the cutoff• Average precision

– For each query: R= # relevant documents i = rank k = # retrieved documents P(i) precision at rank i

• rel(i)=1 if document at rank i relevant, 0 otherwise– For each system:

• Compute the mean of these averages: Mean Average Precision (MAP) – one of the most used measures

41

R- Precision

Precision at the R-th position in the ranking of results for a query that has R relevant documents.

n doc # relevant1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990

R = # of relevant docs = 6

R-Precision = 4/6 = 0.67

42

Averaging Across Queries

43

Average Precision

44

MAP

45







46

Cumulative Gain

• For each document d, and query q, definerel(d,q) >= 0

• The higher the value, the more relevant the document is to the query

• Pitfalls:– Graded relevance introduces even more ambiguity in practice

With great flexibility comes great responsibility to justify parameter values

47







48

Discounted Cumulative Gain

Popular measure for evaluating web search and related tasks

Two assumptions:– Highly relevant documents are more useful than marginally relevant

document– the lower the ranked position of a relevant document, the less useful

it is for the user, since it is less likely to be examined

49


Uses graded relevance as a measure of the usefulness, or gain, from examining a document

Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks

Typical discount is 1/log (rank)– With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3

50


DCG is the total gain accumulated at a particular rank p:

Alternative formulation:

– used by some web search companies– emphasis on retrieving highly relevant documents

[Jarvelin:2000]

[Borges:2005]

51


• Neither CG, nor DCG can be used for comparison across topics! depends on the # relevant documents per topic

52

Normalised Discounted Cumulative Gain

Compute CG / DCG for the optimal return setEg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)has the Ideal Discounted Cumulative Gain: IDCG

Normalise:

53

some more variations

Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)has the Ideal Discounted Cumulative Gain: IDCG

“our rank”: (5,2,0,0,5,2,4,0,0,1,4,…) two ranked lists

– rank correlation measures kendall Tau (similarity of orderings) pearson Rho (linear correlation between variables) spearman Rho (Pearson for ranks)

54

some more variations

rank biased precision (RBP)– “log-based discount is not a good model of users’ behaviour”– imagine the probability p of the user moving on to the next document

p~0.95 p~0.0

55

Time-based calibration

Assumption– The objective of the search engine is to improve the efficiency of an

information seeking task Extend nDCG to replace discount with a time-based

function

(Smucker and Clarke:2011)

Normalization

Gain Decay, as a function of time to reach item k in

the ranked list

56

The water filling model (Luo et al, 2013)

and the corresponding Cube Test (CT)

also for professional search– to capture embedded subtopics

no assumption of linear traversal of documents– takes into account time

potential cap on the amount of information taken into account

high discriminative power

57

Other diversity metrics

several aspects of the topic might [need to] be covered– Aspectual recall/precision

discount may take into account previously seen aspects– α-NDCG = NDCG where

58

Other measures

• There are many IR measures!• trec_eval is a little program that computes many of them

– 37 in v9.0, many of which are multi-point (e.g. Precision @10, @20…)

• http://trec.nist.gov/trec_eval/• “there is a measure to make anyone a winner”

– Not really true, but still…

http://trec.nist.gov/trec_eval/

59

Other measures

• How about correlations between measures?

• Kendal Tau values • From Voorhees and Harman,2004

• Overall they correlate

P(30) R-Prec MAP .5 prec R(1,1000) Rel Ret MRR

P(10) 0.88 0.81 0.79 0.78 0.78 0.77 0.77

P(30) 0.87 0.84 0.82 0.80 0.79 0.72

R-Prec 0.93 0.87 0.83 0.83 0.67

MAP 0.88 0.85 0.85 0.64

.5 prec 0.77 0.78 0.63

R(1,1000) 0.92 0.67

Rel ret 0.66

60

Topic sets

Topic selection– In early TREC candidates rejected if ambiguous

Are all topics equal?– Mean Average Precision uses arithmetic mean

– Classical Test Theory experiments (Bodoff and Li,2007) identified outliers that could change the rankings

MAP: a change in AP from 0.05 to 0.1 has the same effect as a change from 0.25 to 0.3GMAP: a change in AP from 0.05 to 0.1 has the same effect as a change from 0.25 to 0.5

61

Measure measures

What is the best measure?– What makes a measure better?

Match to task– E.g.

Known item search: MRR Something more quantitative?

– Correlations between measures Does the system ranking change when using different measures Useful to group measures

– Ability to distinguish between runs– Measure stability

62

Ad-hoc quiz

It was necessary to normalize the discounted cumulative gain (NDCG) because… of the assumption for normal probability distribution to be able to compare across topics normalization is always better to be able to average across topics

63

Ad-hoc quiz

It was necessary to normalize the discounted cumulative gain (NDCG) because… of the assumption for normal probability distribution to be able to compare across topics normalization is always better to be able to average across topics

64

Measure stability

Success criteria:– A measure is good if it is able to predict differences between

systems (on the average of future queries) Method

– Split collection in 21. Use as train collection to rank runs2. Use as test collection to compute how many pair-wise

comparisons hold Observations

– Cut-off measures less stable than MAP

65

Measure stability

Success criteria:– A measure is good if it is able to predict differences between

systems (on the average of future queries) Method

– Split collection in 21. Use as train collection to rank runs2. Use as test collection to compute how many pair-wise

comparisons hold Observations

– Cut-off measures less stable than MAP

Any other criteria for measure quality? ?

66

Measure measures

started with opinions from ’60s, seen some measures – have the targets changed?

7 numeric properties of effectiveness metrics (Moffat 2013)

68

7 properties of effectiveness metrics

Boundedness – the set of scores attainable by the metric is bounded, usually in [0,1]

Monotonicity – if a ranking of length k is extended so that k+1 elements are included, the score never decreases

Convergence – if a document outside the top k is swapped with a less relevant document inside the top k, the score strictly increases

Top-weightedness – if a document within the top k is swapped with a less relevant one higher in the ranking, the score strictly increases

Localization – a score at depth k can be compute based solely on knowledge of the documents that appear in top k

Completeness – a score can be calculated even if the query has no relevant documents

Realizability – provided that the collection has at least one relevant document, it is possible for the score at depth k to be maximal.

69

So far

introduction metrics

we are now able to say “System A is better than System B” or are we?Remember

- we only have limited data- potential future applications unbounded

a very strong statement!

70

Statistical validity

Whatever evaluation metric used, all experiments must be statistically valid– i.e. differences must not be the result of chance

00.020.040.060.080.1

0.120.140.160.180.2

MAP

71


• Ingredients of a significance test– A test statistic (e.g. the differences between AP values) – A null hypothesis (e.g. “there is no difference between the two

systems) This gives us a particular distribution of the test statistic

– An alternative hypothesis (one or two-tailed tests) don’t change it after the test

– A significance level computed by taking the actual value of the test statistic and determining how likely it is to see this value given the distribution implied by the null hypothesis

• P-value• If the p-value is low, we can feel confident that we can reject

the null hypothesis the systems are different

72


Common practice is to declare systems different when the p-value <= 0.05

A few tests– Randomization tests

Wilcoxon Signed Rank test Sign test

– Boostrap test– Student’s Paired t-test

See recent discussion in SIGIR Forum– T. Sakai - Statistical Reform in Information Retrieval?

effect sizes confidence intervals

73


How do we increase the statistical validity of an experiment?

By increasing the number of topics– The more topics, the more confident we are that the

difference between average scores will be significant What’s the minimum number of topics?

42

• Depends, but• TREC started with 50• Below 25 is generally considered

not significant

74

Example Experimental Results

75

t-Test

Assumption is that the difference between the effectiveness values is a sample from a normal distribution

Null hypothesis is that the mean of the distribution of differences is zero

Test statistic

– for the example,

76

t-Testt=2.33

77

t-Testt=2.33

78

Statistical Validity - example

83

Summary

so far– introduction– metrics

next– where to get ground truth

some more metrics– discussion

84







85

Relevance assessments

• Ideally– Sit down and look at all documents

• Practically– The ClueWeb09 collection has

• 1,040,809,705 web pages, in 10 languages • 5 TB, compressed. (25 TB, uncompressed.)

– No way to do this exhaustively– Look only at the set of returned documents

• Assumption: if there are enough systems being tested and not one of them returned a document – the document is not relevant

86

Relevance assessments - Pooling Combine the results retrieved by all systems Choose a parameter k (typically 100) Choose the top k documents as ranked in each

submitted run The pool is the union of these sets of docs

– Between k and (# submitted runs) × k documents in pool– (k+1)st document returned in one run either irrelevant or

ranked higher in another run Give pool to judges for relevance assessments

87 From Donna Harman

88

Relevance assessments - Pooling Conditions under which pooling works [Robertson]

– Range of different kinds of systems, including manual systems

– Reasonably deep pools (100+ from each system) But depends on collection size

– The collections cannot be too big. Big is so relative…

89

Relevance assessments - Pooling Advantage of pooling:

– Fewer documents must be manually assessed for relevance Disadvantages of pooling:

– Can’t be certain that all documents satisfying the query are found (recall values may not be accurate)

– Runs that did not participate in the pooling may be disadvantaged

– If only one run finds certain relevant documents, but ranked lower than 100, it will not get credit for these.

90

Relevance assessments

Pooling with randomized sampling As the data collection grows, the top 100 may not be

representative of the entire result set– (i.e. the assumption that everything after is not relevant

does not hold anymore) Add, to the pool, a set of documents randomly

sampled from the entire retrieved set– If the sampling is uniform easy to reason about, but may

be too sparse as the collection grows– Stratified sampling: get more from the top of the ranked

list [Yilmaz et al.:2008]

91

Relevance assessments - incomplete

• The unavoidable conclusion is that we have to handle incomplete relevance assessments– Consider unjudged = non relevant– Do not consider unjudged at all (i.e. compress the ranked lists)

• A new measure:– BPref (binary preference)

r = a relevant returned document R = # documents judged relevant N = # documents judged non-relevant n = a non-relevant document

92

Relevance assessments - incomplete

• BPref was designed to mimic MAP• soon after, induced AP and inferred AP were proposed

• if data complete – equal to MAP

expectation of precision at rank k

93

not only are we incomplete, but we might also be inconsistent in our judgments

94

Relevance assessment - subjectivity In TREC-CHEM’09 we had each topic evaluated by

two students– “conflicts” ranged between 2% and 33% (excluding a topic

with 60% conflict)– This all increased if we considered “strict disagreement”

In general, inter-evaluator agreement is rarely above 80%

There is little one can do about it

95

Relevance assessment - subjectivity Good news:

– “idiosyncratic nature of relevance judgments does not affect comparative results” (E. Voorhees)

– Mean Kendall Tau between system rankings produced from different query relevance sets: 0.938

– Similar results held for: Different query sets Different evaluation measures Different assessor types Single opinion vs .group opinion judgments

96

No assessors

Pooling assumes all relevant documents found by systems– Take this assumption further

Voting based- relevance assessments– Consider top K only

Soboroff et al:2001

97

Test Collections

Generally created as the result of an evaluation campaign– TREC – Text Retrieval Conference (USA)– CLEF – Cross Language Evaluation Forum (EU)– NTCIR - NII Test Collection for IR Systems (JP)– INEX – Initiative for evaluation of XML Retrieval– …

First one and paradigm definer:– The Cranfield Collection

In the 1950s Aeronautics 1400 queries, about 6000 documents Fully evaluated

98

TREC

Started in 1992 Always organised in the States, on the NIST campus

As leader, introduced most of the jargon used in IR Evaluation:– Topic = query / request for information– Run = a ranked list of results – Qrel = relevance judgements

99

TREC

Organised as a set of tracks that focus on a particular sub-problem of IR– E.g.

Patient records, Session, Chemical, Genome, Legal, Blog, Spam,Q&A, Novelty, Enterprise, Terabyte, Web, Video, Speech, OCR, Chinese, Spanish, Interactive, Filtering, Routing, Million Query, Ad-Hoc, Robust

– Set of tracks in a year depends on Interest of participants Fit to TREC Needs of sponsors Resource constraints

100

TREC

C all forpartic ipation Task

defin ition

D ocum entprocurem ent

Top ic defin ition

IRexperim ents

R elevance assessm ents

R esultsevaluation

R esultsanalysis

TR E Cconference

Proceedingspublication

101

TREC – Task definition

Each Track has a set of Tasks: Examples of tasks from the Blog track:

– 1. Finding blog posts that contain opinions about the topic – 2. Ranking positive and negative blog posts – 3. (A separate baseline task to just find blog posts relevant

to the topic) – 4. Finding blogs that have a principal, recurring interest in

the topic

102

TREC - Topics

For TREC, topics generally have a specific format (not always though)– <ID>– <title>

Very short– <description>

A brief statement of what would be a relevant document– <narrative>

A long description, meant also for the evaluator to understand how to judge the topic

103

TREC - Topics

Example:– <ID>

312– <title>

Hydroponics– <description>

Document will discuss the science of growing plants in water or some substance other than soil

– <narrative> A relevant document will contain specific information on

the necessary nutrients, experiments, types of substrates, and/or any other pertinent facts related to the science of hydroponics. Related information includes, but is not limited to, the history of hydro- …

104

CLEF

Cross Language Evaluation Forum– From 2010: Conference on Multilingual and Multimodal

Information Access Evaluation– Supported by the PROMISE Network of Excellence

Started in 2000 Grand challenge:

– Fully multilingual, multimodal IR systems Capable of processing a query in any medium and any

language Finding relevant information from a multilingual

multimedia collection And presenting it in the style most likely to be useful

for the user

105

CLEF

• Previous tracks:• Mono-, bi- multilingual text retrieval• Interactive cross language retrieval• Cross language spoken document retrieval• QA in multiple languages• Cross language retrieval in image collections• CL geographical retrieval• CL Video retrieval• Multilingual information filtering• Intellectual property• Log file analysis• Large scale grid experiments

• From 2010– Organised as a series of “labs”

106

MediaEval

dedicated to evaluating new algorithms for multimedia access and retrieval.

emphasizes the 'multi' in multimedia focuses on human and social aspects of multimedia tasks

– speech recognition, multimedia content analysis, music and audio analysis, user-contributed information (tags, tweets), viewer affective response, social networks, temporal and geo-coordinates.

http://www.multimediaeval.org/

107

Test collections - summary

it is important to design the right experiment for the right IR task– Web retrieval is very different from legal retrieval

The example of Patent retrieval– High Recall: a single missed document can invalidate a

patent– Session based: single searches may involve days of cycles

of results review and query reformulation– Defendable: Process and results may need to be defended

in court

108

Outline

Introduction Kinds of evaluation Retrieval Effectiveness evaluation



109

User-based evaluation

Different levels of user involvement– Based on subjectivity levels1. Relevant/non-relevant assessments

Used largely in lab-like evaluation as described before2. User satisfaction evaluation

Some work on 1., very little on 2.– User satisfaction is very subjective

UIs play a major role Search dissatisfaction can be a result of the non-existence of

relevant documents

110


User-based relevance assessments– Focus the user on each query-document pair

111


User-based relevance assessments– Focus the user one each query-document pair

112


User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document

113


User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document

Relative judgements of documents“Is document X more relevant than document Y for

the given query?”

- Many more assessments needed- Better inter-annotator agreement [Rees and Schultz,

1967]

114


User-based relevance assessments– Focus the user on each query-document pair– Focus the user on query-document-document – Focus the user on lists of results

115


User-based relevance assessments– Focus the user one each query-document pair– Focus the user on query-document-document – Focus the user on lists of results

Image from Thomas and Hawking, Evaluation by comparing result sets in context, CIKM2006

116



Some issues, alternatives– Control for all sorts of user-based biases

117


User-based relevance assessments– Focus the user one each query-document pair– Focus the user on lists of results– Focus the user on query-document-document

Some issues, alternatives– Control for all sorts of user-based biases

Image from Bailey, Thomas and Hawking, Does brandname influence perceived search result quality?, ADCS2007

118



Some issues, alternatives– Control for all sorts of user-based biases– Two-panel evaluation

– limits the number of systems which can be evaluated– Is unusable in real-life contexts

– Interspersed ranked list with click monitoring

119

Effectiveness evaluationlab-like vs. user-focused

Results are mixed: some experiments show correlations, some not

Do user preferences and Evaluation Measures Line up? SIGIR 2010: Sanderson, Paramita, Clough, Kanoulas– shows the existence of correlations

User preferences is inherently user dependent Domain specific IR will be different

The relationship between IR effectiveness measures and user satisfaction, SIGIR 2007, Al-Maskari, Sanderson, Clough– strong correlation between user satisfaction and DCG, which

disappeared when normalized to NDCG.

120

Predicting performance

Future data and queries not absolute, but relative performance

– ad-hoc evaluations suffer in particular– no comparison between lab and operational settings

for justified reasons, but still none– how much better must a system be?

generally, require statistical significance

[ Trip

pe:2

011

]

121

Predictive performance

Future systems Test collections are often used to prove we have a better

system than the state of the art– not all documents were evaluated

122

Predictive performance

Future systems Test collections are often used to prove we have a better

system than the state of the art– not all documents were evaluated– “retrofit” metrics that are not considered resilient to such evolution

RBP [Webber:2009] Precision@n [Lipani:2014], Recall@n […]

Why do this?- Precision@n and Recall@n are loved in industry- Also in industry, technology migration steps are high (i.e. hold on to a

system that ‘works’ until it is patently obvious it affects business performance)

123

Are Lab evals sufficient?

Patent search is an active process where the end-user engages in a process of understanding and interacting with the information

evaluation needs a definition of success– success ~ lower risk

partly precision and recall partly (some argue the most important part) the intellectual and

interactive role of the patent search system as a whole series of evaluation layers

– lab evals are now the lowest level– to elevate them, they must measure risk and incentivize systems to

provide estimates of confidence in the results they provide

[ Trip

pe:2

011

]

124

Outline

Introduction Kinds of evaluation Retrieval Effectiveness evaluation



125

Discussion on evaluation

Laboratory evaluation – good or bad?– Rigorous testing– Over-constrained

I usually make the comparison to a tennis racket:– No evaluation of the device will tell you how

well it will perform in real life – that largely depends on the user

– But the user will chose the device based on the lab evaluation

126


There is bias to account for– E.g. number of relevant documents per topic

127


Recall and recall-related measures are often contested [cooper:73,p95]

– “The involvement of unexamined documents in a performance formula has long been taken for granted as a perfectly natural thing, but if one stops to ponder the situation, it begins to appear most peculiar. … Surely a document which the system user has not been shown in any form, to which he has not devoted the slightest particle of time or attention during his use of the system output, and of whose very existence he is unaware, does that user neither harm nor good in his search”

Clearly not true in the legal & patent domains

128

Discussion on Evaluation

Realistic tasks and user models– Evaluation has to be based on the available data sets.

This creates the user model Tasks need to correspond to available techniques

Much literature on generating tasks– Experts describe typical tasks– Use of log files of various sorts

IR Research decades behind sociology in terms of user modeling – there is a place to learn from

129


Competitiveness– Most campaigns take pain in explaining “This is not a

competition – this is an evaluation” Competitions are stimulating, but

– Participants wary of participating if they are not sure to win Particularly commercial vendors

– Without special care from organizers, it stifles creativity: Best way to win is to take last year’s method and

improve a bit Original approaches are risky

130


Topical Relevance What other kinds of relevance factors are there?

– diversity of information– quality– credibility– ease of reading

131

Conclusion

• IR Evaluation is a research field in itself• Without evaluation, research is pointless

– IR Evaluation research included • statistical significance testing is a must to validate results

• Most IR Evaluation exercises are laboratory experiments– As such, care must be taken to match, to the extent possible, real

needs of the users• Experiments in the wild are rare, small and domain specific:

– VideOlympics (2007-2009)– PatOlympics (2010-2012)

132

Bibliography

Test Collection Based Evaluation of Information Retrieval Systems– M. Sanderson 2010

TREC – Experiment and Evaluation in Information Retrieval– E. Voorhees, D. Harman (eds.)

On the history of evaluation in IR – S. Robertson, 2008, Journal of Information Science

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation– M. Smucker, J. Allan, B. Carterette (CIKM’07)

A Simple and Efficient Sampling Methodfor Estimating AP and NDCG– E. Yilmaz, E. Kanoulas, J. Aslam (SIGIR’08)

133

Bibliography

Do User Preferences and Evaluation Measures Line Up?, M. Sanderson and M. L. Paramita and P. Clough and E. Kanoulas 2010 A Review of Factors Influencing User Satisfaction in Information Retrieval, A. Al-Maskari and M. Sanderson 2010 Towards higher quality health search results: Automated quality rating of depression websites, D. Hawking and T. Tang and R.

Sankaranarayana and K. Griffiths and N. Craswell and P. Bailey 2007 Evaluating Sampling Methods for Uncooperative Collections, P. Thomas and D. Hawking 2007 Comparing the Sensitivity of Information Retrieval Metrics, F. Radlinski and N. Craswell 2010 Redundancy, Diversity and Interdependent Document Relevance, F. Radlinski and P. Bennett and B. Carterette and T. Joachims

2009 Does Brandname influence perceived search result quality? Yahoo!, Google, and WebKumara, P. Bailey and P. Thomas and D.

Hawking 2007 Methods for Evaluating Interactive Information Retrieval Systems with Users, D. Kelly 2009 C-TEST: Supporting Novelty and Diversity in TestFiles for Search Tuning, D. Hawking and T. Rowlands and P. Thomas 2009 Live Web Search Experiments for the Rest of Us, T. Jones and D. Hawking and R. Sankaranarayana 2010 Quality and relevance of domain-specific search: A case study in mental health, T. Tang and N. Craswell and D. Hawking and K.

Griffiths and H. Christensen 2006 New methods for creating testfiles: Tuning enterprise search with C-TEST, D. Hawking and P. Thomas and T. Gedeon and T.

Jones and T. Rowlands 2006 A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching, A. M. Rees and D. G.

Schultz, Final Report to the National Science Foundation. Volume II, Appendices. Clearing- house for Federal Scientific and Technical Information, October 1967

The Water Filling Model and the Cube Test: Multi-dimensional Evaluation for Professional Search , J. Luo, C. Wing, H. Yang and M. Hearst, CIKM 2013

On sample sizes for non-matched-pair IR experiments, S. Robertson, 1990, Information Processing & Management Lipani A, Lupu M, Hanbury A, Splitting Water: Precision and Anti-Precision to Reduce Pool Bias, SIGIR 2015 W. Webber and L. A. F. Park. Score adjustment for correction of pooling bias. In Proc. of SIGIR, 2009

Information Retrieval Evaluation

Data & Analytics

Transcript of Information Retrieval Evaluation