Performance measures

Performance measures for matching

• The following counts are typically measured for performance of matching:– TP: true positives, i.e. number of correct matches – FN: false negatives, matches that were not correctly detected – FP: false positives, proposed matches that are incorrect– TN: true negatives, non-matches that were correctly rejected

• Based on them, any particular matching strategy at a particular threshold can be rated by the following measures:– True Positive Rate (TPR) also referred as True Acceptance Rate (TAR) = TP / (TP+FN) = TP / P– False positive rate (FPR) also referred as False Acceptance Rate (FAR) = FP / (FP+TN) = FP / N

• TAR @ 0.001 FAR is a typical performance index used in benchmarks. Ideally, the true positive rate will be close to 1 and the false positive rate close to 0.

ROC curves

• As we vary the matching threshold at which TPR and FPR are obtained, we derive a set of points in the TPR-FPR space , which are collectively known as the receiver-operating characteristic (ROC curve).

• The ROC curve plots the true positive rate against the false positive rate for a particular combination of feature extraction and matching algorithms. The area under the ROC curve (AUC) is often used as a scalar measure of performance.

• As the threshold θ is increased,the number of true positives increases and false positives lowers. The closer this curve lies to the upper left corner, the better is its performance.

• The ROC curve can also be used to calculate the Mean Average Precision, which is the average precision as you vary the threshold to select the best results.

Performance measure for retrieval sets

+++ + -- -- - - -

- - - --

- - +- -Selectionf(d,q)=

++++

--+-+ --

- ---- -

Rankingf(d,q)=

1

00.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -

R’(q)

R’(q)

True R(q)

+

• Definition of performance measures for retrieval sets stems from information retrieval. The case of document selection is distinguished from the case in which position in the retrieval set is considered (document ranking).

• With Selection, the classifier is inaccurate:− “Over-constrained” query (terms are too specific) no relevant documents found− “Under-constrained” query (terms are too general) over delivery

Even if the classifier is accurate, all relevant documents are not equally relevant. • Ranking allows the user to control the boundary according to his/her preferences.

Performance measures for unranked retrieval sets

• Two most frequent and basic measures for unranked retrieval sets are Precision and Recall. These are first defined for the simple case where the information retrieval system returns a set of documents for a query

Relevant

Retrieved

All docs

retrieved & relevant

not retrieved but relevant

retrieved & irrelevant

Not retrieved & irrelevant

retrieved not retrieved

rele

vant

irrel

evan

t

Very high precision, very low recall High recall, but low precision High precision, high recall

Relevant

RelevantRelevant

• The advantage of having two numbers is that one is more important than the other in many circumstances:

− Surfers would like every result in the first page to be relevant (i.e. high precision). − Professional searchers are moreconcerned with high recall and will tolerate low precision.

7

• F-Measure: is a single measure that that takes into account both recall and precision.

It is the the weighted harmonic mean of precision and recall:

• Compared to arithmetic mean, both precision and recall must be high for harmonic mean to be high.

PRRPPRF 11

22

8

• E-Measure (parameterized F-Measure): a variant of F-measure that trades off precision versus recall. Allows weighting emphasis on precision over recall:

• Value of controls trade-off:– = 1: equally weights precision and recall (E=F).– > 1: weights recall more.– < 1: weights precision more.

PRRPPRE

1

2

2

2

2

)1()1(

Performance measures for ranked retrieval sets

• In a ranking context, appropriate sets of retrieved documents are given by the top k retrieved documents. For each such set, precision and recall values can be plotted to give a Precision-Recall curve. Precision-Recall curve plot a trade-off between relevant and non-relevant items retrieved

Recall

Prec

ision

The ideal caseMany relevant documents butmany other useful missed

Most relevant documents But also many non-relevant

1

1

0

Slide content from J. Ghosh

Computing Precision-Recall points

• Precision-Recall plots are built as follows:– For each query, produce the ranked list of retrieved documents. Setting different

thresholds on this ranked list results into different sets of retrieved documents. Different recall/precision measures are therefore obtained.

– Mark each document in the ranked list that is relevant.– Compute a recall/precision pair for each position in the ranked list that contains a

relevant document.

Slide content from J. Ghosh

R=3/6=0.5; P=3/4=0.75

Example 1

• Let total # of relevant documents = 6. Check each new recall point:

R=2/6=0.333; P=2/2=1

R=5/6=0.833; P=5/13=0.38

R=4/6=0.667; P=4/6=0.667

Missing one relevant document. Doesn’t reach 100% recall

Slide from J. Ghosh

R=1/6=0.167; P=1/1=1

R=3/6=0.5; P=3/5=0.6

Example 2

• Let total # of relevant documents = 6. Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/3=0.667

R=6/6=1.0; P=6/14=0.429

R=4/6=0.667; P=4/8=0.5R=5/6=0.833; P=5/9=0.556

Slide from J. Ghosh

• Precision-recall curves have a distinctive saw-tooth shape: – if the (k + 1)th document retrieved is non-relevant then recall is the same as for the top k

documents, but precision drops; – if it is relevant, then both precision and recall increase, and the curve jags up and to the right.

• Interpolated Precision is often useful to remove jiggles. the interpolated precision at a certain recall level r is defined as the highest precision found for any recall level q ≥ r : pint(r) = maxr’ ≥r p(r′)

Interpolated precision at recall level r

r

Interpolated Precision-Recall curves

0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.167 0.333 0.5 0.667 0.833

0.167 0.333 0.5 0.667 0.833

1 1 0.75 0.667 0.38 Precision

Recall

0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.167 0.333 0.5 0.667 0.833

0.167 0.333 0.5 0.667 0.833 1 0.67 0.6 0.5 0.556 0.429

1

• In order to obtain reliable performance measures, performance is averaged over a large set of queries:

− Compute average precision at each standard recall level across all queries.− Plot average precision/recall curves to evaluate overall system performance on

a document/query corpus.

Prec

ision

Recall

Comparing performance of two or more systems

• When performance of two or more systems are compared, the curve closest to the upper right-hand corner of the graph indicates the best performance

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NoStem Stem

Recall

Prec

ision

This system has the best performance

Slide from J. Ghosh

Recall

Prec

ision

• Average precision (AP) is a typical performance measure used for ranked sets. Average Precision is defined as the average of the precision scores after each relevant item (true positive, TP) in the scope S.Given a scope S = 7, and a ranked list (gain vector) G = [1,1,0,1,1,0,0,1,1,0,1,0,0,..], where 1/0 indicate the gains associated to relevant/non-relevant items, respectively:

AP = (1/1 + 2/2 + 3/4 + 4/5) / 4 = 0.8875.

• Mean Average Precision (MAP): Average of the average precision value for a set of queries.

• Average Dynamic Precision (ADP) is also used. It is defined as the average sum of precisions with increasing scope S, with 1 ≤ S ≤ #relevant items:

= (1 + 1 + 0.667 + 0.75 + 0.80 + 0.667 + 0.571) / 7 = 0.779

7

171S STPADP

S Gain vector Relevants Precision/S1 [2] 1 1/1 = 12 [2,2] 2 2/2 = 13 [2,2,0] 2 2/3 = 0.6674 [2,2,0,2] 3 3/4 = 0.755 [2,2,0,2,2] 4 4/5 = 0.806 [2,2,0,2,2,0] 4 4/6 = 0.6677 [2,2,0,2,2,0,0] 4 4/7 = 0.571

Other performance measures for ranked retrieval sets

• Other measures for ranked retrieval sets usually employed in benchmarks are the mean values of :

– Recognition Rate: total number of queries for which a relevant item is in the 2nd position of the ranked list divided by the number of items in the dataset

– 1st tier and 2nd tier: average number of relevant items retrieved respectively in the first n and 2n positions of the ranked list (n =7 typically used in benchmarks).

– Cumulated Gain (CG) at a particular rank position p : where reli is the graded relevance of the result at position i (at rank-5 typically used).

– Discounted Cumulated Gain (DCG) at a particular rank position p (highly relevant documents appearing lower in a search result list are penalized reducing the graded relevance value logarithmically proportional to the position of the result):

Performance measures

Documents

Transcript of Performance measures