Download - Chapter 8 : Evaluation in Information Retrieval

Introduction to Information Retrieval

Introduction to

Information Retrieval

Joongjin Bae(@bae_j)

Chapter 8 : Evaluation in Information Retrieval

http://baepiff.blogspot.com/


2

Overview

I. Information retrieval system evaluation

II. Standard test collections

III. Evaluation for unranked retrieval

IV. Evaluation for ranked retrieval

V. Assessing relevance

VI. System quality and user utility


3

Outline








4

IR system evaluation

検索結果の満足度は関連性で判断できる。

関連性(relevance)のどうやって評価するの?

具体的評価方法は後ろのスライドで説明する。

関連性評価の３要素:

1. document collection

2. suite of queries(検索単語)

3. 各クエリとドキュメントペアに対して関連か非関連かのbinary assessment判断集合

Sec. 8.1


5

IR system evaluation

情報要求(information need) はクエリ(query)として検索される。

関連性の評価は情報要求に対して行う。クエリではない。

例）情報要求: オフィス近くに安くて美味しいランチが食べたい

クエリ: 渋谷 and 安い and ランチ

Sec. 8.1


6

Outline








7

Standard test collections

TREC - National Institute of Standards and Technology (NIST)は1992年からIR用テストベッドを運用している。

Reutersとその他のtest collectionも利用されている。

人力で各クエリと文書ペアの関連性判断が行われる。

Sec. 8.2


8

Outline








9

Precision and Recall

Precision: 取得した文書の正解率(関連性) = P(relevant|retrieved)

Recall: 関連がある全文書から取得した文書の比率

= P(retrieved|relevant)

Precision P = tp/(tp + fp)

Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp fp

Not Retrieved fn tn

Sec. 8.3


10

Accuracy

accuracy:

(tp + tn) / ( tp + fp + fn + tn)

accuracyは機会学習では評価手段として利用される。

IRの文書は99.9%がユーザの情報要求に対して非関連

accuracyは全ての文書を非関連することで最大化できる。

Sec. 8.3


11

Precision/Recall trade off

Recallは全ての文書を取得することで上げられる。ただPrecisionは低い。

Precisionは取得する文書を減らせば上げられる。

PrecisionとRecallはトレードオフ

Sec. 8.3


12

F measure

Precisionとrecallの重み付き調和平均:

多くはbalanced F1 measureを利用

i.e., with = 1 or = ½

< 1はPrecision強調

> 1はRecall強調

RP

PR

RP

F

2

2 )1(

1)1(

1

1

Sec. 8.3


13

F1 and other averages

Combined Measures

0

20

40

60

80

100

0 20 40 60 80 100

Precision (Recall fixed at 70%)

Minimum

Maximum

Arithmetic

Geometric

Harmonic

Sec. 8.3


14

Outline








15

Evaluating ranked results

Evaluation of ranked results:

Precision, RecallとF measureは全ての文書集合に対する評価指標

ランク付き検索結果は上位k件によってPrecisionとRecallが変わる。

その集合に対してPrecisionとRecallの値を計算したのが precision-recall curve

Sec. 8.4


16

precision-recall curve

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Pre

cis

ion

Sec. 8.4


17

Interpolated precision

グラフからノコギリの歯型を取り除くために補間適合率(interpolated precision)を利用

簡単に言えばPrecisionの最大値

Sec. 8.4


18

Evaluation

グラフはいいけど要約したmeasureもほしい！

11-point interpolated average precision

The standard measure in the early TREC competitions: recallを0から1まで0.1刻む。各ポイントでは補間適合率測定する。

Sec. 8.4


19

Typical (good) 11 point precisions

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cis

ion

Sec. 8.4


20

Sec. 8.4

Precision@K

Set a rank threshold K

上位K件の関連性の%計算

K+1からは無視

Ex: Prec@3 of 2/3

Prec@4 of 2/4

Prec@5 of 3/5

R


21

Sec. 8.4

Mean Average Precision

各適合文書(relevant doc)のランクを付ける。 K1, K2, … KR

各適合文書のPrecision@Kを計算する。

Average precision = average of Precision@K

Ex:

MAPは複数のクエリ/ランクをまたがるAverage Precisionである。


22

Sec. 8.4

Average Precision


23

Sec. 8.4

MAP


24

Sec. 8.4

Mean average precision

適合文書がない場合MAP = 0

MAPは算術平均である。

最も一般的評価方法

MAPはユーザが各クエリに対して適合文書を多く取得するのを好むと想定する。

MAPは text collectionの関連性判断を多く要求する。


25

R-precision

Rel = set of known relevant documents

検索結果上位|Rel|件からr件の正解が得られたら

R-precision = 𝑟

|𝑅𝑒𝑙|

完璧なシステムなら = 1

Doc = 100, Rel = 8, k = 20

完璧なシステムではr = 8

Precision@K = r / k = 8 / 20 = 0.4

R-precision = r / |Rel| = 8 / 8 = 1

Sec. 8.4


26

ROC curve and NDCG

ROC

yを recall、yを false-positive 率 ( fp / (fp + tn) )

いいシステムは右の図のようになる。

NDCG

機械学習によるランク付ける際よく利用される。

Sec. 8.4


27

Outline








28

Assessing relevance

Test queries

文書に対して適切

ドメイン専門家によるデザインがBest!

Random queryはnot a good idea

Relevance assessments

人間の判断, コストがかかる。

人間は完璧ではない。

Kappa statistic

関連性の判断にそれらがどれくらい一致するかを測定する必要がある。

Sec. 8.5


29

Kappa statistic

Kappa statistic

判定の一致度

カテゴリーの判定に対してデザイン

偶然の一致の割合に基づいて単純な一致度の割合を補正 Corrects for chance agreement

Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]

P(A) – 判定が一致した回数の比率 proportion of time judges agree

P(E) – 偶然により一致すると期待される回数の比率

Kappa = 0 = 偶然, 1 = 完全に合意よる一致.

Sec. 8.5


30

Kappa Example

Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]

Sec. 8.5

Judge 2

Judge 1

Relevance

Non-Rel

Total

Relevance

300 20 320

Non-Rel

10 70 80

Total

310 90 400


31

Kappa Example

Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]

𝑃 𝐴 =300+70

400= 0.925

𝑃 𝑛𝑜𝑛 =10+20+70+70

800= 0.2125

𝑃 𝑟𝑒𝑙 =10+20+300+300

800= 0.7878

Sec. 8.5

Judge 2

Judge 1

Relevance

Non-Rel

Total

Relevance

300 20 320

Non-Rel

10 70 80

Total

310 90 400


32

Kappa Example

Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]

𝑃 𝐴 =300+70

400= 0.925

𝑃 𝑛𝑜𝑛 =10+20+70+70

800= 0.2125

𝑃 𝑟𝑒𝑙 =10+20+300+300

800= 0.7878

𝑃 𝐸 = 𝑃 𝑟𝑒𝑙 2 + 𝑃 𝑛𝑜𝑛 2 =0.21252 + 0.78782 = 0.665

𝐾 =𝑃 𝐴 −𝑃(𝐸)

1−𝑃(𝐸)=

0.925 −0.665

1−0.665=

0.776

Sec. 8.5

Judge 2

Judge 1

Relevance

Non-Rel

Total

Relevance

300 20 320

Non-Rel

10 70 80

Total

310 90 400


33

Kappa statistic

Interpretation of the kappa statistic k:

k > 0.8 good agreement

0.67 <= k < 0.8 fair agreement

k < 0.67 bad agreement

Sec. 8.5


34

Outline








35

System quality and user utility

System issues

How fast does it index?

How fast does it search?

How expressive is its query language? How fast is it on complex queries?

How large is its document collection?

User utility – ユーザの幸福度の測定

www：ユーザが探す結果を得たのか、また利用するか

エンタプライズ：必要情報を探すまでの時間

Refining a deployed system

A/B test

Sec. 8.6


36

Reference

IIR Chapter 8

http://www.stanford.edu/class/cs276/handouts/lecture8-evaluation.ppt

http://bloghackers.net/~naoya/iir/ppt/

http://www.stanford.edu/class/cs276/handouts/EvaluationNew.ppt