Web Search Engine Metrics for Measuring User Satisfaction

129
1 Web Search Engine Metrics for Measuring User Satisfaction Ali Dasdan Kostas Tsioutsiouliklis Emre Velipasaoglu {dasdan, kostas, emrev}@yahoo-inc.com Yahoo! Inc. 20 Apr 2009

Transcript of Web Search Engine Metrics for Measuring User Satisfaction

Page 1: Web Search Engine Metrics for Measuring User Satisfaction

1

Web Search Engine Metrics for Measuring

User Satisfaction

Ali Dasdan Kostas Tsioutsiouliklis

Emre Velipasaoglu {dasdan, kostas, emrev}@yahoo-inc.com

Yahoo! Inc. 20 Apr 2009

Page 2: Web Search Engine Metrics for Measuring User Satisfaction

2

Tutorial @

18th International World Wide Web

Conference

http://www2009.org/ April 20-24, 2009

Page 3: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Disclaimers

•  This talk presents the opinions of the authors. It does not necessarily reflect the views of Yahoo! Inc.

•  This talk does not imply that these metrics are used by Yahoo!, or should they be used, they may not be used in the way described in this talk.

•  The examples are just that – examples. Please do not generalize them to the level of comparing search engines.

3

Page 4: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Acknowledgments for presentation material (in alphabetical order of last names in each category)

•  Coverage –  Paolo D’Alberto, Amit Sasturkar

•  Discovery –  Chris Drome, Kaori Drome

•  Freshness –  Xinh Huynh

•  Presentation –  Rob Aseron, Youssef Billawala, Prasad Kantamneni, Diane

Yip •  General

–  Stanford U. presentation audience (organized by Aneesh Sharma and Panagiotis Papadimitriou), Yahoo! presentation audience (organized by Pavel Dmitriev)

4

Page 5: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Learning objectives

•  To learn about user satisfaction metrics

•  To learn about how to interpret metrics results

•  To get the relevant bibliography •  To learn about the open problems

5

Page 6: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Scope

• Web “textual” search • Users’ point of view • Analysis rather than

synthesis •  Intuitive rather than formal • Not exhaustive coverage

(including references) 6

Page 7: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Outline

•  Introduction (30min) – Ali •  Relevance metrics (50min) – Emre •  Break (15min) •  Coverage metrics (15min) – Ali •  Diversity metrics (15min) – Ali •  Discovery metrics (15min) – Ali •  Freshness metrics (15min) – Ali •  Presentation metrics (50min) – Kostas •  Conclusions (5min) – Kostas

7

Page 8: Web Search Engine Metrics for Measuring User Satisfaction

8

Introduction PART “0”

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 9: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

“To measure is to know”

“If you cannot measure it, you cannot improve it”

Lord Kelvin (1824-1907)

Why measure? Why metrics?

9

Page 10: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Search engine pipeline: Simplified architecture

10

•  Serving system: serves user queries and search results

•  Content system: acquires and processes content

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Page 11: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Search engine pipeline: Content selection

11

Craw

led

Accessed

Served

Indexed

Graphed

the Web

How do you select content to pass to the next catalog?

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Page 12: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

User view of metrics: Example with coverage metrics (SE #1)

12

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Search Engine (SE) #1

Page 13: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

User view of metrics: Example with coverage metrics (SE #2)

13

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Search Engine (SE) #2

Page 14: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

User view of metrics: Example with coverage metrics (SE #3)

14

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Search Engine (SE) #3

Page 15: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

System view of metrics: Example with coverage metrics

15

Check for coverage of expected URL http://rain.stanford.edu/schedule/ (if missing from SRP)

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Page 16: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Ideal vs. reality

•  Ideal view –  crawl all content –  discover all changes instantaneously –  serve all content instantaneously –  store all content indefinitely –  meet user’s information need perfectly

•  Practical view –  constraints on above aspects due to

•  market focus, long tails, cost, resources, complexity

•  Moral of the story –  Cannot make all the users happy all the time!

16

Page 17: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Sampling methods for metrics

•  Random sampling of queries –  from search engine’s query logs –  from third-party logs (e.g., ComScore)

•  Random sampling of URLs –  from random walking the Web

•  see a review at Baykan et al., WWW’06 –  from directories and similar hubs –  from RSS feeds and sitemaps –  from third-party feeds –  from search engine’s catalogs –  from competitor’s indices using queries

•  Customer-selected samples

17

Page 18: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Different dimensions for metrics

•  Content types and sources –  news, blogs, wikipedia, forums, scholar; regions, languages;

adult, spam, etc. •  Site types

–  small vs. large, region, language •  Document formats

–  html, pdf, etc. •  Query types

–  head, torso, tail; #terms; informational, navigational, transactional; celebrity, adult, business, research, etc.

•  Open web vs. hidden web •  Organic vs. commercial •  Dynamic vs. static content •  New content vs. existing content

18

Page 19: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Further issues to consider

•  Rate limitations –  search engine blocking, hence, difficulty of large competitive testing –  internal bandwidth usage limitations

•  Intrusiveness –  How can metrics queries affect what’s observed?

•  Statistical soundness –  in methods used and guarantees provided –  accumulation of errors –  the “value” question –  E.g., what is “random”? is “random” good enough?

•  Undesired positive feedback or the chicken-and-egg problem –  Focus on popular queries may make them more popular at the expense

of potentially what’s good for the future. •  Controlled feedback or labeled training and testing data

–  Paid human judges (or editors), crowdsourcing (e.g., Amazon’s Mechanical Turk), Games with a Purpose (e.g., Dasdan et al., WWW’09), bucket testing on live traffic, etc.

19

Page 20: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Key problems for metrics

•  Measure user satisfaction •  Compare two search engines •  Optimize for user satisfaction in

each component of the pipeline •  Automate all metrics •  Discover anomalies •  Visualize, mine, and summarize

metrics data •  Debug problems automatically

20 Also see: Yahoo Research list at http://research.yahoo.com/ksc

Page 21: Web Search Engine Metrics for Measuring User Satisfaction

21

Relevance Metrics PART I

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 22: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 22 22

Example on relevance

Ad for gear. OK if I will go to the game.

No schedule here.

There is a schedule.

A different schedule?

A different Real Madrid!

Page 23: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 23 23

What is relevance?

•  User issues a query to a search engine and receives an ordered list of results…

•  Relevance: How effectively was the user’s information need met? –  How useful were the results? –  How many of the retrieved results were useful? –  Were there any useful pages not retrieved? –  Did the order of the results make the user’s search

easier or harder? –  How successful did the search engine handle the

ambiguity and the subjectivity of the query?

Page 24: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 24

Evaluating relevance

•  Set based evaluation –  basic but fundamental

•  Rank based evaluation with explicit absolute judgments –  binary vs. graded judgments

•  Rank based evaluation with explicit preference judgments –  binary vs. graded judgments –  practical system testing and incomplete judgments

•  Rank based evaluation with implicit judgments –  direct and indirect evaluation by clicks

•  User satisfaction •  More notes

Page 25: Web Search Engine Metrics for Measuring User Satisfaction

25

Relevance Metrics: Set Based Evaluation

Page 26: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 26

Precision

–  True Positive (TP): A retrieved document is relevant –  False Positive (FP): A retrieved document is not relevant

•  Kent et. al. (1955) €

Precision =# relevant items retrieved( )

# retrieved items( )

=TP

TP+ FP( )= Prob relevant retrieved( )

•  How many of the retrieved results were useful?

Page 27: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 27

Recall

Recall =# relevant items retrieved( )

# relevant items( )

=TP

TP+ FN( )= Prob retrieved relevant( )

–  True Positive (TP): A retrieved document is relevant –  False Negatives (FN): A relevant document is not retrieved

•  Kent et. al. (1955)

•  Were there any useful pages left not retrieved?

Page 28: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 28

Properties of precision and recall

•  Precision decreases when false positives increase •  False positives:

–  also known as false alarm in signal processing –  correspond to Type I error in statistical hypothesis testing

•  Recall decreases when false negatives increase •  False negatives

–  also known as missed opportunities

–  correspond to Type II error in statistical hypothesis testing

Page 29: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 29

F-measure

•  Inconvenient to have two numbers •  F-measure: Harmonic mean of precision and recall

–  related to van Rijsbergen’s effectiveness measure –  reflects user’s willingness to trade precision for recall controlled

by a parameter selected by the system designer

F =1

α1P

+ 1−α( ) 1R

=β 2 +1( )PRβ 2P+ R

α = 1β 2 +1( )

F β =1( ) =2PRP + R

Page 30: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 30

Various means of precision and recall

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

Precision

Averag

e o

f p

recis

ion

an

d r

ecall

R

arithmetic

geometric

F1

F2

F.5

Recall = 70%

Page 31: Web Search Engine Metrics for Measuring User Satisfaction

31

Relevance Metrics: Rank Based Evaluation

with Explicit Absolute

Judgments

Page 32: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 32

Extending precision and recall

•  So far, considered: –  How many of the retrieved results were useful? –  Were there any useful pages left not retrieved?

•  Next, consider: –  Did the order of the results made the user’s search for

information easier or harder?

•  Extending set based precision/recall to ranked list –  It is possible to define many sets over a ranked list. –  E.g. Start with a set including the first result and progressively

increase the size of the set by adding the next result.

•  Precision-recall curve: –  Calculate precision at standard recall levels and interpolate.

Page 33: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 33

Precision-recall curve example

rank relevance TP FP FN recall precisioninterpolated

precision1 1 1 0 3 0.25 1.00 1.00

2 1 2 0 2 0.5 1.00 1.00

3 0 2 1 2 0.5 0.67 0.75

4 1 3 1 1 0.75 0.75 0.75

5 0 3 2 1 0.75 0.60 0.60

6 0 3 3 1 0.75 0.50 0.57

7 1 4 3 0 1 0.57 0.57

8 0 4 4 0 1 0.50 0.50

9 0 4 5 0 1 0.44 0.44

10 0 4 6 0 1 0.40 0.40

Page 34: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 34

Precision-recall curve example

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Precision

precision interpolated precision

Page 35: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 35

Average precision-recall curve

•  Precision-recall curve is for one ranked list (i.e. one query).

•  To evaluate relevance of a search engine: –  Calculate interpolated

precision-recall curves for a sample of queries at 11-points (Recall = 0.0:0.1:1.0).

–  Average over test sample of queries.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

recall

precision

Page 36: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 36

Mean average precision (MAP)

•  Single number instead of a graph •  Measure of quality at all recall levels •  Average precision for a single query:

AP = 1# relevant Precision at rank of kth relevant document( )

k=1

# relevant

•  MAP: Mean of average precision over all queries –  Most frequently, arithmetic mean is used over the query

sample. –  Sometimes, geometric mean can be useful by putting

emphasis on low performing queries.

Page 37: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 37

Average precision example

rank relevance TP FP FN R P P@rel(k)

1 1 1 0 3 0.25 1.00 1.00

2 1 2 0 2 0.5 1.00 1.00

3 0 2 1 2 0.5 0.67 0

4 1 3 1 1 0.75 0.75 0.75

5 0 3 2 1 0.75 0.60 0

6 0 3 3 1 0.75 0.50 0

7 1 4 3 0 1 0.57 0.57

8 0 4 4 0 1 0.50 0

9 0 4 5 0 1 0.44 0

10 0 4 6 0 1 0.40 0

# relevant 4 ave P 0.83

Page 38: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 38

Precision @ k

•  MAP evaluates precision at all recall levels. •  In web search, top portion of a result set is more

important. •  A natural alternative is to report at top-k

(e.g. top-10). •  Problem:

–  Not all queries will have more than k relevant results. So, even a perfect system may score less than 1.0 for some queries.

Page 39: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 39

R-precision

•  Allan (2005) •  Use a variable result set cut-off for each query

based on number of its relevant results. •  In this case, a perfect system can score 1.0 over all

queries. •  Official evaluation metric of the TREC HARD track •  Highly correlated with MAP

Page 40: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 40

Mean reciprocal rank (MRR)

•  Voorhees (1999) •  Reciprocal of the rank of the first relevant result

averaged over a population of queries •  Possible to define it for entities other than explicit

absolute relevance judgments (e.g. clicks - see implicit judgments later on)

MRR = 1#queries

1rank(1st relevant result of query q)q=1

#queries

Page 41: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 41

Graded Relevance

•  So far, the evaluation methods did not measure satisfaction in the following aspects: –  How useful were the results?

•  Do documents have grades of usefulness in meeting an information need?

–  How successful did the search engine handle the ambiguity and the subjectivity of the query?

•  Is the information need of the user clear in the query? •  Do different users mean different things with the same query?

•  Can we cover these aspects by using graded relevance judgments instead of binary? –  very useful –  somewhat useful –  not useful

Page 42: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 42

Precision-recall curves

•  If we have grades of relevance, how can we modify some of the binary relevance measures?

•  Calculate Precision-Recall curves at each grade level (Järvelin and Kekäläinen (2000))

•  Informative but, too many curves to compare

Page 43: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 43 43

Discounted cumulative gain (DCG)

•  Järvelin and Kekäläinen (2002) •  Gain adjustable for importance of different relevance

grades for user satisfaction •  Discounting desirable for web ranking

–  Most users don’t browse deep. –  Search engines truncate the list of results returned.

DCG =Gain(result@r)logb r +1( )r=1

R

∑Discount proportional to

effort to reach result at rank r.

Gain proportional to utility of result at rank r.

Page 44: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 44

DCG example

•  Gain for various grades –  Very useful (V): 3 –  Somewhat useful (S): 1 –  Not useful (N) : 0

•  E.g. Results ordered as VSN:

DCG = 3/log2(1+1) + 1/log2(2+1) + 0/log2(3+1) = 2.63

•  E.g. Results ordered as VNS:

DCG = 3/log2(1+1) + 0/log2(2+1) + 1/log2(3+1) = 2.50

Page 45: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 45

Normalized DCG (nDCG)

•  DCG is yields unbounded scores. It is desirable for the best possible result set to have a score of 1.

•  For each query, divide the DCG by best attainable DCG for that query.

•  E.g. VSN:

nDCG = 2.63 / 2.63 = 1.00

•  E.g. VNS:

nDCG = 2.50 / 2.63 = 0.95

Page 46: Web Search Engine Metrics for Measuring User Satisfaction

46

Relevance Metrics: Rank Based Evaluation

with Explicit Preference

Judgments

Page 47: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 47

Kendall tau coefficient

•  Based on counts of preferences –  Preference judgments are cheaper and easier/cleaner than

absolute judgments. –  But, may need to deal with circular preferences.

•  Range in [-1,1] ‒  τ = 1 when all in agreement ‒  τ = -1 when all disagree

•  Robust for incomplete judgments –  Just use the known set of preferences.

τ =A−DA+D

preferences in agreement

preferences in disagreement

Page 48: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 48

Binary preference (bpref)

•  Buckley and Voorhees (2004) •  Designed in particular for incomplete judgments •  Similar to some other relevance metrics (MAP) •  Can be generalized to graded judgments

bpref =1R

1− Nr

R

r∈R∑ ∝

AA + D

# of non-relevant docs above relevant doc r,

in the first R non-relevant

For a query with R relevant results

Page 49: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 49

Bpref example

rank relevance numerator denominator summand

1 0

2 1 1 3 0.66

3 NA

4 1 1 3 0.66

5 NA

6 0

7 0

8 0

9 1 3 3 0

10 0

# relevant 3 bpref 0.44

# non-relevant 5

# unjudged 2

Page 50: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 50

Generalization of bpref to graded judgments - rpref

•  De Beer and Moens (2006) •  Graded relevance version of bpref •  Sakai (2007) gives a corrected version expressed in

terms of cumulative gain.

rprefrelative R( ) = 1CGideal R( ) g r( ) 1−

penalty r( )Nr

r>1,g r( )>0

r=R

penalty r( ) =g r( ) − g i( )g r( )i<r,g i( )<g r( )

∑ # of judged docs above r

Soft count of out-of-order pairs

Relevance gain of result at rank r

Cumulative gain

Page 51: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 51

Practical system testing with incomplete judgments

•  Comparing two search engines in practice –  Scrape top-k result sets for a sample of queries –  Calculate any of the metrics above for each engine and

compare using a statistical test (e.g. paired t-test) •  Need judgments •  Use existing judgments •  What to do if judgments missing •  Use a metric robust to missing judgments

Page 52: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 52

Comparing various metrics under incomplete judgment scenario

•  Sakai (2007) simulates incomplete judgments by sampling from pooled judgments –  Stratified sampling yields various levels of completeness from 100% to

10%. •  Then tests bpref, rpref, MAP, Q-measure, and normalized DCG

(nDCG). –  Q-measure is similar to rpref (see Sakai (2007)) –  Since all but the first two are originally designed for complete judgments,

he tests two versions of them: •  one based on assuming results with missing judgments are non-relevant, •  and another computed on condensed lists by removing results with missing

judgments.

•  nDCG with incomplete absolute judgments –  As in average precision based measures, one can ignore the unjudged

documents when using normalized DCG.

Page 53: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 53

Robustness of evaluation with incomplete judgments

•  Among the original methods only bpref and rpref stay stable with increasing incompleteness.

•  nDCG, Q and MAP computed on condensed lists also perform well. –  Furthermore, they have more discriminative power.

•  Graded relevance metrics are more robust than binary metrics for incompleteness.

•  nDCG and Q-measure on condensed lists are the best metrics.

Page 54: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 54

Average precision based rank correlation

•  Yilmaz, Aslam and Robertson (2008) •  Kendal tau rank correlation as a random variable

–  Pick a pair of items at random. –  Define p: Return 1 if pair in same order in both lists, 0 otherwise.

•  Rank correlation based on average precision as a random variable –  Pick an item at random from the 1st list (other than the top item). –  Pick another document at random above the current. –  Define p’: Return the 1 if this pair is in the same relevance order

in the 2nd list, 0 otherwise. •  Agreement on top of the list is rewarded.

τ =A −DA + D

= p − 1− p( ) = 2p −1

Page 55: Web Search Engine Metrics for Measuring User Satisfaction

55

Relevance Metrics: Rank Based Evaluation

with Implicit Judgments

Page 56: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 56

Implicit judgments from clicks

•  Explicit judgments are expensive. •  A search engine has lots of user interaction data

–  which results were viewed for a query, and –  which of those received clicks.

•  Can we obtain implicit judgments of satisfaction or relevance from clicks? –  Clicks are highly biased.

•  presentation details (order of results, attractiveness of abstracts) •  trust and other subtle aspects of user’s need

–  Not impossible - some innovative methods are emerging. •  Pros: Cheap, better model of ambiguity and subjectivity •  Cons: Noisy and retroactive. (May expose poor quality search

engines to live traffic.)

Page 57: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 57

Performance metrics from user logs

•  Naïve way to utilize user interaction data is to develop basic statistics from raw observations: –  abandonment rate –  reformulation rate –  number of queries per session –  clicks per query –  mean reciprocal rank of clicked results –  time to first or last click

•  Intuitive but not clear how sensitive these metrics are to what we want to measure

Page 58: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 58

Implicit preference judgments from clicks

•  Joachims (2002) •  Radlinski and Joachims (2005) •  These are document level preference judgments

and have not been used in evaluation.

A

B

C

skip

skip

click

C>A and C>B

A

B

C

click

skip

skip

A>B

Page 59: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 59

Direct evaluation by clicks

•  Randomly interleave two result sets to be compared. –  Have the same number of links from top of each result set. –  More clicks on links from one result set indicates preference.

•  Balanced interleaving (Joachims (2003)) –  Determine randomly which side goes first at the start. –  Pick the next available result from the side that has the turn

while removing duplicates. –  Caution: Biased when two result sets are nearly identical

•  Team draft interleaving (Radlinski et. al. (2008)) –  Determine randomly which side goes first at each round. –  Pick the next available result from the side that has the turn

while removing duplicates. •  Effectively removes the rank bias, but not directly

applicable to evaluation of multi-page sessions.

Page 60: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 60

Interleaving example

A B A first B first Captain Team Captain Team1 a b a b A a B b

b a B b A a2 b e e A A

e B e B e3 c a c B B

c A c A c4 d f d f A d A d

f d B f B f5 e g g B g A

g A B g6 f h h B h A

h A B h7 g k k B k B k

k A A8 h c A B

B A9 i d i B B

i A i A i10 j i j A j A j

j B B

Rank

Engines to Comapre

Balanced Interleave

Team Draft InterleaveTDI example 1 TDI example 2

Page 61: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 61

Indirect evaluation by clicks

•  Carterette and Jones (2007) •  Relevance as a multinomial random variable

•  Model absolute judgments by clicks

•  Expected DCG (incomplete judgments are OK)

P Ri = grade j{ }

E DCGN[ ] = E R1[ ] +E Ri[ ]log2 ii= 2

N

∑€

p(R | q,c) = p Ri q,c( )i=1

N

logp R > g j q,c( )p R ≤ g j q,c( )

=α j +βq+ βicii=1

N

∑ + βikcicki<k

N

Page 62: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 62

Indirect evaluation by clicks (cont’d)

•  Comparing two search engines

•  Predict if difference is statistically significant –  use Monte Carlo

•  Can improve confidence by asking for labels where

•  Efficient but, effectiveness depends on the quality of the relevance model obtained from the clicks.

E ΔDCG[ ] = E DCGA[ ]− E DCGB[ ]

P ΔDCG < 0( ) ≥ 0.95

maxi E GiA[ ]− E Gi

B[ ] Gi =Ri if rank(i) =1

Rilog2 rank(i)

o/w

Page 63: Web Search Engine Metrics for Measuring User Satisfaction

63

Relevance Metrics: User Satisfaction

Page 64: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 64

Relevance evaluation and user satisfaction

•  So far, focused on evaluation method rather than the entity (i.e. user satisfaction) to be evaluated.

•  Subtle and salient aspects of user satisfaction are difficult for traditional relevance. –  E.g. trust, expectation, patience, ambiguity, subjectivity –  Explicit absolute or preference judgments are not very

successful in addressing all aspects at once. –  Implicit judgment models get one step closer to user

satisfaction by incorporating user feedback.

•  The popular IR relevance metrics are not strongly based on user tasks and experiences. –  Turpin and Scholer (2006): precision based metrics such as

MAP fail to assess user satisfaction on tasks targeting recall.

Page 65: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 65

Modeling user satisfaction

•  Huffman and Hochster (2007) •  Obtain explicit judgments of true satisfaction over a

sample of sessions or any other grain. •  Develop a predictive model based on observable

statistics. –  explicit absolute relevance judgments –  number of user actions in a session –  query classification

•  Carry out correlation analysis •  Pros: More direct than many other evaluation metrics.

•  Cons: More exploratory than a usable metric at this stage.

Page 66: Web Search Engine Metrics for Measuring User Satisfaction

66

Relevance Metrics: More Notes

Page 67: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 67

Relevance through search system components

•  Relevance can explicitly be measured for each search system component (Dasdan and Drome (2009)). –  Use set based evaluation for WWW, catalog, database tiers.

•  Rank based evaluation can be used if sampled subset is ordered by explicit judgments or by using order inferred from a downstream component.

–  Yields approximate upper bounds

–  Use rank based evaluation for candidate documents and result set. •  Useful for quantifying and monitoring relevance gap

–  inter-system relevance gap by comparing different system stages –  intra-system relevance gap by comparing against external benchmarks

WWW crawl catalog index tier 1

tier N

selection candidate doc list

ranking result

set

query

Page 68: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 68

Where to find more

•  Traditional relevance metrics have deep roots in information retrieval –  Cranfield experiments (Cleverdon (1991)) –  SMART (Salton (1991)) –  TREC (Voorhees and Harman (2005))

•  Modern metrics addressing cost and noise by using statistical inference in more advanced ways

•  For more on relevance evaluation, see –  Manning, Raghavan, and Schütze (2008) –  Croft, Metzler, and Strohman (2009)

•  For more on the user dimension, see –  Baeza-Yates, and Ribeiro-Neto (1999) –  Spink and Cole (2005)

Page 69: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 69

References 1/2

•  J. Allan (2005), HARD track overview in TREC 2005: High accuracy retrieval from documents. •  R. Baeza-Yates and B. Ribeiro-Neto (1999), Modern Information Retrieval, Addison-Wesley. •  C. Buckley and E.M. Voorhees (2004), Retrieval Evaluation with Incomplete Information, SIGIR’04. •  B. Carterette and R. Jones (2007), Evaluating Search Engines by Modeling the Relationship

Between Relevance and Clicks, NIPS’07. •  C.W. Cleverdon (1991), The significance of the Cranfield tests on index languages, SIGIR’91. •  B. Croft, D. Metzler, and T. Strohman (2009), Search Engines: Information Retrieval in Practice,

Addison Wesley. •  A. Dasdan and C. Drome (2008), Measuring Relevance Loss of Search Engine Components,

submitted. •  J. De Beer and M.-F. Moens (2006), Rpref - A Generalization of Bpref towards Graded Relevance

Judgments, SIGIR’06. •  S.B. Huffman and M. Hochster (2007), How Well does Result Relevance Predict Session

Satisfaction? SIGIR’07. •  K. Järvelin and J. Kekäläinen (2000), IR evaluation methods for retrieving highly relevant

documents, SIGIR’00. •  K. Järvelin and J. Kekäläinen (2002), Cumulated Gain-Based Evaluation of IR Techniques, ACM

Trans. IS 20(4):422-446. •  T. Joachims (2002), Optimizing Search Engines using Clickthrough Data. SIGKDD’02. •  T. Joachims (2003), Evaluating Retrieval Performance using Clickthrough Data, In J. Franke, et.

al. (eds.), Text Mining, Physica Verlag.

Page 70: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 70

References 2/2

•  A. Kent, M.M. Berry F.U. Luehrs Jr., and J.W. Perry (1955), Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation 6(2):93-101.

•  C. Manning, P. Raghavan, and H. Schützke (2008), H. Introduction to Information Retrieval, Cambridge University Press.

•  F. Radlinski, M. Kurup, and T. Joachims (2008), How Does Clickthrough Data Reflect Retrieval Quality? CIKM’08.

•  F. Radlinski and T. Joachims (2005), Evaluating the Robustness of Learning from Implicit Feedback, ICML’05.

•  T. Sakai (2007), Alternatives to Bpref, SIGIR’07. •  G. Salton (1991), The smart project in automatic document retrieval, SIGIR’91. •  A. Spink and C. Cole (eds.) (2005), New Directions in Cognitive Information Retrieval, Springer. •  A. Turpin and F. Scholer (2006), User performance versus precision measures for simple search

tasks, SIGIR’06. •  C.J. Van Rijsbergen (1979), Information Retrieval (2nd ed.), Butterworth. •  E.M. Voorhees and D. Harman (eds) (2005), TREC: Experiment and Evaluation in Information

Retrieval, MIT Press. •  E.M. Voorhees (1999), TREC-8 Question Answering Track Report. •  E. Yilmaz and J. Aslam (2006), Estimating Average Precision with Incomplete and Imperfect

Information, CIKM’06. •  E. Yilmaz, J. Aslam, and S. Robertson (2008), A New Rank Correlation Coefficient for Information

Retrieval, SIGIR’08.

Page 71: Web Search Engine Metrics for Measuring User Satisfaction

71

Coverage Metrics PART II

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 72: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on coverage: Heard some interesting news; decided to search

72

Page 73: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on coverage: URL was not found

73

Page 74: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on coverage: But content was found under different URLs

74

Page 75: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on coverage: URL was also found after some time

75

Page 76: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Definitions for coverage

•  Coverage refers to presence of content of interest in a catalog.

•  Coverage ratio – defined as the ratio of the number of

documents (pages) found to the number of documents (pages) tested

– Can be represented as a distribution when many document attributes are considered together

76

Page 77: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Some background: Shingling and Jaccard Index

77

Doc = (a b c d e) (5 terms) 2-grams: (a b, b c, c d, d e)

Shingles for 2-grams (after hashing them): 10, 3, 7, 16 Min shingle: 3 (used as a signature of Doc)

Doc1 = (a b c d e) Doc2 = (a e f g)

Doc1 Doc2 = (a e) Doc1 Doc2 = (a b c d e f g)

Jaccard index = |Doc1 Doc2| / |Doc1 Doc2| = 2 / 7 ≈ 30% (Shingling estimates this index.)

Page 78: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure coverage

•  Given an input document with its URL •  Query by URL (QBU)

–  enter URL at the target search engine’s query interface –  if the URL is not found, then iterate using “normalized” forms of

the same URL •  Query by content (QBC)

–  if URL is not given or URL search has failed, then perform this search

–  generate a set of queries (called strong queries) from the document

–  submit the queries to the target search engine’s query interface –  combine the returned results –  perform a more thorough similarity check between the returned

documents and the input document •  Compute coverage ratio over multiple documents

78

Page 79: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Query-by-Content flowchart

79

String signature: Terms from page

Strings combined into queries

Similarity check using shingles

Search results extraction

Page 80: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Query by content: How to generate queries

•  Select sequences of terms randomly –  find the document’s shingles signature –  find the corresponding sequences of terms –  This method can produce the same query signature for the

same document, as opposed to the method of just selecting random sequences of terms from the document.

•  Select sequences of terms by frequency –  terms with the lowest frequency or highest TF-IDF

•  Select sequences of terms by position –  +/- two terms at every 5th term

80

Page 81: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Further issues to consider

•  URL normalization –  see Dasgupta, Kumar, and Sasturkar (2008)

•  Page templates and ads –  or how to avoid undesired matches

•  Search for non-textual content –  images, mathematical formulas, tables and other similar

structures

•  Definition of content similarity •  Syntactic vs. semantic match •  How to balance coverage against other

objectives

81

Page 82: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Key problems

•  Measure web growth in general and along any dimension

•  Compare search engines automatically and reliably

•  Improve content-based search, including semantic-similarity search

•  Improve copy detection methods for quality and performance, including URL based copy detection

82

Page 83: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on coverage metrics

•  Luhn (1957) –  summarizes an input document by selecting terms or sentences

by frequency –  Bharat and Broder (1998) discovered the same method

independently for a different purpose •  Bar-Yossef and Gurevich (2008)

–  introduces improved methods to randomly sample pages from a search engine’s index using its public query interface, a problem introduced by Bharat and Broder (1998)

•  Dasdan et al. (2008), Pereira and Ziviani (2004) –  represents an input document by selecting (sequences of)

terms randomly or by frequency –  uses the term-based document signature as queries (called

strong queries) for similarity search –  Yang et al. (2009) proposes similar methods for blog search

83

Page 84: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

References

•  Z. Bar-Yossef and M. Gurevich (2008), Random sampling from a search engine’s index, J. ACM, 55(5).

•  K. Bharat, A. Broder (1998), A technique for measuring the relative size and overlap of public Web search engines, WWW’98.

•  S. Brin, J. Davis, and H. Garcia-Molina (1995), Copy detection mechanisms for digital documents, SIGMOD’95.

•  A. Dasdan, P. D’Alberto, C. Drome, and S. Kolay (2008), Automating retrieval for similar content using search engine query interface, submitted.

•  A. Dasgupta, R. Kumar, and A. Sasturkar (2008), De-duping URLs via Rewrite Rules, KDD’08.

•  H. Luhn (1957), A statistical approach to mechanized encoding and searching of literary information, IBM J. Research and Dev., 1(4):309–317.

•  H. P. Luhn (1958), The automatic creation of literature abstracts, IBM J. Research and Dev., 2(2).

•  A.R. Pereira Jr. and N. Ziviani (2004), Retrieving similar documents from the Web, J. Web Engineering, 2(4):247-261.

•  Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, D. Papadias (2009), Query by document, WSDM’09.

84

Page 85: Web Search Engine Metrics for Measuring User Satisfaction

85

Diversity Metrics PART III

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 86: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on diversity: Long query

86

Ever

y re

sult

is a

bout

the

sam

e ne

ws.

Page 87: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on diversity: Long query

87

Mor

e di

vers

e

Page 88: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on diversity: Ambiguous query [stanford]

88

See

http

://en

.wik

iped

ia.o

rg/w

iki/S

tanf

ord_

(dis

ambi

guat

ion)

Page 89: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on diversity: Ambiguous query [stanford]

89

See

http

://en

.wik

iped

ia.o

rg/w

iki/S

tanf

ord_

(dis

ambi

guat

ion)

Page 90: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Definitions for diversity

•  Diversity –  related to the breadth of the content –  also related to the quantification of “concepts” in

a set of documents, or the quantification of query disambiguation or query intent

•  Closely tied to relevance and redundancy –  excluding near-duplicate results

•  May have implications for search engine interfaces too –  e.g., clustered or faceted presentations

90

Page 91: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure diversity

•  Method #1: –  get editorial judgments as to the degree of diversity in a catalog

•  Method #2: –  use the number of the content or source types for the

documents in a catalog –  find the set of concepts in a catalog and measure diversity

based on their relationships •  e.g., cluster using document similarity and assign a

concept to each cluster

•  Method #3: (with a given relevance metric) –  iterate over each intent of the input query –  consider sets of documents relevant to each intent –  weight the given relevance metric by the probability of each

intent

91

Page 92: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure diversity: Example

92

•  Types: News, organic, rich, ads •  Sources for 10 organic results:

•  4 domains •  Themes for organic results:

•  6 for Stanford University related •  1 for Stanfords restaurant related •  1 for Stanford, MT related •  1 for Stanford, KY related

•  Detailed themes for organic results: •  2 for general Stanford U. intro •  1 for Stanford athletics •  1 for Stanford medical school •  1 for Stanford business school •  1 for Stanford news •  1 for Stanford green buildings •  1 for Stanfords restaurant •  1 for Stanford, MT high school •  1 for Stanford, KY fire department

Page 93: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Further issues to consider

•  Categorization and similarity methods –  for documents, queries, sites

•  Presentation issues – single page, clusters, facets, term cloud

•  Summarizing diversity •  How to balance diversity against

other objectives – diversity vs. relevance in particular

93

Page 94: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Key problems

• Measure and summarize diversity better

• Measure tradeoffs between diversity and relevance better

• Determine the best presentation of diversity

94

Page 95: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on diversity metrics

•  Goldstein and Carbonell (1998) –  defines maximizal marginal relevance as a parameterized linear combination of novelty and

relevance •  novelty: measured via the similarity among documents (to avoid redundancy) •  relevance: measured via the similarity between documents and the query

•  Jain, Sarda, and Haritsa (2003); Chen and Karger (2006); Joachims et al. (2008); and Swaminathan et al. (2008)

–  iteratively expand a document set to maximize marginal gain –  each time add a new relevant document that is least similar to the existing set –  Joachims et al. (2008) address the learning aspect.

•  Radlinski and Dumais (2006) –  diversifies search results using relevant results to the input query and queries related to it

•  Agrawal et al. (2009) –  diversifies search results using a taxonomy for classifying queries and documents –  also reviews diversity metrics and proposes new ones

•  Gollapudi and Sharma (2009) –  proposes an axiomatization of result diversification (similar to similar recent efforts for ranking

and clustering) and proves the impossibility of satisfying all properties –  enumerates a set of diversification functions satisfying different subsets of properties

•  Metrics to measure diversity of a given set of results are proposed by Chen and Karger (2006), Clarke et al. (2008), and Agrawal et al. (2009).

95

Page 96: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

References

•  R. Agrawal, S. Gollapudi, A. Halverson, and S. Leong (2009), Diversifying search results, WSDM’09.

•  H. Chen and D.R. Karger (2006), Less is more: Probabilistic models for retrieving fewer relevant documents, SIGIR’06.

•  C.L.A. Clarke, M. Kolla, G.V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon (2008), Novelty and diversity in information retrieval evaluation, SIGIR’08.

•  J. Goldstein and J. Carbonell (1998), Summarization: (1) Using MMR for Diversity-based Reranking and (2) Evaluating Summaries, SIGIR’98.

•  S. Gollapudi and A. Sharma (2009), An axiomatic approach for result diversification, WWW’09.

•  A. Jain, P. Sarda, and J.R. Haritsa (2003), Providing Diversity in K-Nearest Neighbor Query Results, CoRR’03.

•  R. Kleinberg, F. Radlinski, and T. Joachims (2008), Learning Diverse Rankings with Multi-armed Bandits, ICML’08.

•  F. Radlinski and S.T. Dumais (2006), Improving personalized web search using result diversification, SIGIR’06.

•  A. Swaminathan, C. Mathew, and D. Kirovski (2008), Essential pages, MSR-TR-2008-015, Microsoft Research.

•  Y. Yue, and T. Joachims (2008), Predicting Diverse Subsets Using Structural SVMs, ICML’08.

•  C. Zhai and J.D. Lafferty (2006), A risk minimization framework for information retrieval, Info. Proc. and Management, 42(1):31-55.

96

Page 97: Web Search Engine Metrics for Measuring User Satisfaction

97

Discovery and Latency Metrics

PART IV of

WWW’09 Tutorial on Web Search Engine Metrics by

A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 98: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on discovery: Page was born ~30 minutes before

98

Page 99: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on discovery: URL of page was not found

99

Page 100: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on discovery: But content existed under different URLs

100

Page 101: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on discovery: URL was also found after ~1 hr

101

Page 102: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Life of a URL

102

AGE

LATENCY

BORN DISCOVERED NOW EXPIRED

TIME

Page 103: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Lives of many URLs

103

AGE

LATENCY

BORN DISCOVERED NOW EXPIRED

TIME

LATENCY

LATENCY

LATENCY

Page 104: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure discovery and latency

•  Consider a sample of new pages on the Web –  Feeds at regular intervals –  Each sample monitored for a period (e.g., 15 days)

•  User view –  Discovery: Measure how many of these new pages are in

the search results? •  Using the coverage ratio formula

–  Latency: Measure how long it took to get these new pages in the search results?

•  System view –  Discovery: Measure how many of these new pages are in a

catalog? –  Latency: Measure how long it took to get these new pages

in a catalog?

104

Page 105: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Discovery profile of a search engine component: Overview

105

Time to reach a certain coverage percentage

No expiration yet

Content expired

Convergence

Over many URLs, per search engine component

Oth

er b

ehav

iors

Page 106: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Discovery profiles and monitoring: Examples

106

Profiles Monitoring of

profile parameters

Page 107: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Latency profiles of a search engine component: Overview

107

Over many URLs, per search engine component

Desired skewness direction Close to zero for crawlers

Page 108: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Latency profiles and monitoring: Examples

108

Profiles Monitoring of

profile parameters

Page 109: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Further issues to consider

•  How to discover samples to measure discovery and latency

•  How to beat crawlers to acquire samples

•  Discovery of top-level pages •  Discovery of deep links •  Discovery of hidden web content •  How to balance discovery against

other objectives

109

Page 110: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Key problems

•  Predict content changes on the Web •  Discover new content almost

instantaneously •  Reduce latency per search engine

component and overall

110

Page 111: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on discovery metrics

•  Cho, Garcia-Molina, & Page (1998) –  discusses how to order URL accesses based on importance

scores •  importance: PageRank (best), link count, similarity to query in

anchortext or URL string, attributes of URL string. •  Dasgupta et al. (2007)

–  formulates the problem of discoverability (discover new content from the fewest number of known pages) and proposes approximation algorithms

•  Kim and Kang (2007) –  compares top three search engines for discovery (called

“timeliness”), freshness, and latency •  Lewandowski (2008)

–  compares top three search engines for freshness and latency •  Dasdan and Drome (2009)

–  proposes discovery metrics along the lines discussed in this section

111

Page 112: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

References

•  J. Cho, H. Garcia-Molina, and L. Page (1998), Efficient Crawling Through URL Ordering, Computer Networks and ISDN Systems, 30(1-7):161-172.

•  A. Dasdan and C. Drome (2009), Discovery coverage: Measuring how fast content is discovered by search engines, submitted.

•  A. Dasgupta, A. Ghosh R. Kumar, C. Olston, S. Pandey, and A. Tomkins (2007), The discoverability of the Web, WWW’07.

•  J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.

•  N. Eiron, K.S. McCurley, and J.A. Tomlin, Ranking the Web frontier, WWW’04.

•  C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08.

•  Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07.

•  D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008.

112

Page 113: Web Search Engine Metrics for Measuring User Satisfaction

113

Freshness Metrics PART V

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 114: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on freshness: Stale abstract in Search Results Page

114

Page 115: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on freshness: Actual page content

115

http://en.wikipedia.org/wiki/John_Yoo:

Page 116: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on freshness: Fresh abstract now

116

Page 117: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Definitions illustrated for a page

117

(Dasdan and Huynh, WWW’09)

Last sync Page is up-to-date or fresh until time 3.

CRAWLED

0

TIME

1 2 3 4 5 6

MODIFIED MODIFIED

INDEXED CLICKED

AGE=3

Page 118: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Definitions illustrated for a page

118

CRAWLED

0

TIME

1 2 3 4 5 6

MODIFIED MODIFIED

INDEXED CLICKED

TIME

FRESHNESS

AGE 0

3

1

(Dasdan and Huynh, WWW’09)

Page 119: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Freshness and age of a page

•  The freshness F(p,t) of a local page p at time t is – 1 if p is up-to-date at time t – 0 otherwise

•  The age A(p,t) of a local page p at time t is – 0 if p is up-to-date at time t –  t−tmod otherwise, where tmod is the time of

the first modification after the last sync of p.

119

Page 120: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Freshness and age of a catalog

•  S: catalog of documents •  Sc: catalog of clicked documents •  Basic freshness and age

•  Unweighted freshness and age

•  Weighted freshness and age (c(): #clicks)

120

Page 121: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure freshness

•  Find the true refresh history of each page in the sample –  Needs independent crawling

•  Compare with the history in the search engine

•  Determine freshness and age –  basic form: averaged over all documents in the catalog

•  Consider clicked or viewed documents –  unweighted form: averaged over all clicked or viewed

documents in the catalog –  weighted form: unweighted form weighted with #clicks or

#views (or any other weight function)

121

Page 122: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure freshness: Example

122

Page 123: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Further issues to consider

•  Sampling pages –  random, from DMOZ, revisited, popular

•  Classifying pages –  topical, importance, change period, refresh period

•  Refresh period for monitoring –  daily, hourly, minutely

•  Measuring change –  hashing (MD5, Broder’s shingles, Charikar’s SimHash),

Jaccard’s index, Dice coefficient, word frequency distribution similarity, structural similarity via DOM trees

•  note:

•  What is change? –  content, “information”, structure, status, links, features, ads

•  How to balance freshness against other objectives

123

Page 124: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Key problems

•  Measure the evaluation of the content on the Web

•  Design refresh policies to adapt to the changes on the Web

•  Reduce latency from discovery to serving

•  Improve freshness metrics

124

Page 125: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on web page change patterns

•  Cho & Garcia-Molina (2000): Crawled 720K pages once a day for 4 months.

•  Ntoulas, Cho, & Olston (2004): Crawled 150 sites once a week for a year.

–  found: most pages didn’t change; changes were minor; freq of change couldn’t predict degree of change but degree of change could predict future degree of change;

•  Fetterly, Manasse, Najork, & Wiener (2003): Crawled 150M pages once a week for 11 weeks.

–  found: past change could predict future change; page length & top level domain name were correlated with change;

•  Olston & Panday (2008): Crawled 10K random pages and 10K pages sampled from DMOZ every two days for several months.

–  found: moderate correlation between change frequency and information longevity

•  Adar, Teevan, Dumais, & Elsas (2009): Crawled 55K revisited pages (sub)hourly for 5 weeks.

–  found: higher change rates compared to random pages; large portions of pages changing more than hourly; focus on pages with important static or dynamic content;

125

Page 126: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on predicting refresh rates

•  Grimes, Ford & Tassone (2008) –  determines optimal crawl rates under a set of scenarios:

•  while doing estimation; while fairly sure of the estimate; •  when crawls are expensive, and when they are cheap;

•  Matloff (2005) –  derives estimators similar to Cho & Garcia-Molina but lower

variance (and with improved theory) –  also derives estimators for non-Poisson case –  finds that Poisson model is not very good for its data

•  but the estimators seem accurate (bias around 10%)

•  Singh (2007) –  non-homogeneous Poisson, localized windows, piecewise,

Weibull, experimental evaluation •  No work seems to consider non-periodical case.

126

Page 127: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on freshness metrics

•  Cho & Garcia-Molina (2003) –  freshness & age of one page –  average/expected freshness & age of one page & corpus –  freshness & age wrt Poisson model of change –  weighted freshness & age –  sync policies

•  uniform (better): all pages at the same rate •  nonuniform: rates proportionally to change rates

–  sync order •  fixed order (better), random order

–  to improve freshness, penalize pages that change too often –  to improve age, sync proportionally to freq but uniform is not far from

optimal •  Han et al. (2004) and Dasdan and Huynh (2009) add user

perspective with weights. •  Lewandowski (2008) and Kim and Kang (2007) compare top

three search engines for freshness. 127

Page 128: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

References 1/2

•  E. Adar, J. Teevan, S. Dumais, and J.L. Elsas (2009), The Web changes everything: Understanding the dynamics of Web content, WSDM’09.

•  J. Cho and H. Garcia-Molina (2000), The evolution of the Web and implications for an incremental crawler, VLDB’00.

•  D. Fetterly, M. Manasse, M. Najork, and J. Wiener (2003), A Large scale study of the evolution of Web pages, WWW’03.

•  F. Grandi (2000), Introducing an annotated bibliography on temporal and evolution aspects in the World Wide Web, SIGMOD Records, 33(2):84-86.

•  A. Ntoulas, J. Cho, and C. Olston (2004), What’s new on the Web? The evolution of the Web from a search engine perspective, WWW’04.

128

Page 129: Web Search Engine Metrics for Measuring User Satisfaction

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

References 2/2

•  J. Cho and H. Garcia-Molina (2003), Effective page refresh policies for web crawlers, ACM Trans. Database Syst., 28(4):390-426.

•  J. Cho and H. Garcia-Molina (2003), Estimating frequency of change, ACM Trans. Inter. Tech., 3(3):256-290.

•  A. Dasdan and X. Huynh, User-centric content freshness metrics for search engines, WWW’09.

•  J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.

•  C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08.

•  J. Han, N. Cercone, and X. Hu (2004), A Weighted freshness metric for maintaining a search engine local repository, WI’04.

•  Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07.

•  D. Lewandowski, H. Wahlig, and G. Meyer-Bautor (2006), The freshness of web search engine databases, J. Info. Syst., 32(2):131-148.

•  D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008.

•  N. Matloff (2005), Estimation of internet file-access/modification rates from indirect data, ACM Trans. Model. Comput. Simul., 15(3):233-253.

•  C. Olston and S. Padley (2008), Recrawl scheduling based on information longevity, WWW’08.

•  S.R. Singh (2007), Estimating the rate of web page changes, IJCAI’07.

129