Web search-metrics-tutorial-

20
1 Web Search Engine Metrics for Measuring User Satisfaction [Section 5 of 7: Discovery] Ali Dasdan, eBay Kostas Tsioutsiouliklis, Yahoo! Emre Velipasaoglu, Yahoo! With contributions from Prasad Kantamneni, Yahoo! 27 Apr 2010 (Update in Aug 2015: The authors work in different companies now.)

Transcript of Web search-metrics-tutorial-

1

Web Search Engine Metrics for Measuring User

Satisfaction [Section 5 of 7: Discovery]

Ali Dasdan, eBay

Kostas Tsioutsiouliklis, Yahoo!

Emre Velipasaoglu, Yahoo!

With contributions from Prasad Kantamneni, Yahoo!

27 Apr 2010

(Update in Aug 2015: The authors work in different companies now.)

2

Tutorial @

19th International World Wide Web

Conference

http://www2010.org/

April 26-30, 2010

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Disclaimers

•  This talk presents the opinions of the authors. It does not necessarily reflect the views of our employers.

•  This talk does not imply that these metrics are used by our employers, or should they be used, they may not be used in the way described in this talk.

•  The examples are just that – examples. Please do not generalize them to the level of comparing search engines.

3

4

Discovery and Latency Metrics

Section 5/7 of

WWW’10 Tutorial on Web Search Engine Metrics by

A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Example on discovery: Page was born ~30 minutes before

5

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Example on discovery: URL of page was not found

6

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Example on discovery: But content existed under different URLs

7

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Example on discovery: URL was also found after ~1 hr

8

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Life of a URL

9

AGE

LATENCY

BORN DISCOVERED NOW EXPIRED

TIME

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Lives of many URLs

10

AGE

LATENCY

BORN DISCOVERED NOW EXPIRED

TIME

LATENCY

LATENCY

LATENCY

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

How to measure discovery and latency

•  Consider a sample of new pages on the Web –  Feeds at regular intervals –  Each sample monitored for a period (e.g., 15 days)

•  User view –  Discovery: Measure how many of these new pages are in

the search results? •  using the coverage ratio formula

–  Latency: Measure how long it took to get these new pages in the search results?

•  variants as ‘Time-To-First-* (TTF*)’ metrics, e.g., Time-To-First-Click and Time-To-First-View

•  System view –  Discovery: Measure how many of these new pages are in a

catalog? –  Latency: Measure how long it took to get these new pages

in a catalog?

11

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Discovery profile of a search engine component: Overview

12

Time to reach a certain coverage percentage

No expiration yet

Content expired

Convergence

Over many URLs, per search engine component

Oth

er b

ehav

iors

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Discovery profiles and monitoring: Examples

13

Profiles Monitoring of

profile parameters

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Latency profiles of a search engine component: Overview

14

Over many URLs, per search engine component

Desired skewness direction Close to zero for crawlers

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Latency profiles and monitoring: Examples

15

Profiles Monitoring of

profile parameters

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Further issues to consider

•  How to discover samples to measure discovery and latency

•  How to beat crawlers to acquire samples

•  Discovery of top-level pages •  Discovery of deep links •  Discovery of hidden web content •  How to balance discovery against

other objectives

16

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Key problems

•  Predict content changes on the Web •  Discover new content almost

instantaneously •  Reduce latency per search engine

component and overall

17

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Reference review on discovery metrics

•  Cho, Garcia-Molina, & Page (1998) –  discusses how to order URL accesses based on importance

scores •  importance: PageRank (best), link count, similarity to query in

anchortext or URL string, attributes of URL string. •  Dasgupta et al. (2007)

–  formulates the problem of discoverability (discover new content from the fewest number of known pages) and proposes approximation algorithms

•  Kim and Kang (2007) –  compares top three search engines for discovery (called “timeliness”), freshness, and latency

18

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Reference review on discovery metrics

•  Lewandowski (2008) –  compares top three search engines for freshness and latency

•  Dasdan and Drome (2009) –  proposes discovery metrics along the lines discussed in this

section •  Olston and Najork (2010)

–  gives a detailed survey of web crawling, including how crawlers discover URLs

–  discusses how to optimize for both coverage and freshness in a web crawler

19

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

References

•  J. Cho, H. Garcia-Molina, and L. Page (1998), Efficient Crawling Through URL Ordering, Computer Networks and ISDN Systems, 30(1-7):161-172.

•  A. Dasdan and C. Drome (2009), Discovery coverage: Measuring how fast content is discovered by search engines, submitted.

•  A. Dasgupta, A. Ghosh R. Kumar, C. Olston, S. Pandey, and A. Tomkins (2007), The discoverability of the Web, WWW’07.

•  J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.

•  N. Eiron, K.S. McCurley, and J.A. Tomlin, Ranking the Web frontier, WWW’04.

•  C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08.

•  Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07.

•  D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008.

•  C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations and Trends in Information Retrieval, 4(3):175--246.

20