Coping with copies on the Web: Investigating Deduplication by Major Search Engines...

27
Coping with copies on the Web: Investigating Deduplication by Major Search Engines [email protected] CWI, Amsterdam, The Netherlands [email protected] Vrije Universiteit Brussel , and Universiteit Antwerpen, Belgium Hanneke Smulders Infomare Consultancy, The Netherlands Presented at Internet Librarian International 2006 in London, England, October 2006

Transcript of Coping with copies on the Web: Investigating Deduplication by Major Search Engines...

Page 1: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Coping with copies on the Web: Investigating

Deduplication by Major Search Engines•[email protected]

CWI, Amsterdam, The Netherlands

[email protected]

Vrije Universiteit Brussel, and Universiteit Antwerpen, Belgium

•Hanneke Smulders

Infomare Consultancy, The Netherlands

Presented at Internet Librarian International 2006in London, England, October 2006

Page 2: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Overview of this paper1. Introduction:

Internet search engines omit documents from search results

2. Purpose of this investigation3. Experimental procedure4. Results / findings5. Discussion:

This may be important6. Conclusion of our investigation and

recommendation

Page 3: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Introduction: duplicates on the Web (1)

Many computer files that carry documents, images, multimedia, programs… are present in personal information systems, organizations, intranets, the Internet and the Web, in more than one copy or they are very similar to other files.

Page 4: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Introduction: duplicates on the Web (2)These “duplicates” and “near-duplicates” cause

problems in storage and retrieval of information:• They consume memory and processing power

of computers.• What is worse: users loose time in locating

the file that is the most appropriate or original or authentic or recent, wading through duplicates and near-duplicates.

Page 5: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Introduction: duplicates on the Web (3)• This forms a challenge for information

professionals in organizations and in particular for information retrieval systems such as databases and intranets plus their search engines, federated search systems, Web search engines...

• Investigation proved that about 30% of all Web pages are very similar to other pages of the remaining 70% and that about 20% are virtually identical to other pages on the Web.

• Furthermore, as an increasing number of people create, copy, store and distribute files, this challenge gets more important.

Page 6: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Deduplication of very similar files by Web search engines

(1)To help users in view of the many copies, duplicates, or very similar files, Web search engines can apply deduplication.For this investigation we define deduplication of search results as the selection of one or more copies of a duplicate Web page to represent a cluster of all the recognized copies. The default search result lists present the so called representatives. Deduplication can be applied during the harvesting of new Web pages or during the retrieval phase.

Page 7: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Deduplication of very similar files by Web search engines

(2)Deduplication can serve several purposes:

• For the search engines themselves it can improve the work of Web crawlers and systems to archive the Web. An experiment with the Google Web Search crawler in 1999 showed that the work of a crawler in the Google Web Search system can be reduced by 40%, by avoiding redundant crawling.

• For the user it offers methods for identifying versioned and plagiarized documents, and it helps to review the results faster.

Page 8: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Purpose of this investigation

The investigation reported here was motivated by the central problem: in which ways is the user confronted with the various ways in which Web search engines handle very similar documents?

Analytical problem statements: Do the important, popular Web search

engines offer their users results that have been deduplicated in some way?

Is the user confronted with deduplication by various Web search engines in the same way?

How stable and predictable is the deduplication function of Web search engines?

Page 9: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental procedure (1): Test documents for the WWW

•We have performed experiments with very similar test documents.

•We constructed a unique test document with a specific content of several metatags, and we put this on the Internet.

•This guaranteed a high rank in the results of our searches for this test document.

•We created also variations of this document.Differences among our test documents were made in the title tag, body text and filename.

Page 10: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental procedure (2): Test documents for the WWW

•We keep on the Internet 18 different samples of the test document on 8 server computers.

•The test documents were put on the WWW at the end of 2002 (8 documents) and at the end of 2003 (10 documents).

•Our test documents do NOT change over time.

Page 11: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental procedure (3): Example of our test document

on the WWW

Page 12: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental procedure (4)

•We investigated Alltheweb, AltaVista, AskJeeves, Google, Lycos, MSN, Teoma and Yahoo!

• In this experiment the test documents were searched with one specific content query repeated every hour during September - October 2005.

Page 13: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental procedure (5)•Every investigated Web search engine has been

queried 430 times with the content query.

•Also they were queried simultaneously with 18 what we call "URL queries". These are queries that search for (a part of) the URL of the 18 test documents using the possibilities that the particular search engine offers.

•We name the test documents retrieved by all 19 simultaneously submitted queries “known test documents”

•The total number of queries submitted is 28886

Page 14: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental procedure (6)

•We are well aware that fluctuations can occur in the results over time.

•The fluctuations over time in these search results were counted: a so called document fluctuation occurs every time that the content query doesn’t show all test documents that were retrieved with the previous submission

Page 15: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental Results (1)

Page 16: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental Results (2)

Page 17: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental Results (3)

Page 18: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental Results (4)

Page 19: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental results (6) - Example: numbers of test

documents involved for Yahoo!

Page 20: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental results (7) - Screen shot of Google Web search in a

case with “omitted entries”

Page 21: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental Results (8) - Screen shot of Yahoo! search in a case

with “omitted entries”

Page 22: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Experimental Results (9) – Quantitative results

known documents, hidden per query set

deduplicated result sets with document fluctuations

MSN on average 83.6% 0.0%

Alltheweb ,, 37.7% 5.1%

AltaVista ,, 29.7% 5.1%

GoogleWebSearch

,, 50.8% 0.2%

Yahoo! ,, 46.1% 10.0%

AskJeeves ,, 0.0% 0.0%

Lycos ,, 0.0% 0.0%

Teoma ,, 0.0% 0.0%

Page 23: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Discussion: The importance of our findings

•Real, authentic documents on their original server computer have to compete with “very similar” versions (that can be substantially different), which are made available by others on other servers.

• In reality documents are not abstract items: they can be concrete, real laws, regulations, price lists, scientific reports, political programs… so that NOT finding the more authentic document can have real consequences.

Page 24: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

Discussion: The importance of our findings

•May complicate scientometric / bibliometric studies, quantitative studies of numbers of documents retrieved.

•Documents on their original server pushed away from the search results, by very similar competing documents on 1 or several other servers?!

•A small change in a document may have large consequences for the meaning!

Page 25: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

ConclusionRecommendations

•Deduplication occurs. Not only duplicates, but also very similar documents are omitted from search results.

•Enjoy this, when you don’t want very similar documents in your search results. Use a search engine that deduplicates rigorously.

Page 26: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

ConclusionRecommendations

•But take this into account when it is important to find

»the oldest, authentic, master version of a document;

»the newest, most recent version of a document;

»versions of a document with comments, corrections…

»in general: variations of documents

•Use a search engine that does not deduplicate or that shows the omitted search result

Page 27: Coping with copies on the Web: Investigating Deduplication by Major Search Engines Wouter.Mettrop@cwi.nl CWI, Amsterdam, The Netherlands Paul.Nieuwenhuysen@vub.ac.be.

ConclusionRecommendations

•Search engines that deduplicate partially show fluctuations in the search results

•Searchers for a known item should be aware of this