Coping with copies on the Web: Investigating Deduplication by Major Search Engines...

Coping with copies on the Web: Investigating

Deduplication by Major Search Engines•[email protected]

CWI, Amsterdam, The Netherlands

•[email protected]

Vrije Universiteit Brussel, and Universiteit Antwerpen, Belgium

•Hanneke Smulders

Infomare Consultancy, The Netherlands

Presented at Internet Librarian International 2006in London, England, October 2006

Overview of this paper1. Introduction:

Internet search engines omit documents from search results

2. Purpose of this investigation3. Experimental procedure4. Results / findings5. Discussion:

This may be important6. Conclusion of our investigation and

recommendation

Introduction: duplicates on the Web (1)

Many computer files that carry documents, images, multimedia, programs… are present in personal information systems, organizations, intranets, the Internet and the Web, in more than one copy or they are very similar to other files.

Introduction: duplicates on the Web (2)These “duplicates” and “near-duplicates” cause

problems in storage and retrieval of information:• They consume memory and processing power

of computers.• What is worse: users loose time in locating

the file that is the most appropriate or original or authentic or recent, wading through duplicates and near-duplicates.

Introduction: duplicates on the Web (3)• This forms a challenge for information

professionals in organizations and in particular for information retrieval systems such as databases and intranets plus their search engines, federated search systems, Web search engines...

• Investigation proved that about 30% of all Web pages are very similar to other pages of the remaining 70% and that about 20% are virtually identical to other pages on the Web.

• Furthermore, as an increasing number of people create, copy, store and distribute files, this challenge gets more important.

Deduplication of very similar files by Web search engines

(1)To help users in view of the many copies, duplicates, or very similar files, Web search engines can apply deduplication.For this investigation we define deduplication of search results as the selection of one or more copies of a duplicate Web page to represent a cluster of all the recognized copies. The default search result lists present the so called representatives. Deduplication can be applied during the harvesting of new Web pages or during the retrieval phase.

Deduplication of very similar files by Web search engines

(2)Deduplication can serve several purposes:

• For the search engines themselves it can improve the work of Web crawlers and systems to archive the Web. An experiment with the Google Web Search crawler in 1999 showed that the work of a crawler in the Google Web Search system can be reduced by 40%, by avoiding redundant crawling.

• For the user it offers methods for identifying versioned and plagiarized documents, and it helps to review the results faster.

Purpose of this investigation

The investigation reported here was motivated by the central problem: in which ways is the user confronted with the various ways in which Web search engines handle very similar documents?

Analytical problem statements: Do the important, popular Web search

engines offer their users results that have been deduplicated in some way?

Is the user confronted with deduplication by various Web search engines in the same way?

How stable and predictable is the deduplication function of Web search engines?

Experimental procedure (1): Test documents for the WWW

•We have performed experiments with very similar test documents.

•We constructed a unique test document with a specific content of several metatags, and we put this on the Internet.

•This guaranteed a high rank in the results of our searches for this test document.

•We created also variations of this document.Differences among our test documents were made in the title tag, body text and filename.

Experimental procedure (2): Test documents for the WWW

•We keep on the Internet 18 different samples of the test document on 8 server computers.

•The test documents were put on the WWW at the end of 2002 (8 documents) and at the end of 2003 (10 documents).

•Our test documents do NOT change over time.

Experimental procedure (3): Example of our test document

on the WWW

Experimental procedure (4)

•We investigated Alltheweb, AltaVista, AskJeeves, Google, Lycos, MSN, Teoma and Yahoo!

• In this experiment the test documents were searched with one specific content query repeated every hour during September - October 2005.

Experimental procedure (5)•Every investigated Web search engine has been

queried 430 times with the content query.

•Also they were queried simultaneously with 18 what we call "URL queries". These are queries that search for (a part of) the URL of the 18 test documents using the possibilities that the particular search engine offers.

•We name the test documents retrieved by all 19 simultaneously submitted queries “known test documents”

•The total number of queries submitted is 28886

Experimental procedure (6)

•We are well aware that fluctuations can occur in the results over time.

•The fluctuations over time in these search results were counted: a so called document fluctuation occurs every time that the content query doesn’t show all test documents that were retrieved with the previous submission

Experimental Results (1)

Experimental results (6) - Example: numbers of test

documents involved for Yahoo!

Experimental results (7) - Screen shot of Google Web search in a

case with “omitted entries”

Experimental Results (8) - Screen shot of Yahoo! search in a case

with “omitted entries”

Experimental Results (9) – Quantitative results

known documents, hidden per query set

deduplicated result sets with document fluctuations

MSN on average 83.6% 0.0%

Alltheweb ,, 37.7% 5.1%

AltaVista ,, 29.7% 5.1%

GoogleWebSearch

,, 50.8% 0.2%

Yahoo! ,, 46.1% 10.0%

AskJeeves ,, 0.0% 0.0%

Lycos ,, 0.0% 0.0%

Teoma ,, 0.0% 0.0%

Discussion: The importance of our findings

•Real, authentic documents on their original server computer have to compete with “very similar” versions (that can be substantially different), which are made available by others on other servers.

• In reality documents are not abstract items: they can be concrete, real laws, regulations, price lists, scientific reports, political programs… so that NOT finding the more authentic document can have real consequences.

Discussion: The importance of our findings

•May complicate scientometric / bibliometric studies, quantitative studies of numbers of documents retrieved.

•Documents on their original server pushed away from the search results, by very similar competing documents on 1 or several other servers?!

•A small change in a document may have large consequences for the meaning!

ConclusionRecommendations

•Deduplication occurs. Not only duplicates, but also very similar documents are omitted from search results.

•Enjoy this, when you don’t want very similar documents in your search results. Use a search engine that deduplicates rigorously.


•But take this into account when it is important to find

»the oldest, authentic, master version of a document;

»the newest, most recent version of a document;

»versions of a document with comments, corrections…

»in general: variations of documents

•Use a search engine that does not deduplicate or that shows the omitted search result


•Search engines that deduplicate partially show fluctuations in the search results

•Searchers for a known item should be aware of this

Coping with copies on the Web: Investigating Deduplication by Major Search Engines...

Documents

Transcript of Coping with copies on the Web: Investigating Deduplication by Major Search Engines...