Introducing Translation Studies - Theories and Applications, J. Munday
S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide...
-
Upload
isabel-daniels -
Category
Documents
-
view
219 -
download
3
Transcript of S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide...
S. Lawrence and C.L. Giles
Presented by
Robert Cadwgan-Evans, Simon Munday
Searching the World Wide Web
Introduction• Analyse the paper
– Coverage of search engines– Size of the Indexable Web
• Consider search and Internet development from 1998-today
• The future of searching
Paper Outline• Published April 1998, data collected in 1997
• Investigates the comparative coverage of the internet by major search engines of the time
• Attempts to put a figure on the size of the web
• Important as provide a way to measure the size of the web
Search Engine Coverage: The Test
Coverage: Percentage of the unique list that an individual engine returns in its queries
HotBot Northern LightExcite Infoseek LycosAltaVista
ResultsResults Results ResultsResultsResults
575 Queries
List of unique resultsfrom all queries
Search Engine Coverage: Results
Results of search engine coverage using this test:
Search Engine Coverage (%)
HotBot 57.5
AltaVista 46.5
Northern Light 32.9
Excite 23.1
Infoseek 16.5
Lycos 4.41
Even the most successful of the engines, HotBot, doesn’t manage to cover two thirds of the result set from all engines
Size of the Indexable Web: Method
N
Na
N0
Nb Estimated on the analysis of the overlap between search engines
N Set of indexable web pages
Na Set of results returned by search engine A
Nb Set of results returned by search engine B
N0 Set of results returned by A and B, the overlap
An estimate of the fraction of the indexable web covered by an engine a can be calculated:
Pa = N0 / Nb
From this fraction an estimate for the overall size of the indexable web, N, can be calculated
N = Sa / Pa
N
Na
N0
Nb
N
Na
N0Nb
Little overlap shows ignorance of search engines as lots of results are missing therefore not much of the web is covered
Size of the Indexable Web: ExamplesBig overlap shows the sets are almost complete therefore must contain most of the web
• Works on the assumption of randomness and independence
Size of the Indexable Web: Results
Comparison between pairs of search engines
Search Engines Indexable Web (millions of pages)
Lycos and Infoseek 90
Infoseek and Excite 220
Excite and Northern Light 230
Northern Light and Altavista 230
Altavista and HotBot 320
Paper selects the largest of these, 320million pages, as an estimate for the size of the indexable web
Paper Summary• Paper admits the size is an estimate, the
actual figure is probably larger
• Query terms based upon scientists searching habits, not general public
• This estimate suggests that previous estimates of as little as 75 million pages are incorrect
Current Technology• Newcomers: Google, Yahoo, MSN and Ask Jevees• Size of the web has exploded in the last 5 years [1]
Size of the Web Today• Up-to-date and accurate measurement is difficult.
But, current figures put the size of the web around 11.5billion pages [2]
• Currently indexed 9.4 billion pages [2]
• Google indexes 8 billion pages, but also takes searching further, indexing 880million images [3]
• Does a bigger index mean better quality results?
• Larger index could hamper performance [4]
Specialized Search Engines• With such big search engines providing general results more
specialized search engines have resulted:
The Future• The Deep Web – refers to databases from
which dynamic pages are created from
• Over 200,000 deep websites exist [5]
• Examples include eBay and Amazon
• Deep Web is 400 to 550 times larger than the “surface web” [5]
Conclusion• Estimating the size of the web is difficult and
as of yet not possible
• Paper does a good job of showing previous estimates are far too low (even if it's own is low)
• The inclusion of deep web will only make the problem harder
References• 1. Search Engine Sizes, D. Sullivan, January 2005, http:
//searchenginewatch.com/reports/article.php/2156481
• 2. The Indexable Web is More than 11.5 Billion Pages, A. Gulli and A. Sigorini, 2005, http://citeseer.ist.psu.edu/gulli05indexable.html
• 3. Google Product Descriptions, http://www.google.co.uk/press/descriptions.html
• 4. Accessibility of Information on the Web, S. Lawrence and C. Giles, Nature, 400:107--109, 1999
• 5. The Deep Web: Surfacing Hidden Value, Michael K. Bergman, 2001, http://beta.brightplanet.com/deepcontent/turtorials/DeepWeb/index.asp