Retrieving Information on the Web
Presented by
Md. Zaheed IftekharCourse : Information Retrieval (IFT6255)
Professor : Jian E. Nie DIRO, University of Montreal
April 9th, 2003
April 9, 2003Presented by: Md. Zaheed Iftekhar
2
Overview
• Web search: general description– Introduction of web, search engines
– Definitions
– Major search engines
– Current technologies
• The future – Where is the technology heading
– Proposal for further improvement
• Conclusion• References
April 9, 2003Presented by: Md. Zaheed Iftekhar
3
History of the Web
• In 1990 the World Wide Web (WWW) was developed by Tim Berners-Lee at CERN to organize research documents available on the Internet.
• Combined idea of documents available by FTP with the idea of hypertext to link documents.
• Developed initial HTTP network protocol, URLs, HTML, and first “web server.”
April 9, 2003Presented by: Md. Zaheed Iftekhar
4
• Ted Nelson developed idea of hypertext in 1965.
• Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI.
• ARPANET was developed in the early 1970’s.
• The basic technology was in place in the 1970’s; but it took the PC revolution and widespread networking to inspire the web and make it practical.
World Wide Web
April 9, 2003Presented by: Md. Zaheed Iftekhar
5
Web Browser
• Early browsers were developed in 1992 (Erwise, ViolaWWW).
• In 1993, Marc Andreessen and Eric Bina at UIUC NCSA developed the Mosaic.
• Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict with UIUC).
• Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer in 1995.
April 9, 2003Presented by: Md. Zaheed Iftekhar
6
Web Search
• By late 1980’s many files were available by anonymous FTP.
• In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”)
– Assembled lists of files available on many FTP servers.
– Allowed regex search of these file names.
• In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.
April 9, 2003Presented by: Md. Zaheed Iftekhar
7
Web Search
• In 1993, early web robots (spiders) were built to collect URL’s:
– Wanderer– ALIWEB (Archie-Like Index of the
WEB)– WWW Worm (indexed URL’s and titles
for regex search)• In 1994, Stanford grad students David Filo
and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.
April 9, 2003Presented by: Md. Zaheed Iftekhar
8
Web Search
• In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Wash. (became part of Excite and AOL).
• The same year, Fuzzy Maudlin, a grad student at CMU developed Lycos.
– First to use a standard IR system. – First to index a large set of pages.
• In late 1995, DEC developed Altavista. Supported boolean operators, phrases, and “reverse pointer” queries.
• In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google.
April 9, 2003Presented by: Md. Zaheed Iftekhar
9
Spiders (Robots/Bots/Crawlers)
• Start with a comprehensive set of root URL’s from which to start the search.
• Follow all links on these pages recursively to find additional pages.
• Index all novel found pages in an inverted index as they are encountered.
• May allow users to directly submit pages to be indexed (and crawled from).
April 9, 2003Presented by: Md. Zaheed Iftekhar
10
Breadth-first Search
Web search
April 9, 2003Presented by: Md. Zaheed Iftekhar
11
Depth-first Search
Web search
April 9, 2003Presented by: Md. Zaheed Iftekhar
12
Search Strategy Trade-Off’s
• Breadth-first explores uniformly outward from the root page but requires memory of all nodes on the previous level (exponential in depth). Standard spidering method.
• Depth-first requires memory of only depth times branching-factor (linear in depth) but gets “lost” pursuing a single thread.
• Both strategies implementable using a queue of links (URL’s).
April 9, 2003Presented by: Md. Zaheed Iftekhar
13
Avoiding Page Duplication
• Must detect when revisiting a page that has already been spidered (web is a graph not a tree).
• Must efficiently index visited pages to allow rapid recognition test.– Tree indexing (e.g. trie)– Hashtable
• Index page using URL as a key.– Must canonicalize URL’s (e.g. delete ending “/”) – Not detect duplicated or mirrored pages.
• Index page using textual content as a key.– Requires first downloading page.
April 9, 2003Presented by: Md. Zaheed Iftekhar
14
Spidering Algorithm
Initialize queue (Q) with initial set of known URL’s.Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…) continue loop. If already visited L, continue loop. Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) continue loop. Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N. Append N to the end of Q.
April 9, 2003Presented by: Md. Zaheed Iftekhar
15
Queueing Strategy
• How new links added to the queue determines search strategy.
• FIFO (append to end of Q) gives breadth-first search.
• LIFO (add to front of Q) gives depth-first search.
• Heuristically ordering the Q gives a “focused crawler” that directs its search towards “interesting” pages.
April 9, 2003Presented by: Md. Zaheed Iftekhar
16Source: http://www.bruceclay.com
April 9, 2003Presented by: Md. Zaheed Iftekhar
17
• Google is a search engine that maintains its own spider based index.
• Google also has a directory that is powered by the Open Directory;
• Google supports:– Boolean search– Phrase– Similarity– Proximity
Source: lookoff.com, http://www.bruceclay.com
April 9, 2003Presented by: Md. Zaheed Iftekhar
18
Strengths• The interface is tremendously simple, but the quality in
results is not significantly impeded • Accuracy for common topics
Weaknesses• Lack of power features • Coverage of the Internet is much less than some
competitors • No OR keyword support for boolean searches
Source: lookoff.com, http://www.bruceclay.com
April 9, 2003Presented by: Md. Zaheed Iftekhar
19
Yahoo!
Strengths• Coverage of the Internet is excellent • Links are generally quite up to date and free of spam and poor
quality sites • Human maintainers ensure that sites are placed correctly within
the relevant topic • The search interface is very fast • Yahoo integrates with indexed searches after presenting Yahoo
topic areas • Accuracy for common topics
Weaknesses• The search interface is very effective for general searches but
could be better for powerful searches • Not all relevant sites are listed in Yahoo - they have to be
submitted and accepted.
Source: lookoff.com, http://www.bruceclay.com
April 9, 2003Presented by: Md. Zaheed Iftekhar
20
Ask Jeeves
Strengths• A simple interface makes it very easy to form queries.
Excellent for new users and children. • If your query corresponds to a pre-packaged answer, you can
expect some surprisingly good results. Millions of bundled answers provide premium answers that are superior to standard index search.es
• The site is actively maintained. • An integrated metacrawler provides results for your search
from Goto, AltaVista, Mamma and 4Anything. • The search code is very fast. Weaknesses• The site supposedly takes pay for top spots, sometimes
placing dubious quality links at the top of results. • No advanced search. • Very little power in constructing your keywords • Little control over filtering results.
April 9, 2003Presented by: Md. Zaheed Iftekhar
21
MSN
Strengths• Very active news portal with updated and well-presented
headlines. • Integrated single sign-on with hotmail, msn, etc. • Configurable interface lets you customize content, layout and
colors. • Very actively maintained. • Many interesting (although often commercially-oriented) services
tied into the MSN network. • Nationalized versions for quite a few countries providing a more
specific content and news feed. • Ability to save (i.e. tag) results to quickly filter search results into
a candidates list.
Weaknesses• Not a low-bandwidth interface. Slow modem users should beware. • Mediocre search interface • Less web coverage than most search engines
April 9, 2003Presented by: Md. Zaheed Iftekhar
22
Program
Pages (#) Class FAQ FTP Index Meta Misc News Portal
Dejanews
300M msg Best N N N N Y Y N
Raging
250M Best N N Y N N N N
Yahoo
500T Best N N N N N N Y
AllTheWeb
300M Excellent N N Y N N N N
AltaVista
250M Excellent N N Y N N Y Y
FAQS
3300 FAQs Excellent Y N N N Y N N
FTPSearch
100M file Excellent N Y N N N N N
Search.com
N/A Excellent N N N Y N N N
About
? Good N N N N Y N Y
AskJeeves
8M Ques. Good Y N Y N N N Y
DirectHit
? Good N N N N N N Y
Excite
? Good N N Y N N Y Y
Go
50M? Good N N Y N N N Y
100M? Good N N Y N N N N
HotBot
150M? Good N N Y N N N Y
Lycos
250M? Good N Y Y N N N Y
MetaCrawler
N/A Good N N N Y N N N
MSN
120M? Good N N Y N N N Y
NorthernLight
200M? Good N N Y N N Y N
OpenDirectory
1M? Good N N N N N N Y
WebCenter
500T? Good N N N N N N Y
DogPile
N/A Okay N Y N Y Y Y Y
GoTo
? Okay N N Y N N N Y
InfoSpace
very few Okay N N Y N Y N N
iWon
350M? Okay N N Y N Y N N
Snap
? Okay N N Y N N N Y
Mamma
n/a Weak N N N Y N N N
April 9, 2003Presented by: Md. Zaheed Iftekhar
23
April 9, 2003Presented by: Md. Zaheed Iftekhar
24
April 9, 2003Presented by: Md. Zaheed Iftekhar
25
April 9, 2003Presented by: Md. Zaheed Iftekhar
26
April 9, 2003Presented by: Md. Zaheed Iftekhar
27
Conclusion
• Intelligent agent technology could be used to improve the searching method.
• Quantum searching method also could be explored.
April 9, 2003Presented by: Md. Zaheed Iftekhar
28
Web search
Thank you all!
Top Related