Copy of Presentation 1
-
Upload
madhoolika-varma-siruvuri -
Category
Documents
-
view
215 -
download
0
Transcript of Copy of Presentation 1
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 1/16
PRESENTED BY :
Buddaraju Akhila Devi
(502)
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 2/16
Content :
What is web crawler?
History
How does web crawler work?
Crawling Applications
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 3/16
What is web crawler?
Also known as a Web spider or Web robot.
engines to download pages from the web for later processing by a search engine that will
index the downloaded pages to provide fastsearches.
Crawl or visit web pages and download them
Starting from one page ±determine whichpage(s) to go to next
Create a copy of visited pages for later usage(indexing)
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 4/16
History Of Web Crawlers:
1994, by Brian Pinkerton.
1994, web application with few sites in database.
1994, DealerNet and Starwave contributed for advertising and searching.
1995, America online started to use with futuredevelopment.
1995, Spidey introduced with design changes.
1996, extended functionality from pure search tohuman edited guide : GNN Select.
1997, Excite and Web Crawler.
2001, InfoSpace worked on Meta-search
engine.
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 5/16
Architecture Of Web Crawler:
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 6/16
How does web crawler work?
It starts with a list of URLs to visit, called
the seeds. As the crawler visits these
URLs, it identifies all the hyperlinks in thepage and adds them to the list of visited
URLs, called the crawl frontier .
URLs from the frontier are recursivelyvisited according to a set of policies.
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 7/16
Uses: Difficulties:
Collect information.
Textual analysis.
Updates data.
Automate
maintenance.
Harvesting e-mail
addresses.Mirroring,
visualization.
Illegal uses.
Large volume.
Dynamic page
generation.
Fast rate of change.
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 8/16
Policies Used:
Selection policy-which pages to storeRe-visit policy-when to check for changes
Politeness policy-to avoid overloading
sites
Parallelization policy-distributed web
crawlers
Behavior Of Web Crawler :
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 9/16
Selection Policy:
Not random sample pages
Select relevant pages by prioritizing
Calculates popularity Abiteboul-OPIC( On-line Page Importance
Computation Algorithm)
Daneshpajouh-discovers good seeds
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 10/16
Re-visit Policy:
Cost functions used are Freshness and
Age
Coffman redefined web crawler in terms of freshness:
³Crawler minimizes pages remain
outdated´
Problems are: single-server polling
system, multiple queue
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 11/16
Politeness Policy:
Crawler retrieve data faster than human
searchers
Overhead if single crawler used withmultiple requests
Cost of crawlers includes n/w resources,
server overload, poor crawlers, personal
crawlers
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 12/16
Parallelization Policy:
Runs multiple processes in parallel
Objective: Maximize download rate,
minimizing overhead Avoid repeated downloads of same page
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 13/16
Examples( Crawler Architectures):
RBSE first published web crawler Slurp( Yahoo)
Bingbot( Microsoft). Replaced by Msnbot
FAST Crawler( Distributed Crawler)
Googlebot in C++ and PythonPolyBot in C++ and Python
World Wife Web Worm( Indexing by grep
UNIX cmd)WebFountain in C++( distributed controller m/c)
WebRACE in Java
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 14/16
Open Source Crawlers:
Aspseek in C++
Datapark under GNU
GNU Wget under GPLGRUB used by Wikia Search
Heritix in Java
ICDL in C++ and many more
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 15/16
Demo Program:
8/3/2019 Copy of Presentation 1
http://slidepdf.com/reader/full/copy-of-presentation-1 16/16
NUTCH:
Is a Open Source web crawler
Nutch Web Search Application
Maintain DB of pages and linksPages have scores, assigned by analysis
Fetches high-scoring, out-of-date pages
Distributed search front end
Based on Lucene
http://lucene.apache.org/nutch/