Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a...
-
Upload
rudolph-arnold -
Category
Documents
-
view
224 -
download
4
Transcript of Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a...
![Page 1: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/1.jpg)
Web Crawlers
![Page 2: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/2.jpg)
Web crawler
• A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner.
• Other terms ants, automatic indexers, bots, and worms or Web spider, Web robot, Web scutter.
• Process is called Web crawling or spidering.
![Page 3: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/3.jpg)
Use Of Web Crawlers
• To create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
• For automating maintenance tasks on a Web site, such as checking links or validating HTML code.
• To gather specific types of information from Web pages, such as harvesting e-mail addresses
![Page 4: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/4.jpg)
• A Web crawler is one type of bot, or software agent.
• Starts with a list of URLs to visit, called the seeds.
• It identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.
• URLs from the frontier are recursively visited according to a set of policies.
![Page 5: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/5.jpg)
Characteristics of Web Crawling
• its large volume – only download a fraction of the Web pages within a given time, so prioritize.
• its fast rate of change – it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted while downloading pages.
• dynamic page generation – pages being generated by server-side scripting languages has also created difficulty in crawling.
![Page 6: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/6.jpg)
Crawling policies
• a selection policy that states which pages to download,
• a re-visit policy that states when to check for changes to the pages,
• a politeness policy that states how to avoid overloading Web sites,
• a parallelization policy that states how to coordinate distributed Web crawlers.
![Page 7: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/7.jpg)
Selection policy
– Pageranks– Path ascending– Focused crawling
![Page 8: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/8.jpg)
Re-visit policy
• Freshness : This is a binary measure that indicates whether the local copy is accurate or not.
• Age :This is a measure that indicates how outdated the local copy is
![Page 9: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/9.jpg)
Re-visit policy
• Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change.
• Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency.
![Page 10: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/10.jpg)
Politeness Policy• Costs of using Web crawlers include:– network resources, crawlers require bandwidth
and operate with a high degree of parallelism during a long period of time;
– server overload, if the frequency of accesses to a given server is too high;
– poorly-written crawlers, can crash servers or routers, download pages they cannot handle; and
– personal crawlers, if deployed by too many users, can disrupt networks and Web servers.
![Page 11: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/11.jpg)
Parallelization Policy• a crawler that runs multiple processes in parallel.• Goal to maximize the download rate while
minimizing overhead from parallelization and to avoid repeated downloads of the same page.
• Avoid downloading same page more than once.• Crawling system requires policy for assigning
new URLs discovered during crawling process, same URL can be found by two different crawling processes.
![Page 12: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/12.jpg)
Components• Search engine– Responsible for deciding which new documents to
explore, and for initiating the process of their retrieval.
• Database– Used to store the document metadata, full-text
index, and the hyperlinks between documents.
• Agents– Responsible for retrieving the documents from
the web under the control of search engine.
![Page 13: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/13.jpg)
Components(Contd..)
• Query server– Responsible for handling the query processing
service.
• libWWW– This is the CERN WWW library, used by agents to
access several different kinds of contents using different protocols.
![Page 14: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/14.jpg)
![Page 15: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/15.jpg)
Web Crawler Architecture
![Page 16: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/16.jpg)
![Page 17: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/17.jpg)
Crawling Infrastructure
• Maintains a list of unvisited URLs called the frontier, list is initialized with seed URLs which may be provided by a user or another program.
• Each crawling loop involves picking the next URL to crawl from the frontier, fetching the page corresponding to the URL through HTTP.
![Page 18: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/18.jpg)
Crawling Infrastructure(contd..)
• Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page corresponding to the URL.
• The crawling process may be terminated when a certain number of pages have been crawled.
• If the crawler is ready to crawl another page and the frontier is empty, the situation signals a dead-end for the crawler.
![Page 19: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/19.jpg)
Graph Search Problem• Crawling can be viewed as a graph search
problem.• Web is seen as a large graph with pages at its
nodes and hyperlinks as its edges.• A crawler starts at a few of the nodes (seeds) and
then follows the edges to reach other nodes.• The process of fetching a page and extracting the
links within it is analogous to expanding a node in graph search.
![Page 20: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/20.jpg)
![Page 21: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/21.jpg)
Frontier
• The frontier is the to-do list of a crawler that contains the URLs of unvisited pages.
• In graph search terminology the frontier is an open list of unexpanded (unvisited) nodes.
• frontier can filled rather quickly as pages are crawled.
• frontier may be implemented as a FIFO queue in which case we have a breadth-first crawler that can be used to blindly crawl the Web.
![Page 22: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/22.jpg)
History and Page Repository
• A time-stamped list of URLs that were fetched by the crawler.
• It shows the path of the crawler through the Web starting from the seed pages.
• A URL entry is made into the history only after fetching the corresponding page.
• History may be used for post crawl analysis and evaluations.
![Page 23: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/23.jpg)
History and Page Repository(contd..)
• In its simplest form a page repository may store the crawled pages as separate files.
• Each page must map to a unique file name.• A compact string using some form of hashing
function with low probability of collisions.• MD5 one-way hashing function that provides
a 128 bit hash code for each URL.
![Page 24: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/24.jpg)
Fetching
• An HTTP client which sends an HTTP request for a page and reads the response.
• Client needs to have timeouts to make sure that an unnecessary amount of time is not spent on slow servers or in reading large pages.
• Client needs to parse the response headers for status codes and redirections.
![Page 25: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/25.jpg)
Fetching(contd..)
• Error checking and exception handling is important during the page fetching process.
• To collect statistics on timeouts and status codes for identifying problems or automatically changing timeout values.
![Page 26: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/26.jpg)
Robot Exclusion Protocol
• provides a mechanism for Web server administrators to communicate their file access policies.
• To identify files that may not be accessed by a crawler.
• Done by keeping a file named robots.txt under the root directory of the Web server.
![Page 27: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/27.jpg)
Parsing• To parse its content to extract information
that will feed and possibly guide the future path of the crawler.
• Parsing may imply simple hyperlink/URL extraction or it may involve the more complex process of tidying up the HTML content in order to analyze the HTML tag tree.
• Steps to convert the extracted URL to a canonical form, remove stopwords from the page's content and stem the remaining words.
![Page 28: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/28.jpg)
URL Extraction and Canonicalization
• To extract hyperlink URLs from a Web page, we can use these parsers to find anchor tags and grab the values of associated href attributes.
• Convert any relative URLs to absolute URLs using the base URL of the page.
• Different URLs that correspond to the same Web page can be mapped onto a single canonical form.
![Page 29: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/29.jpg)
Canonicalization Procedures
• convert the protocol and hostname to lowercase.
• remove the `anchor' or `reference' part of the URL.
• perform URL encoding for some commonly used characters such as `~'.
• for some URLs, add trailing `/'s.• use heuristics to recognize default Web pages.
![Page 30: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/30.jpg)
Canonicalization Procedures(contd..)
• remove `..' and its parent directory from the URL path.
• leave the port numbers in the URL unless it is port 80, add port 80 when no port number is specified.
![Page 31: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/31.jpg)
Stoplisting
• remove commonly used words or stopwords such as “it" and “can".
• process of removing stopwords from text is called stoplisting.
• system recognizes no more than nine words (“an", “and", “by", “for", “from", “of", “the",
• “to", and “with") as the stopwords.
![Page 32: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/32.jpg)
Stemming
• stemming process normalizes words by conflating a number of morphologically similar words to a single root form or stem.
• Example– “connect”, " connected" and “connection" are all
reduced to “connect.“
• Stemming reduced the precision of the crawling results.
![Page 33: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/33.jpg)
HTML tag tree
• Crawlers may assess by examining the HTML tag context in which it resides.
• The crawler only needs the links within a page, and the text or portions of the text in the page by using HTML parsers.
![Page 34: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/34.jpg)
Example
![Page 35: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/35.jpg)
![Page 36: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/36.jpg)
URL Normalization
• Needed to avoid crawling the same resource more than once.
• Also called URL canonicalization, refers to the process of modifying and standardizing a URL in a consistent manner.
• Several types of normalization includes conversion of URLs to lowercase, removal of "." and ".." segments, and adding trailing slashes to the non-empty path component.
![Page 37: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/37.jpg)
Crawler identification
• Identification is also useful for administrators, knowing when they may expect their Web pages to be indexed by a particular search engine.
• Spambots and other malicious Web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler.
![Page 38: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/38.jpg)
Multi-threaded Crawlers
• sequential crawling loop spends a large amount of time in which either the CPU is idle or network is idle.
• Here, each thread follows a crawling loop, can provide reasonable speed-up and efficient use of available bandwidth.
![Page 39: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/39.jpg)
![Page 40: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/40.jpg)
Page Importance
• Keywords in document• Similarity to a query• Similarity to seed pages• Classifier score• Retrieval system rank• Link-based popularity
![Page 41: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/41.jpg)
Summary Analysis
• Acquisition rate• Average relevance• Target recall• Robustness
![Page 42: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/42.jpg)
Nutch
• Is a Open Source web crawler• Nutch Web Search Application– Maintain DB of pages and links– Pages have scores, assigned by analysis– Fetches high-scoring, out-of-date pages– Distributed search front end– Based on Lucene
![Page 43: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/43.jpg)
![Page 44: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/44.jpg)
Examples
• Yahoo Crawler (Slurp) is the name of the Yahoo Search crawler.
• Google Crawler, but the reference is only about an early version of its architecture, was based in C++ and Python.
![Page 45: Web Crawlers. Web crawler A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms ants, automatic.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e2d5503460f94b1c5ec/html5/thumbnails/45.jpg)
Open-source crawlers
• Aspseek is a crawler, indexer and a search engine written in C and licensed under the GPL.
• DataparkSearch is a crawler and search engine released under the GNU General Public License.
• YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).