distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages...
-
date post
22-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages...
![Page 1: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/1.jpg)
distributed web crawlers 1
Implementation
• All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec. 99, during a period of two weeks.
• The web image projected from this crawl might be biased but it represent the pages a parallel crawler would fetch.
![Page 2: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/2.jpg)
distributed web crawlers 2
Firewall Mode & Coverage
• Firewall: – every crawl collects pages only from its
predetermined partition, and follows only intra-partition links.
– Has a minimal communication overhead, but may have quality and coverage problems.
![Page 3: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/3.jpg)
distributed web crawlers 3
Firewall Mode & Coverage
• Considering the 40m pages as the entire web.
• Using site-hash based partitioning.
• Each c-proc was given five random sites from its own partitioning (5n for the overall crawler).
![Page 4: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/4.jpg)
distributed web crawlers 4
Results
![Page 5: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/5.jpg)
distributed web crawlers 5
Results (2)
![Page 6: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/6.jpg)
distributed web crawlers 6
Conclusions
• When a small number of c-proc’s run in parallel this mode provides good coverage, and the crawler may start with relatively small number of seed URLs.
• This mode is not a good choice when coverage is important, especially when it runs many c-proc’s in parallel.
![Page 7: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/7.jpg)
distributed web crawlers 7
Example
• Assuming we want to download 1B pages over one month, with 10 Mbps link to the internet per each c-proc’s machine:– we need to download 109X104 bytes.– The download rate is 34 Mbps therefore we need 4
c-proc’s. from fig 4 we conclude that coverage will be about 80%.
– having a week, we need a download rate of 140 Mbps = 14 c-proc’s, which will cover only 50%.
![Page 8: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/8.jpg)
distributed web crawlers 8
Cross-over & Overlap
• This mode may yield improved coverage, since it follows inter-partition links, when a c-proc runs out of links in its own partition.
• This mode also incurs overlap, because a page can be downloaded by several c-procs.
• => the crawler increases coverage at the expense of overlap.
![Page 9: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/9.jpg)
distributed web crawlers 9
Cross-over & Overlap
• Considering the 40M pages as the entire web.
• Using site-hash based partitioning.
• Each c-proc was given five random sites from its own partitioning (5n for the overall crawler).
• Measuring overlap in various coverage points
![Page 10: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/10.jpg)
distributed web crawlers 10
Results
![Page 11: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/11.jpg)
distributed web crawlers 11
Conclusions
• While this mode is much better than independent crawl, it still incurs quite significant overlap. For example: 4 c-proc’s running will overlap almost 2.5 in order to obtain coverage close to 1. For this reason it is not recommended to use this mode unless coverage is important and no communication between c-proc’s is available.
![Page 12: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/12.jpg)
distributed web crawlers 12
Exchange Mode & Communication
• In this section we learn the communication overhead of an exchange mode crawler and how to reduce it by replication.
• We split the 40M pages into n partitions based on site-hash value, and run n c-proc’s in exchange mode.
![Page 13: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/13.jpg)
distributed web crawlers 13
Results
![Page 14: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/14.jpg)
distributed web crawlers 14
Conclusions
• The site-hash based partitioning scheme significantly reduces communication overhead, compared with the URL-hash based scheme. In average we need to transfer less than 10% of the discovered links (or up to 1 per page).
• .
![Page 15: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/15.jpg)
distributed web crawlers 15
Conclusions (2)
• Network bandwidth used for URL exchange is relatively small. URL’s average length is about 40 bytes, while an average page is about 10kb, so this transfer consumes about 0.4% of total network bandwidth.
![Page 16: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/16.jpg)
distributed web crawlers 16
Conclusions (3)
• The overhead of this exchange is quite significant because transmission goes through TCP/IP network stack at both sides, and incurs 2 switches between kernel and user mode.
![Page 17: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/17.jpg)
distributed web crawlers 17
Reducing Overhead by Replication
![Page 18: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/18.jpg)
distributed web crawlers 18
Conclusions
• Based on this result replicating 10-100 thousands in each c-proc will give best results (minimizes communication overhead while maintaining low replication overhead).
![Page 19: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/19.jpg)
distributed web crawlers 19
Quality & Batch Communication
• In this section we study the quality issue – as mentioned parallel crawler can be worse than
single process crawler if every c-proc decides solely based on personal information.
![Page 20: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/20.jpg)
distributed web crawlers 20
Quality & Batch Communication(2)
• throughout this section we’ll regard a page’s importance I(p) as the number of backlinks it has.– The most common metric. – Obviously depends on how often c-proc’s are
exchanging information.
![Page 21: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/21.jpg)
distributed web crawlers 21
Quality at Different Exchange Rates
![Page 22: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/22.jpg)
distributed web crawlers 22
Conclusions
• As number of c-proc’s increases, the quality becomes worse, unless they exchange backlink messages often.
• The quality of a firewall mode is worse than a single process crawler when downloading small fraction of pages. However there is no difference when downloading bigger fractions.
![Page 23: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/23.jpg)
distributed web crawlers 23
Quality and Communication Overhead
![Page 24: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/24.jpg)
distributed web crawlers 24
Conclusions
• Communication overhead doesn’t increase linearly.
• Large number of URL exchanges is not necessary for achieving high quality, especially when downloading large portion of the web. fig 9.
![Page 25: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/25.jpg)
distributed web crawlers 25
Final Example
• Say we plan to operate a medium-scale search engine, to obtain 20% of the web (240M pages). We plan to refresh the index once a month, and our machines have 1 Mbps connection to the Internet.– We need about 7.44 Mbps download
bandwidth, so we have to run at least 8 c-procs run in parallel.
![Page 26: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/26.jpg)
distributed web crawlers 26
Related charts
![Page 27: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/27.jpg)
distributed web crawlers 27
Final Conclusions
• When a small number of c-procs run in parallel, firewall mode provides good coverage. Given the simplicity of this mode, it is a good option to consider unless:– More than 4 c-procs are required. fig 4.– Small subset of the web is required and quality
is important. Fig 9.
![Page 28: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/28.jpg)
distributed web crawlers 28
Final Conclusions (2)
• Exchange mode based crawler consumes small network bandwidth, and minimize overhead if batch communication is operated. Quality is maximized even if less than 100 URL exchanges occurs.
• Replication of 10,000-100,000 most popular URLs reduces communication overhead by roughly 40%. Further replication contributes little. Fig 8.
![Page 29: distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.](https://reader031.fdocuments.us/reader031/viewer/2022032523/56649d7a5503460f94a5db70/html5/thumbnails/29.jpg)
distributed web crawlers 29
References
• Junghoo Cho, Hector Garcia-Molina. Parallel crawlers, October 2001.
• Mike Burner, Crawling Towards Eternity, web techniques magazine, May 1998.