distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages...

distributed web crawlers 1

Implementation

• All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec. 99, during a period of two weeks.

• The web image projected from this crawl might be biased but it represent the pages a parallel crawler would fetch.


Firewall Mode & Coverage

• Firewall: – every crawl collects pages only from its

predetermined partition, and follows only intra-partition links.

– Has a minimal communication overhead, but may have quality and coverage problems.


Firewall Mode & Coverage

• Considering the 40m pages as the entire web.

• Using site-hash based partitioning.

• Each c-proc was given five random sites from its own partitioning (5n for the overall crawler).


Results


Results (2)


Conclusions

• When a small number of c-proc’s run in parallel this mode provides good coverage, and the crawler may start with relatively small number of seed URLs.

• This mode is not a good choice when coverage is important, especially when it runs many c-proc’s in parallel.


Example

• Assuming we want to download 1B pages over one month, with 10 Mbps link to the internet per each c-proc’s machine:– we need to download 109X104 bytes.– The download rate is 34 Mbps therefore we need 4

c-proc’s. from fig 4 we conclude that coverage will be about 80%.

– having a week, we need a download rate of 140 Mbps = 14 c-proc’s, which will cover only 50%.


Cross-over & Overlap

• This mode may yield improved coverage, since it follows inter-partition links, when a c-proc runs out of links in its own partition.

• This mode also incurs overlap, because a page can be downloaded by several c-procs.

• => the crawler increases coverage at the expense of overlap.


Cross-over & Overlap

• Considering the 40M pages as the entire web.

• Using site-hash based partitioning.

• Each c-proc was given five random sites from its own partitioning (5n for the overall crawler).

• Measuring overlap in various coverage points


Results


Conclusions

• While this mode is much better than independent crawl, it still incurs quite significant overlap. For example: 4 c-proc’s running will overlap almost 2.5 in order to obtain coverage close to 1. For this reason it is not recommended to use this mode unless coverage is important and no communication between c-proc’s is available.


Exchange Mode & Communication

• In this section we learn the communication overhead of an exchange mode crawler and how to reduce it by replication.

• We split the 40M pages into n partitions based on site-hash value, and run n c-proc’s in exchange mode.


Results


Conclusions

• The site-hash based partitioning scheme significantly reduces communication overhead, compared with the URL-hash based scheme. In average we need to transfer less than 10% of the discovered links (or up to 1 per page).

• .


Conclusions (2)

• Network bandwidth used for URL exchange is relatively small. URL’s average length is about 40 bytes, while an average page is about 10kb, so this transfer consumes about 0.4% of total network bandwidth.


Conclusions (3)

• The overhead of this exchange is quite significant because transmission goes through TCP/IP network stack at both sides, and incurs 2 switches between kernel and user mode.


Reducing Overhead by Replication


Conclusions

• Based on this result replicating 10-100 thousands in each c-proc will give best results (minimizes communication overhead while maintaining low replication overhead).


Quality & Batch Communication

• In this section we study the quality issue – as mentioned parallel crawler can be worse than

single process crawler if every c-proc decides solely based on personal information.


Quality & Batch Communication(2)

• throughout this section we’ll regard a page’s importance I(p) as the number of backlinks it has.– The most common metric. – Obviously depends on how often c-proc’s are

exchanging information.


Quality at Different Exchange Rates


Conclusions

• As number of c-proc’s increases, the quality becomes worse, unless they exchange backlink messages often.

• The quality of a firewall mode is worse than a single process crawler when downloading small fraction of pages. However there is no difference when downloading bigger fractions.


Quality and Communication Overhead


Conclusions

• Communication overhead doesn’t increase linearly.

• Large number of URL exchanges is not necessary for achieving high quality, especially when downloading large portion of the web. fig 9.


Final Example

• Say we plan to operate a medium-scale search engine, to obtain 20% of the web (240M pages). We plan to refresh the index once a month, and our machines have 1 Mbps connection to the Internet.– We need about 7.44 Mbps download

bandwidth, so we have to run at least 8 c-procs run in parallel.


Related charts


Final Conclusions

• When a small number of c-procs run in parallel, firewall mode provides good coverage. Given the simplicity of this mode, it is a good option to consider unless:– More than 4 c-procs are required. fig 4.– Small subset of the web is required and quality

is important. Fig 9.


Final Conclusions (2)

• Exchange mode based crawler consumes small network bandwidth, and minimize overhead if batch communication is operated. Quality is maximized even if less than 100 URL exchanges occurs.

• Replication of 10,000-100,000 most popular URLs reduces communication overhead by roughly 40%. Further replication contributes little. Fig 8.


References

• Junghoo Cho, Hector Garcia-Molina. Parallel crawlers, October 2001.

• Mike Burner, Crawling Towards Eternity, web techniques magazine, May 1998.

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages...

Documents

Transcript of distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages...