Post on 06-May-2015
description
DEXA'11, Toulouse, France, 31.08.2011
Sampling National Deep WebDenis Shestakov, fname.lname at aalto.fi
Department of Media Technology, Aalto University
Outline
● Background● Our approach: Host-IP cluster random
sampling● Results● Conclusions
Background
● Deep Web: web content behind search interfaces
● See example of interface -------->● Main problem: hard to crawl, thus
content poorly indexed and not available for search (hidden)
● Many research problems: roughly 150-200 works addressing certain aspects of challenge (e.g., see 'Search interfaces on the Web: querying and characterizing', Shestakov, 2008)
● "Clearly, the science and practice of deep web crawling is in its infancy" (in 'Web crawling', Olston&Najork, 2010)
Background
● What is still unknown (surprisingly):○ How large is deep Web: number of deep web
resources? amount of content in them? what portion is indexed?
● So far only several studies addressed this:○ Bergman, 2001: number, amount of content○ Chang et al., 2004: number, coverage○ Shestakov et al., 2007: number○ Chinese surveys: number○ ....
Background
● All approaches used so far are not good● Basically, the idea behind estimating number of
deep web sites:○ IP address random sampling method (proposed in
1997)○ Description: take a pool of all IP addresses (~3 billions
currently in use), generate a random sample (~one million is ok), connect to them, if it serves HTTP crawl it and search for search interfaces
○ Obtain a number of search interfaces in a sample and apply sampling math to get an estimate
○ One can restrict to some segment of the Web (e.g., national): then pool consists of national IP addresses only
Virtual Hosting
● Bottleneck: virtual hosting● When only IP available then URLs for crawl look
like these http://X.Y.Z.W -----> lots of web sites hosting on X.Z.Y.W missed
● Examples:○ OVH (hosting company): 65,000 servers host
7,500,000○ This survey: 670,000 hosts on 80,000 IP
addresses● You can't ignore it!
Host-IP cluster sampling
● What if a large list of hosts is available?○ In fact, not very trivial to get one as such a list
should cover a certain web segment well● Host random sampling can be applied (Shestakov
et al., 2007)○ Works but with limitations○ Bottleneck: host aliasing, i.e., different hostnames
lead to the same web site■ Hard to solve: need to crawl all hosts in the list
(their start web pages)● Idea: resolve all hosts to their IPs
Host-IP cluster sampling
● Resolve all hosts in the list to their IP addresses○ A set of host-IP pairs
● Cluster hosts (pairs) by IP○ IP1: host11,host12, host13, ...○ IP2: host21,host22, host23, ...○ ...○ IPN: hostN1,hostN2, hostN3, ...
● Generate random sample of IP● Analyze sampled IPs
○ E.g., if IP2 sampled then crawl host21,host22, host23, ...
Host-IP cluster sampling
● Analyze sampled IPs○ E.g., if IP2 sampled then crawl host21,host22,
host23, ...○ While crawling 'unknown' (not in the list)
hosts may be found■ Crawl only those that either resolved to
IP2 or to IPs that are not among list's IP list ( IP1, IP2,..., IPN)
● Identify search interfaces YES --->○ Filtering, machine learning, manual check○ Out of the scope (see ref [14] in the paper)
● Apply sampling formulas (see Section 4.4 of the paper)
NO
Results
● Dataset:○ ~670 thousand hostnames○ Obtained from Yandex: good coverage of Russian
Web as of 2006○ Resolved to ~80 thousands unique IP addresses○ 77.2% of hosts shared their IPs with at least 20
other hosts <--virtual hosting scale● 1075 IPs sampled - 6237 hosts in initial crawl
seed○ Enough if satisfied with NUM+/-25% with 95%
confidence
Results
Comparison: host-IP vs IP sampling
Conclusion: IP random sampling (used in previous deep web characterization studies) applied to the same dataset resulted in estimates that are 3.5 times smaller than actual numbers (obtained by host-IP)
Conclusion
● Proposed Host-IP clustering technique○ Superior to IP random sampling
● Accurately characterized a national web segment○ As of 09/2006, 14,200+/-3800 deep web sites in
Russian Web ● Estimates obtained by Chang et al. (ref [9] in the
paper) are underestimated● Planning to apply Host-IP to other datasets
○ Main challenge is to obtain a large list of hosts that reliably covers a certain web segment
● Contact me if interested in Host-IP pairs datasets
Thank you!Questions?