Sampling national deep Web

DEXA'11, Toulouse, France, 31.08.2011

Sampling National Deep WebDenis Shestakov, fname.lname at aalto.fi

Department of Media Technology, Aalto University

http://dx.doi.org/10.1007/978-3-642-23088-2_24

Outline

● Background● Our approach: Host-IP cluster random

sampling● Results● Conclusions

Background

● Deep Web: web content behind search interfaces

● See example of interface -------->● Main problem: hard to crawl, thus

content poorly indexed and not available for search (hidden)

● Many research problems: roughly 150-200 works addressing certain aspects of challenge (e.g., see 'Search interfaces on the Web: querying and characterizing', Shestakov, 2008)

● "Clearly, the science and practice of deep web crawling is in its infancy" (in 'Web crawling', Olston&Najork, 2010)

http://www.doria.fi/handle/10024/38506

http://www.doria.fi/handle/10024/38506

http://dx.doi.org/10.1561/1500000017

Background

● What is still unknown (surprisingly):○ How large is deep Web: number of deep web

resources? amount of content in them? what portion is indexed?

● So far only several studies addressed this:○ Bergman, 2001: number, amount of content○ Chang et al., 2004: number, coverage○ Shestakov et al., 2007: number○ Chinese surveys: number○ ....

http://dx.doi.org/10.3998/3336451.0007.104

http://dx.doi.org/10.1145/1031570.1031584

http://dx.doi.org/10.1007/978-3-540-74469-6_76

http://en.cnki.com.cn/Article_en/CJFDTOTAL-JSJX201102016.htm

Background

● All approaches used so far are not good● Basically, the idea behind estimating number of

deep web sites:○ IP address random sampling method (proposed in

1997)○ Description: take a pool of all IP addresses (~3 billions

currently in use), generate a random sample (~one million is ok), connect to them, if it serves HTTP crawl it and search for search interfaces

○ Obtain a number of search interfaces in a sample and apply sampling math to get an estimate

○ One can restrict to some segment of the Web (e.g., national): then pool consists of national IP addresses only

Virtual Hosting

● Bottleneck: virtual hosting● When only IP available then URLs for crawl look

like these http://X.Y.Z.W -----> lots of web sites hosting on X.Z.Y.W missed

● Examples:○ OVH (hosting company): 65,000 servers host

7,500,000○ This survey: 670,000 hosts on 80,000 IP

addresses● You can't ignore it!

Host-IP cluster sampling

● What if a large list of hosts is available?○ In fact, not very trivial to get one as such a list

should cover a certain web segment well● Host random sampling can be applied (Shestakov

et al., 2007)○ Works but with limitations○ Bottleneck: host aliasing, i.e., different hostnames

lead to the same web site■ Hard to solve: need to crawl all hosts in the list

(their start web pages)● Idea: resolve all hosts to their IPs

http://dx.doi.org/10.1007/978-3-540-74469-6_76

http://dx.doi.org/10.1007/978-3-540-74469-6_76


● Resolve all hosts in the list to their IP addresses○ A set of host-IP pairs

● Cluster hosts (pairs) by IP○ IP1: host11,host12, host13, ...○ IP2: host21,host22, host23, ...○ ...○ IPN: hostN1,hostN2, hostN3, ...

● Generate random sample of IP● Analyze sampled IPs

○ E.g., if IP2 sampled then crawl host21,host22, host23, ...


● Analyze sampled IPs○ E.g., if IP2 sampled then crawl host21,host22,

host23, ...○ While crawling 'unknown' (not in the list)

hosts may be found■ Crawl only those that either resolved to

IP2 or to IPs that are not among list's IP list ( IP1, IP2,..., IPN)

● Identify search interfaces YES --->○ Filtering, machine learning, manual check○ Out of the scope (see ref [14] in the paper)

● Apply sampling formulas (see Section 4.4 of the paper)

NO

http://dx.doi.org/10.1007/978-3-642-23088-2_24

http://dx.doi.org/10.1007/978-3-642-23088-2_24

Results

● Dataset:○ ~670 thousand hostnames○ Obtained from Yandex: good coverage of Russian

Web as of 2006○ Resolved to ~80 thousands unique IP addresses○ 77.2% of hosts shared their IPs with at least 20

other hosts <--virtual hosting scale● 1075 IPs sampled - 6237 hosts in initial crawl

seed○ Enough if satisfied with NUM+/-25% with 95%

confidence

Results

Comparison: host-IP vs IP sampling

Conclusion: IP random sampling (used in previous deep web characterization studies) applied to the same dataset resulted in estimates that are 3.5 times smaller than actual numbers (obtained by host-IP)

Conclusion

● Proposed Host-IP clustering technique○ Superior to IP random sampling

● Accurately characterized a national web segment○ As of 09/2006, 14,200+/-3800 deep web sites in

Russian Web ● Estimates obtained by Chang et al. (ref [9] in the

paper) are underestimated● Planning to apply Host-IP to other datasets

○ Main challenge is to obtain a large list of hosts that reliably covers a certain web segment

● Contact me if interested in Host-IP pairs datasets

Thank you!Questions?

Sampling national deep Web

Technology

Transcript of Sampling national deep Web