Sampling national deep Web

14
DEXA'11, Toulouse, France, 31.08.2011 Sampling National Deep Web Denis Shestakov, fname.lname at aalto.fi Department of Media Technology, Aalto University

description

Talk given at DEXA 2011 in Toulouse, France. Full text paper is available at http://goo.gl/oCWPkN

Transcript of Sampling national deep Web

Page 1: Sampling national deep Web

DEXA'11, Toulouse, France, 31.08.2011

Sampling National Deep WebDenis Shestakov, fname.lname at aalto.fi

Department of Media Technology, Aalto University

Page 2: Sampling national deep Web

Outline

● Background● Our approach: Host-IP cluster random

sampling● Results● Conclusions

Page 3: Sampling national deep Web

Background

● Deep Web: web content behind search interfaces

● See example of interface -------->● Main problem: hard to crawl, thus

content poorly indexed and not available for search (hidden)

● Many research problems: roughly 150-200 works addressing certain aspects of challenge (e.g., see 'Search interfaces on the Web: querying and characterizing', Shestakov, 2008)

● "Clearly, the science and practice of deep web crawling is in its infancy" (in 'Web crawling', Olston&Najork, 2010)

Page 4: Sampling national deep Web

Background

● What is still unknown (surprisingly):○ How large is deep Web: number of deep web

resources? amount of content in them? what portion is indexed?

● So far only several studies addressed this:○ Bergman, 2001: number, amount of content○ Chang et al., 2004: number, coverage○ Shestakov et al., 2007: number○ Chinese surveys: number○ ....

Page 5: Sampling national deep Web

Background

● All approaches used so far are not good● Basically, the idea behind estimating number of

deep web sites:○ IP address random sampling method (proposed in

1997)○ Description: take a pool of all IP addresses (~3 billions

currently in use), generate a random sample (~one million is ok), connect to them, if it serves HTTP crawl it and search for search interfaces

○ Obtain a number of search interfaces in a sample and apply sampling math to get an estimate

○ One can restrict to some segment of the Web (e.g., national): then pool consists of national IP addresses only

Page 6: Sampling national deep Web

Virtual Hosting

● Bottleneck: virtual hosting● When only IP available then URLs for crawl look

like these http://X.Y.Z.W -----> lots of web sites hosting on X.Z.Y.W missed

● Examples:○ OVH (hosting company): 65,000 servers host

7,500,000○ This survey: 670,000 hosts on 80,000 IP

addresses● You can't ignore it!

Page 7: Sampling national deep Web

Host-IP cluster sampling

● What if a large list of hosts is available?○ In fact, not very trivial to get one as such a list

should cover a certain web segment well● Host random sampling can be applied (Shestakov

et al., 2007)○ Works but with limitations○ Bottleneck: host aliasing, i.e., different hostnames

lead to the same web site■ Hard to solve: need to crawl all hosts in the list

(their start web pages)● Idea: resolve all hosts to their IPs

Page 8: Sampling national deep Web

Host-IP cluster sampling

● Resolve all hosts in the list to their IP addresses○ A set of host-IP pairs

● Cluster hosts (pairs) by IP○ IP1: host11,host12, host13, ...○ IP2: host21,host22, host23, ...○ ...○ IPN: hostN1,hostN2, hostN3, ...

● Generate random sample of IP● Analyze sampled IPs

○ E.g., if IP2 sampled then crawl host21,host22, host23, ...

Page 9: Sampling national deep Web

Host-IP cluster sampling

● Analyze sampled IPs○ E.g., if IP2 sampled then crawl host21,host22,

host23, ...○ While crawling 'unknown' (not in the list)

hosts may be found■ Crawl only those that either resolved to

IP2 or to IPs that are not among list's IP list ( IP1, IP2,..., IPN)

● Identify search interfaces YES --->○ Filtering, machine learning, manual check○ Out of the scope (see ref [14] in the paper)

● Apply sampling formulas (see Section 4.4 of the paper)

NO

Page 10: Sampling national deep Web

Results

● Dataset:○ ~670 thousand hostnames○ Obtained from Yandex: good coverage of Russian

Web as of 2006○ Resolved to ~80 thousands unique IP addresses○ 77.2% of hosts shared their IPs with at least 20

other hosts <--virtual hosting scale● 1075 IPs sampled - 6237 hosts in initial crawl

seed○ Enough if satisfied with NUM+/-25% with 95%

confidence

Page 11: Sampling national deep Web

Results

Page 12: Sampling national deep Web

Comparison: host-IP vs IP sampling

Conclusion: IP random sampling (used in previous deep web characterization studies) applied to the same dataset resulted in estimates that are 3.5 times smaller than actual numbers (obtained by host-IP)

Page 13: Sampling national deep Web

Conclusion

● Proposed Host-IP clustering technique○ Superior to IP random sampling

● Accurately characterized a national web segment○ As of 09/2006, 14,200+/-3800 deep web sites in

Russian Web ● Estimates obtained by Chang et al. (ref [9] in the

paper) are underestimated● Planning to apply Host-IP to other datasets

○ Main challenge is to obtain a large list of hosts that reliably covers a certain web segment

● Contact me if interested in Host-IP pairs datasets

Page 14: Sampling national deep Web

Thank you!Questions?