Sampling national deep Web

DEXA'11, Toulouse, France, 31.08.2011

Sampling National Deep WebDenis Shestakov, fname.lname at aalto.fi

Department of Media Technology, Aalto University

Outline

● Background● Our approach: Host-IP cluster random

sampling● Results● Conclusions

Background

● Deep Web: web content behind search interfaces

● See example of interface -------->● Main problem: hard to crawl, thus

content poorly indexed and not available for search (hidden)

● Many research problems: roughly 150-200 works addressing certain aspects of challenge (e.g., see 'Search interfaces on the Web: querying and characterizing', Shestakov, 2008)

● "Clearly, the science and practice of deep web crawling is in its infancy" (in 'Web crawling', Olston&Najork, 2010)

Background

● What is still unknown (surprisingly):○ How large is deep Web: number of deep web

resources? amount of content in them? what portion is indexed?

● So far only several studies addressed this:○ Bergman, 2001: number, amount of content○ Chang et al., 2004: number, coverage○ Shestakov et al., 2007: number○ Chinese surveys: number○ ....

Background

● All approaches used so far are not good● Basically, the idea behind estimating number of

deep web sites:○ IP address random sampling method (proposed in

1997)○ Description: take a pool of all IP addresses (~3 billions

currently in use), generate a random sample (~one million is ok), connect to them, if it serves HTTP crawl it and search for search interfaces

○ Obtain a number of search interfaces in a sample and apply sampling math to get an estimate

○ One can restrict to some segment of the Web (e.g., national): then pool consists of national IP addresses only

Virtual Hosting

● Bottleneck: virtual hosting● When only IP available then URLs for crawl look

like these http://X.Y.Z.W -----> lots of web sites hosting on X.Z.Y.W missed

● Examples:○ OVH (hosting company): 65,000 servers host

7,500,000○ This survey: 670,000 hosts on 80,000 IP

addresses● You can't ignore it!

Host-IP cluster sampling

● What if a large list of hosts is available?○ In fact, not very trivial to get one as such a list

should cover a certain web segment well● Host random sampling can be applied (Shestakov

et al., 2007)○ Works but with limitations○ Bottleneck: host aliasing, i.e., different hostnames

lead to the same web site■ Hard to solve: need to crawl all hosts in the list

(their start web pages)● Idea: resolve all hosts to their IPs

● Resolve all hosts in the list to their IP addresses○ A set of host-IP pairs

● Cluster hosts (pairs) by IP○ IP1: host11,host12, host13, ...○ IP2: host21,host22, host23, ...○ ...○ IPN: hostN1,hostN2, hostN3, ...

● Generate random sample of IP● Analyze sampled IPs

○ E.g., if IP2 sampled then crawl host21,host22, host23, ...

● Analyze sampled IPs○ E.g., if IP2 sampled then crawl host21,host22,

host23, ...○ While crawling 'unknown' (not in the list)

hosts may be found■ Crawl only those that either resolved to

IP2 or to IPs that are not among list's IP list ( IP1, IP2,..., IPN)

● Identify search interfaces YES --->○ Filtering, machine learning, manual check○ Out of the scope (see ref [14] in the paper)

● Apply sampling formulas (see Section 4.4 of the paper)

Results

● Dataset:○ ~670 thousand hostnames○ Obtained from Yandex: good coverage of Russian

Web as of 2006○ Resolved to ~80 thousands unique IP addresses○ 77.2% of hosts shared their IPs with at least 20

other hosts <--virtual hosting scale● 1075 IPs sampled - 6237 hosts in initial crawl

seed○ Enough if satisfied with NUM+/-25% with 95%

confidence

Results

Comparison: host-IP vs IP sampling

Conclusion: IP random sampling (used in previous deep web characterization studies) applied to the same dataset resulted in estimates that are 3.5 times smaller than actual numbers (obtained by host-IP)

Conclusion

● Proposed Host-IP clustering technique○ Superior to IP random sampling

● Accurately characterized a national web segment○ As of 09/2006, 14,200+/-3800 deep web sites in

Russian Web ● Estimates obtained by Chang et al. (ref [9] in the

paper) are underestimated● Planning to apply Host-IP to other datasets

○ Main challenge is to obtain a large list of hosts that reliably covers a certain web segment

● Contact me if interested in Host-IP pairs datasets

Thank you!Questions?

Sampling national deep Web

Technology

Transcript of Sampling national deep Web

Sampling and Analysis Technical Report for the Targeted National …€¦ · Targeted National Sewage Sludge Survey Sampling and Analysis Technical Report . U.S. Environmental Protection

JOURNAL OF LA Adaptive Image Sampling using Deep Learning … · 2019-11-19 · JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2012 1 Adaptive Image Sampling using Deep Learning

Sampling Guide - UK National Audit Office

Learning Effective Sparse Sampling Strategies using …...Learning Effective Sparse Sampling Strategies using Deep Active Sensing Mehdi Stapleton 1;2,Dieter Schmalstieg ,Clemens Arth

Sampling Wisely: Deep Image Embedding by Top- Precision ...openaccess.thecvf.com/content_ICCV_2019/supplemental/Lu_Sampli… · Sampling Wisely: Deep Image Embedding by Top-k Precision

DEEP-FRI: Sampling Outside the Box Improves Soundnesssk1233/deep-fri.pdf · An Interactive Oracle Proof of Proximity (IOPP) for RS codes, called DEEP-FRI. This soundness of the protocol

Sampling designs for national forest assessments

NATIONAL AUTOMOTIVE SAMPLING SYSTEM (NASS) CRASHWORTHINESS ...

Sampling for Diesel Particulate Matter in Mines - IRSST · Sampling for Diesel Particulate Matter in ... DEEP October 2001 Sampling for Diesel Particulate Matter in Mines Michel ...

A national grain sampling and analysis system for … · A national grain sampling and analysis system for improved food marketing and safety by S C W ... from sampling workshops

Canada’s National Forest Inventory Ground Sampling Guidelinesi Library and Archives Canada Cataloguing in Publication Canada's National Forest Inventory ground sampling guidelines

ISO Sampling Agreement Requirements and National Sampling Plan

Processing Megapixel Images with Deep Attention-Sampling ...11-11-00)-11-11-25-4512... · Processing Megapixel Images with Deep Attention-Sampling Models Angelos Katharopoulos & Fran˘cois

Other challenges in Marine Natural Products Research Sampling & Identification: difficult access to the sampling grounds, poor understanding of deep sea.

I,standingcommitteeofanalysts.co.uk/Archive/Methods... · I, DEPARTMENT OF THE ENVIRONMENT Methods of Biological Sampling: Sampling of Benthic Macroinvertebrates in Deep Rivers 1983.

Deep Creek Watershed Sampling and Analysis Planwaterquality.montana.edu/vol-mon/images-files/DeepCreek...Deep Creek Watershed Sampling and Analysis Plan 2015 Prepared by: Holly Kreiner,

Stronger at Depth: Jamming Grippers as Deep Sea Sampling Tools

Measuring and Sampling Equipment - N-Wissen GmbHn-wissen.de/Petrochemical/N-Wissen_Brochure for petrochemical_V1.6.pdfMeasuring and Sampling Equipment ... deep measuring tapes, digital

Accurate sampling and deep sequencing of the HIV-1 ...Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID Cassandra B. Jabaraa,b,c, Corbin D. Jonesa,d,

NATIONAL AUTOMOTIVE SAMPLING SYSTEM (NASS)