Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft...
-
Upload
clifford-cobb -
Category
Documents
-
view
213 -
download
1
Transcript of Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft...
![Page 1: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/1.jpg)
Accessing the Deep WebBin HeIBM Almaden Research Center in San Jose, CA Mitesh PatelMicrosoft CorporationZhen Zhangcomputer science at the University of Illinois at Urbana-ChampaignKevin Chen-Chuan Changcomputer science at the University of Illinois at Urbana-Champaign
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5
1
![Page 2: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/2.jpg)
Introduction•Web has been rapidly “deepened” by
massive databases online▫current search engines do not reach most of
the data on the internet•Surface Web
▫linked of static HTML pages•a far more significant amount of
information is believed to be “hidden” in the deep Web▫behind the query forms of searchable
databases
2
![Page 3: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/3.jpg)
Conceptual View of the Deep Web
3
![Page 4: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/4.jpg)
Introduction (Con.)
•This article reports the survey of the deep Web▫scale▫subject distribution▫search-engine coverage▫other access characteristics of online
databases
4
![Page 5: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/5.jpg)
Related Work•BrightPlanet.com., 2000
▫established interest in this area▫focused on only the scale aspect▫overlap analysis
43,000–96,000 “deep Web sites” informal estimate of 7,500 terabytes of data exist 500 times larger than the surface Web
▫underestimate assume two search engines randomly and
independently obtain data Actually, highly correlated in coverage of deep web
data
5
![Page 6: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/6.jpg)
Global Scale Estimation•IP sampling approach
▫randomly sampled 1,000,000 IPs▫From the entire space of 2,230,124,544
valid IP address•For each IP
▫download HTML pages▫identified & analyzed web databases in this
sample
6
![Page 7: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/7.jpg)
Site, Databases, and Interface•distinguish three related notions for
accessing the deep web▫site, database, and interface
•a deep web ▫a Web server that provides information
maintained in one or more backend Web databases
•each of database is searchable through one or more HTML forms as its query interfaces
7
![Page 8: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/8.jpg)
Site, Databases, and Interface (Con.)
find the number of query interfaces for each Web site, then the number of Web databases, and finally the number of deep Web sites
8
![Page 9: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/9.jpg)
Query Interface
•exclude non-query HTML forms (which do not access back-end databases) from query interfaces
•exclude login, subscription, registration, polling, and message posting
•exclude “site search” ▫many web sites now provide for searching
HTML pages on their sites•removed duplicates
9
![Page 10: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/10.jpg)
Web Database
•based on the discovered query interfaces•compute the number of Web databases by
finding the set of query interfaces (within a site) that refer to the same database
•if the objects from one interface can always be found in the other one▫the two interfaces are searching the same
database▫randomly choose five objects
10
![Page 11: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/11.jpg)
Deep Web Site
•the recognition of deep web site is rather simple
•a Web site is a deep Web site if it has at least one query interface
11
![Page 12: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/12.jpg)
(Q1) Where to find “entrances” to databases?•To access a Web database, we must first
find its entrances: the query interfaces•depth of query interface
▫the minimum number of hops from the root page of the site to the interface page
•Due to deep crawling, analyzed 1/10 of total IP samples▫100,000 IPs
12
![Page 13: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/13.jpg)
Results of Q1• found 281 Web servers•Exhaustively crawling these servers to
depth 10, we found 24 of them are deep Web sites▫Contained a total of 129 query interfaces
representing 34 Web databases•query interfaces tend to locate shallowly in
their sites▫none of the 129 query interfaces had depth
deeper than 5▫72% (93 out of 129) interfaces were found
within depth 3
13
![Page 14: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/14.jpg)
Depth of Web Database
• since a Web database may be accessed through multiple interfaces▫ measured its depth as
the minimum depths of all its interfaces
▫ 94% (32 out of 34) Web databases appeared within depth 3
▫ 91.6% (22 out of 24) deep Web sites had their databases within depth 3
14
![Page 15: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/15.jpg)
(Q2) What is the scale of the deep Web?•tested and analyzed all of the 1,000,000
IP samples to estimate the scale of the deep Web
•high depth-three coverage▫almost all Web databases can be identified
within depth 3•crawled to depth 3 for these one million
IPs
15
![Page 16: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/16.jpg)
Results of Q2
•2,256 Web servers▫126 deep Web sites▫406 query
interfaces ▫190 Web databases
•s = 1,000,000 unique IP samples
•the entire IP space of t = 2,230,124,544 IPs
Number of deep Web sites
number of Web databases
number of query interfaces
16
![Page 17: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/17.jpg)
(Q3) How “structured” is the deep Web?•classified Web databases into two types
▫unstructured databases provide data objects as unstructured media
(text, images, audio, and video)▫structured databases
provide data objects as structured “relational” records with attribute-value pairs
17
![Page 18: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/18.jpg)
Results of Q3•manual querying and inspection of the
190 Web databases sampled▫found 43 unstructured and 147 structured▫similarly estimate their total numbers to be
102,000 and 348,000•Deep Web features mostly structured data
sources▫3.4:1
18
![Page 19: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/19.jpg)
(Q4)What is the subject distribution of Web databases?
• top-level categories of the yahoo.com directory as taxonomy
• manually categorized the sampled 190 Web databases
• the distribution indicates great subject diversity among Web databases
non-commerce categories
51% (97 out of 190 databases)
19
![Page 20: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/20.jpg)
(Q5) How do search engines cover the deep Web?•randomly selected 20 Web databases from
190•For each database, first manually sampled
five objects (result pages) as test data
20
![Page 21: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/21.jpg)
Coverage of Search Engines
indexing almost the same objects
entirely a subset of Yahoo
• contrasts with the surface Web
• overlap -> low, combination -> greatly improve coverage
21
![Page 22: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/22.jpg)
(Q6) What is the coverage ofdeep Web directories?
• providing deep Web directories
• classify Web databases in some taxonomies
• recorded the number of databases it claimed to have indexed
• low coverage
• manual classification of Web databases (directory-based indexing services)
• hardly scale for the deep Web
22
![Page 23: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/23.jpg)
Conclusion
23
![Page 24: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/24.jpg)
Conclusion (Con.)•poor coverage of both its data and databases
▫access to the deep Web is not adequately supported
•V.s. surface web▫Same
large, fast-growing, and diverse▫Different
more diversely distributed, is mostly structured, and suffers an inherent limitation of crawling
•crawl-and-index techniques▫“limit of coverage” and “structural heterogeneity”
24
![Page 25: Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.](https://reader036.fdocuments.us/reader036/viewer/2022070409/56649e995503460f94b9c191/html5/thumbnails/25.jpg)
Future Work•database-centered, discover-and-forward
access model•automatically discover databases on the
Web by crawling and indexing their query interfaces
•User querying -> forward users to the actual search of data▫use their data-specific interfaces ▫fully leverage their structures
•Recent project▫MetaQuerier and WISE-Integrator
25