Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros...
-
Upload
justin-suarez -
Category
Documents
-
view
215 -
download
0
Transcript of Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros...
![Page 1: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/1.jpg)
Downloading Textual Hidden-WebContent Through Keyword Queries
Alexandros Ntoulas Petros Zerfos Junghoo Cho
University of California Los AngelesComputer Science Department
{ntoulas, pzerfos, cho}@cs.ucla.edu
JCDL, June 8th 2005
Downloading Textual Hidden-WebContent Through Keyword Queries
![Page 2: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/2.jpg)
April 10, 2023
Motivation
I would like to buy a used ’98 Ford Taurus Technical specs ?
Reviews ?
Classifieds ?
Vehicle history ?
GGooooggllee??
![Page 3: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/3.jpg)
April 10, 2023
Why can’t we use a search engine ? Search engines today employ crawlers that
find pages by following links around Many useful pages are available only after
issuing queries (e.g. Classifieds, USPTO, PubMed, LoC, …)
Search engines cannot reach such pages: there are no links to them (Hidden-Web)
In this talk: how can we download Hidden-Web content?
![Page 4: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/4.jpg)
April 10, 2023
Outline
Interacting with Hidden-Web sites Algorithms for selecting queries for the
Hidden-Web sites Experimental evaluation of our algorithms
![Page 5: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/5.jpg)
April 10, 2023
Interacting with Hidden-Web pages (1)1. The user issues a query through a query
interface
liver
![Page 6: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/6.jpg)
April 10, 2023
Interacting with Hidden-Web pages (2)1. The user issues a query through a query
interface 2. A result list is presented to the user
Result List Page
![Page 7: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/7.jpg)
April 10, 2023
1. The user issues a query through a query interface
2. A result list is presented to the user
3. The user selects and views the “interesting” results
Interacting with Hidden-Web pages (3)
![Page 8: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/8.jpg)
April 10, 2023
Querying a Hidden-Web site
Procedure
while ( there are available resources ) do
(1) select a query to send to the site
(2) send query and acquire result list
(3) download the pages
done
![Page 9: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/9.jpg)
April 10, 2023
How should we select the queries ? (1)
S: set of pages in Web site (pages as points) qi: set of pages returned if we issue query qi
(queries as circles)
![Page 10: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/10.jpg)
April 10, 2023
How should we select the queries ? (2)
Find the queries (circles) that cover the maximum number of pages (points)
Equivalent to the set-covering problem in graph-theory
![Page 11: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/11.jpg)
April 10, 2023
Challenges during query selection In practice we don’t know which pages will be
returned by which queries (qi are unknown)
Even if we did know qi, the set-covering problem is NP-Hard
We will present approximation algorithms to the query selection problem
We will assume single-keyword queries
![Page 12: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/12.jpg)
April 10, 2023
Outline
Interacting with Hidden-Web sites Algorithms for selecting queries for the
Hidden-Web sites Experimental evaluation of our algorithms
![Page 13: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/13.jpg)
April 10, 2023
Some background (1)
Assumption: When we issue query qi to a Web site, all pages containing qi are returned
P(qi): fraction of pages from site we get back after issuing qi
Example: q = liver No. of docs in DB: 10,000 No. of docs containing liver: 3,000 P(liver) = 0.3
![Page 14: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/14.jpg)
April 10, 2023
Some background (2)
P(q1/\q2): fraction of pages containing both q1 and q2 (intersection of q1 and q2)
P(q1\/q2): fraction of pages containing either q1 or q2 (union of q1 and q2)
Cost and benefit: How much benefit do we get out of a query ? How costly is it to issue a query?
![Page 15: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/15.jpg)
April 10, 2023
Cost function
The cost to issue a query and download the Hidden-Web pages:
cq: query cost cr: cost for retrieving
a result item cd: cost for downloading
a document
Cost(qi) =
(1) Cost for issuing a query
(2) Cost for retrieving a result item times no. of results
(3) Cost for retrieving a doc times no. of docs
cq + crP(qi) + cdP(qi)
![Page 16: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/16.jpg)
April 10, 2023
Problem formalization
Find the set of queries q1,…,qn
which maximizes
P(q1\/…\/qn)
Under the constraint:
n
ii tqCost
1
)(
![Page 17: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/17.jpg)
April 10, 2023
Query selection algorithms
Random: Select a query randomly from a precompiled list (e.g. a dictionary)
Frequency-based: Select a query from a precompiled list based on frequency (e.g. a corpus previously downloaded from the Web)
Adaptive: Analyze previously downloaded pages to determine “promising” future queries
![Page 18: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/18.jpg)
April 10, 2023
Adaptive query selection
Assume we have issued q1,…,qi-1.
To find a promising query qi we need to estimate P(q1\/…\/qi-1\/qi)
P( (q1\/…\/qi-1) \/ qi) =
P(q1\/…\/qi-1) +
P(qi) -
P(q1\/…\/qi-1) P(qi|q1\/…\/qi-1)
Known (by counting) since we have
issued q1,…,qi-1
Can measure by counting P(qi) within
P(q1,…,qi-1)What about P(qi) ?
![Page 19: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/19.jpg)
April 10, 2023
Estimating P(qi)
Independence estimator
Zipf estimator [IG02] Rank queries based on frequency of occurrence
and fit a power law distribution Use fitted distribution to estimate P(qi)
P(qi) ~ P(qi|q1\/…\/qi-1)
![Page 20: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/20.jpg)
April 10, 2023
Query selection algorithm
foreach qi in [potential queries] do
Pnew(qi) = P(q1\/…\/qi-1\/qi) – P(q1\/…\/qi-1)
Estimate
done
return qi with maximum Efficiency(qi)
)(
)()(
i
inewi qCost
qPqEfficiency
![Page 21: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/21.jpg)
April 10, 2023
Other practical issues
Efficient calculation of P(qi|q1\/…\/qi-1) Selection of the initial query Crawling sites that limit the number of results
(e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details
![Page 22: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/22.jpg)
April 10, 2023
Outline
Interacting with Hidden-Web sites Algorithms for selecting queries for the
Hidden-Web sites Experimental evaluation of our algorithms
![Page 23: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/23.jpg)
April 10, 2023
Experimental evaluation Applied our algorithms to 4 different sites
Hidden-Web site No. of documents
Limit in the no.
of results
PubMed medical library
~13 million no limit
Books section of Amazon
~4.2 million 32,000
DMOZ: Open directory project
~3.8 million 10,000
Arts section of DMOZ
~429,000 10,000
![Page 24: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/24.jpg)
April 10, 2023
Policies
Random-16K Pick query randomly from 16,000
most popular terms Random-1M
Pick query randomly from 1,000,000 most popular terms
Frequency-based Pick query based on frequency of occurrence
Adaptive
![Page 25: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/25.jpg)
April 10, 2023
Coverage of policies
What fraction of the Web sites can we download by issuing queries ?
Study P(q1\/…\/qi) as i increases
![Page 26: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/26.jpg)
April 10, 2023
Coverage of policies for PubMed
Adaptive gets ~80% with ~83 queries Frequency needs 103 for the same coverage
![Page 27: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/27.jpg)
April 10, 2023
Coverage of policies for DMOZ (whole)
Adaptive outperforms others
![Page 28: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/28.jpg)
April 10, 2023
Coverage of policies for DMOZ (arts)
Adaptive performs best in topic-specific texts
![Page 29: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/29.jpg)
April 10, 2023
Other experiments
Impact of the initial query Impact of the various parameters of the cost
function Crawling sites that limit the number of results
(e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details
![Page 30: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/30.jpg)
April 10, 2023
Related work
Issuing queries to databases Acquire language model [CCD99] Estimate fraction of the Web indexed [LG98] Estimate relative size and overlap of indexes
[BB98] Build multi-keyword queries that can return a
large number of documents [BF04] Harvesting approaches/cooperative
databases (OAI [LS01], DP9 [LMZN02])
![Page 31: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/31.jpg)
April 10, 2023
Conclusion
An adaptive algorithm for issuing queries to Hidden-Web sites
Our algorithm is highly efficient (downloaded >90% of a site with ~100 queries)
Allows users to tap into unexplored information on the Web
Allows the research community to download, mine, study, understand the Hidden-Web
![Page 32: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/32.jpg)
April 10, 2023
References [IG02] P. Ipeirotis, L. Gravano. Distributed search over the
hidden web: Hierarchical database sampling and selection. VLDB 2002.
[CCD99] J. Callan, M.E. Connel, A. Du. Automatic discovery of language models for text databases. SIGMOD 1999.
[LG98] S. Lawrence, C.L. Giles. Searching the World Wide Web. Science 280(5360):98-100, 1998.
[BB98] K. Bharat, A. Broder. A technique for measuring the relative size and overlap of public web search engines. WWW 1998.
[BF04] L. Barbosa, J. Freire. Siphoning hidden-web data through keyword-based interfaces.
[LS01] C. Lagoze, H.V. Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework. JCDL 2001.
[LMZN02] X. Liu, K. Maly, M. Zubair, M.L. Nelson. DP9-An OAI Gatway Service for Web Crawlers. JCDL 2002.
![Page 33: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/33.jpg)
Thank you !
Questions ?
![Page 34: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/34.jpg)
April 10, 2023
Impact of the initial query
Does it matter what the first query is ? Crawled PubMed with queries:
data (1,344,999 results) information (308,474 results) return (29,707 results) pubmed (695 results)
![Page 35: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/35.jpg)
April 10, 2023
Impact of the initial query
Algorithm converges regardless of initial query
![Page 36: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/36.jpg)
April 10, 2023
Incorporating the document download cost Cost(qi) = cq + crP(qi) + cdPnew (qi) Crawled PubMed with
cq = 100
cr = 100
cd = 10,000
![Page 37: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/37.jpg)
April 10, 2023
Incorporating document download cost
Adaptive uses resources more efficiently Document cost significant portion of the cost
![Page 38: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/38.jpg)
April 10, 2023
Can we get all the results back ?
…
![Page 39: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/39.jpg)
April 10, 2023
Downloading from sites limiting the number of results (1)
Site returns qi’ instead of qi
For qi+1 we need to estimate P(qi+1|q1\/…\/qi)
![Page 40: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/40.jpg)
April 10, 2023
Downloading from sites limiting the number of results (2)
Assuming qi’ is a random sample of qi
))]...(()(
))...(([)...(
1
)...|(
1111
1111
11
iiiii
iii
ii
qqqqPqqP
qqqPqqP
qqqP
)'(
)(
)'(
)(
1
1
i
i
ii
ii
qP
qP
qqP
qqP
![Page 41: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/41.jpg)
April 10, 2023
Impact of the limit of results
How does the limit of results affect our algorithms ?
Crawled DMOZ but restricted the algorithms to 1,000 results instead of 10,000
![Page 42: Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.](https://reader036.fdocuments.us/reader036/viewer/2022070305/5513d4ec55034679748b4e9d/html5/thumbnails/42.jpg)
April 10, 2023
Dmoz with a result cap at 1,000
Adaptive still outperforms frequency-based