On building a search interface discovery system

Post on 06-May-2015

597 views 2 download

description

Slides of my talk at RED'09 workshop

Transcript of On building a search interface discovery system

28.08.2009 RED’09 1

On building a search interface discovery system

Denis Shestakov

Helsinki University of Technology, Finland

fname.lname at tkk dot fi

28.08.2009 RED’09 2

Outline

• Background: search interfaces & deep Web

• Motivation

• Building directory of deep web resources

• Interface Crawler

• Experiments & results

• Discussion & conclusion

28.08.2009 RED’09 3

Background• Search engines (e.g., Google) do not crawl and index a

significant portion of the Web• The information from non-indexable part of the Web

cannot be found and accessed via searchers• Important type of web content which is badly indexed:

• Web pages generated based on parameters provided by users via search interfaces

• Typically, these pages contain ‘high-quality’ structured content (e.g., product descriptions)

• Search interfaces are entry-points to myriads of databases on the Web

• The part of the Web ’behind’ search interfaces is known as deep Web (or hidden Web)

• Also, see VLDB’09 papers presented yesterday: Lixto & Kosmix

28.08.2009 RED’09 4

Background: exampleAutoTrader search form (http://autotrader.com/):

28.08.2009 RED’09 5

Background: deep Web numbers & misconceptions

Size of the deep Web:• 400 to 550 times larger than the indexable Web

according to survey of 2001; but it is not that big• Comparable with the size of the indexable Web [indirect

support in tech.report by Shestakov&Salakoski]

Content of some (well, of many) web databases is, in fact, indexable:

• Correlation with database subjects: content of books/movies/music databases (i.e., relatively ’static’ data) is indexed well

• Search engines’ crawlers do go behind web forms [see VLDB’08 work by Madhavan et al.]

Total number of web databases:• Survey of Apr’04 by Chang et al.: 450 000 web dbs• Underestimation• Now in 2009, several millions dbs available online

28.08.2009 RED’09 6

Motivation• Several millions databases available online …• To access a database, a user needs to know its URL • But there are directories/lists of databases, right?

• Biggest, Completeplanet.com, includes 70,000 resources

• Manually created and maintained by domain specialists, such as Molecular Biology Database Collection with 1170 summaries of bioinformatics databases in 2009

• Essentially, we currently have no idea about location of most deep web resources:• And content of many of these databases is either not

indexed or poorly indexed• I.e., undiscovered resources with unknown content

• Directories of online databases corresponding to the scale of deep Web are needed

28.08.2009 RED’09 7

Motivation• Building such directories requires technique for

finding search interfaces• A database on the Web is identifiable by its search

interface• For any given topic there are too many web databases

with relevant content: resource discovery has to be automatic

• One specific application: general web search• Transactional queries (i.e., find a site where further

interaction will happen)• For example, if a query suggests that a user wants

to buy/sell a car search results should contain links to pages with web forms for car search

28.08.2009 RED’09 8

Building directory of deep web resources

1. Visit as many pages that potentially have search interfaces as possible• (Dozens of) billions web pages vs. millions of

databases• Visiting a page with a search interface during a

‘regular’ crawl is a rare event• It is even more rare if databases of interest

belong to a particular domain• Thus, some visiting (or crawling) strategy could

be very helpful

28.08.2009 RED’09 9

Building directory of deep web resources

2. Recognize search interface on a web page (focus in this work)

28.08.2009 RED’09 10

Building directory of deep web resources

2. Recognize search interface on a web page (focus in this work)• Forms have great variety in structure and

vocabulary• JavaScript-rich and non-HTML forms (e.g., in

Flash) have to be recognized

28.08.2009 RED’09 11

Building directory of deep web resources

3. Classify search interfaces (and, hence, databases) into subject hierarchy• One of the challenges: some interfaces belong to

several domains

28.08.2009 RED’09 12

Interface crawler

• I-Crawler is a system to automatically discover search interfaces and identify a main subject of an underlying database• Deal with JavaScript-rich and non-HTML forms• Use a binary domain-independent classifier for

identifying searchable web forms• Divides all forms into two groups: u-forms (those

with one or two visible fields) and s-forms (the rest)

• U- and s-forms are processed differently: u-interfaces are classified using query probing [Bergholz and Childlovskii, 2003; Gravano et al., 2003]

28.08.2009 RED’09 13

Interface crawler: architecture

28.08.2009 RED’09 14

Experiments and results

• Tested the Interface Identification component• Datasets:

1.216 searchable (HTML) web forms from the UIUC repository plus 90 searchable web forms (60 HTML forms and 30 JS-rich or non-HTML forms) and 300 non-searchable forms (270 and 30) added by us

2.Only s-forms from the dataset 13.264 searchable forms and 264 non-searchable

forms (all in Russian)4.90 searchable u-forms and 120 non-searchable

u-forms• Learning with two thirds of each dataset and testing

on the remaining third

28.08.2009 RED’09 15

Experiments and results

28.08.2009 RED’09 16

Experiments and results

• Used the decision tree to detect search interfaces on real web sites

• Three groups of web sites:1.150 deep web sites (in Russian)2.150 sites randomly selected from “Recreation”

category of http://www.dmoz.org3.150 sites randomly selected based on IP

addresses• All sites in each group were crawled to depth 5

28.08.2009 RED’09 17

Discussion and conclusion

• One of the specific usage for the I-Crawler: deep web characterization (i.e., how many deep web resources on the Web)• Hence, while false positives are OK false

negatives are not OK (resources are ignored)• Root pages of deep web sites are good starting

points for discovering more databases• JS-rich and non-HTML forms become more and

popular• Recognizing them is essential

• Nowadays more and more content owners provide APIs to their data, databases, etc.• Need in techniques for API-discovery

28.08.2009 RED’09 18

Thank you!

Questions?