On building a search interface discovery system

18
28.08.2009 RED’09 1 On building a search interface discovery system Denis Shestakov Helsinki University of Technology, Finland fname.lname at tkk dot fi

description

Slides of my talk at RED'09 workshop

Transcript of On building a search interface discovery system

Page 1: On building a search interface discovery system

28.08.2009 RED’09 1

On building a search interface discovery system

Denis Shestakov

Helsinki University of Technology, Finland

fname.lname at tkk dot fi

Page 2: On building a search interface discovery system

28.08.2009 RED’09 2

Outline

• Background: search interfaces & deep Web

• Motivation

• Building directory of deep web resources

• Interface Crawler

• Experiments & results

• Discussion & conclusion

Page 3: On building a search interface discovery system

28.08.2009 RED’09 3

Background• Search engines (e.g., Google) do not crawl and index a

significant portion of the Web• The information from non-indexable part of the Web

cannot be found and accessed via searchers• Important type of web content which is badly indexed:

• Web pages generated based on parameters provided by users via search interfaces

• Typically, these pages contain ‘high-quality’ structured content (e.g., product descriptions)

• Search interfaces are entry-points to myriads of databases on the Web

• The part of the Web ’behind’ search interfaces is known as deep Web (or hidden Web)

• Also, see VLDB’09 papers presented yesterday: Lixto & Kosmix

Page 4: On building a search interface discovery system

28.08.2009 RED’09 4

Background: exampleAutoTrader search form (http://autotrader.com/):

Page 5: On building a search interface discovery system

28.08.2009 RED’09 5

Background: deep Web numbers & misconceptions

Size of the deep Web:• 400 to 550 times larger than the indexable Web

according to survey of 2001; but it is not that big• Comparable with the size of the indexable Web [indirect

support in tech.report by Shestakov&Salakoski]

Content of some (well, of many) web databases is, in fact, indexable:

• Correlation with database subjects: content of books/movies/music databases (i.e., relatively ’static’ data) is indexed well

• Search engines’ crawlers do go behind web forms [see VLDB’08 work by Madhavan et al.]

Total number of web databases:• Survey of Apr’04 by Chang et al.: 450 000 web dbs• Underestimation• Now in 2009, several millions dbs available online

Page 6: On building a search interface discovery system

28.08.2009 RED’09 6

Motivation• Several millions databases available online …• To access a database, a user needs to know its URL • But there are directories/lists of databases, right?

• Biggest, Completeplanet.com, includes 70,000 resources

• Manually created and maintained by domain specialists, such as Molecular Biology Database Collection with 1170 summaries of bioinformatics databases in 2009

• Essentially, we currently have no idea about location of most deep web resources:• And content of many of these databases is either not

indexed or poorly indexed• I.e., undiscovered resources with unknown content

• Directories of online databases corresponding to the scale of deep Web are needed

Page 7: On building a search interface discovery system

28.08.2009 RED’09 7

Motivation• Building such directories requires technique for

finding search interfaces• A database on the Web is identifiable by its search

interface• For any given topic there are too many web databases

with relevant content: resource discovery has to be automatic

• One specific application: general web search• Transactional queries (i.e., find a site where further

interaction will happen)• For example, if a query suggests that a user wants

to buy/sell a car search results should contain links to pages with web forms for car search

Page 8: On building a search interface discovery system

28.08.2009 RED’09 8

Building directory of deep web resources

1. Visit as many pages that potentially have search interfaces as possible• (Dozens of) billions web pages vs. millions of

databases• Visiting a page with a search interface during a

‘regular’ crawl is a rare event• It is even more rare if databases of interest

belong to a particular domain• Thus, some visiting (or crawling) strategy could

be very helpful

Page 9: On building a search interface discovery system

28.08.2009 RED’09 9

Building directory of deep web resources

2. Recognize search interface on a web page (focus in this work)

Page 10: On building a search interface discovery system

28.08.2009 RED’09 10

Building directory of deep web resources

2. Recognize search interface on a web page (focus in this work)• Forms have great variety in structure and

vocabulary• JavaScript-rich and non-HTML forms (e.g., in

Flash) have to be recognized

Page 11: On building a search interface discovery system

28.08.2009 RED’09 11

Building directory of deep web resources

3. Classify search interfaces (and, hence, databases) into subject hierarchy• One of the challenges: some interfaces belong to

several domains

Page 12: On building a search interface discovery system

28.08.2009 RED’09 12

Interface crawler

• I-Crawler is a system to automatically discover search interfaces and identify a main subject of an underlying database• Deal with JavaScript-rich and non-HTML forms• Use a binary domain-independent classifier for

identifying searchable web forms• Divides all forms into two groups: u-forms (those

with one or two visible fields) and s-forms (the rest)

• U- and s-forms are processed differently: u-interfaces are classified using query probing [Bergholz and Childlovskii, 2003; Gravano et al., 2003]

Page 13: On building a search interface discovery system

28.08.2009 RED’09 13

Interface crawler: architecture

Page 14: On building a search interface discovery system

28.08.2009 RED’09 14

Experiments and results

• Tested the Interface Identification component• Datasets:

1.216 searchable (HTML) web forms from the UIUC repository plus 90 searchable web forms (60 HTML forms and 30 JS-rich or non-HTML forms) and 300 non-searchable forms (270 and 30) added by us

2.Only s-forms from the dataset 13.264 searchable forms and 264 non-searchable

forms (all in Russian)4.90 searchable u-forms and 120 non-searchable

u-forms• Learning with two thirds of each dataset and testing

on the remaining third

Page 15: On building a search interface discovery system

28.08.2009 RED’09 15

Experiments and results

Page 16: On building a search interface discovery system

28.08.2009 RED’09 16

Experiments and results

• Used the decision tree to detect search interfaces on real web sites

• Three groups of web sites:1.150 deep web sites (in Russian)2.150 sites randomly selected from “Recreation”

category of http://www.dmoz.org3.150 sites randomly selected based on IP

addresses• All sites in each group were crawled to depth 5

Page 17: On building a search interface discovery system

28.08.2009 RED’09 17

Discussion and conclusion

• One of the specific usage for the I-Crawler: deep web characterization (i.e., how many deep web resources on the Web)• Hence, while false positives are OK false

negatives are not OK (resources are ignored)• Root pages of deep web sites are good starting

points for discovering more databases• JS-rich and non-HTML forms become more and

popular• Recognizing them is essential

• Nowadays more and more content owners provide APIs to their data, databases, etc.• Need in techniques for API-discovery

Page 18: On building a search interface discovery system

28.08.2009 RED’09 18

Thank you!

Questions?