Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot,...

38
Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google Inc. Speaker: Tom 1

Transcript of Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot,...

Page 1: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Google’s Deep-Web CrawlGoogle’s Deep-Web Crawl

Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and

Alon Halevy, Google Inc.

Speaker: Tom

1

Page 2: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

What is the Deep Web?

Content hidden behind HTML forms

Deep = not accessible through search engines

2

Page 3: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Why is it important?

Large source of structured dataForms present a search interface over backend databases

Significant gap in search engine coveragePotentially more content that currently searchable web [Bergman+,

Madhavan+, He+]

More than 10 million distinct HTML forms

Likely to increase and more data comes online

Challenge: make the Deep Web accessible to web search

3

Page 4: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Yes: Informational forms

No: Login forms, anything that requires user informationMaybe: Interactive forms, e.g., airline reservations

What is in the Deep Web?

store locationsused cars

radio stationspatents

recipes

4

Page 5: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)5

Page 6: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Mediator forms per domainMappings between forms [Doan+, He+, Wu+]Query routing/reformulation at run-timePopular with vertical search engines

Impractical for web search!

Modeling all domains in all languages might not be possibleHigh cost of building and maintaining

Query routing at run-time is very difficultPotentially high loads on deep-web sources

Virtual Integration

mediated form

deep-web sources

semantic mappings

6

Page 7: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)7

Page 8: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Surfacing the Deep Web

8

Page 9: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Surfacing the Deep Web

Pre-compute all interesting form submissions each HTML form

Each form submission corresponds to a distinct URL

Add URLs for each form submission into search engine index

Enables the reuse of existing search engine infrastructureDeep-web URLs are like any other URL

Reduced load on deep-web sites

Only in response to user clicks on a search results

Search engine performance not dependent on deep-web source

9

Page 10: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Surfacing Challenges

1. Predicting the appropriate values for text inputsValid input values are required for retrieving dataIngredients in recipes.com and zipcodes in borderstores.com

2. Predicting the correct input combinationsGenerating all possible URLs is wasteful + unnecessaryCars.com has ~500K listings, but 250M possible queries

10

Page 11: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Surfacing for a Search Engine

Goal: access to as much Deep-Web content at possible.

Distribution of form-generated traffic is heavy-tailedMore than 800,000 distinct forms in a week

Overall coverage more important than site-specific coverage

Completely automatic and efficient solution required !Many domains and many languages

No human in the loop, no site-specific scripts

11

Page 12: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Contributions and Impact

Research contributionsFormulation: searching for informative query templates

Algorithms: predicting input combinations

Algorithms: predicting input values for text boxes

Google’s Deep-Web crawling systemAffects more than 1000 queries per second

Enables access to more than a million Deep-Web sites

Spans 50+ languages and 100+ domains

12

Page 13: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Problem Formulation

13

Page 14: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Form Processing 101

GET and POST: types of HTML forms

Only GETs can be surfaced

<form action=http://www.borders.com/locator method=GET> <select name=store><option …/>… </select> … <input name=zip type=text/> <input name=search type=submit value=Go/> <input name=site type=hidden value=homepage/></form>

URL: http://www.borders.com/locator?store=All&city=&state=&zip=94043&within=25&search=Go&site=homepage

on submit

14

Page 15: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Problem Formulation

Form submission ~ SQL Query

select * from DBwhere I1=V1 and … and IN=VN

Not all inputs impose selection predicates

E.g., sort order and results per page affect presentation

Problem: find the best set of SQL queries

15

Page 16: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Query Templates

Query Template: compact representation of a set of queriesIB: binding inputs in the form

{ select * from DB where PB }PB: selection predicates only involving IB

All queries with different values for IB

Default values assigned to other inputs

Store locator with zip and type can have templates:<Z> {select * from DB where zip = z | z are valid zip codes }<T> {select * from DB where type = t | t are valid store types }<T, Z> {select * from DB where zip = z and type = t | … }

Problem: find the best possible query templates

16

Page 17: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Predicting Input Combinations

17

Page 18: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Predicting Input Combinations

Forms can have multiple inputsGenerating all possible URLs is wasteful! … and un-necessary!

Goal: minimize URLs while maximizing retrieval!

Other considerationsGenerated URLs must be good candidates for indexOnly need URLs sufficient to drive trafficOnly need URLs sufficient to seed the web crawler

18

Page 19: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Query Template Quality

Presentation input is binding– There exists a template with fewer binding inputs

Large query templates (many binding inputs)– Too many queries generated– Numerous queries with empty results+ Likely to ensure complete coverage

Small query templates (fewer binding inputs)+ Smaller number of queries– Lower actual coverage (restrictions on the results per page)– Results of a single query not sufficiently related

19

Page 20: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Good Query Templates

Do not contain presentation inputs

Neither too small, neither too largeDependent on database size?

Dependent on potential query traffic?

20

Page 21: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Informative Query Templates

http://jobs.shrm.org/search?state=All&kw=&type=Allhttp://jobs.shrm.org/search?state=AL&kw=&type=Allhttp://jobs.shrm.org/search?state=AK&kw=&type=All…http://jobs.shrm.org/search?state=WV&kw=&type=All

http://jobs.shrm.org/search?state=All&kw=&type=ALLhttp://jobs.shrm.org/search?state=All&kw=&type=ANYhttp://jobs.shrm.org/search?state=All&kw=&type=EXACT

Result pages different informative

Result pages similar un-informative

21

Page 22: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Identifying Informative Templates

Generate a sampling of possible form submissionsAnalyze and compare the contents of the result pages

Compute content signatures for each corresponding web page

Dist. Frac. = # Distinct Signatures / # URLs

Dist. Frac. > Threshold Informative Template

Content signatures must be robust toChanges in HTML layoutMinor differences in contentPresence of advertisements and transient content

22

Page 23: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

URL Generation

Low distinctness fractions imply thatpresentation inputs: many pages have similar results

very large template: many pages are empty

error template: all pages are the same with an error message

Generated submissions unlikely to be useful

URL generation strategyEnumerate all possible query templates

Test each template for informativeness

Generate all URLs from informative templates

23

Page 24: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Incremental Template Search

Determine informative templates with one binding input

Determine informative templates with two binding inputsOnly consider pairs with one input known to be informative

Incrementally build candidate templatesOnly consider supersets of smaller informative templates

Halt when no larger templates are possible

ISIT: Incremental Search for Informative Templates

24

Page 25: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Scalable URL Generation

Our algorithm generates far fewer URLsInformativeness test plays a critical roleNumber of URLs generated depends on database size

Competitors• Cartesian: all possible URLs• Triple: templates with three binding inputs

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

10000000000

1 2 3 4 5 6 7 8 9 10

Number of Inputs

Av

era

ge

UR

Ls

pe

r F

orm

INFORMATIVE CARTESIAN TRIPLE

25

Page 26: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Other significant results

Larger Templates are usefulCompare with simple strategy: single binding input templates

Among forms with informative templates with 3 inputsTemplates of size 1 contribute 6% of search results on Google.com

Templates of size 2 contribute 37%

Templates of size 3 contribute 57%

Informative templates are discovered efficientlyAmong forms with 5 inputs, on average

Only 12.6 (out of possible 31) templates are tested

Only 1300 URLs are analyzed in total

26

Page 27: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Predicting Text Values

27

Page 28: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Generic and Typed Text boxes

Generic Search BoxesAccept any keywords

Challenge: selecting the most appropriate values

Typed Text BoxesOnly values belonging to specific types, e.g., zipcodes

Challenge: selecting the type of the input

28

Page 29: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Example: www.wipo.int

29

Page 30: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Input values for Generic Search

Iterative Probing for search boxesSelect an initial list of candidate keywords

Download pages based on current set of keywordsExtract more candidate keywords from result pagesRefine the current set of keywords

Repeat until no more new candidate keywordsPrune list of candidate keywords

Related Work:Classifying Deep-Web sources [Ipeirotis+]Extracting text documents [Ntoulas+, Barbosa+]

30

Page 31: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Example: www.wipo.int

MetalworkingProteinAntibodyPyrazoleImmobilizerVasoconstrictionPhosphinatesNosepieceSandbridgeViscosityCarboxydiphenylsulphideOzonizer…

31

Page 32: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Results Summary

Distribution of keywords extracted is heavy tailed

Large fraction of records retrieved extracted

Text inputs and select menus are complementary and both are important

Web crawler can automatically retrieve additional content

32

Page 33: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Typed Text Boxes

Library of types that are common across domainsName patterns and sample values

Zipcodes, City Names, Prices, Dates

Re-use informativeness testTest singleton text boxes

Informative only when using the correct type

33

Page 34: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Summary

34

Page 35: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Google’s Deep-Web Crawl

Solution based on the idea of informative templates

Automatic descriptions learned for millions of forms

Spans many domains and 50+ languages

Affects more than 1000 queries per sec

Results served from 400K+ distinct forms per day

Results served from 800K+ distinct forms per week

Results validate the utility of Deep-Web content

35

Page 36: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)

Future Work

Extending the coverage of crawlable formsDependencies between inputs, which are currently being ignored

Javascript-based submissions, which involve complex URL generation

Surfacing only part of the solutionPOST forms cannot be indexed by surfacing

Surfacing flattens structure – cannot be exploited during ranking

36

Page 37: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Related to 3D-LBS

Google's Deep-Web Crawl (VLDB 2008)

•Mobile application•Accessibility

•Limited screen size, hard to fill in forms•Recommendation

•Location-sensitive query suggestion•Dependency of inputs•Hong Kong Style Dim Sum Shatin

38

Page 38: Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Google's Deep-Web Crawl (VLDB 2008)39

Q&AThanks!