Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

25
Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Page 1: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Crawling the Hidden Web

Sriram Raghavan

Hector Garcia-Molina

@ Stanford University

Page 2: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Introdution

What’s the problem? Current-day crawlers retrieve only Publicly

Indexable Web (PIW)

Why is it a problem? Large amounts of high quality information are

‘hidden’ behind search forms The hidden Web is 500 times as large as PIW

Page 3: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Introduction (cont’d)

What’s the solution?– Design a crawler capable of extracting content

from the hidden Web– A generic operational model of a hidden Web

crawler, Hidden Web Exposer (HiWE)

Why is HiWE a solution?

Page 4: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

User Form Interaction

Page 5: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Challenges and Simplifications

Challenges Parse, process and interact with search forms Fill out forms for submission

Simplifications Application dependant With user assistance Only address content retrieval and resource

discovery step is done

Page 6: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Crawler Form Interaction MSEEEF n ,,},...,,{ 21

]...,,[,,,},...,{ 111 nnn vEvEDMSEEMatch

Page 7: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Performance Metrics

Coverage Metric

Submission Efficiency

Lenient Submission Efficiency

SubmissionTotal

SubmissionSuccessful

N

N

PagesHiddenTotal

pagesretrieved

N

N

SubmissionTotal

SubmissionValid

N

N

Page 8: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Design Issues

Internal Form Representation Task-specific Database Matching Function Response Analysis

Page 9: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE Architecure

Page 10: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Form Representaion

,,},...,,{ 21 SEEEF n

)( 2EDom)( 2ELable

Page 11: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Sample Forms

Page 12: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Task-Specific Database

Label Value-Set (LVS) Tables

Vaule Set

is a fuzzy set of element values

is a membership function to assign weights [0, 1] to the member of the set

},...,{ 1 nvvV

)( iv vM

Page 13: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Populating the LVS Table

Explicit Initialization Built-in Entries Wrapped Data Sources Crawling Experience

Page 14: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Computing Weights Values from explicit initialization and built-in

categories have weight 1 Values from external data sources assigned

weights by wrappers [0, 1] Values gathered by crawlers

Extract and Match the label – add new values Extract and can not match the label – add new

entries (L,V) Can not extract the label – find closest entry and

add new values

Page 15: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Matching Function Enumerate values for finite domain

elements Label matching

step 1: string normalization step 2: string matching

Evaluate value assignment Fuzzy Conjunction

Average

Probabilistic

Page 16: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Configuring HiWE

Page 17: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – extraction from pages

Prune form page and only keep forms

Approximately lay-out the pruned page using a lay-out engine

Using lay-out engine to identify candidate labels to form elements

Rank each candidate and chose the best one

Page 18: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – extraction from pages (cont’d)

Page 19: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Experiments

Page 20: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Experiments (cont’d)

Page 21: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Experiments (cont’d)

Page 22: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Experiments (cont’d)

Page 23: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

HiWE – Experiments (cont’d)

93% accuracy

Page 24: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Future Work

Recognize and respond to the dependencies between form elements

Support partially filling-out forms

Page 25: Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

Conclusion

Propose an application specific approach to hidden Web crawling

Implement a prototype crawler – HiWE Set the stage for designing a variety of

hidden Web crawlers