Post on 02-Feb-2016
description
Crawling the Hidden WebCrawling the Hidden Web
by
Michael Weinbergmwmw@cs.huji.ac.il
Internet DB Seminar,
The Hebrew University of Jerusalem,
School of Computer Science and Engineering,
December 2001
23/12/2001 Michael Weinberg, SDBI Seminar 2
AgendaAgenda
Hidden Web - what is it all about? Generic model for a hidden Web crawler HiWE (Hidden Web Exposer) LITE – LLayout-based IInformation EExtraction
TTechnique Results from experiments conducted to test these
techniques
23/12/2001 Michael Weinberg, SDBI Seminar 3
Web CrawlersWeb Crawlers
Automatically traverse the Web graph, building a local repository of the portion of the Web that they visit
Traditionally, crawlers have only targeted a portion of the Web called the publicly indexable Web (PIW)
PIW – the set of pages reachable purely by following hypertext links, ignoring search forms and pages that require authentication
23/12/2001 Michael Weinberg, SDBI Seminar 4
The Hidden WebThe Hidden Web
Recent studies show that a significant fraction of Web content in fact lies outside the PIW
Large portions of the Web are ‘hidden’ behind search forms in searchable databases
HTML pages are dynamically generated in response to queries submitted via the search forms
Also referred as the ‘Deep’ Web
23/12/2001 Michael Weinberg, SDBI Seminar 5
The Hidden Web GrowthThe Hidden Web Growth Hidden Web continues to grow, as organizations
with large amount of high-quality information are placing their content online, providing web-accessible search facilities over existing databases
For example:– Census Bureau– Patents and Trademarks Office– News media companies
InvisibleWeb.com lists over 10000 such databases
23/12/2001 Michael Weinberg, SDBI Seminar 6
Surface WebSurface Web
23/12/2001 Michael Weinberg, SDBI Seminar 7
Deep WebDeep Web
23/12/2001 Michael Weinberg, SDBI Seminar 8
Deep Web Content DistributionDeep Web Content Distribution
23/12/2001 Michael Weinberg, SDBI Seminar 9
Deep Web StatsDeep Web Stats The Deep Web is 500500 times larger than PIW !!! Contains 7,500 terabytes of information (March 2000)
More than 200,000 Deep Web sites exist Sixty of the largest Deep Web sites collectively
contain about 750 terabytes of information 95% of the Deep Web is publicly accessible (no
fees) Google indexes about 16% of the PIW, so we
search about 0.030.03% of the pages available today
23/12/2001 Michael Weinberg, SDBI Seminar 10
The ProblemThe Problem
Hidden Web contains large amounts of high-quality information
The information is buried on dynamically generated sites
Search engines that use traditional crawlers never find this information
23/12/2001 Michael Weinberg, SDBI Seminar 11
The SolutionThe Solution
Build a hidden Web crawler Can crawl and extract content from hidden
databases Enable indexing, analysis, and mining of hidden
Web content The content extracted by such crawlers can be
used to categorize and classify the hidden databases
23/12/2001 Michael Weinberg, SDBI Seminar 12
ChallengesChallenges
Significant technical challenges in designing a hidden Web crawler
Should interact with forms that were designed primarily for human consumption
Must provide input in the form of search queries How equip the crawlers with input values for use
in constructing search queries? To address these challenges, we adopt the
task-specifictask-specific, human-assistedhuman-assisted approach
23/12/2001 Michael Weinberg, SDBI Seminar 13
Task-SpecificityTask-Specificity
Extract content based on the requirements of a particular application or task
For example, consider a market analyst interested in press releases, articles, etc… pertaining to the semiconductor industry, and dated sometime in the last ten years
23/12/2001 Michael Weinberg, SDBI Seminar 14
Human-AssistanceHuman-Assistance
Human-assistance is critical to ensure that the crawler issues queries that are relevant to the particular task
For instance, in the semiconductor example, the market analyst may provide the crawler with lists of companies or products that are of interest
The crawler will be able to gather additional potential company and product names as it processes a number of pages
23/12/2001 Michael Weinberg, SDBI Seminar 15
Two StepsTwo Steps
There are two steps in achieving our goal:– Resource discovery – identify sites and databases that
are likely to be relevant to the task– Content extraction – actually visit the identified sites to
submit queries and extract the hidden pages
In this presentation we do not directly address the resource discovery problem
23/12/2001 Michael Weinberg, SDBI Seminar 16
Hidden Web CrawlersHidden Web Crawlers
23/12/2001 Michael Weinberg, SDBI Seminar 17
User form interactionUser form interaction
Form page
Response page
Web query front-end
(3) Fill-out form
(1) Download form
(5) Download response
(2) View form
(4) Submit form
(6) View result
Hidden Databas
e
23/12/2001 Michael Weinberg, SDBI Seminar 18
Operation ModelOperation Model Our model of a hidden Web crawler consists of
four components:– Internal Form Representation– Task-specific database– Matching function– Response Analysis
Form Page – the page containing the search form Response Page – the page received in response to
a form submission
23/12/2001 Michael Weinberg, SDBI Seminar 19
Generic Operational ModelGeneric Operational Model
Internal Form Representation
Task specific databas
eSet of value-assignments
Response Analysis
Hidden Web Crawler Form
page
Response page
Web query front-end
Match
Hidden Databas
e
Repository
Download form
Form submission
Download response
Form analysi
s
23/12/2001 Michael Weinberg, SDBI Seminar 20
Internal Form RepresentationInternal Form Representation
Form F: is a set of n form elements S – submission information associated with the
form:– submission URL– Internal identifiers for each form element
M – meta-information about the form:– web-site hosting the form– set of pages pointing to this form page– other text on the page besides the form
M})S,},E,...,E,({EF n21}E,...,E,{E n21
23/12/2001 Michael Weinberg, SDBI Seminar 21
Task-specific DatabaseTask-specific Database
The crawler is equipped with a task-specific database D
Contains the necessary information to formulate queries relevant to the particular task
In the ‘market analyst’ example, D could contain list of semiconductor company and product names
The actual format and organization of D are specific for to a particular crawler implementation
HiWE uses a set of labeled fuzzy sets
23/12/2001 Michael Weinberg, SDBI Seminar 22
Matching FunctionMatching Function
Matching algorithm properties:– – Input: Internal form representation and current contents
of the database D– Output: Set of value assignments– associates value with element
]},...,{[)),,},,...Match(({ 111 nnn vEvEDMSEE
ii vE iv iE
23/12/2001 Michael Weinberg, SDBI Seminar 23
Response AnalysisResponse Analysis
Module that stores the response page in the repository
Attempts to distinguish between pages containing search results and pages containing error messages
This feedback is used to tune the matching function
23/12/2001 Michael Weinberg, SDBI Seminar 24
Traditional Performance MetricTraditional Performance Metric
Traditional crawlers performance metrics:– Crawling speed– Scalability– Page importance– Freshness
These metrics are relevant to hidden web crawlers, but do not capture the fundamental challenges in dealing with the Hidden Web
23/12/2001 Michael Weinberg, SDBI Seminar 25
New Performance MetricsNew Performance Metrics Coverage metric:
– ‘Relevant’ pages extracted / ‘relevant’ pages present in the targeted hidden databases
– Problem: difficult to estimate how much of the hidden content is relevant to the task
23/12/2001 Michael Weinberg, SDBI Seminar 26
New Performance MetricsNew Performance Metrics
total
successstrict N
NSE
– : the total number of forms that the crawler submits
– : num of submissions which result in response page with one or more search results
– Problem: the crawler is penalized if the database didn’t contain any relevant search results
successN
totalN
23/12/2001 Michael Weinberg, SDBI Seminar 27
New Performance MetricsNew Performance Metrics
– : number of semantically correct form submissions
– Penalizes the crawler only if a form submission is semantically incorrect
– Problem: difficult to evaluate since a manual comparison is needed to decide whether the form is semantically correct
total
validlenient N
NSE
validN
23/12/2001 Michael Weinberg, SDBI Seminar 28
Design IssuesDesign Issues What information about each form element
should the crawler collect? What meta-information is likely to be useful? How should the task-specific database be
organized, updated and accessed? What Match function is likely to maximize
submission efficiency? How to use the response analysis module to tune
the Match function?
iE
23/12/2001 Michael Weinberg, SDBI Seminar 29
HiWE: Hidden Web ExposerHiWE: Hidden Web Exposer
23/12/2001 Michael Weinberg, SDBI Seminar 30
Basic IdeaBasic Idea Extract descriptive information (label) for each
element of a form Task-specific database is organized in terms of
categories, each of which is also associated with labels
Matching function attempts to match from form labels to database categories to compute a set of candidate values assignments
LVS Manager
HiWE ArchitectureHiWE Architecture
Label1 Value-Set1
Label2 Value-Set2
Labeln Value-Setn
Response Analyzer
Form Processor
Form Analyzer
Crawl Manager
Parser
WWW
URL 1 URL 2
URL N
URL List
Custom data sources
LVS Table
Form submissio
n
Response
Feedback
23/12/2001 Michael Weinberg, SDBI Seminar 32
HiWE’s Main ModulesHiWE’s Main Modules URL List:
– contains all the URLs the crawler has discovered so far
Crawl Manager: – controls the entire crawling process
Parser: – extracts hypertext links from the crawled pages and adds
them to the URL list
Form Analyzer, Form Processor, Response Analyzer:– Together implement the form processing and submission
operations
23/12/2001 Michael Weinberg, SDBI Seminar 33
HiWE’s Main ModulesHiWE’s Main Modules LVS Manager:
– Manages additions and accesses to the LVS table LVS table:
– HiWE’s implementation of the task-specific database
23/12/2001 Michael Weinberg, SDBI Seminar 34
HiWE’s Form RepresentationHiWE’s Form Representation
Form– The third component of F is an empty set since current
implementation of HiWE does not collect any meta-information about the form
For each element , HiWE collects a domain Dom( ) and a label label( )
})S,},E,...,E,({EF n21
iEiEiE
23/12/2001 Michael Weinberg, SDBI Seminar 35
HiWE’s Form RepresentationHiWE’s Form Representation
Domain of an element:– Set of values which can be associated with the
corresponding form element– May be a finite set (e.g., domain of a selection list)– May be infinite set (e.g., domain of a text box)
Label of an element:– The descriptive information associated with the
element, if any– Most forms include some descriptive text to help users
understand the semantics of the element
23/12/2001 Michael Weinberg, SDBI Seminar 36
Label(E1) = "Document Type"Dom(E1 ) = {Articles, Press Releases,
Label(E2) = "Company Name"Dom(E2) = {s | s is a text string}Label(E3) = "Sector"Dom(E3) = {Entertainment, Automobile
Reports}
Element E1
Element E2
Information Technology,Construction}
Element E3
Form Representation - FigureForm Representation - Figure
23/12/2001 Michael Weinberg, SDBI Seminar 37
HiWE’s Task-specific DatabaseHiWE’s Task-specific Database
Task-specific information is organized in terms of a finite set of concepts or categories
Each concept has one or more labels and an associated set of values
For example the label ‘Company Name’ could be associated with the set of values {‘IBM’, ‘Microsoft’, ‘HP’,…}
23/12/2001 Michael Weinberg, SDBI Seminar 38
The concepts are organized in a table called the Label Value Set (LVS)
Each entry in the LVS is of the form (L,V):– L : label– fuzzy set of values
– Fuzzy set V has an associated membership function that assigns weights, in the range [0,1] to each member of the set
– is a measure of the crawler’s confidence that the assignment of to E is semantically meaningful
}{V 1 n,...,vv
)(v ivM
vM
iv
HiWE’s Task-specific DatabaseHiWE’s Task-specific Database
23/12/2001 Michael Weinberg, SDBI Seminar 39
For elements with a finite domain:– The set of possible values is fixed and can be
exhaustively enumerated
– In this example, the crawler can first retrieve all relevant articles, then all relevant press releases and finally all relevant reports
HiWE’s Matching FunctionHiWE’s Matching Function
Label(E1) = "Document Type"Dom(E1 ) = {Articles, Press Releases, Reports
}
Element E1
23/12/2001 Michael Weinberg, SDBI Seminar 40
For elements with an infinite domain:– HiWE textually matches the labels of these elements
with labels in the LVS table– For example, if a textbox element has the label “Enter
State” which best matches an LVS entry with the label “State” , the values associated with that LVS entry (e.g., “California”) can be used to fill the textbox
– How do we match Form labels with LVS labels?
HiWE’s Matching FunctionHiWE’s Matching Function
23/12/2001 Michael Weinberg, SDBI Seminar 41
Two steps in matching Form labels with LVS labels:– 1. Normalization: includes conversion to a common
case and standard style– 2. Use of an approximate string matching algorithm to
compute minimum edit distances– HiWE employs D. Lopresti and A. Tomkins string
matching algorithm that takes word reordering into account
Label MatchingLabel Matching
23/12/2001 Michael Weinberg, SDBI Seminar 42
Let LabelMatch( ) denote the LVS entry with the minimum distance to label( )
Threshold If all LVS entries are more than edit operations
away from label( ) , LabelMatch( ) = nil
Label MatchingLabel Matching
iE
iE
iE iE
23/12/2001 Michael Weinberg, SDBI Seminar 43
For each element , compute ( , ):– If has an infinite domain and (L,V) is the
closest matching LVS entry, then = V and =
– If has a finite domain, then =Dom( ) and
The set of value assignments is computed as the product of all the `s:
Too many assignments?
Label MatchingLabel Matching
iV VxxMi
,1)(iE
vMiVM
iVMiV
iV
iV
iE
iE
iE
iV}...1,:],...,{[),( 11 niVvvEvELVSFMatch iinn
23/12/2001 Michael Weinberg, SDBI Seminar 44
HiWE employs an aggregation function to compute a rank for each value assignment
Uses a configurable parameter, a minimum acceptable value assignment rank ( )
The intent is to improve submission efficiency by only using ‘high-quality’ value assignments
We will show three possible aggregation functions
Ranking Value AssignmentsRanking Value Assignments
min
23/12/2001 Michael Weinberg, SDBI Seminar 45
The rank of a value assignment is the minimum of the weights of all the constituent values.
Very conservative in assigning ranks. Assigns a high rank only if each individual weight is high
Fuzzy ConjunctionFuzzy Conjunction
)(min]),...,([...1
11 iVninnfuz vMvEvE
i
23/12/2001 Michael Weinberg, SDBI Seminar 46
The rank of a value assignment is the average of the weights of the constituent values
Less conservative than fuzzy conjunction
AverageAverage
ni
iVnnfuz vMn
vEvEi
...111 )(
1]),...,([
23/12/2001 Michael Weinberg, SDBI Seminar 47
This ranking function treats weights as probabilities
is the likelihood that the choice of is useful and is the likelihood that it is not
The likelihood of a value assignment being useful is:
Assigns low rank if all the individual weights are very low
ProbabilisticProbabilistic
ni
iVnnfuz vMvEvEi
...111 ))(1(1]),...,([
)( iV vMi
)(1 iV vMi
iv
23/12/2001 Michael Weinberg, SDBI Seminar 48
HiWE supports a variety of mechanisms for adding entries to the LVS table:– Explicit Initialization– Built-in entries– Wrapped data sources– Crawling experience
Populating the LVS TablePopulating the LVS Table
23/12/2001 Michael Weinberg, SDBI Seminar 49
Supply labels and associated value sets at startup time
Useful to equip the crawler with labels that the crawler is most likely to encounter
In the ‘semiconductor’ example, we supply HiWE with a list of relevant company names and associate the list with labels ‘Company’ , ‘Company Name’
Explicit InitializationExplicit Initialization
23/12/2001 Michael Weinberg, SDBI Seminar 50
HiWE has built-in entries for commonly used concepts:– Dates and Times– Names of months– Days of week
Built-in EntriesBuilt-in Entries
23/12/2001 Michael Weinberg, SDBI Seminar 51
LVS Manager can query data sources through a well-defined interface
The data source must be ‘wrapped’ by a program that supports two kinds of queries:– Given a set of labels, return a value set– Given a set of values, return other values that belong to
the same value set
Wrapped Data SourcesWrapped Data Sources
LVS Manager
HiWE ArchitectureHiWE Architecture
Label1 Value-Set1
Label2 Value-Set2
Labeln Value-Setn
Response Analyzer
Form Processor
Form Analyzer
Crawl Manager
Parser
WWW
URL 1 URL 2
URL N
URL List
Custom data sources
LVS Table
Form submissio
n
Response
Feedback
23/12/2001 Michael Weinberg, SDBI Seminar 53
Finite domain form elements are a useful source of labels and associated value sets
HiWE adds this information to the LVS table Effective when similar label is associated with a
finite domain element in one form and with an infinite domain element in another
Crawling ExperienceCrawling Experience
23/12/2001 Michael Weinberg, SDBI Seminar 54
New value added to the LVS must be assigned a suitable weight
Explicit initialization and build-in values have fixed weights
Values obtained from external data sources or through the crawler’s own activity, are assigned weights that vary with time
Computing WeightsComputing Weights
23/12/2001 Michael Weinberg, SDBI Seminar 55
For external data sources - computed by the respective wrappers
For values directly gathered by the crawler:– Finite domain element E with Dom(E)– = 1 iff – Three cases arise when incorporating Dom(E) into the
LVS table
Initial WeightsInitial Weights
)()( xM EDom )(EDomx
23/12/2001 Michael Weinberg, SDBI Seminar 56
Crawler successfully extracts label(E) and computes LabelMatch(E)=(L,V):– Replace the (L,V) entry by the entry–
– Intuitively, Dom(E) provides new elements to the value set and ‘boosts’ the weights of existing elements
Updating LVS – Case 1Updating LVS – Case 1
))(,( EDomVL ))(),(max()( )()( xMxMxM EDomVEDomV
23/12/2001 Michael Weinberg, SDBI Seminar 57
Crawler successfully extracts label(E) but LabelMatch(E) = nil:– A new entry ( label(E),Dom(E) ) is created in the LVS
Updating LVS – Case 2Updating LVS – Case 2
23/12/2001 Michael Weinberg, SDBI Seminar 58
Crawler can not extract label(E):– For each entry (L,V):
Compute a score :
Identify the entry with the maximum score Identify the value of the maximum score Replace entry with new entry
Confidence of new values:
Updating LVS – Case 3Updating LVS – Case 3
)(
)()(
EDom
xMEDomx
V
),( maxmax VL
maxs),( maxmax VL
))(,( maxmax EDomVL
)()( )(max xMsxM EDom
23/12/2001 Michael Weinberg, SDBI Seminar 59
Initialization of the crawling activity includes:– Set of sites to crawl– Explicit initialization for the LVS table– Set of data sources– Label matching threshold – Minimum acceptable value assignment rank– Value assignment aggregation function
Configuring HiWEConfiguring HiWE
)(
23/12/2001 Michael Weinberg, SDBI Seminar 60
Layout-based Information Extraction Technique Physical Layout of a page is also used to aid in
extraction For example, a piece of text that is physically
adjacent to a form element is very likely a description of that element
Unfortunately, this semantic associating is not always reflected in the underlying HTML of the Web page
Introducing LITEIntroducing LITE
23/12/2001 Michael Weinberg, SDBI Seminar 61
Layout-based Information Layout-based Information Extraction TechniqueExtraction Technique
23/12/2001 Michael Weinberg, SDBI Seminar 62
Accurate extraction of the labels and domains of form elements
Elements that are visually close on the screen, may be separated arbitrarily in the actual HTML text
Even when HTML provides a facility for semantic relationships, it’s not used in a majority of pages
Accurate page layout is a complex process Even a crude approximate layout of portions of a
page, can yield very useful semantic information
The ChallengeThe Challenge
23/12/2001 Michael Weinberg, SDBI Seminar 63
LITE-based heuristic:– Prune the form page and isolate elements which
directly influence the layout– Approximately layout the pruned page using a custom
layout engine– Identify the pieces of text that are physically closest to
the form element (these are candidates)– Rank each candidate using a variety of measures– Choose the highest ranked candidate as the label
Form Analysis in HiWEForm Analysis in HiWE
23/12/2001 Michael Weinberg, SDBI Seminar 64
Pruning Before Partial LayoutPruning Before Partial Layout
23/12/2001 Michael Weinberg, SDBI Seminar 65
LITE - FigureLITE - Figure
Partial Layout
DOM Parser
DOM Representation
Pruned Page
Prune
List of Elements
Submission Info
Labels & Domain Values
DOM API
Internal Form Representation
Key Idea in LITE:
Physical page layout embeds significant semantic information
23/12/2001 Michael Weinberg, SDBI Seminar 66
ExperimentsExperiments A number of experiments were conducted to study
the performance of HiWE We will see how performance depends on:
– Minimum form size– Crawler input to LVS table– Different ranking functions
23/12/2001 Michael Weinberg, SDBI Seminar 67
Parameter Values for Task 1Parameter Values for Task 1
Task 1:
News articles, reports, press releases and white papers relating to the semiconductor industry, dated sometime in the last ten years
23/12/2001 Michael Weinberg, SDBI Seminar 68
Variation of Performance with Variation of Performance with
23/12/2001 Michael Weinberg, SDBI Seminar 69
Effect of Crawler input to LVSEffect of Crawler input to LVS
23/12/2001 Michael Weinberg, SDBI Seminar 70
Different Ranking FunctionsDifferent Ranking Functions
fuz When using and the crawler’s submission efficiency is mostly above 80%
performs poorly submits more forms than (less conservative)
avgprob
avg
fuz
23/12/2001 Michael Weinberg, SDBI Seminar 71
Label ExtractionLabel Extraction
LITE-based heuristic achieved overall accuracy of 93% The test set was manually analyzed
23/12/2001 Michael Weinberg, SDBI Seminar 72
ConclusionConclusion Addressed the problem of extending current-day
crawlers to build repositories that include pages from the ‘Hidden Web’
Presented a simple operation model of a hidden web crawler
Described the implementation of a prototype crawler – HiWE
Introduced a technique for Layout-based information extraction
23/12/2001 Michael Weinberg, SDBI Seminar 73
BibliographyBibliography
Crawling the Hidden Web, by S. Raghavan and H. Garcia-Molina, Stanford University, 2001
BrightPlanet.com white papers D. Lopresti and A. Tomkins. Block edit models for
approximate string matching