WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information...
-
Upload
charity-young -
Category
Documents
-
view
212 -
download
0
Transcript of WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information...
WebIQ: Learning from the Web WebIQ: Learning from the Web to Match Deep-Web Query Interfacesto Match Deep-Web Query Interfaces
Wensheng WuDatabase & Information Systems Group
University of Illinois, Urbana
Joint work with AnHai Doan & Clement Yu
ICDE, April 2006
2
Search Problems on the Deep WebSearch Problems on the Deep Web
united.com airtravel.com delta.com
Find round-trip flights from Chicago to New York under $500
3
Solution: Build Data Integration SystemsSolution: Build Data Integration SystemsFind round-trip flights from
Chicago to New York under $500
united.com airtravel.com delta.com
Global query interface
comparison shopping systems “on steroid”
4
Current State of AffairsCurrent State of Affairs Very active in both research communities & industry
Research– multidisciplinary efforts: Database, Web, KDD & AI – 10+ research groups in US, Asia & Europe– focuses:
– source discovery– schema matching & integration– query processing– data extraction
Industry– Transformic, Glenbrook Networks, WebScalers, PriceGrabber,
Shopping.com, MySimon, Google, …
5
Key Task: Schema MatchingKey Task: Schema Matching
1-1 match
Complex match
6
Schema Matching is Ubiquitous!Schema Matching is Ubiquitous! Fundamental problem in numerous applications
– data integration– data warehousing– peer data management– ontology merging– view integration– personal information management
Schema matching across Web sources– 30+ papers generated in past few years – Washington [AAAI-03, ICDE-05], Illinois [SIGMOD-03, SIGMOD-04,
ICDE-06], MSR [VLDB-04], Binghamton [VLDB-03], HKST [VLDB-04], Utah [WebDB-05], …
7
Schema Matching is Still Very DifficultSchema Matching is Still Very Difficult
Must rely on properties of attributes, e.g., label & instances Often there are little in common between matching attributes Many attributes do not even have instances!
1-1 match
Complex match
8
Matching Performance Greatly Hampered by Matching Performance Greatly Hampered by Pervasive Lack of Attribute InstancesPervasive Lack of Attribute Instances
28.1% ~ 74.6% of attributes with no instances
Extremely challenging to match these attributes– e.g., does departure city match from city or departure date?
Also difficult to match attributes with dissimilar instances– e.g., airline (with American airliners) vs. carrier (with Europeans)
9
Our Solution: Exploit the WebOur Solution: Exploit the Web Discover instances from the Web
– e.g., Chicago, New York, etc. for departure city & from city
Borrow instances from other attributes & validate via Web – e.g., check if Air Canada is an instance of carrier with the Web
10
Key Idea: Question-Answering from AIKey Idea: Question-Answering from AI Search Web via search engines, e.g., Google … but search engines do not understand natural language
questions
Idea: form extraction queries as sentences to be completed “Trick” search engine to complete sentences with instances
Example extraction query: “departure cities such as”
Extraction Patterns
Ls such as NP1, … NPn
such Ls as NP1, …, NPn
NP1, …, NPn, and other Ls
Ls including NP1, …, NPn
attribute label: departure city
11
Key Idea: Question-Answering from AIKey Idea: Question-Answering from AI
Search Google & obtain snippets:
Extract instance candidates from snippets:
other departure cities such as Boston, Chicago and LAX available …
Boston, Chicago, LAX
extraction query completion
12
But Not Every Candidate is True InstanceBut Not Every Candidate is True Instance
Reason 1: Extraction queries may not be perfect Reason 2: Web content is inherently noisy
Example:– attribute: city– extraction query: “and other cities”– extracted candidate: 150
need to perform instance verification
13
Instance Verification: Outlier DetectionInstance Verification: Outlier Detection
Goal: Remove statistical outliers (among candidates)
Step 1: Pre-processing– recognize types of instances via pattern matching & 80% rule– types: numeric & string– discard all candidates not of determined type– e.g., most of instance candidates for city are strings, so remove 150
Step 2: Type-specific detection– perform discordance tests– test statistics, e.g.,
– # of words: abnormal if more than 5 words in person name
– % of numeric characters: US zip code contains only digits
14
Instance Verification: Web ValidationInstance Verification: Web Validation Goal: Further semantic-level validation Idea: Exploit co-occurrence statistics of label & instances
– “Make: Honda; Model: Accord”– “a variety of makes such as Honda, Mitsubishi”
Form validation queries using validation patterns– e.g., “make Honda”, “makes such as Honda”
Validation Patterns (V + x)
L x
Ls such as x
such Ls as x
x and other Ls
Ls including x
Validation phrase V
15
Instance Verification: Web ValidationInstance Verification: Web Validation
Possible measure: NumHits(V+x)– e.g., NumHits(“cities such as Los Angeles”) = 26M
Potential problems: bias towards popular instances Use PMI(V, x), point-wise mutual information
Example: – V = “cities such as”, candidates: California, Los Angeles– NumHits(V, California) = 29– PMI(V, Los Angeles) = 3000 * PMI(V, California)
NumHits(V+x)
NumHits(V) * NumHits(x)
16
Validate Instances from Other AttributesValidate Instances from Other Attributes Method 1: Discover k more instances from Web
– then check for borrowed one (Aer Lingus for Airline) problem: very likely Aer Lingus not among discovered instances
Method 2: Compare validation score with that of instance
problem: score for Aer Lingus may be much lower, how to decide?
Key observation: compare also to scores of non-instances – e.g., Economy (with respect to Airline)
17
Train Validation-Based Instance ClassifierTrain Validation-Based Instance Classifier Naïve Bayes classifier with validation-based features
Example M1 M2 +/-
Air Canada .5 .3 +
American .8 .1 +
Economy .4 .03 -
First Class .2 .05 -
Delta .6 .3 +
United .9 .4 +
Jan .1 .06 -
1 .3 .09 -
Example f1 f2 +/-
Delta 1 1 +
United 1 1 +
Jan 0 0 -
1 0 1 -
Thresholds: t1=.45, t2=.075 P(C|X) ~ P(C) P(X|C)
P(+)=P(-) = ½
P(f1=1|+) = 3/4P(f1=1|-) = 1/4
…
V1: Airlines such as V2: Airline
18
Validate Instances via Deep WebValidate Instances via Deep Web Handle attributes while difficult via Web, e.g., from Disadvantage: ambiguity when no results found
19
Architecture of Assisted Matching SystemArchitecture of Assisted Matching System
Instanceacquisition
Interfacematcher
Source interfaces
Source interfaceswith augmented instances
Attribute matches
20
Empirical EvaluationEmpirical Evaluation Five domains:
Experiments:– Baseline: IceQ [Wu et al., SIGMOD-04]– Web assistance
Performance metrics:– precision (P), recall (R), & F1 (= 2PR/(P+R))
Domain#
schemas# attributes per schema
% of attributes with no instances
Average depth of schemas
Airfare 20 10.7 28.1 3.6
Automobile 20 5.1 38.6 2.4
Book 20 5.4 74.6 2.3
Job 20 4.6 30.0 2.1
Real Estate 20 6.5 32.2 2.7
21
Matching Accuracy Matching Accuracy
Web assistance boosts accuracy (F1) from 89.5 to 97.5
80
85
90
95
100
Airfare Automobile Book Job Real Estate
BaselineBaseline + WebIQBaseline + WebIQ + Threshold
22
Overhead AnalysisOverhead Analysis Reasonable overhead: 6~11 minutes across domains
0
1
2
3
4
5
6
7
Airfare Auto Book Job RE
Min
(s)
Baseline SurfaceAttr-Surface Attr-Deep
23
ConclusionConclusion
Search problems on the Deep Web are increasingly crucial!
Novel QA-based approach to learning attribute instances
Incorporation into a state-of-art matching system
Extensive evaluation over varied real-world domains
More details: Wensheng Wu on Google