WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information...

WebIQ: Learning from the Web WebIQ: Learning from the Web to Match Deep-Web Query Interfacesto Match Deep-Web Query Interfaces

Wensheng WuDatabase & Information Systems Group

University of Illinois, Urbana

Joint work with AnHai Doan & Clement Yu

ICDE, April 2006

2

Search Problems on the Deep WebSearch Problems on the Deep Web

united.com airtravel.com delta.com

Find round-trip flights from Chicago to New York under $500

3

Solution: Build Data Integration SystemsSolution: Build Data Integration SystemsFind round-trip flights from

Chicago to New York under $500

united.com airtravel.com delta.com

Global query interface

comparison shopping systems “on steroid”

4

Current State of AffairsCurrent State of Affairs Very active in both research communities & industry

Research– multidisciplinary efforts: Database, Web, KDD & AI – 10+ research groups in US, Asia & Europe– focuses:

– source discovery– schema matching & integration– query processing– data extraction

Industry– Transformic, Glenbrook Networks, WebScalers, PriceGrabber,

Shopping.com, MySimon, Google, …

5

Key Task: Schema MatchingKey Task: Schema Matching

1-1 match

Complex match

6

Schema Matching is Ubiquitous!Schema Matching is Ubiquitous! Fundamental problem in numerous applications

– data integration– data warehousing– peer data management– ontology merging– view integration– personal information management

Schema matching across Web sources– 30+ papers generated in past few years – Washington [AAAI-03, ICDE-05], Illinois [SIGMOD-03, SIGMOD-04,

ICDE-06], MSR [VLDB-04], Binghamton [VLDB-03], HKST [VLDB-04], Utah [WebDB-05], …

7

Schema Matching is Still Very DifficultSchema Matching is Still Very Difficult

Must rely on properties of attributes, e.g., label & instances Often there are little in common between matching attributes Many attributes do not even have instances!

1-1 match

Complex match

8

Matching Performance Greatly Hampered by Matching Performance Greatly Hampered by Pervasive Lack of Attribute InstancesPervasive Lack of Attribute Instances

28.1% ~ 74.6% of attributes with no instances

Extremely challenging to match these attributes– e.g., does departure city match from city or departure date?

Also difficult to match attributes with dissimilar instances– e.g., airline (with American airliners) vs. carrier (with Europeans)

9

Our Solution: Exploit the WebOur Solution: Exploit the Web Discover instances from the Web

– e.g., Chicago, New York, etc. for departure city & from city

Borrow instances from other attributes & validate via Web – e.g., check if Air Canada is an instance of carrier with the Web

10

Key Idea: Question-Answering from AIKey Idea: Question-Answering from AI Search Web via search engines, e.g., Google … but search engines do not understand natural language

questions

Idea: form extraction queries as sentences to be completed “Trick” search engine to complete sentences with instances

Example extraction query: “departure cities such as”

Extraction Patterns

Ls such as NP1, … NPn

such Ls as NP1, …, NPn

NP1, …, NPn, and other Ls

Ls including NP1, …, NPn

attribute label: departure city

11

Key Idea: Question-Answering from AIKey Idea: Question-Answering from AI

Search Google & obtain snippets:

Extract instance candidates from snippets:

other departure cities such as Boston, Chicago and LAX available …

Boston, Chicago, LAX

extraction query completion

12

But Not Every Candidate is True InstanceBut Not Every Candidate is True Instance

Reason 1: Extraction queries may not be perfect Reason 2: Web content is inherently noisy

Example:– attribute: city– extraction query: “and other cities”– extracted candidate: 150

need to perform instance verification

13

Instance Verification: Outlier DetectionInstance Verification: Outlier Detection

Goal: Remove statistical outliers (among candidates)

Step 1: Pre-processing– recognize types of instances via pattern matching & 80% rule– types: numeric & string– discard all candidates not of determined type– e.g., most of instance candidates for city are strings, so remove 150

Step 2: Type-specific detection– perform discordance tests– test statistics, e.g.,

– # of words: abnormal if more than 5 words in person name

– % of numeric characters: US zip code contains only digits

14

Instance Verification: Web ValidationInstance Verification: Web Validation Goal: Further semantic-level validation Idea: Exploit co-occurrence statistics of label & instances

– “Make: Honda; Model: Accord”– “a variety of makes such as Honda, Mitsubishi”

Form validation queries using validation patterns– e.g., “make Honda”, “makes such as Honda”

Validation Patterns (V + x)

L x

Ls such as x

such Ls as x

x and other Ls

Ls including x

Validation phrase V

15

Instance Verification: Web ValidationInstance Verification: Web Validation

Possible measure: NumHits(V+x)– e.g., NumHits(“cities such as Los Angeles”) = 26M

Potential problems: bias towards popular instances Use PMI(V, x), point-wise mutual information

Example: – V = “cities such as”, candidates: California, Los Angeles– NumHits(V, California) = 29– PMI(V, Los Angeles) = 3000 * PMI(V, California)

NumHits(V+x)

NumHits(V) * NumHits(x)

16

Validate Instances from Other AttributesValidate Instances from Other Attributes Method 1: Discover k more instances from Web

– then check for borrowed one (Aer Lingus for Airline) problem: very likely Aer Lingus not among discovered instances

Method 2: Compare validation score with that of instance

problem: score for Aer Lingus may be much lower, how to decide?

Key observation: compare also to scores of non-instances – e.g., Economy (with respect to Airline)

17

Train Validation-Based Instance ClassifierTrain Validation-Based Instance Classifier Naïve Bayes classifier with validation-based features

Example M1 M2 +/-

Air Canada .5 .3 +

American .8 .1 +

Economy .4 .03 -

First Class .2 .05 -

Delta .6 .3 +

United .9 .4 +

Jan .1 .06 -

1 .3 .09 -

Example f1 f2 +/-

Delta 1 1 +

United 1 1 +

Jan 0 0 -

1 0 1 -

Thresholds: t1=.45, t2=.075 P(C|X) ~ P(C) P(X|C)

P(+)=P(-) = ½

P(f1=1|+) = 3/4P(f1=1|-) = 1/4

…

V1: Airlines such as V2: Airline

18

Validate Instances via Deep WebValidate Instances via Deep Web Handle attributes while difficult via Web, e.g., from Disadvantage: ambiguity when no results found

19

Architecture of Assisted Matching SystemArchitecture of Assisted Matching System

Instanceacquisition

Interfacematcher

Source interfaces

Source interfaceswith augmented instances

Attribute matches

20

Empirical EvaluationEmpirical Evaluation Five domains:

Experiments:– Baseline: IceQ [Wu et al., SIGMOD-04]– Web assistance

Performance metrics:– precision (P), recall (R), & F1 (= 2PR/(P+R))

Domain#

schemas# attributes per schema

% of attributes with no instances

Average depth of schemas

Airfare 20 10.7 28.1 3.6

Automobile 20 5.1 38.6 2.4

Book 20 5.4 74.6 2.3

Job 20 4.6 30.0 2.1

Real Estate 20 6.5 32.2 2.7

21

Matching Accuracy Matching Accuracy

Web assistance boosts accuracy (F1) from 89.5 to 97.5

80

85

90

95

100

Airfare Automobile Book Job Real Estate

BaselineBaseline + WebIQBaseline + WebIQ + Threshold

22

Overhead AnalysisOverhead Analysis Reasonable overhead: 6~11 minutes across domains

0

1

2

3

4

5

6

7

Airfare Auto Book Job RE

Min

(s)

Baseline SurfaceAttr-Surface Attr-Deep

23

ConclusionConclusion

Search problems on the Deep Web are increasingly crucial!

Novel QA-based approach to learning attribute instances

Incorporation into a state-of-art matching system

Extensive evaluation over varied real-world domains

More details: Wensheng Wu on Google

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information...

Documents

Transcript of WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information...