1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD...
-
date post
21-Dec-2015 -
Category
Documents
-
view
220 -
download
2
Transcript of 1 Statistical Schema Matching across Web Query Interfaces Bin He , Kevin Chen-Chuan Chang SIGMOD...
1
Statistical Schema Matching across Web Query Interfaces
Bin He , Kevin Chen-Chuan Chang
SIGMOD 2003
2
Background: Large-Scale Integration of the deep Web
Query Result
The Deep Web
3
Challenge: matching query interfaces (QIs)Book Domain
Music Domain
4
Traditional approaches of schema matching – Pairwise Attribute Correspondence Scale is a challenge
Only small scale Large-scale is a
must for our task Scale is an opportunity
Useful ContextPairwise Attribute
Correspondence
S2:writertitlecategoryformat
S3:nametitlekeywordbinding
S1:authortitlesubjectISBN
S1.author S3.nameS1.subject S2.category
5
Deep Web Observation
Proliferating sources
Converging
vocabularies
6
A hidden schema model exists?
Our View (Hypothesis):
M
P
QIsFinite Vocabulary Statistical Model Generate QIs with different probabilities
QI1
Instantiation probability:P(QI1|M)
7
A hidden schema model exists? Our View (Hypothesis):
Now the problem is:
M
P
QIsFinite Vocabulary Statistical Model Generate QIs with different probabilities
P
QIs
Given , can we discover
M
?
QI1
Instantiation probability:P(QI1|M)
8
MGS framework & Goal
Hypothesis modeling Hypothesis generation Hypothesis selection
Goal:
Verify the phenomenons
Validate MGSsd with two metrics
9
Comparison with Related Work
Related Work Authors’ Work
Paradigms Match two input sources Match many sources
Techniques Machine Learning, Contraint-based, hybrid ones
Statistical approach
Input data Relational or Structured schemas with inconsistency
Interface with consistency
Focuses Name match, structure match,etc
Synonym discovery
10
Outline
MGS MGSsd: Hypothesis Modeling, Generation, Selection Deal with Real World Data Final Algorithm Case Study Metrics Experimental Results Conclusion and Future Issues My Assessment
11
Towards hidden model discovery: Statistical schema matching (MGS)
1. Define the abstract Model structure M to solve a target question
P(QI|M) = …
M
2. Given QIs, Generate the model candidates
P(QIs|M) > 0
M1 M2
AA BB CC SS TT PP
3. Select the candidate with highest confidence
What is the confidence of given ?
M1
AA BB CC
12
MGSSD: Specialize MGS for Synonym Discovery
MGS is generally applicable to a wide range of schema matching tasks E.g., attribute grouping
Focus : discover synonym attributes Author – Writer, Subject – Category No hierarchical matching: Query interface as flat schema No complex matching: (LastName, FirstName) – Author
13
Hypothesis Modeling: Structure Goal: capture synonym relationship Two-level model structure
Possible schemas: I1={author, title, subject, ISBN}, I2={title,category, ISBN}
Concepts
Attributes
Mutually Independent
Mutually Exclusive
No overlapping concepts
14
Hypothesis Modeling: Formula Definition and Formula:
Probability that M can generate schema I:
15
Hypothesis Modeling: Instantiation probability
P(author|M) = α1 * β1P(C1|M)
C1
* P(author|C1) =
author
1.Observing an attribute
2.Observing a schemaP({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C2|M))
3.Observing a schema setP(QIs|M) = П P(QIi|M)
16
Consistency check
A set of schema I as schema observation <Ii,Bi>:number of occurrences Bi for each Ii M is consistent if Pr (I|M)>0 Find consistent models as candidates
17
Hypothesis Generation
Two sub-steps
1. Consistent Concept Construction
2.Build Hypothesis Space
18
Hypothesis Generation: Space pruning Prune the space of model candidates
Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption Co-occurrence graph
Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category}
author categorysubject
C1 C3C2
M1
author categorysubject
C1 C2
M4
author categorysubject
C1C2
M2
author subjectcategory
C1C2
M3
author categorysubject
C1
M5
19
Hypothesis Generation Prune the space of model candidates
Generate M such that P(QI|M)>0 for any observed QI mutual exclusion assumption
Example: Observations: QI1 = {author, subject} and QI2 = {author, category} Space of model: any set partition of {author, subject, category} Model candidates after pruning:
author categorysubject
C1 C3C2
M1
author categorysubject
C1 C2
M4
author categorysubject
C1C2
M2
author subjectcategory
C1C2
M3
author categorysubject
C1
M5
20
Hypothesis Generation (Cont.) Build Probability Functions Maximum likelihood estimation
Estimate ai and Bj that maximize Pr (I|M)
21
Hypothesis Selection
Rank the model candidates Select the model that generates the closest distribution
to the observations Approach: hypothesis testing
Example: select schema model at significance level 0.05
=3.93 3.93<7.815: accept =20.20 20.20>14.067: reject
22
Dealing with the Real World Data Head-often, tail-rare distribution Attribute Selection Systematically remove rare attributes Rare Schema Smoothing Aggregate infrequent schemas into a conceptual event
I(rare) Consensus Projection Follow concept mutual independence assumption
Extract and aggregate New input schemas with re-estimation para.
23
Final Algorithm Two phases:
Build initial hypothesis space
Discover the hidden model
Attribute Selection Extract the common parts of model candidates of last iteration
Hypothesis Generation
Hypothesis Selection
Combine rare interfaces
24
Experiment Setup in Case Studies Over 200 sources on four domains Threshold f=10% Significance level : 0.05 Can be specified by users
25
Example of the MSGsd Algorithm
M1={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,ln), (fn)}
M2={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,fn), (ln)}
26
Metrics
1. How it is close to the correct schema model Precision: Recall:
2. How good it can answer the target question Precison: Recall:
27
Examples on Metrics
I={<I1,6>, <I2,3>, <I3,1>} I1={author, subject}, I2={author, category}, I3={subject} M1={(author:1):0.6, (subject:0.7,category:0.3):1} M2={(author:1):0.6, (subject:1):0.7, (category:1):0.3}
Metrics 1: Pm(M2,Mc)=0.196+0.036+0.249+0.054=0.58 Rm(M2,Mc)=0.28+0.12+0.42+0.18=1
Metrics 2:
28
Experimental Results
This approach can identify most concepts correctly Incorrect matchings due to small # observations Do need two suites of metrics Time complexity is exponential
Can generate all correct instances
The discovered synonyms are all correct ones
29
Advantages Scalability: large-scale matching Solvability: exploit statistical information Generality
Holistic Model Discovery
author name subject categorywriter
S2:writertitlecategoryformat
S3:nametitlekeywordbinding
S1:authortitlesubjectISBN
Pairwise Attribute Correspondence
S2:writertitlecategoryformat
S3:nametitlekeywordbinding
S1:authortitlesubjectISBN
S1.author S3.nameS1.subject S2.category
V.S.
30
Conclusions & Future Work
Holistic statistical schema matching of massive sources MGS framework to find synonym attributes Discover hidden models Suited for large-scale database Results verify the observed phenomena and show
accuracy and effectiveness Future Issues
Complex matching: (Last Name, First Name) – Author More efficient approximation algorithm Incorporating other matching techniques
31
My Assessments Promise
Use minimal “light-weight” information: attribute name
Effective with sufficient instances Leverage challenge as opportunity
Limitation Need sufficient observations Simple Assumptions Exponential time complexity Homonyms
32
Questions