Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering...
-
Upload
brendan-cannon -
Category
Documents
-
view
223 -
download
0
Transcript of Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering...
![Page 1: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/1.jpg)
Pedro Domingos
Joint work with AnHai Doan & Alon LevyDepartment of Computer Science & Engineering
University of Washington
Data Integration:Data Integration:A “Killer App” for Multi-Strategy LearningA “Killer App” for Multi-Strategy Learning
![Page 2: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/2.jpg)
2
OverviewOverview
Data integration & XML Schema matching Multi-strategy learning Prototype system & experiments Related work Future work Summary
![Page 3: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/3.jpg)
3
Data IntegrationData Integration
Find houses with four bathrooms and price under $500,000
mediated schema
superhomes.com
source schema
realestate.com
source schema
homeseekers.com
source schema
wrapper wrapperwrapper
![Page 4: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/4.jpg)
4
Why Data Integration MattersWhy Data Integration Matters
Very active area in database & AI – research / workshops– start-ups
Large organizations – multiple databases with differing schemas
Data warehousing The Web: HTML sources The Web: XML sources
![Page 5: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/5.jpg)
5
XMLXML
Extensible Markup Language– introduced in 1996
The standard for data publishing & exchange– replaces HTML & proprietary formats– embraced by database/web/e-commerce communities
XML versus HTML– both use tags to mark up data elements – HTML tags specify format – XML tags define meaning– relationships among elements provided via nesting
![Page 6: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/6.jpg)
6
ExampleExample
<residential-listings><house> < location> <city> Seattle </city> <state> WA </state> <country> USA </country> </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...</residential-listings>
<h1> Residential Listings </h1><ul>House For Sale <li> location: Seattle, WA, USA <li> agent-phone: (206) 729 0831 <li> listed-price: $250,000 <li> comments: Fantastic house ...</ul><hr><ul> House For Sale...</ul>...
HTML XML
![Page 7: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/7.jpg)
7
XML DTDXML DTD
A DTD can be visualized as a tree
<!ELEMENT residential-listings (house*)><!ELEMENT house (location?, agent-phone, listed-price, comments?)><!ELEMENT location (city, state, country?)>
Document Type Descriptor– BNF grammar– constraints on element structure: type, order, # of times
A real-estate DTD
![Page 8: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/8.jpg)
8
Semantic Mappings between SchemasSemantic Mappings between Schemas
Mediated & source schemas = XML DTDs
house
location contact-info
house
address
agent-name agent-phone
num-baths amenities
full-baths half-baths handicap-equipped
contact
name phone
![Page 9: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/9.jpg)
9
Map of the ProblemMap of the Problemsource descriptions
schema matching data translationscopecompletenessreliabilityquery capability
leaf elements higher-levelelements
1-1 mappings complex mappings
![Page 10: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/10.jpg)
10
Current State of AffairsCurrent State of Affairs
Largely done by hand– labor intensive & error prone– key bottleneck in building applications
Will only be exacerbated – data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data
Need automatic approaches to scale up!
![Page 11: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/11.jpg)
11
Use machine learning to match schemas Basic idea
1. create training data– manually map a set of sources to mediated schema
2. train system on training data– learns from
– name of schema elements – format of values– frequency of words & symbols– characteristics of value distribution– proximity, position, structure, ...
3. system proposes mappings for subsequent sources
Our Approach Our Approach
![Page 12: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/12.jpg)
12
ExampleExample
realestate.com
<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments>Fantastic house ... </comments></house> ...
address phone price description
mediated schema
location
Seattle, WASeattle, WADallas, TX...
listed-price
$250,000$162,000$180,000...
agent-phone
(206) 729 0831(206) 321 4571(214) 722 4035...
comments
Fantastic house ...Great ...Hurry! ......
![Page 13: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/13.jpg)
13
Multi-Strategy LearningMulti-Strategy Learning
Use a set of base learners– each exploits certain types of information
Match schema elements of a new source– apply the learners– combine their predictions using a meta-learner
Meta-learner– measures base learner accuracy on training data– weighs each learner based on its accuracy
![Page 14: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/14.jpg)
14
LearnersLearners Input
– schema information: name, proximity, structure, ...– data information: value, format, ...
Output– prediction weighted by confidence score
Example learners– name matcher
– agent-name => (name,0.7), (phone,0.3)
– Naive Bayes – “Seattle, WA” => (address,0.8), (name,0.2)– “Great location ...” => (description,0.9), (address,0.1)
![Page 15: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/15.jpg)
15
Training the LearnersTraining the Learnersrealestate.com
<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $ 250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...
address phone price description
mediated schema
location listed-price agent-phone comments
Name Matcher
(location, address)(agent-phone, phone)(listed-price, price)(comments, description) ...
Naive Bayes
(“Seattle, WA”, address)(“(206) 729 0831”, phone)(“$ 250,000”, price)(“Fantastic house ...”, description) ...
![Page 16: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/16.jpg)
16
Applying the Learned ModelsApplying the Learned Models
homes.com
address phone price description
mediated schema
area
Seattle, WAKent, WAAustin, TXSeattle, WA Name Matcher
Naive Bayes
Name MatcherNaive Bayes
Meta-learner
Meta-learneraddressaddressdescriptionaddress
Combiner
address
![Page 17: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/17.jpg)
17
The LSD SystemThe LSD System
Base learners/modules– name matcher– Naive Bayes– Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98]– county-name recognizer
Meta-learner– stacking [Ting&Witten99, Wolpert92]
![Page 18: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/18.jpg)
18
Name MatcherName Matcher
Matches based on names– including all names on path from root to current node– allowing synonyms
Good for ...– specific, descriptive names: agent-phone, listed-price
Bad for ...– vacuous names: item, listings– partially specified, ambiguous names: office
(for “office phone”)
![Page 19: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/19.jpg)
19
Naive Bayes LearnerNaive Bayes Learner
Exploits frequencies of words & symbols Good for ...
– elements with words/symbols that are strongly indicative– examples:
– “fantastic” & “great” in house descriptions– $ in prices, parentheses in phone numbers
Bad for ...– short, numeric elements: num-baths, num-bedrooms
![Page 20: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/20.jpg)
20
WHIRL Nearest-Neighbor ClassifierWHIRL Nearest-Neighbor Classifier
Similarity-based– stores all examples seen so far – classifies a new example based on similarity to
training examples– IR document similarity metric
Good for ...– long, textual elements: house description, names– limited, descriptive set of values: color (blue, red, ...)
Bad for ...– short, numeric elements: num-baths, num-bedrooms
![Page 21: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/21.jpg)
21
County-Name RecognizerCounty-Name Recognizer
Stores all county names, obtained from the Web Verifies if the input name is a county name Essential to matching a county-name element
![Page 22: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/22.jpg)
22
Meta-Learner: StackingMeta-Learner: Stacking
Training– uses training data to learn weights– one for each (base learner, mediated-schema element)
Combining predictions– for each mediated-schema element
– computes weighted sum of base-learner confidence scores
– picks mediated-schema element with highest sum
![Page 23: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/23.jpg)
23
Experiments Experiments
Sources Coverage# of MatchableLeaf Elements
BestSingle Learner
LSD
realestate.yahoo USA 31 63% 77%
homeseekers.com USA 31 52% 64%
nkymls.com Kentucky 28 64% 75%
texasproperties.com Texas 42 59% 62%
windermere.com Northwest 35 55% 63%
![Page 24: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/24.jpg)
24
Reasons for Incorrect MatchingsReasons for Incorrect Matchings
Unfamiliarity – suburb– solution: add a suburb-name recognizer
Insufficient information– correctly identified the general type– failed to pinpoint the exact type– <agent-name>Richard Smith</agent-name>
<phone> (206) 234 5412 </phone>– solution: add a proximity learner
![Page 25: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/25.jpg)
25
Experiments: SummaryExperiments: Summary
Multi-strategy learning– better performance than any single learner
Accuracy of 100% unlikely to be reached– difficult even for human
Lots of room for improvement– more learners– better learning algorithms
![Page 26: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/26.jpg)
26
Related WorkRelated Work
Rule-based approaches– TRANSCM [Milo&Zohar98],
ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98]
– utilize only schema information
Learner-based approaches– SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95]– employ a single learner, limited applicability
![Page 27: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/27.jpg)
27
Future WorkFuture Worksource descriptions
schema matching data translationscopecompletenessreliabilityquery capability
leaf elements higher-levelelements
1-1 mappings complex mappings
![Page 28: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/28.jpg)
28
Future WorkFuture Work
Improve matching accuracy– more learners, more domains
Incorporate domain knowledge– semantic integrity constraints– concept hierarchy of mediated-schema elements
Learn with structured data
![Page 29: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/29.jpg)
29
Learning with Structured DataLearning with Structured Data
Each example with >1 level of structure Generative model for XML XML classifier XML: “killer app” for relational learning
![Page 30: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.](https://reader035.fdocuments.us/reader035/viewer/2022062221/56649dbf5503460f94ab3fae/html5/thumbnails/30.jpg)
30
SummarySummary
Schema matching– automated by learning
Multi-strategy learning is essential– handles different types of data– incorporates different types of domain knowledge– easy to incorporate new learners– alleviates effects of noise & dirty data
Implemented LSD– promising results with initial experiments