HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF...
-
Upload
ricky-munoz -
Category
Documents
-
view
218 -
download
0
Transcript of HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF...
HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING
Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH
2
Scenario
Arnab Nandi & Phil Bernstein
Arnab Nandi & Phil Bernstein
3
Scenario
Search over structured dataCommerceentertainment
Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse.
4
Scenario
Arnab Nandi & Phil Bernstein
query
Search engine + data warehouse
Users
3rd Party Feed
3rd Party Feed
3rd Party Feed
3rd Party Feed
results
“Amazon.com”
•High Precision•High Recall•Minimal Human
Involvement
Arnab Nandi & Phil Bernstein
5
Example Feed
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>
Arnab Nandi & Phil Bernstein
6
Schema Matching
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>
From To
Movie MOVIE
Title MOVIE_NAME
Runtime RUNTIME
Category GENRE*
MPAA RATING
Person ACTOR*
Arnab Nandi & Phil Bernstein
7
Taxonomy Matching
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>From To
Action Action/Adventure
PG-13 NR
R R
8
Various Problems
Badly normalized….
Unit conversion…
Formatting choices…
In-band signaling…
Arbitrary labels
Arnab Nandi & Phil Bernstein
Non standard vocabulary / language
Zero documenta
tion
Not enough
instances
9
Unlike conventional matching…
Arnab Nandi & Phil Bernstein
We have web search click data
For both Warehouse & 3rd party website
The databases we are integrating (usually) have a presence on the web
Why not use click data as a feature for schema & taxonomy matching?
query
Search engine + data warehouse
Users
3rd Party Feed
results
10
Outline
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
11
Core idea
“If two (sets of) products are searched for by similar queries, then they are similar”
Small laptop
Arnab Nandi & Phil BernsteinWeb Search
12
Clicklog
Core idea
Arnab Nandi & Phil Bernstein
Small Lapto
ps
Pro. Laptops
Warehouse
hardware eee
Asus.com
eee ::: small
laptopsSmall laptop
Small laptop
Y
X
Z
Small laptop
13
Query Distributions
Arnab Nandi & Phil Bernstein
small laptop
netbook
hp mini 1000
hp mini
0 10 20 30 40 50click count
14
Mapping to Taxonomy
Map URL to product, which belongs to taxonomy
http://www.amazon.com/dp/B001JTA59C
Shopping | Electronics |NetbooksArnab Nandi & Phil Bernstein
3rd party DB(provided to us)
15
Aggregating Query Distributions
Arnab Nandi & Phil Bernstein
Small Laptop
s
Pro. Laptops
Warehouse
hardware eee
Asus.com
eee ::: small
laptops
0 5 101520253035404550
0 5 101520253035404550
0 5 101520253035404550
0 5 101520253035404550
0 10 20 30 40 50
0 10 20 30 40 50
Arnab Nandi & Phil Bernstein
17
Generating Correspondences
Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.
Process For each page (URL)
Identify query distribution Identify category / schema element of that page
For each category / schema element C Aggregate over pages in C to get query distribution
For each foreign category / schema element Find host category / schema element with most similar
query distribution
18
Outline
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
19
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
query freq url
laptop 70http://searchengine.com/product/macbookpro
laptop 25http://searchengine.com/product/mininote
laptop 5 http://asus.com/eeepcnetbook 5
http://searchengine.com/product/macbookpro
netbook 20
http://searchengine.com/product/mininote
netbook 15 http://asus.com/eeepccheap netbook 5 http://asus.com/eeepc
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
20
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
“laptop”: 25/45“netbook”: 20/45
“laptop” : 70 / 75“netbook” : 5/75
“laptop”: 5/25“netbook”: 15/25“cheap laptop”:
5/25
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
21
Distribution Similarity Metric
Arnab Nandi & Phil Bernstein
Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)Σ(all qhost, qforeign combinations)
22
“small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop
1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)
= 0.74
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
“laptop”: 25/45“netbook”: 20/45
“laptop” : 70 / 75“netbook” : 5/75
“laptop”: 5/25“netbook”: 15/25“cheap laptop”:
5/25
0.74
0.31
Arnab Nandi & Phil Bernstein
23
Advantages of Clicklogs
Resilient to language
Resilient to new domains, data, and features As long as people query & click, we have data to
learn from
Generates mappings previous methods can’tElectronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments
≈ Office Products ▷ Office Machines ▷ Calculators
Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic ≈ Software ▷ Developer Tools
24
System Design
Arnab Nandi & Phil Bernstein
25
Outline
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
Arnab Nandi & Phil Bernstein
26
Experimenting with Click Logs Commercial warehouse mapping, 258 products
from a 70,000 term Amazon.com taxonomy (613 in gold)
to a 6,000 term warehouse taxonomy (40 in gold)
Live.com (now Bing.com) search querylog Amazon to warehouse mapping task,
consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product
pages Typically each product had a query
distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).
27
Summary of Results
Arnab Nandi & Phil Bernstein
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Arnab Nandi & Phil Bernstein
28
Precision / Recall
Commercial warehouse mapping, 258 products
from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613
categories used)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Instance-basedQuery DistributionConsensusName-based
Recall
Pre
cisio
n
29
Summary of Results
Arnab Nandi & Phil Bernstein
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Arnab Nandi & Phil Bernstein
30
Match Quality
QDs are unique to entities
QDs are unique to aggregate classes
Amazon Products
Amazon Categories
Warehouse Products
Warehouse Categories
Amazon Products
257/258 correct
241/258 correct
189/258 correct (73%)
226/258correct
Amazon Categories
373/613 correct
204/400 correct 525/613 (85%)
Warehouse Products
392/400 correct 383/400 correct
Warehouse Categories
40/40 correct
QDs of entities are closest to the distributions of their aggregate classes
QDs of similar aggregates are similar
31
Summary of Results
Arnab Nandi & Phil Bernstein
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
32
Varying Clicklog Size
Successively decreased clicklog size by half
Recall decreases as clicklog size is decreased
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.65
0.75
0.85
0.95
ItemsCategories
Recall
Pre
cisio
n
¼ ½ Full Log
1/32
Arnab Nandi & Phil Bernstein
33
Summary of Results
Arnab Nandi & Phil Bernstein
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
34
Comparing Query Distributions
Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)
Σ(all qhost, qforeign combinations)
Replace Jaccard with various phrase similarity metrics
Minimal difference due to size of most queries
Arnab Nandi & Phil Bernstein
35
Summary of Results
Arnab Nandi & Phil Bernstein
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
36
Related + Future Work
Arnab Nandi & Phil Bernstein
Usage Based / Crowdsourcing Usage-Based Schema Matching (ICDE 2008)
Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.
Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan
Web Scale Integration Web-scale Data Integration: You can only afford to Pay
As You Go (CIDR 2007)Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy
37
Related + Future Work
Arnab Nandi & Phil Bernstein
“Mixed” methods Ontology matching: A machine learning approach
(Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy
Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy
Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm
Arnab Nandi & Phil Bernstein
38
Conclusion
Unsupervised mapping is possible very high recall / precision when enough
queries are present
Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce
more mappings
Combinable with existing methods
39
Arnab Nandi & Phil Bernstein
http://arnab.org/contact
http://research.microsoft.com/~philbe/
Questions?