Synthesizing Products For Online Catalogs
description
Transcript of Synthesizing Products For Online Catalogs
![Page 1: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/1.jpg)
Synthesizing Products For Online Catalogs
Hoa NguyenJuliana Freire
University of Utah
Ariel Fuxman Stelios PaparizosRakesh Agrawal
Microsoft Research
![Page 2: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/2.jpg)
All major search engine companies provide an offering for Commerce Search
Commerce Search Engines
![Page 3: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/3.jpg)
Commerce Search Engines
Product Catalog
Relevant Products
![Page 4: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/4.jpg)
Building catalogs in a timely fashion is at the heart of the business model of Commerce Search Engines
Economic Importance of Catalogs
Merchant offers
The search engine receives revenue for every click to a merchant offer
If an offer has no matching product, it is dropped and will never receive any click
![Page 5: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/5.jpg)
• Catalogs are currently built from data aggregator feeds who employ mostly manual techniques
• Manual techniques cannot keep up with the introduction of new products to the market
• No product, no clicks
Building Catalog Today
Our Goal:Automatically build product catalogs
![Page 6: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/6.jpg)
• Catalogs contain structured data about their products
Structured Data
![Page 7: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/7.jpg)
• It enables faceted search
Structured Data Drives Commerce Experience
![Page 8: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/8.jpg)
• It enables the use of structure to improve search
Structured Data Drives Commerce Experience
Our Goal:Add structured data to the Catalog
![Page 9: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/9.jpg)
• Automated construction of product catalogs– End-to-end system–Producing structured product
representations– Scalable to millions of products and
thousands of categories
Product Synthesis
![Page 10: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/10.jpg)
• Problems and solutions– Identifying data sources– Extracting structured data– Schema matching
• End-to-end system• Experimental evaluation• Conclusion
Outline
![Page 11: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/11.jpg)
• Leverage merchant offer feeds
Identifying Data Sources
Input: Merchant Offers
Output:Synthesized Products
Our System
![Page 12: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/12.jpg)
Offer Feeds Lack Structured Data
Table with offer specification
• Information extraction from merchant landing pages
![Page 13: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/13.jpg)
• Generating one wrapper per merchant does not scale
• Our solution: Use generic wrappers
Information Extraction
Warranty Terms-Parts 1 year
Warranty Terms-Labor 1 year limited
Product Height 2-9/10”
Product Height 4-7/8”
Product Weight 6.1 oz
… …
![Page 14: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/14.jpg)
• Generic wrappers are noisy• Vocabulary mismatch between catalog and
data extracted from merchant pages• Our solution: Schema matching
Dealing With Data Heterogeneity
Divot Pros: efficient, effective, …
The truth Pros: When it worked …
Attribute Name Merchant part number: AutoAnything.com mpn: Runtechmedia.com mfg sku number Number1Direct manufacturer part: Memory Place msku: AppliancesConnection.com part # MemorySuppliers.com
![Page 15: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/15.jpg)
Divot Pros: efficient, effective, …
The truth Pros: When it worked …
Schema Matching For Noise Filtering
Screen Size 4.3, 3.5, 4.3, 4.3
Manufacturer Tomtom, Garmin, Magellan, Garmin
ProductCatalog
Weight 7.51, 3.8, 6.8, 5.7
Potential Attributes
No overlap with catalogvalues
![Page 16: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/16.jpg)
• Large-scale schema matching problem– Thousands of merchants– Thousands of categories---each merchant-
category consists of a different schema• Our Solution: Exploiting historical offer-
product associations to automatically learn matches
Schema Matching In The Wild
![Page 17: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/17.jpg)
• Problem: Merchants and catalog may have widely different value distributions
Exploiting Historical Associations
Manufacturer Screen Size Weight
Garmin 4.3 “ 4.2
Tom Tom 3.5 “ 6.8
Garmin 4.3 “ 6.1
Magellan 3.0 “ 3.8
Garmin 5 “ 7.8
Description Brand Weight
Garmin Nuvi 3490LMT
Garmin 4.2 ounces
Nuvi 265WT Garmin 6.1 ounces
Nuvi 1490T Garmin 7.8 ounces
Catalog Garmin.com offers
![Page 18: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/18.jpg)
Manufacturer Screen Size Weight
Garmin 4.3 “ 4.2
Tom Tom 3.5 “ 6.8
Garmin 4.3 “ 6.1
Magellan 3.0 “ 3.8
Garmin 5 “ 7.8
Catalog Garmin.com offers
• Match offers to products • Keep only matching offers to products
Exploiting Historical Associations
Description Brand Weight
Garmin Nuvi 3490LMT
Garmin 4.2 ounces
Nuvi 265WT Garmin 6.1 ounces
Nuvi 1490T Garmin 7.8 ounces
![Page 19: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/19.jpg)
• For the tail of merchants, data may be too sparse to construct reliable distributions
• Our Solution: Match at multiple levels of granularity
Overcoming Sparsity
Product Catalog
Interface ConnectivityDoes match ?
Mom&PapGPS has few offers
Mom&PapGPS offersGPS offers from all merchants
![Page 20: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/20.jpg)
Learning Classifier To Identify Matches
• Compute features for every candidate
– Exploit historical associations– Compute features for multiple granularity levels
• Build a classifier:– Automatically create training set– Logistic regression classifier
<Catalog attribute, Merchant Attribute, Merchant, Category>
![Page 21: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/21.jpg)
Classifier Features• Computed on three types of matching– Fine grained
Om,c offers of merchant m in category c
Pm,c products in catalog that match offers in Om,c
– Coarse grained, grouped by categoryOc offers in category c (regardless of merchant)
Pc products in catalog that match offers in Oc
– Coarse grained, grouped by merchantOm offers of merchant m (regardless of category)
Pm products in catalog that match offers in Om
![Page 22: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/22.jpg)
Classifier Features
For each
– Get bag of words from ac and am
– Compute term distributions pc and pm from bag of words– Compute Jensen-Shannon divergence
– Compute Jaccard coefficient
matching of offers O and products P catalog attribute ac
merchant attribute am
)||()||(21)||( AmAcmc ppKLppKLppJS
)()()()||(tptptpppKL
A
ccAc
mc
mcmc aa
aaaaJ
),(
![Page 23: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/23.jpg)
Offer Clustering
HistoricalOffers
Unmatched Offers
Schema Matching Component
Offer-to-product
matching
Offers matched to products
OFFLINE LEARNING
RUNTIME PRODUCT SYNTHESIS PIPELINE
Product database
Extraction from tables
Extracted offer data
Extraction from tables
Schema Reconciliation
Dictionary of schema matches
Value Fusion
End-To-End System
![Page 24: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/24.jpg)
Offer Clustering
HistoricalOffers
Unmatched Offers
Schema Matching Component
Offer-to-product
matching
Offers matched to products
OFFLINE LEARNING
RUNTIME PRODUCT SYNTHESIS PIPELINE
Product database
Extraction from tables
Extracted offer data
Extraction from tables
Schema Reconciliation
Dictionary of schema matches
Value Fusion
End-To-End System
![Page 25: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/25.jpg)
Offer Clustering
HistoricalOffers
Unmatched Offers
Schema Matching Component
Offer-to-product
matching
Offers matched to products
OFFLINE LEARNING
RUNTIME PRODUCT SYNTHESIS PIPELINE
Product database
Extraction from tables
Extracted offer data
Extraction from tables
Schema Reconciliation
Dictionary of schema matches
Value Fusion
End-To-End System
![Page 26: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/26.jpg)
Offer Clustering
HistoricalOffers
Unmatched Offers
Schema Matching Component
Offer-to-product
matching
Offers matched to products
OFFLINE LEARNING
RUNTIME PRODUCT SYNTHESIS PIPELINE
Product database
Extraction from tables
Extracted offer data
Extraction from tables
Schema Reconciliation
Dictionary of schema matches
Value Fusion
End-To-End System
![Page 27: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/27.jpg)
Offer Clustering
HistoricalOffers
Unmatched Offers
Schema Matching Component
Offer-to-product
matching
Offers matched to products
OFFLINE LEARNING
RUNTIME PRODUCT SYNTHESIS PIPELINE
Product database
Extraction from tables
Extracted offer data
Extraction from tables
Schema Reconciliation
Dictionary of schema matches
Value Fusion
End-To-End System
![Page 28: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/28.jpg)
Offer Clustering
HistoricalOffers
Unmatched Offers
Schema Matching Component
Offer-to-product
matching
Offers matched to products
OFFLINE LEARNING
RUNTIME PRODUCT SYNTHESIS PIPELINE
Product database
Extraction from tables
Extracted offer data
Extraction from tables
Schema Reconciliation
Dictionary of schema matches
Value Fusion
End-To-End System
![Page 29: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/29.jpg)
Offer Clustering
HistoricalOffers
Unmatched Offers
Schema Matching Component
Offer-to-product
matching
Offers matched to products
OFFLINE LEARNING
RUNTIME PRODUCT SYNTHESIS PIPELINE
Product database
Extraction from tables
Extracted offer data
Extraction from tables
Schema Reconciliation
Dictionary of schema matches
Value Fusion
End-To-End System
![Page 30: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/30.jpg)
Offer Clustering
HistoricalOffers
Unmatched Offers
Schema Matching Component
Offer-to-product
matching
Offers matched to products
OFFLINE LEARNING
RUNTIME PRODUCT SYNTHESIS PIPELINE
Product database
Extraction from tables
Extracted offer data
Extraction from tables
Schema Reconciliation
Dictionary of schema matches
Value Fusion
End-To-End System
![Page 31: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/31.jpg)
Offer Clustering
HistoricalOffers
Unmatched Offers
Schema Matching Component
Offer-to-product
matching
Offers matched to products
OFFLINE LEARNING
RUNTIME PRODUCT SYNTHESIS PIPELINE
Product database
Extraction from tables
Extracted offer data
Extraction from tables
Schema Reconciliation
Dictionary of schema matches
Value Fusion
End-To-End System
![Page 32: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/32.jpg)
Offer Clustering
HistoricalOffers
Unmatched Offers
Schema Matching Component
Offer-to-product
matching
Offers matched to products
OFFLINE LEARNING
RUNTIME PRODUCT SYNTHESIS PIPELINE
Product database
Extraction from tables
Extracted offer data
Extraction from tables
Schema Reconciliation
Dictionary of schema matches
Value Fusion
End-To-End System
![Page 33: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/33.jpg)
• Data set obtained from Bing Shopping catalog• 850K offers from 1100 merchants• Merchant landing pages fetched using crawler • 500 leaf-level categories– Computing products (laptops, hard drives, etc.)– Cameras (digital cameras, lenses, etc.)– Home furnishings (bedspreads, home lighting, etc.)– Kitchen and housewares (air conditioners,
dishwashers, etc.)
Experimental Setup: Data Set
![Page 34: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/34.jpg)
• Validate effectiveness of end-to-end system– What is the quality of synthesized products?
• Drill down into schema matching results– Understand the effect of using historical associations– Comparison with state of the art schema matchers
Experimental Goals
![Page 35: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/35.jpg)
• Attribute Precision– Fraction of correct attribute-value pairs over the
total number of extracted pairs• Attribute Recall– Fraction of correct attribute-value pairs over the
expected number of pairs• Product Precision– Fraction of correct product over all products– A product is correct if all offers and attribute-value
pairs are correct
End-to-End System: Metrics
![Page 36: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/36.jpg)
Attribute Precision 92%– Out of 1.1 M synthesized attribute-value pairs
Product Precision 85%– Out of 280K synthesized products
Attribute Recall
End-To-End System: Results
Products with >= 10 offers 66%Products with < 10 offers 47%
Higher recall when there are more offers associated with a product
![Page 37: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/37.jpg)
• Precision– Fraction of correct matches over the number of
extracted matches• Coverage– Absolute number of extracted matches– Higher coverage at same precision higher recall
Schema Matching: Metrics
![Page 38: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/38.jpg)
Benefit Of Matching Step
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 500000
0.2
0.4
0.6
0.8
1 Our approach
No matching
Coverage (Number of correspondences)
Prec
isio
n
Offer-to-product matching step improves quality
![Page 39: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/39.jpg)
Comparison To State-Of-Art
0 10000 20000 30000 40000 500000
0.2
0.4
0.6
0.8
1 Our approachInstance-based Naïve BayesDUMASName-based COMA++ Instance-based COMA++Combined COMA++
Coverage (number of correspondences)
Prec
isio
nOutperforms state-of-the-art schema matchers
![Page 40: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/40.jpg)
• End-to-end solution for product synthesis• Schema matching at huge scale – Thousands of merchants and categories– Resilient to noisy data from generic extractors
• Experimental evaluation on Bing Shopping data
Conclusions
![Page 41: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/41.jpg)
Thank you!
![Page 42: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/42.jpg)
![Page 43: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/43.jpg)
![Page 44: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/44.jpg)
• Classification based on logistic regression
Schema Matching Component
Probability that candidate<Catalog attribute, Merchant Attribute, Merchant, Category>
is a match
Values for FeaturesImportance score of features(computed offline using automatically-created trainingdata)
![Page 45: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/45.jpg)
Offer Clustering
Unmatched Offers
RUNTIME PRODUCT SYNTHESIS PIPELINE
Extraction from tables
Schema Reconciliation Value Fusion
• Schema Reconciliation: – Translate the merchant attribute names into the
product attribute names using the extracted attribute correspondences
Runtime Pipeline
Merchant Attribute Catalog AttributeOperating System@Microwarehouse OS Provided/TypePlatform@Amazon OS Provided/Type
![Page 46: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/46.jpg)
• Offer Clustering:– Group offers of the same product together– The more offers, the more attributes are synthesized– Using *Key* catalog attributes (e.g., MPN, UPC):
• Get values from merchant attributes which are corresponded to the key catalog attributes
• Group offers that have the same values for those key attributes
Runtime Pipeline
Offer Clustering
Unmatched Offers
RUNTIME PRODUCT SYNTHESIS PIPELINE
Extraction from tables
Schema Reconciliation Value Fusion
![Page 47: Synthesizing Products For Online Catalogs](https://reader036.fdocuments.us/reader036/viewer/2022062411/568168ad550346895ddf61a3/html5/thumbnails/47.jpg)
Offer Clustering
Unmatched Offers
RUNTIME PRODUCT SYNTHESIS PIPELINE
Extraction from tables
Schema Reconciliation Value Fusion
• Value Fusion: – Generate spec using learned correspondences and
centroid computation
Runtime Pipeline