Merging Taxonomies. Assertion Creation and maintenance of large ontologies will require the...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Merging Taxonomies. Assertion Creation and maintenance of large ontologies will require the...
Assertion
Creation and maintenance of large ontologies will require the capability to merge taxonomies
This problem is similar to the problem of merging e-commerce catalogs
R. Agrawai, R. Srikant: On Catalog Integration. WWW-10
Catalog Integration Problem
Integrate products from new catalog into master catalog.
a
ICs
LogicMem.DSP
fec db
ICs
Cat 2Cat 1
yx z
New CatalogMaster Catalog
Desired Solution
Automatically integrate products:little or no effort on part of user.domain independent.
Problem size:million productsthousands of nodes in the hierarchy
How do we do it
Build classification model (rules) using product descriptions in master catalog. Example: If the product description contains "DRAM", the
product is likely to be in the "Memory" category.
Use classification model to predict categories for products in the new catalog.
Logic
DSPx
5%
95%
National Semiconductor Files
Part: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverPart_Id: DS14185 Manufacturer: nationalTitle: DS14185 EIA/TIA-232 3 Driver x 5 ReceiverDescription: The DS14185 is a three driver, five receiver
device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment.Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage RegulatorPart: LM2940/LM2940C 1A Low Dropout Regulator
...
...
...
National Semiconductor Files with CategoriesPart: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category:
Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line
Part: LM3940 1A Low Dropout RegulatorPangea Category:
Choice 1: Positive Fixed Voltage RegulatorChoice 2: Voltage-Feedback Operational AmplifierChoice 3: Voltage ReferenceChoice 4: Voltage-Mode SMPS ControllerChoice 5: Positive Adjustable Voltage Regulator
...
...
Accuracy on Pangea Data
B2B Portal for electronic components:1200 categories, 40K training documents.500 categories with < 5 documents.
Accuracy:72% for top choice.99.7% for top 5 choices.
Enhanced Algorithm
Use affinity information in catalog to be integrated:Products in same category are similar.Bias the classifier to incorporate this information.
Accuracy boost depends on quality of current catalog:Use tuning set to determine amount of bias.
Empirical Results
71-22-6 79-21 100
Purity (No. of classes & their distribution)
0
5
10
15
20
% E
rro
rs Standard
Enhanced
Improvement in Accuracy (Pangea)
1 2 5 10 25 50 100 200
Weight
65
70
75
80
85
90
95
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Improvement in Accuracy (Reuters)
1 2 5 10 25 50 100 200
Weight
82
84
86
88
90
92
94
96
98
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Improvement in Accuracy (Google.Outdoors)
1 5 25 100 400 1000
Weight
50
60
70
80
90
100
Ac
cu
rac
y
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Tune Set Size (Pangea)
0 5 10 20 35 50
Tune Set Size
70
75
80
85
90
95A
ccu
racy
Perfect
90-10
80-20
GaussianA
GaussianB
Base
Similar results for Reuters and Google.
Summary
The catalog integration technolgy can be directly used for creating and evolving large taxonomies
See WWW-2000 paper for experimental results on merging Yahoo and Google categorizations
Naive Bayes Classifier
Estimates the probability of a product belonging to Estimates the probability of a product belonging to a classa class
Pr(class | product) = Pr(class) * Pr(product | class) Pr(class | product) = Pr(class) * Pr(product | class) / Pr(product)/ Pr(product)
Pr(class) : # products in class / total productsPr(class) : # products in class / total productsPr(product) : same for all classes ( Pr(product) : same for all classes ( classesclasses Pr(class) * Pr(class) *
Pr(product | class) )Pr(product | class) )
How to compute Pr(product | class)?How to compute Pr(product | class)?
Naive Bayes Classifier (cont.)
Pr(Pr(productproduct | class) = Pr( | class) = Pr(productproduct description | class) description | class) * Pr(* Pr(productproduct attributes | class) attributes | class)
= = words in descriptionwords in description Pr(word | class) *Pr(word | class) * attributes attributes
Pr(APr(Aii = v= vkk | class)| class)assumption: words, attributes occur assumption: words, attributes occur independentlyindependently
Pr(word | class)Pr(word | class) (n+ (n+ ) / (t +) / (t + *|Vocabulary|)*|Vocabulary|)n : number of times word occurs in classn : number of times word occurs in classt : total number of words in classt : total number of words in class
Enhanced Classifier
S: node in new hierarchyS: node in new hierarchy
Pr(class | product, S) Pr(class | product, S) Pr(class | S) * Pr(product Pr(class | S) * Pr(product | class) / Pr(product | S)| class) / Pr(product | S)
Ignore Pr(product | S)Ignore Pr(product | S)
Pr(class CPr(class Cii | S) | S) (|C(|Cii| * Number of products in S | * Number of products in S predicted to be from Cpredicted to be from Cii))ww / / j (|Cj (|Cjj| * Number of | * Number of products in S predicted to be from Cproducts in S predicted to be from Cjj))ww
w determines the weightw determines the weight