Recognizing Ontology-Applicable Multiple-Record Web Documents
-
Upload
mohammad-freeman -
Category
Documents
-
view
26 -
download
1
description
Transcript of Recognizing Ontology-Applicable Multiple-Record Web Documents
![Page 1: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/1.jpg)
Recognizing Ontology-ApplicableMultiple-Record Web Documents
David W. Embley
Dennis Ng
Li Xu
Brigham Young University
![Page 2: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/2.jpg)
Problem: Recognizing Applicable DocumentsDocument 1: Car Ads
Document 2: Items for Sale or Rent
![Page 3: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/3.jpg)
A Conceptual Modeling Solution
![Page 4: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/4.jpg)
Car-Ads Ontology
Car [->object];
Car [0:0.975:1] has Year [1:*];
Car [0:0.925:1] has Make [1:*];
Car [0:0.908:1] has Model [1:*];
Car [0:0.45:1] has Mileage [1:*];
Car [0:2.1:*] has Feature [1:*];
Car [0:0.8:1] has Price [1:*];
PhoneNr [1:*] is for Car [1:1.15:*];
Year matches [4]
constant {extract “\d{2}”;
context "([^\$\d]|^)[4-9]\d,[^\d]";
substitute "^" -> "19"; },
…
End;
![Page 5: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/5.jpg)
Recognition Heuristics
• H1: Density
• H2: Expected Values
• H3: Grouping
![Page 6: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/6.jpg)
Document 1: Car Ads
Document 2: Items for Sale or Rent
H1: Density
![Page 7: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/7.jpg)
H1: Density
• Car Ads– Number of Matched Characters: 626– Total Number of Characters: 2048– Density: 0.306
• Items for Rent or Sale– Number of Matched Characters: 196– Total Number of Characters: 2671– Density: 0.073
![Page 8: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/8.jpg)
Document 1: Car Ads
Year: 3Make: 2Model: 3Mileage: 1Price: 1Feature: 15PhoneNr: 3
H2: Expected Values
Document 2: Items for Sale or Rent
Year: 1Make: 0Model: 0Mileage: 1Price: 0Feature: 0PhoneNr: 4
![Page 9: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/9.jpg)
H2: Expected Values
OV D1 D2
Year 0.98 16 6
Make 0.93 10 0
Model 0.91 12 0
Mileage 0.45 6 2
Price 0.80 11 8
Feature 2.10 29 0
PhoneNr 1.15 15 11
D1: 0.996
D2: 0.567
ov
D1
D2
![Page 10: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/10.jpg)
H3: Grouping (of 1-Max Object Sets)
YearMakeModelPriceYearModelYearMakeModelMileage…
Document 1: Car Ads
{{{
YearMileage…MileageYearPricePrice…
Document 2: Items for Sale or Rent
{{
![Page 11: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/11.jpg)
H3: GroupingCar Ads----------------YearYearMakeModel-------------- 3PriceYearModelYear---------------3MakeModelMileageYear---------------4ModelMileagePriceYear---------------4…Grouping: 0.865
Sale Items----------------YearYearYearMileage-------------- 2MileageYearPricePrice---------------3YearPricePriceYear---------------2PricePricePricePrice---------------1…Grouping: 0.500
Expected Number in Group = Ave = 4 (for our example)
Sum of Distinct 1-Max in each GroupNumber of Groups Expected Number in a Group
1-Max
3+3+4+4 44
= 0.875 2+3+2+1 44 = 0.500
![Page 12: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/12.jpg)
Combining Heuristics
• Decision-Tree Learning Algorithm C4.5– (H1, H2, H3, Positive)
– (H1, H2, H3, Negative)
• Training Set– 20 positive examples– 30 negative examples (some purposely similar, e.g. classified ads)
• Test Set– 10 positive examples
– 20 negative examples
![Page 13: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/13.jpg)
Car Ads: Rule & Results
• Precision: 100%• Recall: 91%• Accuracy 97%
– Harmonic Mean– 2/(1/Precision + 1/Recall)
![Page 14: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/14.jpg)
False Negative
![Page 15: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/15.jpg)
Obituaries
![Page 16: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/16.jpg)
Obituaries: Rule & Results
• Precision: 91%• Recall: 100%• Accuracy: 97%
![Page 17: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/17.jpg)
False Positive: Missing Person Report
![Page 18: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/18.jpg)
Universal Rule
• Precision: 84%• Recall: 100%• Accuracy: 93%
![Page 19: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/19.jpg)
Additional and Future Work
• Other Approaches– Naïve Bayes [McCallum96] (accuracy near 90%)– Logistic Regression [Wang01] (accuracy near 95%)– Multivariate Analysis with Continuous Random Vectors
[Tang01] (accuracy near 100%)
• More Extensive Testing– Similar documents (motorcycles, wedding announcements, …)– Accuracy drops to near 87%– Naïve Bayes drops to near 77%– Others … ?
• Other Types of Documents– XML Documents– Forms and the Hidden Web– Tables
![Page 20: Recognizing Ontology-Applicable Multiple-Record Web Documents](https://reader035.fdocuments.us/reader035/viewer/2022081516/56812dd2550346895d93171e/html5/thumbnails/20.jpg)
Summary
• Objective: Automatically Recognize Document Applicability
• Approach:– Conceptual Modeling– Recognition Heuristics
• Density
• Expected Values
• Grouping
• Result: Accuracy Near 95%
www.deg.byu.edu