IEEE ON Semiautomatic Implementation Communication Protocols
Semiautomatic Generation of Resilient Data-Extraction Ontologies
description
Transcript of Semiautomatic Generation of Resilient Data-Extraction Ontologies
![Page 1: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/1.jpg)
Semiautomatic Generation of Resilient Data-Extraction
Ontologies
Yihong Ding
Data Extraction GroupBrigham Young University
Sponsored by NSF
![Page 2: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/2.jpg)
2
Introduction
• Wrapper-driven data extraction– Pros: data-source-specified, high performance– Cons: lack of resiliency and scalability
• Ontology-driven data extraction– Pros: application-domain-specified, resilient and scalable– Cons: hard to create
• Objective– Generating data-extraction ontologies
![Page 3: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/3.jpg)
3
Generation Architecture
Data Extraction Ontology
Integrated Knowledge Base
training documents
interact if necessary
Results Storage
Concept Selection
ExtractionProcessing
pre-processing
cleanrecords
RelationRetrieval
ConstraintDiscovery
testdocuments
Knowledge Sources
pre-processing
ResultEvaluation
KnowledgePreparation
ApplicationSpecification
DomainAllocation
OntologyGeneration
![Page 4: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/4.jpg)
4
Knowledge Base Construction
• Knowledge Sources– Mikrokosmos (K) Ontology– Data-Frame Library– Additional Lexicons– WordNet
• Integration of Knowledge Base
Data-Frame Library
KOntolog
y
Synonym Dictionary
(WordNet)
Lexicons
KNOWLEDGE BASE
![Page 5: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/5.jpg)
5
Application Specification
Record 1:
00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446
Record 2:
02 Buick Century Custom, Pwr Seat, Nada Retail 13,695Only $12,695. 221-1250
Record 3:
02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755
Record 4:
00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
![Page 6: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/6.jpg)
6
Domain Allocation: concept selection
• Select concepts using string-matching with object values• Resolve conflict by context or semantic meanings
02 Buick CenturyPwr Seat,Nada Retail 13,695. <Price>
<Mileage>
Data Frame Library
retailby keyword
identification
![Page 7: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/7.jpg)
7
Domain Allocation: relationship retrieval
Record 1:
00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446
Record 2:
02 Buick Century Custom, Pwr Seat, Nada Retail 13,695Only $12,695. 221-1250
Record 3:
02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755
Record 4:
00 Buick Century Stk# HU7159 Green $9,319, 714-2200
To Apply By Phone, 1-877-228-9486, OREM Utah
• Find paths among selected concept nodes• Retrieve cluster representing application domain
<MAKE>
<FEATURE>
<AUTOMOBILE>
<PRICE><PHONE>
<YEAR><TEMPORAL-UNIT>
![Page 8: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/8.jpg)
8
<MAKE><FEATURE>
<AUTOMOBILE>
<PRICE>
Domain Allocation: constraint discovery
• Discover participation times for each object values• Specify discovered values to be participation constraints
02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755
00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
<MAKE><FEATURE>
<AUTOMOBILE>
<PRICE>
AUTOMOBILE [0:1]
has MAKE [1:*]
AUTOMOBILE [0:*]
has FEATURE [1:*]
AUTOMOBILE [0:1]
has PRICE [1:1]
![Page 9: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/9.jpg)
9
Ontology Generation
• Initial ontology: automatically generated
• Updated ontology: user tuning
• Expectation– Rejecting existence much easier than adding new– Modification as less as possible
![Page 10: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/10.jpg)
10
Evaluation and Results
• Evaluation– Compare: Generated vs. Expert-created– POG (Precision of Ontology Generation)– PROG (Pseudo-Recall of Ontology Generation)– EPROG (Effective-PROG)
• Results– Three testing domains: Apt-Rental, Used-Auto-Ads, Nation-
Essence– Average POG less than 0.23– Lowest EPROG is around 0.70, highest is almost 1.0
![Page 11: Semiautomatic Generation of Resilient Data-Extraction Ontologies](https://reader036.fdocuments.us/reader036/viewer/2022081604/568139c4550346895da16e1a/html5/thumbnails/11.jpg)
11
Conclusion
• Exploits existing knowledge
• Specifies application domain
• Allocates domain inside the knowledge base
• Generates a data-extraction ontology
• Shows effective recall of more than 70% on average