Semiautomatic Generation of Resilient Data-Extraction Ontologies

Post on 01-Jan-2016

31 views 1 download

Tags:

description

Semiautomatic Generation of Resilient Data-Extraction Ontologies. Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF. Wrapper-Driven Data Extraction. Web data extraction Obtain user-specified information from Web documents Wrapper - PowerPoint PPT Presentation

Transcript of Semiautomatic Generation of Resilient Data-Extraction Ontologies

Semiautomatic Generation of Resilient Data-Extraction Ontologies

Yihong Ding

Data Extraction GroupBrigham Young University

Sponsored by NSF

2

Wrapper-Driven Data Extraction

Web data extraction– Obtain user-specified information from Web documents

Wrapper– Convert implicit HTML data into explicit formatted data– Data-source-specified, high performance

Examples:– SoftMealy, STALKER, WIEN, Omini, ROADRUNNER, …

3

Common Problem of Wrappers

<LI> <A HREF="…"> Mani Chandy </A>,

<I>Professor of Computer Science</I>

and <I>Executive Officer for Computer

Science</I>

b

U_U

N_N

? / ε etc.

? / ε

? / ε

? / next_token

? / next_token

s<U,U> / ε

s<b,U> /“U=” + next_token

s<N,N> / εs<b,N> /“N=” + next_token

s<U,N> /“N=” + next_token

SoftMealy

Resiliency fixed domainchangeable layout

Scalabilityunchanged existing wrapperextendable domain and functions

4

Data-Extraction Ontology

Structure– Object sets– Relationship sets– Participation constraints– Data frames

Pros: resilient and scalableCons: hard to create– Knowledge requirements– Tedious and error-prone work

Car [-> object];

Car [0:1] has Make [1:*];Make matches [10] constant { extract "\baudi\b"; };end;

Car [0:1] has Model [1:*];Model matches [25] constant { extract "80"; context "\baudi\S*\s*80\b"; };end;

Car [0:1] has Mileage [1:*];Mileage matches [8] constant {extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000";};end;

Car [0:1] has Price [1:*];Price matches [8] constant { extract "[1-9]\d{3,6}"; context "\$[1-9]\d{3,6}";};end;

5

Motif of Ontology Generation

Human Brain

Concepts of Interest

Concepts with Relations

Data-Extraction Ontology

Knowledge Base

Sample Documents

6

Thesis Statement

Given: knowledge baseInput: sample Web pages of interest Output: a data-extraction ontology for the domain of interest

Between input and output: this is the work of this thesis

7

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

8

Primary Knowledge Source

Requirements– Available – General in coverage– Rich in meaningful relationship– Encoded in or easily converted to XML

Mikrokosmos (K) Ontology– Developed by NMSU jointly with U.S. DoD– Contains over 5000 concepts– Connects to an average 14 links per concept– Represented in XML format

9

Integrated Knowledge Base

Data-Frame Library

KOntolog

y

Synonym Dictionary

(WordNet)

Lexicons

KNOWLEDGE BASE

10

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

11

Domain Specification

Training documents– Data-rich – Narrow in topic breadth

Preprocessing

12

Example – Car AdvertisementRecord 1:

00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446

Record 2:

02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250

Record 3:

02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755

Record 4:

00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

13

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

14

Concept Selection

Selection strategies– Compare a string with the

name of a concept– Compare a string with the

values belonging to a concept

– Apply data-frame recognizers to recognize a string

00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

KB

<PHONE-NR>

15

Concept Selection

Reasons of conflict– Synonymy– Polysemy

Conflict resolution– Same-string only one

meaning– Favor longer over shorter– Context decides meaning

02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250.

KB<PRICE>

<MILEAGE>

price

by keyword identification

16

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

17

Relationship Retrieval

<AUTOMOBILE>

<PRICE>

<PHONE-NR>

<YEAR>

<CENTURY>

KB

<MILEAGE>

<AUDIO-MEDIA-ARTIFACT>

18

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

19

<AUTOMOBILE>

<PRICE>

Constraint Discovery

<AUTOMOBILE>

<PRICE>

02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755

00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

AUTOMOBILE [0:1] IsA.ARTIFACT.CostofProduction PRICE [1:1]

20

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

21

Ontology Generation

concept nodes object setspaths relationship setsdiscovered constraints participation constraintsconcept recognizers data frames

22

Automatically Generated Ontology -- Car Advertisement

(01) {Automobile [-> object];}

(02) {Automobile [0:1] has Mileage [1:1];}

(03) {Automobile [0:1] IsA.ARTIFACT.CostOfProduction Price [1:1];}

(12) {Price [1:1] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year [0:*];}

(20) {Automobile [0:1] relatesTo PhoneNr [1:*] relatesTo ArtifactPart [1:*] relatesTo Mileage [1:*] relatesTo Truck [1:*] relatesTo AudioMediaArtifact [1:*] relatesTo CommunicationDevice [1:*] relatesTo ControlEvent [1:*] relatesTo TravelEvent [1:*];}

23

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

24

Updating Strategies

Remove all bad relationship sets

Modify remaining incorrect relationship sets– Substitute incorrect object sets– Reduce long n-ary relationship sets – Fix participation constraints

Adjust names or re-arrange sequences

Add new relationship sets

25

Final Ontology

Car [-> object]Car [0:1] has Year [1:*]Car [0:1] has Mileage [1:*]Car [0:1] has Price [1:*]PhoneNr [1:*] is for Car [0:1]PhoneNr [0:1] has Extension [1:*]Car [0:*] has Feature [1:*]Car [0:1] has Make [1:*]Car [0:1] has Model [1:*]

26

Evaluation Criteria

Basic measures– POG (Precision of Ontology Generation)– ROG (Recall of Ontology Generation)

Human constraints– PROG (Pseudo-ROG)– Comparing with an expert-created ontology

Knowledge base constraints– EPROG (Effective-PROG)

Correctness dependency– DEPROG (Dependent-EPROG)– For example: relationship sets depends on object sets

27

Evaluation Results

28

Discussion of Results

Bottleneck: cannot generate what not in the knowledge base

Object sets– Concept-selection procedure works well– Desired concept not shown in training records

• Rarely occurring concept not severe even if we don’t fix the error• Example: extension

– Aggregation and union• USAddressCity, USAddressState, USAddressZipCode Location• CropPlant, AnimalProduct, FruitFoodStuff AgriculturalProduct

– Close-meaning concepts: FurniturePart Furnished

29

Discussion of Results

Relationship sets– Binary relationship sets over 95% – Most errors due to incorrectly generated object sets– Semantically incorrect relationship sets

• Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year

– n-ary relationship sets (usually huge)

Participation constraints– Error due to lack of training examples – How much is enough?

30

Knowledge Base Extensibility

Add SALT -- a new knowledge sourceSuccessfully integrated into existing KBSample new relationship set (DOE abstract domain)– CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation

31

Conclusion

Experimented with knowledge-base construction and extension

Standardized application domain specification

Generated data-extraction ontologies from a specified domain and an integrated knowledge base

Showed DEPROG results of more than 70% on average and over 90% for well-defined domains

32

Future Work

Build a general-purpose knowledge source for data-extraction usage

Study more about data frames– Can a system correctly identify concepts with data frames?– Can a system update a data frame to fit a special situation?– Can a system generate a data frame from a collection of

information of interest?