A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System
description
Transcript of A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System
![Page 1: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/1.jpg)
A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System
Alan WessmanBrigham Young UniversityMS Thesis Defense
Based in part on research funded by the National Science Foundation.
![Page 2: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/2.jpg)
2
Presentation Overview
Background of legacy Ontos Assumptions, challenges, concerns Framework as solution Explain framework Explain reference implementation Evaluation of system Future work and conclusion
![Page 3: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/3.jpg)
3
Data Extraction Goals of data extraction
Find relevant data in unstructured or semi-structured documents
Map extracted data to a formal structure Approaches
Wrappers (ROADRUNNER, TSIMMIS) NLP and machine learning (RAPIER, WHISK) Ontologies (Ontos)
![Page 4: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/4.jpg)
4
Ontos
Developed by Data Extraction Group (DEG) at BYU
Based on OSM ontologies and data frames Focuses on multiple-record extraction Good precision/recall Resilient to document changes
![Page 5: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/5.jpg)
5
How Ontos Works
![Page 6: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/6.jpg)
6
Ontos Assumptions
OSML ontologies Single- or multiple-record text documents Each document/record relevant to domain Heuristics produce accurate mappings Output to relational database
![Page 7: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/7.jpg)
7
Some Current Challenges
Challenge Example
New/evolving ontology features Enhanced data frames
Variety of documents PDF, plaintext, XML
Content filtering Extract from certain HTML attributes (ALT, SRC, HREF)
Locating values On-the-fly lexicon
Optimizing mappings Better heuristics; HMM-based mapping
![Page 8: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/8.jpg)
8
Architectural Concerns
Variety of technologies Different OSM representations Highly coupled code Difficult to install elsewhere Difficult to upgrade or extend
![Page 9: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/9.jpg)
9
Thesis Statement
A framework for data extraction can give us a flexible and configurable platform for conducting data-extraction research.
We can re-implement Ontos under the framework, which will let us adapt the system to particular research needs without ongoing massive rewrites.
![Page 10: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/10.jpg)
10
Frameworks Abstract architecture Decouple independent
functions Define interfaces Use abstract classes,
interfaces, declarative configuration files
Allow quick adjustment of system settings without re-coding
Make a system customizable
Image from http://www.mcoe.org
![Page 11: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/11.jpg)
11
Creating an Extraction Framework
Analyze systems Generalize
functionality Define interfaces Create supporting
code Document framework
DataExtractionEngine
public void doExtraction()
ExtractionPlan
DocumentRetriever
DocumentStructureRecognizer
DocumentStructureParser
ContentFilter
ValueRecognizer
ValueMapper
OntologyWriter
Dynamicallyloaded
components
Config parameters
execute()
ExtractionAlgorithm
uses
![Page 12: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/12.jpg)
12
Managing the Process
DataExtractionEngine Main class Initialize, perform extraction, finalize
ExtractionPlan Defines order of steps in the extraction process Can be imperative, declarative, or dynamic (like
SQL execution plan)
![Page 13: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/13.jpg)
13
Handling Documents DocumentRetriever
Responsible for locating relevant documents
Search engine, local filesystem, CMS
DocumentStructureRecognizer Decides which
DocumentStructureParser to use
DocumentStructureParser Breaks document into
individual records or sub-documents
Record separator, table analyzer
ContentFilter Normalizes document text Strips out unwanted markup,
stopwords, etc.
DocumentRetriever
public Iterator retrieveDocuments()URI
DocumentStructureRecognizer
public DocumentStructureParser getDocumentParser(Document doc)
Single
Multi
Tabular
Hierarchical
...
DocumentStructureParser
public Document parse(Document doc)
ContentFilter
public Document filterDocument(Document doc)
<p>Price:<br>$452.00</p>
Price:$452.00
ValueRecognizer
public void findValues(Ontology ont, Document doc)
Price:$452.00
Ontology
Price: $452.00
Keyword Value
DocumentRetriever
public Iterator retrieveDocuments()URI
DocumentStructureRecognizer
public DocumentStructureParser getDocumentParser(Document doc)
Single
Multi
Tabular
Hierarchical
...
DocumentStructureParser
public Document parse(Document doc)
ContentFilter
public Document filterDocument(Document doc)
<p>Price:<br>$452.00</p>
Price:$452.00
ValueRecognizer
public void findValues(Ontology ont, Document doc)
Price:$452.00
Ontology
Price: $452.00
Keyword Value
DocumentRetriever
public Iterator retrieveDocuments()URI
DocumentStructureRecognizer
public DocumentStructureParser getDocumentParser(Document doc)
Single
Multi
Tabular
Hierarchical
...
DocumentStructureParser
public Document parse(Document doc)
ContentFilter
public Document filterDocument(Document doc)
<p>Price:<br>$452.00</p>
Price:$452.00
ValueRecognizer
public void findValues(Ontology ont, Document doc)
Price:$452.00
Ontology
Price: $452.00
Keyword Value
DocumentRetriever
public Iterator retrieveDocuments()URI
DocumentStructureRecognizer
public DocumentStructureParser getDocumentParser(Document doc)
Single
Multi
Tabular
Hierarchical
...
DocumentStructureParser
public Document parse(Document doc)
ContentFilter
public Document filterDocument(Document doc)
<p>Price:<br>$452.00</p>
Price:$452.00
ValueRecognizer
public void findValues(Ontology ont, Document doc)
Price:$452.00
Ontology
Price: $452.00
Keyword Value
![Page 14: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/14.jpg)
14
Extracting Values ValueRecognizer
Uses matching rules defined in ontology Produces set of candidate matches (like data
record table) ValueMapper
Accepts or rejects candidate matches Assigns accepted matches to elements of the
ontology (e.g., object sets) OntologyWriter
Emits ontology structure and/or extracted data in an output format (e.g., XML, SQL)
![Page 15: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/15.jpg)
15
Implementing the Framework
Applicati onOntol ogyDocument
Retriever
Object sets, relationship sets,and constraints
Value and keyword matching rules
SourceDescriptor
Document
StructureRecognizer
StructureParser
ContentFilter
Document
DocumentDocumentDocument
ValueRecognizer
ValueMapper
Candidate matches
Extracted objects and relationships
Ontol ogyWriter
StructureOutput
DataOutput
URI LocalDocumentRetriever
DOMDocument
(no DocumentStructureRecognizer)
FanoutRecordSeparator
TextDocument
HTMLFilter
DataFrameMatcher
OSMX ontology
HeuristicBasedMapper
ObjectRelationshipWriter
(no structural output) HTML representation
![Page 16: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/16.jpg)
16
OSMX Legacy Ontos: OSML OntologyEditor:
OSM.dtd New standard is OSMX
XML Schema (better constraints; validation)
JAXB generates corresponding Java classes
Common language for DEG tools
Allows data to be stored inline with model
![Page 17: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/17.jpg)
17
Managing the Process OntosEngine
Main class for Ontos system
Takes parameters from command line or configuration file
OntosExtractionPlan Sequentially
retrieves, parses, filters, and extracts from individual documents
Imperative (hard-coded) algorithm
Applicati onOntol ogyDocument
Retriever
Object sets, relationship sets,and constraints
Value and keyword matching rules
SourceDescriptor
Document
StructureRecognizer
StructureParser
ContentFilter
Document
DocumentDocumentDocument
ValueRecognizer
ValueMapper
Candidate matches
Extracted objects and relationships
Ontol ogyWriter
StructureOutput
DataOutput
URI LocalDocumentRetriever
DOMDocument
(no DocumentStructureRecognizer)
FanoutRecordSeparator
TextDocument
HTMLFilter
DataFrameMatcher
OSMX ontology
HeuristicBasedMapper
ObjectRelationshipWriter
(no structural output) HTML representation
![Page 18: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/18.jpg)
18
Handling Documents
LocalDocumentRetriever Retrieves documents from local filesystem Filename filter excludes irrelevant files
FanoutRecordSeparator Implements DocumentStructureParser Locates record boundaries and creates sub-
documents HTMLFilter
Removes all HTML markup from documents
![Page 19: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/19.jpg)
19
Recognizing Values: DataFrameMatcher
Uses data frame enhancements: Keyword affinity (left and right) Require context for left, right, or both Value phrase-specific keywords Link matches back to specific patterns
Other improvements: Consistent regular expression handling Unlimited recursive macro definition
![Page 20: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/20.jpg)
20
Mapping Values: HeuristicBasedMapper
New algorithm Fully recursive wrt ontology structure ContextualHeuristic generates objects Connection-based heuristics (singleton, nested-
group, etc.) generate relationships See paper for additional details
![Page 21: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/21.jpg)
21
Output
Human-readable HTML format Easier to count correct, partial, incorrect
mappingsDeceasedPerson osmx3113
•has DeceasedName Sandoval, Ernesto J.•has DeathDate October 7, 2004•has BirthDate November 9, 1923•has Age 63•DeceasedPerson has Relationship to RelativeName
•RelativeName Agullar Sandoval•Relationship daughter
•DeceasedPerson has Relationship to RelativeName •RelativeName Lalo Sandoval•Relationship brother
![Page 22: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/22.jpg)
22
Using the Framework and Reference Implementation
Adding new features Create new implementation classes Extend (subclass) existing implementations
Switching feature set Change class name in config file Override class on command line
![Page 23: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/23.jpg)
23
Evaluating the Framework
Age FuneralDate Viewing Relationship/
RelativeName
Recall Precision Recall Precision Recall Precision Recall Precision
New Ontos
60% 50% 68% 76% 80% 63% 74% 43%
Legacy Ontos
57% 38% 63% 75% 93% 18% 73% 41%
Four of eighteen object sets shown above.
Data from Salt Lake Tribune and Arizona Daily Star
Input:
Obituaries ontology
25 obituaries from two newspapers
![Page 24: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/24.jpg)
24
Statistics about the System
Files Lines of code*
Framework 38 2868
OntologyEditor 141 22,249
OSMX (XML Schema) 1 1918
OSMX (Java)** 60 6912
Ontos 29 6295
* Includes comments and whitespace.
** JAXB-generated classes add 197 files and 62,888 lines of code.
![Page 25: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/25.jpg)
25
Future Work Algorithm improvements
On-the-fly lexicons Machine learning techniques Confidence values Canonicalization Expected participation cardinality Negative-indicator keywords
Integration Online search engines Semantic Web annotator and query engine Web interface to extraction engine
![Page 26: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/26.jpg)
26
Contributions
Design and construction of a data-extraction framework
Reference implementation Ontos upgrade Pattern for future use of framework
OSMX Standardized storage format http://www.deg.byu.edu/xml/osmx.xsd
![Page 27: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/27.jpg)
27
Contributions
Uniform codebase and language OntologyEditor migration
New graphics classes Extended data frame support
Modular heuristic-based mapper Concept of extraction plans Flexible research platform
![Page 28: A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062803/56814702550346895db43cff/html5/thumbnails/28.jpg)
28
Conclusion
Framework gives us the flexibility we need for further data-extraction research
Framework is capable of supporting Ontos functionality
OSMX and reference implementation provide solid base for future research applications