Ontology-based information extraction: progresses and perspectives of the Ex tool
description
Transcript of Ontology-based information extraction: progresses and perspectives of the Ex tool
University of Economics Prague
Ontology-based information extraction: progresses and perspectives of the Ex tool
Martin Labský[email protected]
KEG seminar, May 29, 2008
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Agenda
1. Motivation for Web Information Extraction (IE)2. Difficulties in practical applications3. Extraction Ontologies4. Extraction process5. Experimental results: contact information6. Future work and Conclusion
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Motivation for Web IE (1/4): online products
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Motivation for Web IE (2/4): contact information
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Motivation for Web IE (3/4): seminars, events
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Motivation for Web IE (4/4)
Store the extracted results in a DB to enable structured search over documents– information retrieval– database-like querying– e.g. online product search engine– e.g. building a contact DB
Support for web page quality assessment– involved in an EU project MedIEQ to support medical
website accreditation agencies Source documents
– internet, intranet, emails– can be very diverse
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Agenda
1. Motivation for Web Information Extraction (IE)2. Difficulties in practical IE applications3. Extraction Ontologies4. Extraction process5. Experimental results: contact information6. Future work and Conclusion
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Difficulties in practical applications (1/3)
Requirements– be able to extract some information quickly
not necessarily with the best accuracyoften needed for a proof-of-concept applicationthen more work can be done to boost accuracy
– the extraction model changesmeaning of to-be-extracted items may shift, new items are often added
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Difficulties in practical applications (2/3)
Training data– most state-of-the-art trainable IE systems require large amounts
of training data: these are almost never available– when training data is collected, it is not easy to adapt it to
changed or additional criteria– active learning helps reduce training data collection efforts but
users often need to spend time annotating trivial examples that could be easily covered by manual rules
– this is our experience from experiments with extraction of bicycle descriptions using Hidden Markov Models
Wrappers– cannot rely on wrapper-only systems when extracting from
multiple websites– non-wrapper systems often do not utilize regular formatting
cues Purely manual rules
– just writing extraction rules manually is not easily extensible when training data are collected in later phases
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Difficulties in practical applications (3/3)
It seems to be difficult to exploit at the same time– extraction knowledge from domain experts– training data– formatting regularities
within a documentwithin a group of documents from the same source
We attempt to address this with the approach of extraction ontologies
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Agenda
1. Motivation for Web Information Extraction (IE)2. Difficulties in practical applications3. Extraction Ontologies4. Extraction process5. Experimental results: contact information6. Future work and Conclusion
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Extraction ontologies
An extraction ontology is a part of a domain ontology transformed to suit extraction needs
Contains classes composed of attributes– more like UML class diagrams, less like
ontologies where e.g. relations are standalone
– also contains axioms related to classes or attributes
Classes and attributes are augmented with extraction evidence– manually provided patterns for content
and context– value or length ranges– links to external trainable classifiers
Personname {1}degree {0-5}email {0-2}phone {0-3}
Responsible
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Extraction evidence provided by domain expert (1)
Patterns– for attributes and classes– for their content and context– patterns may be defined at the following levels:
word and character-level, formatting tag level level of labels (e.g. sentence breaks, POS tags)
Attribute value constraints– word length constraints, numeric value ranges– possible to attach units to numeric attributes
Axioms– may enforce relations among attributes– interpreted using JavaScript scripting language
Simple co-reference resolution rules
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Extraction evidence provided by domain expert (2)
Axioms class level attribute level
Patterns class content attribute value attribute context class context
Value constraints word length numeric value
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Extraction evidence from classifiers
Links to trainable classifiers– may classify attributes only– binary or multi-class
Training (if not done externally) uses these features– re-use all evidence provided by expert– induce binary features based on word n-grams
Feature induction– candidate features are all word n-grams of given lengths occurring
inside or near training attribute values– pruning parameters:
point-wise mutual information thresholds: minimal absolute occurrence count maximum number of features
)()(
),(
clsPfP
clsfPpwmi
classifier usage
classifier definition
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Probabilistic model to combine evidence
Each piece of evidence E is equipped with 2 probability estimates with respect to predicted attribute A:– evidence precision P(A|E) ... prediction confidence– evidence coverage P(E|A) ... necessity of evidence (support)
Each attribute is assigned some low prior probability P(A) Let be the set of evidence applicable to A Assume conditional independence among :
Using Bayes formula we compute P(A | its evidence values) as:
where
AA
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Agenda
1. Motivation for Web Information Extraction (IE)2. Difficulties in practical applications3. Extraction Ontologies4. Extraction process5. Experimental results: contact information6. Future work and Conclusion
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
The extraction process (1/5)
1. Tokenize, build HTML formatting tree, apply sentence splitter, POS tagger
2. Match patterns3. Create Attribute Candidates (ACs)
For each created AC, let PAC=
prune ACs below threshold build document AC lattice, score ACs by log(PAC)
Washington , DC
......
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
The extraction process (2/5)
4. Evaluate coreference resolution rules for each pair of ACs e.g. “Dr. Burns” “John Burns” possible coreferring groups are remembered in attribute’s value section:
5. Compute the best scoring path BP through AC lattice using dynamic programming
6. Run wrapper induction algorithm using all AC BP wrapper induction algorithm described in next slides if new local patterns are induced, apply them to:
rescore existing ACs create new ACs
update AC lattice, recompute BP7. Terminate here if no instances are to be generated
output all AC BP (n-best paths supported)
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
The extraction process (3/5)
8. Generate Instance Candidates (ICs) bottom-up
– triangular trellis used to store partial ICs– when scoring new ICs, only consider axioms and patterns that
already can be applied to the IC. Validity is not required.– pruning parameters: abs and relative beam size at trellis node,
maximum number of ACs that can be skipped, min IC probability
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
The extraction process (4/5)
8. IC generation: continued When new IC is created, its P(IC) is computed from 2 components:
where |IC| is member attribute count, ACskip is an non-member AC that is fully or partially inside the IC,
PAC skip is the probability of AC being a “false positive”.
where C is the set of evidence known for the class C, computed using the same probabilistic model as for ACs.
Scores are combined using the Prospector pseudo-bayesian method:
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
The extraction process (5/5)
9. Insert valid ICs into AC lattice Valid ICs were assembled during IC generation phase Score of a valid IC reflects all extraction evidence of its class All unpruned valid ICs are inserted into the AC lattice, scored by
10. The best path BP is calculated through the IC+AC lattice (n-best supported) the search algorithm allows constraints to be defined over the
extracted path(s) e.g. min/max count of extracted instances
output all ACs and ICs on BP
||))(log( ICICscore
IC1
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Extraction evidence based on formatting
A simple wrapper induction algorithm – identify formatting regularities– turn them into “local” context patterns to boost contained ACs
1. Assemble distinct formatting subtrees rooted at block elements containing ACs from the best path BP currently determined by the system
2. For each subtree S, calculate
3. If both C(S,Att) and prec(Att|S) reach defined thresholds, a new local context pattern is created with its precision set to C(S,Att) and its recall close to 0 (in order not to harm potential singleton ACs.
TD
A_hrefB
John Doe [email protected]
TD
A_hrefB
Argentina Agosto [email protected]
)(
),()|(
SC
AttSCSAttprec
a formatting tree learned using known names like “John Doe” and applied to unknown names
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Agenda
1. Motivation for Web Information Extraction (IE)2. Difficulties in practical applications3. Extraction Ontologies4. Extraction process5. Experimental results: contact information6. Future work and Conclusion
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Experimental results: Contact information
109 English contact pages, 200 Spanish, 108 Czech Named entity counts: 7000, 5000, 11000, respectively, instances not labeled Only domain expert’s evidence and formatting pattern induction were used Domain expert saw 30 randomly chosen documents, the rest was test data Instance extraction done but not evaluated
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Future work
Confirm that improved results can be achieved when combining expert knowledge and formatting pattern induction with classifiers
Attempt to improve a seed extraction ontology by bootstrapping using relevant pages retrieved from the Internet
Adapt the structure of extraction ontology according to data– e.g. add new attributes to represent product features
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Conclusions
Presented an extraction ontology approach to– allow for fast prototyping of IE applications– accomodate extraction schema changes easily– utilize all available forms of extraction knowledge
domain expert’s knowledgetraining dataformatting regularities found in web pages
Results– indicate that extraction ontologies can serve as a
quick prototyping tool– it seems possible to improve performance of the
prototyped ontology when training data become available
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Acknowledgements
The research was partially supported by the EC under contract FP6-027026, Knowledge Space of Semantic Inference for Automatic Annotation and Retrieval of Multimedia Content: K-Space.
The medical website application is carried out in the context of the EC-funded (DG-SANCO) project MedIEQ.
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Backup slides
IET and co.
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Information extraction toolkit - architecture
INFORMATION EXTRACTION TOOLKIT
user components
admin components
IE Engines
Labelling schemas
Classified documents from WCC
DataModel
Manager
Pre-processor
UI
Expert’sdomain
and extractionknowledg
e,annotated
corpora
Ex extraction ontology engine
TaskManage
rUI Documen
t IO
Annotated documents
Extracted attributes, instances
Annotationtool
UI
AQUA
Evaluator
CRF extraction engine
Rule-based integrator (TBD)
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
extract attributes
extract attributesand instances
refines extracted values,e.g. based on document classification
Information extraction toolkit – document flow
Rule-based
integrator
Extraction
ontology engine
Pre-processor
CRF NE engine
classified document
select extraction model (s) based on document class
extracted attributesand instances
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Czech contact data set: results
counts strict mode loose mode
gold auto prec recall F prec recall F
title 0.87 0.88 0.88 0.89 0.91 0.90
name 0.74 0.82 0.78 0.76 0.83 0.80
street 0.78 0.66 0.71 0.83 0.69 0.75
city 0.67 0.69 0.68 0.75 0.79 0.77
region
zip 0.91 0.97 0.94 0.91 0.97 0.94
country 0.64 0.87 0.74 0.66 0.96 0.78
phone 0.92 0.85 0.88 0.93 0.85 0.89
email 0.99 0.98 0.98 0.99 0.98 0.98
organization
department
overall 0.81 0.84 0.82 0.84 0.87 0.84
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Czech dataset: per-attribute F-measures
F (strict)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
title
name
street
city zip
coun
try
phon
eem
ail
over
all
F (loose)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
title
name
street
city zip
coun
try
phon
eem
ail
over
all
IET purpose: to support the user by providing suggestions not to work standalone without supervision
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Customization to new criteria
Precisely define the criterion or criteria group– define and give positive and negative examples
If gazetteers required:– search or construct appropriate gazetteers
If training required:– annotate training corpus of at least 100 documents with at
least 300 occurrences of the criterion– train one of the trainable extractors:
CRF engine Ex with Weka integration
If some extraction evidence can be given by human:– write new or extend an existing extraction ontology
Evaluate performance
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Localization to a new language
Reuse language independent parts of extraction ontology:– class structure (attributes in a class)– cardinalities, constraints, axioms– some criteria can be reused almost completely (phone,
email) If a criterion requires training:
– annotate corpus and train classifier as when adding a new criterion
Provide language specific extraction evidence that can be encoded by a human (if any):– add to extraction ontology
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
Demo + tutorial
IET + Ex– free text criteria– (shows internal IET user interface)
Tutorial– http://eso.vse.cz/~labsky/ex/ex_tutorial.pdf
ISMIS 2008 Combining Multiple Sources of Evidence in Web IE
New features in Ex IE engine
Significant speed-up Memory footprint reduction Multiple class extraction Extended axiom support Instance parsing and reference resolution
improvements Extraction ontology authoring made easier