SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig
-
Upload
tobias-wunner -
Category
Education
-
view
661 -
download
2
description
Transcript of SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig
Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
1
SOFIE - A Unified Approach To Ontology-Based Information Extraction
Using Reasonig
Tobias WunnerUnit for Natural Language Processing (UNLP)
Wednesday,22nd June, 2011
DERI, Reading Group
Digital Enterprise Research Institute www.deri.ie
Based On:
“SOFIE: A Self-Organizing Framework for Information Extraction”
Authors: Fabian Suchanek, Mauro Sozio, Gerhard Weikum
Published: World Wide Web Conference (WWW)
Madrid, 2009
2
Digital Enterprise Research Institute www.deri.ie
Overview
1. Introduction
2. SOFIE Model + Rules
3. Excursion: Satisfiability
4. SOFIE Approach
5. Evaluation experiments
6. Conclusion
3
Digital Enterprise Research Institute www.deri.ie
Motivation
Classical IE on text pattern-based 80pc
Semistructural approach Wikipedia infoboxes 95%
Idea of Paper: combine
use text (hypotheses) + ontology (trusted facts)
4
Digital Enterprise Research Institute www.deri.ie
Example
5
Einstein attended secondary school in Germany.
Document1
YAGO ontology
familyName(AlbertEinstein, Einstein)bornIn(AlbertEinstein, Germany)
New Knowledge
attendedSchoolIn( AlbertEinstein, Germany)
Digital Enterprise Research Institute www.deri.ie
General Idea
Express extraction patterns as fact
Rules to understand usage of terms
Add restrictions
6
patternOcc(“X went to school in Y”,Einstein, Switzerland)
patternOcc(Pattern,X,Y) and R(X,Y) ⇒ express(Pattern,R)
Digital Enterprise Research Institute www.deri.ie
Contribution
Unified approach toPattern matching
Word Sense Disambiguation
Reasoning
Large ScaleOn Unstructured Data
7
Digital Enterprise Research Institute www.deri.ie
Pattern extraction with WICs
Extract patterns based on ‘interesting’ entities
8
Einstein was born at Ulm in Württemberg, Germany, on March 18, 1879. When Albert was around four, his father gave him a magnetic compass.
When Albert became older, he went to a school in Switzerland. After he graduated, he got a job in the patent office there…
Documents
patternOcc(“Einstein was born in Ulm”,Einstein@D1, Ulm@D1) [1]
patternOcc(“Ulm is in Württemberg, Germany”,Ulm@D1, Germany@D1) [1]patternOcc(“Albert .. Switzerland”,Albert@D1, Switzerland@D1) [1]
Knowledge Base
WICs (Word in Context)
Digital Enterprise Research Institute www.deri.ie
Grounding
Test Rules How?
find an instance which satisfies the formulae
9
bornIn(X,Ulm) ⇒ ¬bornIn(X,Timbuktu)
studiedIn(X,Ulm)
bornIn(Einstein,Ulm) ⇒ ¬bornIn(Einstein,Timbuktu)
studiedIn(Einstein,Ulm)
Digital Enterprise Research Institute www.deri.ie
Rules (Hypotheses)
Disambiguation
– disambiguatesAs(Albert@D,AlberEinstein)[?]
Expresses a new fact
– expresses(P, livedIn(Einstein,Switzerland) )[?]
New facts
– CityIn(Ulm,Germany)[?]
10
Digital Enterprise Research Institute www.deri.ie
New fact rule
...with disambiguation
11
patternOcc( P, WX, WY ) and
disambiguatesAs(WX, X) and
disambiguatesAs(WY, Y) and
R(X,Y)
⇒ express( P, R )
“Pattern P expresses
Relation R when
analysis of WICs
are
disambiguated”
Digital Enterprise Research Institute www.deri.ie
Restrictions
Disambiguation disambiguation prior should influence
choice of disambiguation
12
disambPrior( W, X, N )
⇒ disambiguatedAs( W, X ) | words(D1) ∩ rel(AlbertEinstein)|
N - any disamb. function
| words(D1) |
Digital Enterprise Research Institute www.deri.ie
Restrictions
Functional restrictions
13
R(X,Y) and
type(R, function) and
different(Y,Z)
⇒ ¬R(X,Z)
Albert@D1 ≠ Albert@D2
“Albert@D1 born in?”
Digital Enterprise Research Institute www.deri.ie
SOFIE Rules
Framework to test the hypotheses Question
“How to satisfy all them?”
rules + trusted facts
14
patternOcc( P, X, Y ) and
R(X,Y)
⇒ express( P, R )
dismbPrior(Albert@D1, HermannEinstein, 3)
⇒ disambiguatesAs(Albert@D1,
HermannEinstein)dismbPrior(Albert@D1, AlbertEinstein, 10)
⇒ disambiguatesAs(Albert@D1, AlbertEinstein)
Country(Germany)
livedIn(AlbertEinstein,Ulm)
…
Digital Enterprise Research Institute www.deri.ie
SAT / MAX SAT
SAT (Satisfiability) proove formula can be TRUE
Complexity Classes P Good example: Nk
NP Bad cN
– e.g. naive algorithm for 100 variables
2100 x 10-10 ms per row = 4 x 1012 y
– Not always.. 3SAT in (4/3)N
– SAT Solver
15
X Y Z F
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 0
truth table has 23 rows
F = (X or Y or Z) and (¬X or Y or Z)
and (¬X or ¬Y or ¬Z)G = (X or Y) and (¬X or ¬Y) and (X)
Details Schöning 2010
Digital Enterprise Research Institute www.deri.ie
SAT / MAX SAT
SAT (Satisfiability) proove formula can be TRUE
Complexity Classes P Good example: Nk
NP Bad cN
– e.g. naive algorithm for 100 variables
2100 x 10-10 ms per row = 4 x 1012 y
– Not always.. 3SAT in (4/3)N
– SAT Solver
MAX SAT
16
F = (X or Y or Z) and (¬X or Y or Z)
and (¬X or ¬Y or ¬Z)G = (X or Y) and (¬X or ¬Y) and (X)
X Y Z F
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 0
truth table has 23 rows
X Y G #clauses
0 0 0 1
0 1 0 2
1 0 0 3
1 1 0 2
Details Schöning 2010
Digital Enterprise Research Institute www.deri.ie
Weighted MAX SAT in SOFIE
...back to SOFIE
this is MAX SAT but with weights
17
patternOcc( P, X, Y ) and
R(X,Y)
⇒ express( P, R )
dismbPrior(Albert@D1, HermannEinstein, 3)
⇒ disambiguatesAs(Albert@D1, HermannEinstein)
dismbPrior(Albert@D1, AlbertEinstein, 10)
⇒ disambiguatesAs(Albert@D1, AlbertEinstein)
Country(Germany)
livedIn(AlbertEinstein,Ulm)
…
rules + trusted facts
Digital Enterprise Research Institute www.deri.ie
Weighted MAX SAT in SOFIE
Weighted MAX SAT is NP hard only approximation algorithms
impractical to find optimal solution
SAT Solver Johnson’s algorithm: 2/3 (apprx guarantee)
Digital Enterprise Research Institute www.deri.ie
Weighted MAX SAT in SOFIE
Functional MAX SAT
Specialized reasoning (support for functional properties)
Approximation guarantee 1/2
A v B [w1]
A v B [w2]
B v C [w3]
C [w4]
Considers only unit clauses
Propagates dominating unit clauses
A v B [10]
A [10]
A [30]
A = true
30 > 10+10
Digital Enterprise Research Institute www.deri.ie
Controlled experiment
Corpus from Wikipedia infoboxes 100 articles
Semantic is known!
20
Digital Enterprise Research Institute www.deri.ie
Controlled experiment
Large-scale: Corpus from Wikipedia articles 2000 articles
13 frequent relations from YAGO
Parsing = 87min Reaoning = 77min
21
Digital Enterprise Research Institute www.deri.ie
Unstructured text sources
150 news paper articles relation under test headquarterOf
YAGO (modified with relation seeds)
Parsing 87min WeightedMaxSat 77min
disambiguated entries (provenance) could be manually assessed
22
functionalrelation
Digital Enterprise Research Institute www.deri.ie
Unstructured text sources
Large-scale: 10 biographies for each of 400 US senators
5 relationships
Disambiguation was not ideal for YAGO (13 James Watson)
Parsing 7h W-MAX-SAT 9h
Results
– 4 good
– 1 bad (misleading patterns)
23
Digital Enterprise Research Institute www.deri.ie
Reformulate OWL in propositional logic OWL FOL Skolem Normal Form Propositional Logic
Might find OWL-inconsistent ontologies due to OW Assumption
MAX SAT can’t do OWL per se (Open World Assumption)
24
define a student as a subclass “attends some course”
⇒ ∀ x, ∃ y: attends(x,y), Course(y) → Student(y)
⇒ ∀ x: attends(x,k), Course(y) → Student(y); ∃ k
⇒ ¬attends(xi, ki) or ¬Course(xi) or Student(xi); k=x1 .. xn
Inferred Ontology
{ Student(alex), Student(bob), Student subClassOf attends some Course, attends(alex, SemanticWeb) }
Details JMC 2010
Digital Enterprise Research Institute www.deri.ie
Conclusions
Ontology-based IE (OBIE) reformulated as
weighted MAX SAT problem
Approximation algorithm with 1/2
Works and scales (large corpus + YAGO)
25
Digital Enterprise Research Institute www.deri.ie
Limitations
Specialized approximation algorithm– Accounts for SOFIE rules NOT OWL
MAX SAT Restrictions∈ Prepositional Logic
∉ First-Order Logic
Ontology population approach (can’t infer new relations)
26
Digital Enterprise Research Institute www.deri.ie
References
27
1. F Suchanek et al, SOFIE: a self-organizing framework for information extraction, Proceeding WWW '09 Proceedings of the 18th international conference on World wide web, link
2. John McCrae, Automatic Extraction Of Logically Consistent Ontologies From Text, PhD thesis at National Institute of Informatics, Japan, 2009 link
3. Uwe Schöning: Das SAT-Problem. In Informatik Spektrum 33(5): 479-483, 2010, link
4. F Suchanek, Automated Construction and Growth of a Large Ontology, PhD thesis at Technology of Saarland University. Saarbrücken, Germany, 2009, link