Information Extractions from Texts - Synthesis Group
Transcript of Information Extractions from Texts - Synthesis Group
![Page 1: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/1.jpg)
Information Extractions from TextsДмитрий Брюхов, к.т.н., с.н.с.
Институт Проблем Информатики РАН
![Page 2: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/2.jpg)
Content
Motivation
Applications
Information Extraction Steps
Source Selection and Preparation
Entity Recognition
Named Entity Annotation
Entity Disambiguation
2
![Page 3: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/3.jpg)
Motivation
Around 80% of data are unstructured text documents
Reports
Web pages
Tweets
E-mails
Texts contain useful information
Entities
Entity Relationships
Facts
Sentiments
3
![Page 4: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/4.jpg)
Def: Information Extraction
Information Extraction (IE) is the process of deriving structured
factual information from digital text documents.
4
Persons Occupation
Elvis singer
![Page 5: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/5.jpg)
Need for structured data
Business intelligence tools work with structured data
OLAP
Data mining
To use unstructured data with business intelligence tools
Requires that structured data to be extracted from unstructured and semi-structured data
Why would anybody want to do Information Extraction?
The reason is that...
5
this is abundant... this is useful.
Persons Occupation
Elvis singer
![Page 6: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/6.jpg)
Problem with unstructured data
Structured data has
Known attribute types
Integer
Character
Decimal
Known usage
Represents salary versus zip code
Unstructured data has no
Known attribute types nor usage
Usage is based upon context
Tom Brown has brown eyes
A computer program has to be able view a word in context to
know its meaning
6
![Page 7: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/7.jpg)
Difficulty: Ambiguity7
“Ford Prefect thought cars were the dominant life form on
Earth.”
“Ford Prefect”
entity name (of person ”Ford Prefect”) ?
entity name (of car brand) ?
![Page 8: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/8.jpg)
Difficulty:Detect phrases
Elvis is a rock star.
Elvis is a rock star.
Elvis is a real rock star.
8
![Page 9: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/9.jpg)
Application: Customer care
Dear Apple support,
my iPhone 5 charging cable is not interoperable with my fridge.
Can you please help me?
9
Product Part Problem
iPhone 5 Charger Compatibility
![Page 10: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/10.jpg)
Application: Opinion Mining10
iPhone maps doesn’t work
Product Opinion
iPhone negative
![Page 11: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/11.jpg)
Application: Portals11
News Google Wiki
Extraction Engine
Portal
Everything about
Organization/Person/…
![Page 12: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/12.jpg)
Application: Portals12
![Page 13: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/13.jpg)
Application: Question Answering
Which protein inhibits atherosclerosis?
Who was king of England when Napoleon I was Emperor of
France?
Has any scientist ever won the Nobel Prize in Literature?
Which countries have a HDI comparable to Sweden’s?
Which scientific papers have led to patents?
13
![Page 14: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/14.jpg)
Question Answering
Questions of this kind can be answered with structured information.
14
Person Occupation Awards
Russell Mathematician NP Literature
Bohr Physicist NP Physics
![Page 15: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/15.jpg)
Question Answering
Structured information can be extracted from digital documents.
15
![Page 16: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/16.jpg)
Application: IBM WATSON
IBM Watson is a question answering system based on many
sources (i.a. YAGO).
It outperformed the 74-fold human winner of the Jeopardy!
quizz show.
16
![Page 17: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/17.jpg)
Application: Midas17
![Page 18: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/18.jpg)
Application: British Library
The British Library has a very large and rapidly growing web
archive portal that allows researchers and historians to explore
preserved web content.
With initial manual methods, pages of 5,000 .uk websites were
classified by 30 research analysts. However, the cost to
manually archive the entire .uk web domain — which
comprises 4 million websites and is growing daily — would be
exorbitant.
The British Library uses InfoSphere BigInsights and a classification
module built by IBM to electronically classify and tag web
content and enable/create visualizations across numerous
commodity PCs in parallel, dramatically reducing the cost of
archiving.
The British Library can now archive and preserve massive
numbers of web pages and allow patrons to explore and
generate new data insights.
18
![Page 19: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/19.jpg)
Information Extraction Steps
Source Selection and Preparation
Entity Recognition
Named Entity Annotation
Entity Disambiguation
19
![Page 20: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/20.jpg)
Source Selection and Preparation
The corpus is the set of digital text documents from which we
want to extract information.
Where do we get our corpus from?
We can use a given corpus
We can crawl the Web
A Web crawler is a system that follows hyperlinks, collecting all pages on the way.
We can find pages on demand
E.g., Google Search
20
![Page 21: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/21.jpg)
Information Extraction Steps
Source Selection and Preparation
Entity Recognition
Named Entity Annotation
Entity Disambiguation
21
![Page 22: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/22.jpg)
Def: Named Entity Recognition
Named entity recognition (NER) is the task of finding entity
names in a corpus.
Uruguay's economy is characterized by an export-oriented
agricultural sector, a well-educated workforce, and high levels of
social spending.
22
![Page 23: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/23.jpg)
NER is difficult23
“Ford Prefect thought cars were the dominant life form on Earth.”
entity name
(of person ”Ford Prefect”) ?
entity name
(of car brand) ?
![Page 24: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/24.jpg)
NER Approaches
by dictionaries
by rules (regular expressions)
24
![Page 25: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/25.jpg)
Def: Dictionary
A dictionary is a set of names.
Countries
Cities
Organizations
Company staff
…
NER by dictionary finds only names of the dictionary.
25
![Page 26: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/26.jpg)
NER by dictionary
NER by dictionary can be used if the entities are known upfront.
Countries: {Argentina, Brazil, China, Uruguay, ...}
26
… the economy suffered from lower demand in
Argentina and Brazil, which together account for
nearly half of Uruguay's exports …
![Page 27: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/27.jpg)
Trie
A trie is a tree, where nodes are
labeled with booleans and edges are
labeled with characters.
A trie contains a string, if the string
denotes a path from the root to a
node marked with ’true’.
Example
{ADA, ADAMS, ADORA}
27
![Page 28: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/28.jpg)
Adding strings
To add a string that is a prefix of an
existing string, switch node to ’true’.
{ADA, ADAMS, ADORA} + ADAM
To add another string, make a new
branch.
{ADA, ADAM, ADAMS, ADORA} +
ART
28
![Page 29: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/29.jpg)
Tries can be used for NER
For every character in the doc
advance as far as possible in the trie
and
report match whenever you meet a
’true’ node
Tries have good runtime
O(textLength maxWordLength)
29
![Page 30: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/30.jpg)
Dictionary NER
Dictionary NER is very efficient, but dictionaries:
have to be given upfront
have to be maintained to accommodate new names
cannot deal with name variants
cannot deal with unknown sets of names (e.g., people names)
30
![Page 31: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/31.jpg)
NER Approaches
by dictionaries
by rules (regular expressions)
31
![Page 32: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/32.jpg)
Disadvantages of Dictionary Approach
Dictionaries do not always work
It is unhandy/impossible to come up with exhaustive sets of all
movies, people, or books.
32
![Page 33: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/33.jpg)
Some names follow patterns33
From 1713 to 1728 and from 1732 to
1918, Saint Petersburg was the
Imperial capital of Russia.
Years
Dr. Albert Einstein, one of the great
thinkers of the ages
People
with Titles
Main street 42
West CountryAddresses
![Page 34: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/34.jpg)
Regular expressions
Regular expressions are used to extract textual values from a
non-structured file
There is a defined set of structural specifications that are used
to find patterns in the data
Example (uses Perl syntax)
Phone numbers (e.g. (123) 456-78-90)
\(\d{3}\) \d{3}\-\d{2}\-\d{2}
\(\d\d\d\) \d\d\d\-\d\d\-\d\d
Address (e.g. Main street 42)
[A-Z][a-z]* (street|str\.) [0-9]+
34
![Page 35: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/35.jpg)
Character category and character classes
category35
Construct Matches
\\ The backslash character
\t The tab character
\n The newline character
\r The carriage-return character
\e The escape character
The space character
Construct Matches
[abc] a, b, or c
[^abc] Any character except a, b, and c
[a-z] a through z, inclusive
![Page 36: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/36.jpg)
Predefined character classes36
Character Class Meaning
any character That particular character
. Any character except the newline character
\w Any word character
\W Any non-word character
\p{L} Any word character
\p{Ll} Any lower case word character
\p{Lu} Any upper case work character
\d Any digit [0-9]
\D Any non-digit [^\d]
\s Any whitespace character
\S Any non-whitespace character [^\s]
![Page 37: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/37.jpg)
Boundary matches and greed quantifiers
Construct Matches
^ The beginning of a line
& The end of a line
\b A word boundary
\B A non-word boundary
37
Construct Matches
? X, once or not at all
* X, zero or more times
+ X, one or more times
{n} X, exactly n times
{n,} X, at least n times
{n,m} X, at least n times but not more than m times
![Page 38: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/38.jpg)
Logical operators
Construct Matches
XY X followed by Y
X|Y either X or Y
38
![Page 39: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/39.jpg)
Finite State Machine
A Finite state machine (FSM) is a quintuple of
an input alphabet A
a finite non-empty set of states S
an initial state s, an element of S
a set of final states F S
a state transition relation S A S
An FSM is a directed multi-graph, where each edge is labeled
with a symbol or the empty symbol
One node is labeled ”start”
Zero or more nodes are labeled ”final”
39
![Page 40: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/40.jpg)
Def: Acceptance
An FSM accepts (also: generates) a string, if there is a path
from the start node to a final node whose edge labels are the
string.
(The symbol can be passed without consuming a character of the
string)
To check whether a given string is accepted by a FSM, find a
path from the start node to a final node whose labels
correspond to the string.
40
![Page 41: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/41.jpg)
FSM = Regex
For every regex, there is an FSM that accepts exactly the words
that the regex matches (and vice versa).
RegEx
42(0)+
Simplified regex
420(0)*
FSM
Matcher
His favorite numbers are 42, 4200, and 19.
41
![Page 42: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/42.jpg)
Information Extraction Steps
Source Selection and Preparation
Entity Recognition
Named Entity Annotation
Entity Disambiguation
42
![Page 43: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/43.jpg)
Def: Named Entity Annotation
Named Entity Annotation (NEA) is the task of (1) finding entity
names in a corpus and (2) annotating each name with a class
out of a set of given classes.
Example
43
Andrey Nikolaevich Kolmogorov graduated
from the Moscow State University in 1925.
Classes: {Person, Location, Organization, ..}
Person Organization
![Page 44: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/44.jpg)
Classes for NEA
Person
Organization
Location
Address
City
Country
DateTime
EmailAddress
PhoneNumber
URL
…
44
![Page 45: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/45.jpg)
Example45
In 1930, [PER Kolmogorov] went on his first long trip
abroad, traveling to [LOC Göttingen] and [LOC
Munich] , and then to [LOC Paris] . His pioneering
work, About the [MISC Analytical Methods of
Probability Theory] , was published (in [MISC German]
) in 1931. Also in 1931, he became a professor at the
[ORG Moscow State University] .
In 1930, Kolmogorov went on his first long trip abroad,
traveling to Göttingen and Munich, and then to Paris.
His pioneering work, About the Analytical Methods of
Probability Theory, was published (in German) in 1931.
Also in 1931, he became a professor at the Moscow
State University.
![Page 46: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/46.jpg)
NEA Approaches
NEA by rules
Learning NEA rules
NEA by statistical models
Learning statistical NEA models
46
![Page 47: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/47.jpg)
Def: NEA by rules
NEA by rules uses rules of the form PATTERN => CLASS
It annotates a string by the class CLASS, if it matches the NEA
pattern PATTERN.
Example
([A-Z][a-z]+) (CityForest) => Location
47
Light City is the only city in Ursa Minor.
Location
![Page 48: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/48.jpg)
Def: NEA pattern, feature
A NEA pattern is a sequence of features.
Features can be more than regexes
48
([A-Z][a-z]+) (CityForest) => Location
Feature Feature
Dictionary:Title ”.” CapWord{2} => Person
{Dr,Mr,Prof,...} {.} {Ab Cd, Efg Hij,...}
![Page 49: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/49.jpg)
Def: Context/Designated features
Designated features in a NEA pattern are those whose match
will be annotated.
The others are context features.
49
”a pub in” [CapWord] => location
Arthur and Fenchurch have a drink
in a pub in Taunton.
<loc>Taunton</loc>
![Page 50: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/50.jpg)
NEA Approaches
NEA by rules
Learning NEA rules
NEA by statistical models
Learning statistical NEA models
50
![Page 51: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/51.jpg)
Where do we get the rules from?
NEA rules are usually designed manually (as in the GATE
system).
However, they can also be learned automatically (as in the
Rapier, LP2, FOIL, and WHISK systems).
51
![Page 52: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/52.jpg)
Bottom Up NEA rule learning
Input: A corpus annotated with classes
Output: NEA rules
Algorithm:
1. Make one rule for each annotation
2. Prune redundant rules, merge rules
3. If rules are not general enough, go to 2
52
![Page 53: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/53.jpg)
Example: Bottom up NEA rule learning
0. Start with annotated training corpus
1. Find a NEA rule for each annotation
53
<pers>Arthur</pers> says ’Hello’
[Arthur] ”says ’Hello’” => pers
![Page 54: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/54.jpg)
Example: Bottom up NEA rule learning
2. Merge two rules by replacing a feature by a more general
feature
54
[Ford] ”says ’Hello’” => pers
[Arthur] ”says ’Hello’” => pers
[CapWord] ”says ’Hello’” => pers
Generalize
![Page 55: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/55.jpg)
Example: Bottom up NEA rule learning
3. Merge two rules by dropping a feature
55
[CapWord] ”says Bye’” => pers
[CapWord] ”says ’Hello’” => pers
[CapWord] ”says” => pers
Drop
![Page 56: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/56.jpg)
Example: Bottom up NEA rule learning
4. Remove redundant rules
5. Continue
56
[CapWord] (says|yells|screams)=>pers
[CapWord] ”says” => pers
[CapWord] (says|yells|screams)=>pers
![Page 57: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/57.jpg)
NEA Approaches
NEA by rules
Learning NEA rules
NEA by statistical models
Learning statistical NEA models
57
![Page 58: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/58.jpg)
Def: Statistical NEA corpus
A corpus for statistical NEA is a vector of words (”tokens”).
The output is a vector of class names.
Words that fall into no class are annotated with ”other”
58
Adams lives in California
pers oth oth loc
=: X, input
=: Y, output
X is a vector of words
Y is a vector of class names
![Page 59: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/59.jpg)
Def: Statistical NEA Feature
In statistical NEA, a feature is a function that maps
a word vector X
a position i in X
a class name y
to a real value
59
f1(X; i; y) = 1 if X[i] is CapWord & y=Person
= 0 else
f2(X; i; y) := 1 if xi-1istitle ^ y = Person
![Page 60: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/60.jpg)
Example: Statistical NEA features
f1(< Arthur,talks>; 1; Person) = 1
f1(< Arthur,talks>; 2; Person) = 0
f1(< Arthur,talks>; 1; Location) = 0
f1(< Arthur,talks>; 2; Location) = 0
60
f1(X; i; y) = 1 if X[i] is CapWord & y=Person
= 0 else
![Page 61: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/61.jpg)
Def: Statistical NEA
Given
a corpus vector X =< x1; …; xm >
a vector of features F =< f1; f2; …; fn >
a weight vector W =< w1;w2; …;wn >
Compute class names Y =< y1; …; ym >
that maximize
𝑖 𝑗𝑤𝑗 𝑓𝑗(𝑋, 𝑖, 𝑦𝑖)
”Find class names for the words, s.t. each feature is happy for
each word.”
61
![Page 62: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/62.jpg)
NEA Approaches
NEA by rules
Learning NEA rules
NEA by statistical models
Learning statistical NEA models
62
![Page 63: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/63.jpg)
Where do we get the weights from?
Define features F =< f1; …; fn >
Produce a training corpus (i.e., a manually annotated corpus)
(X =< x1; …xm >; Y =< y1; …ym >)
Find weights W =< w1; …;wn > for the features so that statistical
NEA annotates the training corpus correctly
63
![Page 64: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/64.jpg)
Information Extraction Steps
Source Selection and Preparation
Entity Recognition
Named Entity Annotation
Entity Disambiguation
64
![Page 65: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/65.jpg)
Def: Disambiguation
Given an ambiguous name in a corpus and its meanings,
disambiguation is the task of determining the intended
meaning.
65
Homer eats
a doughnut.?
![Page 66: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/66.jpg)
Disambiguation Setting
Usually NER runs first, and the goal is to map the names to
entities in a KB.
66
Corpus Knowledge Base
Homer eats
a doughnut. “Homer”
American
Poettype
label
label
![Page 67: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/67.jpg)
Disambiguation Approaches
Context-based
Disambiguation Prior
Coherence
67
![Page 68: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/68.jpg)
Def: Context-based disambiguation
Context-based disambiguation (also: bag of words
disambiguation) maps a name in a corpus to the entity in the
KB whose context has the highest overlap to the context of the
name.
The context of a word in a corpus is the multi-set of the words in
its vicinity without the stopwords.
“Homer eats a doughnut.”
Context of ”Homer”: {eats, doughnut}
The context of an entity in a KB is the set of all labels of all
entities in its vicinity.
68
USAlivesIn
“USA”label
“America”
label
doughnut
likes
“doughnut”label
Context
of Homer:
{doughnut,
USA,America}
![Page 69: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/69.jpg)
Def: Disambiguation Prior
A disambiguation prior is a mapping from names to their
meanings, weighted by the number of times that the name
refers to the meaning in a reference corpus.
69
This is very important
for the Simpsons.
Simpsons Store4
493201
![Page 70: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/70.jpg)
Def: Coherence Criterion
The Coherence Criterion postulates that entities that are
mentioned in one document should be related in the KB.
70
Bart and Homer accidentally launch a model rocket into
the Springfield church, causing Lisa to leave Christianity.
![Page 71: Information Extractions from Texts - Synthesis Group](https://reader031.fdocuments.us/reader031/viewer/2022021308/6207b8ff5db02351542b361d/html5/thumbnails/71.jpg)
Questions ?
71