Self-supervised Probabilistic Methods for Extracting Facts from Text
description
Transcript of Self-supervised Probabilistic Methods for Extracting Facts from Text
1
Self-supervised Probabilistic Methods for Extracting Facts
from TextDoug Downey
2
Q: Who did IBM acquire in 2002?
A:“IBM acquired * in 2002”
Q: Who has won a best actor Oscar for playing a villain?
A: “won best actor for playing a villain” – 0 hits!
The answer isn’t on just one Web page
Web Search: Answering Questions
3
Q: Who has won a best actor Oscar for playing a villain?
A: Find all $X where the following appear:“$X won best actor for $Y”“$X, who played $Z in $Y”“the villain, $Z”
“Forest Whitaker won best actor for The Last King of Scotland” – 210 hits
“Forest Whitaker, who played Idi Amin in The Last King of Scotland” – 4 hits
“the villian, Idi Amin” – 1 hitAnswer: Forest Whitaker
Solution: Synthesizing Across Pages
4
Given: One or more contexts indicating a semantic class C, e.g., “$X starred in $Y” => StarredIn($X, $Y)– User-specified (TextRunner [Banko et al., IJCAI 2007])– Automatically generated (KnowItAll [Etzioni et al., AIJ 2005])– Bootstrapped from resources [Snow et al., NIPS 2004].
Output: instances of Cbut, extraction from contexts is highly imperfect!
=> Output P(x C) for each term x
Self-supervised – no hand-tagged examples
Self-supervised Information Extraction
5
Given: One or more contexts suggestive of a semantic class C, and a corpus of text
Output: P(x C) for each term x
KnowItAll Hypothesis – Terms x which occur in the suggestive contexts more
frequently are more likely to be instances of C.
Distributional Hypothesis– Terms in the same class tend to appear in similar contexts.
My task: formalizing these heuristics into statements about P(x C) given a corpus.
Self-supervised Information Extraction
6
Who cares about Probabilities?
Why not use rankings (e.g., the precision/recall metric)?
P( WonBestActorFor(Forest Whitaker, The Last King of Scotland) )
And P( PlayedVillainIn(Forest Whitaker, The Last King of Scotland) )
=>Our goal: an estimate of the probability that Forest Whitaker won best actor for playing a villain.
Not possible with rankings!In fact, combining even perfect rankings can yield accuracy < .
7
1) Two Research Questions
2) URNS model3) REALM
4) Proposal for DH
5) Chez KnowItAll
Outline
8
Term-Context Matrix
Terms
. . . 98 0 2 25 1 513 . . .
. . . 2 0 930 0 0 1 . . .
. . . 1 0 10 0 0 1 . . .
Contexts
E.g., Miami
(Robert De Niro, Raging Bull)
…potential elements of C
9
Terms
. . . 98 0 2 25 1 513 . . .
. . . 2 0 930 0 0 1 . . .
. . . 1 0 10 0 0 1 . . .
Contexts
E.g.,
cities such as $X,
$X said $Y offered to,
also: parse trees, bag of words, containing Web domain, etc.
Term-Context Matrix
10
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .
. . . 1 1000 0 2 1 1 . . .X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
11
Two Research Questions
-- term-context matrix
-- columns of M for contexts
suggesting C.
-- prior estimate that x C
Formalizing the KnowItAll hypothesis: What is an expression for ?
Formalizing the distributional hypothesis: What is an expression for ?
12
Key Requirements for Models
1) Produce probabilities
2) Execute at “interactive” speed
3) No hand-tagged data
13
1) Two Research Questions
2) URNS model3) REALM
4) Proposal for DH
5) Chez KnowItAll
Outline
14
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .
. . . 1 1000 0 2 1 1 . . .X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
15
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .
. . . 1 1000 0 2 1 1 . . .X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
16
1. Modeling Redundancy – The Problem
Consider a single context, e.g.:“cities such as x”
If an extraction x appears k times in a set of n sentences containing this pattern, what is the probability that x C?
17
Modeling with k
“…countries such as Saudi Arabia…”
“…countries such as the United States…”
“…countries such as Saudi Arabia…”
“…countries such as Japan…”
“…countries such as Africa…”
“…countries such as Japan…”
“…countries such as the United Kingdom…”
“…countries such as Iraq…”
“…countries such as Afghanistan…”
“…countries such as Australia…”
Country(x)
extractions, n = 10
18
Modeling with k
Country(x)
extractions, n = 10
Saudi Arabia
Japan
United States
Africa
United Kingdom
Iraq
Afghanistan
Australia
k2
2
1
1
1
1
1
1
Noisy-Or Model :
k
ornoisy
p
kxCxP
11
times appears
p is the probability that a single sentence is true, i.e.
p = 0.9
ornoisyP 0.99
0.99
0.9
0.9
0.9
0.9
0.9
0.9
Important: –Sample size (n) –Distribution of C }Noisy-or ignores these
19
Needed in Model: Sample Size
kJapan
Norway
Israil
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
New Zeland
Country(x)
extractions, n ~50,000 ornoisyP 1723
295
1
1
1
1
1
1
1
0.9999…
0.9999…
0.9
0.9
0.9
0.9
0.9
0.9
0.9
Country(x)
extractions, n = 10
Saudi Arabia
Japan
United States
Africa
United Kingdom
Iraq
Afghanistan
Australia
k2
2
1
1
1
1
1
1
ornoisyP 0.99
0.99
0.9
0.9
0.9
0.9
0.9
0.9
As sample size increases, noisy-or becomes inaccurate.
20
Needed in Model: Distribution of C
nk
freq
p
kxCxP
100011
times appears
kJapan
Norway
Israil
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
New Zeland
Country(x)
extractions, n ~50,000 ornoisyP 1723
295
1
1
1
1
1
1
1
0.9999…
0.9999…
0.9
0.9
0.9
0.9
0.9
0.9
0.9
21
Needed in Model: Distribution of C
nk
freq
p
kxCxP
100011
times appears
kJapan
Norway
Israil
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
New Zeland
Country(x)
extractions, n ~50,000
1723
295
1
1
1
1
1
1
1
0.9999…
0.9999…
0.05
0.05
0.05
0.05
0.05
0.05
0.05
freqP
22
Needed in Model: Distribution of C
kToronto
Belgrade
Lacombe
Kent County
Nikki
Ragaz
Villegas
Cres
Northeastwards
City(x)
extractions, n ~50,000
274
81
1
1
1
1
1
1
1
0.9999…
0.98
0.05
0.05
0.05
0.05
0.05
0.05
0.05
freqP
Probability that x C depends on the distribution of C.
kJapan
Norway
Israil
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
New Zeland
Country(x)
extractions, n ~50,000
1723
295
1
1
1
1
1
1
1
0.9999…
0.9999…
0.05
0.05
0.05
0.05
0.05
0.05
0.05
freqP
23
The URNS Model – Single Urn
24
The URNS Model – Single Urn
U.K.
Sydney
Urn for City(x)
Cairo
Tokyo
Tokyo
Atlanta
Atlanta
Yakima
Utah
U.K.
25
Tokyo
The URNS Model – Single Urn
U.K.
Sydney
Urn for City(x)
Cairo
Tokyo
Tokyo
Atlanta
Atlanta
Yakima
Utah
U.K.
…cities such as Tokyo…
26
Single Urn – Formal Definition
C – set of unique target labels
E – set of unique error labels
num(b) – number of balls labeled by b C E
num(B) –distribution giving the number of balls for each label b B.
27
Single Urn Example
num(“Atlanta”) = 2
num(C) = {2, 2, 1, 1, 1}
num(E) = {2, 1}
Estimated from data
U.K.
Sydney
Urn for City(x)
Cairo
Tokyo
Tokyo
Atlanta
Atlanta
Yakima
Utah
U.K.
28
Single Urn: Computing Probabilities
If an extraction x appears k times in a set of n sentences containing a pattern, what is the probability that x C?
29
Single Urn: Computing Probabilities
Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?
where s is the total number of balls in the urn
30
Consider the case where num(ci) = RC and num(ej) = RE
for all ci C, ej E
Then:
Then using a Poisson Approximation:
Odds increase exponentially with k, but decrease exponentially with n.
Uniform Special Case
31
The URNS Model – Multiple Urns
Correlation across contexts is higher for elements of C than for elements of E.
32
0
1
2
3
4
5
City Film Country MayorOf
De
via
tio
n f
rom
ide
al l
og
lik
elih
oo
d
urns
noisy-or
pmi
Unsupervised Performance
33
1) Two Research Questions
2) URNS model3) REALM
4) Proposal for DH
5) Chez KnowItAll
Outline
34
0
250
500
0 50000 100000
Frequency rank of extraction
Nu
mb
er
of
tim
es
ex
tra
cti
on
a
pp
ea
rs i
n p
att
ern
Redundancy fails on “sparse” facts
Tend to be correct
e.g., (Michael Bloomberg, New York City)
A mixture of correct and incorrect
e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)
35
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .
. . . 1 1000 0 2 1 1 . . .X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
36
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .
. . . 1 1000 0 2 1 1 . . .X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
37
Assessing Sparse Extractions
Task: Identify which sparse extractions are correct.
Strategy:1. Build a model of how common extractions occur in
text2. Rank sparse extractions by fit to model
• The distributional hypothesis!
Our contribution: Unsupervised language models.– Methods for mitigating sparsity– Precomputed, so greatly improved scalability
38
The REALM Architecture
RElation Assessment using Language Models
Input: Set of extractions for relation R
ER = {(arg11, arg21), …, (arg1M, arg2M)}
1) Seeds SR = s most frequent pairs in ER
(assume these are correct)
2) Output ranking of (arg1, arg2) ER
by distributional similarity to each (seed1, seed2) in SR
39
Distributional Similarity (1)
N-gram Language Model:
Estimate P(wi | wi-1, … wi-k)
#Parameters scales with (Vocab. Size)k+1
wi-k … wi-1 wi
40
Distributional Similarity (2)
Naïve Approach:
Compare context distributions:
P(wg,…, wj | seed1, seed2 )
P(wg,…, wj | arg1, arg2)But j-g can be large
Many parameters, sparse data => inaccuracy
wg … wh seed1 wh+2 … wi seed2 wi+2 … wj
wg … wh arg1 wh+2 … wi arg2 wi+2 … wj
41
The REALM ArchitectureTwo steps for assessing R(arg1, arg2)• Typechecking
– e.g., AuthorOf( arg1, arg2 )arg1 must be an author, arg2 a written
workValuable, but allows errors like:AuthorOf(Danielle Steele, Hamlet)
• Relation Assessment– Ensure R actually holds between arg1 and arg2
Both steps use small, pre-computed language models=> Scaleable
42
Task: For each extraction (arg1, arg2) ER, determine if arg1 and arg2 are the proper type for R.
Solution: Assume seedj SR are of the proper type, and
rank argj by distributional similarity to each seedj
Computing Distributional Similarity:
1) Offline, train Hidden Markov Model (HMM) of corpus
2) Measure distance between argj , seedj in HMM’s N-dimensional latent state space.
Typechecking and HMM-T
43
HMM Language Model
ti ti+1 ti+2 ti+3
wi wi+1 wi+2 wi+3
cities such as Seattle
wordsw
Nt
i
i
,...,1
Offline Training: Learn P(w | t), P(ti | ti-1, …, ti-k) to maximize probability of corpus (using EM).
k = 1 case:
44
HMM-T
Trained HMM gives “distributional summary” of each w: N-dimensional state distribution P(t | w)
Typecheck each arg by comparing state distributions:
Rank extractions in ascending order of f(arg) summed over arguments.
arg|,|
||
1(arg) tPseedtP
seedsKLf
ii
45
Miami: < >Twisp: < >
Problems:– Vectors are large– Intersections are sparse
. . . 71 25 1 513 . . .w
hen
he v
isite
d X
he v
isite
d X
and
visi
ted
X a
nd o
ther
X a
nd o
ther
citi
es
. . . 0 0 0 1 . . .
Why not use context vectors?
46
Miami: <
>
P(t | Miami):
Latent state distribution P(t | w)– Compact (efficient – 10-50x less data retrieved)– Dense (accurate)
. . . 71 25 1 513 . . .
0.14 0.01 … 0.06 t=1 2 N
HMM-T Advantages (1)
47
HMM-T Advantages (2)
Is Pickerington of the same type as Chicago?
Chicago , Illinois
Pickerington , Ohio
Chicago:
Pickerington:
=> N-grams says no, Dot product is 0!
291 0 …
<x> , Ohio
<x> , Illinois
0 1 …
48
HMM Generalizes:
Chicago , Illinois
Pickerington , Ohio
HMM-T Advantages (3)
49
HMM-T Limitations
Learning time is proportional to (corpus size *Tk+1)
T = number of latent states
k = HMM order
We use limited values T=20, k=3– Sufficient for typechecking (Santa Clara is a city)– Too coarse for relation assessment
(Santa Clara is where Intel is headquartered)
50
1) Two Research Questions
2) URNS model3) REALM
4) Proposal for Formalizing the DH
5) Chez KnowItAll
Outline
51
Formalizing the Distributional Hypothesis
How is this not just semi-supervised or transductive learning?
– Starts with prior , not hand-labeled examples.– Features are counts.
Two alternative formalizations– Context Counts– Distance Function
Don’t yet have expression for – Instead: basic formalizations, preliminary results
52
Context Counts
Terms
. . . 920 600 293 20 2 1 . . .
. . . 20 110 930 3 0 1 . . .
. . . 43 30 0 1 0 2 . . .
Contexts
Reliable
Unreliable
As the corpus increases in size, the number of reliable contexts increases.
53
Context Counts
Terms
. . . 920 600 293 20 2 1 . . .
. . . 20 110 930 3 0 1 . . .
. . . 43 30 0 1 0 2 . . .
Contexts
Reliable
Unreliable
Basic idea: model each reliable context as a “single urn.”
54
Context Counts – Assumptions
1) Only a term’s reliable contexts are useful.• Occur at least r times with the term.
2) Contexts conditionally independent given C.
3) Terms and contexts are Zipf distributed.
Key question: how many reliable contexts co-occur with a given term in a corpus of n total tokens?
Can be computed in closed form given the above assumptions.
55
Preliminary Result (1)
Assume that the Bayes Risk for a classifier using just one context is at least \Beta. Then for a corpus of n tokens over a vocabulary V and context set \Pi,
56
Preliminary Result (2)
Provides non-trivial bounds:
Google n-grams data set (roughly):
n = 1,000,000,000,000
|V| = 15,000,000
|\Pi| = 1,000,000,000
Setting \Beta = 0.45, we get E[accuracy] <= 0.85.
57
Alternate Formalization: Distance Functions
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
distance(x, y)
P(x
, y s
am
e c
las
s |
dis
tan
ce
(x, y
))
58
Distance Functions
Key Formal Problem:
Given a distance function d(x, y) and prior over P(x C), what isP(x C | , d(xi, yj) for i, j V)
Straightforward to compute, but:
Requires (naively) summing over the power set of V.
59
Empirical Investigation
Either formalization is governed by parameters, some specific to C, others more global.
Proposed Experiments – with a variety of classes, measure empirically:Context Counts
Urn parameters for contextsDependence between contexts
Distance FunctionsObserved distance functions, as a function of:term frequency, corpus size, class prevalence.
60
1) Two Research Questions
2) URNS model3) REALM
4) Proposal for DH
5) Chez KnowItAll
Outline
61
Theoretical Questions:Entrée: DH Formalisms
(Distance Functions, Context Counts, something else?)
Sides: Relationship between KH and DH, generative textual models yielding hypotheses.
Empirical Questions:
Improving REALM’s language modeling techniques
Modeling polysemy
Language modeling accuracy vs. IE accuracy
Applying HMM-T to NER
62
Context Counts Advantages:
Explicitly models counts
Leverages Urns model
Likely tractable
Distance Function Advantages
Applicable to semi-supervised learning
More “pure” instantiation of DH
Entrée: DH Formalizms
63
Relationship between KH and DH
Theoretical Sides(1)
Terms
. . . 920 400 293 … 2 1 . . .
. . . 200 170 30 … 0 1 . . .
. . . 43 30 50 … 0 2 . . .
Contexts
DH KH (in $X) … (cities such as $X)
64
Theoretical Sides(2)
Is there a generative model of text that leads to KH, DH?
E.g., if text is generated by a HMM…
65
Empirical Questions (1)
Improving REALM with language modeling enhancementsCharacter level models, syntax, PCFGs, etc.
Modeling PolysemyP(t | Chicago) the same for Chicago the city and Chicago the musical.
Idea: an HMM that selectively bifurcates words into senses when this improves LM accuracy.
66
Empirical Questions (2)
Language Modeling Accuracy vs. Information Extraction accuracyIs it monotonic?
Applying HMM-T to Named Entity Recognition
67
Thanks!