Albert Gatt Corpora and Statistical Methods Lecture 5
Slide 2
Application 3: Verb selectional restrictions
Slide 3
Observation Some verbs place high restrictions on the semantic
category of the NPs they take as arguments. Assumption: were
focusing attention on Direct Objects only e.g. eat selects for FOOD
DOs: eat cake eat some fresh vegetables grow selects for LEGUME
DOs: grow potatoes
Slide 4
Not all verbs are equally constraining Some verbs seem to place
fewer restrictions than others: see doesnt seem too restrictive:
see John see the potato see the fresh vegetables
Slide 5
Problem definition For a given verb and a potential set of
arguments (nouns), we want to learn to what extent the verb selects
for those arguments rather than individual nouns, were better off
using noun classes (FOOD etc), since these allow us to generalise
more can obtain these using a standard resource, e.g. WordNet
Slide 6
A short detour: Kullback-Leibler divergence
Slide 7
Kullback-Leibler divergence We are often in a position where we
estimate a probability distribution from (incomplete) data This
problem is inherent in sampling. We end up with a distribution P,
which is intended as a model of distribution Q. How good is P as a
model? Kullback-Leibler divergence tells us how well our model
matches the actual distribution.
Slide 8
Motivating example Suppose Im interested in the semantic type
or class to which a noun belongs, e.g.: cake, meat, cauliflower are
types of FOOD (among other things) potato, carrot are types of
LEGUME (among other things) How do I infer this? It helps if I know
that certain predicates, like grow select for some types of DO, not
others *grow meat, *grow cake grow potatoes, grow carrots
Slide 9
Motivating example cont/d Ingredients C: the class of interest
(e.g. LEGUME) v: the verb of interest (e.g. grow) P(C) =
probability of class C prior probability of finding some element of
C as DO of any verb P(C|v) = probability of C given that we know
that a noun is a DO of grow this is my posterior probability More
precise way of asking the question: Does the probability
distribution of C change given the info about v?
Slide 10
Ingredients for KL Divergence some prior distribution P some
posterior distribution Q Intuition: KL-Divergence measures how much
information we gain about P, given that we know Q if its 0, then we
gain no info Given two probability distributions P and Q, with
probability mass functions p(x) and q(x), KL-Divergence is denoted
D(p||q)
Slide 11
Calculating KL-Divergence divergence between prior and
posterior probability distributions
Slide 12
More on the interpretation of KL-Divergence If probability
distribution P is interpreted as the truth and distribution Q is my
approximation, then: D(p||q) tells me how much extra info I need to
add to Q to get to the actual truth
Slide 13
Back to our problem: Applying KL-divergence to selectional
restrictions
Slide 14
Resniks model (Resnik 1996) 2 main ingredients: 1. Selectional
Preference Strength (S): how strongly a verb constrains its direct
object (a global estimate) 2. Selectional Association (A): how much
a verb v is associated with a given noun class (a specific estimate
for a given class)
Slide 15
Notation v = a verb of interest S(v) = the selectional
preference strength of v c = a noun class C = the set of all the
noun classes A(v,c) = the selectional association between v and
class c
Slide 16
Selectional Preference Strength S(v) is the KL-Divergence
between: the overall prior distribution of all noun classes the
posterior distribution of noun classes in the direct object
position of v how much info we gain from knowing the probability
that members of a class occur as DO of v works as a global estimate
of how much v constrains its arguments semantically the more it
constrains them, the more info we stand to gain from knowing that
an argument occurs as DO of v
Slide 17
S(grow): prior vs. posterior Source: Resnik 1996, p. 135
Slide 18
Calculating S(v) This quantifies the extent to which our prior
and posterior probability estimates diverge. how much info do we
gain about C by knowing its the object of v?
Slide 19
Some more examples classP(c)P(c|eat)P(c|see)P(c|find)
people0.250.010.250.33 furniture0.250.010.250.33
food0.250.970.250.33 action0.250.010.250.01 SPS: S(v)1.760.000.35
How much info do we gain if we know what a noun is DO of? quite a
lot if its an argument of eat not much if its an argument of find
none if its an argument of see Source: Manning and Schutze 1999, p.
290
Slide 20
Selectional association This is estimated based on selectional
preference strength tells us how much a verb is associated with a
specific class, given the extent to which it constrains its
arguments given a class c, A(v,c) tells us how much of S(v) is
contributed by c
Slide 21
Calculating A(v,c) this is part of our summation for S(v)
dividing by S(v) gives the proportion of S(v) which is caused by
class c
Slide 22
From A(v,c) to A(v,n) We know how to estimate the association
strength of a class with v Problem: some nouns can occur in more
than one class Let classes(n) be the classes in which noun n
belongs:
Slide 23
Example Susan interrupted the chair. chair is in class
FURNITURE chair is in class PEOPLE A(interrupt,PEOPLE) >
A(interrupt,FURNITURE) A(interrupt,chair) = A(interrupt,PEOPLE)
Note that this is a kind of word-sense disambiguation!
Slide 24
Some results from Resnik 1996 Verb (v)Noun (n)Class (c)A(V,n)
answerrequestSpeech act4.49 answertragedycommunication3.88
hearstorycommunication1.89 hearissuecommunication1.89 There are
some fairly atypical examples: these are due to the disambiguation
method e.g. tragedy can be in COMM class, and so is assigned
A(answer,COMM) as its a(v,n)
Slide 25
Overall evaluation Resniks results were shown to correlate very
well with results from a psycholinguistic study The method is
promising: seems to mirror human intuitions may have some
psychological validity Possibly an alternative, data-driven account
of the semantic bootstrapping hypothesis of Pinker 1989?