Post on 05-Jan-2016
Optimizing Local Probability Models for Statistical Parsing
Kristina Toutanova, Mark Mitchell, Christopher Manning
Computer Science DepartmentStanford University
Highlights
Choosing a local probability model P(expansion(n)|history(n)) for statistical parsing – a comparison of commonly used models
A new player – memory based models and their relation to interpolated models
Joint likelihood, conditional likelihood and classification accuracy for models of this form
Motivation
Many problems in natural language processing are disambiguation problems word senses
jaguar – a big cat, a car, name of a Java package
line - phone, queue, in mathematics, air line, etc. part-of-speech tags (noun, verb, proper noun,
etc.)
? ? ?Joy makes progress every day .
NN
VBDT NN
NN
NNP
VBZ
NNS
Parsing as Classification“I would like to meet with you again on
Monday” Input: a sentence
Classify to one of the possible parses
Motivation – Classification Problems
There are two major differences from typical ML domains: The number of classes can be very large or
even infinite; the set of available classes for an input varies (and depends on a grammar)
Data is usually very sparse and the number of possible features is large (e.g. words)
Solutions
The possible parse trees are broken down into small pieces defining features features are now functions of input and
class, not input only Discriminative or generative models are
built using these features we concentrate on generative models here;
when a huge number of analyses are possible, they are the only practical ones
History-Based Generative Parsing Models
“Tuesday Marks bought Brooks”.S
TOP
NP NP-C VP
NNP
TuesdayThe generative models
learn a distribution P(S,T) on <sentence, parse tree> pairs:
select a single most likely parse based on:
))(history|)(expansion(),()(
nnPTSPTnodesn
),(maxarg)(yield:
TSPTSTT
best
Factors in the Performance of Generative History-Based Models
The chosen decomposition of parse tree generation, including the representation of parse tree nodes and the independence assumptions
The model family chosen for representing local probability distributions:Decision Trees, Naïve Bayes, Log-linear
Models The optimization method for fitting major
and smoothing parameters: Maximum likelihood, maximum conditional
likelihood, minimum error rate , etc.
Previous Studies and This Work
The influence of the previous three factors has not been isolated in previous work: authors presented specific choices for all components and the importance of each was unclear.
We assume the generative history-based model and set of features (the representation of parse tree nodes) are fixed and we study carefully the other two factors.
Deleted Interpolation
),...,(
},...,2,1{
),...,,(
11 },..{
21
kk iiii
n
xxX
nS
xxxX
Estimating the probability P(y|X) by interpolating relative frequency estimates for lower-order distributions
*},...,1{
)|(ˆ)()|(~
ii
iiSnS
SS XyPXXyP
Most commonly used: linear feature subsets order
)(ˆ...),..,|(ˆ)(),...,|(ˆ)(),...,|(~
{}11}1,..,1{1},..,1{1 yPxxyPXxxyPXxxyP nnnnn
Jelinek-Mercer with fixed weight, Witten Bell with varying d, Decision Trees with path interpolation,Memory-Based Learning
Memory-Based Learning as Deleted Interpolation
In k-NN, the probability of a class given features is estimated as:
If the distance function depends only on the positions of the matching features* , it is a case of deleted interpolation
)('
)('
))',((
)',())',((
)|(~
XNX
XNX
K
K
XXw
yyXXw
XyP
Memory-Based Learning as Deleted Interpolation
P(eye-color=blue|hair-color=blond)We have N=12 samples of people d=1 or d=0 (match), w(1)=w1 , w(0)=w0, K=12
Deleted Interpolation where the interpolation weights depend on the counts and weights of nearest neighbors at all accepted distances
)(.,)(.,
),(),()|(
~
10
10
blondcwblondcw
blondbluecwblondbluecwblondblueP
)(ˆ)(.,)(.,
(.,.))|(ˆ
)(.,)(.,
)(.,)(
(.,.)
,.)(
)(.,)(.,
(.,.)
)(.,
),(
)(.,)(.,
)(.,)(
)(.,)(.,
)],(,.)([),(
10
1
10
10
10
1
10
10
10
10
bluePbluecwbluecw
cwblondblueP
bluecwbluecw
blondcww
c
bluec
bluecwbluecw
cw
blondc
blondbluec
bluecwbluecw
blondcww
bluecwbluecw
blondbluecbluecwblondbluecw
The Task and Features Used
No Name Example
1 Node label HCOMP
2 Parent node label HCOMP
3 Node direction left
4 Parent node direction
none
5 Grandparent node label
IMPER
6 Great grandparent node label
TOP
7 Left sister node label
none
8 Category of node verb
sent length
struct amb
random
5312
7.0 8.3 25.81%
Maximum ambiguity – 507, minimum - 2
seesee
letlet usus
LET_V1 US
IMPER
SEE_V3HCOMP
HCOMP
TOP
Experiments
Linear Feature Subsets Order Jelinek-Mercer with fixed weight Witten Bell with varying d Linear Memory-Based Learning
Arbitrary Feature Subsets Order Decision Trees Memory-Based Learning Log-linear Models
Experiments on the connection among likelihoods and accuracy
Experiments – Linear Sequence
The features {1,2,…,8} ordered by gain ratio {1,8,2,3,5,4,7,6}
),..,|(~
),...,|(ˆ)1(),...,|(~
1111 iJMFWiiJMFW xxyPxxyPxxyP
Jelinek Mercer Fixed Weight
Witten-Bell Varying d
|0),..,,(:|),..,(
),..,(),..,(
11
11
ii
ii xxycydxxc
xxcxx
),..,|(~
)),..,(1(),...,|(ˆ),..,(),...,|(~
11 WBD1111 WBD iiiii xxyPxxxxyPxxxxyP
Jelinek Mercer Fixed Weight
73%
74%
75%
76%
77%
78%
79%
80%
81%
0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9Interpolation Weight
Acc
ura
cy
Experiments – Linear Sequence
heavy smoothing for best results
Witten-Bell Varying d
73%
74%
75%
76%
77%
78%
79%
80%
81%
0 5 10 15 20 25 30d
Acc
ura
cy
MBL Linear Subsets Sequence
Restrict MBL to be an instance of the same linear subsets sequence deleted interpolation as follows:
Weighting functions INV3 and INV4 performed best:
}',...,'{max)',(
)',(1)',(
11 iii
xxxxXXsim
XXsimXX
4
3
))1/(1()(4
))1/(1()(3
INV
INVLinear k-NN
73%
74%
75%
76%
77%
78%
79%
80%
81%
0 5000 10000 15000K
Accu
racy
Weight Inverse 3 Weight Inverse 4
LKNN3 best at K=3,000 79.94%
LKNN4 best at K=15,000 80.18%
LKNN4 is best of all
Experiments
Linear Subsets Feature Order Jelinek-Mercer with fixed weight Witten Bell with varying d Linear Memory-Based Learning
Arbitrary Subsets Feature Order Decision Trees Memory-Based Learning Log-linear Models
Experiments on the connection among likelihoods and accuracy
Model Implementations – Decision Trees
(DecTreeWBd) n-ary decision trees; If we choose a feature f
to split on, all its values form subtrees splitting criterion – gain ratio final probabilities estimates at the leaves are
Witten Bell d interpolations of estimates on the path to the root
feat: 1
feat:2
HCOMP
instances of deleted interpolation models!
NOPTCOMP
) noptcomp, hcomp|(~
21WBD xxexpP
Model Implementations – Log-linear Models
Binary features formed by instantiating templates
Three models with different allowable features Single attributes only LogLinSingle Pairs of attributes, only pairs involving the most
important feature (node label) LogLinPairs Linear feature subsets – comparable to
previous models LogLinBackoff Gaussian smoothing was used Trained by Conjugate Gradient (Stanford
Classify Package)
Model Implementations –Memory-Based Learning
Weighting functions INV3 and INV4
KNN4 better than DecTreeWBd and Log-linear models
KNN4 has 5.8% error reduction from WBd (significant at the 0.01 level)
kniiXX
ii
k
k
}),...,({)',(
X' and X of indices feature matching ofset },...,{
1
1
Model KNN4 DecTreeWBd
LogLinSingle
LogLinPairs
LogLinBackoff
Accuracy
80.79%
79.66% 78.65%
78.91%
77.52%
Accuracy Curves for MBL and Decision Trees
Decision Tree Accuracy
74%
75%
76%
77%
78%
79%
80%
81%
0 10 20 30 40 50d
Ac
cu
rac
y
k-NN Accuracy
74%
75%
76%
77%
78%
79%
80%
81%
0 5000 10000 15000
K
Acc
ura
cy
Inverse Weight 4 Inverse Weight 3
Experiments
Linear Subsets Feature Order Jelinek-Mercer with fixed weight Witten Bell with varying d Linear Memory-Based Learning
Arbitrary Subsets Feature Order Decision Trees Memory-Based Learning Log-linear Models
Experiments on the connection among likelihoods and accuracy
Joint Likelihood, Conditional Likelihood, and Classification Accuracy
Our aim is to maximize parsing accuracy, but: Smoothing parameters are usually fit on held-
out data to maximize joint likelihood Sometimes conditional likelihood is optimized
We look at the relationship among the maxima of these three scoring functions, depending on the amount of smoothing, finding that: Much heavier smoothing is needed to
maximize accuracy than joint likelihood Conditional likelihood also increases with
smoothing, even long after the maximum for joint likelihood
Test Set Performance versus Amount of Smoothing - I
JMFW Data Log-likelihoods and Accuracy
-7,0
-6,0
-5,0
-4,0
-3,0
-2,0
-1,0
0,0
0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
interpolation weight
log
-lik
elih
oo
d
73%
74%
75%
76%
77%
78%
79%
80%
81%
joint log-likelihood conditional log-likelihood Accuracy
Test Set Performance versus Amount of Smoothing
k-NN INV4 Data Log-likelihoods and Accuracy
-3,5
-3,0
-2,5
-2,0
-1,5
-1,0
-0,5
0,0
0 5000 10000 15000K
Lo
g-l
ikel
iho
od
74%
75%
76%
77%
78%
79%
80%
81%
Joint Log-likelihood Conditional Log-likelihood Accuracy
Test Set Performance versus Amount of Smoothing –PP Attachment
Jelinek-Mercer Fixed Weight
-1
-0,8
-0,6
-0,4
-0,2
0
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
d
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
Joint Log-likelihood Conditional Log-likelihood Accuracy
-1
-0,8
-0,6
-0,4
-0,2
0
0 5 10 15 20 25 30
d
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
Joint Log-likelihood Conditional Log-likelihood Accuracy
Witten-Bell Varying d
Summary
The problem of effectively estimating local probability distributions for compound decision models used for classification is under-explored
We showed that the chosen local distribution model matters
We showed the relationship between MBL and deleted interpolation models
MBL with large numbers of neighbors and appropriate weighting outperformed more expensive and popular algorithms – Decision Trees and Log-linear Models
Fitting a small number of smoothing parameters to maximize classification accuracy is promising for improving performance
Future Work
Compare MBL to other state-of-the art smoothing methods
Better ways of fitting MBL weight functions
Theoretical investigation of bias-variance tradeoffs for compound decision systems with strong independence assumptions