INF5820 โ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence...
Transcript of INF5820 โ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence...
INF5830 โ 2015 FALL NATURAL LANGUAGE PROCESSING
Jan Tore Lรธnning, Lecture 14, 16.11
1
Today
More on collocations Repeat some statistics T-test Other tests Differences
Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers
2
Collocations
Does a sample belong to a population with mean ยต?
4
1. General case, example: height, sentence length. 1. Known standard deviation ฯ
Calculate z-score: z = ๐๏ฟฝโ๐
๐2 ๐๏ฟฝ and corresponding p-value.
2. Unknown st. dev.. Approximate by sample st. dev. s
Calculate t-score: t = ๐๏ฟฝโ๐
๐ 2 ๐๏ฟฝ and the t(n-1) density curve.
2. Proportion n items, k successes, ๏ฟฝฬ๏ฟฝ = ๐ ๐โ p = ยต (the expected proportion), ๐2 = ๐ (1 โ ๐)
for large n: z-score: z = ๐๏ฟฝโ๐
๐(1โ๐)๐๏ฟฝ and corresponding p-value.
โT-testโ for collocations
How likely/unlikely is w1 and w2 to occur together?
Null hypothesis, H0 : P(w1=new, w2=companies)= P(w1=new)รP(w2=companies)
Hypothesis, Ha : P(w1=new, w2=companies)> P(w1=new)รP(w2=companies)
The expectation, ยต=p = P(w1=new)รP(w2=companies) =3.615 x 10- 7
๏ฟฝฬ๏ฟฝ = P(w1=new, w2=companies) = 5.591 x 10- 7
N = 14307668
Manning and Schรผtze uses t-score: t = ๐๏ฟฝโ๐
๐ 2 ๐๏ฟฝ
Since this is propoprtion it is more correct to use z = ๐๏ฟฝโ๐
๐(1โ๐)๐๏ฟฝ
With z-distribution and ฯ
2431319.11430766810615.3
10615.310591.57
77
2=
ร
รโร=
โ=
โ
โโ
nxzฯ
ยต
(Better score than the book โ but I still think it can be defended!)
z = ๐๏ฟฝโ๐
๐(1โ๐)๐๏ฟฝ
Today
More on collocations Repeat some statistics T-test Other tests Differences
Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers
7
Collocations tests 8
1. T-test 2. Absolute frequencies
3. Dice score 2ร๐(๐ค1๐ค2)๐ ๐ค1 +๐(๐ค2)
4. Pointwise mutual information log( ๐ ๐ค1๐ค2๐ ๐ค1 ร๐ ๐ค2
)
5. ฯ2 test 6. (Mutual information)
Pointwise mutual information
In the T-test we compared P(new companies) with P(new)xP(companies)
Idea: compare these directly P(new companies)/P(new)xP(companies)
log(P(new companies)/P(new)xP(companies)) = Pointwise mutual information Observe: log does not change the ranking log has a theoretical motivation
Problems with pmi
If w1w2 only occur together, the score increases when the frequency of w2 decreases P(w1 w2) = P(w1) = P(w2) P(w1 w2)/P(w1)xP(w2) = 1/P(w2)
Bad measure of dependence Good measure of independence
All measures are unreliable for small samples
ฯ2 test
O11= 8
E11= P(new) x P(companies) x N = (C(new)/N) x (C(companies)/N) x N = (O11 + O12 ) x (O11 + O21 ) /N, Etc
H0 : no association between row and column variable:
If true, X2-statistics has approx. a ฯ2 distribution with (r-1)(c-1) degrees of freedom (d.f.)
In practice
Today
More on collocations Repeat some statistics T-test Other tests Differences
Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers
13
Comparing two samples 14
X ={x1, โฆ, xn}, Y ={y1, โฆ, ym}, ยซdo they belong to the same population?ยป 1. General case, example: height, sentence length
t = ๐๏ฟฝโ๐๏ฟฝ
๐ ๐ฅ2 ๐๏ฟฝ +๐ ๐ฆ2๐๏ฟฝ
, the t(k-1) density curve (k = min(n, m) )
2. Proportions: n items, k successes, ๏ฟฝฬ๏ฟฝ = ๐ ๐โ ๐ 2 = ๏ฟฝฬ๏ฟฝ 1 โ ๏ฟฝฬ๏ฟฝ z = ๐๏ฟฝโ๐2๏ฟฝ
๐๏ฟฝ(1โ๐๏ฟฝ)๐๏ฟฝ +๐2๏ฟฝ (1โ๐2๏ฟฝ )
๐๏ฟฝ , use the z-density curve
Applied to collocations
Assumption: 1-p is so close to 1 that p is a good approximation to s2 = p(1-p)
What is a collocation
Non-compositional Ex.: N-N compounds in English, written as one word in
German/Norwegian: white wine
Non substitutability: red wine, burgundy wine Frozen Or only frequent co-occurrence?
IR 13.5
Feature selection 18
Today
More on collocations Repeat some statistics T-test Other tests Differences
Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers
19
Feature selection 20
To include all words as features is too costly Select features which
Separate between the classes Expected to occur with a reasonable frequency
To select: use similar measures as for collocations, e.g.
Raw frequency Chi-square Mutual information โฆ
Apply chi square 21
Assume a binary classifier: only two classes {yes, no} and binary features
For every feature calculate the chi square score between the two values of the feature and the two classes
Select the words with the highest score as features
Chi square 22
Sense musical instrument Yes No sums โguitarโ in context
Yes 25 5 30 No 275 695 970
Sums 300 700 1000
Chi square 23
Observations Is in class s Yes NO sums w in context
Yes O11 O10 O1x No O01 O00 O0x
Sums Ox1 Ox0 N
Expectations Is in class s
Yes NO sums
w in context
Yes E11=O1xรOx1/N E10=O1xรOx0/N O1x
No E01=O0xรOx1/N E00=O0xรOx0/N O0x
Sums Ox1 Ox0 O
Chi square 24
Observations Sense musical instrument Yes NO sums โguitarโ in context
Yes 25 5 30 No 275 695 970
Sums 300 700 1000
Expectations Is in class s
Yes NO sums
w in context
Yes 9=30ร300/1000 21=30ร700/1000 30
No 291 679 970
Sums 300 700 O
Line, binary, โproductโ, BoW-only 25
Number of word features
Most frequent Chi-square
0 0.528 0.528
10 0.632 0.786
20 0.724 0.826
50 0.816 0.846
100 0.844 0.862
200 0.864 0.878
500 0.864 0.898
1000 0.892 0.906
2000 0.912 0.912
5000 0.918 0.916
Feature selection, contd. 26
Similarly to collocations, we may use other association measures, e.g. Pointwise mutual information Mutual information
A difference between the different measures is how they trade-off: discrimination/frequency
Pointwise mutual information 27
11
11logEO
=
Mutual information 28
โโ
โ
====
==
==
====
===
1,0;1,01,0;1,0
1,0;1,0
loglog
)(ห)(ห),(ห
log),(ห);(
ji ij
ijij
ji jxxi
ijij
ji
EO
NO
OONO
NO
jCPiWPjCiWPjCiWPCWI
Multinomial logistic regression
Today
More on collocations Repeat some statistics T-test Other tests Differences
Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers
30
A slight reformulation
We saw that for NB
iff
This could also be written
0)|()|(
log)()(log
1 2
1
2
1 >
+
โ=
n
j j
j
cfPcfP
cPcP
( ) ( ) 0)2|(log)|(log)(log)(log1
121 >โ+โ โ=
n
jjj cfPcfPcPcP
โโ==
+>+n
jj
n
jj cfPcPcfPcP
12
111 )2|(log)(log)|(log)(log
)|()|( 21 fcPfcP
> โโ==
>n
jj
n
jj cfPcPcfPcP
122
111 )|()()|()(
31
Reformulation, contd. 32
has the form where
and our earlier
(The probability in this notation
and similarly for P(c2|f) )
โโ==
+>+n
jj
n
jj cfPcPcfPcP
12
111 )2|(log)(log)|(log)(log
fwxwxwfwM
iii
M
iii
โข=>=โข โโ
==
2
0
2
0
11
))|(log( 11 cfPw jj =
21jjj www โ=
))|(log( 22 cfPw jj =
fwfw
fw
fww
fww
fw
fw
eee
ee
eefcP
โขโข
โข
โขโ
โขโ
โข
โข
+=
+=
+= 12
1
21
21
)(
)(
1 11)|(
Multinomial logistic regression
We may generalize this to more than two classes For each class cj for j = 1,..,k a linear expression and the probability of belonging to class cj:
where
and
( ) โ โ=
=โ==โข= โข
i i
fi
fwfwjj i
ifjii i
jij
aZ
wZ
eZ
eZ
fwZ
fcP e 1111exp1)|(
jiw
i ea =
( )โ=
โข=k
j
j fwZ1exp
โ=
=โขM
ii
ji
j xwfw0
classifierlinear as NBBinary )(Bernoulli Bayes Naive
regression Logisticregression lMultinomia
โ
33
Footnote: Alternative formulation
(In case you read other presentations, like Mitchell or Hastie et. al.: They use a slightly different formulation, corresponding to
where for i = 1, 2,โฆ, k-1:
But and
The two formulations are equivalent though: In the J&M formulation, divide the numerator and denominator in each P(ci|f)
with
and you get this formulation (with adjustments to Z and w.)
( )โโ
=
โข+=1
1exp1
k
i
i fwZ
( )โโ
=
โข+= 1
1exp1
1)|( k
i
i
k
fwfcP
( ) โ โ=
=โ==โข= โข
j j
fj
fwfwii j
jfijj j
iji
aZ
wZ
eZ
eZ
fwZ
fcP e 1111exp1)|(
( )fwk
โขexp
34
Indicator variables
Already seen: categorical variables represented by indicator variables, taking the values 0,1
Also usual to let the variables indicate both observation and class
( ) ( )( ) โ โ
โ
โ โ
โ
โ= =
=
= =
=
=
=
=โข
โข=โข=
k
l
li
m
ii
ji
m
ii
k
li
n
i
li
i
n
i
ji
k
l
l
jjj
xcfw
xcfw
fw
fw
fw
fwfwZ
fcP
1 0
0
1 0
0
1),(exp
),(exp
exp
exp
exp
expexp1)|(
35
Examples โ J&M 36
Why called โmaximum entropyโ?
See NLTK book for a further example
37
P(NN)+P(JJ)+P(NNS)+P(VB)=1
P(NN)+P(NNS)=0.8
P(VB)=1/20
Why called โmaximum entropyโ?
The multinomial logistic regression yields the probability distribution which Gives the maximum entropy Given our training data
38
Line โ Most frequent BoW-features 39
Number of word features
NaiveBayes SklearnClassifier(LogisticRegression())
0 0.528 0.528
10 0.528 0.528
20 0.534 0.546
50 0.576 0.624
100 0.688 0.732
200 0.706 0.752
500 0.744 0.804
1000 0.774 0.838
2000 0.802 0.846
5000 0.826 0.850
Line, binary, โproductโ, BoW-only 40
Number of word features
Bernoulli Chi-square SKLearn-LogReg
Chi-square LogReg
0 0.528 0.528 0.528 0.528
10 0.632 0.786 0.636 0.786
20 0.724 0.826 0.738 0.830
50 0.816 0.846 0.810 0.866
100 0.844 0.862 0.858 0.898
200 0.864 0.878 0.888 0.906
500 0.864 0.898 0.902 0.924
1000 0.892 0.906 0.912 0.928
2000 0.912 0.912 0.924 0.928
5000 0.918 0.916 0.922 0.928
Comparing and combining classifiers
Today
More on collocations Repeat some statistics T-test Other tests Differences
Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers
42
Maxent vs Naive Bayes
If the Naive Bayes assumption is warranted โ i.e. the features are independent โ the two yield the same result in the limit.
Otherwise, Maxent cope better with dependencies between features
With Maxent you may throw in features and let the model decide whether they are useful
Maxent training is slower
43
Generative vs discriminative model
P(o,c) P(c|o) argmaxC P(c|o) P(o) argmaxo P(o) P(o|c)
โฆ P(c|o) argmaxC P(c|o)
Generative (e.g. NB) Discriminative (e.g. Maxent)
See NLTK book
44
More than two classes (in general)
Any of or multivalue classification An item may belong to 1, 0 or more than 1 classes Classes are independent Use n binary classifiers Example: Documents
One-of or multinomial classification Each item belongs to one class Classes are mutually exclusive Example: POS-tagging
45
One of classifiers
Many classifiers are built for binary problems
Simply combining several binary quantifiers do not result in a one-of-classifier.
?
?
?
46
Combining binary classifiers
Build a classifier for each class compared to its complement For a test document, evaluate it for membership in each
class Assign document to class with either:
maximum probability maximum score maximum confidence
Multinomial logistic regression is a good example Sometimes one postpones the decision and proceed with the
probabilities (soft classification), E.g. Maxent tagging
47