INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence...

47
INF5830 โ€“ 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lรธnning, Lecture 14, 16.11 1

Transcript of INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence...

Page 1: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

INF5830 โ€“ 2015 FALL NATURAL LANGUAGE PROCESSING

Jan Tore Lรธnning, Lecture 14, 16.11

1

Page 2: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

2

Page 3: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Collocations

Page 4: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Does a sample belong to a population with mean ยต?

4

1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ

Calculate z-score: z = ๐‘‹๏ฟฝโˆ’๐œ‡

๐œŽ2 ๐‘›๏ฟฝ and corresponding p-value.

2. Unknown st. dev.. Approximate by sample st. dev. s

Calculate t-score: t = ๐‘‹๏ฟฝโˆ’๐œ‡

๐‘ 2 ๐‘›๏ฟฝ and the t(n-1) density curve.

2. Proportion n items, k successes, ๏ฟฝฬ‚๏ฟฝ = ๐‘˜ ๐‘›โ„ p = ยต (the expected proportion), ๐œŽ2 = ๐‘ (1 โˆ’ ๐‘)

for large n: z-score: z = ๐‘๏ฟฝโˆ’๐‘

๐‘(1โˆ’๐‘)๐‘›๏ฟฝ and corresponding p-value.

Page 5: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

โ€œT-testโ€ for collocations

How likely/unlikely is w1 and w2 to occur together?

Null hypothesis, H0 : P(w1=new, w2=companies)= P(w1=new)ร—P(w2=companies)

Hypothesis, Ha : P(w1=new, w2=companies)> P(w1=new)ร—P(w2=companies)

The expectation, ยต=p = P(w1=new)ร—P(w2=companies) =3.615 x 10- 7

๏ฟฝฬ‚๏ฟฝ = P(w1=new, w2=companies) = 5.591 x 10- 7

N = 14307668

Manning and Schรผtze uses t-score: t = ๐‘‹๏ฟฝโˆ’๐œ‡

๐‘ 2 ๐‘›๏ฟฝ

Since this is propoprtion it is more correct to use z = ๐‘๏ฟฝโˆ’๐‘

๐‘(1โˆ’๐‘)๐‘›๏ฟฝ

Page 6: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

With z-distribution and ฯƒ

2431319.11430766810615.3

10615.310591.57

77

2=

ร—

ร—โˆ’ร—=

โˆ’=

โˆ’

โˆ’โˆ’

nxzฯƒ

ยต

(Better score than the book โ€“ but I still think it can be defended!)

z = ๐‘๏ฟฝโˆ’๐‘

๐‘(1โˆ’๐‘)๐‘›๏ฟฝ

Page 7: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

7

Page 8: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Collocations tests 8

1. T-test 2. Absolute frequencies

3. Dice score 2ร—๐‘ƒ(๐‘ค1๐‘ค2)๐‘ƒ ๐‘ค1 +๐‘ƒ(๐‘ค2)

4. Pointwise mutual information log( ๐‘ƒ ๐‘ค1๐‘ค2๐‘ƒ ๐‘ค1 ร—๐‘ƒ ๐‘ค2

)

5. ฯ‡2 test 6. (Mutual information)

Page 9: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Pointwise mutual information

In the T-test we compared P(new companies) with P(new)xP(companies)

Idea: compare these directly P(new companies)/P(new)xP(companies)

log(P(new companies)/P(new)xP(companies)) = Pointwise mutual information Observe: log does not change the ranking log has a theoretical motivation

Page 10: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Problems with pmi

If w1w2 only occur together, the score increases when the frequency of w2 decreases P(w1 w2) = P(w1) = P(w2) P(w1 w2)/P(w1)xP(w2) = 1/P(w2)

Bad measure of dependence Good measure of independence

All measures are unreliable for small samples

Page 11: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

ฯ‡2 test

O11= 8

E11= P(new) x P(companies) x N = (C(new)/N) x (C(companies)/N) x N = (O11 + O12 ) x (O11 + O21 ) /N, Etc

H0 : no association between row and column variable:

If true, X2-statistics has approx. a ฯ‡2 distribution with (r-1)(c-1) degrees of freedom (d.f.)

Page 12: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

In practice

Page 13: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

13

Page 14: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Comparing two samples 14

X ={x1, โ€ฆ, xn}, Y ={y1, โ€ฆ, ym}, ยซdo they belong to the same population?ยป 1. General case, example: height, sentence length

t = ๐‘‹๏ฟฝโˆ’๐‘Œ๏ฟฝ

๐‘ ๐‘ฅ2 ๐‘›๏ฟฝ +๐‘ ๐‘ฆ2๐‘š๏ฟฝ

, the t(k-1) density curve (k = min(n, m) )

2. Proportions: n items, k successes, ๏ฟฝฬ‚๏ฟฝ = ๐‘˜ ๐‘›โ„ ๐‘ 2 = ๏ฟฝฬ‚๏ฟฝ 1 โˆ’ ๏ฟฝฬ‚๏ฟฝ z = ๐‘๏ฟฝโˆ’๐‘2๏ฟฝ

๐‘๏ฟฝ(1โˆ’๐‘๏ฟฝ)๐‘›๏ฟฝ +๐‘2๏ฟฝ (1โˆ’๐‘2๏ฟฝ )

๐‘š๏ฟฝ , use the z-density curve

Page 15: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Applied to collocations

Assumption: 1-p is so close to 1 that p is a good approximation to s2 = p(1-p)

Page 16: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›
Page 17: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

What is a collocation

Non-compositional Ex.: N-N compounds in English, written as one word in

German/Norwegian: white wine

Non substitutability: red wine, burgundy wine Frozen Or only frequent co-occurrence?

Page 18: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

IR 13.5

Feature selection 18

Page 19: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

19

Page 20: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Feature selection 20

To include all words as features is too costly Select features which

Separate between the classes Expected to occur with a reasonable frequency

To select: use similar measures as for collocations, e.g.

Raw frequency Chi-square Mutual information โ€ฆ

Page 21: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Apply chi square 21

Assume a binary classifier: only two classes {yes, no} and binary features

For every feature calculate the chi square score between the two values of the feature and the two classes

Select the words with the highest score as features

Page 22: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Chi square 22

Sense musical instrument Yes No sums โ€guitarโ€ in context

Yes 25 5 30 No 275 695 970

Sums 300 700 1000

Page 23: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Chi square 23

Observations Is in class s Yes NO sums w in context

Yes O11 O10 O1x No O01 O00 O0x

Sums Ox1 Ox0 N

Expectations Is in class s

Yes NO sums

w in context

Yes E11=O1xร—Ox1/N E10=O1xร—Ox0/N O1x

No E01=O0xร—Ox1/N E00=O0xร—Ox0/N O0x

Sums Ox1 Ox0 O

Page 24: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Chi square 24

Observations Sense musical instrument Yes NO sums โ€guitarโ€ in context

Yes 25 5 30 No 275 695 970

Sums 300 700 1000

Expectations Is in class s

Yes NO sums

w in context

Yes 9=30ร—300/1000 21=30ร—700/1000 30

No 291 679 970

Sums 300 700 O

Page 25: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Line, binary, โ€˜productโ€™, BoW-only 25

Number of word features

Most frequent Chi-square

0 0.528 0.528

10 0.632 0.786

20 0.724 0.826

50 0.816 0.846

100 0.844 0.862

200 0.864 0.878

500 0.864 0.898

1000 0.892 0.906

2000 0.912 0.912

5000 0.918 0.916

Page 26: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Feature selection, contd. 26

Similarly to collocations, we may use other association measures, e.g. Pointwise mutual information Mutual information

A difference between the different measures is how they trade-off: discrimination/frequency

Page 27: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Pointwise mutual information 27

11

11logEO

=

Page 28: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Mutual information 28

โˆ‘โˆ‘

โˆ‘

====

==

==

====

===

1,0;1,01,0;1,0

1,0;1,0

loglog

)(ห†)(ห†),(ห†

log),(ห†);(

ji ij

ijij

ji jxxi

ijij

ji

EO

NO

OONO

NO

jCPiWPjCiWPjCiWPCWI

Page 29: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Multinomial logistic regression

Page 30: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

30

Page 31: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

A slight reformulation

We saw that for NB

iff

This could also be written

0)|()|(

log)()(log

1 2

1

2

1 >

+

โˆ‘=

n

j j

j

cfPcfP

cPcP

( ) ( ) 0)2|(log)|(log)(log)(log1

121 >โˆ’+โˆ’ โˆ‘=

n

jjj cfPcfPcPcP

โˆ‘โˆ‘==

+>+n

jj

n

jj cfPcPcfPcP

12

111 )2|(log)(log)|(log)(log

)|()|( 21 fcPfcP

> โˆโˆ==

>n

jj

n

jj cfPcPcfPcP

122

111 )|()()|()(

31

Page 32: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Reformulation, contd. 32

has the form where

and our earlier

(The probability in this notation

and similarly for P(c2|f) )

โˆ‘โˆ‘==

+>+n

jj

n

jj cfPcPcfPcP

12

111 )2|(log)(log)|(log)(log

fwxwxwfwM

iii

M

iii

โ€ข=>=โ€ข โˆ‘โˆ‘

==

2

0

2

0

11

))|(log( 11 cfPw jj =

21jjj www โˆ’=

))|(log( 22 cfPw jj =

fwfw

fw

fww

fww

fw

fw

eee

ee

eefcP

โ€ขโ€ข

โ€ข

โ€ขโˆ’

โ€ขโˆ’

โ€ข

โ€ข

+=

+=

+= 12

1

21

21

)(

)(

1 11)|(

Page 33: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Multinomial logistic regression

We may generalize this to more than two classes For each class cj for j = 1,..,k a linear expression and the probability of belonging to class cj:

where

and

( ) โˆ โˆ=

=โˆ‘==โ€ข= โ€ข

i i

fi

fwfwjj i

ifjii i

jij

aZ

wZ

eZ

eZ

fwZ

fcP e 1111exp1)|(

jiw

i ea =

( )โˆ‘=

โ€ข=k

j

j fwZ1exp

โˆ‘=

=โ€ขM

ii

ji

j xwfw0

classifierlinear as NBBinary )(Bernoulli Bayes Naive

regression Logisticregression lMultinomia

โ‰ˆ

33

Page 34: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Footnote: Alternative formulation

(In case you read other presentations, like Mitchell or Hastie et. al.: They use a slightly different formulation, corresponding to

where for i = 1, 2,โ€ฆ, k-1:

But and

The two formulations are equivalent though: In the J&M formulation, divide the numerator and denominator in each P(ci|f)

with

and you get this formulation (with adjustments to Z and w.)

( )โˆ‘โˆ’

=

โ€ข+=1

1exp1

k

i

i fwZ

( )โˆ‘โˆ’

=

โ€ข+= 1

1exp1

1)|( k

i

i

k

fwfcP

( ) โˆ โˆ=

=โˆ‘==โ€ข= โ€ข

j j

fj

fwfwii j

jfijj j

iji

aZ

wZ

eZ

eZ

fwZ

fcP e 1111exp1)|(

( )fwk

โ€ขexp

34

Page 35: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Indicator variables

Already seen: categorical variables represented by indicator variables, taking the values 0,1

Also usual to let the variables indicate both observation and class

( ) ( )( ) โˆ‘ โˆ‘

โˆ‘

โˆ‘ โˆ‘

โˆ‘

โˆ‘= =

=

= =

=

=

=

=โ€ข

โ€ข=โ€ข=

k

l

li

m

ii

ji

m

ii

k

li

n

i

li

i

n

i

ji

k

l

l

jjj

xcfw

xcfw

fw

fw

fw

fwfwZ

fcP

1 0

0

1 0

0

1),(exp

),(exp

exp

exp

exp

expexp1)|(

35

Page 36: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Examples โ€“ J&M 36

Page 37: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Why called โ€maximum entropyโ€?

See NLTK book for a further example

37

P(NN)+P(JJ)+P(NNS)+P(VB)=1

P(NN)+P(NNS)=0.8

P(VB)=1/20

Page 38: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Why called โ€maximum entropyโ€?

The multinomial logistic regression yields the probability distribution which Gives the maximum entropy Given our training data

38

Page 39: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Line โ€“ Most frequent BoW-features 39

Number of word features

NaiveBayes SklearnClassifier(LogisticRegression())

0 0.528 0.528

10 0.528 0.528

20 0.534 0.546

50 0.576 0.624

100 0.688 0.732

200 0.706 0.752

500 0.744 0.804

1000 0.774 0.838

2000 0.802 0.846

5000 0.826 0.850

Page 40: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Line, binary, โ€˜productโ€™, BoW-only 40

Number of word features

Bernoulli Chi-square SKLearn-LogReg

Chi-square LogReg

0 0.528 0.528 0.528 0.528

10 0.632 0.786 0.636 0.786

20 0.724 0.826 0.738 0.830

50 0.816 0.846 0.810 0.866

100 0.844 0.862 0.858 0.898

200 0.864 0.878 0.888 0.906

500 0.864 0.898 0.902 0.924

1000 0.892 0.906 0.912 0.928

2000 0.912 0.912 0.924 0.928

5000 0.918 0.916 0.922 0.928

Page 41: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Comparing and combining classifiers

Page 42: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

42

Page 43: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Maxent vs Naive Bayes

If the Naive Bayes assumption is warranted โ€“ i.e. the features are independent โ€“ the two yield the same result in the limit.

Otherwise, Maxent cope better with dependencies between features

With Maxent you may throw in features and let the model decide whether they are useful

Maxent training is slower

43

Page 44: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Generative vs discriminative model

P(o,c) P(c|o) argmaxC P(c|o) P(o) argmaxo P(o) P(o|c)

โ€ฆ P(c|o) argmaxC P(c|o)

Generative (e.g. NB) Discriminative (e.g. Maxent)

See NLTK book

44

Page 45: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

More than two classes (in general)

Any of or multivalue classification An item may belong to 1, 0 or more than 1 classes Classes are independent Use n binary classifiers Example: Documents

One-of or multinomial classification Each item belongs to one class Classes are mutually exclusive Example: POS-tagging

45

Page 46: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

One of classifiers

Many classifiers are built for binary problems

Simply combining several binary quantifiers do not result in a one-of-classifier.

?

?

?

46

Page 47: INF5820 โ€“ 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation ฯƒ Calculate z-score: z = ๐‘‹ โˆ’๐œ‡ ๐œŽ2 ๐‘›

Combining binary classifiers

Build a classifier for each class compared to its complement For a test document, evaluate it for membership in each

class Assign document to class with either:

maximum probability maximum score maximum confidence

Multinomial logistic regression is a good example Sometimes one postpones the decision and proceed with the

probabilities (soft classification), E.g. Maxent tagging

47