1 Text categorization Feature selection: chi square test.

26
1 Text categorization Feature selection: chi square test

Transcript of 1 Text categorization Feature selection: chi square test.

Page 1: 1 Text categorization Feature selection: chi square test.

1

Text categorization

Feature selection: chi square test

Page 2: 1 Text categorization Feature selection: chi square test.

2Slides adapted from Mary Ellen Califf

Joint Probability Distribution

The joint probability distribution for a set of random variables X1…Xn gives the probability of every combination of values

P(X1,...,Xn)

Sneeze ¬Sneeze Cold 0.08 0.01 ¬Cold 0.01 0.9

The probability of all possible cases can be calculated by summing the appropriate subset of values from the joint distribution. All conditional probabilities can therefore also be calculated

P(Cold | ¬Sneeze)

BUT it’s often very hard to obtain all the probabilities for a joint distribution

Page 3: 1 Text categorization Feature selection: chi square test.

3Slides adapted from Mary Ellen Califf

Bayes Independence Example

Imagine there are diagnoses ALLERGY, COLD, and WELL and symptoms SNEEZE, COUGH, and FEVER Can these be correct numbers?

Prob Well Cold Allergy P(d) 0.9 0.05 0.05 P(sneeze|d) 0.1 0.9 0.9 P(cough | d) 0.1 0.8 0.7

P(fever | d) 0.01 0.7 0.4

Page 4: 1 Text categorization Feature selection: chi square test.

4

KL divergence (relative entropy)

x xQ

xPxPQPD

)(

)(log)()||(

Basis of comparing two probability distributions

Page 5: 1 Text categorization Feature selection: chi square test.

5Slide adapted from Paul Bennet

Text Categorization Applications

Web pages organized into category hierarchiesJournal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.)Responses to Census Bureau occupationsPatents archived using International Patent ClassificationPatient records coded using international insurance categoriesE-mail message filteringNews events tracked and filtered by topicsSpam

Page 6: 1 Text categorization Feature selection: chi square test.

6

Yahoo News Categories

Page 7: 1 Text categorization Feature selection: chi square test.

7

Text Topic categorization

Topic categorization: classify the document into semantics topics

The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie.

One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.

Page 8: 1 Text categorization Feature selection: chi square test.

8

The Reuters collection

A gold standardCollection of (21,578) newswire documents. For research purposes: a standard text collection to compare systems and algorithms135 valid topics categories

Page 9: 1 Text categorization Feature selection: chi square test.

9

Reuters

Top topics in Reuters

Page 10: 1 Text categorization Feature selection: chi square test.

10

Reuters Document Example<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>

Page 11: 1 Text categorization Feature selection: chi square test.

11

Classification vs. Clustering

Classification assumes labeled data: we know how many classes there are and we have examples for each class (labeled data). Classification is supervisedIn Clustering we don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there areClustering is unsupervised

Page 12: 1 Text categorization Feature selection: chi square test.

12

Categories (Labels, Classes)

Labeling data2 problems: Decide the possible classes (which ones, how many)

Domain and application dependent

Label textDifficult, time consuming, inconsistency between annotators

Page 13: 1 Text categorization Feature selection: chi square test.

13

Reuters Example, revisited

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>

Why not topic = policy ?

Page 14: 1 Text categorization Feature selection: chi square test.

14

Binary vs. multi-way classification

Binary classification: two classes

Multi-way classification: more than two classes

Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes

Page 15: 1 Text categorization Feature selection: chi square test.

15

Features

>>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix."

>>> label = “sport”

>>> labeled_text = LabeledText(text, label)

Here the classification takes as input the whole stringWhat’s the problem with that?What are the features that could be useful for this example?

Page 16: 1 Text categorization Feature selection: chi square test.

16

Feature terminology

Feature: An aspect of the text that is relevant to the task Some typical features

Words present in text Frequency of words CapitalizationAre there NE?WordNet Others?

Page 17: 1 Text categorization Feature selection: chi square test.

17

Feature terminology

Feature: An aspect of the text that is relevant to the task

Feature value: the realization of the feature in the text

Words present in text Frequency of wordAre there dates? Yes/noAre there PERSONS? Yes/noAre there ORGANIZATIONS? Yes/noWordNet: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China)

Page 18: 1 Text categorization Feature selection: chi square test.

18

Feature Types

Boolean (or Binary) FeaturesFeatures that generate boolean (binary) values. Boolean features are the simplest and the most common type of feature.

f1(text) = 1 if text contain “elections”

0 otherwise

f2(text) = 1 if text contain PERSON

0 otherwise

Page 19: 1 Text categorization Feature selection: chi square test.

19

Feature Types

Integer FeaturesFeatures that generate integer values. Integer features can be used to give classifiers access to more precise information about the text.

f1(text) = Number of times “elections” occurs

f2(text) = Number of times PERSON occurs

Page 20: 1 Text categorization Feature selection: chi square test.

20

2 statistic (pronounced “kai square”) A commonly used method of comparing proportions. Measures the lack of independence between a term and

a category

2 statistic (CHI)

Page 21: 1 Text categorization Feature selection: chi square test.

21

Is “jaguar” a good predictor for the “auto” class?

We want to compare: the observed distribution above; and null hypothesis: that jaguar and auto are independent

2 statistic (CHI)

Term = jaguar Term jaguar

Class = auto 2 500

Class auto 3 9500

Page 22: 1 Text categorization Feature selection: chi square test.

22

Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect?

If independent: Pr(j,a) = Pr(j) Pr(a)

So, there would be N Pr(j,a), i.e. N Pr(j) Pr(a) occurances of “jaguar”

Pr(j) = (2+3)/N;

Pr(a) = (2+500)/N;

N=2+3+500+9500

N(5/N)(502/N)=2510/N=2510/10005 0.25

2 statistic (CHI)

Term = jaguar Term jaguar

Class = auto 2 500

Class auto 3 9500

Page 23: 1 Text categorization Feature selection: chi square test.

23

Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect?

2 statistic (CHI)

Term = jaguar Term jaguar

Class = auto 2 (0.25) 500

Class auto 3 9500

expected: fe

observed: fo

Page 24: 1 Text categorization Feature selection: chi square test.

24

Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect?

2 statistic (CHI)

Term = jaguar Term jaguar

Class = auto 2 (0.25) 500 (502)

Class auto 3 (4.75) 9500 (9498)

expected: fe

observed: fo

Page 25: 1 Text categorization Feature selection: chi square test.

25

2 is interested in (fo – fe)2/fe summed over all table entries:

The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence).

2 statistic (CHI)

)001.(9.129498/)94989500(502/)502500(

75.4/)75.43(25./)25.2(/)(),(22

2222

p

EEOaj

Term = jaguar Term jaguar

Class = auto 2 (0.25) 500 (502)

Class auto 3 (4.75) 9500 (9498)

expected: fe

observed: fo

Page 26: 1 Text categorization Feature selection: chi square test.

26

There is a simpler formula for 2:

2 statistic (CHI)

N = A + B + C + D

A = #(t,c) C = #(¬t,c)

B = #(t,¬c) D = #(¬t, ¬c)