Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories

32
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery Carnegie Mellon University

description

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories. Rayid Ghani Center for Automated Learning & Discovery Carnegie Mellon University. Some Recent Work. Learning from Sequences of fMRI Brain Images (with Tom Mitchell) - PowerPoint PPT Presentation

Transcript of Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories

Page 1: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of

Categories

Rayid Ghani

Center for Automated Learning & Discovery

Carnegie Mellon University

Page 2: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Learning from Sequences of fMRI Brain Images (with Tom Mitchell)

Learning to automatically build language-specific corpora from the web (with Rosie Jones & Dunja Mladenic)

Effect of Smoothing on Naive Bayes for Text Classification (with Tong Zhang @ IBM Research)

Hypertext Categorization using links and extracted information (with Sean Slattery & Yiming Yang)

Hybrids of EM & Co-Training for semi-supervised learning (with Kamal Nigam)

Error-Correcting Output Codes for Text Classification

Some Recent Work

Page 3: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Text Categorization

Numerous Applications•Search Engines/Portals•Customer Service•Email Routing ….

Domains:•Topics•Genres•Languages

$$$ Making

Page 4: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Problems

Practical applications such as web portal deal with a large number of categories

A lot of labeled examples are needed for training the system

Page 5: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

How do people deal with a large number of classes?

Use fast multiclass algorithms (Naïve Bayes) Builds one model per class

Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems

What happens with a 1000 class problem? Can we do better?

Page 6: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ECOC to the Rescue!

An n-class problem can be solved by solving log2n binary problems

More efficient than one-per-class Does it actually perform better?

Page 7: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

What is ECOC?

Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995)

Use a learner to learn the binary problems

Page 8: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Training ECOC

0 0 1 11 0 1 00 1 1 10 1 0 0

ABCD

f1 f2 f3 f4

X 1 1 1 1

Testing ECOC

Page 9: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ECOC - Picture

0 0 1 11 0 1 00 1 1 10 1 0 0

ABCD

A

DC

B

f1 f2 f3 f4

Page 10: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ECOC - Picture

0 0 1 11 0 1 00 1 1 10 1 0 0

ABCD

A

DC

B

f1 f2 f3 f4

Page 11: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ECOC - Picture

0 0 1 1 1 0 1 0 0 1 1 1 0 1 0 0

ABCD

A

DC

B

f1 f2 f3 f4

Page 12: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ECOC - Picture

0 0 1 11 0 1 00 1 1 10 1 0 0

ABCD

A

DC

B

f1 f2 f3 f4

X 1 1 1 1

Page 13: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ECOC works but…

Increased code length = Increased Accuracy Increased code length = Increased

Computational Cost

Page 14: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Classification Performance

Efficiency

Naïve Bayes

ECOC

GOALGOAL

(as used in Berger 99)

Page 15: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Choosing the codewords

Random? [Berger 1999, James 1999] Asymptotically good (the longer the better) Computational Cost is very high

Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995] Guaranteed properties for a fixed-length code

Page 16: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Experimental Setup

Generate the code BCH Codes

Choose a Base Learner Naive Bayes Classifier as used in text

classification tasks (McCallum & Nigam 1998)

Page 17: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Text Classification with Naïve Bayes

“Bag of Words” document representation Estimate parameters of generative model:

Naïve Bayes classification:

)P(

)|P()P()|P(

doc

classwordclassdocclass docword

classdoc

classdoc

docV

docwordclassword

)N(||

),N(1)|P(

Page 18: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Industry Sector Dataset

Consists of company web pages classified into 105 economic sectors

[McCallum et al. 1998, Ghani 2000]

Page 19: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ResultsIndustry Sector Data Set

Naïve Bayes

Shrinkage1 MaxEnt2 MaxEnt/ w Prior3

ECOC 63-bit

66.1% 76% 79% 81.1% 88.5%

ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in

computational cost

1. (McCallum et al. 1998) 2,3. (Nigam et al. 1999)

Page 20: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Min HD for correctly classified examples

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9 10

Min HD

Fre

qu

ency

Min HD for wrongly classified examples

0

20

40

60

80

100

0 1 2 3 4 5 6 7 8 9 10

Min HD

Fre

qu

ency

Page 21: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ECOC for better Precision

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.49 0.6 0.65 0.7 0.75 0.81 0.86 0.91 0.97 1

Recall

Pre

cis

ion

ECOC

NB

Page 22: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ECOC for better Precision

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cisi

on

NB

15bitECOC

Page 23: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Classification Performance

Efficiency

NB

ECOC

GOAL

New Goal

(as used in Berger 99)

Page 24: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Solutions

Design codewords that minimize cost and maximize “performance”

Investigate the assignment of codewords to classes

Learn the decoding function

Incorporate unlabeled data into ECOC

Page 25: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

What happens with sparse data?

Percent Decrease in Error with Training size and length of code

30

35

40

45

50

55

60

65

70

0 20 40 60 80 100

Training Size

% D

ecre

ase

in E

rro

r

15bit

31bit

63bit

Page 26: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Use unlabeled data with a large number of classes

How? Use EM Mixed Results

Think Again! Use Co-Training Disastrous Results

Think one more time

Page 27: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

How to use unlabeled data?

Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories

ECOC works great with a large number of classes but there is no framework for using unlabeled data

Page 28: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ECOC + CoTraining = ECoTrain

ECOC decomposes multiclass problems into binary problems

Co-Training works great with binary problems

ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

Page 29: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

ECOC+CoTrain - Results

Algorithm 300L+ 0U

Per Class

50L + 250U

Per Class

5L + 295U

Per Class

Naïve Bayes Uses No Unlabeled Data

76 67 40.3

ECOC 15bit 76.5 68.5 49.2

EM UsesUnlabeled Data-105Class Problem

  68.2 51.4

Co-Train   67.6 50.1

ECoTrain(ECOC + Co-Training)

Uses Unlabeled Data 

  72.0 56.1

Page 30: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

What Next?

Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration

Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Page 31: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Potential Drawbacks

Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems

Page 32: Using Error-Correcting Codes for Efficient Text Categorization with a  Large  Number of Categories

Summary

Use ECOC for efficient text classification with a large number of categories

Increase Accuracy & Efficiency Use Unlabeled data by combining ECOC and

Co-Training Generalize to domain-independent

classification tasks involving a large number of categories