1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification...

52
1 IFT6255: Information Retrieval IFT6255: Information Retrieval A synthesis, analysis and A synthesis, analysis and comparison of text comparison of text classification algorithms classification algorithms Ligen Wang Ligen Wang Jing Bai Jing Bai

Transcript of 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification...

Page 1: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

1

IFT6255: Information RetrievalIFT6255: Information Retrieval

A synthesis, analysis and A synthesis, analysis and comparison of text comparison of text

classification algorithmsclassification algorithms

Ligen WangLigen WangJing BaiJing Bai

Page 2: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

2

OverviewOverview

• Definition of text classification• Important processes in classification• Classification algorithms• Advantages and disadvantages of

algorithms• Performance comparison of algorithms• Conclusion

Page 3: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

3

Text ClassificationText Classification

• Text classification (text categorization):assign documents to one or more predefined categories

classes Documents ? class1

class2 . . .

classn

Page 4: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

4

Illustration of Text ClassificationIllustration of Text Classification

Science

Sport

Art

Page 5: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

5

Applications of Text ClassificationApplications of Text Classification

• Organize web pages into hierarchies• Domain-specific information extraction• Sort email into different folders• Find interests of users• Etc.

Page 6: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

6

Text Classification FrameworkText Classification Framework

Documents Preprocessing Indexing

Feature selection

Applyingclassificationalgorithms

Performancemeasure

Page 7: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

7

PreprocessingPreprocessing

• Preprocessing: transform documents into a suitable

representation for classification task– Remove HTML or other tags– Remove stopwords– Perform word stemming (Remove suffix)

Page 8: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

8

Indexing

• Indexing by different weighing schemes:– Boolean weighing– Word frequency weighing– tf*idf weighing– ltc weighing– Entropy weighing

Page 9: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

9

Feature SelectionFeature Selection

• Feature selection: remove non-informative terms from

documents

=>improve classification effectiveness =>reduce computational complexity

Page 10: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

10

Different Feature Selection MethodsDifferent Feature Selection Methods

• Document Frequency Thresholding (DF)• Information Gain (IG)2 statistic (CHI)• Mutual Information (MI)• Term Strength (TS)

Page 11: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

11

Classification Algorithms

• Rocchio’s algorithm• K-Nearest-Neighbor algorithm (KNN)• Decision Tree algorithm (DT)• Naive Bayes algorithm (NB)• Artificial Neural Network (ANN) • Support Vector Machine (SVM)• Voting algorithms

Page 12: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

12

Rocchio’s Algorithm

• Build prototype vector for each classprototype vector: average vector over all training document vectors that belong to class ci

• Calculate similarity between test document and each of prototype vectors

• Assign test document to the class with maximum similarity

ii CCi centroidcentroidC

Page 13: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

13

Analysis of Analysis of Rocchio’s Algorithm

• Advantages:– Easy to implement– Very fast learner – Relevance feedback mechanism

• Disadvantages: – Low classification accuracy – Linear combination too simple for classification– Constant and are empirical

Page 14: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

14

K-Nearest-Neighbor AlgorithmK-Nearest-Neighbor Algorithm• Principle: points (documents) that are close

in the space belong to the same class

Page 15: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

15

K-Nearest-Neighbor AlgorithmK-Nearest-Neighbor Algorithm

• Calculate similarity between test document and each neighbor

• Select k nearest neighbors of a test document among training examples

• Assign test document to the class which contains most of the neighbors

)),((*)|(maxarg1

iDCDDsim j

k

jji

Page 16: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

16

Analysis of KNN AlgorithmAnalysis of KNN Algorithm• Advantages:

– Effective – Non-parametric– More local characteristics of document

are considered comparing with Rocchio

• Disadvantages:– Classification time is long– Difficult to find optimal value of k

Page 17: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

17

Decision Tree AlgorithmDecision Tree Algorithm• Decision tree associated with document:

– Root node contains all documents– Each internal node is subset of documents

separated according to one attribute– Each arc is labeled with predicate which can be

applied to attribute at parent– Each leaf node is labeled with a class

Page 18: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

18

Decision Tree AlgorithmDecision Tree Algorithm• Recursive partition procedure from

root node• Set of documents separated into

subsets according to an attribute • Use the most discriminative attribute

first • Pruning to deal with overfitting

Page 19: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

19

Analysis of Decision Tree AlgorithmAnalysis of Decision Tree Algorithm

• Advantages:– Easy to understand– Easy to generate rules– Reduce problem complexity

• Disadvantages:– Training time is relatively expensive– A document is only connected with one branch– Once a mistake is made at a higher level, any

subtree is wrong – Does not handle continuous variable well– May suffer from overfitting

Page 20: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

20

Naïve Bayes AlgorithmNaïve Bayes Algorithm• Estimate the probability of each class for a

document: – Compute the posterior probability (Bayes rule)

– Assumption of word independency

)(

)|()()|(

DP

cDPcPDcP ii

i

n

jiji cdPcDP

1

)|()|(

Page 21: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

21

Naïve Bayes AlgorithmNaïve Bayes Algorithm– P(Ci):

– P(dj|ci):

N

NicCP i )(

M

kki

jiij

NM

NcdP

1

1)|(

Page 22: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

22

Analysis of Naïve Bayes Analysis of Naïve Bayes AlgorithmAlgorithm

• Advantages:– Work well on numeric and textual data– Easy to implement and computation comparing

with other algorithms

• Disadvantages:– Conditional independence assumption is violated

by real-world data, perform very poorly when features are highly correlated

– Does not consider frequency of word occurrences

Page 23: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

23

Basic Neuron Model In A Basic Neuron Model In A Feedforward NetworkFeedforward Network

• Inputs xi arrive through pre-synaptic connections

• Synaptic efficacy is modeled using real weights wi

• The response of the neuron is a nonlinear function f of its weighted inputs

Page 24: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

24

Inputs To NeuronsInputs To Neurons• Arise from other neurons

or from outside the network

• Nodes whose inputs arise outside the network are called input nodes and simply copy values

• An input may excite or inhibit the response of the neuron to which it is applied, depending upon the weight of the connection

Page 25: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

25

WeightsWeights• Represent synaptic efficacy and may

be excitatory or inhibitory• Normally, positive weights are

considered as excitatory while negative weights are thought of as inhibitory

• Learning is the process of modifying the weights in order to produce a network that performs some function

Page 26: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

26

OutputOutput• The response function is normally

nonlinear• Samples include

– Sigmoid

– Piecewise linear

xexf

1

1)(

xif

xifxxf

,0

,)(

Page 27: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

27

Backpropagation Backpropagation PreparationPreparation

• Training SetA collection of input-output patterns that are used to train the network

• Testing SetA collection of input-output patterns that are used to assess network performance

• Learning Rate-ηA scalar parameter, analogous to step size in numerical integration, used to set the rate of adjustments

Page 28: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

28

Network ErrorNetwork Error• Total-Sum-Squared-Error (TSSE)

• Root-Mean-Squared-Error (RMSE)

patterns outputs

actualdesiredTSSE 2)(2

1

outputspatterns

TSSERMSE

*##

*2

Page 29: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

29

A Pseudo-Code AlgorithmA Pseudo-Code Algorithm• Randomly choose the initial weights• While error is too large

– For each training pattern• Apply the inputs to the network• Calculate the output for every neuron from the input

layer, through the hidden layer(s), to the output layer• Calculate the error at the outputs• Use the output error to compute error signals for pre-

output layers• Use the error signals to compute weight adjustments• Apply the weight adjustments

– Periodically evaluate the network performance

Page 30: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

30

Apply Inputs From A PatternApply Inputs From A Pattern• Apply the value of

each input parameter to each input node

• Input nodes computer only the identity function

Feedforward

Inpu

ts

Out

puts

Page 31: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

31

Calculate Outputs For Each Neuron Calculate Outputs For Each Neuron Based On The PatternBased On The Pattern

• The output from neuron j for pattern p is Opj where

and

k ranges over the input indices and Wjk is the weight on the connection from input k to neuron j

Feedforward

Inpu

ts

Out

puts

jnetjpje

netO

1

1)(

k

jkpkbiasj WOWbiasnet *

Page 32: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

32

Calculate The Error Signal For Calculate The Error Signal For Each Output NeuronEach Output Neuron

• The output neuron error signal pj is given by pj=(Tpj-Opj) Opj (1-Opj)

• Tpj is the target value of output neuron j for pattern p

• Opj is the actual output value of output neuron j for pattern p

Page 33: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

33

Calculate The Error Signal For Calculate The Error Signal For Each Hidden NeuronEach Hidden Neuron

• The hidden neuron error signal pj is given by

where pk is the error signal of a post-synaptic neuron k and Wkj is the weight of the connection from hidden neuron j to the post-synaptic neuron k

kjk

pkpjpjpj WOO )1(

Page 34: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

34

Calculate And Apply Weight Calculate And Apply Weight AdjustmentsAdjustments

• Compute weight adjustments Wji by

Wji = η pj Opi

• Apply weight adjustments according to

Wji = Wji + Wji

Page 35: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

35

Analysis of ANN AlgorithmAnalysis of ANN Algorithm• Advantages:

– Produce good results in complex domains– Suitable for both discrete and continuous data

(especially better for the continuous domain)– Testing is very fast

• Disadvantages:– Training is relatively slow– Learned results are difficult for users to interpret

than learned rules (comparing with DT)– Empirical Risk Minimization (ERM) makes ANN try to

minimize training error, may lead to overfitting

Page 36: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

36

Support Vector MachinesSupport Vector Machines

• Main idea of SVMsMain idea of SVMsFind out the linear separating hyperplane Find out the linear separating hyperplane which maximize the margin, i.e., the optimal which maximize the margin, i.e., the optimal separating hyperplane (OSH)separating hyperplane (OSH)

• Nonlinear separable caseNonlinear separable caseKernel function and Hilbert spaceKernel function and Hilbert space

FX

f(x)

f(x)

f(x)

f(x)

x

xx

x

0

00

0 f(0)f(0) f(0

)f(0)

X f(X)

Page 37: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

37

SVM classificationSVM classification

Ni

Nibxw

Cww

i

ii

T

N

i i

T

1,0

1,01)(ytosubject

)(2

1minimize

i

1ζb,w, i

Maximizing the margin is equivalent to:

Introducing Lagrange multipliers , the Lagrangian is: ,

N

ii

N

iii

N

i

T

iii

N

iiii

T

N

iii

N

iii

T

ii

N

ii

T

i

bywxy

Cww

bxwy

Cwwbw

111

1

11

1

)(2

1

]1)([

2

1),;,,(

Dual problem:

jijijiji

iiD

xxyy,2

1maximizeα

subject to:,0 Ci .0

iii y

The solution is given by:

N

iiii xyw

1

The problem of classifying a new data point x is now simply solved by looking at the sigh of bxw

jj xwyb

Page 38: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

38

Analysis of SVM AlgorithmAnalysis of SVM Algorithm• Advantages:

– Comparing with ANN, SVM capture the inherent characteristics of the data better

– Embedding the Structural Risk Minimization (SRM) principle which minimizes the upper bound on the generalization error (better than the Empirical Risk Minimization principle)

– Ability to learn can be independent of the dimensionality of the feature space

– Global minima vs. local minima

• Disadvantage:– Parameter tuning– kernel selection

00

0

0

x

xx x

00

0

0

x

xx x

Page 39: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

39

Voting AlgorithmVoting AlgorithmPrinciple: using multiple evidence

(multiple poor classifiers=> single good classifier)

• Generate some base classifiers• Combine them to make the final

decision

Page 40: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

40

Bagging AlgorithmBagging Algorithm• Use multiple versions of a training

set D of size N, each created by resampling N examples from D with bootstrap

• Each of data sets is used to train a base classifier, the final classification decision is made by the majority voting of these classifiers

Page 41: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

41

AdaboostAdaboost• Main idea:

- The main idea of this algorithm is to maintain a distribution or set of weights over the training set. Initially, all weights are set equally, but in each iteration the weights of incorrectly classified examples are increased so that the base classifier is forced to focus on the ‘hard’ examples in the training set. For those correctly classified examples, their weights are decreased so that they are less important in next iteration.

• Why ensembles can improve performance:- Uncorrelated errors made by the individual classifiers can be removed by voting. - Our hypothesis space H may not contain the true function f. Instead, H may include several equally good approximations to f. By taking weighted combinations of these approximations, we may be able to represent classifiers that lie outside of H.

Page 42: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

42

Adaboost algorithmAdaboost algorithmGiven: m examples ),(),...,,( 11 mm yxyx where }1,1{, YyXx ii

Initialize miD /1)(1

For t = 1,…,T: Train base classifier using distribution .tD

Get a hypothesis }1,1{: Xht with error .)(])([Prε)(:

~t

iit

tyxhi

tiitDi iDyxh

for all i = 1…m

Choose )1

ln(2

1

t

tt

.

Update:

t

ititt

iit

iit

t

tt

Z

xhyiD

yxhe

yxhe

Z

iDiD

t

t

))(exp()(

)( if

)( if {

)()(1

where tZ is a normalization factor (chosen so that 1tD will be a distribution).

Output the final hypothesis:

).)(()(1

T

ttt xhsignxH

Page 43: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

43

Analysis of Voting Analysis of Voting AlgorithmsAlgorithms

• Advantage:– Surprisingly effective – Robust to noise– Decrease the overfitting effect

• Disadvantage:– Require more calculation and memory

Page 44: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

44

Performance MeasurePerformance Measure• Performance of algorithm:

– Training time– Testing time– Classification accuracy

• Precision, Recall• Micro-average / Macro-average• Breakeven: precision = recall

Goal: high classification quality and computation efficiency

Page 45: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

45

Comparison Based on Comparison Based on Six ClassifiersSix Classifiers• Classification accuracy: six classifiers (Reuters-21578 collection)

    1 2 3 4

  Author Dumais Joachims Weiss Yang

1 Training 9603 9603 9603 7789

2 Test 3299 3299 3299 3309

3 Topics 118 90 95 93

4 Indexing Boolean tfc Frequency ltc

5 Selection MI IG -

7 Measure Breakeven Microavg. Breakeven Breakeven

8 Rocchio 61.7 79.9 78.7 75

9 NB 75.2 72 73.4 71

10 KNN N/A 82.3 86.3 85

11 DT N/A 79.4 78.9 79

12 SVM 87 86 86.3 N/A

13 Voting N/A N/A 87.8 N/A

Page 46: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

46

Analysis of ResultsAnalysis of Results

• SVM, Voting and KNN are showed good performance

• DT, NB and Rocchio showed relatively poor performance

Page 47: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

47

Comparison Based on Feature SelectionComparison Based on Feature Selection

• Classification accuracy: NB vs. KNN vs. SVM (Reuter collection)

# of features NB KNN SVM

10 48.66 ± 0.10 57.31 ± 0.2 60.78 ± 0.17

20 52.28 ± 0.15 62.57 ± 0.16 73.67 ± 0.11

40 59.19 ± 0.15 68.39 ± 0.13 77.07 ± 0.14

50 60.32 ± 0.14 74.22 ± 0.11 79.02 ± 0.13

75 66.18 ± 0.19 76.41 ± 0.11 83.0 ± 0.10

100 77.9 ± 0.19 80.2 ± 0.09 84.3 ± 0.12

200 78.26 ± 0.15 82.5 ± 0.09 86.94 ± 0.11

500 80.80 ± 0.12 82.19 ± 0.08 86.59 ± 0.10

1000 80.88 ± 0.11 82.91 ± 0.07 86.31 ± 0.08

5000 79.26 ± 0.07 82.97 ± 0.06 86.57 ± 0.04

Page 48: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

48

Analysis of ResultsAnalysis of Results

• Accuracy is improved with an increase in the number of features until some level

• Top level = approximately 500-1000 features: accuracy reaches its peak and begins to decline

• SVM obtains the best performance

Page 49: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

49

Comparison Based on Training Time (1)Comparison Based on Training Time (1)

• Training time: SVM vs. NB (# features = 100):

# documents Training Time for SVM Training Time for NB

9603 5 8

19206 15 25

28809 27 60

38412 32 120

48015 40 340

57618 50 410

67221 65 498

76824 78 600

86427 100 630

Page 50: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

50

Comparison Comparison BBased on ased on TTraining raining TTime (2)ime (2)

• Training time: SVM vs. NB (# of features increasing):

# features Training Time for SVM Training Time for NB

20 2.2 3

50 3 5

100 3.1 11

200 3.3 22

300 3.5 27

500 4.1 35

Page 51: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

51

Analysis of ResultsAnalysis of Results• Table1:

– Training time of SVM is less than NB w.r.t. the number of documents

• Table2:– Training time of SVM increases slowly

with the number of features– Training time of NB increases more

quickly

Page 52: 1 IFT6255: Information Retrieval A synthesis, analysis and comparison of text classification algorithms Ligen Wang Jing Bai.

52

ConclusionConclusion

• Different algorithms perform differently depending on data collections

• Some algorithms (e.g. Rocchio) do not perform well

• None of them appears to be globally superior over the others; however, SVM and Voting are good choices by considering all the factors