Presenter : Yu-Ting LU Authors : Harun Ug˘uz 2011.KBS

17
Intelligent Database Systems Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm

description

A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Presenter : Yu-Ting LU Authors : Harun Ug˘uz 2011.KBS. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation

Transcript of Presenter : Yu-Ting LU Authors : Harun Ug˘uz 2011.KBS

Page 1: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Presenter : YU-TING LU

Authors : Harun Ug˘uz

2011.KBS

A two-stage feature selection method for text categorization by usinginformation gain, principal component analysis and genetic algorithm

Page 2: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments

Page 3: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Motivation

• A major problem of text categorization is its

large number of features.

• Most of those are irrelevant noise that can

mislead the classifier.

Page 4: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Objectives

• Two-stage feature selection and feature extraction is

used to improve the performance of text

categorization.

Page 5: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Methodology

Page 6: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Methodology – pre-processing– removing of stop-words

– Stemming

– term weighting

– pruning of the words

a, an, and, because, can, do, every, the…

computer, computing, computation, computes comput

prune the words that appear less than two times in the documents.

Terms of the document collection

documents

Page 7: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Methodology – feature ranking with information gain• each term within the text is ranked depending on

their importance for the classification in decreasing order using the IG method.

Page 8: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Methodology – dimension reduction methods• principal component analysis

• Genetic algorithm for feature selection

Individual’s encoding

Fitness function

Mutation Crossover

11011001100111011110

Selection

p m≦

Page 9: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Methodology – text categorization methods• KNN classifier

• C4.5 decision tree classifier

Page 10: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

precision recall F-measure

Methodology – evaluation of the performance

Page 11: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Experiments – datasets– Reuters dataset-21578

– Classic3 dataset

Category name Number of document

Earn 3743

Acquisition 2179

Money-fx 633

Crude 561

Grain 542

Trade 500

Category name Number of document

CRANFIELD 1398

MEDLINE 1033

CISI 1460

Page 12: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Experiments – Reuters-21578 A document-term matrix is acquired with a dimension of 8158 × 7542 at the end of pre-processing.

Page 13: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Experiments – Reuters-21578

Page 14: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Experiments – Classic3 A document-term matrix is acquired in the dimension of 3891 × 6679 at the end of pre-processing.

Page 15: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Experiments – Classic3

Page 16: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Conclusions

• The success of text categorization performed through the C4.5 decision tree and KNN algorithms using fewer features selected via IG-PCA and IG- GA is higher than the success acquired using features selected via IG.

• Two-stage feature selection methods can improve the performance of text categorization.

Page 17: Presenter  : Yu-Ting LU Authors :  Harun Ug˘uz 2011.KBS

Intelligent Database Systems Lab

Comments• Advantages

- understand the basic methods• Applications

- text categorization