Download - Extracting Key-Substring-Group Features for Text Classification

Transcript
Page 1: Extracting Key-Substring-Group Features for Text Classification

Extracting Key-Substring-Group Features for Text Classification

Dell Zhang and Wee Sun Lee

KDD2006

Page 2: Extracting Key-Substring-Group Features for Text Classification

The Context

Text Classification via Machine Learning (ML)

L Classifier U

Learning Predicting

TrainingDocuments

TestDocuments

Page 3: Extracting Key-Substring-Group Features for Text Classification

Text Data

to_be_or_not_to_be…

To be, or not to be

tobeor

betonot

Page 4: Extracting Key-Substring-Group Features for Text Classification

Some Applications

Non-Topical Text Classification Text Genre Classification

Paper? Poem? Prose? Text Authorship Classification

Washington? Adams? Jefferson?

How to exploit sub-word/super-word information?

Page 5: Extracting Key-Substring-Group Features for Text Classification

Some Applications

Asian-Language Text Classification

How to avoid the problem of word-segmentation?

Page 6: Extracting Key-Substring-Group Features for Text Classification

Some Applications

Spam Filtering

How to handle non-alphabetical characters etc.?

(Pampapathi et al., 2006)

Page 7: Extracting Key-Substring-Group Features for Text Classification

Some Applications

Desktop Text Classification

How to deal with different types of files?

Page 8: Extracting Key-Substring-Group Features for Text Classification

Learning Algorithms

Generative Naïve Bayes, Rocchio, …

Discriminative Support Vector Machine (SVM) , AdaBoost, …

For word-based text classification, discriminative methods are often superior to generative methods.

How about string-based text classification?

Page 9: Extracting Key-Substring-Group Features for Text Classification

String-Based Text Classification Generative

Markov Chain Models (char-level) fixed order: n-gram, … variable order: PST, PPM, …

Discriminative SVM with string kernel (= taking all substrings as

features implicitly through the “kernel trick”) limitations: (1) ridge problem; (2) feature redundancy; (3)

feature selection/weighting and advanced kernels.

Page 10: Extracting Key-Substring-Group Features for Text Classification

generative

discriminative

word-based string-based

?

The Problem

Page 11: Extracting Key-Substring-Group Features for Text Classification

The Difficulty

The number of substrings: O(n2)

5 + 9 = 14 characters

15 + 45 = 60 substrings

d1: to_be

d2: not_to_be

Page 12: Extracting Key-Substring-Group Features for Text Classification

Our Idea

The substrings could be partitioned into statistical equivalence groups

toto_to_bto_be

d1: to_be

d2: not_to_be

otot_ot_tot_toot_to_ot_to_bot_to_be

……

Page 13: Extracting Key-Substring-Group Features for Text Classification

n o t _ t o _ b e d2

to _ b e

_ t o _ b e d2

o_ b e

t _ t o _ b e d2

_b e

t o _ b e d2

b e

e

a suffix tree node=

a substring group

Suffix Tree d1

d2

d1

d2

d1

d2

d1

d2

d1

d2

Page 14: Extracting Key-Substring-Group Features for Text Classification

Substring-Groups

The substrings in an equivalence group have exactly identical distribution over the corpus, therefore such a substring-group could be taken in whole as a single feature to be used by a statistical machine learning algorithm for text classification.

Page 15: Extracting Key-Substring-Group Features for Text Classification

Substring-Groups

The number of substring-groups: O(n) n trivial substring-groups

leaf nodes frequency = 1 not so useful to learning

at most n-1 non-trivial substring-groups internal (non-root) nodes frequency > 1 to be selected as features

Page 16: Extracting Key-Substring-Group Features for Text Classification

Key-Substring-Groups

Select the key (salient) substring-groups by -l the minimum frequency

freq(SGv) -h the maximum frequency

freq(SGv) -b the minimum number of branches

children_num(v) -p the maximum parent-child conditional probability

freq(SGv) / freq(SGp(v)) -q the maximum suffix-link conditional probability

freq(SGv) / freq(SGs(v))

Page 17: Extracting Key-Substring-Group Features for Text Classification

Suffix Link

“c1 c2 …ck ” “c2 …ck ” v s(v) s(v) root

Page 18: Extracting Key-Substring-Group Features for Text Classification

Feature Extraction Algorithm

Input a set of documents the parameters

Output the key-substring-groups for each document

Time Complexity: O(n) Trick

make use of suffix links to traverse the tree

Page 19: Extracting Key-Substring-Group Features for Text Classification

Feature Extraction Algorithm

construct the (generalized) suffix tree T

using Ukkonen’s algorithm;

count frequencies recursively;

select features recursively;

accumulate features recursively;

for each document d {

match d to T and get to the node v;

while v is not the root {

output the features associated with v;

move v to the next node via the suffix link of v;

}

}

Page 20: Extracting Key-Substring-Group Features for Text Classification

Experiments

Parameter Tuning the number of features the cross-validation performance

Feature Weighting TFxIDF (with l2 normalization)

Learning Algorithm LibSVM linear kernel

Page 21: Extracting Key-Substring-Group Features for Text Classification

English Text Topic Classification Dataset

Reuters-21578 Top10 (ApteMod) The home-ground of word-based text classification

Classes (1) earn; (2) acq; (3) money-fx; (4) grain; (5) crude; (6) trade;

(7) interest; (8) ship; (9) wheat; (10) corn.

Parameters -l 80 -h 8000 -b 8 -p 0.8 -q 0.8

Features 9*1013 6,055 (extracted in < 30 seconds)

Page 22: Extracting Key-Substring-Group Features for Text Classification

English Text Topic Classification

The distribution of substring-groups ~ Zip’s law (power law)

Page 23: Extracting Key-Substring-Group Features for Text Classification

English Text Topic Classification

The performance of linear kernel SVM with key-substring-group features on the Reuters-21578 top10 dataset.

Page 24: Extracting Key-Substring-Group Features for Text Classification

English Text Topic Classification

Comparing the experimental results of our proposed approach and some representative existing approaches.

Page 25: Extracting Key-Substring-Group Features for Text Classification

English Text Topic Classification

The influence of feature extraction parameters to the number of features and the text classification performance.

Page 26: Extracting Key-Substring-Group Features for Text Classification

Chinese Text Topic Classification Dataset

TREC-5 People’s Daily News Classes

(1) Politics, Law and Society; (2) Literature and Arts; (3) Education, Science and Culture; (4) Sports; (5) Theory and Academy; (6) Economics.

Parameters -l 20 -h 8000 -b 8 -p 0.8 -q 0.8

Page 27: Extracting Key-Substring-Group Features for Text Classification

Chinese Text Topic Classification Performance (miF)

SVM + word segmentation: 82.0% (He et al., 2000; He et al., 2003)

char-level n-gram language model: 86.7% (Peng et al. 2004)

SVM with key-substring-group features: 87.3%

Page 28: Extracting Key-Substring-Group Features for Text Classification

Greek Text Authorship Classification Dataset

(Stamatatos et al., 2000) Classes

(1) S. Alaxiotis; (2) G. Babiniotis; (3) G. Dertilis; (4) C. Kiosse; (5) A. Liakos; (6) D. Maronitis; (7) M. Ploritis; (8) T. Tasios; (9) K. Tsoukalas; (10) G. Vokos.

Page 29: Extracting Key-Substring-Group Features for Text Classification

Greek Text Authorship Classification Performance (accuracy)

deep natural language processing: 72% (Stamatatos et al., 2000)

char-level n-gram language model: 90% (Peng et al. 2004)

SVM with key-substring-group features: 92%

Page 30: Extracting Key-Substring-Group Features for Text Classification

Greek Text Genre Classification Dataset

(Stamatatos et al., 2000) Classes

(1) press editorial; (2) press reportage; (3) academic prose; (4) official documents; (5) literature; (6) recipes; (7) curriculum vitae; (8) interviews; (9) planned speeches; (10) broadcast news.

Page 31: Extracting Key-Substring-Group Features for Text Classification

Greek Text Genre Classification Performance (accuracy)

deep natural language processing: 82% (Stamatatos et al., 2000)

char-level n-gram language model: 86% (Peng et al. 2004)

SVM with key-substring-group features: 94%

Page 32: Extracting Key-Substring-Group Features for Text Classification

Conclusion

We propose the concept of key-substring-group features and a linear-time (suffix tree based) algorithm to

extract them We show that

our method works well for some text classification tasks

clustering etc.?gene/protein sequence data?

Page 33: Extracting Key-Substring-Group Features for Text Classification

?

Page 34: Extracting Key-Substring-Group Features for Text Classification