Download - Extracting Key-Substring-Group Features for Text Classification

Extracting Key-Substring-Group Features for Text Classification

Dell Zhang and Wee Sun Lee

KDD2006

The Context

Text Classification via Machine Learning (ML)

L Classifier U

Learning Predicting

TrainingDocuments

TestDocuments

Text Data

to_be_or_not_to_be…

To be, or not to be

…

tobeor

betonot

…

Some Applications

Non-Topical Text Classification Text Genre Classification

Paper? Poem? Prose? Text Authorship Classification

Washington? Adams? Jefferson?

How to exploit sub-word/super-word information?

Some Applications

Asian-Language Text Classification

How to avoid the problem of word-segmentation?

Some Applications

Spam Filtering

How to handle non-alphabetical characters etc.?

(Pampapathi et al., 2006)

Some Applications

Desktop Text Classification

How to deal with different types of files?

Learning Algorithms

Generative Naïve Bayes, Rocchio, …

Discriminative Support Vector Machine (SVM) , AdaBoost, …

For word-based text classification, discriminative methods are often superior to generative methods.

How about string-based text classification?

String-Based Text Classification Generative

Markov Chain Models (char-level) fixed order: n-gram, … variable order: PST, PPM, …

Discriminative SVM with string kernel (= taking all substrings as

features implicitly through the “kernel trick”) limitations: (1) ridge problem; (2) feature redundancy; (3)

feature selection/weighting and advanced kernels.

generative

discriminative

word-based string-based

?

The Problem

The Difficulty

The number of substrings: O(n2)

5 + 9 = 14 characters

15 + 45 = 60 substrings

d1: to_be

d2: not_to_be

Our Idea

The substrings could be partitioned into statistical equivalence groups

toto_to_bto_be

d1: to_be

d2: not_to_be

otot_ot_tot_toot_to_ot_to_bot_to_be

……

n o t _ t o _ b e d2

to _ b e

_ t o _ b e d2

o_ b e

t _ t o _ b e d2

_b e

t o _ b e d2

b e

e

a suffix tree node=

a substring group

Suffix Tree d1

d2

d1

d2

d1

d2

d1

d2

d1

d2

Substring-Groups

The substrings in an equivalence group have exactly identical distribution over the corpus, therefore such a substring-group could be taken in whole as a single feature to be used by a statistical machine learning algorithm for text classification.

Substring-Groups

The number of substring-groups: O(n) n trivial substring-groups

leaf nodes frequency = 1 not so useful to learning

at most n-1 non-trivial substring-groups internal (non-root) nodes frequency > 1 to be selected as features

Key-Substring-Groups

Select the key (salient) substring-groups by -l the minimum frequency

freq(SGv) -h the maximum frequency

freq(SGv) -b the minimum number of branches

children_num(v) -p the maximum parent-child conditional probability

freq(SGv) / freq(SGp(v)) -q the maximum suffix-link conditional probability

freq(SGv) / freq(SGs(v))

Suffix Link

“c1 c2 …ck ” “c2 …ck ” v s(v) s(v) root

Feature Extraction Algorithm

Input a set of documents the parameters

Output the key-substring-groups for each document

Time Complexity: O(n) Trick

make use of suffix links to traverse the tree

Feature Extraction Algorithm

construct the (generalized) suffix tree T

using Ukkonen’s algorithm;

count frequencies recursively;

select features recursively;

accumulate features recursively;

for each document d {

match d to T and get to the node v;

while v is not the root {

output the features associated with v;

move v to the next node via the suffix link of v;

}

}

Experiments

Parameter Tuning the number of features the cross-validation performance

Feature Weighting TFxIDF (with l2 normalization)

Learning Algorithm LibSVM linear kernel

English Text Topic Classification Dataset

Reuters-21578 Top10 (ApteMod) The home-ground of word-based text classification

Classes (1) earn; (2) acq; (3) money-fx; (4) grain; (5) crude; (6) trade;

(7) interest; (8) ship; (9) wheat; (10) corn.

Parameters -l 80 -h 8000 -b 8 -p 0.8 -q 0.8

Features 9*1013 6,055 (extracted in < 30 seconds)

English Text Topic Classification

The distribution of substring-groups ~ Zip’s law (power law)


The performance of linear kernel SVM with key-substring-group features on the Reuters-21578 top10 dataset.


Comparing the experimental results of our proposed approach and some representative existing approaches.


The influence of feature extraction parameters to the number of features and the text classification performance.

Chinese Text Topic Classification Dataset

TREC-5 People’s Daily News Classes

(1) Politics, Law and Society; (2) Literature and Arts; (3) Education, Science and Culture; (4) Sports; (5) Theory and Academy; (6) Economics.

Parameters -l 20 -h 8000 -b 8 -p 0.8 -q 0.8

Chinese Text Topic Classification Performance (miF)

SVM + word segmentation: 82.0% (He et al., 2000; He et al., 2003)

char-level n-gram language model: 86.7% (Peng et al. 2004)

SVM with key-substring-group features: 87.3%

Greek Text Authorship Classification Dataset

(Stamatatos et al., 2000) Classes

(1) S. Alaxiotis; (2) G. Babiniotis; (3) G. Dertilis; (4) C. Kiosse; (5) A. Liakos; (6) D. Maronitis; (7) M. Ploritis; (8) T. Tasios; (9) K. Tsoukalas; (10) G. Vokos.

Greek Text Authorship Classification Performance (accuracy)

deep natural language processing: 72% (Stamatatos et al., 2000)

char-level n-gram language model: 90% (Peng et al. 2004)

SVM with key-substring-group features: 92%

Greek Text Genre Classification Dataset

(Stamatatos et al., 2000) Classes

(1) press editorial; (2) press reportage; (3) academic prose; (4) official documents; (5) literature; (6) recipes; (7) curriculum vitae; (8) interviews; (9) planned speeches; (10) broadcast news.

Greek Text Genre Classification Performance (accuracy)

deep natural language processing: 82% (Stamatatos et al., 2000)

char-level n-gram language model: 86% (Peng et al. 2004)

SVM with key-substring-group features: 94%

Conclusion

We propose the concept of key-substring-group features and a linear-time (suffix tree based) algorithm to

extract them We show that

our method works well for some text classification tasks

clustering etc.?gene/protein sequence data?