Extracting Key-Substring-Group Features for Text Classification

Dell Zhang and Wee Sun Lee

KDD2006

The Context

Text Classification via Machine Learning (ML)

L Classifier U

Learning Predicting

TrainingDocuments

TestDocuments

Text Data

to_be_or_not_to_be…

To be, or not to be

tobeor

betonot

Some Applications

Non-Topical Text Classification Text Genre Classification

Paper? Poem? Prose? Text Authorship Classification

Washington? Adams? Jefferson?

How to exploit sub-word/super-word information?

Some Applications

Asian-Language Text Classification

How to avoid the problem of word-segmentation?

Some Applications

Spam Filtering

How to handle non-alphabetical characters etc.?

(Pampapathi et al., 2006)

Some Applications

Desktop Text Classification

How to deal with different types of files?

Learning Algorithms

Generative Naïve Bayes, Rocchio, …

Discriminative Support Vector Machine (SVM) , AdaBoost, …

For word-based text classification, discriminative methods are often superior to generative methods.

How about string-based text classification?

String-Based Text Classification Generative

Markov Chain Models (char-level) fixed order: n-gram, … variable order: PST, PPM, …

Discriminative SVM with string kernel (= taking all substrings as

features implicitly through the “kernel trick”) limitations: (1) ridge problem; (2) feature redundancy; (3)

feature selection/weighting and advanced kernels.

generative

discriminative

word-based string-based

The Problem

The Difficulty

The number of substrings: O(n2)

5 + 9 = 14 characters

15 + 45 = 60 substrings

d1: to_be

d2: not_to_be

Our Idea

The substrings could be partitioned into statistical equivalence groups

toto_to_bto_be

d1: to_be

d2: not_to_be

otot_ot_tot_toot_to_ot_to_bot_to_be

……

n o t _ t o _ b e d2

to _ b e

_ t o _ b e d2

o_ b e

t _ t o _ b e d2

t o _ b e d2

a suffix tree node=

a substring group

Suffix Tree d1

Substring-Groups

The substrings in an equivalence group have exactly identical distribution over the corpus, therefore such a substring-group could be taken in whole as a single feature to be used by a statistical machine learning algorithm for text classification.

Substring-Groups

The number of substring-groups: O(n) n trivial substring-groups

leaf nodes frequency = 1 not so useful to learning

at most n-1 non-trivial substring-groups internal (non-root) nodes frequency > 1 to be selected as features

Key-Substring-Groups

Select the key (salient) substring-groups by -l the minimum frequency

freq(SGv) -h the maximum frequency

freq(SGv) -b the minimum number of branches

children_num(v) -p the maximum parent-child conditional probability

freq(SGv) / freq(SGp(v)) -q the maximum suffix-link conditional probability

freq(SGv) / freq(SGs(v))

Suffix Link

“c1 c2 …ck ” “c2 …ck ” v s(v) s(v) root

Feature Extraction Algorithm

Input a set of documents the parameters

Output the key-substring-groups for each document

Time Complexity: O(n) Trick

make use of suffix links to traverse the tree

Feature Extraction Algorithm

construct the (generalized) suffix tree T

using Ukkonen’s algorithm;

count frequencies recursively;

select features recursively;

accumulate features recursively;

for each document d {

match d to T and get to the node v;

while v is not the root {

output the features associated with v;

move v to the next node via the suffix link of v;

Experiments

Parameter Tuning the number of features the cross-validation performance

Feature Weighting TFxIDF (with l2 normalization)

Learning Algorithm LibSVM linear kernel

English Text Topic Classification Dataset

Reuters-21578 Top10 (ApteMod) The home-ground of word-based text classification

Classes (1) earn; (2) acq; (3) money-fx; (4) grain; (5) crude; (6) trade;

(7) interest; (8) ship; (9) wheat; (10) corn.

Parameters -l 80 -h 8000 -b 8 -p 0.8 -q 0.8

Features 9*1013 6,055 (extracted in < 30 seconds)

English Text Topic Classification

The distribution of substring-groups ~ Zip’s law (power law)

The performance of linear kernel SVM with key-substring-group features on the Reuters-21578 top10 dataset.

Comparing the experimental results of our proposed approach and some representative existing approaches.

The influence of feature extraction parameters to the number of features and the text classification performance.

Chinese Text Topic Classification Dataset

TREC-5 People’s Daily News Classes

(1) Politics, Law and Society; (2) Literature and Arts; (3) Education, Science and Culture; (4) Sports; (5) Theory and Academy; (6) Economics.

Parameters -l 20 -h 8000 -b 8 -p 0.8 -q 0.8

Chinese Text Topic Classification Performance (miF)

SVM + word segmentation: 82.0% (He et al., 2000; He et al., 2003)

char-level n-gram language model: 86.7% (Peng et al. 2004)

SVM with key-substring-group features: 87.3%

Greek Text Authorship Classification Dataset

(Stamatatos et al., 2000) Classes

(1) S. Alaxiotis; (2) G. Babiniotis; (3) G. Dertilis; (4) C. Kiosse; (5) A. Liakos; (6) D. Maronitis; (7) M. Ploritis; (8) T. Tasios; (9) K. Tsoukalas; (10) G. Vokos.

Greek Text Authorship Classification Performance (accuracy)

deep natural language processing: 72% (Stamatatos et al., 2000)

char-level n-gram language model: 90% (Peng et al. 2004)

SVM with key-substring-group features: 92%

Greek Text Genre Classification Dataset

(Stamatatos et al., 2000) Classes

(1) press editorial; (2) press reportage; (3) academic prose; (4) official documents; (5) literature; (6) recipes; (7) curriculum vitae; (8) interviews; (9) planned speeches; (10) broadcast news.

Greek Text Genre Classification Performance (accuracy)

deep natural language processing: 82% (Stamatatos et al., 2000)

char-level n-gram language model: 86% (Peng et al. 2004)

SVM with key-substring-group features: 94%

Conclusion

We propose the concept of key-substring-group features and a linear-time (suffix tree based) algorithm to

extract them We show that

our method works well for some text classification tasks

clustering etc.?gene/protein sequence data?

Extracting Key-Substring-Group Features for Text Classification

Documents

Transcript of Extracting Key-Substring-Group Features for Text Classification

Substring Range Reporting · Substring Range Reporting Philip Bille phbi@imm.dtu.dk Inge Li G˝rtz ilg@imm.dtu.dk August 19, 2011 Abstract We revisit various string indexing problems

Efï¬cient Top-k Algorithms for Approximate Substring Matching

Storage E cient Substring Searchable Symmetric Encryption · Storage E cient Substring Searchable Symmetric Encryption ... Although current techniques exploit auxiliary data structures

Extracting Key-Substring-Group Features for Text Classification Dell Zhang and Wee Sun Lee KDD2006.

casablanca.pkcasablanca.pk/surgical/dental.pdf · DENTAL CATALOGUE 2009-2010 CASABLANCA . Extracting Forceps CASABLANCA Er.sh Patten . Extracting Forceps ... Wisdom Teeth Extracting

ftan · 2014-08-08 · Gives substring starting from n'h charactet Gives substring starting from n'h character up to m'h (not including m'h) Creates a string object of the parameter

String Sorts Tries Substring Search: KMP, BM, RK

The Longest Common Substring Problem

General strategy for extracting vegetation classification from large phytosociological databases

Longest Palindromic Substring Yang Liu. Problem Given a string S Find the longest palindromic substring in S. Example: S=“abcbcbb”. The longest palindromic.

Multi-column Substring Matching for Database Schema Translation

Computing Longest Common Substring/Subsequence of Non-linear Texts

Improved Similarity Measure For Text Classification … , Preprocessing , ... effective organization of web page documents automatically. Extracting features from web pages is first

Substring Statistics Kyoji Umemura Kenneth Church.

Melissa Chase and Emily Shen Substring ...Substring-SearchableSymmetricEncryption 265 45] propose schemes that allow updates to the stored documents, and Kurosawa and Ohtaki [38] propose

Extracting Minerals.pdf

Extracting User Interests from Log using Long-Period Extracting Algorithm

Substring Search

Slender PUF Protocol Authentication by Substring Matching

Extracting Data