[IEEE 2007 IEEE Workshop on Machine Learning for Signal Processing - Thessaloniki, Greece...

DISCRIMINANT SUBSPACES OF SOME HIGH DIMENSIONAL PATTERNCLASSIFICATION PROBLEMS

Hongyu Li and Mahesan Niranjan

Department of Computer ScienceThe University of Sheffield

Regent Court, 211 Portobello Street, Shefffield, SI 4DP, U.K.Email: {1H.Li, M.Niranjan} dcs.shef.ac.uk

ABSTRACT

In this paper, we report on an empirical study of several highdimensional classification problems and show that much ofthe discriminant information may lie in lLow dimensionalsubspaces. Feature subset selection is achieved either byforward, selection or backward elimination from the full fea-ture space with Support Vector Machines (SVMs) as baseclassifiers. These "wrapper" methods are compared with a"filter" method of feature selection using information gainas discriminant criterion. Publicly available data sets in ar-eas of text categorization, chemoinformatics, and gene ex-pression analysis are used, to illustrate the id.ea. We foundthat forward, selection systematically outperforms backwardelimination at low dimensions when applied to these prob-lems. These observations are known anecdotally in the ma-chine learning community, but here we provide empiricalsupport on a wide range of problems in different domains.

1. INTRODUCTION

Many classification problems of interest in machine learn-ing research are posed, in high dimensions. While the curseof dimensionality ought to make high dimensional problemsdifficult, the observation that pattern classification involvingapproximating a class boundary by-passes explicit densityestimation, has led to the application ofmachine learning al-gorithms to a large variety of high dimensional problems inareas such as text processing and high throughput genomics.

For such high dimensional patterns, feature selection ordimensionality reduction has become an important issue inmachine learning, data mining and knowledge discovery.Generally speaking, feature selection methods can be sum-marized in three categories, fi'lter wrapper and embeddedmethods [ ],[2], [3 . Filter methods attempt to consider fea-tures independently of the induction algorithm that will usethem, and select some features relying on general charac-teristics of the training set. Wrapper methods on the other

hand run the induction algorithm on a training set to gen-erate a set of candidate features and use the accuracy ofthe resulting model to evaluate the feature set. Embeddedmethods are based on the understanding that feature selec-tion leads to the largest possible generalization performanceor minimal expected risk. The learning algorithm combinesfeature subset generalization and evaluation together. Whilefilter methods have been widely used in a wide range ofhigh dimensional pattern classification problems, such asgenomic microarray data classification [4] and, spam filter-ing [6], wrapper methods have not received, comparable at-tention.

In this study, we make a systematic evaluation of wrap-per type feature selection on a range of problems: spam fil-tering, functional classification of genes, text categorizationand drug discovery in synthetic chemistry. In all four ofthese areas, the natural formulation of the learning problemis a high dimensional pattern classification one, either be-cause of the large number of simultaneous measurements(e.g. microarrays) or the particular representation chosen(e.g. bag of words for text). We find that, across these prob-lems, much of the discriminant information lies in a smallsubset of features. Further, sequential forward selection offeatures offers a significant performance gain over a sub-set obtained by backward deletion. Backward deletion of-ten deletes useful features too early in the process, therebysometimes under-performing filter methods.

2. METHODOLOGY

2.1. SFS with SVMs

Selection of features forming the subspaces, done sub op-timally by SFS, is shown in pseudo-code form in Fig 1SFS is a greedy search procedure considering one featuIreat a time as candidate for inclusion in the set of selectedfeatures, the criteria for selection being the classifier per-forms on a validation set. In each step of the feature se-lection process, SFS search finds the next optimal feature

1-4244-1566-7/07/$25.00 ©2007 IEEE. 27

which, when combined with the previously selected featureset gives maximum increase in recognition performance.

The induction method used in the feature selection is theSVM. The learning process of the SVM is to maximize themargin of classification. Since the RFE uses linear kernel,we use it in SFS as well for fair comparison. Furthermore,the SVMs used save computational costs by taking advan-tage of the sparse text feature vectors [7].

Input: Number of candidate features Nfeatures;Output: Ranked feature list F;Randomly partition S into Str, S and S";Empty feature subset F = CrTnpty;for k = 1: Nfeatures

fori 1:Nstr = {F+i}

u) train SVMs on Str;e(i)=Evaluate performance ofw on S"" { -Fx

endFind feature j which e(j) is maximal, in case of

multiple candidates,randomly selected one;trainResults(k) = max(c),Put j in the feature subset F = [F +wne,l, = train classifier fne1 on str = {F+j}testResults(k) = Performance of Wne, onsts {xF+j}

end

Fig. 1. Pseudo-code of the SFS algorithm. NotationsIStr denotes the size of the set Str . Data set SAXN{St>,Y},{i =1, ...,NI}1.

2.2. RFE with SVMs

An alternative way of selecting a subset in a wrapper frame-work is recursive feature elimination (RFE) [8]. This is alsoa suboptimal search method, which starts with all the fea-tures and removes one feature at a time. At each step, thecoefficient of weight vector of a linear SVM is used as thefeature selection criterion, and the feature corresponding tothe smallest weight is removed, justified by the argumentthat weighting of features correspond to the importance informing the decision function. Pseudo-code ofthe recursiveelimination procedure is shown in Fig. 2.

3. EXPERIMENTS

Software environment for the experiments was mainly inMATLAB 7.1, which is convenient for manipulating ma-trices and vectors. The SVMs Toolbox1 made available

htp. www.isis.ecs.soton.ac.uk isystems kernel

Input: Number of candidate features Nfeatures;Output: Ranked feature list F;

Empty feature subset F = IeTnpty3Subset of remaining features S0, = [1, ..., d];Randomly partition S into the training set str

and the test set Sts;for k = 1: Nfeatures

Train a linear SVM with all the features in the Str.Compute the weight vector of dimension length

wn = Zk akYkwk;Compute the ranking scores for features in

S5,: ci = (1 i)2;Find the feature with the smallest ranking score:

e = argrnin( c ),Update F: F = F[e: F];Update SW by eliminating the feature with the smallestweight e S,, = S,,(1 e -1 e + 1: length(S,,));Evaluate test perfornnance on Sts;

end

Fig. 2. Pseudo-code of the RFE algorithm. Sts denotes theset Ss, with feature c removed and d is the dimensionalityof the given data.

by Steve Gunn at the University of Southampton was usedin the implementation. Simulations were carried out on a2.4 GHz machine running Windows XP with 1.0 G of mainmemory.

In the information retrieval context, it is customary touse the notions ofprecision(p) and recall(r) and summarizethe balance between them by a single measure known asF1 = 2pr/(p + r) [15]. We use F1 as criterion for featureselection in this work. F, was also used as the evaluationcriteria on these classification problems. The complexityparameter C in each experiments was tuned in a systematicway, exploring a sensible range of values by the classifica-tion performance evaluated on a validation set.

3.1. Spain Filtering

We tested content based spain filtering using two data sets:Ling -SpamandSpanAssassin. TheLing -Spamcor-pus2 has four versions: bare, stop-list, lemmatizer and lem-matizer with stop-list. The only difference between theseversions is whether the lemmatizer and stop-list removal areincorporated or not. The main experiments reported in thispaper used the bare versions, which provide as much infor-mation as possible of the original emails. The corpus con-tains 10 folders which are used for cross-validation. In the

2http. www.aue .grusers ion data ingspam pubIc.tar.gz.

28

search for optimal feature subsets, SFS was run 10 timeswith different training sets. By taking the pre-defined 10-fold partitions, in each run, reserving a different folder fortesting and using the remaining 9 folders for training andvalidation. Experimental results are shown in Table 1.

In experiments on the SpamAssassin corpus3, a bi-nary classification problem, spam vs non-spam (hard+ easy), was defined. The message representation usedwas the same as in previous experiments. 2,499 words weightedby the Term Frequency-Inverse Document Frequency (TF-IDF) were chosen to construct feature vectors. The exper-imental results are shown in Fig. 4. The line labeled IGshown in Fig. 4 shows performance of the filter methodbased on information gain.

3.2. Gene Classification

Two datasets are used to pose the problem of classifyingribosomal genes from all others following the work reportedin [5]. One of these used CDNA spotted arrays and the otherused synthetic oligo arrays, or the so called Affymetrixarrays. Missing values in the original datasets were simplyreplaced with 0. For these two datasets, one third of thedata set was used for training / validation and the reminderfor testing. Experimental results reported in Fig. 5 and Fig.6 are the generalization performance of the average of tenruns with different training I validation / testing partitions.Fig. 7 shows the scatter of the data in the space of the firsttwo discriminative features selected by the SFS.

We illustrate feature selection performance of cDNA us-ing projections onto a linear discriminant plane, following[9]. The same method has been used in a visualization ofoutlier detection method [10]. The first two discriminantdirections define a plane in the space containing the data,linear projections onto which will be maximally separated,forming a convenient mechanism to visualize the data in twodimensions. Four features selected by each ofthe SFS, RFEand IG methods which are shown in Table 2 are projectedonto the first two discriminant plane. In Fig. 3, we showthese discriminant plane projection for the cDNA dataset.From this figure, we find that features selected by the SFSperform much better than that of IG.

3.3. Text Categorization

The Reuters Corpus Volume 1 (RCV 1)4 includes 806,791English language stories produced by Reuters between Aug.20, 1996 and Aug. 19, 1997. All stories in RCV1 havebeen coded by topic, region and industry sector. Two sub-hierarchies of topics, C15 (Performance) and M14 (Comimodity Markets), were randomly selected to build up two

15

0* 10U)

5

.~ 0

0

cno-50E 0

0

-15

15

0

a 10

X~ 5

a)n-10-3

15

*, 10a)

.E_

5

E 0

cn

o-10

-4

cDNA SFS

-4 -2 0 2First Discriminant Direction

cDNA on RFE

-2 -1 0 1First Discriminant Direction

cDNA on IG

-2 0 2First Discriminant Direction

4

2 3

4

Fig. 3. The cDNA data set, reduced to 4 features selectedby each of SFS, RFE and IG are then projected onto thefirst two discriminant direction for visualization. The dis-criminant direction that maximizes the Fisher ratio is com-puted by d, = IA-1A , where A is within-class scattermatrix, A { Trn1-m} is the difference in the esti-mated means. The second discriminant direction is givenby d2 = a2 A- A 2}, where, b1 AT A )Ad2-

3http //spamassassin apache org4http:Habout.reuters.com/researchandstandards/corpus/

29

Dimensions 10 20 50 100 505SFS (mean) ,u 0.8975 0.8991 0.9267 0.9310 0.9365SFS (variance) oT2 0.0023 0.0018 0.0017 0.0017 0.0015RFL (mean) , 0.7763 0.8998 0.9138 0.9304 0.9361RFE(variance) 72 0.0018 0.0022 0.0014 0.0013 0.0017

Table I. Performance on the Ling- Spain corpus using the SFS with SVMs and RFE with SVMss F1 in various dimensionsaveraged over 10 runs.

SFSCell-cycle CLN3 induction 30m

Cell cycle Elutriation 4.5 hrs

Cell cycle Elutriation 2.5 hrs

Cell-cycle cdcl5 3Gm

RFESporulation ndt80 middle

Cell-cycle CLN3 induction 30m

Sporulation 0

Cell-cycle Elutriation 4.5 hrs

IGCell-cycle Elutriation415 hrs

Cell-cycle cdcl5 30m

Cell-cycle cd.cI5 90m

Cell-cycle cd.cl15 lOO0m

Table 2. Feature selection on the CDNA dataset. 4 features are selected by SFS, RFE and IG.

SFS, RFE and IG on SpamAssassin

60 80 100Dimensions

Fig. 4. SFS, RFE and IG on the SpamAssassin corpus,showing classification performances as functions of sub-space dimensions. The results are averaged over 10 randompartition of the data and the bars represent standard devia-tions.

binary classification problems: C15 vs non-C15 and M14vsnon-M114 bythe"one vs the others" method.

RCV1 is split for training / testing purpose by the dataproviders. One split is the "TREC- 10/2001 " split [11l ]. The23,307 documents published from Aug. 20, 1996 to Aug.31, 1996 comprises the training set and the 783,484 doc-uments published from Sep. 1, 1996 to Aug. 19, 1997comprises the test set. Experiments carried out on RCVIwas using the "TREC-10/2001" split Word stemming andstop word removal were employed. The first 1,805 highfrequency terms were used to construct vectors for the ex

periments. The weight of each term was awarded by the

TF-IDF and, all vectors were normalized. The SFS startsfrom the empty feature set whereas the RFE starts from the

SFS, RFE and IG on cDNA

20 40Dimensions

79

Fig. 5. SFS, RFE and IG on cDNA dataset. It shows thegeneralization performances by SFS with SVMs and RFEwith SVMs in subspaces from a total of 79 dimensional ex-

perimental space (results averaged over 10 runs).

full 1805 features and eliminate one by one. Results of theRCV1 are shown in Table 3.

3.4. Drug Discovery

Chemical fingerprints are high dimensional binary descrip-tions in which each bit represents the presence or absence ofcertain structural features of a chemical compound. In thesearch for drug-like properties of chemical molecules one

synthesizes large numbers of compunds and characterisestheir properties by such fingerprint representations Functional similarity between such compunds is posed as a pat-tern classification problem in the space of these binary rep-

resentations 12].

30

features1

234

Dimensions 3 5 10 20 30 50 100 500 1805C 15 SFS (mean) u 0.7317 0.7640 0.7791 0.7870 0.7842 0.7985 0.8045 0.8180 0.8195

SFS (variance) or2 0.0287 0.0184 0.0267 0.0187 0.0244 0.0306 0.0354 0.0291 0.0243C15 RFL(mean) ,u 0.6599 0.6599 0640 01618 01839 0.1864 0.191 0.8213 0.8250

RFE(variance) : 0.0268 0.0202 0.00184 0.0213 0.0173 0.0245 0.0198 0.0368 0.0255M14 SFS (mean) u 0.5487 0.5695 0.6277 0.6869 0.6967 0.7318 0.7570 0.7829 0.7928

SFS (variance) cT2 0.0423 0.0318 0.0217 0.0417 0.0215 0.0366 0.0254 0.0198 0.0322M14 RFE (mean) p. 0.2629 0.2950 0;5736 0.6935 0.7186 0.7262 0.7470 0.7776 0.7954

RfE(variance) 2 0.0281 0.0122 0.0425 0.0219 0.0243 0.0244 0.0271 0.0235 0.0278

Table 3. SFS and RFE on RCVI C15 (Performance) and M14 (Commodity Markets). The means and variances are computedover 10 runs.

U-

SFS, RFE and IG on Affymetrix

.s

c

E

CDN

n)

Fig. 6. SFS, RFE and IG on Affywyetrix dataset. Itshows the generalization performances by SFS with SVMsand RFE with SVMs in 45 discriminative feature space. Theresults are averaged over 10 runs.

The first two discriminative features of Affymetrixo

:~~~~~

o

. o.

ool

06 *b:

Salt 15 mins

Fig. 7. The first two discriminative features ( Salt 15 minsand Salt l20mins) of the Af fymetrix.

4. DISCUSSION

HIere we use a classification problem [13], which is formedby merging the AIDS and the cancer databases distributedby the National Cancer Institute5. Class labels are termed"keepers" and "rejects" corresponding to if the particularmolecule is further processed or removed from an experi-mental chain. The data contains a total of 30,284 keepersand 5,983 rejects. Due to the computational constrains ofour MATLAB implementation, a subset of the data set isused in experiments. The variability obtained in results islow enough to suggest that the sub-sample used is represen-tative. We used 3,000 keepers and rejects each in the sub-set. With the UNITY Bit-String methods6, all these chemi-cal structures are transformed fingerprints, 992 dimensionalvectors. Experimental results are shown in Fig. 8.

5http /dtpnci.nih ov/6http://wwwAtripos.com

This work reports empirical results on the behaviour of fea-ture subset selection algorithms on a wide range of real-world problems. In order to ensure our implementations arecorrect, we comLupared the full dimensional problems withpublished results and find them to be comparable to thesequoted by other authors. In Figures 3, 4, 5, 6 & 8 and Ta-bles 1 & 3, SFS and RFE by SVMs are compared. It is found.that forward selection outperforms backward elimination atlow dimensions when applied to these problems. These ex-perimental results confirm Guyon et a]. [2003]'s suggestionthat sequential selection outperforms backward eliminationat low dimensions. In [1], the authors demonstrated thatbackward elimination may delete features that work best ontheir own. In a very recent pubhlication [14], they furthersugest that the performance of backward elimination maydegrade significantly if the feature set is reduced too much.However, they didc not provide empirical evidence to sup-port this. This study has confirmed, this by a number of

31

SFS, RFE and IG on Chemoinformatics

0.9

0.8

U-

0.7

5 15 30 100Dimensions

992

Fig. 8. SFS, RFE and IG on Chemoinformatics. It shows thegeneralization performances by SFS with SVMs and RFEwith SVMs in 992 discriminative feature space. The resultsare averaged over 10 runs.

experiments.We also investigated the comparison between two wrap-

per and a filter feature selection methods. In most of cases,the performance of IG is worse than that of SFS and RFE.However, we noticed that in Fig 5, 6 & 8, the IG showsbetter performances than those of RFE in very low discrim-inate feature spaces (up to 4 features). This further con-firmed Guyon et al.'s hypothesis that backward eliminationmay delete some important discriminate features, early onin the search.

5. REFERENCES

[1] I. Guyon and A. Elisseeff, "An introduction to variableand feature selection," Journal ofMachine Learning Re-search, Vol. 3, pp. I1157-1182, 2003.

[2] D. Lovell, C. Dance, M. Niranjan, R. Prager, K. Dalton,and R. Derom, "Feature selection using expected attain-able discrimination,", Pattem Recognition Letters, Vol.19, No. 5-6, pp. 393-402, 1998.

[3] S. Suwannaroj and M. Niranjan, "Subspaces of text dis-crimination with application to biological literature," InThe 13th IEEE Workshop on Neural Networks for Sig-nal Processing, pp. 3-12, Toulouse, France, 2003.

port vector machines," Proceedings of the NationalAcademy ofSciences, Vol. 97, pp. 262-267, 2000.

[6] . Androutsopoulos, J. Koutsias, K. Chandrinos, G.Paliouras, and C. Spyropoulos, "An evaluation of naivebayesian anti-spain filtering," In Proceedings of theWorkshop on Machine Learning in the New Infor-mation Age, 11th European Con?ference on MachineLearning, Barcelona, Spain, pp. 9- 17, 2000.

[7] K. Torkkola, "Discriminative features for text documentclassification," Pattern Analysis and Application, Vol.6, pp. 301-308, 2004.

[8] I. Guyon, J. Weston, S. Barnhill and V. Vapnik, "Geneselection for cancer classification using support vectormachines,' Machine Learning, Vol. 46, pp. 389-422,2002.

[9] D. H. Foley and J. W. Sammon, "An optimal set of dis-criminant vectors," IEEE Transactions on Computers,Vol. C-24, No 3, pp. 281-389, 1975.

[10] H Li and M. Niranjan, "Outlier detection in benhmarkclassification tasks," In Proceedings of the ICASSP, Vol5, pp. 557-560 Toulouse, France, 2006.

[11] S. Robertson and I. Soboroff, The TREC 2001 filteringtrack report, MD 20899-0001, The Tenth Text RetrievalConference, Gaithersburg, 2001.

[12] P. Willett, J. M. Barmard and G. NM. Downs, G.M.,"Chemical similarity searching," Journal of ChemicalInformation and Computer Sciences, Vol. 38, pp. 983-996, 1998.

[13] N. Rhodes, P. Willett, J. B. Dunber, and C. Humblet,"Bit-String Methods for Selective Compound Acquisi-tion," Journal of Chemical Information and ComputerSciences, Vol. 40, pp. 210-214, 2000.

[14] 1. Guyon and A. Elisseeff, "An introduction to featureextraction," In Guyon, I. and Gunn, S. and Nikravesh,M, and Zadeh, L. Editors, Feature Extraction: Founda-tions and Applications, Physica-Verlag, Springer, 2006.

[15] C. J. Van Rijsbergen, Information Retrieval, 2nd edi-tion. Butterworth-lHeinemann- 1979.

[4] E. P. Xing, M. I. Jordan and R. M. Karp, "Feature se-lection for high-dimensional genmoic microarray data,"In Proceedings of the 18th International Conference onMachine Learning, 2001.

[5] M. Brown, W. Grundy D. Lin, N. Cristianini, C. Sug-net, T Furey, and D Haussler, "Knowledge-based analysis of microarray gene expression data by using supn

32

... .... ......

1 f

0.51

[IEEE 2007 IEEE Workshop on Machine Learning for Signal Processing - Thessaloniki, Greece...

Documents

Transcript of [IEEE 2007 IEEE Workshop on Machine Learning for Signal Processing - Thessaloniki, Greece...