Reducing multiclass cancer classification to binary by output coding and SVM

Computational Biology and Chemistry 30 (2006) 63–71

Brief communication

Reducing multiclass cancer classification to binaryby output coding and SVM

Li Shen∗, Eng Chong TanSchool of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore

Received 20 July 2005; received in revised form 11 October 2005; accepted 11 October 2005

Abstract

Multiclass cancer classification based on microarray data is presented. The binary classifiers used combine support vector machines with ageneralized output-coding scheme. Different coding strategies, decoding functions and feature selection methods are incorporated and validatedon two cancer datasets: GCM and ALL. Using random coding strategy and recursive feature elimination, the testing accuracy achieved is as highas 83% on GCM data with 14 classes. Comparing with other classification methods, our method is superior in classificatory performance.© 2005 Elsevier Ltd. All rights reserved.

K

1

smt1hco

cvataitfimw

sm

emefunc-hiness ofplys, an

yulare

mberature, we: genetion

1d

eywords: Multiclass; Cancer classification; Microarrays; Output coding; Support vector machine

. Introduction

DNA microarrays generally have thousands of gene expres-ion levels in one single experiment and contain valuable infor-ation about the gene expression variations of cells in different

issues (Brown and Botstein, 1999; Debouck and Goodfellow,999). Studying the expression pattern from tumor tissues canelp us to understand the activities of genes underlying differentancers and the information can in turn be used to identify typesr subtypes of cancers.

Applying machine learning techniques to microarray data forancer classification can achieve rather promising results. Pre-ious work has been mainly in the field of binary classificationnd good accuracy can be obtained (Cho and Won, 2003). Mul-

iclass classification, on the other hand, is more difficult (Li etl., 2004; Yeang et al., 2001) but gaining research momentum

n machine learning. One of the more successful directions iso use error correcting output codes (ECOC) with binary classi-ers. It was first proposed byDietterich and Bakiri (1995)andany variants of the method have since been reported, some ofhich are very encouraging.

tion, we describe a more generalized output-coding schwhich considers different coding strategies and decodingtions in one single framework. The support vector mac(SVM) is chosen as the binary classifier and the effectivenethe various combinations is verified. It is noticed that by simextending the codeword length of one of the coding strategieimprovement can be achieved on the GCM data (Ramaswamet al., 2001). Our approach is compared with three other popmachine learning methods, theK-nearest neighbor (KNN), thC4.5 decision tree and the neural network (NN).

Since microarray data has the characteristics that the nuof genes is much larger than the number of samples, feselection is an important issue before classification. Thusalso evaluate the three major categories of feature selectionranking, dimension reduction and recursive feature elimina(RFE).

2. Methods

2.1. Output coding for multiclass classification

Li et al. (2004)have discussed combining different featureelection methods and coding techniques with classifiers forulticlass cancer classification. However, in this communica-

Assume that we have a set ofm microarray samples: (xi, yi),i = 1, 2,. . ., m, wherexi ∈ �n is a vector of lengthn representinggene expression levels andyi ∈ {1, 2,. . ., k} is the class label ofthe ith sample. In multiclass context,k > 2. The classificationat oses

∗ Corresponding author. Tel.: +65 67906614; fax: +65 63162780.E-mail address: [email protected] (L. Shen).

476-9271/$ – see front matter © 2005 Elsevier Ltd. All rights reserved.oi:10.1016/j.compbiolchem.2005.10.008

lgorithm aims to find a mappingM:�n → {1, 2,. . ., k} usinghem training samples. The output-coding method decomp

64 L. Shen, E.C. Tan / Computational Biology and Chemistry 30 (2006) 63–71

thek-class problem into a set ofl binary subproblems, trains theresultingl base classifiers and then combines thel outputs topredict the class label. We have adopted the generalized schemeproposed byAllwein et al. (2000). It begins with a given codingmatrix:

M ∈ {−1, 0, +1}k×l

for which each rowri (i = 1, 2,. . ., k) represents the codeword oftheith class and each columnsj (j = 1, 2,. . ., l) represents thejthbase classifier. Each rowri must be unique for its correspondingclass, i.e.∀i, j such thati �= j, 1≤ i and j ≤ k, we haveri �= rj.M(i, j) = 1 or−1 means that theith class should be consideredas positive or negative for thejth base classifier, respectively.If M(i, j) = 0, the ith class is simply ignored by thejth baseclassifier. Therefore, thejth column ofM represents a partitionof the original multiclass data into two classes as positive andnegative. Any binary classifier can be used to solve the inducedtwo-class problem, e.g. the SVM.

Let fs (s = 1, 2,. . ., l) denote thel base classification func-tions. Given a microarray samplex, let f(x) = (f1(x), f2(x),. . ., fl(x)); then its class labely is predicted as

y = argmini

d(ri, f(x)) (1)

whered is called the decoding function. Different coding matri-ces and decoding functions can have significant influence ont . Bya of thr

2

. Diff ificatk theo foret da redn .

a timeM e ar

t x

i

M

Tby

D mh rix isc

of the base classifiers. The OVA approach thus does not havecorrecting power becausec = 2. There are many approaches togenerate ECOC. Designing a good set of ECOC requires bothrow and column separation of the coding matrix. Two majorcoding strategies, the random coding and the exhaustive coding,are given as follows.

2.2.1. Random codingLet l = 10 log2 (k)� (Dietterich and Bakiri, 1995). Each ele-

ment of the coding matrix is assigned a value from{−1, 1}uniformly at random. Then, a hill-climbing procedure (Ricciand Aha, 1998) is applied. The procedure can usually improvethe averaged and minimum hamming distances between pairs ofrows of the coding matrix so that better classification accuracycan be achieved.

2.2.2. Exhaustive codingFirstly letl = 2k−1. The columns of the coding-matrix are gen-

erated by enumerating all possible bit-strings with their lengthfixed to k. The complementary column is excluded. Becauseexactly one column is assigned +1 for all of its elements, it isdeleted from the coding matrix. This then makesl = 2k−1 − 1For example, ifk = 4, the coding matrix is given by

M

1 1 1 1 1 1 1

−1 −1 −1 −1 1 1 1

I e is thatl

2

n theo impletw tionf

d

wed n ist . Forb allyi ple,a basec givew noth pitet

t thec

he classification accuracy of the output-coding schemedopting this generalized scheme, we can combine someesearchers’ work into one single system.

.2. Coding matrix

There are various methods to generate coding matriceserent coding matrices may have substantial effect on classion accuracy. Probably the simplest approach is to setM as a× k square matrix with all its diagonal elements 1 and allther elements−1. Thus, it is equivalent to a binary problemach of thek classes. That is, for base classifieri (1≤ i ≤ k), we

rain the classifier in which all samples of classi are consideres positive and all samples of the other classes are consideegative. This is called the one-versus-all (OVA) approach

Another approach byHastie and Tibshirani (1998)is to usebinary classifier to distinguish one pair of classes at aeanwhile, the other classes are simply ignored. So ther

otally

(k

2

)base classifiers to induce. Ifk = 3, the coding matri

s given by

=

1 1 0

−1 0 1

0 −1 −1

his is called the all-pairs (AP) approach.Error correcting output codes (ECOC) was proposed

ietterich and Bakiri (1995). They argued that if the minimuamming distance between a pair of rows of the coding mat, the output codes have the ability to correct�(c − 1)/2 errors

e

--

as

.e

= −1 −1 1 1 −1 −1 1

−1 1 −1 1 −1 1 −1

t is easy to see that the minimum hamming distanc(2k−1 − 1)/2�. The disadvantage of exhaustive coding isincreases exponentially withk.

.3. Decoding function

Decoding function determines how the distance betweeutputs of base classifiers and codewords is calculated. A s

ype of decoding function is to count the number of positionss inhich the codeword entry differs from the sign of the predic

s(x). Formally, the distance measure is given as

H(r, f(x)) =l∑

s=1

(1 − sign(rsfs(x))

2

)(2)

here sign(z) = +1 if z > 0, −1 if z < 0 and 0 ifz = 0. rs is thentry of codewordr at positions. This is called thehammingistance decoding. A disadvantage of this decoding functiohat it totally ignores the output values of base classifiersinary classifiers like SVM, the magnitudes of outputs usu

ndicate the level of “confidence” of the prediction. For examssume that the OVA coding matrix is used and some of thelassifiers are “weak”; then the weak classifiers may oftenrong signs of their predictions. Since the OVA codes doave error correcting ability, a lot of errors may occur des

he fact that the other classifiers are “strong” and correct.A second type of decoding function takes into accoun

onfidence of predictions. It uses a loss functionL which is

L. Shen, E.C. Tan / Computational Biology and Chemistry 30 (2006) 63–71 65

algorithm-specific. The loss function calculates the “loss” of theprediction given the output values and the codewords. The lossfunction for SVM is defined as

L(y, f ) = (1 − yf )+ (3)

wherey is an entry of the codeword andf is the output of theSVM. z+ is defined as max(z, 0). The distance measure can bewritten as

dL(r, f(x)) =l∑

s=1

(L(rs, fs(x))) (4)

This is called theloss based decoding.There is another type of decoding function that takes the

prediction confidence into account by simply calculating theinner product of codewords and the vector of classifier outputs.The distance measure is defined as

dI (r, f(x)) = −l∑

s=1

(rsfs(x)) (5)

and it is called theinner product decoding.Finally, we introducea decoding function which is based on the probability of theprediction. Given the assumption that the base classifiers areindependent, the probability that assigns samplex to classr is

J∏ ∏

w(o y ist efinet

d

wT nc-t iond robab

p

w asec ym

H

I theo mes tpu

because the distribution of SVM outputs is very different fromthose training and testing samples. To address this problem, athree-fold cross-validation (CV) is used in our case to fitAs andBs. An additional advantage of probabilistic decoding is that itgives the probabilities of prediction.

2.4. Classification methods

SVM is generally applied to binary problems. It oftenachieves superior performance to other machine learning tech-niques. Here, we use SVM as base classifiers in combina-tion with output-coding scheme to solve multiclass cancerclassification problems. Our SVM output-coding (SVM–OC)method is compared with three popular classifiers, the KNN,the multi-layer perceptron (MLP) neural network and the C4.5decision tree. Those three methods have already been success-fully applied to multiclass microarray classification by otherresearchers (Khan et al., 2001; Yeang et al., 2001; Tan andGilbert, 2003).

2.4.1. Support vector machineThe main idea of binary SVM is to implicitly map data

to a higher dimensional space via a kernel function and thensolve an optimization problem to identify the maximum-marginhyperplane that separates training instances (Vapnik, 1998). Theh nces,c ateda ta byp assi-fi

2ints

i ann etric,t

2ess

a siont rules.T gy toc n thev le thats ionw d fromt then ct thec essaryp

2nly

f om-p ii) a

=rs=+1

ps(x)rs=−1

(1 − ps(x)) (6)

hereps is the probabilistic output of the base classifiers andrs

s = 1, 2,. . ., l) are entries of the codeword for classr. The classf which the codeword gives the maximum joint probabilit

he one predicted. Negative log-likelihood can be used to dhe decoding function as

P(r, p(x)) = −l∑

s=1

1 + rs

2log(ps(x))

−l∑

s=1

1 − rs

2log(1− ps(x)) (7)

here r = (r1, r2, . . ., rl) and p(x) = (p1(x), p2(x), . . ., pl(x)).here is still a problem in using probabilistic decoding fu

ion. Classifier like SVM does not give probability of predictirectly. A parametric model can be used to estimate the pility as suggested byPlatt (1999):

s(x) = 1

1 + exp(Asfs(x) + Bs)(8)

herefs(x) is the output of the SVM which is trained as the blassifiers. The sigmoid parametersAs andBs can be found baximizing the following log-likelihood function:

s =∑

i

log

(1

1 + exp(Asfs(xi) + Bs)

)(9)

n which xi are the samples that are involved in producingutputs of the base classifiers. However, we cannot use the saamples to train the base classifier and to produce the ou

-

ts

yperplane is based on a set of boundary training instaalled support vectors. The optimization problem is formuls an objective function which allows for non-separable daenalizing misclassifications. New instances are finally cled according to the side of the hyperplane they fall into.

.4.2. K-nearest neighborThe main idea of KNN is that it treats all samples as po

n n-dimensional space (n is the number of variables). Givenew testing samplex, the algorithm classifies it by voting ofK-earest training samples as determined by some distance m

ypically Euclidian distance (Duda et al., 2001).

.4.3. C4.5 decision treeThe decision tree algorithm is well known for its robustn

nd learning efficiency. The output of the algorithm is a deciree, which can be easily represented as a set of symboliche learning algorithm applies a divide-and-conquer strateonstruct the tree. Each node of a decision tree is a test oalue of a gene and a leaf represents the class of a sampatisfies the rest. The tree will return a “yes” or “no” decishen the sets of instances are tested. Rules can be derive

he tree by following a path from the root to a leaf and usingodes along the path as preconditions for the rule, to predilass at the leaf. The rules can be pruned to remove unnecreconditions and duplication (Duda et al., 2001).

.4.4. Backpropagation neural networkA MLP is a feed-forward NN with signals propagated o

orwardly through layers of neurons. The MLP we used crises of (i) an input-layer with gene expression data, (


hidden-layer of neurons using tangent sigmoid transfer func-tions, and (iii) an output-layer of neurons using logistic sigmoidtransfer functions. The tangent sigmoid and the logistic sigmoidtransfer functions are defined, respectively, as

tan sign(x) = ex − e−x

ex + e−x, log sign(x) = 1

1 + e−x

Each output neuron represents one cancer class. The predictionresult corresponds to the class of the neuron with the largest out-put. All connections from neurons of one layer to neurons of thenext layer have weights and biases. These weights and biasesare initialized before training and can be adjusted by backprop-agation training algorithm. Errors between network outputs andtargets are calculated and then the gradient decent optimiza-tion algorithm is employed to backpropagate the errors so thatthe weights and biases can be adjusted to minimize the trainingerrors (Duda et al., 2001). All samples should be presented tothe network once before weights and biases are modified. Sucha procedure is called one epoch and it can be repeated until theperformance goal is met.

2.4.5. Parameters for classification methodsThe parameters for different classification methods are deter-

mined by three-fold CV to optimize performance and to avoidover-fitting. This is a simple but effective way of parameter selec-t onalc . FoS ter is ro ings m-b ,4 t withm numb SEp t at0

2

hodT clasl w-ut ctiom ds td uare( usedT rate,c lectf leac morc

2lated

w

(2004)have eight gene ranking methods for multiclass microar-ray classification. However, they did not find a clear winner outof the eight. We choose a gene ranking method which is basedon the ratio of their between-group to within-group sums ofsquares. It is also used by a few other researchers as the geneselection method (Dudoit et al., 2002; Lee and Lee, 2003). Fora genej, this ratio is defined as

BW(j) =

∑i

∑k

I(yi = k)(x̄kj − x̄ij)2

∑i

∑k

I(yi = k)(x̄ij − x̄kj)2(10)

wherex̄ij andx̄kj denote the average expression level of genejacross all classes and across samples belonging to classk only.I(·) is the indicator function. The base classifiers are built usinggenes with the largest BW values.

2.5.2. Dimension reductionIn microarray context, because the number of features is much

larger than the number of samples, dimension reduction meth-ods have been proposed to tackle the “curse of dimensionality”problem. It is prohibitive to use some of the statistical methodswhenm < n because of excessive computational time. Dimen-sion reduction is also used as the preprocessing step to maket( ivef ena

2n

f

f

w of| es tot luesi sivelyb ssifiersw ni d bep( n inm trat-e heR at theb tainedc usedt s a seto tively.T sifiersc e eachc

ion. The only shortcoming of this method is the computatiost that incurs by multiple training and testing processesVM, a linear kernel is used and the regularization parameet from{0.0001, 0.01, 1, 100, 10 000}. For KNN, the numbef neighborsK varies from one to the total number of trainamples for a thorough optimization. For MLP NN, the nuer of neurons in the hidden-layer is chosen from{2, 5, 10, 200}. Backpropagation algorithm based on gradient descenomentum and adaptive learning rate is employed. Theer of epochs is set at 3000 with a mean-squared-error (Merformance goal of 10−6, and the momentum constant is se.9.

.5. Feature selection

There are three major categories of feature selection methe first one ranks the features according to their values and

abels. The highest-ranked features can be selected for folloraining and testing. The second one is the dimension reduethod. It can reduce the number of features from thousanozens. Two well-established methods, the partial least sqPLS) and the principal components analysis (PCA), arehe PLS may incorporate, while the PCA may not incorpolass labels into dimension reduction. The third category seeatures by feedback from classifiers. Features that haveontribution to classification are eliminated so as to have aompact feature subset and to enhance performance.

.5.1. Gene rankingIntuitively one would select those genes that are corre

ith a class but are uncorrelated with the other classes.Li et al.

rs

-)

s.s

pnos.

sste

hese methods feasible. The PLS (Wegelin, 2000) and the PCAGolub and Van Loan, 1996) have been proven to be effector microarray classification (Nguyen and Rocke, 2002; Shnd Tan, 2005).

.5.3. Recursive feature eliminationGiven a samplex, the linear SVM function can be writte

ormally as

(x) = w · x + b (11)

herew is the weight vector andb is the bias. The elementsw| thus indicate the contribution of the corresponding genhe output of classification. The genes with the smallest van |w| are dropped and this process can be executed recurecause each time a few genes are eliminated, the clahould be run again to generate a new vector ofw. The methodas first proposed byGuyon et al. (2002)to do feature selectio

n binary classification. In multiclass context, the RFE shoulerformed for each base classifier. As a comparison,Rifkin et al.2003)also used SVM and RFE to carry out feature selectioulticlass microarray classification based on OVA coding s

gy. However, we perform RFE in a slightly different way. TFE is executed on each base classifier independently so thest performance and the smallest gene subset can be oboncurrently for one base classifier. Three-fold CV is theno evaluate the goodness of gene subset. Finally, it gives uf base classifiers from different subsets of genes, respechis method is based on the assumption that all base clasan be considered as independent and we can thus optimizlassifier individually.


3. Results

3.1. Datasets and experimental setup

Two multiclass microarray datasets are selected for our exper-iments. The first is the GCM dataset published byRamaswamyet al. (2001). It consists of 144 training samples and 54testing samples of 15 common cancer classes. Each samplehas 16063 gene expression levels. The data is available at:http://www.genome.wi.mit.edu/MPR/GCM. For simplicity, wedropped the eight metastatic samples from the testing datasetbecause they are not present in the training dataset. Therefore,46 testing samples and 14 cancer classes are considered. The dis-tribution of training and testing samples among the 14 classesis listed inTable 1. The second is the ALL dataset published byYeoh et al. (2002). It consists of 163 training samples and 85 test-ing samples of 6 subtypes of acute lymphoblastic leukemia. Eachsample has 12558 gene expression levels. The data is availableat: http://www.stjuderesearch.org/data/ALL1/. The distributionof training and testing samples among the six classes is listed inTable 2. All data is log-transformed and all genes are normal-ized to have zero mean and unit standard deviation. No otherpreprocessing steps are applied.

For GCM data, three coding strategies are used: AP, OVA,and random. We did not use exhaustive coding becausel would

Table 1GCM: number of samples per cancer class

Cancer class Training Testing

Breast (BR) 8 3Prostate (PR) 8 2Lung (LU) 8 3Colorectal (CO) 8 3Lymphoma (LY) 16 6Bladder (BL) 8 3Melanoma (ME) 8 2Uterus (UT) 8 2Leukemia (LE) 24 6Renal (RE) 8 3Pancreas (PA) 8 3Ovary (OV) 8 3Mesothelioma (ML) 8 3Brain (CNS) 16 4

Table 2ALL: number of samples per subtype

Subtype Training Testing

BCR-ABL 9 6E2A-PBX1 18 9Hyperdiploid > 50 42 22MLL 14 6T-ALL 28 15TEL-AML1 52 27

Fig. 1. Accuracies of output coding on GCM data using different coding strategies, decoding functions and feature selections.

http://www.genome.wi.mit.edu/mpr/gcm

http://www.stjuderesearch.org/data/all1/


be equal to 213− 1 = 8191 and this will make the computa-tion intractable. For ALL data, AP, OVA and exhaustive codingstrategies are used.

According to the suggestion ofLi et al. (2004), 250 top genesare selected from BW ratio ranking. We also tested the data with-out feature selection, which is denoted as NO (Tables 5 and 6).For RFE, the gene subset for each base classifier is determinedby three-fold CV. For PLS, components are extracted so that99% of the variances of response variables, and 80% of the vari-ances of predictor variables are captured. For PCA, componentsare sorted in descending order according to their eigenvaluesand the first ones are selected so that 80% of the variances ofpredictor variables are captured. The percentage is determinedempirically and is considered sufficient for classification, as arule of thumb.

All programs are written in MATLAB® codes. The softwarepackage written by Steve Gunn is used for the SVM algorithm. Itis available at:http://www.kernel-machines.org/. However, wedid some modifications so that the speed is enhanced.

3.2. Testing on output coding and SVM

All combinations of coding strategies, decoding functionsand feature selection methods with SVM are applied on theALL and GCM datasets. The results are shown inFigs. 1 and 2.I to

Table 3The combinations that the probabilistic decoding fails to produce proper outputs

Dataset Coding strategy Feature selection Errors

GCM OVA RFE and PLS 43ALL OVA BW, RFE and PLS 79ALL Exhaustive BW, RFE and PLS 79

that of the output coding without feature selection and: for APand ECOC codings, decoding is hamming-distance; for OVAcoding, decoding is inner product. InFigs. 1(d) and 2(d), themaximum and mean accuracies of all coding strategies aregiven; the probabilistic decoding fails to produce proper out-puts with certain codings and feature selections, and is thus notincluded. The corresponding accuracies are also truncated inFigs. 1(a)–(c) and 2(a)–(c). More information is given inTable 3.We have the following observations:

• The ECOC coding strategy generally outperforms the othercoding strategies. The highest accuracy on the GCM data isachieved by random coding. Combining with loss based andinner product decoding functions and RFE, a 80.4% testingaccuracy has been obtained. On the ALL data, exhaustivecoding has achieved almost perfect accuracy for most decod-ing functions and feature selection methods. There are someexceptions on probabilistic decoding function. This can be
n Figs. 1(a)–(c) and 2(a)–(c), the base line value is equal
Fig. 2. Accuracies of output coding on ALL data using differe
nt coding strategies, decoding functions and feature selections.
http://www.kernel-machines.org/


attributed to the ability of ECOC to correct errors for weakbase classifiers.

• The AP coding strategy works rather well with the ALL data,but it is the worst coding strategy with the GCM data. Alltesting accuracies on GCM data are below 70%. Learningfrom Table 1, we know that some classes of the GCM dataare very small. It is hard for base classifiers to perform well onthose pairs of small cancer classes. A lot of errors may occurfor base classifiers; thus the multiclass classification accuracyis degenerated.

• The probabilistic decoding function is very sensitive to cod-ing strategies and feature selections. It can be observed fromTable 3that it fails to work with RFE and PLS when OVA cod-ing strategy is applied to GCM data. It also fails when OVAand exhaustive coding strategies, and BW, RFE and PLS fea-ture selections are used on the ALL data. However, it achieves100% accuracy on the ALL data when AP and BW are used.It is known that fitting sigmoid parameters by solving(9) issensitive to the distribution of samples of two classes. If thenegative class is very large but the positive class is very small,Eq.(9)would simply produce an infinite negativeAs, and viceversa. For microarray data, the unequal distribution of classesmay cause the failure of probabilistic decoding function. It isobserved that the probabilistic decoding function never failswith the AP and random coding strategies when the distri-butions of these two classes trained by base classifiers are

ding

• forbase

andr tover,

OC.highrectased

• CMtheenlassn oflatedlter-atdis

alsoelec

• andtionstedthan2;lasspo-

Fig. 3. Accuracy versus random code length. Random perturbations have beenadded for better viewing.

nents are extracted for both GCM and ALL datasets while37 and 80 PCA components are extracted for the above twodatasets, respectively. With fewer components, the trainingand testing speeds can usually be enhanced.

Another experiment is performed to find out the relation-ship between the accuracy of output-coding scheme and thecodeword lengthl on GCM data. Random coding strategy andRFE are used. We varyl at 5 log2 k� and from10 log2 k� to40 log2 k�. The results are plotted inFig. 3. The highest accu-racy is 83% whenl = 20 log2 k�, using loss based and proba-bilistic decoding functions. Increasing the codeword length doesnot necessarily increase the accuracy. This is because some baseclassifiers corresponding to the codeword bits are weak and maycommonly introduce errors. They simply degrade the perfor-mance without increasing the error correcting power of ECOC.This is confirmed by deleting the codeword bits that correspondto the base classifiers with CV errors above average. It is doneon codeword lengths30 log2 k� and40 log2 k�. Comparisonof results before and after deletion is listed inTable 4.

3.3. Comparison of classification accuracies with otherclassification methods

To compare the classification accuracies of SVM-OCwith those of the other classification methods, three popular

TT

C

HLIP

more balanced than those of the OVA and exhaustive costrategies.The hamming-distance decoding function is not suitableOVA. This is because many ties will happen when theclassifiers do not give enough high prediction confidencethey are just solved by random assignment. It is betteintegrate prediction confidence when OVA is used. Howehamming-distance decoding works well with AP and ECThis is because the base classifiers of AP usually haveprediction confidence and ECOC has the ability to corerrors if base classifiers are weak. It is noticed that loss band inner product decoding give very similar results.Feature selection by BW ratios performs poorly with Gdata but rather well with ALL data. This is consistent withresults byLi et al. (2004). BW only gives the values betweand within group variances but no information about the clabels. It may select genes that only contain informatioseveral classes without regards to the rest. Again, this is reto the class-unbalance condition of GCM. A possible anative is to use the Studentt-statistics to select genes thdistinguish one class from the rest and make a uniformtribution for the numbers of genes for all classes. It isnoticed that results are usually good when no feature stion is used.PLS outperforms PCA no matter what coding matrixdecoding function are used, excluding only a few excepfrom probabilistic decoding function. It has been validathat PLS is usually a better dimension reduction methodPCA for supervised classification (Nguyen and Rocke, 200Shen and Tan, 2005), mainly because PLS incorporates clabels in generating its components. Thirty-two PLS com

-

-able 4esting errors before and after deletion of codeword bits

odeword length l = 30 log2 (k)� l = 40 log2 (k)�Beforea Aftera Beforea Aftera

amming distance 11 10 12 11oss based 12 10 12 10

nner product 12 10 13 12robabilistic 12 10 12 11

a Decoding function.


Table 5Testing accuracies in % on GCM dataset

Classification method NO BW PLS PCA

SVM-OC 76.1 56.5 78.3 71.7KNN 47.8 37 63 37C4.5 52.2 39.1 47.8 43.5MLP 65.2 47.8 73.9 60.9

Table 6Testing accuracies in % on ALL dataset

Classification method NO BW PLS PCA

SVM-OC 100 100 98.8 97.6KNN 87.1 100 89.4 88.2C4.5 81.2 81.2 85.9 84.7MLP 95.3 97.6 97.6 95.3

classifiers, KNN, C4.5 and MLP were applied to the GCM andALL datasets. Three-fold CVs were used on the training datato find the optimal parameters and then the trained classifierswere applied to the testing data. Three feature selection meth-ods, BW, PLS and PCA were used. RFE was not used becauseit is not possible to combine it with KNN and C4.5. For SVM-OC, we cited the results from Section3.2and they are given inTables 5 and 6.

It can be seen that the SVM-OC achieves the best performance no matter which feature selection method is used. Fothe other three classifiers, MLP seems to be better. It outperforms KNN and C4.5 in most cases, and achieves very similaresults with SVM-OC on ALL data. The classification accura-cies of SVM-OC cannot be enhanced further by feature selectiomethods. However, the performance of KNN and MLP can usu-ally be significantly improved if a proper feature selection isemployed. For example, accuracy is improved by 8.2% on GCMwhen PLS is used for MLP; accuracy is also improved by 12.9%on ALL when BW is used for KNN. KNN has large variance ofthe prediction in high-dimensional spaces because all the training points are located close to the edge of the sample (Hastie etal., 2001). Furthermore, many irrelevant variables dominate thedistances between samples which cause serious problems for tprediction (Mitchell, 1997). MLP suffers badly from the “curseof dimensionality” because of at least two reasons: (i) Theremay be more local minima in the error spaces so that backpropa acei forei .5 ist eci-s hicht zatioc ves;t

4

bees tion.

Usage of different coding matrices, decoding functions and fea-ture selection methods have been discussed. Combining ECOC,RFE and some decoding functions, 83% testing accuracy hasbeen achieved on the GCM data. Comparing with other machinelearning methods, SVM-OC has shown its superior performanceand is much less sensitive to the “curse of dimensionality”.

It has been shown that a good coding matrix can result inhigh accuracy of multiclass microarray classification. Bettercoding strategies are required to further improve the perfor-mance of the output-coding scheme.Crammer and Singer (2001)have reported that for a given fixedl, finding a k × l codingmatrix M which minimizes the empirical loss is NP-complete.We have empirically shown that by deleting the codeword bitscorresponding to base classifiers of high CV errors, the mul-ticlass accuracy can usually be improved. This stimulates theidea that code-matrix can be recursively improved as follows:(i) Start with a coding matrix of sufficiently largel. (ii) Delete thecodeword bits corresponding to high CV errors. (iii) Use the hill-climbing algorithm to improve the row separation of truncatedcoding matrix. This procedure can be iterated for a number oftimes until good results have been obtained. The computationalcost of this heuristic method, however, would be high.

Although gene ranking and dimension reduction methodshave been shown to be effective for multiclass classification, it isobserved that sometimes without feature selection, the results areeven better. RFE is good for binary classification but for output-c d toe n andt classs in thee fica-t netica

R

A ary:s. 1,

B ome

C lysisdingsalia.

C odeses. 2,

D very

D via

D . John

D meth-. Am.

G pkins

G ancer–422.

gation may easily fall into one of them. (ii) The model spncreases exponentially with the number of variables; theret becomes difficult to find a model that generalizes. C4he only nonmetric method among the four. This unique dion tree algorithm generates a set of symbolic rules from whe predictions on new testing samples are made. Generalian often be improved by pruning the minimum impurity leahus the method is insensitive to high dimensionality.

. Conclusions and future work

The output-coding scheme from machine learning hasuccessfully applied to multiclass microarray classifica

-r-r

n

-

he

-

,

n

n

oding based multiclass classification, it can only be usenhance base classifiers. Data over-fitting can easily happe

he variances of outputs would be large especially whenizes are small. This can degrade the multiclass accuracynd. It is better to use the CV errors of multiclass classi

ion as feedback to select genes. Some algorithms like gelgorithm could be considered.

eferences

llwein, E.L., Schapire, R.E., Singer, Y., 2000. Reducing multiclass to bina unifying approach for margin classifiers. J. Machine Learn. Re113–141.

rown, P.O., Botstein, D., 1999. Exploring the new world of the genwith DNA microarrays. Nat. Genet. Suppl. 21, 33–37.

ho, S.B., Won, H.H., 2003. Machine learning in DNA microarray anafor cancer classification. In: Yi-Ping, Chen, Phoebe (Eds.), Proceeof the First Asia-Pacific Bioinformatics Conference. Adelaide, Austr

rammer, K., Singer, Y., 2001. On the learnability and design of output cfor multiclass kernel-based vector machines. J. Machine Learn. R265–292.

ebouck, C., Goodfellow, P.N., 1999. DNA microarrays in drug discoand development. Nat. Genet. Suppl. 21, 48–50.

ietterich, T.G., Bakiri, G., 1995. Solving multiclass learning problemserror-correcting output codes. J. Artific. Intell. Res. 2, 263–286.

uda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classification, 2nd edWiley, New York.

udoit, S., Fridlyand, J., Speed, T.P., 2002. Comparison of discriminantods for the classification of tumors using gene expression data. JStat. Assoc. 97 (457), 77–87.

olub, G.H., Van Loan, C.F., 1996. Matrix Computations. The Johns HoUniversity Press.

uyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cclassification using support vector machines. Mach. Learn. 46, 389


Hastie, T., Tibshirani, R., 1998. Classification by pairwise coupling. Annu.Stat. 26 (2), 451–471.

Hastie, T., Tibshirani, R., Friedman, J., 2001. Elements of Statistical Learning:Data Mining, Inference and Prediction. Springer-Verlag, New York.

Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann,F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., Meltzer,P.S., 2001. Classification and diagnostic prediction of cancers using geneexpression profiling and artificial neural networks. Nat. Med. 7, 673–679.

Lee, Y., Lee, C.K., 2003. Classification of multiple cancer types by multicate-gory support vector machines using gene expression data. Bioinformatics19 (9), 1132–1139.

Li, T., Zhang, C., Ogihara, M., 2004. A comparative study of feature selectionand multiclass classification methods for tissue classification based ongene expression. Bioinformatics 20 (15), 2429–2437.

Mitchell, T.M., 1997. Machine Learning. McGraw-Hill, New York, NY, USA.Nguyen, D.V., Rocke, D.M., 2002. Multi-class cancer classification via par-

tial least squares with gene expression profiles. Bioinformatics 18 (9),1216–1226.

Platt, J., 1999. Probabilistic outputs for support vector machines and com-parison to regularized likelihood methods. In: Advances in Large MarginClassifiers. MIT Press.

Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo,M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., et al., 2001. Multi-class cancer diagnosis using tumor gene expression signatures. Proc. NatlAcad. Sci. U.S.A. 98, 15149–15154.

Ricci, F., Aha, D.W., 1998. Error-correcting output codes for local learners.In: Proceedings of the 10th European Conference on Machine Learning.

Rifkin, R., Mukherjee, S., Tamayo, P., Ramaswamy, S., Yeang, C.H., Angelo,M., Reich, M., Poggio, T., Lander, E.S., Golub, T.R., Mesirov, J.P., 2003.An analytical method for multiclass molecular cancer classification. SIAMRev. 45 (4), 706–723.

Shen, L., Tan, E.C., 2005. Dimension reduction-based penalized logisticregression for cancer classification using microarray data. IEEE Trans.Comput. Biol. Bioinform. 2, 166–175.

Tan, A.C., Gilbert, D., 2003. Ensemble machine learning on gene expressiondata for cancer classification. Appl. Bioinform. 2, S75–S83.

Vapnik, V., 1998. Statistical Learning Theory. Wiley/Interscience, New York,NY, USA.

Wegelin, J.A., 2000. A survey of partial least squares (PLS) methods, withemphasis on the two-block case, Technical Report. Department of Statis-tics, University of Washington.

Yeang, C.H., Ramaswamy, S., Tamayo, P., Mukherjee, S., Rifkin, R.M.,Angelo, M., Reich, M., Lander, R., Mesirov, J., Golub, T., 2001. Molec-ular classification of multiple tumor types. Bioinformatics 17 (Suppl. 1),s316–s322.

Yeoh, E.J., Ross, M.E., Shurtleff, S.A., Williams, W.K., Rami Mahfouz, D.P.,Behm, F.G., Raimondi, S.C., Relling, M.V., Patel, A., Cheng, C., etal., 2002. Classification, subtype discovery, and prediction of outcomein pediatric acute lymphoblastic leukemia by gene expression profiling.Cancer Cell 1, 133–143.

Reducing multiclass cancer classification to binary by output coding and SVM

Documents

Transcript of Reducing multiclass cancer classification to binary by output coding and SVM