Luis Francisco Sanchez Merchante To cite this version
Transcript of Luis Francisco Sanchez Merchante To cite this version
HAL Id tel-00868847httpstelarchives-ouvertesfrtel-00868847
Submitted on 2 Oct 2013
HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents whether they are pub-lished or not The documents may come fromteaching and research institutions in France orabroad or from public or private research centers
Lrsquoarchive ouverte pluridisciplinaire HAL estdestineacutee au deacutepocirct et agrave la diffusion de documentsscientifiques de niveau recherche publieacutes ou noneacutemanant des eacutetablissements drsquoenseignement et derecherche franccedilais ou eacutetrangers des laboratoirespublics ou priveacutes
Learning algorithms for sparse classificationLuis Francisco Sanchez Merchante
To cite this versionLuis Francisco Sanchez Merchante Learning algorithms for sparse classification Computer scienceUniversiteacute de Technologie de Compiegravegne 2013 English NNT 2013COMP2084 tel-00868847
Par Luis Francisco SANCHEZ MERCHANTE
Thegravese preacutesenteacutee pour lrsquoobtention du grade de Docteur de lrsquoUTC
Learning algorithms for sparse classification
Soutenue le 07 juin 2013
Speacutecialiteacute Technologies de lrsquoInformation et des Systegravemes
D2084
Algorithmes drsquoestimation pour laclassification parcimonieuse
Luis Francisco Sanchez MerchanteUniversity of Compiegne
CompiegneFrance
ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo
Albert Espinosa
ldquoBe brave Take risks Nothing can substitute experiencerdquo
Paulo Coelho
Acknowledgements
If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy
Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself
I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them
The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience
Contents
List of figures v
List of tables vii
Notation and Symbols ix
I Context and Foundations 1
1 Context 5
2 Regularization for Feature Selection 921 Motivations 9
22 Categorization of Feature Selection Techniques 11
23 Regularization 13
231 Important Properties 14
232 Pure Penalties 14
233 Hybrid Penalties 18
234 Mixed Penalties 19
235 Sparsity Considerations 19
236 Optimization Tools for Regularized Problems 21
II Sparse Linear Discriminant Analysis 25
Abstract 27
3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29
32 Feature Selection in LDA Problems 30
321 Inertia Based 30
322 Regression Based 32
4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35
411 Penalized Optimal Scoring Problem 36
412 Penalized Canonical Correlation Analysis 37
i
Contents
413 Penalized Linear Discriminant Analysis 39
414 Summary 40
42 Practicalities 41
421 Solution of the Penalized Optimal Scoring Regression 41
422 Distance Evaluation 42
423 Posterior Probability Evaluation 43
424 Graphical Representation 43
43 From Sparse Optimal Scoring to Sparse LDA 43
431 A Quadratic Variational Form 44
432 Group-Lasso OS as Penalized LDA 47
5 GLOSS Algorithm 4951 Regression Coefficients Updates 49
511 Cholesky decomposition 52
512 Numerical Stability 52
52 Score Matrix 52
53 Optimality Conditions 53
54 Active and Inactive Sets 54
55 Penalty Parameter 54
56 Options and Variants 55
561 Scaling Variables 55
562 Sparse Variant 55
563 Diagonal Variant 55
564 Elastic net and Structured Variant 55
6 Experimental Results 5761 Normalization 57
62 Decision Thresholds 57
63 Simulated Data 58
64 Gene Expression Data 60
65 Correlated Data 63
Discussion 63
III Sparse Clustering Analysis 67
Abstract 69
7 Feature Selection in Mixture Models 7171 Mixture Models 71
711 Model 71
712 Parameter Estimation The EM Algorithm 72
ii
Contents
72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79
8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81
811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant
Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83
82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85
9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87
911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89
92 Model Selection 91
10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97
Conclusions 97
Appendix 103
A Matrix Properties 105
B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109
C Solving Fisherrsquos Discriminant Problem 111
D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115
iii
Contents
E Invariance of the Group-Lasso to Unitary Transformations 117
F Expected Complete Likelihood and Likelihood 119
G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122
Bibliography 123
iv
List of Figures
11 MASH project logo 5
21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-
rameters 20
41 Graphical representation of the variational approach to Group-Lasso 45
51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56
61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first
discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64
91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92
101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97
v
List of Tables
61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61
101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96
vii
Notation and Symbols
Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors
Sets
N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A
Data
X input domainxi input sample xi isin XX design matrix X = (xgt1 x
gtn )gt
xj column j of Xyi class indicator of sample i
Y indicator matrix Y = (ygt1 ygtn )gt
z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N
Vectors Matrices and Norms
0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A
ix
Notation and Symbols
Probability
E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2
W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix
H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y
Mixture Models
yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)
θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function
Optimization
J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β
βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path
x
Notation and Symbols
Penalized models
λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)
βj jth row of B = (β1gt βpgt)gt
BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix
ΣB sample between-class covariance matrix
ΣW sample within-class covariance matrix
ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach
xi
Part I
Context and Foundations
1
This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed
The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided
The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion
3
1 Context
The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm
The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne
From the point of view of the research the members of the consortium must deal withfour main goals
1 Software development of website framework and APIrsquos
2 Classification and goal-planning in high dimensional feature spaces
3 Interfacing the platform with the 3D virtual environment and the robot arm
4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments
S HM A
Figure 11 MASH project logo
5
1 Context
The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables
Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment
As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform
bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)
bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis
6
All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)
7
2 Regularization for Feature Selection
With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic
21 Motivations
There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)
As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information
When talking about dimensionality reduction there are two families of techniquesthat could induce confusion
bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples
bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature
9
2 Regularization for Feature Selection
Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)
selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category
As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text
ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out
Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the
10
22 Categorization of Feature Selection Techniques
Figure 22 The four key steps of feature selection according to Liu and Yu (2005)
ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost
There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions
22 Categorization of Feature Selection Techniques
Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured
I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm
The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities
bull Depending on the type of integration with the machine learning algorithm we have
ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm
ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while
11
2 Regularization for Feature Selection
the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive
ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm
bull Depending on the feature searching technique
ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches
ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time
ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima
bull Depending on the evaluation technique
ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures
ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty
ndash Dependency Measures - Measuring the correlation between features
ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can
ndash Predictive Accuracy - Use the selected features to predict the labels
ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)
The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels
In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized
12
23 Regularization
goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III
23 Regularization
In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret
An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations
minβJ(β) + λP (β) (21)
minβ
J(β)
s t P (β) le t (22)
In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken
In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty
13
2 Regularization for Feature Selection
Figure 23 Admissible sets in two dimensions for different pure norms ||β||p
231 Important Properties
Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability
Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies
forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)
for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex
Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources
Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution
232 Pure Penalties
For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In
14
23 Regularization
Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties
this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1
Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity
A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero
After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1
3penalty has a support region with sharper vertexes that would induce
a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1
3results in difficulties during optimization that will not happen with a convex
shape
15
2 Regularization for Feature Selection
To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty
L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0
minβ
J(β)
s t β0 le t (24)
where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable
L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)
minβ
J(β)
s t
psumj=1
|βj | le t (25)
Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited
Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)
The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by
16
23 Regularization
minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)
L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like
minβJ(β) + λ β22 (26)
The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem
minβ
nsumi=1
(yi minus xgti β)2 (27)
with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular
the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances
As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
(βlsj )2 (28)
The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)
17
2 Regularization for Feature Selection
where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model
Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions
Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t
This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as
βlowast = maxwisinRp
βgtw s t w le 1
In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1
r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)
233 Hybrid Penalties
There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is
minβ
nsumi=1
(yi minus xgti β)2 + λ1
psumj=1
|βj |+ λ2
psumj=1
β2j (29)
The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables
18
23 Regularization
234 Mixed Penalties
Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =
sumL`=1 d` Mixed norms are
a type of norms that take into consideration those groups The general expression isshowed below
β(rs) =
sum`
sumjisinG`
|βj |s r
s
1r
(210)
The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups
Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)
(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)
235 Sparsity Considerations
In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables
The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables
To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as
19
2 Regularization for Feature Selection
(a) L1 Lasso (b) L(12) group-Lasso
Figure 25 Admissible sets for the Lasso and Group-Lasso
(a) L1 induced sparsity (b) L(12) group inducedsparsity
Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters
20
23 Regularization
the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed
236 Optimization Tools for Regularized Problems
In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms
In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5
Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)
β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))
Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives
βj =minusλsign(βj)minus partJ(β)
partβj
2sumn
i=1 x2ij
In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding
algorithm where β(t+1)j = Sλ
(partJ(β(t))partβj
) The objective function is optimized with respect
21
2 Regularization for Feature Selection
to one variable at a time while all others are kept fixed
Sλ
(partJ(β)
partβj
)=
λminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
gt λ
minusλminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
lt minusλ
0 if |partJ(β)partβj| le λ
(211)
The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)
Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A
Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected
Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)
This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions
22
23 Regularization
and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions
Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points
This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable
This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)
Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)
This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques
Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals
Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)
minβisinRp
J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L
2
∥∥∥β minus β(t)∥∥∥2
2(212)
They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like
23
2 Regularization for Feature Selection
(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as
minβisinRp
1
2
∥∥∥∥β minus (β(t) minus 1
LnablaJ(β(t)))
∥∥∥∥2
2
+λ
LP (β) (213)
The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up
24
Part II
Sparse Linear Discriminant Analysis
25
Abstract
Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes
There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables
In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data
27
3 Feature Selection in Fisher DiscriminantAnalysis
31 Fisher Discriminant Analysis
Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)
We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x
gtn )gt and the corresponding labels in the ntimesK matrix
Y = (ygt1 ygtn )gt
Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance
maxβisinRp
βgtΣBβ
βgtΣWβ (31)
where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as
ΣW =1
n
Ksumk=1
sumiisinGk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
Ksumk=1
sumiisinGk
(microminus microk)(microminus microk)gt
where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k
29
3 Feature Selection in Fisher Discriminant Analysis
This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio
maxBisinRptimesKminus1
tr(BgtΣBB
)tr(BgtΣWB
) (32)
where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is
based on a series of K minus 1 subproblemsmaxβkisinRp
βgtk ΣBβk
s t βgtk ΣWβk le 1
βgtk ΣWβ` = 0 forall` lt k
(33)
The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest
eigenvalue (see Appendix C)
32 Feature Selection in LDA Problems
LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome
Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints
The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities
They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based
321 Inertia Based
The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and
30
32 Feature Selection in LDA Problems
classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations
Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)
Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as
minβisinRp
βgtΣWβ
s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t
where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony
Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max
βisinkRpβgtk Σ
k
Bβk minus Pk(βk)
s t βgtk ΣWβk le 1
The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten
Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal
solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1
minimization minβisinRp
β1
s t∥∥∥Σβ minus (micro1 minus micro2)
∥∥∥infinle λ
Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization
31
3 Feature Selection in Fisher Discriminant Analysis
Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions
322 Regression Based
In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)
Predefined Indicator Matrix
Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)
There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data
Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection
In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is
32
32 Feature Selection in LDA Problems
obtained by solving
minβisinRpβ0isinR
nminus1nsumi=1
(yi minus β0 minus xgti β)2 + λ
psumj=1
|βj |
where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β
vector for λ = 0 but a different intercept β0 is required
Optimal Scoring
In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)
As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)(34a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)
where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems
minθkisinRK βkisinRp
Yθk minusXβk2 + βgtk Ωβk (35a)
s t nminus1 θgtk YgtYθk = 1 (35b)
θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)
where each βk corresponds to a discriminant direction
33
3 Feature Selection in Fisher Discriminant Analysis
Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by
minβkisinRpθkisinRK
sumk
Yθk minusXβk22 + λ1 βk1 + λ2β
gtk Ωβk
where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen
Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)
minβkisinRpθkisinRK
Kminus1sumk=1
Yθk minusXβk22 + λ
psumj=1
radicradicradicradicKminus1sumk=1
β2kj
2
(36)
which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding
this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem
34
4 Formalizing the Objective
In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)
The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data
The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)
41 From Optimal Scoring to Linear Discriminant Analysis
Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)
Throughout this chapter we assume that
bull there is no empty class that is the diagonal matrix YgtY is full rank
bull inputs are centered that is Xgt1n = 0
bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank
35
4 Formalizing the Objective
411 Penalized Optimal Scoring Problem
For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution
The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus
minθisinRK βisinRp
Yθ minusXβ2 + βgtΩβ (41a)
s t nminus1 θgtYgtYθ = 1 (41b)
For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator
βos =(XgtX + Ω
)minus1XgtYθ (42)
The objective function (41a) is then
Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos
(XgtX + Ω
)βos
= θgtYgtYθ minus θgtYgtX(XgtX + Ω
)minus1XgtYθ
where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (43)
which shows that the optimization of the p-OS problem with respect to θk boils down to
finding the kth largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY Indeed Appendix C
details that Problem (43) is solved by
(YgtY)minus1YgtX(XgtX + Ω
)minus1XgtYθ = α2θ (44)
36
41 From Optimal Scoring to Linear Discriminant Analysis
where α2 is the maximal eigenvalue 1
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2nminus1θgt(YgtY)θ
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2 (45)
412 Penalized Canonical Correlation Analysis
As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows
maxθisinRK βisinRp
nminus1θgtYgtXβ (46a)
s t nminus1 θgtYgtYθ = 1 (46b)
nminus1 βgt(XgtX + Ω
)β = 1 (46c)
The solutions to (46) are obtained by finding saddle points of the Lagrangian
nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)
rArr npartL(βθ γ ν)
partβ= XgtYθ minus 2γ(XgtX + Ω)β
rArr βcca =1
2γ(XgtX + Ω)minus1XgtYθ
Then as βcca obeys (46c) we obtain
βcca =(XgtX + Ω)minus1XgtYθradic
nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)
so that the optimal objective function (46a) can be expressed with θ alone
nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
=
radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
and the optimization problem with respect to θ can be restated as
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (48)
Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)
βos = αβcca (49)
1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)
37
4 Formalizing the Objective
where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using
the optimality conditions for θ
npartL(βθ γ ν)
partθ= YgtXβ minus 2νYgtYθ
rArr θcca =1
2ν(YgtY)minus1YgtXβ (410)
Then as θcca obeys (46b) we obtain
θcca =(YgtY)minus1YgtXβradic
nminus1βgtXgtY(YgtY)minus1YgtXβ (411)
leading to the following expression of the optimal objective function
nminus1θgtccaYgtXβ =
nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ
=
radicnminus1βgtXgtY(YgtY)minus1YgtXβ
The p-CCA problem can thus be solved with respect to β by plugging this value in (46)
maxβisinRp
nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)
s t nminus1 βgt(XgtX + Ω
)β = 1 (412b)
where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies
nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω
)βcca (413)
where λ is the maximal eigenvalue shown below to be equal to α2
nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ
rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ
rArr nminus1αβgtccaXgtYθ = λ
rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ
rArr α2 = λ
The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)
38
41 From Optimal Scoring to Linear Discriminant Analysis
413 Penalized Linear Discriminant Analysis
Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows
maxβisinRp
βgtΣBβ (414a)
s t βgt(ΣW + nminus1Ω)β = 1 (414b)
where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C
As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable
to a simple matrix representation using the projection operator Y(YgtY
)minus1Ygt
ΣT =1
n
nsumi=1
xixigt
= nminus1XgtX
ΣB =1
n
Ksumk=1
nk microkmicrogtk
= nminus1XgtY(YgtY
)minus1YgtX
ΣW =1
n
Ksumk=1
sumiyik=1
(xi minus microk) (xi minus microk)gt
= nminus1
(XgtXminusXgtY
(YgtY
)minus1YgtX
)
Using these formulae the solution to the p-LDA problem (414) is obtained as
XgtY(YgtY
)minus1YgtXβlda = λ
(XgtX + ΩminusXgtY
(YgtY
)minus1YgtX
)βlda
XgtY(YgtY
)minus1YgtXβlda =
λ
1minus λ
(XgtX + Ω
)βlda
The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat
βlda = (1minus α2)minus12 βcca
= αminus1(1minus α2)minus12 βos
which ends the path from p-OS to p-LDA
39
4 Formalizing the Objective
414 Summary
The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)s t nminus1 ΘgtYgtYΘ = IKminus1
Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the
square-root of the largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY we have
BLDA = BCCA
(IKminus1 minusA2
)minus 12
= BOS Aminus1(IKminus1 minusA2
)minus 12 (415)
where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p
can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS
or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied
With the aim of performing classification the whole process could be summarized asfollows
1 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
2 Translate the data samples X into the LDA domain as XLDA = XBOSD
where D = Aminus1(IKminus1 minusA2
)minus 12
3 Compute the matrix M of centroids microk from XLDA and Y
4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA
5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule
6 Graphical Representation
40
42 Practicalities
The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively
42 Practicalities
421 Solution of the Penalized Optimal Scoring Regression
Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem
minΘisinRKtimesKminus1BisinRptimesKminus1
YΘminusXB2F + λ tr(BgtΩB
)(416a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)
where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm
Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps
1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1
2 Compute B =(XgtX + λΩ
)minus1XgtYΘ0
3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ
)minus1XgtY
4 Compute the optimal regression coefficients
BOS =(XgtX + λΩ
)minus1XgtYΘ (417)
Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on
Θ0gtYgtX(XgtX + λΩ
)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a
costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B
This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where
41
4 Formalizing the Objective
a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems
422 Distance Evaluation
The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance
d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log
(nkn
) (418)
is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent
Σminus1WΩ =
(nminus1(XgtX + λΩ)minus ΣB
)minus1
=(nminus1XgtXminus ΣB + nminus1λΩ
)minus1
=(ΣW + nminus1λΩ
)minus1 (419)
Before explaining how to compute the distances let us summarize some clarifying points
bull The solution BOS of the p-OS problem is enough to accomplish classification
bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances
bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1
As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain
(xi minus microk)BOS2ΣWΩminus 2 log(πk)
where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1
(IKminus1 minusA2
)minus 12
∥∥∥2
2minus 2 log(πk)
which is a plain Euclidean distance
42
43 From Sparse Optimal Scoring to Sparse LDA
423 Posterior Probability Evaluation
Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as
p(yk = 1|x) prop exp
(minusd(xmicrok)
2
)prop πk exp
(minus1
2
∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2
)minus 12
∥∥∥2
2
) (420)
Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)
2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below
p(yk = 1|x) =πk exp
(minusd(xmicrok)
2
)sum
` π` exp(minusd(xmicro`)
2
)=
πk exp(minusd(xmicrok)
2 + dmax2
)sum`
π` exp
(minusd(xmicro`)
2+dmax
2
)
where dmax = maxk d(xmicrok)
424 Graphical Representation
Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented
43 From Sparse Optimal Scoring to Sparse LDA
The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated
In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see
43
4 Formalizing the Objective
section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB
431 A Quadratic Variational Form
Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)
Our formulation of group-Lasso is showed below
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(421a)
s tsum
j τj minussum
j wj∥∥βj∥∥
2le 0 (421b)
τj ge 0 j = 1 p (421c)
where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1
B =(β1gt βpgt
)gtand wj are predefined nonnegative weights The cost function
J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)
The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41
Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)
Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump
j=1wj∥∥βj∥∥
2
Proof The Lagrangian of Problem (421) is
L = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
( psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2
)minus
psumj=1
νjτj
44
43 From Sparse Optimal Scoring to Sparse LDA
Figure 41 Graphical representation of the variational approach to Group-Lasso
Thus the first order optimality conditions for τj are
partLpartτj
(τj ) = 0hArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
hArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0 τ
j
2 = 0
The last line is obtained from complementary slackness which implies here νjτj = 0
Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier
for constraint gj(τj) le 0 As a result the optimal value of τj
τj =
radicλw2
j
∥∥βj∥∥2
2
ν0=
radicλ
ν0wj∥∥βj∥∥
2(422)
We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)
psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2= 0 (423)
so that τj = wj∥∥βj∥∥
2 Using this value into (421a) it is possible to conclude that
Problem (421) is equivalent to the standard group-Lasso operator
minBisinRptimesM
J(B) + λ
psumj=1
wj∥∥βj∥∥
2 (424)
So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation
45
4 Formalizing the Objective
With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (425)
with
τj = wj∥∥βj∥∥
2
resulting in Ω diagonal components
(Ω)jj =wj∥∥βj∥∥
2
(426)
And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5
The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence
Lemma 42 If J is convex Problem (421) is convex
Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is
V isin RptimesKminus1 V =partJ(B)
partB+ λG
(427)
where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1
G =(g1gt gpgt
)gtdefined as follows Let S(B) denote the columnwise support of
B S(B) = j isin 1 p ∥∥βj∥∥
26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (428)
forallj isin S(B) ∥∥gj∥∥
2le wj (429)
46
43 From Sparse Optimal Scoring to Sparse LDA
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Proof When∥∥βj∥∥
26= 0 the gradient of the penalty with respect to βj is
part (λsump
m=1wj βm2)
partβj= λwj
βj∥∥βj∥∥2
(430)
At∥∥βj∥∥
2= 0 the gradient of the objective function is not continuous and the optimality
conditions then make use of the subdifferential (Bach et al 2011)
partβj
(λ
psumm=1
wj βm2
)= partβj
(λwj
∥∥βj∥∥2
)=λwjv isin RKminus1 v2 le 1
(431)
That gives the expression (429)
Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima
forallj isin S partJ(B)
partβj+ λwj
∥∥βj∥∥minus1
2βj = 0 (432a)
forallj isin S ∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le λwj (432b)
where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment
Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)
432 Group-Lasso OS as Penalized LDA
With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced
Proposition 41 The group-Lasso OS problem
BOS = argminBisinRptimesKminus1
minΘisinRKtimesKminus1
1
2YΘminusXB2F + λ
psumj=1
wj∥∥βj∥∥
2
s t nminus1 ΘgtYgtYΘ = IKminus1
47
4 Formalizing the Objective
is equivalent to the penalized LDA problem
BLDA = maxBisinRptimesKminus1
tr(BgtΣBB
)s t Bgt(ΣW + nminus1λΩ)B = IKminus1
where Ω = diag
(w2
1
τ1
w2p
τp
) with Ωjj =
+infin if βjos = 0
wj∥∥βjos
∥∥minus1
2otherwise
(433)
That is BLDA = BOS diag(αminus1k (1minus α2
k)minus12
) where αk isin (0 1) is the kth leading
eigenvalue of
nminus1YgtX(XgtX + λΩ
)minus1XgtY
Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso
The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr
(BgtΩB
)
48
5 GLOSS Algorithm
The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22
The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below
1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed
2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution
3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set
This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively
51 Regression Coefficients Updates
Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(
XgtAXA + λΩ)βk = XgtAYθ0
k (51)
49
5 GLOSS Algorithm
initialize modelλ B
ACTIVE SETall j st||βj ||2 gt 0
p-OS PROBLEMB must hold1st optimality
condition
any variablefrom
ACTIVE SETmust go toINACTIVE
SET
take it out ofACTIVE SET
test 2nd op-timality con-dition on the
INACTIVE SET
any variablefrom
INACTIVE SETmust go toACTIVE
SET
take it out ofINACTIVE SET
compute Θ
and update B end
yes
no
yes
no
Figure 51 GLOSS block diagram
50
51 Regression Coefficients Updates
Algorithm 1 Adaptively Penalized Optimal Scoring
Input X Y B λInitialize A larr
j isin 1 p
∥∥βj∥∥2gt 0
Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat
Step 1 solve (421) in B assuming A optimalrepeat
Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1
2
BA larr(XgtAXA + λΩ
)minus1XgtAYΘ0
until condition (432a) holds for all j isin A Step 2 identify inactivated variables
for j isin A ∥∥βj∥∥
2= 0 do
if optimality condition (432b) holds thenA larr AjGo back to Step 1
end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax
jisinA
∥∥partJpartβj∥∥2
if∥∥∥partJpartβj∥∥∥
2lt λ then
convergence larr true B is optimalelseA larr Acup j
end ifuntil convergence
(sV)larreigenanalyze(Θ0gtYgtXAB) that is
Θ0gtYgtXABVk = skVk k = 1 K minus 1
Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1
Output Θ B α
51
5 GLOSS Algorithm
where XA denotes the columns of X indexed by A and βk and θ0k denote the kth
column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system
511 Cholesky decomposition
Dropping the subscripts and considering the (K minus 1) systems together (51) leads to
(XgtX + λΩ)B = XgtYΘ (52)
Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows
CgtCB = XgtYΘ
CB = CgtXgtYΘ
B = CCgtXgtYΘ (53)
where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)
512 Numerical Stability
The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression
B = Ωminus12(Ωminus12XgtXΩminus12 + λI
)minus1Ωminus12XgtYΘ0 (54)
where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)
52 Score Matrix
The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY This eigen-analysis is actually solved in the form
ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-
vector decomposition does not require the costly computation of(XgtX + Ω
)minus1that
52
53 Optimality Conditions
involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-
trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω
)minus1XgtY 1
Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as
Θ0gtYgtX(XgtX + Ω
)minus1XgtYΘ0 = Θ0gtYgtXB0
Thus the solution to penalized OS problem can be computed trough the singular
value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining
Θ = Θ0V we have ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ = Λ and when Θ0 is chosen such
that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation
53 Optimality Conditions
GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function
1
2YΘminusXB22 + λ
psumj=1
wj∥∥βj∥∥
2(55)
Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth
row of B βj is the (K minus 1)-dimensional vector
partJ(B)
partβj= xj
gt(XBminusYΘ)
where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as
xjgt
(XBminusYΘ) + λwjβj∥∥βj∥∥
2
1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω
)minus1XgtY It is thus suffi-
cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY In practice to comply with this desideratum and conditions (35b) and
(35c) we set Θ0 =(YgtY
)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal
vectors orthogonal to 1K
53
5 GLOSS Algorithm
The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
54 Active and Inactive Sets
The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function
j = maxj
∥∥∥xjgt (XBminusYΘ)∥∥∥
2minus λwj 0
The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥
2
is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition
55 Penalty Parameter
The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active
The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0
λmax = maxjisin1p
1
wj
∥∥∥xjgtYΘ0∥∥∥
2
The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin
is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)
54
56 Options and Variants
56 Options and Variants
561 Scaling Variables
As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm
562 Sparse Variant
This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation
563 Diagonal Variant
We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated
The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems
minBisinRptimesKminus1
YΘminusXB2F = minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB
)are replaced by
minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B
)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite
564 Elastic net and Structured Variant
For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition
55
5 GLOSS Algorithm
7 8 9
4 5 6
1 2 3
- ΩL =
3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3
Figure 52 Graph and Laplacian matrix for a 3times 3 image
for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth
When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood
This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned
56
6 Experimental Results
This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper
61 Normalization
With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1
62 Decision Thresholds
The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation
1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval
57
6 Experimental Results
63 Simulated Data
We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is
Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)
Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of
dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure
is intended to mimic gene expression data correlation
Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1
3 1)if j le 100 and Xij sim N(0 1) otherwise
Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise
Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563
The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only
58
63 Simulated Data
Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset
Err () Var Dir
Sim 1 K = 4 mean shift ind features
PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)
Sim 2 K = 2 mean shift dependent features
PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)
Sim 3 K = 4 1D mean shift ind features
PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)
Sim 4 K = 4 mean shift ind features
PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)
59
6 Experimental Results
0 10 20 30 40 50 60 70 8020
30
40
50
60
70
80
90
100TPR Vs FPR
gloss
glossd
slda
plda
Simulation1
Simulation2
Simulation3
Simulation4
Figure 61 TPR versus FPR (in ) for all algorithms and simulations
Table 62 Average TPR and FPR (in ) computed over 25 repetitions
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
PLDA 990 782 969 603 980 159 743 656
SLDA 739 385 338 163 416 278 507 395
GLOSS 641 106 300 46 511 182 260 121
GLOSS-D 935 394 921 281 956 655 429 299
method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )
64 Gene Expression Data
We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-
2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736
60
64 Gene Expression Data
Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables
Err () Var
Nakayama n = 86 p = 22 283 K = 5
PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)
Ramaswamy n = 198 p = 16 063 K = 14
PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)
Sun n = 180 p = 54 613 K = 4
PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)
ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4
dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors
Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split
Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors
Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS
4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962
61
6 Experimental Results
GLOSS SLDA
Naka
yam
a
minus25000 minus20000 minus15000 minus10000 minus5000 0 5000
minus25
minus2
minus15
minus1
minus05
0
05
1
x 104
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
2n
dd
iscr
imin
ant
minus2000 0 2000 4000 6000 8000 10000 12000 14000
2000
4000
6000
8000
10000
12000
14000
16000
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
Su
n
minus1 minus05 0 05 1 15 2
x 104
05
1
15
2
25
3
35
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
2n
dd
iscr
imin
ant
minus2 minus15 minus1 minus05 0
x 104
0
05
1
15
2
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means
62
65 Correlated Data
Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo
65 Correlated Data
When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge
The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works
For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63
As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward
The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits
Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results
63
6 Experimental Results
β for GLOSS β for S-GLOSS
Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo
β for GLOSS and λ = 03 β for S-GLOSS and λ = 03
Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo
64
Discussion
GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix
Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data
The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced
The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition
65
Part III
Sparse Clustering Analysis
67
Abstract
Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity
Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix
As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection
Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10
69
7 Feature Selection in Mixture Models
71 Mixture Models
One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering
711 Model
We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically
from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as
f(xi) =
Ksumk=1
πkfk(xi) foralli isin 1 n
where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and
sumk πk = 1) Mixture models transcribe that
given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism
bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK
bull x each xi is assumed to arise from a random vector with probability densityfunction fk
In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as
f(xiθ) =
Ksumk=1
πkφ(xiθk) foralli isin 1 n
71
7 Feature Selection in Mixture Models
where θ = (π1 πK θ1 θK) is the parameter of the model
712 Parameter Estimation The EM Algorithm
For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ
21 σ
22 π) of a univariate
Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches
The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)
The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood
Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm
Maximum Likelihood Definitions
The likelihood is is commonly expressed in its logarithmic version
L(θ X) = log
(nprodi=1
f(xiθ)
)
=nsumi=1
log
(Ksumk=1
πkfk(xiθk)
) (71)
where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions
To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or
72
71 Mixture Models
classification log-likelihood
LC(θ XY) = log
(nprodi=1
f(xiyiθ)
)
=
nsumi=1
log
(Ksumk=1
yikπkfk(xiθk)
)
=nsumi=1
Ksumk=1
yik log (πkfk(xiθk)) (72)
The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise
Defining the soft membership tik(θ) as
tik(θ) = p(Yik = 1|xiθ) (73)
=πkfk(xiθk)
f(xiθ) (74)
To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows
LC(θ XY) =sumik
yik log (πkfk(xiθk))
=sumik
yik log (tikf(xiθ))
=sumik
yik log tik +sumik
yik log f(xiθ)
=sumik
yik log tik +nsumi=1
log f(xiθ)
=sumik
yik log tik + L(θ X) (75)
wheresum
ik yik log tik can be reformulated as
sumik
yik log tik =nsumi=1
Ksumk=1
yik log(p(Yik = 1|xiθ))
=
nsumi=1
log(p(Yik = 1|xiθ))
= log (p(Y |Xθ))
As a result the relationship (75) can be rewritten as
L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)
73
7 Feature Selection in Mixture Models
Likelihood Maximization
The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)
L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))
+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))
In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood
∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1
minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality
Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))
For the mixture model problem Q(θθprime) is
Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]
=sumik
p(Yik = 1|xiθprime) log(πkfk(xiθk))
=nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (77)
Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ
prime) are the posterior proba-bilities of cluster memberships
Hence the EM algorithm sketched above results in
bull Initialization (not iterated) choice of the initial parameter θ(0)
bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)
bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))
74
72 Feature Selection in Model-Based Clustering
Gaussian Model
In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is
f(xiθ) =Ksumk=1
πkfk(xiθk)
=
Ksumk=1
πk1
(2π)p2 |Σ|
12
exp
minus1
2(xi minus microk)
gtΣminus1(xi minus microk)
At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows
Q(θθ(t)) =sumik
tik log(πk)minussumik
tik log(
(2π)p2 |Σ|
12
)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
=sumk
tk log(πk)minusnp
2log(2π)︸ ︷︷ ︸
constant term
minusn2
log(|Σ|)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
equivsumk
tk log(πk)minusn
2log(|Σ|)minus
sumik
tik
(1
2(xi minus microk)
gtΣminus1(xi minus microk)
) (78)
where
tk =nsumi=1
tik (79)
The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)
π(t+1)k =
tkn
(710)
micro(t+1)k =
sumi tikxitk
(711)
Σ(t+1) =1
n
sumk
Wk (712)
with Wk =sumi
tik(xi minus microk)(xi minus microk)gt (713)
The derivations are detailed in Appendix G
72 Feature Selection in Model-Based Clustering
When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own
75
7 Feature Selection in Mixture Models
covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries
In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD
gtk (Banfield and Raftery 1993)
These regularization schemes address singularity and stability issues but they do notinduce parsimonious models
In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space
721 Based on Penalized Likelihood
Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x
log
(p(Yk = 1|x)
p(Y` = 1|x)
)= xgtΣminus1(microk minus micro`)minus
1
2(microk + micro`)
gtΣminus1(microk minus micro`) + logπkπ`
In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm
λKsumk=1
psumj=1
|microkj |
as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices
λ1
Ksumk=1
psumj=1
|microkj |+ λ2
Ksumk=1
psumj=1
psumm=1
|(Σminus1k )jm|
In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models
76
72 Feature Selection in Model-Based Clustering
Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)
λ
psumj=1
sum16k6kprime6K
|microkj minus microkprimej |
This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative
A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features
λ
psumj=1
(micro1j micro2j microKj)infin
One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means
λradicK
psumj=1
radicradicradicradic Ksum
k=1
micro2kj
The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test
The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation
722 Based on Model Variants
The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as
77
7 Feature Selection in Mixture Models
f(xi|φ πθν) =Ksumk=1
πk
pprodj=1
[f(xij |θjk)]φj [h(xij |νj)]1minusφj
where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)
An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1
which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion
tr(
(UgtΣWU)minus1UgtΣBU) (714)
so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations
To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of
minUisinRptimesKminus1
∥∥∥XU minusXU∥∥∥2
F+ λ
Kminus1sumk=1
∥∥∥uk∥∥∥1
where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet
minABisinRptimesKminus1
Ksumk=1
∥∥∥RminusgtW HBk minusABgtHBk
∥∥∥2
2+ ρ
Kminus1sumj=1
βgtj ΣWβj + λ
Kminus1sumj=1
∥∥βj∥∥1
s t AgtA = IKminus1
where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper
78
72 Feature Selection in Model-Based Clustering
triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U
The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem
minUisinRptimesKminus1
psumj=1
∥∥∥ΣBj minus UUgtΣBj
∥∥∥2
2
s t UgtU = IKminus1
whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U
To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality
However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo
723 Based on Model Selection
Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables
bull X(1) set of selected relevant variables
bull X(2) set of variables being considered for inclusion or exclusion of X(1)
bull X(3) set of non relevant variables
79
7 Feature Selection in Mixture Models
With those subsets they defined two different models where Y is the partition toconsider
bull M1
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)|X(1)
)f(X(1)|Y
)bull M2
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)X(1)|Y
)Model M1 means that variables in X(2) are independent on clustering Y Model M2
shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor
B12 =f (X|M1)
f (X|M2)
where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio
B12 =f(X(1)X(2)X(3)|M1
)f(X(1)X(2)X(3)|M2
)=f(X(2)|X(1)M1
)f(X(1)|M1
)f(X(2)X(1)|M2
)
This factor is approximated since the integrated likelihoods f(X(1)|M1
)and
f(X(2)X(1)|M2
)are difficult to calculate exactly Raftery and Dean (2006) use the
BIC approximation The computation of f(X(2)|X(1)M1
) if there is only one variable
in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term
Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability
Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis
80
8 Theoretical Foundations
In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features
We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model
In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided
81 Resolving EM with Optimal Scoring
In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate
811 Relationship Between the M-Step and Linear Discriminant Analysis
LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance
d(ximicrok) = (xi minus microk)gtΣminus1
W (xi minus microk)
where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix
81
8 Theoretical Foundations
The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)
Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood
2lweight(microΣ) =nsumi=1
Ksumk=1
tikd(ximicrok)minus n log(|ΣW|)
which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures
812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis
The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression
813 Clustering Using Penalized Optimal Scoring
The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression
d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)
This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as
82
82 Optimized Criterion
1 Initialize the membership matrix Y (for example by K-means algorithm)
2 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2
k)minus 1
2 )
4 Compute the centroids M in the LDA domain
5 Evaluate distances in the LDA domain
6 Translate distances into posterior probabilities tik with
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
] (81)
7 Update the labels using the posterior probabilities matrix Y = T
8 Go back to step 2 and iterate until tik converge
Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures
814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures
82 Optimized Criterion
In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized
83
8 Theoretical Foundations
optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture
This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows
821 A Bayesian Derivation
This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)
The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter
The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior
f(Σ|Λ0 ν0) =1
2np2 |Λ0|
n2 Γp(
n2 )|Σminus1|
ν0minuspminus12 exp
minus1
2tr(Λminus1
0 Σminus1)
where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as
Γp(n2) = πp(pminus1)4pprodj=1
Γ (n2 + (1minus j)2)
The posterior distribution can be maximized similarly to the likelihood through the
84
82 Optimized Criterion
maximization of
Q(θθprime) + log(f(Σ|Λ0 ν0))
=Ksumk=1
tk log πk minus(n+ 1)p
2log 2minus n
2log |Λ0| minus
p(p+ 1)
4log(π)
minuspsumj=1
log
(Γ
(n
2+
1minus j2
))minus νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
)equiv
Ksumk=1
tk log πk minusn
2log |Λ0| minus
νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
) (82)
with tk =
nsumi=1
tik
νn = ν0 + n
Λminus1n = Λminus1
0 + S0
S0 =
nsumi=1
Ksumk=1
tik(xi minus microk)(xi minus microk)gt
Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)
822 Maximum a Posteriori Estimator
The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is
ΣMAP =1
ν0 + nminus pminus 1(Λminus1
0 + S0) (83)
where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1
0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)
85
9 Mix-GLOSS Algorithm
Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism
91 Mix-GLOSS
The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik
When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant
The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition
911 Outer Loop Whole Algorithm Repetitions
This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs
bull the centered ntimes p feature matrix X
bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically
bull the number of clusters K
bull the maximum number of iterations for the EM algorithm
bull the convergence tolerance for the EM algorithm
bull the number of whole repetitions of the clustering algorithm
87
9 Mix-GLOSS Algorithm
Figure 91 Mix-GLOSS Loops Scheme
bull a ptimes (K minus 1) initial coefficient matrix (optional)
bull a ntimesK initial posterior probability matrix (optional)
For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process
912 Penalty Parameter Loop
The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix
Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage
88
91 Mix-GLOSS
of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive
Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)
Algorithm 2 Automatic selection of λ
Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat
Estimate λ Compute gradient at βj = 0partJ(B)
partβj
∣∣∣βj=0
= xjgt
(sum
m6=j xmβm minusYΘ)
Compute λmax for every feature using (432b)
λmaxj = 1
wj
∥∥∥∥ partJ(B)
partβj
∣∣∣βj=0
∥∥∥∥2
Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false
elselastLAMBDA larr true
end ifuntil lastLAMBDA
Output B L(θ) tik πk microk Σ Y for every λ in solution path
913 Inner Loop EM Algorithm
The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop
89
9 Mix-GLOSS Algorithm
Algorithm 3 Mix-GLOSS for one value of λ
Input X K B0 Y0 λInitializeif (B0Y0) available then
BOS larr B0 Y larr Y0
elseBOS larr 0 Y larr kmeans(XK)
end ifconvergenceEM larr false tolEM larr 1e-3repeat
M-step(BOSΘ
α)larr GLOSS(XYBOS λ)
XLDA = XBOS diag (αminus1(1minusα2)minus12
)
πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n
sumi |tik minus yik| lt tolEM then
convergenceEM larr trueend ifY larr T
until convergenceEMY larr MAP(T)
Output BOS ΘL(θ) tik πk microk Σ Y
90
92 Model Selection
M-Step
The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step
E-Step
The E-step evaluates the posterior probability matrix T using
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
]
The convergence of those tik is used as stopping criterion for EM
92 Model Selection
Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected
In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure
In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time
The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested
91
9 Mix-GLOSS Algorithm
Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)
X K λEMITER MAXREPMixminusGLOSS
Use B and T frombest repetition as
StartB and StartT
Mix-GLOSS (λStartBStartT)
Compute BIC
Chose λ = minλ BIC
Partition tikπk λBEST BΘ D L(θ)activeset
Figure 92 Mix-GLOSS model selection diagram
with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ
92
10 Experimental Results
The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6
This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63
In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations
The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions
101 Tested Clustering Algorithms
This section compares Mix-GLOSS with the following methods in the state of the art
bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan
bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website
93
10 Experimental Results
Figure 101 Class mean vectors for each artificial simulation
bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website
After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered
The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website
bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)
bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see
94
102 Results
Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable
102 Results
In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are
bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different
bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80
bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced
The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS
Results in percentages are displayed in Figure 102 (or in Table 102 )
95
10 Experimental Results
Table 101 Experimental results for simulated data
Err () Var Time
Sim 1 K = 4 mean shift ind features
CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h
Sim 2 K = 2 mean shift dependent features
CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h
Sim 3 K = 4 1D mean shift ind features
CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h
Sim 4 K = 4 mean shift ind features
CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h
Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
MIX-GLOSS 992 015 828 335 884 67 780 12
LUMI-KUAN 992 28 1000 02 1000 005 50 005
FISHER-EM 986 24 888 17 838 5825 620 4075
96
103 Discussion
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100TPR Vs FPR
MIXminusGLOSS
LUMIminusKUAN
FISHERminusEM
Simulation1
Simulation2
Simulation3
Simulation4
Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions
103 Discussion
After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted
LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here
The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4
From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall
97
Conclusions
99
Conclusions
Summary
The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables
In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering
The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems
In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations
In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results
Perspectives
Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species
101
based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography
At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term
The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis
From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm
At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression
102
Appendix
103
A Matrix Properties
Property 1 By definition ΣW and ΣB are both symmetric matrices
ΣW =1
n
gsumk=1
sumiisinCk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
gsumk=1
nk(microk minus x)(microk minus x)gt
Property 2 partxgtapartx = partagtx
partx = a
Property 3 partxgtAxpartx = (A + Agt)x
Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt
Property 5 partagtXbpartX = abgt
Property 6 partpartXtr
(AXminus1B
)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt
105
B The Penalized-OS Problem is anEigenvector Problem
In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form
minθkβk
Yθk minusXβk22 + βgtk Ωkβk (B1)
st θgtk YgtYθk = 1
θgt` YgtYθk = 0 forall` lt k
for k = 1 K minus 1The Lagrangian associated to Problem (B1) is
Lk(θkβk λkνk) =
Yθk minusXβk22 + βgtk Ωkβk + λk(θ
gtk YgtYθk minus 1) +
sum`ltk
ν`θgt` YgtYθk (B2)
Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk
βk = (XgtX + Ωk)minus1XgtYθk (B3)
The objective function of (B1) evaluated at βk is
minθk
Yθk minusXβk22 + βk
gtΩkβk = min
θk
θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk
= maxθk
θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)
If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY
B1 How to Solve the Eigenvector Decomposition
Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition
107
B The Penalized-OS Problem is an Eigenvector Problem
Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way
maxΘisinRKtimes(Kminus1)
tr(ΘgtMΘ
)(B5)
st ΘgtYgtYΘ = IKminus1
If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is
MΘv = λv (B6)
where v is the eigenvector and λ the associated eigenvalue of MΘ Operating
vgtMΘv = λhArr vgtΘgtMΘv = λ
Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue
wgtMw = λ (B7)
Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ
MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ
= ΘgtYgtXB
Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone
To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B
B = (XgtX + Ω)minus1XgtYΘV = BV
108
B2 Why the OS Problem is Solved as an Eigenvector Problem
B2 Why the OS Problem is Solved as an Eigenvector Problem
In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY
By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them
θk =
Kminus1summ=1
αmwm s t θgtk θk = 1 (B8)
The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (
Kminus1summ=1
αmwm
)gt(Kminus1summ=1
αmwm
)= 1
that as per the eigenvector properties can be reduced to
Kminus1summ=1
α2m = 1 (B9)
Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)
Mθk = M
Kminus1summ=1
αmwm
=
Kminus1summ=1
αmMwm
As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain
Mθk =Kminus1summ=1
αmλmwm
Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors
θgtk Mθk =
(Kminus1sum`=1
α`w`
)gt(Kminus1summ=1
αmλmwm
)
This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving
θgtk Mθk =Kminus1summ=1
α2mλm
109
B The Penalized-OS Problem is an Eigenvector Problem
The optimization Problem (B5) for discriminant direction k can be rewritten as
maxθkisinRKtimes1
θgtk Mθk
= max
θkisinRKtimes1
Kminus1summ=1
α2mλm
(B10)
with θk =Kminus1summ=1
αmwm
andKminus1summ=1
α2m = 1
One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =
sumKminus1m=1 αmwm the resulting score vector θk will be equal to
the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can
be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY
110
C Solving Fisherrsquos Discriminant Problem
The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance
maxβisinRp
βgtΣBβ (C1a)
s t βgtΣWβ = 1 (C1b)
where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data
The Lagrangian of Problem (C1) is
L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)
so that its first derivative with respect to β is
partL(β ν)
partβ= 2ΣBβ minus 2νΣWβ
A necessary optimality condition for β is that this derivative is zero that is
ΣBβ = νΣWβ
Provided ΣW is full rank we have
Σminus1W ΣBβ
= νβ (C2)
Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of
eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows
βgtΣBβ = βgtΣWΣminus1
W ΣBβ
= νβgtΣWβ from (C2)
= ν from (C1b)
That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1
W ΣB and β is any eigenvector correspondingto this maximal eigenvalue
111
D Alternative Variational Formulation forthe Group-Lasso
In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(D1a)
s tsump
j=1 τj = 1 (D1b)
τj ge 0 j = 1 p (D1c)
Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed
of row vectors βj isin RKminus1 B =(β1gt βpgt
)gt
L(B τ λ ν0 νj) = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
psumj=1
τj minus 1
minus psumj=1
νjτj (D2)
The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj
partL(B τ λ ν0 νj)
partτj
∣∣∣∣τj=τj
= 0 rArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 = 0
The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ
) = 0 where νj is the Lagrange multiplier and gj(τ) is the
inequality Lagrange condition Then the optimal τj can be deduced
τj =
radicλ
ν0wj∥∥βj∥∥
2
Placing this optimal value of τj into constraint (D1b)
psumj=1
τj = 1rArr τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
(D3)
113
D Alternative Variational Formulation for the Group-Lasso
With this value of τj Problem (D1) is equivalent to
minBisinRptimesKminus1
J(B) + λ
psumj=1
wj∥∥βj∥∥
2
2
(D4)
This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj
The penalty term of (D1a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (D5)
Using the value of τj from (D3) each diagonal component of Ω is
(Ω)jj =wjsump
j=1wj∥∥βj∥∥
2∥∥βj∥∥2
(D6)
In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation
D1 Useful Properties
Lemma D1 If J is convex Problem (D1) is convex
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =
partJ(B)
partB+ 2λ
Kminus1sumj=1
wj∥∥βj∥∥
2
G
(D7)
where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1
∥∥βj∥∥26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (D8)
forallj isin S(B) ∥∥gj∥∥
2le wj (D9)
114
D2 An Upper Bound on the Objective Function
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1
∥∥βj∥∥26= 0 and let S(B) be its complement then we have
forallj isin S(B) minus partJ(B)
partβj= 2λ
Kminus1sumj=1
wj∥∥βj∥∥2
wj∥∥βj∥∥minus1
2βj (D10a)
forallj isin S(B)
∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le 2λwj
Kminus1sumj=1
wj∥∥βj∥∥2
(D10b)
In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)
D2 An Upper Bound on the Objective Function
Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that
τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum
j=1
wj∥∥βj∥∥
2
2
=
psumj=1
τ12j
wj∥∥βj∥∥
2
τ12j
2
le
psumj=1
τj
psumj=1
w2j
∥∥βj∥∥2
2
τj
le
psumj=1
w2j
∥∥βj∥∥2
2
τj
where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one
115
D Alternative Variational Formulation for the Group-Lasso
This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined
116
E Invariance of the Group-Lasso to UnitaryTransformations
The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition
Proposition E1 Let B be a solution of
minBisinRptimesM
Y minusXB2F + λ
psumj=1
wj∥∥βj∥∥
2(E1)
and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof
minBisinRptimesM
∥∥∥Y minusXB∥∥∥2
F+ λ
psumj=1
wj∥∥βj∥∥
2(E2)
Proof The first-order necessary optimality conditions for B are
forallj isin S(B) 2xjgt(xjβ
j minusY)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E3a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minusY)∥∥∥
2le λwj (E3b)
where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement
First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows
forallj isin S(B) 2xjgt(xjβ
j minus Y)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E4a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minus Y)∥∥∥
2le λwj (E4b)
where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM
∥∥ugt∥∥2
=∥∥ugtV
∥∥2 Equation (E4b) is also
117
E Invariance of the Group-Lasso to Unitary Transformations
obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof
118
F Expected Complete Likelihood andLikelihood
Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available
L(θ) =
nsumi=1
log
(Ksumk=1
πkfk(xiθk)
)(F1)
Q(θθprime) =nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (F2)
with tik(θprime) =
πprimekfk(xiθprimek)sum
` πprime`f`(xiθ
prime`)
(F3)
In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are
the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)
Using (F3) we have
Q(θθprime) =sumik
tik(θprime) log (πkfk(xiθk))
=sumik
tik(θprime) log(tik(θ)) +
sumik
tik(θprime) log
(sum`
π`f`(xiθ`)
)=sumik
tik(θprime) log(tik(θ)) + L(θ)
In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities
L(θ) = Q(θθ)minussumik
tik(θ) log(tik(θ))
= Q(θθ) +H(T)
119
G Derivation of the M-Step Equations
This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as
Q(θθprime) = maxθ
sumik
tik(θprime) log(πkfk(xiθk))
=sumk
log
(πksumi
tik
)minus np
2log(2π)minus n
2log |Σ| minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
which has to be maximized subject tosumk
πk = 1
The Lagrangian of this problem is
L(θ) = Q(θθprime) + λ
(sumk
πk minus 1
)
Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ
G1 Prior probabilities
partL(θ)
partπk= 0hArr 1
πk
sumi
tik + λ = 0
where λ is identified from the constraint leading to
πk =1
n
sumi
tik
121
G Derivation of the M-Step Equations
G2 Means
partL(θ)
partmicrok= 0hArr minus1
2
sumi
tik2Σminus1(microk minus xi) = 0
rArr microk =
sumi tikxisumi tik
G3 Covariance Matrix
partL(θ)
partΣminus1 = 0hArr n
2Σ︸︷︷︸
as per property 4
minus 1
2
sumik
tik(xi minus microk)(xi minus microk)gt
︸ ︷︷ ︸as per property 5
= 0
rArr Σ =1
n
sumik
tik(xi minus microk)(xi minus microk)gt
122
Bibliography
F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011
F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008
F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012
J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993
A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009
H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996
P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004
C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008
C M Bishop Pattern Recognition and Machine Learning Springer New York 2006
C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a
C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b
S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004
L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995
L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984
123
Bibliography
T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011
S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999
C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012
B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008
L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011
C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009
A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246
D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006
R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000
B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004
Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008
R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936
V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008
J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009
124
Bibliography
J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010
J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989
W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998
A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003
D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005
G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010
G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011
Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998
Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002
L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008
Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004
J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010
I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003
T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996
T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994
125
Bibliography
T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995
A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970
J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009
T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006
K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000
P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010
T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002
M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004
Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004
C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008
C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006
H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005
J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967
Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012
C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a
126
Bibliography
C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b
L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008
N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006
B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006
B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007
Y Nesterov Gradient methods for minimizing composite functions preprint 2007
S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886
B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011
M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a
M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b
W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007
W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006
K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894
S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003
127
Bibliography
Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009
A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006
C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948
S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007
V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004
V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008
V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004
C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010
L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012
Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978
A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008
S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006
P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010
M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008
128
Bibliography
M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008
R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996
J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010
S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008
D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011
D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010
D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009
M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007
MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009
T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008
B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a
B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b
C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010
J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007
129
Bibliography
M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006
P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007
P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009
H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009
H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006
H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005
130
- SANCHEZ MERCHANTE PDTpdf
- Thesis Luis Francisco Sanchez Merchantepdf
-
- List of figures
- List of tables
- Notation and Symbols
- Context and Foundations
-
- Context
- Regularization for Feature Selection
-
- Motivations
- Categorization of Feature Selection Techniques
- Regularization
-
- Important Properties
- Pure Penalties
- Hybrid Penalties
- Mixed Penalties
- Sparsity Considerations
- Optimization Tools for Regularized Problems
-
- Sparse Linear Discriminant Analysis
-
- Abstract
- Feature Selection in Fisher Discriminant Analysis
-
- Fisher Discriminant Analysis
- Feature Selection in LDA Problems
-
- Inertia Based
- Regression Based
-
- Formalizing the Objective
-
- From Optimal Scoring to Linear Discriminant Analysis
-
- Penalized Optimal Scoring Problem
- Penalized Canonical Correlation Analysis
- Penalized Linear Discriminant Analysis
- Summary
-
- Practicalities
-
- Solution of the Penalized Optimal Scoring Regression
- Distance Evaluation
- Posterior Probability Evaluation
- Graphical Representation
-
- From Sparse Optimal Scoring to Sparse LDA
-
- A Quadratic Variational Form
- Group-Lasso OS as Penalized LDA
-
- GLOSS Algorithm
-
- Regression Coefficients Updates
-
- Cholesky decomposition
- Numerical Stability
-
- Score Matrix
- Optimality Conditions
- Active and Inactive Sets
- Penalty Parameter
- Options and Variants
-
- Scaling Variables
- Sparse Variant
- Diagonal Variant
- Elastic net and Structured Variant
-
- Experimental Results
-
- Normalization
- Decision Thresholds
- Simulated Data
- Gene Expression Data
- Correlated Data
-
- Discussion
-
- Sparse Clustering Analysis
-
- Abstract
- Feature Selection in Mixture Models
-
- Mixture Models
-
- Model
- Parameter Estimation The EM Algorithm
-
- Feature Selection in Model-Based Clustering
-
- Based on Penalized Likelihood
- Based on Model Variants
- Based on Model Selection
-
- Theoretical Foundations
-
- Resolving EM with Optimal Scoring
-
- Relationship Between the M-Step and Linear Discriminant Analysis
- Relationship Between Optimal Scoring and Linear Discriminant Analysis
- Clustering Using Penalized Optimal Scoring
- From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
-
- Optimized Criterion
-
- A Bayesian Derivation
- Maximum a Posteriori Estimator
-
- Mix-GLOSS Algorithm
-
- Mix-GLOSS
-
- Outer Loop Whole Algorithm Repetitions
- Penalty Parameter Loop
- Inner Loop EM Algorithm
-
- Model Selection
-
- Experimental Results
-
- Tested Clustering Algorithms
- Results
- Discussion
-
- Conclusions
- Appendix
-
- Matrix Properties
- The Penalized-OS Problem is an Eigenvector Problem
-
- How to Solve the Eigenvector Decomposition
- Why the OS Problem is Solved as an Eigenvector Problem
-
- Solving Fishers Discriminant Problem
- Alternative Variational Formulation for the Group-Lasso
-
- Useful Properties
- An Upper Bound on the Objective Function
-
- Invariance of the Group-Lasso to Unitary Transformations
- Expected Complete Likelihood and Likelihood
- Derivation of the M-Step Equations
-
- Prior probabilities
- Means
- Covariance Matrix
-
- Bibliography
-
Par Luis Francisco SANCHEZ MERCHANTE
Thegravese preacutesenteacutee pour lrsquoobtention du grade de Docteur de lrsquoUTC
Learning algorithms for sparse classification
Soutenue le 07 juin 2013
Speacutecialiteacute Technologies de lrsquoInformation et des Systegravemes
D2084
Algorithmes drsquoestimation pour laclassification parcimonieuse
Luis Francisco Sanchez MerchanteUniversity of Compiegne
CompiegneFrance
ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo
Albert Espinosa
ldquoBe brave Take risks Nothing can substitute experiencerdquo
Paulo Coelho
Acknowledgements
If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy
Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself
I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them
The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience
Contents
List of figures v
List of tables vii
Notation and Symbols ix
I Context and Foundations 1
1 Context 5
2 Regularization for Feature Selection 921 Motivations 9
22 Categorization of Feature Selection Techniques 11
23 Regularization 13
231 Important Properties 14
232 Pure Penalties 14
233 Hybrid Penalties 18
234 Mixed Penalties 19
235 Sparsity Considerations 19
236 Optimization Tools for Regularized Problems 21
II Sparse Linear Discriminant Analysis 25
Abstract 27
3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29
32 Feature Selection in LDA Problems 30
321 Inertia Based 30
322 Regression Based 32
4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35
411 Penalized Optimal Scoring Problem 36
412 Penalized Canonical Correlation Analysis 37
i
Contents
413 Penalized Linear Discriminant Analysis 39
414 Summary 40
42 Practicalities 41
421 Solution of the Penalized Optimal Scoring Regression 41
422 Distance Evaluation 42
423 Posterior Probability Evaluation 43
424 Graphical Representation 43
43 From Sparse Optimal Scoring to Sparse LDA 43
431 A Quadratic Variational Form 44
432 Group-Lasso OS as Penalized LDA 47
5 GLOSS Algorithm 4951 Regression Coefficients Updates 49
511 Cholesky decomposition 52
512 Numerical Stability 52
52 Score Matrix 52
53 Optimality Conditions 53
54 Active and Inactive Sets 54
55 Penalty Parameter 54
56 Options and Variants 55
561 Scaling Variables 55
562 Sparse Variant 55
563 Diagonal Variant 55
564 Elastic net and Structured Variant 55
6 Experimental Results 5761 Normalization 57
62 Decision Thresholds 57
63 Simulated Data 58
64 Gene Expression Data 60
65 Correlated Data 63
Discussion 63
III Sparse Clustering Analysis 67
Abstract 69
7 Feature Selection in Mixture Models 7171 Mixture Models 71
711 Model 71
712 Parameter Estimation The EM Algorithm 72
ii
Contents
72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79
8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81
811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant
Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83
82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85
9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87
911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89
92 Model Selection 91
10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97
Conclusions 97
Appendix 103
A Matrix Properties 105
B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109
C Solving Fisherrsquos Discriminant Problem 111
D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115
iii
Contents
E Invariance of the Group-Lasso to Unitary Transformations 117
F Expected Complete Likelihood and Likelihood 119
G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122
Bibliography 123
iv
List of Figures
11 MASH project logo 5
21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-
rameters 20
41 Graphical representation of the variational approach to Group-Lasso 45
51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56
61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first
discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64
91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92
101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97
v
List of Tables
61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61
101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96
vii
Notation and Symbols
Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors
Sets
N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A
Data
X input domainxi input sample xi isin XX design matrix X = (xgt1 x
gtn )gt
xj column j of Xyi class indicator of sample i
Y indicator matrix Y = (ygt1 ygtn )gt
z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N
Vectors Matrices and Norms
0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A
ix
Notation and Symbols
Probability
E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2
W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix
H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y
Mixture Models
yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)
θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function
Optimization
J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β
βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path
x
Notation and Symbols
Penalized models
λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)
βj jth row of B = (β1gt βpgt)gt
BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix
ΣB sample between-class covariance matrix
ΣW sample within-class covariance matrix
ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach
xi
Part I
Context and Foundations
1
This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed
The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided
The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion
3
1 Context
The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm
The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne
From the point of view of the research the members of the consortium must deal withfour main goals
1 Software development of website framework and APIrsquos
2 Classification and goal-planning in high dimensional feature spaces
3 Interfacing the platform with the 3D virtual environment and the robot arm
4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments
S HM A
Figure 11 MASH project logo
5
1 Context
The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables
Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment
As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform
bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)
bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis
6
All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)
7
2 Regularization for Feature Selection
With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic
21 Motivations
There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)
As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information
When talking about dimensionality reduction there are two families of techniquesthat could induce confusion
bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples
bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature
9
2 Regularization for Feature Selection
Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)
selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category
As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text
ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out
Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the
10
22 Categorization of Feature Selection Techniques
Figure 22 The four key steps of feature selection according to Liu and Yu (2005)
ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost
There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions
22 Categorization of Feature Selection Techniques
Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured
I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm
The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities
bull Depending on the type of integration with the machine learning algorithm we have
ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm
ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while
11
2 Regularization for Feature Selection
the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive
ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm
bull Depending on the feature searching technique
ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches
ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time
ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima
bull Depending on the evaluation technique
ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures
ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty
ndash Dependency Measures - Measuring the correlation between features
ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can
ndash Predictive Accuracy - Use the selected features to predict the labels
ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)
The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels
In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized
12
23 Regularization
goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III
23 Regularization
In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret
An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations
minβJ(β) + λP (β) (21)
minβ
J(β)
s t P (β) le t (22)
In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken
In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty
13
2 Regularization for Feature Selection
Figure 23 Admissible sets in two dimensions for different pure norms ||β||p
231 Important Properties
Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability
Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies
forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)
for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex
Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources
Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution
232 Pure Penalties
For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In
14
23 Regularization
Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties
this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1
Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity
A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero
After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1
3penalty has a support region with sharper vertexes that would induce
a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1
3results in difficulties during optimization that will not happen with a convex
shape
15
2 Regularization for Feature Selection
To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty
L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0
minβ
J(β)
s t β0 le t (24)
where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable
L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)
minβ
J(β)
s t
psumj=1
|βj | le t (25)
Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited
Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)
The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by
16
23 Regularization
minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)
L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like
minβJ(β) + λ β22 (26)
The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem
minβ
nsumi=1
(yi minus xgti β)2 (27)
with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular
the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances
As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
(βlsj )2 (28)
The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)
17
2 Regularization for Feature Selection
where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model
Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions
Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t
This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as
βlowast = maxwisinRp
βgtw s t w le 1
In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1
r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)
233 Hybrid Penalties
There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is
minβ
nsumi=1
(yi minus xgti β)2 + λ1
psumj=1
|βj |+ λ2
psumj=1
β2j (29)
The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables
18
23 Regularization
234 Mixed Penalties
Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =
sumL`=1 d` Mixed norms are
a type of norms that take into consideration those groups The general expression isshowed below
β(rs) =
sum`
sumjisinG`
|βj |s r
s
1r
(210)
The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups
Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)
(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)
235 Sparsity Considerations
In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables
The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables
To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as
19
2 Regularization for Feature Selection
(a) L1 Lasso (b) L(12) group-Lasso
Figure 25 Admissible sets for the Lasso and Group-Lasso
(a) L1 induced sparsity (b) L(12) group inducedsparsity
Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters
20
23 Regularization
the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed
236 Optimization Tools for Regularized Problems
In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms
In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5
Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)
β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))
Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives
βj =minusλsign(βj)minus partJ(β)
partβj
2sumn
i=1 x2ij
In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding
algorithm where β(t+1)j = Sλ
(partJ(β(t))partβj
) The objective function is optimized with respect
21
2 Regularization for Feature Selection
to one variable at a time while all others are kept fixed
Sλ
(partJ(β)
partβj
)=
λminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
gt λ
minusλminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
lt minusλ
0 if |partJ(β)partβj| le λ
(211)
The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)
Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A
Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected
Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)
This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions
22
23 Regularization
and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions
Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points
This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable
This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)
Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)
This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques
Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals
Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)
minβisinRp
J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L
2
∥∥∥β minus β(t)∥∥∥2
2(212)
They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like
23
2 Regularization for Feature Selection
(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as
minβisinRp
1
2
∥∥∥∥β minus (β(t) minus 1
LnablaJ(β(t)))
∥∥∥∥2
2
+λ
LP (β) (213)
The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up
24
Part II
Sparse Linear Discriminant Analysis
25
Abstract
Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes
There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables
In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data
27
3 Feature Selection in Fisher DiscriminantAnalysis
31 Fisher Discriminant Analysis
Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)
We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x
gtn )gt and the corresponding labels in the ntimesK matrix
Y = (ygt1 ygtn )gt
Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance
maxβisinRp
βgtΣBβ
βgtΣWβ (31)
where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as
ΣW =1
n
Ksumk=1
sumiisinGk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
Ksumk=1
sumiisinGk
(microminus microk)(microminus microk)gt
where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k
29
3 Feature Selection in Fisher Discriminant Analysis
This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio
maxBisinRptimesKminus1
tr(BgtΣBB
)tr(BgtΣWB
) (32)
where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is
based on a series of K minus 1 subproblemsmaxβkisinRp
βgtk ΣBβk
s t βgtk ΣWβk le 1
βgtk ΣWβ` = 0 forall` lt k
(33)
The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest
eigenvalue (see Appendix C)
32 Feature Selection in LDA Problems
LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome
Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints
The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities
They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based
321 Inertia Based
The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and
30
32 Feature Selection in LDA Problems
classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations
Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)
Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as
minβisinRp
βgtΣWβ
s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t
where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony
Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max
βisinkRpβgtk Σ
k
Bβk minus Pk(βk)
s t βgtk ΣWβk le 1
The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten
Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal
solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1
minimization minβisinRp
β1
s t∥∥∥Σβ minus (micro1 minus micro2)
∥∥∥infinle λ
Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization
31
3 Feature Selection in Fisher Discriminant Analysis
Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions
322 Regression Based
In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)
Predefined Indicator Matrix
Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)
There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data
Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection
In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is
32
32 Feature Selection in LDA Problems
obtained by solving
minβisinRpβ0isinR
nminus1nsumi=1
(yi minus β0 minus xgti β)2 + λ
psumj=1
|βj |
where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β
vector for λ = 0 but a different intercept β0 is required
Optimal Scoring
In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)
As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)(34a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)
where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems
minθkisinRK βkisinRp
Yθk minusXβk2 + βgtk Ωβk (35a)
s t nminus1 θgtk YgtYθk = 1 (35b)
θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)
where each βk corresponds to a discriminant direction
33
3 Feature Selection in Fisher Discriminant Analysis
Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by
minβkisinRpθkisinRK
sumk
Yθk minusXβk22 + λ1 βk1 + λ2β
gtk Ωβk
where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen
Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)
minβkisinRpθkisinRK
Kminus1sumk=1
Yθk minusXβk22 + λ
psumj=1
radicradicradicradicKminus1sumk=1
β2kj
2
(36)
which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding
this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem
34
4 Formalizing the Objective
In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)
The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data
The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)
41 From Optimal Scoring to Linear Discriminant Analysis
Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)
Throughout this chapter we assume that
bull there is no empty class that is the diagonal matrix YgtY is full rank
bull inputs are centered that is Xgt1n = 0
bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank
35
4 Formalizing the Objective
411 Penalized Optimal Scoring Problem
For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution
The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus
minθisinRK βisinRp
Yθ minusXβ2 + βgtΩβ (41a)
s t nminus1 θgtYgtYθ = 1 (41b)
For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator
βos =(XgtX + Ω
)minus1XgtYθ (42)
The objective function (41a) is then
Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos
(XgtX + Ω
)βos
= θgtYgtYθ minus θgtYgtX(XgtX + Ω
)minus1XgtYθ
where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (43)
which shows that the optimization of the p-OS problem with respect to θk boils down to
finding the kth largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY Indeed Appendix C
details that Problem (43) is solved by
(YgtY)minus1YgtX(XgtX + Ω
)minus1XgtYθ = α2θ (44)
36
41 From Optimal Scoring to Linear Discriminant Analysis
where α2 is the maximal eigenvalue 1
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2nminus1θgt(YgtY)θ
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2 (45)
412 Penalized Canonical Correlation Analysis
As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows
maxθisinRK βisinRp
nminus1θgtYgtXβ (46a)
s t nminus1 θgtYgtYθ = 1 (46b)
nminus1 βgt(XgtX + Ω
)β = 1 (46c)
The solutions to (46) are obtained by finding saddle points of the Lagrangian
nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)
rArr npartL(βθ γ ν)
partβ= XgtYθ minus 2γ(XgtX + Ω)β
rArr βcca =1
2γ(XgtX + Ω)minus1XgtYθ
Then as βcca obeys (46c) we obtain
βcca =(XgtX + Ω)minus1XgtYθradic
nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)
so that the optimal objective function (46a) can be expressed with θ alone
nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
=
radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
and the optimization problem with respect to θ can be restated as
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (48)
Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)
βos = αβcca (49)
1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)
37
4 Formalizing the Objective
where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using
the optimality conditions for θ
npartL(βθ γ ν)
partθ= YgtXβ minus 2νYgtYθ
rArr θcca =1
2ν(YgtY)minus1YgtXβ (410)
Then as θcca obeys (46b) we obtain
θcca =(YgtY)minus1YgtXβradic
nminus1βgtXgtY(YgtY)minus1YgtXβ (411)
leading to the following expression of the optimal objective function
nminus1θgtccaYgtXβ =
nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ
=
radicnminus1βgtXgtY(YgtY)minus1YgtXβ
The p-CCA problem can thus be solved with respect to β by plugging this value in (46)
maxβisinRp
nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)
s t nminus1 βgt(XgtX + Ω
)β = 1 (412b)
where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies
nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω
)βcca (413)
where λ is the maximal eigenvalue shown below to be equal to α2
nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ
rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ
rArr nminus1αβgtccaXgtYθ = λ
rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ
rArr α2 = λ
The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)
38
41 From Optimal Scoring to Linear Discriminant Analysis
413 Penalized Linear Discriminant Analysis
Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows
maxβisinRp
βgtΣBβ (414a)
s t βgt(ΣW + nminus1Ω)β = 1 (414b)
where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C
As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable
to a simple matrix representation using the projection operator Y(YgtY
)minus1Ygt
ΣT =1
n
nsumi=1
xixigt
= nminus1XgtX
ΣB =1
n
Ksumk=1
nk microkmicrogtk
= nminus1XgtY(YgtY
)minus1YgtX
ΣW =1
n
Ksumk=1
sumiyik=1
(xi minus microk) (xi minus microk)gt
= nminus1
(XgtXminusXgtY
(YgtY
)minus1YgtX
)
Using these formulae the solution to the p-LDA problem (414) is obtained as
XgtY(YgtY
)minus1YgtXβlda = λ
(XgtX + ΩminusXgtY
(YgtY
)minus1YgtX
)βlda
XgtY(YgtY
)minus1YgtXβlda =
λ
1minus λ
(XgtX + Ω
)βlda
The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat
βlda = (1minus α2)minus12 βcca
= αminus1(1minus α2)minus12 βos
which ends the path from p-OS to p-LDA
39
4 Formalizing the Objective
414 Summary
The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)s t nminus1 ΘgtYgtYΘ = IKminus1
Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the
square-root of the largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY we have
BLDA = BCCA
(IKminus1 minusA2
)minus 12
= BOS Aminus1(IKminus1 minusA2
)minus 12 (415)
where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p
can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS
or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied
With the aim of performing classification the whole process could be summarized asfollows
1 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
2 Translate the data samples X into the LDA domain as XLDA = XBOSD
where D = Aminus1(IKminus1 minusA2
)minus 12
3 Compute the matrix M of centroids microk from XLDA and Y
4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA
5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule
6 Graphical Representation
40
42 Practicalities
The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively
42 Practicalities
421 Solution of the Penalized Optimal Scoring Regression
Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem
minΘisinRKtimesKminus1BisinRptimesKminus1
YΘminusXB2F + λ tr(BgtΩB
)(416a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)
where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm
Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps
1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1
2 Compute B =(XgtX + λΩ
)minus1XgtYΘ0
3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ
)minus1XgtY
4 Compute the optimal regression coefficients
BOS =(XgtX + λΩ
)minus1XgtYΘ (417)
Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on
Θ0gtYgtX(XgtX + λΩ
)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a
costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B
This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where
41
4 Formalizing the Objective
a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems
422 Distance Evaluation
The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance
d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log
(nkn
) (418)
is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent
Σminus1WΩ =
(nminus1(XgtX + λΩ)minus ΣB
)minus1
=(nminus1XgtXminus ΣB + nminus1λΩ
)minus1
=(ΣW + nminus1λΩ
)minus1 (419)
Before explaining how to compute the distances let us summarize some clarifying points
bull The solution BOS of the p-OS problem is enough to accomplish classification
bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances
bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1
As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain
(xi minus microk)BOS2ΣWΩminus 2 log(πk)
where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1
(IKminus1 minusA2
)minus 12
∥∥∥2
2minus 2 log(πk)
which is a plain Euclidean distance
42
43 From Sparse Optimal Scoring to Sparse LDA
423 Posterior Probability Evaluation
Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as
p(yk = 1|x) prop exp
(minusd(xmicrok)
2
)prop πk exp
(minus1
2
∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2
)minus 12
∥∥∥2
2
) (420)
Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)
2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below
p(yk = 1|x) =πk exp
(minusd(xmicrok)
2
)sum
` π` exp(minusd(xmicro`)
2
)=
πk exp(minusd(xmicrok)
2 + dmax2
)sum`
π` exp
(minusd(xmicro`)
2+dmax
2
)
where dmax = maxk d(xmicrok)
424 Graphical Representation
Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented
43 From Sparse Optimal Scoring to Sparse LDA
The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated
In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see
43
4 Formalizing the Objective
section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB
431 A Quadratic Variational Form
Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)
Our formulation of group-Lasso is showed below
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(421a)
s tsum
j τj minussum
j wj∥∥βj∥∥
2le 0 (421b)
τj ge 0 j = 1 p (421c)
where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1
B =(β1gt βpgt
)gtand wj are predefined nonnegative weights The cost function
J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)
The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41
Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)
Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump
j=1wj∥∥βj∥∥
2
Proof The Lagrangian of Problem (421) is
L = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
( psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2
)minus
psumj=1
νjτj
44
43 From Sparse Optimal Scoring to Sparse LDA
Figure 41 Graphical representation of the variational approach to Group-Lasso
Thus the first order optimality conditions for τj are
partLpartτj
(τj ) = 0hArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
hArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0 τ
j
2 = 0
The last line is obtained from complementary slackness which implies here νjτj = 0
Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier
for constraint gj(τj) le 0 As a result the optimal value of τj
τj =
radicλw2
j
∥∥βj∥∥2
2
ν0=
radicλ
ν0wj∥∥βj∥∥
2(422)
We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)
psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2= 0 (423)
so that τj = wj∥∥βj∥∥
2 Using this value into (421a) it is possible to conclude that
Problem (421) is equivalent to the standard group-Lasso operator
minBisinRptimesM
J(B) + λ
psumj=1
wj∥∥βj∥∥
2 (424)
So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation
45
4 Formalizing the Objective
With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (425)
with
τj = wj∥∥βj∥∥
2
resulting in Ω diagonal components
(Ω)jj =wj∥∥βj∥∥
2
(426)
And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5
The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence
Lemma 42 If J is convex Problem (421) is convex
Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is
V isin RptimesKminus1 V =partJ(B)
partB+ λG
(427)
where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1
G =(g1gt gpgt
)gtdefined as follows Let S(B) denote the columnwise support of
B S(B) = j isin 1 p ∥∥βj∥∥
26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (428)
forallj isin S(B) ∥∥gj∥∥
2le wj (429)
46
43 From Sparse Optimal Scoring to Sparse LDA
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Proof When∥∥βj∥∥
26= 0 the gradient of the penalty with respect to βj is
part (λsump
m=1wj βm2)
partβj= λwj
βj∥∥βj∥∥2
(430)
At∥∥βj∥∥
2= 0 the gradient of the objective function is not continuous and the optimality
conditions then make use of the subdifferential (Bach et al 2011)
partβj
(λ
psumm=1
wj βm2
)= partβj
(λwj
∥∥βj∥∥2
)=λwjv isin RKminus1 v2 le 1
(431)
That gives the expression (429)
Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima
forallj isin S partJ(B)
partβj+ λwj
∥∥βj∥∥minus1
2βj = 0 (432a)
forallj isin S ∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le λwj (432b)
where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment
Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)
432 Group-Lasso OS as Penalized LDA
With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced
Proposition 41 The group-Lasso OS problem
BOS = argminBisinRptimesKminus1
minΘisinRKtimesKminus1
1
2YΘminusXB2F + λ
psumj=1
wj∥∥βj∥∥
2
s t nminus1 ΘgtYgtYΘ = IKminus1
47
4 Formalizing the Objective
is equivalent to the penalized LDA problem
BLDA = maxBisinRptimesKminus1
tr(BgtΣBB
)s t Bgt(ΣW + nminus1λΩ)B = IKminus1
where Ω = diag
(w2
1
τ1
w2p
τp
) with Ωjj =
+infin if βjos = 0
wj∥∥βjos
∥∥minus1
2otherwise
(433)
That is BLDA = BOS diag(αminus1k (1minus α2
k)minus12
) where αk isin (0 1) is the kth leading
eigenvalue of
nminus1YgtX(XgtX + λΩ
)minus1XgtY
Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso
The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr
(BgtΩB
)
48
5 GLOSS Algorithm
The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22
The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below
1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed
2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution
3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set
This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively
51 Regression Coefficients Updates
Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(
XgtAXA + λΩ)βk = XgtAYθ0
k (51)
49
5 GLOSS Algorithm
initialize modelλ B
ACTIVE SETall j st||βj ||2 gt 0
p-OS PROBLEMB must hold1st optimality
condition
any variablefrom
ACTIVE SETmust go toINACTIVE
SET
take it out ofACTIVE SET
test 2nd op-timality con-dition on the
INACTIVE SET
any variablefrom
INACTIVE SETmust go toACTIVE
SET
take it out ofINACTIVE SET
compute Θ
and update B end
yes
no
yes
no
Figure 51 GLOSS block diagram
50
51 Regression Coefficients Updates
Algorithm 1 Adaptively Penalized Optimal Scoring
Input X Y B λInitialize A larr
j isin 1 p
∥∥βj∥∥2gt 0
Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat
Step 1 solve (421) in B assuming A optimalrepeat
Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1
2
BA larr(XgtAXA + λΩ
)minus1XgtAYΘ0
until condition (432a) holds for all j isin A Step 2 identify inactivated variables
for j isin A ∥∥βj∥∥
2= 0 do
if optimality condition (432b) holds thenA larr AjGo back to Step 1
end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax
jisinA
∥∥partJpartβj∥∥2
if∥∥∥partJpartβj∥∥∥
2lt λ then
convergence larr true B is optimalelseA larr Acup j
end ifuntil convergence
(sV)larreigenanalyze(Θ0gtYgtXAB) that is
Θ0gtYgtXABVk = skVk k = 1 K minus 1
Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1
Output Θ B α
51
5 GLOSS Algorithm
where XA denotes the columns of X indexed by A and βk and θ0k denote the kth
column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system
511 Cholesky decomposition
Dropping the subscripts and considering the (K minus 1) systems together (51) leads to
(XgtX + λΩ)B = XgtYΘ (52)
Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows
CgtCB = XgtYΘ
CB = CgtXgtYΘ
B = CCgtXgtYΘ (53)
where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)
512 Numerical Stability
The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression
B = Ωminus12(Ωminus12XgtXΩminus12 + λI
)minus1Ωminus12XgtYΘ0 (54)
where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)
52 Score Matrix
The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY This eigen-analysis is actually solved in the form
ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-
vector decomposition does not require the costly computation of(XgtX + Ω
)minus1that
52
53 Optimality Conditions
involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-
trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω
)minus1XgtY 1
Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as
Θ0gtYgtX(XgtX + Ω
)minus1XgtYΘ0 = Θ0gtYgtXB0
Thus the solution to penalized OS problem can be computed trough the singular
value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining
Θ = Θ0V we have ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ = Λ and when Θ0 is chosen such
that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation
53 Optimality Conditions
GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function
1
2YΘminusXB22 + λ
psumj=1
wj∥∥βj∥∥
2(55)
Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth
row of B βj is the (K minus 1)-dimensional vector
partJ(B)
partβj= xj
gt(XBminusYΘ)
where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as
xjgt
(XBminusYΘ) + λwjβj∥∥βj∥∥
2
1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω
)minus1XgtY It is thus suffi-
cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY In practice to comply with this desideratum and conditions (35b) and
(35c) we set Θ0 =(YgtY
)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal
vectors orthogonal to 1K
53
5 GLOSS Algorithm
The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
54 Active and Inactive Sets
The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function
j = maxj
∥∥∥xjgt (XBminusYΘ)∥∥∥
2minus λwj 0
The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥
2
is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition
55 Penalty Parameter
The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active
The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0
λmax = maxjisin1p
1
wj
∥∥∥xjgtYΘ0∥∥∥
2
The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin
is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)
54
56 Options and Variants
56 Options and Variants
561 Scaling Variables
As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm
562 Sparse Variant
This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation
563 Diagonal Variant
We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated
The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems
minBisinRptimesKminus1
YΘminusXB2F = minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB
)are replaced by
minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B
)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite
564 Elastic net and Structured Variant
For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition
55
5 GLOSS Algorithm
7 8 9
4 5 6
1 2 3
- ΩL =
3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3
Figure 52 Graph and Laplacian matrix for a 3times 3 image
for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth
When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood
This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned
56
6 Experimental Results
This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper
61 Normalization
With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1
62 Decision Thresholds
The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation
1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval
57
6 Experimental Results
63 Simulated Data
We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is
Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)
Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of
dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure
is intended to mimic gene expression data correlation
Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1
3 1)if j le 100 and Xij sim N(0 1) otherwise
Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise
Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563
The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only
58
63 Simulated Data
Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset
Err () Var Dir
Sim 1 K = 4 mean shift ind features
PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)
Sim 2 K = 2 mean shift dependent features
PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)
Sim 3 K = 4 1D mean shift ind features
PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)
Sim 4 K = 4 mean shift ind features
PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)
59
6 Experimental Results
0 10 20 30 40 50 60 70 8020
30
40
50
60
70
80
90
100TPR Vs FPR
gloss
glossd
slda
plda
Simulation1
Simulation2
Simulation3
Simulation4
Figure 61 TPR versus FPR (in ) for all algorithms and simulations
Table 62 Average TPR and FPR (in ) computed over 25 repetitions
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
PLDA 990 782 969 603 980 159 743 656
SLDA 739 385 338 163 416 278 507 395
GLOSS 641 106 300 46 511 182 260 121
GLOSS-D 935 394 921 281 956 655 429 299
method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )
64 Gene Expression Data
We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-
2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736
60
64 Gene Expression Data
Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables
Err () Var
Nakayama n = 86 p = 22 283 K = 5
PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)
Ramaswamy n = 198 p = 16 063 K = 14
PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)
Sun n = 180 p = 54 613 K = 4
PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)
ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4
dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors
Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split
Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors
Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS
4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962
61
6 Experimental Results
GLOSS SLDA
Naka
yam
a
minus25000 minus20000 minus15000 minus10000 minus5000 0 5000
minus25
minus2
minus15
minus1
minus05
0
05
1
x 104
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
2n
dd
iscr
imin
ant
minus2000 0 2000 4000 6000 8000 10000 12000 14000
2000
4000
6000
8000
10000
12000
14000
16000
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
Su
n
minus1 minus05 0 05 1 15 2
x 104
05
1
15
2
25
3
35
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
2n
dd
iscr
imin
ant
minus2 minus15 minus1 minus05 0
x 104
0
05
1
15
2
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means
62
65 Correlated Data
Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo
65 Correlated Data
When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge
The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works
For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63
As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward
The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits
Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results
63
6 Experimental Results
β for GLOSS β for S-GLOSS
Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo
β for GLOSS and λ = 03 β for S-GLOSS and λ = 03
Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo
64
Discussion
GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix
Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data
The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced
The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition
65
Part III
Sparse Clustering Analysis
67
Abstract
Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity
Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix
As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection
Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10
69
7 Feature Selection in Mixture Models
71 Mixture Models
One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering
711 Model
We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically
from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as
f(xi) =
Ksumk=1
πkfk(xi) foralli isin 1 n
where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and
sumk πk = 1) Mixture models transcribe that
given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism
bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK
bull x each xi is assumed to arise from a random vector with probability densityfunction fk
In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as
f(xiθ) =
Ksumk=1
πkφ(xiθk) foralli isin 1 n
71
7 Feature Selection in Mixture Models
where θ = (π1 πK θ1 θK) is the parameter of the model
712 Parameter Estimation The EM Algorithm
For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ
21 σ
22 π) of a univariate
Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches
The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)
The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood
Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm
Maximum Likelihood Definitions
The likelihood is is commonly expressed in its logarithmic version
L(θ X) = log
(nprodi=1
f(xiθ)
)
=nsumi=1
log
(Ksumk=1
πkfk(xiθk)
) (71)
where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions
To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or
72
71 Mixture Models
classification log-likelihood
LC(θ XY) = log
(nprodi=1
f(xiyiθ)
)
=
nsumi=1
log
(Ksumk=1
yikπkfk(xiθk)
)
=nsumi=1
Ksumk=1
yik log (πkfk(xiθk)) (72)
The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise
Defining the soft membership tik(θ) as
tik(θ) = p(Yik = 1|xiθ) (73)
=πkfk(xiθk)
f(xiθ) (74)
To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows
LC(θ XY) =sumik
yik log (πkfk(xiθk))
=sumik
yik log (tikf(xiθ))
=sumik
yik log tik +sumik
yik log f(xiθ)
=sumik
yik log tik +nsumi=1
log f(xiθ)
=sumik
yik log tik + L(θ X) (75)
wheresum
ik yik log tik can be reformulated as
sumik
yik log tik =nsumi=1
Ksumk=1
yik log(p(Yik = 1|xiθ))
=
nsumi=1
log(p(Yik = 1|xiθ))
= log (p(Y |Xθ))
As a result the relationship (75) can be rewritten as
L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)
73
7 Feature Selection in Mixture Models
Likelihood Maximization
The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)
L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))
+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))
In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood
∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1
minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality
Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))
For the mixture model problem Q(θθprime) is
Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]
=sumik
p(Yik = 1|xiθprime) log(πkfk(xiθk))
=nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (77)
Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ
prime) are the posterior proba-bilities of cluster memberships
Hence the EM algorithm sketched above results in
bull Initialization (not iterated) choice of the initial parameter θ(0)
bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)
bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))
74
72 Feature Selection in Model-Based Clustering
Gaussian Model
In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is
f(xiθ) =Ksumk=1
πkfk(xiθk)
=
Ksumk=1
πk1
(2π)p2 |Σ|
12
exp
minus1
2(xi minus microk)
gtΣminus1(xi minus microk)
At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows
Q(θθ(t)) =sumik
tik log(πk)minussumik
tik log(
(2π)p2 |Σ|
12
)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
=sumk
tk log(πk)minusnp
2log(2π)︸ ︷︷ ︸
constant term
minusn2
log(|Σ|)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
equivsumk
tk log(πk)minusn
2log(|Σ|)minus
sumik
tik
(1
2(xi minus microk)
gtΣminus1(xi minus microk)
) (78)
where
tk =nsumi=1
tik (79)
The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)
π(t+1)k =
tkn
(710)
micro(t+1)k =
sumi tikxitk
(711)
Σ(t+1) =1
n
sumk
Wk (712)
with Wk =sumi
tik(xi minus microk)(xi minus microk)gt (713)
The derivations are detailed in Appendix G
72 Feature Selection in Model-Based Clustering
When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own
75
7 Feature Selection in Mixture Models
covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries
In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD
gtk (Banfield and Raftery 1993)
These regularization schemes address singularity and stability issues but they do notinduce parsimonious models
In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space
721 Based on Penalized Likelihood
Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x
log
(p(Yk = 1|x)
p(Y` = 1|x)
)= xgtΣminus1(microk minus micro`)minus
1
2(microk + micro`)
gtΣminus1(microk minus micro`) + logπkπ`
In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm
λKsumk=1
psumj=1
|microkj |
as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices
λ1
Ksumk=1
psumj=1
|microkj |+ λ2
Ksumk=1
psumj=1
psumm=1
|(Σminus1k )jm|
In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models
76
72 Feature Selection in Model-Based Clustering
Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)
λ
psumj=1
sum16k6kprime6K
|microkj minus microkprimej |
This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative
A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features
λ
psumj=1
(micro1j micro2j microKj)infin
One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means
λradicK
psumj=1
radicradicradicradic Ksum
k=1
micro2kj
The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test
The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation
722 Based on Model Variants
The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as
77
7 Feature Selection in Mixture Models
f(xi|φ πθν) =Ksumk=1
πk
pprodj=1
[f(xij |θjk)]φj [h(xij |νj)]1minusφj
where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)
An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1
which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion
tr(
(UgtΣWU)minus1UgtΣBU) (714)
so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations
To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of
minUisinRptimesKminus1
∥∥∥XU minusXU∥∥∥2
F+ λ
Kminus1sumk=1
∥∥∥uk∥∥∥1
where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet
minABisinRptimesKminus1
Ksumk=1
∥∥∥RminusgtW HBk minusABgtHBk
∥∥∥2
2+ ρ
Kminus1sumj=1
βgtj ΣWβj + λ
Kminus1sumj=1
∥∥βj∥∥1
s t AgtA = IKminus1
where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper
78
72 Feature Selection in Model-Based Clustering
triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U
The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem
minUisinRptimesKminus1
psumj=1
∥∥∥ΣBj minus UUgtΣBj
∥∥∥2
2
s t UgtU = IKminus1
whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U
To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality
However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo
723 Based on Model Selection
Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables
bull X(1) set of selected relevant variables
bull X(2) set of variables being considered for inclusion or exclusion of X(1)
bull X(3) set of non relevant variables
79
7 Feature Selection in Mixture Models
With those subsets they defined two different models where Y is the partition toconsider
bull M1
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)|X(1)
)f(X(1)|Y
)bull M2
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)X(1)|Y
)Model M1 means that variables in X(2) are independent on clustering Y Model M2
shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor
B12 =f (X|M1)
f (X|M2)
where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio
B12 =f(X(1)X(2)X(3)|M1
)f(X(1)X(2)X(3)|M2
)=f(X(2)|X(1)M1
)f(X(1)|M1
)f(X(2)X(1)|M2
)
This factor is approximated since the integrated likelihoods f(X(1)|M1
)and
f(X(2)X(1)|M2
)are difficult to calculate exactly Raftery and Dean (2006) use the
BIC approximation The computation of f(X(2)|X(1)M1
) if there is only one variable
in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term
Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability
Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis
80
8 Theoretical Foundations
In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features
We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model
In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided
81 Resolving EM with Optimal Scoring
In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate
811 Relationship Between the M-Step and Linear Discriminant Analysis
LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance
d(ximicrok) = (xi minus microk)gtΣminus1
W (xi minus microk)
where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix
81
8 Theoretical Foundations
The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)
Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood
2lweight(microΣ) =nsumi=1
Ksumk=1
tikd(ximicrok)minus n log(|ΣW|)
which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures
812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis
The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression
813 Clustering Using Penalized Optimal Scoring
The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression
d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)
This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as
82
82 Optimized Criterion
1 Initialize the membership matrix Y (for example by K-means algorithm)
2 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2
k)minus 1
2 )
4 Compute the centroids M in the LDA domain
5 Evaluate distances in the LDA domain
6 Translate distances into posterior probabilities tik with
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
] (81)
7 Update the labels using the posterior probabilities matrix Y = T
8 Go back to step 2 and iterate until tik converge
Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures
814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures
82 Optimized Criterion
In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized
83
8 Theoretical Foundations
optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture
This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows
821 A Bayesian Derivation
This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)
The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter
The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior
f(Σ|Λ0 ν0) =1
2np2 |Λ0|
n2 Γp(
n2 )|Σminus1|
ν0minuspminus12 exp
minus1
2tr(Λminus1
0 Σminus1)
where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as
Γp(n2) = πp(pminus1)4pprodj=1
Γ (n2 + (1minus j)2)
The posterior distribution can be maximized similarly to the likelihood through the
84
82 Optimized Criterion
maximization of
Q(θθprime) + log(f(Σ|Λ0 ν0))
=Ksumk=1
tk log πk minus(n+ 1)p
2log 2minus n
2log |Λ0| minus
p(p+ 1)
4log(π)
minuspsumj=1
log
(Γ
(n
2+
1minus j2
))minus νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
)equiv
Ksumk=1
tk log πk minusn
2log |Λ0| minus
νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
) (82)
with tk =
nsumi=1
tik
νn = ν0 + n
Λminus1n = Λminus1
0 + S0
S0 =
nsumi=1
Ksumk=1
tik(xi minus microk)(xi minus microk)gt
Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)
822 Maximum a Posteriori Estimator
The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is
ΣMAP =1
ν0 + nminus pminus 1(Λminus1
0 + S0) (83)
where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1
0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)
85
9 Mix-GLOSS Algorithm
Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism
91 Mix-GLOSS
The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik
When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant
The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition
911 Outer Loop Whole Algorithm Repetitions
This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs
bull the centered ntimes p feature matrix X
bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically
bull the number of clusters K
bull the maximum number of iterations for the EM algorithm
bull the convergence tolerance for the EM algorithm
bull the number of whole repetitions of the clustering algorithm
87
9 Mix-GLOSS Algorithm
Figure 91 Mix-GLOSS Loops Scheme
bull a ptimes (K minus 1) initial coefficient matrix (optional)
bull a ntimesK initial posterior probability matrix (optional)
For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process
912 Penalty Parameter Loop
The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix
Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage
88
91 Mix-GLOSS
of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive
Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)
Algorithm 2 Automatic selection of λ
Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat
Estimate λ Compute gradient at βj = 0partJ(B)
partβj
∣∣∣βj=0
= xjgt
(sum
m6=j xmβm minusYΘ)
Compute λmax for every feature using (432b)
λmaxj = 1
wj
∥∥∥∥ partJ(B)
partβj
∣∣∣βj=0
∥∥∥∥2
Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false
elselastLAMBDA larr true
end ifuntil lastLAMBDA
Output B L(θ) tik πk microk Σ Y for every λ in solution path
913 Inner Loop EM Algorithm
The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop
89
9 Mix-GLOSS Algorithm
Algorithm 3 Mix-GLOSS for one value of λ
Input X K B0 Y0 λInitializeif (B0Y0) available then
BOS larr B0 Y larr Y0
elseBOS larr 0 Y larr kmeans(XK)
end ifconvergenceEM larr false tolEM larr 1e-3repeat
M-step(BOSΘ
α)larr GLOSS(XYBOS λ)
XLDA = XBOS diag (αminus1(1minusα2)minus12
)
πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n
sumi |tik minus yik| lt tolEM then
convergenceEM larr trueend ifY larr T
until convergenceEMY larr MAP(T)
Output BOS ΘL(θ) tik πk microk Σ Y
90
92 Model Selection
M-Step
The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step
E-Step
The E-step evaluates the posterior probability matrix T using
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
]
The convergence of those tik is used as stopping criterion for EM
92 Model Selection
Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected
In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure
In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time
The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested
91
9 Mix-GLOSS Algorithm
Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)
X K λEMITER MAXREPMixminusGLOSS
Use B and T frombest repetition as
StartB and StartT
Mix-GLOSS (λStartBStartT)
Compute BIC
Chose λ = minλ BIC
Partition tikπk λBEST BΘ D L(θ)activeset
Figure 92 Mix-GLOSS model selection diagram
with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ
92
10 Experimental Results
The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6
This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63
In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations
The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions
101 Tested Clustering Algorithms
This section compares Mix-GLOSS with the following methods in the state of the art
bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan
bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website
93
10 Experimental Results
Figure 101 Class mean vectors for each artificial simulation
bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website
After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered
The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website
bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)
bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see
94
102 Results
Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable
102 Results
In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are
bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different
bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80
bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced
The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS
Results in percentages are displayed in Figure 102 (or in Table 102 )
95
10 Experimental Results
Table 101 Experimental results for simulated data
Err () Var Time
Sim 1 K = 4 mean shift ind features
CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h
Sim 2 K = 2 mean shift dependent features
CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h
Sim 3 K = 4 1D mean shift ind features
CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h
Sim 4 K = 4 mean shift ind features
CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h
Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
MIX-GLOSS 992 015 828 335 884 67 780 12
LUMI-KUAN 992 28 1000 02 1000 005 50 005
FISHER-EM 986 24 888 17 838 5825 620 4075
96
103 Discussion
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100TPR Vs FPR
MIXminusGLOSS
LUMIminusKUAN
FISHERminusEM
Simulation1
Simulation2
Simulation3
Simulation4
Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions
103 Discussion
After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted
LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here
The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4
From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall
97
Conclusions
99
Conclusions
Summary
The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables
In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering
The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems
In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations
In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results
Perspectives
Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species
101
based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography
At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term
The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis
From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm
At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression
102
Appendix
103
A Matrix Properties
Property 1 By definition ΣW and ΣB are both symmetric matrices
ΣW =1
n
gsumk=1
sumiisinCk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
gsumk=1
nk(microk minus x)(microk minus x)gt
Property 2 partxgtapartx = partagtx
partx = a
Property 3 partxgtAxpartx = (A + Agt)x
Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt
Property 5 partagtXbpartX = abgt
Property 6 partpartXtr
(AXminus1B
)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt
105
B The Penalized-OS Problem is anEigenvector Problem
In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form
minθkβk
Yθk minusXβk22 + βgtk Ωkβk (B1)
st θgtk YgtYθk = 1
θgt` YgtYθk = 0 forall` lt k
for k = 1 K minus 1The Lagrangian associated to Problem (B1) is
Lk(θkβk λkνk) =
Yθk minusXβk22 + βgtk Ωkβk + λk(θ
gtk YgtYθk minus 1) +
sum`ltk
ν`θgt` YgtYθk (B2)
Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk
βk = (XgtX + Ωk)minus1XgtYθk (B3)
The objective function of (B1) evaluated at βk is
minθk
Yθk minusXβk22 + βk
gtΩkβk = min
θk
θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk
= maxθk
θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)
If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY
B1 How to Solve the Eigenvector Decomposition
Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition
107
B The Penalized-OS Problem is an Eigenvector Problem
Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way
maxΘisinRKtimes(Kminus1)
tr(ΘgtMΘ
)(B5)
st ΘgtYgtYΘ = IKminus1
If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is
MΘv = λv (B6)
where v is the eigenvector and λ the associated eigenvalue of MΘ Operating
vgtMΘv = λhArr vgtΘgtMΘv = λ
Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue
wgtMw = λ (B7)
Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ
MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ
= ΘgtYgtXB
Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone
To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B
B = (XgtX + Ω)minus1XgtYΘV = BV
108
B2 Why the OS Problem is Solved as an Eigenvector Problem
B2 Why the OS Problem is Solved as an Eigenvector Problem
In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY
By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them
θk =
Kminus1summ=1
αmwm s t θgtk θk = 1 (B8)
The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (
Kminus1summ=1
αmwm
)gt(Kminus1summ=1
αmwm
)= 1
that as per the eigenvector properties can be reduced to
Kminus1summ=1
α2m = 1 (B9)
Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)
Mθk = M
Kminus1summ=1
αmwm
=
Kminus1summ=1
αmMwm
As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain
Mθk =Kminus1summ=1
αmλmwm
Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors
θgtk Mθk =
(Kminus1sum`=1
α`w`
)gt(Kminus1summ=1
αmλmwm
)
This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving
θgtk Mθk =Kminus1summ=1
α2mλm
109
B The Penalized-OS Problem is an Eigenvector Problem
The optimization Problem (B5) for discriminant direction k can be rewritten as
maxθkisinRKtimes1
θgtk Mθk
= max
θkisinRKtimes1
Kminus1summ=1
α2mλm
(B10)
with θk =Kminus1summ=1
αmwm
andKminus1summ=1
α2m = 1
One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =
sumKminus1m=1 αmwm the resulting score vector θk will be equal to
the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can
be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY
110
C Solving Fisherrsquos Discriminant Problem
The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance
maxβisinRp
βgtΣBβ (C1a)
s t βgtΣWβ = 1 (C1b)
where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data
The Lagrangian of Problem (C1) is
L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)
so that its first derivative with respect to β is
partL(β ν)
partβ= 2ΣBβ minus 2νΣWβ
A necessary optimality condition for β is that this derivative is zero that is
ΣBβ = νΣWβ
Provided ΣW is full rank we have
Σminus1W ΣBβ
= νβ (C2)
Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of
eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows
βgtΣBβ = βgtΣWΣminus1
W ΣBβ
= νβgtΣWβ from (C2)
= ν from (C1b)
That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1
W ΣB and β is any eigenvector correspondingto this maximal eigenvalue
111
D Alternative Variational Formulation forthe Group-Lasso
In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(D1a)
s tsump
j=1 τj = 1 (D1b)
τj ge 0 j = 1 p (D1c)
Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed
of row vectors βj isin RKminus1 B =(β1gt βpgt
)gt
L(B τ λ ν0 νj) = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
psumj=1
τj minus 1
minus psumj=1
νjτj (D2)
The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj
partL(B τ λ ν0 νj)
partτj
∣∣∣∣τj=τj
= 0 rArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 = 0
The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ
) = 0 where νj is the Lagrange multiplier and gj(τ) is the
inequality Lagrange condition Then the optimal τj can be deduced
τj =
radicλ
ν0wj∥∥βj∥∥
2
Placing this optimal value of τj into constraint (D1b)
psumj=1
τj = 1rArr τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
(D3)
113
D Alternative Variational Formulation for the Group-Lasso
With this value of τj Problem (D1) is equivalent to
minBisinRptimesKminus1
J(B) + λ
psumj=1
wj∥∥βj∥∥
2
2
(D4)
This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj
The penalty term of (D1a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (D5)
Using the value of τj from (D3) each diagonal component of Ω is
(Ω)jj =wjsump
j=1wj∥∥βj∥∥
2∥∥βj∥∥2
(D6)
In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation
D1 Useful Properties
Lemma D1 If J is convex Problem (D1) is convex
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =
partJ(B)
partB+ 2λ
Kminus1sumj=1
wj∥∥βj∥∥
2
G
(D7)
where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1
∥∥βj∥∥26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (D8)
forallj isin S(B) ∥∥gj∥∥
2le wj (D9)
114
D2 An Upper Bound on the Objective Function
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1
∥∥βj∥∥26= 0 and let S(B) be its complement then we have
forallj isin S(B) minus partJ(B)
partβj= 2λ
Kminus1sumj=1
wj∥∥βj∥∥2
wj∥∥βj∥∥minus1
2βj (D10a)
forallj isin S(B)
∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le 2λwj
Kminus1sumj=1
wj∥∥βj∥∥2
(D10b)
In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)
D2 An Upper Bound on the Objective Function
Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that
τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum
j=1
wj∥∥βj∥∥
2
2
=
psumj=1
τ12j
wj∥∥βj∥∥
2
τ12j
2
le
psumj=1
τj
psumj=1
w2j
∥∥βj∥∥2
2
τj
le
psumj=1
w2j
∥∥βj∥∥2
2
τj
where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one
115
D Alternative Variational Formulation for the Group-Lasso
This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined
116
E Invariance of the Group-Lasso to UnitaryTransformations
The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition
Proposition E1 Let B be a solution of
minBisinRptimesM
Y minusXB2F + λ
psumj=1
wj∥∥βj∥∥
2(E1)
and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof
minBisinRptimesM
∥∥∥Y minusXB∥∥∥2
F+ λ
psumj=1
wj∥∥βj∥∥
2(E2)
Proof The first-order necessary optimality conditions for B are
forallj isin S(B) 2xjgt(xjβ
j minusY)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E3a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minusY)∥∥∥
2le λwj (E3b)
where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement
First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows
forallj isin S(B) 2xjgt(xjβ
j minus Y)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E4a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minus Y)∥∥∥
2le λwj (E4b)
where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM
∥∥ugt∥∥2
=∥∥ugtV
∥∥2 Equation (E4b) is also
117
E Invariance of the Group-Lasso to Unitary Transformations
obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof
118
F Expected Complete Likelihood andLikelihood
Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available
L(θ) =
nsumi=1
log
(Ksumk=1
πkfk(xiθk)
)(F1)
Q(θθprime) =nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (F2)
with tik(θprime) =
πprimekfk(xiθprimek)sum
` πprime`f`(xiθ
prime`)
(F3)
In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are
the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)
Using (F3) we have
Q(θθprime) =sumik
tik(θprime) log (πkfk(xiθk))
=sumik
tik(θprime) log(tik(θ)) +
sumik
tik(θprime) log
(sum`
π`f`(xiθ`)
)=sumik
tik(θprime) log(tik(θ)) + L(θ)
In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities
L(θ) = Q(θθ)minussumik
tik(θ) log(tik(θ))
= Q(θθ) +H(T)
119
G Derivation of the M-Step Equations
This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as
Q(θθprime) = maxθ
sumik
tik(θprime) log(πkfk(xiθk))
=sumk
log
(πksumi
tik
)minus np
2log(2π)minus n
2log |Σ| minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
which has to be maximized subject tosumk
πk = 1
The Lagrangian of this problem is
L(θ) = Q(θθprime) + λ
(sumk
πk minus 1
)
Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ
G1 Prior probabilities
partL(θ)
partπk= 0hArr 1
πk
sumi
tik + λ = 0
where λ is identified from the constraint leading to
πk =1
n
sumi
tik
121
G Derivation of the M-Step Equations
G2 Means
partL(θ)
partmicrok= 0hArr minus1
2
sumi
tik2Σminus1(microk minus xi) = 0
rArr microk =
sumi tikxisumi tik
G3 Covariance Matrix
partL(θ)
partΣminus1 = 0hArr n
2Σ︸︷︷︸
as per property 4
minus 1
2
sumik
tik(xi minus microk)(xi minus microk)gt
︸ ︷︷ ︸as per property 5
= 0
rArr Σ =1
n
sumik
tik(xi minus microk)(xi minus microk)gt
122
Bibliography
F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011
F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008
F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012
J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993
A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009
H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996
P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004
C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008
C M Bishop Pattern Recognition and Machine Learning Springer New York 2006
C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a
C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b
S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004
L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995
L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984
123
Bibliography
T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011
S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999
C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012
B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008
L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011
C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009
A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246
D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006
R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000
B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004
Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008
R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936
V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008
J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009
124
Bibliography
J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010
J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989
W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998
A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003
D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005
G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010
G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011
Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998
Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002
L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008
Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004
J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010
I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003
T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996
T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994
125
Bibliography
T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995
A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970
J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009
T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006
K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000
P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010
T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002
M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004
Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004
C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008
C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006
H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005
J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967
Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012
C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a
126
Bibliography
C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b
L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008
N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006
B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006
B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007
Y Nesterov Gradient methods for minimizing composite functions preprint 2007
S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886
B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011
M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a
M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b
W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007
W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006
K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894
S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003
127
Bibliography
Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009
A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006
C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948
S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007
V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004
V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008
V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004
C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010
L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012
Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978
A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008
S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006
P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010
M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008
128
Bibliography
M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008
R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996
J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010
S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008
D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011
D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010
D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009
M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007
MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009
T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008
B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a
B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b
C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010
J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007
129
Bibliography
M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006
P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007
P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009
H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009
H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006
H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005
130
- SANCHEZ MERCHANTE PDTpdf
- Thesis Luis Francisco Sanchez Merchantepdf
-
- List of figures
- List of tables
- Notation and Symbols
- Context and Foundations
-
- Context
- Regularization for Feature Selection
-
- Motivations
- Categorization of Feature Selection Techniques
- Regularization
-
- Important Properties
- Pure Penalties
- Hybrid Penalties
- Mixed Penalties
- Sparsity Considerations
- Optimization Tools for Regularized Problems
-
- Sparse Linear Discriminant Analysis
-
- Abstract
- Feature Selection in Fisher Discriminant Analysis
-
- Fisher Discriminant Analysis
- Feature Selection in LDA Problems
-
- Inertia Based
- Regression Based
-
- Formalizing the Objective
-
- From Optimal Scoring to Linear Discriminant Analysis
-
- Penalized Optimal Scoring Problem
- Penalized Canonical Correlation Analysis
- Penalized Linear Discriminant Analysis
- Summary
-
- Practicalities
-
- Solution of the Penalized Optimal Scoring Regression
- Distance Evaluation
- Posterior Probability Evaluation
- Graphical Representation
-
- From Sparse Optimal Scoring to Sparse LDA
-
- A Quadratic Variational Form
- Group-Lasso OS as Penalized LDA
-
- GLOSS Algorithm
-
- Regression Coefficients Updates
-
- Cholesky decomposition
- Numerical Stability
-
- Score Matrix
- Optimality Conditions
- Active and Inactive Sets
- Penalty Parameter
- Options and Variants
-
- Scaling Variables
- Sparse Variant
- Diagonal Variant
- Elastic net and Structured Variant
-
- Experimental Results
-
- Normalization
- Decision Thresholds
- Simulated Data
- Gene Expression Data
- Correlated Data
-
- Discussion
-
- Sparse Clustering Analysis
-
- Abstract
- Feature Selection in Mixture Models
-
- Mixture Models
-
- Model
- Parameter Estimation The EM Algorithm
-
- Feature Selection in Model-Based Clustering
-
- Based on Penalized Likelihood
- Based on Model Variants
- Based on Model Selection
-
- Theoretical Foundations
-
- Resolving EM with Optimal Scoring
-
- Relationship Between the M-Step and Linear Discriminant Analysis
- Relationship Between Optimal Scoring and Linear Discriminant Analysis
- Clustering Using Penalized Optimal Scoring
- From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
-
- Optimized Criterion
-
- A Bayesian Derivation
- Maximum a Posteriori Estimator
-
- Mix-GLOSS Algorithm
-
- Mix-GLOSS
-
- Outer Loop Whole Algorithm Repetitions
- Penalty Parameter Loop
- Inner Loop EM Algorithm
-
- Model Selection
-
- Experimental Results
-
- Tested Clustering Algorithms
- Results
- Discussion
-
- Conclusions
- Appendix
-
- Matrix Properties
- The Penalized-OS Problem is an Eigenvector Problem
-
- How to Solve the Eigenvector Decomposition
- Why the OS Problem is Solved as an Eigenvector Problem
-
- Solving Fishers Discriminant Problem
- Alternative Variational Formulation for the Group-Lasso
-
- Useful Properties
- An Upper Bound on the Objective Function
-
- Invariance of the Group-Lasso to Unitary Transformations
- Expected Complete Likelihood and Likelihood
- Derivation of the M-Step Equations
-
- Prior probabilities
- Means
- Covariance Matrix
-
- Bibliography
-
Algorithmes drsquoestimation pour laclassification parcimonieuse
Luis Francisco Sanchez MerchanteUniversity of Compiegne
CompiegneFrance
ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo
Albert Espinosa
ldquoBe brave Take risks Nothing can substitute experiencerdquo
Paulo Coelho
Acknowledgements
If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy
Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself
I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them
The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience
Contents
List of figures v
List of tables vii
Notation and Symbols ix
I Context and Foundations 1
1 Context 5
2 Regularization for Feature Selection 921 Motivations 9
22 Categorization of Feature Selection Techniques 11
23 Regularization 13
231 Important Properties 14
232 Pure Penalties 14
233 Hybrid Penalties 18
234 Mixed Penalties 19
235 Sparsity Considerations 19
236 Optimization Tools for Regularized Problems 21
II Sparse Linear Discriminant Analysis 25
Abstract 27
3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29
32 Feature Selection in LDA Problems 30
321 Inertia Based 30
322 Regression Based 32
4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35
411 Penalized Optimal Scoring Problem 36
412 Penalized Canonical Correlation Analysis 37
i
Contents
413 Penalized Linear Discriminant Analysis 39
414 Summary 40
42 Practicalities 41
421 Solution of the Penalized Optimal Scoring Regression 41
422 Distance Evaluation 42
423 Posterior Probability Evaluation 43
424 Graphical Representation 43
43 From Sparse Optimal Scoring to Sparse LDA 43
431 A Quadratic Variational Form 44
432 Group-Lasso OS as Penalized LDA 47
5 GLOSS Algorithm 4951 Regression Coefficients Updates 49
511 Cholesky decomposition 52
512 Numerical Stability 52
52 Score Matrix 52
53 Optimality Conditions 53
54 Active and Inactive Sets 54
55 Penalty Parameter 54
56 Options and Variants 55
561 Scaling Variables 55
562 Sparse Variant 55
563 Diagonal Variant 55
564 Elastic net and Structured Variant 55
6 Experimental Results 5761 Normalization 57
62 Decision Thresholds 57
63 Simulated Data 58
64 Gene Expression Data 60
65 Correlated Data 63
Discussion 63
III Sparse Clustering Analysis 67
Abstract 69
7 Feature Selection in Mixture Models 7171 Mixture Models 71
711 Model 71
712 Parameter Estimation The EM Algorithm 72
ii
Contents
72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79
8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81
811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant
Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83
82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85
9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87
911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89
92 Model Selection 91
10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97
Conclusions 97
Appendix 103
A Matrix Properties 105
B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109
C Solving Fisherrsquos Discriminant Problem 111
D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115
iii
Contents
E Invariance of the Group-Lasso to Unitary Transformations 117
F Expected Complete Likelihood and Likelihood 119
G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122
Bibliography 123
iv
List of Figures
11 MASH project logo 5
21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-
rameters 20
41 Graphical representation of the variational approach to Group-Lasso 45
51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56
61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first
discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64
91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92
101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97
v
List of Tables
61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61
101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96
vii
Notation and Symbols
Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors
Sets
N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A
Data
X input domainxi input sample xi isin XX design matrix X = (xgt1 x
gtn )gt
xj column j of Xyi class indicator of sample i
Y indicator matrix Y = (ygt1 ygtn )gt
z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N
Vectors Matrices and Norms
0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A
ix
Notation and Symbols
Probability
E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2
W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix
H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y
Mixture Models
yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)
θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function
Optimization
J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β
βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path
x
Notation and Symbols
Penalized models
λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)
βj jth row of B = (β1gt βpgt)gt
BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix
ΣB sample between-class covariance matrix
ΣW sample within-class covariance matrix
ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach
xi
Part I
Context and Foundations
1
This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed
The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided
The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion
3
1 Context
The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm
The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne
From the point of view of the research the members of the consortium must deal withfour main goals
1 Software development of website framework and APIrsquos
2 Classification and goal-planning in high dimensional feature spaces
3 Interfacing the platform with the 3D virtual environment and the robot arm
4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments
S HM A
Figure 11 MASH project logo
5
1 Context
The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables
Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment
As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform
bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)
bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis
6
All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)
7
2 Regularization for Feature Selection
With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic
21 Motivations
There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)
As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information
When talking about dimensionality reduction there are two families of techniquesthat could induce confusion
bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples
bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature
9
2 Regularization for Feature Selection
Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)
selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category
As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text
ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out
Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the
10
22 Categorization of Feature Selection Techniques
Figure 22 The four key steps of feature selection according to Liu and Yu (2005)
ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost
There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions
22 Categorization of Feature Selection Techniques
Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured
I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm
The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities
bull Depending on the type of integration with the machine learning algorithm we have
ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm
ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while
11
2 Regularization for Feature Selection
the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive
ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm
bull Depending on the feature searching technique
ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches
ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time
ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima
bull Depending on the evaluation technique
ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures
ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty
ndash Dependency Measures - Measuring the correlation between features
ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can
ndash Predictive Accuracy - Use the selected features to predict the labels
ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)
The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels
In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized
12
23 Regularization
goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III
23 Regularization
In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret
An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations
minβJ(β) + λP (β) (21)
minβ
J(β)
s t P (β) le t (22)
In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken
In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty
13
2 Regularization for Feature Selection
Figure 23 Admissible sets in two dimensions for different pure norms ||β||p
231 Important Properties
Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability
Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies
forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)
for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex
Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources
Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution
232 Pure Penalties
For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In
14
23 Regularization
Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties
this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1
Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity
A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero
After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1
3penalty has a support region with sharper vertexes that would induce
a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1
3results in difficulties during optimization that will not happen with a convex
shape
15
2 Regularization for Feature Selection
To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty
L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0
minβ
J(β)
s t β0 le t (24)
where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable
L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)
minβ
J(β)
s t
psumj=1
|βj | le t (25)
Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited
Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)
The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by
16
23 Regularization
minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)
L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like
minβJ(β) + λ β22 (26)
The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem
minβ
nsumi=1
(yi minus xgti β)2 (27)
with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular
the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances
As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
(βlsj )2 (28)
The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)
17
2 Regularization for Feature Selection
where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model
Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions
Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t
This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as
βlowast = maxwisinRp
βgtw s t w le 1
In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1
r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)
233 Hybrid Penalties
There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is
minβ
nsumi=1
(yi minus xgti β)2 + λ1
psumj=1
|βj |+ λ2
psumj=1
β2j (29)
The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables
18
23 Regularization
234 Mixed Penalties
Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =
sumL`=1 d` Mixed norms are
a type of norms that take into consideration those groups The general expression isshowed below
β(rs) =
sum`
sumjisinG`
|βj |s r
s
1r
(210)
The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups
Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)
(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)
235 Sparsity Considerations
In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables
The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables
To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as
19
2 Regularization for Feature Selection
(a) L1 Lasso (b) L(12) group-Lasso
Figure 25 Admissible sets for the Lasso and Group-Lasso
(a) L1 induced sparsity (b) L(12) group inducedsparsity
Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters
20
23 Regularization
the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed
236 Optimization Tools for Regularized Problems
In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms
In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5
Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)
β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))
Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives
βj =minusλsign(βj)minus partJ(β)
partβj
2sumn
i=1 x2ij
In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding
algorithm where β(t+1)j = Sλ
(partJ(β(t))partβj
) The objective function is optimized with respect
21
2 Regularization for Feature Selection
to one variable at a time while all others are kept fixed
Sλ
(partJ(β)
partβj
)=
λminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
gt λ
minusλminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
lt minusλ
0 if |partJ(β)partβj| le λ
(211)
The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)
Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A
Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected
Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)
This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions
22
23 Regularization
and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions
Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points
This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable
This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)
Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)
This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques
Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals
Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)
minβisinRp
J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L
2
∥∥∥β minus β(t)∥∥∥2
2(212)
They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like
23
2 Regularization for Feature Selection
(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as
minβisinRp
1
2
∥∥∥∥β minus (β(t) minus 1
LnablaJ(β(t)))
∥∥∥∥2
2
+λ
LP (β) (213)
The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up
24
Part II
Sparse Linear Discriminant Analysis
25
Abstract
Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes
There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables
In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data
27
3 Feature Selection in Fisher DiscriminantAnalysis
31 Fisher Discriminant Analysis
Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)
We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x
gtn )gt and the corresponding labels in the ntimesK matrix
Y = (ygt1 ygtn )gt
Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance
maxβisinRp
βgtΣBβ
βgtΣWβ (31)
where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as
ΣW =1
n
Ksumk=1
sumiisinGk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
Ksumk=1
sumiisinGk
(microminus microk)(microminus microk)gt
where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k
29
3 Feature Selection in Fisher Discriminant Analysis
This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio
maxBisinRptimesKminus1
tr(BgtΣBB
)tr(BgtΣWB
) (32)
where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is
based on a series of K minus 1 subproblemsmaxβkisinRp
βgtk ΣBβk
s t βgtk ΣWβk le 1
βgtk ΣWβ` = 0 forall` lt k
(33)
The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest
eigenvalue (see Appendix C)
32 Feature Selection in LDA Problems
LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome
Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints
The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities
They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based
321 Inertia Based
The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and
30
32 Feature Selection in LDA Problems
classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations
Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)
Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as
minβisinRp
βgtΣWβ
s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t
where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony
Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max
βisinkRpβgtk Σ
k
Bβk minus Pk(βk)
s t βgtk ΣWβk le 1
The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten
Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal
solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1
minimization minβisinRp
β1
s t∥∥∥Σβ minus (micro1 minus micro2)
∥∥∥infinle λ
Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization
31
3 Feature Selection in Fisher Discriminant Analysis
Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions
322 Regression Based
In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)
Predefined Indicator Matrix
Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)
There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data
Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection
In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is
32
32 Feature Selection in LDA Problems
obtained by solving
minβisinRpβ0isinR
nminus1nsumi=1
(yi minus β0 minus xgti β)2 + λ
psumj=1
|βj |
where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β
vector for λ = 0 but a different intercept β0 is required
Optimal Scoring
In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)
As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)(34a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)
where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems
minθkisinRK βkisinRp
Yθk minusXβk2 + βgtk Ωβk (35a)
s t nminus1 θgtk YgtYθk = 1 (35b)
θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)
where each βk corresponds to a discriminant direction
33
3 Feature Selection in Fisher Discriminant Analysis
Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by
minβkisinRpθkisinRK
sumk
Yθk minusXβk22 + λ1 βk1 + λ2β
gtk Ωβk
where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen
Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)
minβkisinRpθkisinRK
Kminus1sumk=1
Yθk minusXβk22 + λ
psumj=1
radicradicradicradicKminus1sumk=1
β2kj
2
(36)
which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding
this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem
34
4 Formalizing the Objective
In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)
The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data
The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)
41 From Optimal Scoring to Linear Discriminant Analysis
Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)
Throughout this chapter we assume that
bull there is no empty class that is the diagonal matrix YgtY is full rank
bull inputs are centered that is Xgt1n = 0
bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank
35
4 Formalizing the Objective
411 Penalized Optimal Scoring Problem
For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution
The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus
minθisinRK βisinRp
Yθ minusXβ2 + βgtΩβ (41a)
s t nminus1 θgtYgtYθ = 1 (41b)
For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator
βos =(XgtX + Ω
)minus1XgtYθ (42)
The objective function (41a) is then
Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos
(XgtX + Ω
)βos
= θgtYgtYθ minus θgtYgtX(XgtX + Ω
)minus1XgtYθ
where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (43)
which shows that the optimization of the p-OS problem with respect to θk boils down to
finding the kth largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY Indeed Appendix C
details that Problem (43) is solved by
(YgtY)minus1YgtX(XgtX + Ω
)minus1XgtYθ = α2θ (44)
36
41 From Optimal Scoring to Linear Discriminant Analysis
where α2 is the maximal eigenvalue 1
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2nminus1θgt(YgtY)θ
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2 (45)
412 Penalized Canonical Correlation Analysis
As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows
maxθisinRK βisinRp
nminus1θgtYgtXβ (46a)
s t nminus1 θgtYgtYθ = 1 (46b)
nminus1 βgt(XgtX + Ω
)β = 1 (46c)
The solutions to (46) are obtained by finding saddle points of the Lagrangian
nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)
rArr npartL(βθ γ ν)
partβ= XgtYθ minus 2γ(XgtX + Ω)β
rArr βcca =1
2γ(XgtX + Ω)minus1XgtYθ
Then as βcca obeys (46c) we obtain
βcca =(XgtX + Ω)minus1XgtYθradic
nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)
so that the optimal objective function (46a) can be expressed with θ alone
nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
=
radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
and the optimization problem with respect to θ can be restated as
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (48)
Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)
βos = αβcca (49)
1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)
37
4 Formalizing the Objective
where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using
the optimality conditions for θ
npartL(βθ γ ν)
partθ= YgtXβ minus 2νYgtYθ
rArr θcca =1
2ν(YgtY)minus1YgtXβ (410)
Then as θcca obeys (46b) we obtain
θcca =(YgtY)minus1YgtXβradic
nminus1βgtXgtY(YgtY)minus1YgtXβ (411)
leading to the following expression of the optimal objective function
nminus1θgtccaYgtXβ =
nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ
=
radicnminus1βgtXgtY(YgtY)minus1YgtXβ
The p-CCA problem can thus be solved with respect to β by plugging this value in (46)
maxβisinRp
nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)
s t nminus1 βgt(XgtX + Ω
)β = 1 (412b)
where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies
nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω
)βcca (413)
where λ is the maximal eigenvalue shown below to be equal to α2
nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ
rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ
rArr nminus1αβgtccaXgtYθ = λ
rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ
rArr α2 = λ
The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)
38
41 From Optimal Scoring to Linear Discriminant Analysis
413 Penalized Linear Discriminant Analysis
Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows
maxβisinRp
βgtΣBβ (414a)
s t βgt(ΣW + nminus1Ω)β = 1 (414b)
where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C
As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable
to a simple matrix representation using the projection operator Y(YgtY
)minus1Ygt
ΣT =1
n
nsumi=1
xixigt
= nminus1XgtX
ΣB =1
n
Ksumk=1
nk microkmicrogtk
= nminus1XgtY(YgtY
)minus1YgtX
ΣW =1
n
Ksumk=1
sumiyik=1
(xi minus microk) (xi minus microk)gt
= nminus1
(XgtXminusXgtY
(YgtY
)minus1YgtX
)
Using these formulae the solution to the p-LDA problem (414) is obtained as
XgtY(YgtY
)minus1YgtXβlda = λ
(XgtX + ΩminusXgtY
(YgtY
)minus1YgtX
)βlda
XgtY(YgtY
)minus1YgtXβlda =
λ
1minus λ
(XgtX + Ω
)βlda
The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat
βlda = (1minus α2)minus12 βcca
= αminus1(1minus α2)minus12 βos
which ends the path from p-OS to p-LDA
39
4 Formalizing the Objective
414 Summary
The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)s t nminus1 ΘgtYgtYΘ = IKminus1
Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the
square-root of the largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY we have
BLDA = BCCA
(IKminus1 minusA2
)minus 12
= BOS Aminus1(IKminus1 minusA2
)minus 12 (415)
where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p
can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS
or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied
With the aim of performing classification the whole process could be summarized asfollows
1 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
2 Translate the data samples X into the LDA domain as XLDA = XBOSD
where D = Aminus1(IKminus1 minusA2
)minus 12
3 Compute the matrix M of centroids microk from XLDA and Y
4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA
5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule
6 Graphical Representation
40
42 Practicalities
The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively
42 Practicalities
421 Solution of the Penalized Optimal Scoring Regression
Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem
minΘisinRKtimesKminus1BisinRptimesKminus1
YΘminusXB2F + λ tr(BgtΩB
)(416a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)
where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm
Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps
1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1
2 Compute B =(XgtX + λΩ
)minus1XgtYΘ0
3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ
)minus1XgtY
4 Compute the optimal regression coefficients
BOS =(XgtX + λΩ
)minus1XgtYΘ (417)
Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on
Θ0gtYgtX(XgtX + λΩ
)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a
costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B
This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where
41
4 Formalizing the Objective
a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems
422 Distance Evaluation
The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance
d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log
(nkn
) (418)
is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent
Σminus1WΩ =
(nminus1(XgtX + λΩ)minus ΣB
)minus1
=(nminus1XgtXminus ΣB + nminus1λΩ
)minus1
=(ΣW + nminus1λΩ
)minus1 (419)
Before explaining how to compute the distances let us summarize some clarifying points
bull The solution BOS of the p-OS problem is enough to accomplish classification
bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances
bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1
As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain
(xi minus microk)BOS2ΣWΩminus 2 log(πk)
where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1
(IKminus1 minusA2
)minus 12
∥∥∥2
2minus 2 log(πk)
which is a plain Euclidean distance
42
43 From Sparse Optimal Scoring to Sparse LDA
423 Posterior Probability Evaluation
Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as
p(yk = 1|x) prop exp
(minusd(xmicrok)
2
)prop πk exp
(minus1
2
∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2
)minus 12
∥∥∥2
2
) (420)
Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)
2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below
p(yk = 1|x) =πk exp
(minusd(xmicrok)
2
)sum
` π` exp(minusd(xmicro`)
2
)=
πk exp(minusd(xmicrok)
2 + dmax2
)sum`
π` exp
(minusd(xmicro`)
2+dmax
2
)
where dmax = maxk d(xmicrok)
424 Graphical Representation
Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented
43 From Sparse Optimal Scoring to Sparse LDA
The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated
In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see
43
4 Formalizing the Objective
section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB
431 A Quadratic Variational Form
Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)
Our formulation of group-Lasso is showed below
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(421a)
s tsum
j τj minussum
j wj∥∥βj∥∥
2le 0 (421b)
τj ge 0 j = 1 p (421c)
where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1
B =(β1gt βpgt
)gtand wj are predefined nonnegative weights The cost function
J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)
The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41
Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)
Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump
j=1wj∥∥βj∥∥
2
Proof The Lagrangian of Problem (421) is
L = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
( psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2
)minus
psumj=1
νjτj
44
43 From Sparse Optimal Scoring to Sparse LDA
Figure 41 Graphical representation of the variational approach to Group-Lasso
Thus the first order optimality conditions for τj are
partLpartτj
(τj ) = 0hArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
hArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0 τ
j
2 = 0
The last line is obtained from complementary slackness which implies here νjτj = 0
Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier
for constraint gj(τj) le 0 As a result the optimal value of τj
τj =
radicλw2
j
∥∥βj∥∥2
2
ν0=
radicλ
ν0wj∥∥βj∥∥
2(422)
We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)
psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2= 0 (423)
so that τj = wj∥∥βj∥∥
2 Using this value into (421a) it is possible to conclude that
Problem (421) is equivalent to the standard group-Lasso operator
minBisinRptimesM
J(B) + λ
psumj=1
wj∥∥βj∥∥
2 (424)
So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation
45
4 Formalizing the Objective
With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (425)
with
τj = wj∥∥βj∥∥
2
resulting in Ω diagonal components
(Ω)jj =wj∥∥βj∥∥
2
(426)
And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5
The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence
Lemma 42 If J is convex Problem (421) is convex
Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is
V isin RptimesKminus1 V =partJ(B)
partB+ λG
(427)
where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1
G =(g1gt gpgt
)gtdefined as follows Let S(B) denote the columnwise support of
B S(B) = j isin 1 p ∥∥βj∥∥
26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (428)
forallj isin S(B) ∥∥gj∥∥
2le wj (429)
46
43 From Sparse Optimal Scoring to Sparse LDA
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Proof When∥∥βj∥∥
26= 0 the gradient of the penalty with respect to βj is
part (λsump
m=1wj βm2)
partβj= λwj
βj∥∥βj∥∥2
(430)
At∥∥βj∥∥
2= 0 the gradient of the objective function is not continuous and the optimality
conditions then make use of the subdifferential (Bach et al 2011)
partβj
(λ
psumm=1
wj βm2
)= partβj
(λwj
∥∥βj∥∥2
)=λwjv isin RKminus1 v2 le 1
(431)
That gives the expression (429)
Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima
forallj isin S partJ(B)
partβj+ λwj
∥∥βj∥∥minus1
2βj = 0 (432a)
forallj isin S ∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le λwj (432b)
where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment
Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)
432 Group-Lasso OS as Penalized LDA
With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced
Proposition 41 The group-Lasso OS problem
BOS = argminBisinRptimesKminus1
minΘisinRKtimesKminus1
1
2YΘminusXB2F + λ
psumj=1
wj∥∥βj∥∥
2
s t nminus1 ΘgtYgtYΘ = IKminus1
47
4 Formalizing the Objective
is equivalent to the penalized LDA problem
BLDA = maxBisinRptimesKminus1
tr(BgtΣBB
)s t Bgt(ΣW + nminus1λΩ)B = IKminus1
where Ω = diag
(w2
1
τ1
w2p
τp
) with Ωjj =
+infin if βjos = 0
wj∥∥βjos
∥∥minus1
2otherwise
(433)
That is BLDA = BOS diag(αminus1k (1minus α2
k)minus12
) where αk isin (0 1) is the kth leading
eigenvalue of
nminus1YgtX(XgtX + λΩ
)minus1XgtY
Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso
The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr
(BgtΩB
)
48
5 GLOSS Algorithm
The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22
The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below
1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed
2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution
3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set
This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively
51 Regression Coefficients Updates
Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(
XgtAXA + λΩ)βk = XgtAYθ0
k (51)
49
5 GLOSS Algorithm
initialize modelλ B
ACTIVE SETall j st||βj ||2 gt 0
p-OS PROBLEMB must hold1st optimality
condition
any variablefrom
ACTIVE SETmust go toINACTIVE
SET
take it out ofACTIVE SET
test 2nd op-timality con-dition on the
INACTIVE SET
any variablefrom
INACTIVE SETmust go toACTIVE
SET
take it out ofINACTIVE SET
compute Θ
and update B end
yes
no
yes
no
Figure 51 GLOSS block diagram
50
51 Regression Coefficients Updates
Algorithm 1 Adaptively Penalized Optimal Scoring
Input X Y B λInitialize A larr
j isin 1 p
∥∥βj∥∥2gt 0
Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat
Step 1 solve (421) in B assuming A optimalrepeat
Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1
2
BA larr(XgtAXA + λΩ
)minus1XgtAYΘ0
until condition (432a) holds for all j isin A Step 2 identify inactivated variables
for j isin A ∥∥βj∥∥
2= 0 do
if optimality condition (432b) holds thenA larr AjGo back to Step 1
end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax
jisinA
∥∥partJpartβj∥∥2
if∥∥∥partJpartβj∥∥∥
2lt λ then
convergence larr true B is optimalelseA larr Acup j
end ifuntil convergence
(sV)larreigenanalyze(Θ0gtYgtXAB) that is
Θ0gtYgtXABVk = skVk k = 1 K minus 1
Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1
Output Θ B α
51
5 GLOSS Algorithm
where XA denotes the columns of X indexed by A and βk and θ0k denote the kth
column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system
511 Cholesky decomposition
Dropping the subscripts and considering the (K minus 1) systems together (51) leads to
(XgtX + λΩ)B = XgtYΘ (52)
Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows
CgtCB = XgtYΘ
CB = CgtXgtYΘ
B = CCgtXgtYΘ (53)
where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)
512 Numerical Stability
The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression
B = Ωminus12(Ωminus12XgtXΩminus12 + λI
)minus1Ωminus12XgtYΘ0 (54)
where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)
52 Score Matrix
The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY This eigen-analysis is actually solved in the form
ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-
vector decomposition does not require the costly computation of(XgtX + Ω
)minus1that
52
53 Optimality Conditions
involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-
trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω
)minus1XgtY 1
Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as
Θ0gtYgtX(XgtX + Ω
)minus1XgtYΘ0 = Θ0gtYgtXB0
Thus the solution to penalized OS problem can be computed trough the singular
value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining
Θ = Θ0V we have ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ = Λ and when Θ0 is chosen such
that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation
53 Optimality Conditions
GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function
1
2YΘminusXB22 + λ
psumj=1
wj∥∥βj∥∥
2(55)
Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth
row of B βj is the (K minus 1)-dimensional vector
partJ(B)
partβj= xj
gt(XBminusYΘ)
where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as
xjgt
(XBminusYΘ) + λwjβj∥∥βj∥∥
2
1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω
)minus1XgtY It is thus suffi-
cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY In practice to comply with this desideratum and conditions (35b) and
(35c) we set Θ0 =(YgtY
)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal
vectors orthogonal to 1K
53
5 GLOSS Algorithm
The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
54 Active and Inactive Sets
The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function
j = maxj
∥∥∥xjgt (XBminusYΘ)∥∥∥
2minus λwj 0
The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥
2
is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition
55 Penalty Parameter
The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active
The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0
λmax = maxjisin1p
1
wj
∥∥∥xjgtYΘ0∥∥∥
2
The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin
is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)
54
56 Options and Variants
56 Options and Variants
561 Scaling Variables
As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm
562 Sparse Variant
This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation
563 Diagonal Variant
We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated
The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems
minBisinRptimesKminus1
YΘminusXB2F = minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB
)are replaced by
minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B
)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite
564 Elastic net and Structured Variant
For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition
55
5 GLOSS Algorithm
7 8 9
4 5 6
1 2 3
- ΩL =
3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3
Figure 52 Graph and Laplacian matrix for a 3times 3 image
for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth
When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood
This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned
56
6 Experimental Results
This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper
61 Normalization
With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1
62 Decision Thresholds
The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation
1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval
57
6 Experimental Results
63 Simulated Data
We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is
Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)
Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of
dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure
is intended to mimic gene expression data correlation
Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1
3 1)if j le 100 and Xij sim N(0 1) otherwise
Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise
Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563
The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only
58
63 Simulated Data
Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset
Err () Var Dir
Sim 1 K = 4 mean shift ind features
PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)
Sim 2 K = 2 mean shift dependent features
PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)
Sim 3 K = 4 1D mean shift ind features
PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)
Sim 4 K = 4 mean shift ind features
PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)
59
6 Experimental Results
0 10 20 30 40 50 60 70 8020
30
40
50
60
70
80
90
100TPR Vs FPR
gloss
glossd
slda
plda
Simulation1
Simulation2
Simulation3
Simulation4
Figure 61 TPR versus FPR (in ) for all algorithms and simulations
Table 62 Average TPR and FPR (in ) computed over 25 repetitions
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
PLDA 990 782 969 603 980 159 743 656
SLDA 739 385 338 163 416 278 507 395
GLOSS 641 106 300 46 511 182 260 121
GLOSS-D 935 394 921 281 956 655 429 299
method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )
64 Gene Expression Data
We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-
2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736
60
64 Gene Expression Data
Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables
Err () Var
Nakayama n = 86 p = 22 283 K = 5
PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)
Ramaswamy n = 198 p = 16 063 K = 14
PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)
Sun n = 180 p = 54 613 K = 4
PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)
ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4
dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors
Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split
Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors
Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS
4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962
61
6 Experimental Results
GLOSS SLDA
Naka
yam
a
minus25000 minus20000 minus15000 minus10000 minus5000 0 5000
minus25
minus2
minus15
minus1
minus05
0
05
1
x 104
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
2n
dd
iscr
imin
ant
minus2000 0 2000 4000 6000 8000 10000 12000 14000
2000
4000
6000
8000
10000
12000
14000
16000
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
Su
n
minus1 minus05 0 05 1 15 2
x 104
05
1
15
2
25
3
35
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
2n
dd
iscr
imin
ant
minus2 minus15 minus1 minus05 0
x 104
0
05
1
15
2
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means
62
65 Correlated Data
Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo
65 Correlated Data
When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge
The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works
For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63
As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward
The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits
Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results
63
6 Experimental Results
β for GLOSS β for S-GLOSS
Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo
β for GLOSS and λ = 03 β for S-GLOSS and λ = 03
Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo
64
Discussion
GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix
Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data
The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced
The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition
65
Part III
Sparse Clustering Analysis
67
Abstract
Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity
Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix
As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection
Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10
69
7 Feature Selection in Mixture Models
71 Mixture Models
One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering
711 Model
We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically
from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as
f(xi) =
Ksumk=1
πkfk(xi) foralli isin 1 n
where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and
sumk πk = 1) Mixture models transcribe that
given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism
bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK
bull x each xi is assumed to arise from a random vector with probability densityfunction fk
In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as
f(xiθ) =
Ksumk=1
πkφ(xiθk) foralli isin 1 n
71
7 Feature Selection in Mixture Models
where θ = (π1 πK θ1 θK) is the parameter of the model
712 Parameter Estimation The EM Algorithm
For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ
21 σ
22 π) of a univariate
Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches
The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)
The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood
Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm
Maximum Likelihood Definitions
The likelihood is is commonly expressed in its logarithmic version
L(θ X) = log
(nprodi=1
f(xiθ)
)
=nsumi=1
log
(Ksumk=1
πkfk(xiθk)
) (71)
where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions
To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or
72
71 Mixture Models
classification log-likelihood
LC(θ XY) = log
(nprodi=1
f(xiyiθ)
)
=
nsumi=1
log
(Ksumk=1
yikπkfk(xiθk)
)
=nsumi=1
Ksumk=1
yik log (πkfk(xiθk)) (72)
The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise
Defining the soft membership tik(θ) as
tik(θ) = p(Yik = 1|xiθ) (73)
=πkfk(xiθk)
f(xiθ) (74)
To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows
LC(θ XY) =sumik
yik log (πkfk(xiθk))
=sumik
yik log (tikf(xiθ))
=sumik
yik log tik +sumik
yik log f(xiθ)
=sumik
yik log tik +nsumi=1
log f(xiθ)
=sumik
yik log tik + L(θ X) (75)
wheresum
ik yik log tik can be reformulated as
sumik
yik log tik =nsumi=1
Ksumk=1
yik log(p(Yik = 1|xiθ))
=
nsumi=1
log(p(Yik = 1|xiθ))
= log (p(Y |Xθ))
As a result the relationship (75) can be rewritten as
L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)
73
7 Feature Selection in Mixture Models
Likelihood Maximization
The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)
L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))
+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))
In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood
∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1
minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality
Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))
For the mixture model problem Q(θθprime) is
Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]
=sumik
p(Yik = 1|xiθprime) log(πkfk(xiθk))
=nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (77)
Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ
prime) are the posterior proba-bilities of cluster memberships
Hence the EM algorithm sketched above results in
bull Initialization (not iterated) choice of the initial parameter θ(0)
bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)
bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))
74
72 Feature Selection in Model-Based Clustering
Gaussian Model
In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is
f(xiθ) =Ksumk=1
πkfk(xiθk)
=
Ksumk=1
πk1
(2π)p2 |Σ|
12
exp
minus1
2(xi minus microk)
gtΣminus1(xi minus microk)
At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows
Q(θθ(t)) =sumik
tik log(πk)minussumik
tik log(
(2π)p2 |Σ|
12
)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
=sumk
tk log(πk)minusnp
2log(2π)︸ ︷︷ ︸
constant term
minusn2
log(|Σ|)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
equivsumk
tk log(πk)minusn
2log(|Σ|)minus
sumik
tik
(1
2(xi minus microk)
gtΣminus1(xi minus microk)
) (78)
where
tk =nsumi=1
tik (79)
The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)
π(t+1)k =
tkn
(710)
micro(t+1)k =
sumi tikxitk
(711)
Σ(t+1) =1
n
sumk
Wk (712)
with Wk =sumi
tik(xi minus microk)(xi minus microk)gt (713)
The derivations are detailed in Appendix G
72 Feature Selection in Model-Based Clustering
When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own
75
7 Feature Selection in Mixture Models
covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries
In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD
gtk (Banfield and Raftery 1993)
These regularization schemes address singularity and stability issues but they do notinduce parsimonious models
In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space
721 Based on Penalized Likelihood
Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x
log
(p(Yk = 1|x)
p(Y` = 1|x)
)= xgtΣminus1(microk minus micro`)minus
1
2(microk + micro`)
gtΣminus1(microk minus micro`) + logπkπ`
In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm
λKsumk=1
psumj=1
|microkj |
as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices
λ1
Ksumk=1
psumj=1
|microkj |+ λ2
Ksumk=1
psumj=1
psumm=1
|(Σminus1k )jm|
In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models
76
72 Feature Selection in Model-Based Clustering
Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)
λ
psumj=1
sum16k6kprime6K
|microkj minus microkprimej |
This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative
A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features
λ
psumj=1
(micro1j micro2j microKj)infin
One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means
λradicK
psumj=1
radicradicradicradic Ksum
k=1
micro2kj
The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test
The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation
722 Based on Model Variants
The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as
77
7 Feature Selection in Mixture Models
f(xi|φ πθν) =Ksumk=1
πk
pprodj=1
[f(xij |θjk)]φj [h(xij |νj)]1minusφj
where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)
An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1
which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion
tr(
(UgtΣWU)minus1UgtΣBU) (714)
so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations
To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of
minUisinRptimesKminus1
∥∥∥XU minusXU∥∥∥2
F+ λ
Kminus1sumk=1
∥∥∥uk∥∥∥1
where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet
minABisinRptimesKminus1
Ksumk=1
∥∥∥RminusgtW HBk minusABgtHBk
∥∥∥2
2+ ρ
Kminus1sumj=1
βgtj ΣWβj + λ
Kminus1sumj=1
∥∥βj∥∥1
s t AgtA = IKminus1
where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper
78
72 Feature Selection in Model-Based Clustering
triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U
The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem
minUisinRptimesKminus1
psumj=1
∥∥∥ΣBj minus UUgtΣBj
∥∥∥2
2
s t UgtU = IKminus1
whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U
To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality
However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo
723 Based on Model Selection
Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables
bull X(1) set of selected relevant variables
bull X(2) set of variables being considered for inclusion or exclusion of X(1)
bull X(3) set of non relevant variables
79
7 Feature Selection in Mixture Models
With those subsets they defined two different models where Y is the partition toconsider
bull M1
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)|X(1)
)f(X(1)|Y
)bull M2
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)X(1)|Y
)Model M1 means that variables in X(2) are independent on clustering Y Model M2
shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor
B12 =f (X|M1)
f (X|M2)
where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio
B12 =f(X(1)X(2)X(3)|M1
)f(X(1)X(2)X(3)|M2
)=f(X(2)|X(1)M1
)f(X(1)|M1
)f(X(2)X(1)|M2
)
This factor is approximated since the integrated likelihoods f(X(1)|M1
)and
f(X(2)X(1)|M2
)are difficult to calculate exactly Raftery and Dean (2006) use the
BIC approximation The computation of f(X(2)|X(1)M1
) if there is only one variable
in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term
Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability
Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis
80
8 Theoretical Foundations
In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features
We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model
In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided
81 Resolving EM with Optimal Scoring
In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate
811 Relationship Between the M-Step and Linear Discriminant Analysis
LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance
d(ximicrok) = (xi minus microk)gtΣminus1
W (xi minus microk)
where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix
81
8 Theoretical Foundations
The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)
Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood
2lweight(microΣ) =nsumi=1
Ksumk=1
tikd(ximicrok)minus n log(|ΣW|)
which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures
812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis
The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression
813 Clustering Using Penalized Optimal Scoring
The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression
d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)
This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as
82
82 Optimized Criterion
1 Initialize the membership matrix Y (for example by K-means algorithm)
2 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2
k)minus 1
2 )
4 Compute the centroids M in the LDA domain
5 Evaluate distances in the LDA domain
6 Translate distances into posterior probabilities tik with
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
] (81)
7 Update the labels using the posterior probabilities matrix Y = T
8 Go back to step 2 and iterate until tik converge
Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures
814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures
82 Optimized Criterion
In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized
83
8 Theoretical Foundations
optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture
This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows
821 A Bayesian Derivation
This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)
The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter
The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior
f(Σ|Λ0 ν0) =1
2np2 |Λ0|
n2 Γp(
n2 )|Σminus1|
ν0minuspminus12 exp
minus1
2tr(Λminus1
0 Σminus1)
where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as
Γp(n2) = πp(pminus1)4pprodj=1
Γ (n2 + (1minus j)2)
The posterior distribution can be maximized similarly to the likelihood through the
84
82 Optimized Criterion
maximization of
Q(θθprime) + log(f(Σ|Λ0 ν0))
=Ksumk=1
tk log πk minus(n+ 1)p
2log 2minus n
2log |Λ0| minus
p(p+ 1)
4log(π)
minuspsumj=1
log
(Γ
(n
2+
1minus j2
))minus νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
)equiv
Ksumk=1
tk log πk minusn
2log |Λ0| minus
νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
) (82)
with tk =
nsumi=1
tik
νn = ν0 + n
Λminus1n = Λminus1
0 + S0
S0 =
nsumi=1
Ksumk=1
tik(xi minus microk)(xi minus microk)gt
Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)
822 Maximum a Posteriori Estimator
The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is
ΣMAP =1
ν0 + nminus pminus 1(Λminus1
0 + S0) (83)
where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1
0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)
85
9 Mix-GLOSS Algorithm
Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism
91 Mix-GLOSS
The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik
When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant
The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition
911 Outer Loop Whole Algorithm Repetitions
This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs
bull the centered ntimes p feature matrix X
bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically
bull the number of clusters K
bull the maximum number of iterations for the EM algorithm
bull the convergence tolerance for the EM algorithm
bull the number of whole repetitions of the clustering algorithm
87
9 Mix-GLOSS Algorithm
Figure 91 Mix-GLOSS Loops Scheme
bull a ptimes (K minus 1) initial coefficient matrix (optional)
bull a ntimesK initial posterior probability matrix (optional)
For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process
912 Penalty Parameter Loop
The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix
Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage
88
91 Mix-GLOSS
of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive
Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)
Algorithm 2 Automatic selection of λ
Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat
Estimate λ Compute gradient at βj = 0partJ(B)
partβj
∣∣∣βj=0
= xjgt
(sum
m6=j xmβm minusYΘ)
Compute λmax for every feature using (432b)
λmaxj = 1
wj
∥∥∥∥ partJ(B)
partβj
∣∣∣βj=0
∥∥∥∥2
Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false
elselastLAMBDA larr true
end ifuntil lastLAMBDA
Output B L(θ) tik πk microk Σ Y for every λ in solution path
913 Inner Loop EM Algorithm
The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop
89
9 Mix-GLOSS Algorithm
Algorithm 3 Mix-GLOSS for one value of λ
Input X K B0 Y0 λInitializeif (B0Y0) available then
BOS larr B0 Y larr Y0
elseBOS larr 0 Y larr kmeans(XK)
end ifconvergenceEM larr false tolEM larr 1e-3repeat
M-step(BOSΘ
α)larr GLOSS(XYBOS λ)
XLDA = XBOS diag (αminus1(1minusα2)minus12
)
πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n
sumi |tik minus yik| lt tolEM then
convergenceEM larr trueend ifY larr T
until convergenceEMY larr MAP(T)
Output BOS ΘL(θ) tik πk microk Σ Y
90
92 Model Selection
M-Step
The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step
E-Step
The E-step evaluates the posterior probability matrix T using
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
]
The convergence of those tik is used as stopping criterion for EM
92 Model Selection
Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected
In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure
In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time
The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested
91
9 Mix-GLOSS Algorithm
Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)
X K λEMITER MAXREPMixminusGLOSS
Use B and T frombest repetition as
StartB and StartT
Mix-GLOSS (λStartBStartT)
Compute BIC
Chose λ = minλ BIC
Partition tikπk λBEST BΘ D L(θ)activeset
Figure 92 Mix-GLOSS model selection diagram
with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ
92
10 Experimental Results
The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6
This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63
In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations
The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions
101 Tested Clustering Algorithms
This section compares Mix-GLOSS with the following methods in the state of the art
bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan
bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website
93
10 Experimental Results
Figure 101 Class mean vectors for each artificial simulation
bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website
After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered
The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website
bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)
bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see
94
102 Results
Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable
102 Results
In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are
bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different
bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80
bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced
The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS
Results in percentages are displayed in Figure 102 (or in Table 102 )
95
10 Experimental Results
Table 101 Experimental results for simulated data
Err () Var Time
Sim 1 K = 4 mean shift ind features
CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h
Sim 2 K = 2 mean shift dependent features
CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h
Sim 3 K = 4 1D mean shift ind features
CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h
Sim 4 K = 4 mean shift ind features
CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h
Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
MIX-GLOSS 992 015 828 335 884 67 780 12
LUMI-KUAN 992 28 1000 02 1000 005 50 005
FISHER-EM 986 24 888 17 838 5825 620 4075
96
103 Discussion
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100TPR Vs FPR
MIXminusGLOSS
LUMIminusKUAN
FISHERminusEM
Simulation1
Simulation2
Simulation3
Simulation4
Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions
103 Discussion
After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted
LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here
The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4
From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall
97
Conclusions
99
Conclusions
Summary
The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables
In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering
The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems
In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations
In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results
Perspectives
Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species
101
based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography
At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term
The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis
From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm
At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression
102
Appendix
103
A Matrix Properties
Property 1 By definition ΣW and ΣB are both symmetric matrices
ΣW =1
n
gsumk=1
sumiisinCk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
gsumk=1
nk(microk minus x)(microk minus x)gt
Property 2 partxgtapartx = partagtx
partx = a
Property 3 partxgtAxpartx = (A + Agt)x
Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt
Property 5 partagtXbpartX = abgt
Property 6 partpartXtr
(AXminus1B
)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt
105
B The Penalized-OS Problem is anEigenvector Problem
In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form
minθkβk
Yθk minusXβk22 + βgtk Ωkβk (B1)
st θgtk YgtYθk = 1
θgt` YgtYθk = 0 forall` lt k
for k = 1 K minus 1The Lagrangian associated to Problem (B1) is
Lk(θkβk λkνk) =
Yθk minusXβk22 + βgtk Ωkβk + λk(θ
gtk YgtYθk minus 1) +
sum`ltk
ν`θgt` YgtYθk (B2)
Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk
βk = (XgtX + Ωk)minus1XgtYθk (B3)
The objective function of (B1) evaluated at βk is
minθk
Yθk minusXβk22 + βk
gtΩkβk = min
θk
θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk
= maxθk
θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)
If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY
B1 How to Solve the Eigenvector Decomposition
Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition
107
B The Penalized-OS Problem is an Eigenvector Problem
Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way
maxΘisinRKtimes(Kminus1)
tr(ΘgtMΘ
)(B5)
st ΘgtYgtYΘ = IKminus1
If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is
MΘv = λv (B6)
where v is the eigenvector and λ the associated eigenvalue of MΘ Operating
vgtMΘv = λhArr vgtΘgtMΘv = λ
Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue
wgtMw = λ (B7)
Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ
MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ
= ΘgtYgtXB
Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone
To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B
B = (XgtX + Ω)minus1XgtYΘV = BV
108
B2 Why the OS Problem is Solved as an Eigenvector Problem
B2 Why the OS Problem is Solved as an Eigenvector Problem
In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY
By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them
θk =
Kminus1summ=1
αmwm s t θgtk θk = 1 (B8)
The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (
Kminus1summ=1
αmwm
)gt(Kminus1summ=1
αmwm
)= 1
that as per the eigenvector properties can be reduced to
Kminus1summ=1
α2m = 1 (B9)
Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)
Mθk = M
Kminus1summ=1
αmwm
=
Kminus1summ=1
αmMwm
As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain
Mθk =Kminus1summ=1
αmλmwm
Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors
θgtk Mθk =
(Kminus1sum`=1
α`w`
)gt(Kminus1summ=1
αmλmwm
)
This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving
θgtk Mθk =Kminus1summ=1
α2mλm
109
B The Penalized-OS Problem is an Eigenvector Problem
The optimization Problem (B5) for discriminant direction k can be rewritten as
maxθkisinRKtimes1
θgtk Mθk
= max
θkisinRKtimes1
Kminus1summ=1
α2mλm
(B10)
with θk =Kminus1summ=1
αmwm
andKminus1summ=1
α2m = 1
One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =
sumKminus1m=1 αmwm the resulting score vector θk will be equal to
the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can
be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY
110
C Solving Fisherrsquos Discriminant Problem
The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance
maxβisinRp
βgtΣBβ (C1a)
s t βgtΣWβ = 1 (C1b)
where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data
The Lagrangian of Problem (C1) is
L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)
so that its first derivative with respect to β is
partL(β ν)
partβ= 2ΣBβ minus 2νΣWβ
A necessary optimality condition for β is that this derivative is zero that is
ΣBβ = νΣWβ
Provided ΣW is full rank we have
Σminus1W ΣBβ
= νβ (C2)
Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of
eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows
βgtΣBβ = βgtΣWΣminus1
W ΣBβ
= νβgtΣWβ from (C2)
= ν from (C1b)
That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1
W ΣB and β is any eigenvector correspondingto this maximal eigenvalue
111
D Alternative Variational Formulation forthe Group-Lasso
In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(D1a)
s tsump
j=1 τj = 1 (D1b)
τj ge 0 j = 1 p (D1c)
Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed
of row vectors βj isin RKminus1 B =(β1gt βpgt
)gt
L(B τ λ ν0 νj) = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
psumj=1
τj minus 1
minus psumj=1
νjτj (D2)
The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj
partL(B τ λ ν0 νj)
partτj
∣∣∣∣τj=τj
= 0 rArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 = 0
The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ
) = 0 where νj is the Lagrange multiplier and gj(τ) is the
inequality Lagrange condition Then the optimal τj can be deduced
τj =
radicλ
ν0wj∥∥βj∥∥
2
Placing this optimal value of τj into constraint (D1b)
psumj=1
τj = 1rArr τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
(D3)
113
D Alternative Variational Formulation for the Group-Lasso
With this value of τj Problem (D1) is equivalent to
minBisinRptimesKminus1
J(B) + λ
psumj=1
wj∥∥βj∥∥
2
2
(D4)
This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj
The penalty term of (D1a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (D5)
Using the value of τj from (D3) each diagonal component of Ω is
(Ω)jj =wjsump
j=1wj∥∥βj∥∥
2∥∥βj∥∥2
(D6)
In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation
D1 Useful Properties
Lemma D1 If J is convex Problem (D1) is convex
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =
partJ(B)
partB+ 2λ
Kminus1sumj=1
wj∥∥βj∥∥
2
G
(D7)
where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1
∥∥βj∥∥26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (D8)
forallj isin S(B) ∥∥gj∥∥
2le wj (D9)
114
D2 An Upper Bound on the Objective Function
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1
∥∥βj∥∥26= 0 and let S(B) be its complement then we have
forallj isin S(B) minus partJ(B)
partβj= 2λ
Kminus1sumj=1
wj∥∥βj∥∥2
wj∥∥βj∥∥minus1
2βj (D10a)
forallj isin S(B)
∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le 2λwj
Kminus1sumj=1
wj∥∥βj∥∥2
(D10b)
In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)
D2 An Upper Bound on the Objective Function
Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that
τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum
j=1
wj∥∥βj∥∥
2
2
=
psumj=1
τ12j
wj∥∥βj∥∥
2
τ12j
2
le
psumj=1
τj
psumj=1
w2j
∥∥βj∥∥2
2
τj
le
psumj=1
w2j
∥∥βj∥∥2
2
τj
where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one
115
D Alternative Variational Formulation for the Group-Lasso
This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined
116
E Invariance of the Group-Lasso to UnitaryTransformations
The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition
Proposition E1 Let B be a solution of
minBisinRptimesM
Y minusXB2F + λ
psumj=1
wj∥∥βj∥∥
2(E1)
and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof
minBisinRptimesM
∥∥∥Y minusXB∥∥∥2
F+ λ
psumj=1
wj∥∥βj∥∥
2(E2)
Proof The first-order necessary optimality conditions for B are
forallj isin S(B) 2xjgt(xjβ
j minusY)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E3a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minusY)∥∥∥
2le λwj (E3b)
where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement
First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows
forallj isin S(B) 2xjgt(xjβ
j minus Y)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E4a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minus Y)∥∥∥
2le λwj (E4b)
where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM
∥∥ugt∥∥2
=∥∥ugtV
∥∥2 Equation (E4b) is also
117
E Invariance of the Group-Lasso to Unitary Transformations
obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof
118
F Expected Complete Likelihood andLikelihood
Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available
L(θ) =
nsumi=1
log
(Ksumk=1
πkfk(xiθk)
)(F1)
Q(θθprime) =nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (F2)
with tik(θprime) =
πprimekfk(xiθprimek)sum
` πprime`f`(xiθ
prime`)
(F3)
In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are
the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)
Using (F3) we have
Q(θθprime) =sumik
tik(θprime) log (πkfk(xiθk))
=sumik
tik(θprime) log(tik(θ)) +
sumik
tik(θprime) log
(sum`
π`f`(xiθ`)
)=sumik
tik(θprime) log(tik(θ)) + L(θ)
In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities
L(θ) = Q(θθ)minussumik
tik(θ) log(tik(θ))
= Q(θθ) +H(T)
119
G Derivation of the M-Step Equations
This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as
Q(θθprime) = maxθ
sumik
tik(θprime) log(πkfk(xiθk))
=sumk
log
(πksumi
tik
)minus np
2log(2π)minus n
2log |Σ| minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
which has to be maximized subject tosumk
πk = 1
The Lagrangian of this problem is
L(θ) = Q(θθprime) + λ
(sumk
πk minus 1
)
Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ
G1 Prior probabilities
partL(θ)
partπk= 0hArr 1
πk
sumi
tik + λ = 0
where λ is identified from the constraint leading to
πk =1
n
sumi
tik
121
G Derivation of the M-Step Equations
G2 Means
partL(θ)
partmicrok= 0hArr minus1
2
sumi
tik2Σminus1(microk minus xi) = 0
rArr microk =
sumi tikxisumi tik
G3 Covariance Matrix
partL(θ)
partΣminus1 = 0hArr n
2Σ︸︷︷︸
as per property 4
minus 1
2
sumik
tik(xi minus microk)(xi minus microk)gt
︸ ︷︷ ︸as per property 5
= 0
rArr Σ =1
n
sumik
tik(xi minus microk)(xi minus microk)gt
122
Bibliography
F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011
F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008
F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012
J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993
A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009
H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996
P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004
C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008
C M Bishop Pattern Recognition and Machine Learning Springer New York 2006
C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a
C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b
S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004
L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995
L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984
123
Bibliography
T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011
S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999
C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012
B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008
L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011
C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009
A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246
D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006
R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000
B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004
Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008
R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936
V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008
J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009
124
Bibliography
J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010
J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989
W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998
A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003
D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005
G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010
G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011
Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998
Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002
L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008
Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004
J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010
I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003
T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996
T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994
125
Bibliography
T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995
A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970
J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009
T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006
K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000
P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010
T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002
M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004
Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004
C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008
C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006
H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005
J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967
Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012
C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a
126
Bibliography
C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b
L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008
N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006
B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006
B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007
Y Nesterov Gradient methods for minimizing composite functions preprint 2007
S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886
B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011
M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a
M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b
W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007
W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006
K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894
S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003
127
Bibliography
Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009
A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006
C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948
S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007
V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004
V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008
V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004
C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010
L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012
Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978
A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008
S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006
P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010
M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008
128
Bibliography
M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008
R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996
J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010
S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008
D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011
D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010
D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009
M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007
MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009
T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008
B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a
B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b
C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010
J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007
129
Bibliography
M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006
P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007
P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009
H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009
H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006
H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005
130
- SANCHEZ MERCHANTE PDTpdf
- Thesis Luis Francisco Sanchez Merchantepdf
-
- List of figures
- List of tables
- Notation and Symbols
- Context and Foundations
-
- Context
- Regularization for Feature Selection
-
- Motivations
- Categorization of Feature Selection Techniques
- Regularization
-
- Important Properties
- Pure Penalties
- Hybrid Penalties
- Mixed Penalties
- Sparsity Considerations
- Optimization Tools for Regularized Problems
-
- Sparse Linear Discriminant Analysis
-
- Abstract
- Feature Selection in Fisher Discriminant Analysis
-
- Fisher Discriminant Analysis
- Feature Selection in LDA Problems
-
- Inertia Based
- Regression Based
-
- Formalizing the Objective
-
- From Optimal Scoring to Linear Discriminant Analysis
-
- Penalized Optimal Scoring Problem
- Penalized Canonical Correlation Analysis
- Penalized Linear Discriminant Analysis
- Summary
-
- Practicalities
-
- Solution of the Penalized Optimal Scoring Regression
- Distance Evaluation
- Posterior Probability Evaluation
- Graphical Representation
-
- From Sparse Optimal Scoring to Sparse LDA
-
- A Quadratic Variational Form
- Group-Lasso OS as Penalized LDA
-
- GLOSS Algorithm
-
- Regression Coefficients Updates
-
- Cholesky decomposition
- Numerical Stability
-
- Score Matrix
- Optimality Conditions
- Active and Inactive Sets
- Penalty Parameter
- Options and Variants
-
- Scaling Variables
- Sparse Variant
- Diagonal Variant
- Elastic net and Structured Variant
-
- Experimental Results
-
- Normalization
- Decision Thresholds
- Simulated Data
- Gene Expression Data
- Correlated Data
-
- Discussion
-
- Sparse Clustering Analysis
-
- Abstract
- Feature Selection in Mixture Models
-
- Mixture Models
-
- Model
- Parameter Estimation The EM Algorithm
-
- Feature Selection in Model-Based Clustering
-
- Based on Penalized Likelihood
- Based on Model Variants
- Based on Model Selection
-
- Theoretical Foundations
-
- Resolving EM with Optimal Scoring
-
- Relationship Between the M-Step and Linear Discriminant Analysis
- Relationship Between Optimal Scoring and Linear Discriminant Analysis
- Clustering Using Penalized Optimal Scoring
- From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
-
- Optimized Criterion
-
- A Bayesian Derivation
- Maximum a Posteriori Estimator
-
- Mix-GLOSS Algorithm
-
- Mix-GLOSS
-
- Outer Loop Whole Algorithm Repetitions
- Penalty Parameter Loop
- Inner Loop EM Algorithm
-
- Model Selection
-
- Experimental Results
-
- Tested Clustering Algorithms
- Results
- Discussion
-
- Conclusions
- Appendix
-
- Matrix Properties
- The Penalized-OS Problem is an Eigenvector Problem
-
- How to Solve the Eigenvector Decomposition
- Why the OS Problem is Solved as an Eigenvector Problem
-
- Solving Fishers Discriminant Problem
- Alternative Variational Formulation for the Group-Lasso
-
- Useful Properties
- An Upper Bound on the Objective Function
-
- Invariance of the Group-Lasso to Unitary Transformations
- Expected Complete Likelihood and Likelihood
- Derivation of the M-Step Equations
-
- Prior probabilities
- Means
- Covariance Matrix
-
- Bibliography
-
ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo
Albert Espinosa
ldquoBe brave Take risks Nothing can substitute experiencerdquo
Paulo Coelho
Acknowledgements
If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy
Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself
I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them
The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience
Contents
List of figures v
List of tables vii
Notation and Symbols ix
I Context and Foundations 1
1 Context 5
2 Regularization for Feature Selection 921 Motivations 9
22 Categorization of Feature Selection Techniques 11
23 Regularization 13
231 Important Properties 14
232 Pure Penalties 14
233 Hybrid Penalties 18
234 Mixed Penalties 19
235 Sparsity Considerations 19
236 Optimization Tools for Regularized Problems 21
II Sparse Linear Discriminant Analysis 25
Abstract 27
3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29
32 Feature Selection in LDA Problems 30
321 Inertia Based 30
322 Regression Based 32
4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35
411 Penalized Optimal Scoring Problem 36
412 Penalized Canonical Correlation Analysis 37
i
Contents
413 Penalized Linear Discriminant Analysis 39
414 Summary 40
42 Practicalities 41
421 Solution of the Penalized Optimal Scoring Regression 41
422 Distance Evaluation 42
423 Posterior Probability Evaluation 43
424 Graphical Representation 43
43 From Sparse Optimal Scoring to Sparse LDA 43
431 A Quadratic Variational Form 44
432 Group-Lasso OS as Penalized LDA 47
5 GLOSS Algorithm 4951 Regression Coefficients Updates 49
511 Cholesky decomposition 52
512 Numerical Stability 52
52 Score Matrix 52
53 Optimality Conditions 53
54 Active and Inactive Sets 54
55 Penalty Parameter 54
56 Options and Variants 55
561 Scaling Variables 55
562 Sparse Variant 55
563 Diagonal Variant 55
564 Elastic net and Structured Variant 55
6 Experimental Results 5761 Normalization 57
62 Decision Thresholds 57
63 Simulated Data 58
64 Gene Expression Data 60
65 Correlated Data 63
Discussion 63
III Sparse Clustering Analysis 67
Abstract 69
7 Feature Selection in Mixture Models 7171 Mixture Models 71
711 Model 71
712 Parameter Estimation The EM Algorithm 72
ii
Contents
72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79
8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81
811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant
Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83
82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85
9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87
911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89
92 Model Selection 91
10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97
Conclusions 97
Appendix 103
A Matrix Properties 105
B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109
C Solving Fisherrsquos Discriminant Problem 111
D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115
iii
Contents
E Invariance of the Group-Lasso to Unitary Transformations 117
F Expected Complete Likelihood and Likelihood 119
G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122
Bibliography 123
iv
List of Figures
11 MASH project logo 5
21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-
rameters 20
41 Graphical representation of the variational approach to Group-Lasso 45
51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56
61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first
discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64
91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92
101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97
v
List of Tables
61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61
101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96
vii
Notation and Symbols
Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors
Sets
N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A
Data
X input domainxi input sample xi isin XX design matrix X = (xgt1 x
gtn )gt
xj column j of Xyi class indicator of sample i
Y indicator matrix Y = (ygt1 ygtn )gt
z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N
Vectors Matrices and Norms
0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A
ix
Notation and Symbols
Probability
E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2
W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix
H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y
Mixture Models
yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)
θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function
Optimization
J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β
βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path
x
Notation and Symbols
Penalized models
λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)
βj jth row of B = (β1gt βpgt)gt
BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix
ΣB sample between-class covariance matrix
ΣW sample within-class covariance matrix
ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach
xi
Part I
Context and Foundations
1
This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed
The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided
The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion
3
1 Context
The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm
The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne
From the point of view of the research the members of the consortium must deal withfour main goals
1 Software development of website framework and APIrsquos
2 Classification and goal-planning in high dimensional feature spaces
3 Interfacing the platform with the 3D virtual environment and the robot arm
4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments
S HM A
Figure 11 MASH project logo
5
1 Context
The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables
Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment
As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform
bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)
bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis
6
All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)
7
2 Regularization for Feature Selection
With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic
21 Motivations
There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)
As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information
When talking about dimensionality reduction there are two families of techniquesthat could induce confusion
bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples
bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature
9
2 Regularization for Feature Selection
Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)
selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category
As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text
ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out
Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the
10
22 Categorization of Feature Selection Techniques
Figure 22 The four key steps of feature selection according to Liu and Yu (2005)
ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost
There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions
22 Categorization of Feature Selection Techniques
Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured
I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm
The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities
bull Depending on the type of integration with the machine learning algorithm we have
ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm
ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while
11
2 Regularization for Feature Selection
the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive
ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm
bull Depending on the feature searching technique
ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches
ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time
ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima
bull Depending on the evaluation technique
ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures
ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty
ndash Dependency Measures - Measuring the correlation between features
ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can
ndash Predictive Accuracy - Use the selected features to predict the labels
ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)
The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels
In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized
12
23 Regularization
goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III
23 Regularization
In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret
An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations
minβJ(β) + λP (β) (21)
minβ
J(β)
s t P (β) le t (22)
In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken
In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty
13
2 Regularization for Feature Selection
Figure 23 Admissible sets in two dimensions for different pure norms ||β||p
231 Important Properties
Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability
Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies
forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)
for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex
Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources
Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution
232 Pure Penalties
For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In
14
23 Regularization
Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties
this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1
Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity
A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero
After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1
3penalty has a support region with sharper vertexes that would induce
a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1
3results in difficulties during optimization that will not happen with a convex
shape
15
2 Regularization for Feature Selection
To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty
L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0
minβ
J(β)
s t β0 le t (24)
where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable
L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)
minβ
J(β)
s t
psumj=1
|βj | le t (25)
Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited
Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)
The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by
16
23 Regularization
minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)
L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like
minβJ(β) + λ β22 (26)
The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem
minβ
nsumi=1
(yi minus xgti β)2 (27)
with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular
the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances
As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
(βlsj )2 (28)
The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)
17
2 Regularization for Feature Selection
where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model
Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions
Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t
This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as
βlowast = maxwisinRp
βgtw s t w le 1
In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1
r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)
233 Hybrid Penalties
There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is
minβ
nsumi=1
(yi minus xgti β)2 + λ1
psumj=1
|βj |+ λ2
psumj=1
β2j (29)
The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables
18
23 Regularization
234 Mixed Penalties
Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =
sumL`=1 d` Mixed norms are
a type of norms that take into consideration those groups The general expression isshowed below
β(rs) =
sum`
sumjisinG`
|βj |s r
s
1r
(210)
The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups
Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)
(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)
235 Sparsity Considerations
In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables
The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables
To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as
19
2 Regularization for Feature Selection
(a) L1 Lasso (b) L(12) group-Lasso
Figure 25 Admissible sets for the Lasso and Group-Lasso
(a) L1 induced sparsity (b) L(12) group inducedsparsity
Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters
20
23 Regularization
the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed
236 Optimization Tools for Regularized Problems
In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms
In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5
Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)
β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))
Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives
βj =minusλsign(βj)minus partJ(β)
partβj
2sumn
i=1 x2ij
In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding
algorithm where β(t+1)j = Sλ
(partJ(β(t))partβj
) The objective function is optimized with respect
21
2 Regularization for Feature Selection
to one variable at a time while all others are kept fixed
Sλ
(partJ(β)
partβj
)=
λminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
gt λ
minusλminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
lt minusλ
0 if |partJ(β)partβj| le λ
(211)
The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)
Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A
Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected
Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)
This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions
22
23 Regularization
and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions
Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points
This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable
This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)
Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)
This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques
Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals
Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)
minβisinRp
J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L
2
∥∥∥β minus β(t)∥∥∥2
2(212)
They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like
23
2 Regularization for Feature Selection
(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as
minβisinRp
1
2
∥∥∥∥β minus (β(t) minus 1
LnablaJ(β(t)))
∥∥∥∥2
2
+λ
LP (β) (213)
The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up
24
Part II
Sparse Linear Discriminant Analysis
25
Abstract
Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes
There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables
In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data
27
3 Feature Selection in Fisher DiscriminantAnalysis
31 Fisher Discriminant Analysis
Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)
We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x
gtn )gt and the corresponding labels in the ntimesK matrix
Y = (ygt1 ygtn )gt
Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance
maxβisinRp
βgtΣBβ
βgtΣWβ (31)
where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as
ΣW =1
n
Ksumk=1
sumiisinGk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
Ksumk=1
sumiisinGk
(microminus microk)(microminus microk)gt
where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k
29
3 Feature Selection in Fisher Discriminant Analysis
This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio
maxBisinRptimesKminus1
tr(BgtΣBB
)tr(BgtΣWB
) (32)
where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is
based on a series of K minus 1 subproblemsmaxβkisinRp
βgtk ΣBβk
s t βgtk ΣWβk le 1
βgtk ΣWβ` = 0 forall` lt k
(33)
The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest
eigenvalue (see Appendix C)
32 Feature Selection in LDA Problems
LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome
Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints
The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities
They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based
321 Inertia Based
The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and
30
32 Feature Selection in LDA Problems
classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations
Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)
Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as
minβisinRp
βgtΣWβ
s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t
where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony
Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max
βisinkRpβgtk Σ
k
Bβk minus Pk(βk)
s t βgtk ΣWβk le 1
The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten
Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal
solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1
minimization minβisinRp
β1
s t∥∥∥Σβ minus (micro1 minus micro2)
∥∥∥infinle λ
Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization
31
3 Feature Selection in Fisher Discriminant Analysis
Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions
322 Regression Based
In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)
Predefined Indicator Matrix
Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)
There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data
Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection
In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is
32
32 Feature Selection in LDA Problems
obtained by solving
minβisinRpβ0isinR
nminus1nsumi=1
(yi minus β0 minus xgti β)2 + λ
psumj=1
|βj |
where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β
vector for λ = 0 but a different intercept β0 is required
Optimal Scoring
In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)
As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)(34a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)
where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems
minθkisinRK βkisinRp
Yθk minusXβk2 + βgtk Ωβk (35a)
s t nminus1 θgtk YgtYθk = 1 (35b)
θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)
where each βk corresponds to a discriminant direction
33
3 Feature Selection in Fisher Discriminant Analysis
Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by
minβkisinRpθkisinRK
sumk
Yθk minusXβk22 + λ1 βk1 + λ2β
gtk Ωβk
where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen
Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)
minβkisinRpθkisinRK
Kminus1sumk=1
Yθk minusXβk22 + λ
psumj=1
radicradicradicradicKminus1sumk=1
β2kj
2
(36)
which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding
this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem
34
4 Formalizing the Objective
In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)
The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data
The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)
41 From Optimal Scoring to Linear Discriminant Analysis
Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)
Throughout this chapter we assume that
bull there is no empty class that is the diagonal matrix YgtY is full rank
bull inputs are centered that is Xgt1n = 0
bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank
35
4 Formalizing the Objective
411 Penalized Optimal Scoring Problem
For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution
The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus
minθisinRK βisinRp
Yθ minusXβ2 + βgtΩβ (41a)
s t nminus1 θgtYgtYθ = 1 (41b)
For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator
βos =(XgtX + Ω
)minus1XgtYθ (42)
The objective function (41a) is then
Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos
(XgtX + Ω
)βos
= θgtYgtYθ minus θgtYgtX(XgtX + Ω
)minus1XgtYθ
where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (43)
which shows that the optimization of the p-OS problem with respect to θk boils down to
finding the kth largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY Indeed Appendix C
details that Problem (43) is solved by
(YgtY)minus1YgtX(XgtX + Ω
)minus1XgtYθ = α2θ (44)
36
41 From Optimal Scoring to Linear Discriminant Analysis
where α2 is the maximal eigenvalue 1
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2nminus1θgt(YgtY)θ
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2 (45)
412 Penalized Canonical Correlation Analysis
As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows
maxθisinRK βisinRp
nminus1θgtYgtXβ (46a)
s t nminus1 θgtYgtYθ = 1 (46b)
nminus1 βgt(XgtX + Ω
)β = 1 (46c)
The solutions to (46) are obtained by finding saddle points of the Lagrangian
nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)
rArr npartL(βθ γ ν)
partβ= XgtYθ minus 2γ(XgtX + Ω)β
rArr βcca =1
2γ(XgtX + Ω)minus1XgtYθ
Then as βcca obeys (46c) we obtain
βcca =(XgtX + Ω)minus1XgtYθradic
nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)
so that the optimal objective function (46a) can be expressed with θ alone
nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
=
radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
and the optimization problem with respect to θ can be restated as
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (48)
Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)
βos = αβcca (49)
1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)
37
4 Formalizing the Objective
where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using
the optimality conditions for θ
npartL(βθ γ ν)
partθ= YgtXβ minus 2νYgtYθ
rArr θcca =1
2ν(YgtY)minus1YgtXβ (410)
Then as θcca obeys (46b) we obtain
θcca =(YgtY)minus1YgtXβradic
nminus1βgtXgtY(YgtY)minus1YgtXβ (411)
leading to the following expression of the optimal objective function
nminus1θgtccaYgtXβ =
nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ
=
radicnminus1βgtXgtY(YgtY)minus1YgtXβ
The p-CCA problem can thus be solved with respect to β by plugging this value in (46)
maxβisinRp
nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)
s t nminus1 βgt(XgtX + Ω
)β = 1 (412b)
where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies
nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω
)βcca (413)
where λ is the maximal eigenvalue shown below to be equal to α2
nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ
rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ
rArr nminus1αβgtccaXgtYθ = λ
rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ
rArr α2 = λ
The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)
38
41 From Optimal Scoring to Linear Discriminant Analysis
413 Penalized Linear Discriminant Analysis
Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows
maxβisinRp
βgtΣBβ (414a)
s t βgt(ΣW + nminus1Ω)β = 1 (414b)
where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C
As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable
to a simple matrix representation using the projection operator Y(YgtY
)minus1Ygt
ΣT =1
n
nsumi=1
xixigt
= nminus1XgtX
ΣB =1
n
Ksumk=1
nk microkmicrogtk
= nminus1XgtY(YgtY
)minus1YgtX
ΣW =1
n
Ksumk=1
sumiyik=1
(xi minus microk) (xi minus microk)gt
= nminus1
(XgtXminusXgtY
(YgtY
)minus1YgtX
)
Using these formulae the solution to the p-LDA problem (414) is obtained as
XgtY(YgtY
)minus1YgtXβlda = λ
(XgtX + ΩminusXgtY
(YgtY
)minus1YgtX
)βlda
XgtY(YgtY
)minus1YgtXβlda =
λ
1minus λ
(XgtX + Ω
)βlda
The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat
βlda = (1minus α2)minus12 βcca
= αminus1(1minus α2)minus12 βos
which ends the path from p-OS to p-LDA
39
4 Formalizing the Objective
414 Summary
The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)s t nminus1 ΘgtYgtYΘ = IKminus1
Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the
square-root of the largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY we have
BLDA = BCCA
(IKminus1 minusA2
)minus 12
= BOS Aminus1(IKminus1 minusA2
)minus 12 (415)
where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p
can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS
or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied
With the aim of performing classification the whole process could be summarized asfollows
1 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
2 Translate the data samples X into the LDA domain as XLDA = XBOSD
where D = Aminus1(IKminus1 minusA2
)minus 12
3 Compute the matrix M of centroids microk from XLDA and Y
4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA
5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule
6 Graphical Representation
40
42 Practicalities
The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively
42 Practicalities
421 Solution of the Penalized Optimal Scoring Regression
Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem
minΘisinRKtimesKminus1BisinRptimesKminus1
YΘminusXB2F + λ tr(BgtΩB
)(416a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)
where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm
Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps
1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1
2 Compute B =(XgtX + λΩ
)minus1XgtYΘ0
3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ
)minus1XgtY
4 Compute the optimal regression coefficients
BOS =(XgtX + λΩ
)minus1XgtYΘ (417)
Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on
Θ0gtYgtX(XgtX + λΩ
)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a
costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B
This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where
41
4 Formalizing the Objective
a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems
422 Distance Evaluation
The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance
d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log
(nkn
) (418)
is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent
Σminus1WΩ =
(nminus1(XgtX + λΩ)minus ΣB
)minus1
=(nminus1XgtXminus ΣB + nminus1λΩ
)minus1
=(ΣW + nminus1λΩ
)minus1 (419)
Before explaining how to compute the distances let us summarize some clarifying points
bull The solution BOS of the p-OS problem is enough to accomplish classification
bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances
bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1
As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain
(xi minus microk)BOS2ΣWΩminus 2 log(πk)
where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1
(IKminus1 minusA2
)minus 12
∥∥∥2
2minus 2 log(πk)
which is a plain Euclidean distance
42
43 From Sparse Optimal Scoring to Sparse LDA
423 Posterior Probability Evaluation
Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as
p(yk = 1|x) prop exp
(minusd(xmicrok)
2
)prop πk exp
(minus1
2
∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2
)minus 12
∥∥∥2
2
) (420)
Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)
2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below
p(yk = 1|x) =πk exp
(minusd(xmicrok)
2
)sum
` π` exp(minusd(xmicro`)
2
)=
πk exp(minusd(xmicrok)
2 + dmax2
)sum`
π` exp
(minusd(xmicro`)
2+dmax
2
)
where dmax = maxk d(xmicrok)
424 Graphical Representation
Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented
43 From Sparse Optimal Scoring to Sparse LDA
The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated
In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see
43
4 Formalizing the Objective
section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB
431 A Quadratic Variational Form
Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)
Our formulation of group-Lasso is showed below
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(421a)
s tsum
j τj minussum
j wj∥∥βj∥∥
2le 0 (421b)
τj ge 0 j = 1 p (421c)
where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1
B =(β1gt βpgt
)gtand wj are predefined nonnegative weights The cost function
J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)
The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41
Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)
Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump
j=1wj∥∥βj∥∥
2
Proof The Lagrangian of Problem (421) is
L = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
( psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2
)minus
psumj=1
νjτj
44
43 From Sparse Optimal Scoring to Sparse LDA
Figure 41 Graphical representation of the variational approach to Group-Lasso
Thus the first order optimality conditions for τj are
partLpartτj
(τj ) = 0hArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
hArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0 τ
j
2 = 0
The last line is obtained from complementary slackness which implies here νjτj = 0
Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier
for constraint gj(τj) le 0 As a result the optimal value of τj
τj =
radicλw2
j
∥∥βj∥∥2
2
ν0=
radicλ
ν0wj∥∥βj∥∥
2(422)
We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)
psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2= 0 (423)
so that τj = wj∥∥βj∥∥
2 Using this value into (421a) it is possible to conclude that
Problem (421) is equivalent to the standard group-Lasso operator
minBisinRptimesM
J(B) + λ
psumj=1
wj∥∥βj∥∥
2 (424)
So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation
45
4 Formalizing the Objective
With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (425)
with
τj = wj∥∥βj∥∥
2
resulting in Ω diagonal components
(Ω)jj =wj∥∥βj∥∥
2
(426)
And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5
The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence
Lemma 42 If J is convex Problem (421) is convex
Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is
V isin RptimesKminus1 V =partJ(B)
partB+ λG
(427)
where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1
G =(g1gt gpgt
)gtdefined as follows Let S(B) denote the columnwise support of
B S(B) = j isin 1 p ∥∥βj∥∥
26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (428)
forallj isin S(B) ∥∥gj∥∥
2le wj (429)
46
43 From Sparse Optimal Scoring to Sparse LDA
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Proof When∥∥βj∥∥
26= 0 the gradient of the penalty with respect to βj is
part (λsump
m=1wj βm2)
partβj= λwj
βj∥∥βj∥∥2
(430)
At∥∥βj∥∥
2= 0 the gradient of the objective function is not continuous and the optimality
conditions then make use of the subdifferential (Bach et al 2011)
partβj
(λ
psumm=1
wj βm2
)= partβj
(λwj
∥∥βj∥∥2
)=λwjv isin RKminus1 v2 le 1
(431)
That gives the expression (429)
Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima
forallj isin S partJ(B)
partβj+ λwj
∥∥βj∥∥minus1
2βj = 0 (432a)
forallj isin S ∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le λwj (432b)
where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment
Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)
432 Group-Lasso OS as Penalized LDA
With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced
Proposition 41 The group-Lasso OS problem
BOS = argminBisinRptimesKminus1
minΘisinRKtimesKminus1
1
2YΘminusXB2F + λ
psumj=1
wj∥∥βj∥∥
2
s t nminus1 ΘgtYgtYΘ = IKminus1
47
4 Formalizing the Objective
is equivalent to the penalized LDA problem
BLDA = maxBisinRptimesKminus1
tr(BgtΣBB
)s t Bgt(ΣW + nminus1λΩ)B = IKminus1
where Ω = diag
(w2
1
τ1
w2p
τp
) with Ωjj =
+infin if βjos = 0
wj∥∥βjos
∥∥minus1
2otherwise
(433)
That is BLDA = BOS diag(αminus1k (1minus α2
k)minus12
) where αk isin (0 1) is the kth leading
eigenvalue of
nminus1YgtX(XgtX + λΩ
)minus1XgtY
Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso
The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr
(BgtΩB
)
48
5 GLOSS Algorithm
The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22
The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below
1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed
2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution
3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set
This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively
51 Regression Coefficients Updates
Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(
XgtAXA + λΩ)βk = XgtAYθ0
k (51)
49
5 GLOSS Algorithm
initialize modelλ B
ACTIVE SETall j st||βj ||2 gt 0
p-OS PROBLEMB must hold1st optimality
condition
any variablefrom
ACTIVE SETmust go toINACTIVE
SET
take it out ofACTIVE SET
test 2nd op-timality con-dition on the
INACTIVE SET
any variablefrom
INACTIVE SETmust go toACTIVE
SET
take it out ofINACTIVE SET
compute Θ
and update B end
yes
no
yes
no
Figure 51 GLOSS block diagram
50
51 Regression Coefficients Updates
Algorithm 1 Adaptively Penalized Optimal Scoring
Input X Y B λInitialize A larr
j isin 1 p
∥∥βj∥∥2gt 0
Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat
Step 1 solve (421) in B assuming A optimalrepeat
Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1
2
BA larr(XgtAXA + λΩ
)minus1XgtAYΘ0
until condition (432a) holds for all j isin A Step 2 identify inactivated variables
for j isin A ∥∥βj∥∥
2= 0 do
if optimality condition (432b) holds thenA larr AjGo back to Step 1
end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax
jisinA
∥∥partJpartβj∥∥2
if∥∥∥partJpartβj∥∥∥
2lt λ then
convergence larr true B is optimalelseA larr Acup j
end ifuntil convergence
(sV)larreigenanalyze(Θ0gtYgtXAB) that is
Θ0gtYgtXABVk = skVk k = 1 K minus 1
Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1
Output Θ B α
51
5 GLOSS Algorithm
where XA denotes the columns of X indexed by A and βk and θ0k denote the kth
column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system
511 Cholesky decomposition
Dropping the subscripts and considering the (K minus 1) systems together (51) leads to
(XgtX + λΩ)B = XgtYΘ (52)
Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows
CgtCB = XgtYΘ
CB = CgtXgtYΘ
B = CCgtXgtYΘ (53)
where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)
512 Numerical Stability
The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression
B = Ωminus12(Ωminus12XgtXΩminus12 + λI
)minus1Ωminus12XgtYΘ0 (54)
where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)
52 Score Matrix
The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY This eigen-analysis is actually solved in the form
ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-
vector decomposition does not require the costly computation of(XgtX + Ω
)minus1that
52
53 Optimality Conditions
involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-
trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω
)minus1XgtY 1
Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as
Θ0gtYgtX(XgtX + Ω
)minus1XgtYΘ0 = Θ0gtYgtXB0
Thus the solution to penalized OS problem can be computed trough the singular
value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining
Θ = Θ0V we have ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ = Λ and when Θ0 is chosen such
that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation
53 Optimality Conditions
GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function
1
2YΘminusXB22 + λ
psumj=1
wj∥∥βj∥∥
2(55)
Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth
row of B βj is the (K minus 1)-dimensional vector
partJ(B)
partβj= xj
gt(XBminusYΘ)
where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as
xjgt
(XBminusYΘ) + λwjβj∥∥βj∥∥
2
1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω
)minus1XgtY It is thus suffi-
cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY In practice to comply with this desideratum and conditions (35b) and
(35c) we set Θ0 =(YgtY
)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal
vectors orthogonal to 1K
53
5 GLOSS Algorithm
The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
54 Active and Inactive Sets
The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function
j = maxj
∥∥∥xjgt (XBminusYΘ)∥∥∥
2minus λwj 0
The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥
2
is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition
55 Penalty Parameter
The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active
The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0
λmax = maxjisin1p
1
wj
∥∥∥xjgtYΘ0∥∥∥
2
The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin
is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)
54
56 Options and Variants
56 Options and Variants
561 Scaling Variables
As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm
562 Sparse Variant
This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation
563 Diagonal Variant
We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated
The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems
minBisinRptimesKminus1
YΘminusXB2F = minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB
)are replaced by
minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B
)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite
564 Elastic net and Structured Variant
For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition
55
5 GLOSS Algorithm
7 8 9
4 5 6
1 2 3
- ΩL =
3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3
Figure 52 Graph and Laplacian matrix for a 3times 3 image
for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth
When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood
This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned
56
6 Experimental Results
This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper
61 Normalization
With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1
62 Decision Thresholds
The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation
1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval
57
6 Experimental Results
63 Simulated Data
We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is
Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)
Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of
dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure
is intended to mimic gene expression data correlation
Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1
3 1)if j le 100 and Xij sim N(0 1) otherwise
Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise
Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563
The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only
58
63 Simulated Data
Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset
Err () Var Dir
Sim 1 K = 4 mean shift ind features
PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)
Sim 2 K = 2 mean shift dependent features
PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)
Sim 3 K = 4 1D mean shift ind features
PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)
Sim 4 K = 4 mean shift ind features
PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)
59
6 Experimental Results
0 10 20 30 40 50 60 70 8020
30
40
50
60
70
80
90
100TPR Vs FPR
gloss
glossd
slda
plda
Simulation1
Simulation2
Simulation3
Simulation4
Figure 61 TPR versus FPR (in ) for all algorithms and simulations
Table 62 Average TPR and FPR (in ) computed over 25 repetitions
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
PLDA 990 782 969 603 980 159 743 656
SLDA 739 385 338 163 416 278 507 395
GLOSS 641 106 300 46 511 182 260 121
GLOSS-D 935 394 921 281 956 655 429 299
method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )
64 Gene Expression Data
We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-
2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736
60
64 Gene Expression Data
Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables
Err () Var
Nakayama n = 86 p = 22 283 K = 5
PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)
Ramaswamy n = 198 p = 16 063 K = 14
PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)
Sun n = 180 p = 54 613 K = 4
PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)
ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4
dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors
Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split
Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors
Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS
4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962
61
6 Experimental Results
GLOSS SLDA
Naka
yam
a
minus25000 minus20000 minus15000 minus10000 minus5000 0 5000
minus25
minus2
minus15
minus1
minus05
0
05
1
x 104
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
2n
dd
iscr
imin
ant
minus2000 0 2000 4000 6000 8000 10000 12000 14000
2000
4000
6000
8000
10000
12000
14000
16000
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
Su
n
minus1 minus05 0 05 1 15 2
x 104
05
1
15
2
25
3
35
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
2n
dd
iscr
imin
ant
minus2 minus15 minus1 minus05 0
x 104
0
05
1
15
2
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means
62
65 Correlated Data
Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo
65 Correlated Data
When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge
The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works
For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63
As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward
The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits
Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results
63
6 Experimental Results
β for GLOSS β for S-GLOSS
Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo
β for GLOSS and λ = 03 β for S-GLOSS and λ = 03
Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo
64
Discussion
GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix
Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data
The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced
The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition
65
Part III
Sparse Clustering Analysis
67
Abstract
Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity
Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix
As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection
Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10
69
7 Feature Selection in Mixture Models
71 Mixture Models
One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering
711 Model
We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically
from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as
f(xi) =
Ksumk=1
πkfk(xi) foralli isin 1 n
where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and
sumk πk = 1) Mixture models transcribe that
given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism
bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK
bull x each xi is assumed to arise from a random vector with probability densityfunction fk
In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as
f(xiθ) =
Ksumk=1
πkφ(xiθk) foralli isin 1 n
71
7 Feature Selection in Mixture Models
where θ = (π1 πK θ1 θK) is the parameter of the model
712 Parameter Estimation The EM Algorithm
For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ
21 σ
22 π) of a univariate
Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches
The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)
The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood
Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm
Maximum Likelihood Definitions
The likelihood is is commonly expressed in its logarithmic version
L(θ X) = log
(nprodi=1
f(xiθ)
)
=nsumi=1
log
(Ksumk=1
πkfk(xiθk)
) (71)
where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions
To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or
72
71 Mixture Models
classification log-likelihood
LC(θ XY) = log
(nprodi=1
f(xiyiθ)
)
=
nsumi=1
log
(Ksumk=1
yikπkfk(xiθk)
)
=nsumi=1
Ksumk=1
yik log (πkfk(xiθk)) (72)
The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise
Defining the soft membership tik(θ) as
tik(θ) = p(Yik = 1|xiθ) (73)
=πkfk(xiθk)
f(xiθ) (74)
To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows
LC(θ XY) =sumik
yik log (πkfk(xiθk))
=sumik
yik log (tikf(xiθ))
=sumik
yik log tik +sumik
yik log f(xiθ)
=sumik
yik log tik +nsumi=1
log f(xiθ)
=sumik
yik log tik + L(θ X) (75)
wheresum
ik yik log tik can be reformulated as
sumik
yik log tik =nsumi=1
Ksumk=1
yik log(p(Yik = 1|xiθ))
=
nsumi=1
log(p(Yik = 1|xiθ))
= log (p(Y |Xθ))
As a result the relationship (75) can be rewritten as
L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)
73
7 Feature Selection in Mixture Models
Likelihood Maximization
The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)
L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))
+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))
In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood
∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1
minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality
Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))
For the mixture model problem Q(θθprime) is
Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]
=sumik
p(Yik = 1|xiθprime) log(πkfk(xiθk))
=nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (77)
Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ
prime) are the posterior proba-bilities of cluster memberships
Hence the EM algorithm sketched above results in
bull Initialization (not iterated) choice of the initial parameter θ(0)
bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)
bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))
74
72 Feature Selection in Model-Based Clustering
Gaussian Model
In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is
f(xiθ) =Ksumk=1
πkfk(xiθk)
=
Ksumk=1
πk1
(2π)p2 |Σ|
12
exp
minus1
2(xi minus microk)
gtΣminus1(xi minus microk)
At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows
Q(θθ(t)) =sumik
tik log(πk)minussumik
tik log(
(2π)p2 |Σ|
12
)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
=sumk
tk log(πk)minusnp
2log(2π)︸ ︷︷ ︸
constant term
minusn2
log(|Σ|)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
equivsumk
tk log(πk)minusn
2log(|Σ|)minus
sumik
tik
(1
2(xi minus microk)
gtΣminus1(xi minus microk)
) (78)
where
tk =nsumi=1
tik (79)
The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)
π(t+1)k =
tkn
(710)
micro(t+1)k =
sumi tikxitk
(711)
Σ(t+1) =1
n
sumk
Wk (712)
with Wk =sumi
tik(xi minus microk)(xi minus microk)gt (713)
The derivations are detailed in Appendix G
72 Feature Selection in Model-Based Clustering
When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own
75
7 Feature Selection in Mixture Models
covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries
In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD
gtk (Banfield and Raftery 1993)
These regularization schemes address singularity and stability issues but they do notinduce parsimonious models
In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space
721 Based on Penalized Likelihood
Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x
log
(p(Yk = 1|x)
p(Y` = 1|x)
)= xgtΣminus1(microk minus micro`)minus
1
2(microk + micro`)
gtΣminus1(microk minus micro`) + logπkπ`
In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm
λKsumk=1
psumj=1
|microkj |
as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices
λ1
Ksumk=1
psumj=1
|microkj |+ λ2
Ksumk=1
psumj=1
psumm=1
|(Σminus1k )jm|
In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models
76
72 Feature Selection in Model-Based Clustering
Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)
λ
psumj=1
sum16k6kprime6K
|microkj minus microkprimej |
This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative
A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features
λ
psumj=1
(micro1j micro2j microKj)infin
One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means
λradicK
psumj=1
radicradicradicradic Ksum
k=1
micro2kj
The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test
The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation
722 Based on Model Variants
The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as
77
7 Feature Selection in Mixture Models
f(xi|φ πθν) =Ksumk=1
πk
pprodj=1
[f(xij |θjk)]φj [h(xij |νj)]1minusφj
where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)
An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1
which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion
tr(
(UgtΣWU)minus1UgtΣBU) (714)
so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations
To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of
minUisinRptimesKminus1
∥∥∥XU minusXU∥∥∥2
F+ λ
Kminus1sumk=1
∥∥∥uk∥∥∥1
where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet
minABisinRptimesKminus1
Ksumk=1
∥∥∥RminusgtW HBk minusABgtHBk
∥∥∥2
2+ ρ
Kminus1sumj=1
βgtj ΣWβj + λ
Kminus1sumj=1
∥∥βj∥∥1
s t AgtA = IKminus1
where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper
78
72 Feature Selection in Model-Based Clustering
triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U
The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem
minUisinRptimesKminus1
psumj=1
∥∥∥ΣBj minus UUgtΣBj
∥∥∥2
2
s t UgtU = IKminus1
whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U
To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality
However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo
723 Based on Model Selection
Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables
bull X(1) set of selected relevant variables
bull X(2) set of variables being considered for inclusion or exclusion of X(1)
bull X(3) set of non relevant variables
79
7 Feature Selection in Mixture Models
With those subsets they defined two different models where Y is the partition toconsider
bull M1
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)|X(1)
)f(X(1)|Y
)bull M2
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)X(1)|Y
)Model M1 means that variables in X(2) are independent on clustering Y Model M2
shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor
B12 =f (X|M1)
f (X|M2)
where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio
B12 =f(X(1)X(2)X(3)|M1
)f(X(1)X(2)X(3)|M2
)=f(X(2)|X(1)M1
)f(X(1)|M1
)f(X(2)X(1)|M2
)
This factor is approximated since the integrated likelihoods f(X(1)|M1
)and
f(X(2)X(1)|M2
)are difficult to calculate exactly Raftery and Dean (2006) use the
BIC approximation The computation of f(X(2)|X(1)M1
) if there is only one variable
in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term
Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability
Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis
80
8 Theoretical Foundations
In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features
We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model
In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided
81 Resolving EM with Optimal Scoring
In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate
811 Relationship Between the M-Step and Linear Discriminant Analysis
LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance
d(ximicrok) = (xi minus microk)gtΣminus1
W (xi minus microk)
where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix
81
8 Theoretical Foundations
The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)
Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood
2lweight(microΣ) =nsumi=1
Ksumk=1
tikd(ximicrok)minus n log(|ΣW|)
which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures
812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis
The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression
813 Clustering Using Penalized Optimal Scoring
The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression
d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)
This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as
82
82 Optimized Criterion
1 Initialize the membership matrix Y (for example by K-means algorithm)
2 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2
k)minus 1
2 )
4 Compute the centroids M in the LDA domain
5 Evaluate distances in the LDA domain
6 Translate distances into posterior probabilities tik with
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
] (81)
7 Update the labels using the posterior probabilities matrix Y = T
8 Go back to step 2 and iterate until tik converge
Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures
814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures
82 Optimized Criterion
In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized
83
8 Theoretical Foundations
optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture
This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows
821 A Bayesian Derivation
This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)
The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter
The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior
f(Σ|Λ0 ν0) =1
2np2 |Λ0|
n2 Γp(
n2 )|Σminus1|
ν0minuspminus12 exp
minus1
2tr(Λminus1
0 Σminus1)
where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as
Γp(n2) = πp(pminus1)4pprodj=1
Γ (n2 + (1minus j)2)
The posterior distribution can be maximized similarly to the likelihood through the
84
82 Optimized Criterion
maximization of
Q(θθprime) + log(f(Σ|Λ0 ν0))
=Ksumk=1
tk log πk minus(n+ 1)p
2log 2minus n
2log |Λ0| minus
p(p+ 1)
4log(π)
minuspsumj=1
log
(Γ
(n
2+
1minus j2
))minus νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
)equiv
Ksumk=1
tk log πk minusn
2log |Λ0| minus
νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
) (82)
with tk =
nsumi=1
tik
νn = ν0 + n
Λminus1n = Λminus1
0 + S0
S0 =
nsumi=1
Ksumk=1
tik(xi minus microk)(xi minus microk)gt
Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)
822 Maximum a Posteriori Estimator
The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is
ΣMAP =1
ν0 + nminus pminus 1(Λminus1
0 + S0) (83)
where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1
0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)
85
9 Mix-GLOSS Algorithm
Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism
91 Mix-GLOSS
The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik
When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant
The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition
911 Outer Loop Whole Algorithm Repetitions
This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs
bull the centered ntimes p feature matrix X
bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically
bull the number of clusters K
bull the maximum number of iterations for the EM algorithm
bull the convergence tolerance for the EM algorithm
bull the number of whole repetitions of the clustering algorithm
87
9 Mix-GLOSS Algorithm
Figure 91 Mix-GLOSS Loops Scheme
bull a ptimes (K minus 1) initial coefficient matrix (optional)
bull a ntimesK initial posterior probability matrix (optional)
For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process
912 Penalty Parameter Loop
The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix
Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage
88
91 Mix-GLOSS
of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive
Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)
Algorithm 2 Automatic selection of λ
Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat
Estimate λ Compute gradient at βj = 0partJ(B)
partβj
∣∣∣βj=0
= xjgt
(sum
m6=j xmβm minusYΘ)
Compute λmax for every feature using (432b)
λmaxj = 1
wj
∥∥∥∥ partJ(B)
partβj
∣∣∣βj=0
∥∥∥∥2
Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false
elselastLAMBDA larr true
end ifuntil lastLAMBDA
Output B L(θ) tik πk microk Σ Y for every λ in solution path
913 Inner Loop EM Algorithm
The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop
89
9 Mix-GLOSS Algorithm
Algorithm 3 Mix-GLOSS for one value of λ
Input X K B0 Y0 λInitializeif (B0Y0) available then
BOS larr B0 Y larr Y0
elseBOS larr 0 Y larr kmeans(XK)
end ifconvergenceEM larr false tolEM larr 1e-3repeat
M-step(BOSΘ
α)larr GLOSS(XYBOS λ)
XLDA = XBOS diag (αminus1(1minusα2)minus12
)
πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n
sumi |tik minus yik| lt tolEM then
convergenceEM larr trueend ifY larr T
until convergenceEMY larr MAP(T)
Output BOS ΘL(θ) tik πk microk Σ Y
90
92 Model Selection
M-Step
The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step
E-Step
The E-step evaluates the posterior probability matrix T using
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
]
The convergence of those tik is used as stopping criterion for EM
92 Model Selection
Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected
In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure
In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time
The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested
91
9 Mix-GLOSS Algorithm
Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)
X K λEMITER MAXREPMixminusGLOSS
Use B and T frombest repetition as
StartB and StartT
Mix-GLOSS (λStartBStartT)
Compute BIC
Chose λ = minλ BIC
Partition tikπk λBEST BΘ D L(θ)activeset
Figure 92 Mix-GLOSS model selection diagram
with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ
92
10 Experimental Results
The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6
This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63
In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations
The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions
101 Tested Clustering Algorithms
This section compares Mix-GLOSS with the following methods in the state of the art
bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan
bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website
93
10 Experimental Results
Figure 101 Class mean vectors for each artificial simulation
bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website
After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered
The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website
bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)
bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see
94
102 Results
Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable
102 Results
In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are
bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different
bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80
bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced
The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS
Results in percentages are displayed in Figure 102 (or in Table 102 )
95
10 Experimental Results
Table 101 Experimental results for simulated data
Err () Var Time
Sim 1 K = 4 mean shift ind features
CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h
Sim 2 K = 2 mean shift dependent features
CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h
Sim 3 K = 4 1D mean shift ind features
CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h
Sim 4 K = 4 mean shift ind features
CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h
Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
MIX-GLOSS 992 015 828 335 884 67 780 12
LUMI-KUAN 992 28 1000 02 1000 005 50 005
FISHER-EM 986 24 888 17 838 5825 620 4075
96
103 Discussion
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100TPR Vs FPR
MIXminusGLOSS
LUMIminusKUAN
FISHERminusEM
Simulation1
Simulation2
Simulation3
Simulation4
Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions
103 Discussion
After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted
LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here
The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4
From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall
97
Conclusions
99
Conclusions
Summary
The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables
In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering
The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems
In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations
In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results
Perspectives
Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species
101
based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography
At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term
The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis
From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm
At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression
102
Appendix
103
A Matrix Properties
Property 1 By definition ΣW and ΣB are both symmetric matrices
ΣW =1
n
gsumk=1
sumiisinCk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
gsumk=1
nk(microk minus x)(microk minus x)gt
Property 2 partxgtapartx = partagtx
partx = a
Property 3 partxgtAxpartx = (A + Agt)x
Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt
Property 5 partagtXbpartX = abgt
Property 6 partpartXtr
(AXminus1B
)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt
105
B The Penalized-OS Problem is anEigenvector Problem
In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form
minθkβk
Yθk minusXβk22 + βgtk Ωkβk (B1)
st θgtk YgtYθk = 1
θgt` YgtYθk = 0 forall` lt k
for k = 1 K minus 1The Lagrangian associated to Problem (B1) is
Lk(θkβk λkνk) =
Yθk minusXβk22 + βgtk Ωkβk + λk(θ
gtk YgtYθk minus 1) +
sum`ltk
ν`θgt` YgtYθk (B2)
Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk
βk = (XgtX + Ωk)minus1XgtYθk (B3)
The objective function of (B1) evaluated at βk is
minθk
Yθk minusXβk22 + βk
gtΩkβk = min
θk
θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk
= maxθk
θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)
If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY
B1 How to Solve the Eigenvector Decomposition
Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition
107
B The Penalized-OS Problem is an Eigenvector Problem
Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way
maxΘisinRKtimes(Kminus1)
tr(ΘgtMΘ
)(B5)
st ΘgtYgtYΘ = IKminus1
If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is
MΘv = λv (B6)
where v is the eigenvector and λ the associated eigenvalue of MΘ Operating
vgtMΘv = λhArr vgtΘgtMΘv = λ
Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue
wgtMw = λ (B7)
Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ
MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ
= ΘgtYgtXB
Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone
To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B
B = (XgtX + Ω)minus1XgtYΘV = BV
108
B2 Why the OS Problem is Solved as an Eigenvector Problem
B2 Why the OS Problem is Solved as an Eigenvector Problem
In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY
By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them
θk =
Kminus1summ=1
αmwm s t θgtk θk = 1 (B8)
The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (
Kminus1summ=1
αmwm
)gt(Kminus1summ=1
αmwm
)= 1
that as per the eigenvector properties can be reduced to
Kminus1summ=1
α2m = 1 (B9)
Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)
Mθk = M
Kminus1summ=1
αmwm
=
Kminus1summ=1
αmMwm
As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain
Mθk =Kminus1summ=1
αmλmwm
Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors
θgtk Mθk =
(Kminus1sum`=1
α`w`
)gt(Kminus1summ=1
αmλmwm
)
This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving
θgtk Mθk =Kminus1summ=1
α2mλm
109
B The Penalized-OS Problem is an Eigenvector Problem
The optimization Problem (B5) for discriminant direction k can be rewritten as
maxθkisinRKtimes1
θgtk Mθk
= max
θkisinRKtimes1
Kminus1summ=1
α2mλm
(B10)
with θk =Kminus1summ=1
αmwm
andKminus1summ=1
α2m = 1
One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =
sumKminus1m=1 αmwm the resulting score vector θk will be equal to
the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can
be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY
110
C Solving Fisherrsquos Discriminant Problem
The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance
maxβisinRp
βgtΣBβ (C1a)
s t βgtΣWβ = 1 (C1b)
where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data
The Lagrangian of Problem (C1) is
L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)
so that its first derivative with respect to β is
partL(β ν)
partβ= 2ΣBβ minus 2νΣWβ
A necessary optimality condition for β is that this derivative is zero that is
ΣBβ = νΣWβ
Provided ΣW is full rank we have
Σminus1W ΣBβ
= νβ (C2)
Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of
eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows
βgtΣBβ = βgtΣWΣminus1
W ΣBβ
= νβgtΣWβ from (C2)
= ν from (C1b)
That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1
W ΣB and β is any eigenvector correspondingto this maximal eigenvalue
111
D Alternative Variational Formulation forthe Group-Lasso
In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(D1a)
s tsump
j=1 τj = 1 (D1b)
τj ge 0 j = 1 p (D1c)
Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed
of row vectors βj isin RKminus1 B =(β1gt βpgt
)gt
L(B τ λ ν0 νj) = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
psumj=1
τj minus 1
minus psumj=1
νjτj (D2)
The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj
partL(B τ λ ν0 νj)
partτj
∣∣∣∣τj=τj
= 0 rArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 = 0
The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ
) = 0 where νj is the Lagrange multiplier and gj(τ) is the
inequality Lagrange condition Then the optimal τj can be deduced
τj =
radicλ
ν0wj∥∥βj∥∥
2
Placing this optimal value of τj into constraint (D1b)
psumj=1
τj = 1rArr τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
(D3)
113
D Alternative Variational Formulation for the Group-Lasso
With this value of τj Problem (D1) is equivalent to
minBisinRptimesKminus1
J(B) + λ
psumj=1
wj∥∥βj∥∥
2
2
(D4)
This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj
The penalty term of (D1a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (D5)
Using the value of τj from (D3) each diagonal component of Ω is
(Ω)jj =wjsump
j=1wj∥∥βj∥∥
2∥∥βj∥∥2
(D6)
In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation
D1 Useful Properties
Lemma D1 If J is convex Problem (D1) is convex
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =
partJ(B)
partB+ 2λ
Kminus1sumj=1
wj∥∥βj∥∥
2
G
(D7)
where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1
∥∥βj∥∥26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (D8)
forallj isin S(B) ∥∥gj∥∥
2le wj (D9)
114
D2 An Upper Bound on the Objective Function
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1
∥∥βj∥∥26= 0 and let S(B) be its complement then we have
forallj isin S(B) minus partJ(B)
partβj= 2λ
Kminus1sumj=1
wj∥∥βj∥∥2
wj∥∥βj∥∥minus1
2βj (D10a)
forallj isin S(B)
∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le 2λwj
Kminus1sumj=1
wj∥∥βj∥∥2
(D10b)
In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)
D2 An Upper Bound on the Objective Function
Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that
τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum
j=1
wj∥∥βj∥∥
2
2
=
psumj=1
τ12j
wj∥∥βj∥∥
2
τ12j
2
le
psumj=1
τj
psumj=1
w2j
∥∥βj∥∥2
2
τj
le
psumj=1
w2j
∥∥βj∥∥2
2
τj
where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one
115
D Alternative Variational Formulation for the Group-Lasso
This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined
116
E Invariance of the Group-Lasso to UnitaryTransformations
The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition
Proposition E1 Let B be a solution of
minBisinRptimesM
Y minusXB2F + λ
psumj=1
wj∥∥βj∥∥
2(E1)
and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof
minBisinRptimesM
∥∥∥Y minusXB∥∥∥2
F+ λ
psumj=1
wj∥∥βj∥∥
2(E2)
Proof The first-order necessary optimality conditions for B are
forallj isin S(B) 2xjgt(xjβ
j minusY)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E3a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minusY)∥∥∥
2le λwj (E3b)
where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement
First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows
forallj isin S(B) 2xjgt(xjβ
j minus Y)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E4a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minus Y)∥∥∥
2le λwj (E4b)
where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM
∥∥ugt∥∥2
=∥∥ugtV
∥∥2 Equation (E4b) is also
117
E Invariance of the Group-Lasso to Unitary Transformations
obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof
118
F Expected Complete Likelihood andLikelihood
Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available
L(θ) =
nsumi=1
log
(Ksumk=1
πkfk(xiθk)
)(F1)
Q(θθprime) =nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (F2)
with tik(θprime) =
πprimekfk(xiθprimek)sum
` πprime`f`(xiθ
prime`)
(F3)
In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are
the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)
Using (F3) we have
Q(θθprime) =sumik
tik(θprime) log (πkfk(xiθk))
=sumik
tik(θprime) log(tik(θ)) +
sumik
tik(θprime) log
(sum`
π`f`(xiθ`)
)=sumik
tik(θprime) log(tik(θ)) + L(θ)
In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities
L(θ) = Q(θθ)minussumik
tik(θ) log(tik(θ))
= Q(θθ) +H(T)
119
G Derivation of the M-Step Equations
This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as
Q(θθprime) = maxθ
sumik
tik(θprime) log(πkfk(xiθk))
=sumk
log
(πksumi
tik
)minus np
2log(2π)minus n
2log |Σ| minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
which has to be maximized subject tosumk
πk = 1
The Lagrangian of this problem is
L(θ) = Q(θθprime) + λ
(sumk
πk minus 1
)
Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ
G1 Prior probabilities
partL(θ)
partπk= 0hArr 1
πk
sumi
tik + λ = 0
where λ is identified from the constraint leading to
πk =1
n
sumi
tik
121
G Derivation of the M-Step Equations
G2 Means
partL(θ)
partmicrok= 0hArr minus1
2
sumi
tik2Σminus1(microk minus xi) = 0
rArr microk =
sumi tikxisumi tik
G3 Covariance Matrix
partL(θ)
partΣminus1 = 0hArr n
2Σ︸︷︷︸
as per property 4
minus 1
2
sumik
tik(xi minus microk)(xi minus microk)gt
︸ ︷︷ ︸as per property 5
= 0
rArr Σ =1
n
sumik
tik(xi minus microk)(xi minus microk)gt
122
Bibliography
F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011
F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008
F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012
J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993
A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009
H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996
P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004
C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008
C M Bishop Pattern Recognition and Machine Learning Springer New York 2006
C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a
C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b
S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004
L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995
L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984
123
Bibliography
T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011
S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999
C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012
B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008
L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011
C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009
A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246
D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006
R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000
B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004
Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008
R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936
V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008
J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009
124
Bibliography
J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010
J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989
W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998
A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003
D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005
G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010
G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011
Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998
Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002
L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008
Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004
J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010
I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003
T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996
T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994
125
Bibliography
T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995
A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970
J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009
T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006
K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000
P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010
T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002
M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004
Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004
C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008
C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006
H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005
J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967
Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012
C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a
126
Bibliography
C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b
L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008
N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006
B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006
B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007
Y Nesterov Gradient methods for minimizing composite functions preprint 2007
S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886
B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011
M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a
M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b
W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007
W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006
K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894
S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003
127
Bibliography
Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009
A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006
C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948
S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007
V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004
V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008
V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004
C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010
L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012
Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978
A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008
S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006
P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010
M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008
128
Bibliography
M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008
R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996
J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010
S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008
D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011
D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010
D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009
M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007
MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009
T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008
B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a
B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b
C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010
J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007
129
Bibliography
M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006
P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007
P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009
H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009
H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006
H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005
130
- SANCHEZ MERCHANTE PDTpdf
- Thesis Luis Francisco Sanchez Merchantepdf
-
- List of figures
- List of tables
- Notation and Symbols
- Context and Foundations
-
- Context
- Regularization for Feature Selection
-
- Motivations
- Categorization of Feature Selection Techniques
- Regularization
-
- Important Properties
- Pure Penalties
- Hybrid Penalties
- Mixed Penalties
- Sparsity Considerations
- Optimization Tools for Regularized Problems
-
- Sparse Linear Discriminant Analysis
-
- Abstract
- Feature Selection in Fisher Discriminant Analysis
-
- Fisher Discriminant Analysis
- Feature Selection in LDA Problems
-
- Inertia Based
- Regression Based
-
- Formalizing the Objective
-
- From Optimal Scoring to Linear Discriminant Analysis
-
- Penalized Optimal Scoring Problem
- Penalized Canonical Correlation Analysis
- Penalized Linear Discriminant Analysis
- Summary
-
- Practicalities
-
- Solution of the Penalized Optimal Scoring Regression
- Distance Evaluation
- Posterior Probability Evaluation
- Graphical Representation
-
- From Sparse Optimal Scoring to Sparse LDA
-
- A Quadratic Variational Form
- Group-Lasso OS as Penalized LDA
-
- GLOSS Algorithm
-
- Regression Coefficients Updates
-
- Cholesky decomposition
- Numerical Stability
-
- Score Matrix
- Optimality Conditions
- Active and Inactive Sets
- Penalty Parameter
- Options and Variants
-
- Scaling Variables
- Sparse Variant
- Diagonal Variant
- Elastic net and Structured Variant
-
- Experimental Results
-
- Normalization
- Decision Thresholds
- Simulated Data
- Gene Expression Data
- Correlated Data
-
- Discussion
-
- Sparse Clustering Analysis
-
- Abstract
- Feature Selection in Mixture Models
-
- Mixture Models
-
- Model
- Parameter Estimation The EM Algorithm
-
- Feature Selection in Model-Based Clustering
-
- Based on Penalized Likelihood
- Based on Model Variants
- Based on Model Selection
-
- Theoretical Foundations
-
- Resolving EM with Optimal Scoring
-
- Relationship Between the M-Step and Linear Discriminant Analysis
- Relationship Between Optimal Scoring and Linear Discriminant Analysis
- Clustering Using Penalized Optimal Scoring
- From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
-
- Optimized Criterion
-
- A Bayesian Derivation
- Maximum a Posteriori Estimator
-
- Mix-GLOSS Algorithm
-
- Mix-GLOSS
-
- Outer Loop Whole Algorithm Repetitions
- Penalty Parameter Loop
- Inner Loop EM Algorithm
-
- Model Selection
-
- Experimental Results
-
- Tested Clustering Algorithms
- Results
- Discussion
-
- Conclusions
- Appendix
-
- Matrix Properties
- The Penalized-OS Problem is an Eigenvector Problem
-
- How to Solve the Eigenvector Decomposition
- Why the OS Problem is Solved as an Eigenvector Problem
-
- Solving Fishers Discriminant Problem
- Alternative Variational Formulation for the Group-Lasso
-
- Useful Properties
- An Upper Bound on the Objective Function
-
- Invariance of the Group-Lasso to Unitary Transformations
- Expected Complete Likelihood and Likelihood
- Derivation of the M-Step Equations
-
- Prior probabilities
- Means
- Covariance Matrix
-
- Bibliography
-
Acknowledgements
If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy
Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself
I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them
The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience
Contents
List of figures v
List of tables vii
Notation and Symbols ix
I Context and Foundations 1
1 Context 5
2 Regularization for Feature Selection 921 Motivations 9
22 Categorization of Feature Selection Techniques 11
23 Regularization 13
231 Important Properties 14
232 Pure Penalties 14
233 Hybrid Penalties 18
234 Mixed Penalties 19
235 Sparsity Considerations 19
236 Optimization Tools for Regularized Problems 21
II Sparse Linear Discriminant Analysis 25
Abstract 27
3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29
32 Feature Selection in LDA Problems 30
321 Inertia Based 30
322 Regression Based 32
4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35
411 Penalized Optimal Scoring Problem 36
412 Penalized Canonical Correlation Analysis 37
i
Contents
413 Penalized Linear Discriminant Analysis 39
414 Summary 40
42 Practicalities 41
421 Solution of the Penalized Optimal Scoring Regression 41
422 Distance Evaluation 42
423 Posterior Probability Evaluation 43
424 Graphical Representation 43
43 From Sparse Optimal Scoring to Sparse LDA 43
431 A Quadratic Variational Form 44
432 Group-Lasso OS as Penalized LDA 47
5 GLOSS Algorithm 4951 Regression Coefficients Updates 49
511 Cholesky decomposition 52
512 Numerical Stability 52
52 Score Matrix 52
53 Optimality Conditions 53
54 Active and Inactive Sets 54
55 Penalty Parameter 54
56 Options and Variants 55
561 Scaling Variables 55
562 Sparse Variant 55
563 Diagonal Variant 55
564 Elastic net and Structured Variant 55
6 Experimental Results 5761 Normalization 57
62 Decision Thresholds 57
63 Simulated Data 58
64 Gene Expression Data 60
65 Correlated Data 63
Discussion 63
III Sparse Clustering Analysis 67
Abstract 69
7 Feature Selection in Mixture Models 7171 Mixture Models 71
711 Model 71
712 Parameter Estimation The EM Algorithm 72
ii
Contents
72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79
8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81
811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant
Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83
82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85
9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87
911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89
92 Model Selection 91
10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97
Conclusions 97
Appendix 103
A Matrix Properties 105
B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109
C Solving Fisherrsquos Discriminant Problem 111
D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115
iii
Contents
E Invariance of the Group-Lasso to Unitary Transformations 117
F Expected Complete Likelihood and Likelihood 119
G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122
Bibliography 123
iv
List of Figures
11 MASH project logo 5
21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-
rameters 20
41 Graphical representation of the variational approach to Group-Lasso 45
51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56
61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first
discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64
91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92
101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97
v
List of Tables
61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61
101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96
vii
Notation and Symbols
Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors
Sets
N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A
Data
X input domainxi input sample xi isin XX design matrix X = (xgt1 x
gtn )gt
xj column j of Xyi class indicator of sample i
Y indicator matrix Y = (ygt1 ygtn )gt
z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N
Vectors Matrices and Norms
0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A
ix
Notation and Symbols
Probability
E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2
W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix
H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y
Mixture Models
yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)
θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function
Optimization
J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β
βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path
x
Notation and Symbols
Penalized models
λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)
βj jth row of B = (β1gt βpgt)gt
BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix
ΣB sample between-class covariance matrix
ΣW sample within-class covariance matrix
ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach
xi
Part I
Context and Foundations
1
This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed
The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided
The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion
3
1 Context
The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm
The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne
From the point of view of the research the members of the consortium must deal withfour main goals
1 Software development of website framework and APIrsquos
2 Classification and goal-planning in high dimensional feature spaces
3 Interfacing the platform with the 3D virtual environment and the robot arm
4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments
S HM A
Figure 11 MASH project logo
5
1 Context
The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables
Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment
As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform
bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)
bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis
6
All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)
7
2 Regularization for Feature Selection
With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic
21 Motivations
There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)
As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information
When talking about dimensionality reduction there are two families of techniquesthat could induce confusion
bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples
bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature
9
2 Regularization for Feature Selection
Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)
selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category
As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text
ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out
Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the
10
22 Categorization of Feature Selection Techniques
Figure 22 The four key steps of feature selection according to Liu and Yu (2005)
ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost
There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions
22 Categorization of Feature Selection Techniques
Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured
I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm
The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities
bull Depending on the type of integration with the machine learning algorithm we have
ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm
ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while
11
2 Regularization for Feature Selection
the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive
ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm
bull Depending on the feature searching technique
ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches
ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time
ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima
bull Depending on the evaluation technique
ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures
ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty
ndash Dependency Measures - Measuring the correlation between features
ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can
ndash Predictive Accuracy - Use the selected features to predict the labels
ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)
The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels
In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized
12
23 Regularization
goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III
23 Regularization
In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret
An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations
minβJ(β) + λP (β) (21)
minβ
J(β)
s t P (β) le t (22)
In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken
In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty
13
2 Regularization for Feature Selection
Figure 23 Admissible sets in two dimensions for different pure norms ||β||p
231 Important Properties
Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability
Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies
forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)
for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex
Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources
Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution
232 Pure Penalties
For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In
14
23 Regularization
Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties
this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1
Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity
A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero
After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1
3penalty has a support region with sharper vertexes that would induce
a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1
3results in difficulties during optimization that will not happen with a convex
shape
15
2 Regularization for Feature Selection
To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty
L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0
minβ
J(β)
s t β0 le t (24)
where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable
L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)
minβ
J(β)
s t
psumj=1
|βj | le t (25)
Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited
Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)
The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by
16
23 Regularization
minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)
L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like
minβJ(β) + λ β22 (26)
The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem
minβ
nsumi=1
(yi minus xgti β)2 (27)
with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular
the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances
As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
(βlsj )2 (28)
The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)
17
2 Regularization for Feature Selection
where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model
Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions
Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t
This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as
βlowast = maxwisinRp
βgtw s t w le 1
In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1
r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)
233 Hybrid Penalties
There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is
minβ
nsumi=1
(yi minus xgti β)2 + λ1
psumj=1
|βj |+ λ2
psumj=1
β2j (29)
The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables
18
23 Regularization
234 Mixed Penalties
Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =
sumL`=1 d` Mixed norms are
a type of norms that take into consideration those groups The general expression isshowed below
β(rs) =
sum`
sumjisinG`
|βj |s r
s
1r
(210)
The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups
Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)
(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)
235 Sparsity Considerations
In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables
The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables
To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as
19
2 Regularization for Feature Selection
(a) L1 Lasso (b) L(12) group-Lasso
Figure 25 Admissible sets for the Lasso and Group-Lasso
(a) L1 induced sparsity (b) L(12) group inducedsparsity
Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters
20
23 Regularization
the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed
236 Optimization Tools for Regularized Problems
In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms
In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5
Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)
β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))
Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives
βj =minusλsign(βj)minus partJ(β)
partβj
2sumn
i=1 x2ij
In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding
algorithm where β(t+1)j = Sλ
(partJ(β(t))partβj
) The objective function is optimized with respect
21
2 Regularization for Feature Selection
to one variable at a time while all others are kept fixed
Sλ
(partJ(β)
partβj
)=
λminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
gt λ
minusλminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
lt minusλ
0 if |partJ(β)partβj| le λ
(211)
The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)
Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A
Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected
Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)
This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions
22
23 Regularization
and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions
Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points
This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable
This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)
Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)
This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques
Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals
Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)
minβisinRp
J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L
2
∥∥∥β minus β(t)∥∥∥2
2(212)
They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like
23
2 Regularization for Feature Selection
(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as
minβisinRp
1
2
∥∥∥∥β minus (β(t) minus 1
LnablaJ(β(t)))
∥∥∥∥2
2
+λ
LP (β) (213)
The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up
24
Part II
Sparse Linear Discriminant Analysis
25
Abstract
Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes
There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables
In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data
27
3 Feature Selection in Fisher DiscriminantAnalysis
31 Fisher Discriminant Analysis
Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)
We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x
gtn )gt and the corresponding labels in the ntimesK matrix
Y = (ygt1 ygtn )gt
Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance
maxβisinRp
βgtΣBβ
βgtΣWβ (31)
where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as
ΣW =1
n
Ksumk=1
sumiisinGk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
Ksumk=1
sumiisinGk
(microminus microk)(microminus microk)gt
where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k
29
3 Feature Selection in Fisher Discriminant Analysis
This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio
maxBisinRptimesKminus1
tr(BgtΣBB
)tr(BgtΣWB
) (32)
where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is
based on a series of K minus 1 subproblemsmaxβkisinRp
βgtk ΣBβk
s t βgtk ΣWβk le 1
βgtk ΣWβ` = 0 forall` lt k
(33)
The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest
eigenvalue (see Appendix C)
32 Feature Selection in LDA Problems
LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome
Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints
The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities
They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based
321 Inertia Based
The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and
30
32 Feature Selection in LDA Problems
classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations
Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)
Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as
minβisinRp
βgtΣWβ
s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t
where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony
Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max
βisinkRpβgtk Σ
k
Bβk minus Pk(βk)
s t βgtk ΣWβk le 1
The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten
Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal
solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1
minimization minβisinRp
β1
s t∥∥∥Σβ minus (micro1 minus micro2)
∥∥∥infinle λ
Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization
31
3 Feature Selection in Fisher Discriminant Analysis
Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions
322 Regression Based
In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)
Predefined Indicator Matrix
Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)
There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data
Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection
In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is
32
32 Feature Selection in LDA Problems
obtained by solving
minβisinRpβ0isinR
nminus1nsumi=1
(yi minus β0 minus xgti β)2 + λ
psumj=1
|βj |
where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β
vector for λ = 0 but a different intercept β0 is required
Optimal Scoring
In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)
As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)(34a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)
where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems
minθkisinRK βkisinRp
Yθk minusXβk2 + βgtk Ωβk (35a)
s t nminus1 θgtk YgtYθk = 1 (35b)
θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)
where each βk corresponds to a discriminant direction
33
3 Feature Selection in Fisher Discriminant Analysis
Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by
minβkisinRpθkisinRK
sumk
Yθk minusXβk22 + λ1 βk1 + λ2β
gtk Ωβk
where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen
Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)
minβkisinRpθkisinRK
Kminus1sumk=1
Yθk minusXβk22 + λ
psumj=1
radicradicradicradicKminus1sumk=1
β2kj
2
(36)
which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding
this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem
34
4 Formalizing the Objective
In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)
The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data
The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)
41 From Optimal Scoring to Linear Discriminant Analysis
Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)
Throughout this chapter we assume that
bull there is no empty class that is the diagonal matrix YgtY is full rank
bull inputs are centered that is Xgt1n = 0
bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank
35
4 Formalizing the Objective
411 Penalized Optimal Scoring Problem
For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution
The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus
minθisinRK βisinRp
Yθ minusXβ2 + βgtΩβ (41a)
s t nminus1 θgtYgtYθ = 1 (41b)
For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator
βos =(XgtX + Ω
)minus1XgtYθ (42)
The objective function (41a) is then
Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos
(XgtX + Ω
)βos
= θgtYgtYθ minus θgtYgtX(XgtX + Ω
)minus1XgtYθ
where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (43)
which shows that the optimization of the p-OS problem with respect to θk boils down to
finding the kth largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY Indeed Appendix C
details that Problem (43) is solved by
(YgtY)minus1YgtX(XgtX + Ω
)minus1XgtYθ = α2θ (44)
36
41 From Optimal Scoring to Linear Discriminant Analysis
where α2 is the maximal eigenvalue 1
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2nminus1θgt(YgtY)θ
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2 (45)
412 Penalized Canonical Correlation Analysis
As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows
maxθisinRK βisinRp
nminus1θgtYgtXβ (46a)
s t nminus1 θgtYgtYθ = 1 (46b)
nminus1 βgt(XgtX + Ω
)β = 1 (46c)
The solutions to (46) are obtained by finding saddle points of the Lagrangian
nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)
rArr npartL(βθ γ ν)
partβ= XgtYθ minus 2γ(XgtX + Ω)β
rArr βcca =1
2γ(XgtX + Ω)minus1XgtYθ
Then as βcca obeys (46c) we obtain
βcca =(XgtX + Ω)minus1XgtYθradic
nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)
so that the optimal objective function (46a) can be expressed with θ alone
nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
=
radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
and the optimization problem with respect to θ can be restated as
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (48)
Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)
βos = αβcca (49)
1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)
37
4 Formalizing the Objective
where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using
the optimality conditions for θ
npartL(βθ γ ν)
partθ= YgtXβ minus 2νYgtYθ
rArr θcca =1
2ν(YgtY)minus1YgtXβ (410)
Then as θcca obeys (46b) we obtain
θcca =(YgtY)minus1YgtXβradic
nminus1βgtXgtY(YgtY)minus1YgtXβ (411)
leading to the following expression of the optimal objective function
nminus1θgtccaYgtXβ =
nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ
=
radicnminus1βgtXgtY(YgtY)minus1YgtXβ
The p-CCA problem can thus be solved with respect to β by plugging this value in (46)
maxβisinRp
nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)
s t nminus1 βgt(XgtX + Ω
)β = 1 (412b)
where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies
nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω
)βcca (413)
where λ is the maximal eigenvalue shown below to be equal to α2
nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ
rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ
rArr nminus1αβgtccaXgtYθ = λ
rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ
rArr α2 = λ
The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)
38
41 From Optimal Scoring to Linear Discriminant Analysis
413 Penalized Linear Discriminant Analysis
Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows
maxβisinRp
βgtΣBβ (414a)
s t βgt(ΣW + nminus1Ω)β = 1 (414b)
where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C
As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable
to a simple matrix representation using the projection operator Y(YgtY
)minus1Ygt
ΣT =1
n
nsumi=1
xixigt
= nminus1XgtX
ΣB =1
n
Ksumk=1
nk microkmicrogtk
= nminus1XgtY(YgtY
)minus1YgtX
ΣW =1
n
Ksumk=1
sumiyik=1
(xi minus microk) (xi minus microk)gt
= nminus1
(XgtXminusXgtY
(YgtY
)minus1YgtX
)
Using these formulae the solution to the p-LDA problem (414) is obtained as
XgtY(YgtY
)minus1YgtXβlda = λ
(XgtX + ΩminusXgtY
(YgtY
)minus1YgtX
)βlda
XgtY(YgtY
)minus1YgtXβlda =
λ
1minus λ
(XgtX + Ω
)βlda
The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat
βlda = (1minus α2)minus12 βcca
= αminus1(1minus α2)minus12 βos
which ends the path from p-OS to p-LDA
39
4 Formalizing the Objective
414 Summary
The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)s t nminus1 ΘgtYgtYΘ = IKminus1
Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the
square-root of the largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY we have
BLDA = BCCA
(IKminus1 minusA2
)minus 12
= BOS Aminus1(IKminus1 minusA2
)minus 12 (415)
where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p
can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS
or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied
With the aim of performing classification the whole process could be summarized asfollows
1 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
2 Translate the data samples X into the LDA domain as XLDA = XBOSD
where D = Aminus1(IKminus1 minusA2
)minus 12
3 Compute the matrix M of centroids microk from XLDA and Y
4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA
5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule
6 Graphical Representation
40
42 Practicalities
The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively
42 Practicalities
421 Solution of the Penalized Optimal Scoring Regression
Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem
minΘisinRKtimesKminus1BisinRptimesKminus1
YΘminusXB2F + λ tr(BgtΩB
)(416a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)
where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm
Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps
1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1
2 Compute B =(XgtX + λΩ
)minus1XgtYΘ0
3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ
)minus1XgtY
4 Compute the optimal regression coefficients
BOS =(XgtX + λΩ
)minus1XgtYΘ (417)
Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on
Θ0gtYgtX(XgtX + λΩ
)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a
costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B
This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where
41
4 Formalizing the Objective
a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems
422 Distance Evaluation
The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance
d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log
(nkn
) (418)
is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent
Σminus1WΩ =
(nminus1(XgtX + λΩ)minus ΣB
)minus1
=(nminus1XgtXminus ΣB + nminus1λΩ
)minus1
=(ΣW + nminus1λΩ
)minus1 (419)
Before explaining how to compute the distances let us summarize some clarifying points
bull The solution BOS of the p-OS problem is enough to accomplish classification
bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances
bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1
As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain
(xi minus microk)BOS2ΣWΩminus 2 log(πk)
where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1
(IKminus1 minusA2
)minus 12
∥∥∥2
2minus 2 log(πk)
which is a plain Euclidean distance
42
43 From Sparse Optimal Scoring to Sparse LDA
423 Posterior Probability Evaluation
Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as
p(yk = 1|x) prop exp
(minusd(xmicrok)
2
)prop πk exp
(minus1
2
∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2
)minus 12
∥∥∥2
2
) (420)
Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)
2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below
p(yk = 1|x) =πk exp
(minusd(xmicrok)
2
)sum
` π` exp(minusd(xmicro`)
2
)=
πk exp(minusd(xmicrok)
2 + dmax2
)sum`
π` exp
(minusd(xmicro`)
2+dmax
2
)
where dmax = maxk d(xmicrok)
424 Graphical Representation
Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented
43 From Sparse Optimal Scoring to Sparse LDA
The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated
In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see
43
4 Formalizing the Objective
section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB
431 A Quadratic Variational Form
Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)
Our formulation of group-Lasso is showed below
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(421a)
s tsum
j τj minussum
j wj∥∥βj∥∥
2le 0 (421b)
τj ge 0 j = 1 p (421c)
where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1
B =(β1gt βpgt
)gtand wj are predefined nonnegative weights The cost function
J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)
The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41
Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)
Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump
j=1wj∥∥βj∥∥
2
Proof The Lagrangian of Problem (421) is
L = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
( psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2
)minus
psumj=1
νjτj
44
43 From Sparse Optimal Scoring to Sparse LDA
Figure 41 Graphical representation of the variational approach to Group-Lasso
Thus the first order optimality conditions for τj are
partLpartτj
(τj ) = 0hArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
hArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0 τ
j
2 = 0
The last line is obtained from complementary slackness which implies here νjτj = 0
Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier
for constraint gj(τj) le 0 As a result the optimal value of τj
τj =
radicλw2
j
∥∥βj∥∥2
2
ν0=
radicλ
ν0wj∥∥βj∥∥
2(422)
We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)
psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2= 0 (423)
so that τj = wj∥∥βj∥∥
2 Using this value into (421a) it is possible to conclude that
Problem (421) is equivalent to the standard group-Lasso operator
minBisinRptimesM
J(B) + λ
psumj=1
wj∥∥βj∥∥
2 (424)
So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation
45
4 Formalizing the Objective
With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (425)
with
τj = wj∥∥βj∥∥
2
resulting in Ω diagonal components
(Ω)jj =wj∥∥βj∥∥
2
(426)
And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5
The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence
Lemma 42 If J is convex Problem (421) is convex
Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is
V isin RptimesKminus1 V =partJ(B)
partB+ λG
(427)
where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1
G =(g1gt gpgt
)gtdefined as follows Let S(B) denote the columnwise support of
B S(B) = j isin 1 p ∥∥βj∥∥
26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (428)
forallj isin S(B) ∥∥gj∥∥
2le wj (429)
46
43 From Sparse Optimal Scoring to Sparse LDA
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Proof When∥∥βj∥∥
26= 0 the gradient of the penalty with respect to βj is
part (λsump
m=1wj βm2)
partβj= λwj
βj∥∥βj∥∥2
(430)
At∥∥βj∥∥
2= 0 the gradient of the objective function is not continuous and the optimality
conditions then make use of the subdifferential (Bach et al 2011)
partβj
(λ
psumm=1
wj βm2
)= partβj
(λwj
∥∥βj∥∥2
)=λwjv isin RKminus1 v2 le 1
(431)
That gives the expression (429)
Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima
forallj isin S partJ(B)
partβj+ λwj
∥∥βj∥∥minus1
2βj = 0 (432a)
forallj isin S ∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le λwj (432b)
where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment
Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)
432 Group-Lasso OS as Penalized LDA
With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced
Proposition 41 The group-Lasso OS problem
BOS = argminBisinRptimesKminus1
minΘisinRKtimesKminus1
1
2YΘminusXB2F + λ
psumj=1
wj∥∥βj∥∥
2
s t nminus1 ΘgtYgtYΘ = IKminus1
47
4 Formalizing the Objective
is equivalent to the penalized LDA problem
BLDA = maxBisinRptimesKminus1
tr(BgtΣBB
)s t Bgt(ΣW + nminus1λΩ)B = IKminus1
where Ω = diag
(w2
1
τ1
w2p
τp
) with Ωjj =
+infin if βjos = 0
wj∥∥βjos
∥∥minus1
2otherwise
(433)
That is BLDA = BOS diag(αminus1k (1minus α2
k)minus12
) where αk isin (0 1) is the kth leading
eigenvalue of
nminus1YgtX(XgtX + λΩ
)minus1XgtY
Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso
The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr
(BgtΩB
)
48
5 GLOSS Algorithm
The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22
The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below
1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed
2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution
3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set
This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively
51 Regression Coefficients Updates
Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(
XgtAXA + λΩ)βk = XgtAYθ0
k (51)
49
5 GLOSS Algorithm
initialize modelλ B
ACTIVE SETall j st||βj ||2 gt 0
p-OS PROBLEMB must hold1st optimality
condition
any variablefrom
ACTIVE SETmust go toINACTIVE
SET
take it out ofACTIVE SET
test 2nd op-timality con-dition on the
INACTIVE SET
any variablefrom
INACTIVE SETmust go toACTIVE
SET
take it out ofINACTIVE SET
compute Θ
and update B end
yes
no
yes
no
Figure 51 GLOSS block diagram
50
51 Regression Coefficients Updates
Algorithm 1 Adaptively Penalized Optimal Scoring
Input X Y B λInitialize A larr
j isin 1 p
∥∥βj∥∥2gt 0
Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat
Step 1 solve (421) in B assuming A optimalrepeat
Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1
2
BA larr(XgtAXA + λΩ
)minus1XgtAYΘ0
until condition (432a) holds for all j isin A Step 2 identify inactivated variables
for j isin A ∥∥βj∥∥
2= 0 do
if optimality condition (432b) holds thenA larr AjGo back to Step 1
end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax
jisinA
∥∥partJpartβj∥∥2
if∥∥∥partJpartβj∥∥∥
2lt λ then
convergence larr true B is optimalelseA larr Acup j
end ifuntil convergence
(sV)larreigenanalyze(Θ0gtYgtXAB) that is
Θ0gtYgtXABVk = skVk k = 1 K minus 1
Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1
Output Θ B α
51
5 GLOSS Algorithm
where XA denotes the columns of X indexed by A and βk and θ0k denote the kth
column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system
511 Cholesky decomposition
Dropping the subscripts and considering the (K minus 1) systems together (51) leads to
(XgtX + λΩ)B = XgtYΘ (52)
Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows
CgtCB = XgtYΘ
CB = CgtXgtYΘ
B = CCgtXgtYΘ (53)
where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)
512 Numerical Stability
The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression
B = Ωminus12(Ωminus12XgtXΩminus12 + λI
)minus1Ωminus12XgtYΘ0 (54)
where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)
52 Score Matrix
The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY This eigen-analysis is actually solved in the form
ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-
vector decomposition does not require the costly computation of(XgtX + Ω
)minus1that
52
53 Optimality Conditions
involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-
trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω
)minus1XgtY 1
Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as
Θ0gtYgtX(XgtX + Ω
)minus1XgtYΘ0 = Θ0gtYgtXB0
Thus the solution to penalized OS problem can be computed trough the singular
value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining
Θ = Θ0V we have ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ = Λ and when Θ0 is chosen such
that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation
53 Optimality Conditions
GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function
1
2YΘminusXB22 + λ
psumj=1
wj∥∥βj∥∥
2(55)
Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth
row of B βj is the (K minus 1)-dimensional vector
partJ(B)
partβj= xj
gt(XBminusYΘ)
where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as
xjgt
(XBminusYΘ) + λwjβj∥∥βj∥∥
2
1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω
)minus1XgtY It is thus suffi-
cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY In practice to comply with this desideratum and conditions (35b) and
(35c) we set Θ0 =(YgtY
)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal
vectors orthogonal to 1K
53
5 GLOSS Algorithm
The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
54 Active and Inactive Sets
The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function
j = maxj
∥∥∥xjgt (XBminusYΘ)∥∥∥
2minus λwj 0
The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥
2
is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition
55 Penalty Parameter
The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active
The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0
λmax = maxjisin1p
1
wj
∥∥∥xjgtYΘ0∥∥∥
2
The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin
is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)
54
56 Options and Variants
56 Options and Variants
561 Scaling Variables
As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm
562 Sparse Variant
This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation
563 Diagonal Variant
We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated
The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems
minBisinRptimesKminus1
YΘminusXB2F = minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB
)are replaced by
minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B
)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite
564 Elastic net and Structured Variant
For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition
55
5 GLOSS Algorithm
7 8 9
4 5 6
1 2 3
- ΩL =
3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3
Figure 52 Graph and Laplacian matrix for a 3times 3 image
for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth
When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood
This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned
56
6 Experimental Results
This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper
61 Normalization
With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1
62 Decision Thresholds
The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation
1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval
57
6 Experimental Results
63 Simulated Data
We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is
Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)
Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of
dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure
is intended to mimic gene expression data correlation
Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1
3 1)if j le 100 and Xij sim N(0 1) otherwise
Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise
Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563
The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only
58
63 Simulated Data
Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset
Err () Var Dir
Sim 1 K = 4 mean shift ind features
PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)
Sim 2 K = 2 mean shift dependent features
PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)
Sim 3 K = 4 1D mean shift ind features
PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)
Sim 4 K = 4 mean shift ind features
PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)
59
6 Experimental Results
0 10 20 30 40 50 60 70 8020
30
40
50
60
70
80
90
100TPR Vs FPR
gloss
glossd
slda
plda
Simulation1
Simulation2
Simulation3
Simulation4
Figure 61 TPR versus FPR (in ) for all algorithms and simulations
Table 62 Average TPR and FPR (in ) computed over 25 repetitions
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
PLDA 990 782 969 603 980 159 743 656
SLDA 739 385 338 163 416 278 507 395
GLOSS 641 106 300 46 511 182 260 121
GLOSS-D 935 394 921 281 956 655 429 299
method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )
64 Gene Expression Data
We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-
2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736
60
64 Gene Expression Data
Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables
Err () Var
Nakayama n = 86 p = 22 283 K = 5
PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)
Ramaswamy n = 198 p = 16 063 K = 14
PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)
Sun n = 180 p = 54 613 K = 4
PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)
ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4
dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors
Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split
Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors
Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS
4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962
61
6 Experimental Results
GLOSS SLDA
Naka
yam
a
minus25000 minus20000 minus15000 minus10000 minus5000 0 5000
minus25
minus2
minus15
minus1
minus05
0
05
1
x 104
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
2n
dd
iscr
imin
ant
minus2000 0 2000 4000 6000 8000 10000 12000 14000
2000
4000
6000
8000
10000
12000
14000
16000
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
Su
n
minus1 minus05 0 05 1 15 2
x 104
05
1
15
2
25
3
35
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
2n
dd
iscr
imin
ant
minus2 minus15 minus1 minus05 0
x 104
0
05
1
15
2
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means
62
65 Correlated Data
Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo
65 Correlated Data
When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge
The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works
For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63
As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward
The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits
Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results
63
6 Experimental Results
β for GLOSS β for S-GLOSS
Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo
β for GLOSS and λ = 03 β for S-GLOSS and λ = 03
Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo
64
Discussion
GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix
Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data
The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced
The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition
65
Part III
Sparse Clustering Analysis
67
Abstract
Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity
Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix
As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection
Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10
69
7 Feature Selection in Mixture Models
71 Mixture Models
One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering
711 Model
We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically
from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as
f(xi) =
Ksumk=1
πkfk(xi) foralli isin 1 n
where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and
sumk πk = 1) Mixture models transcribe that
given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism
bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK
bull x each xi is assumed to arise from a random vector with probability densityfunction fk
In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as
f(xiθ) =
Ksumk=1
πkφ(xiθk) foralli isin 1 n
71
7 Feature Selection in Mixture Models
where θ = (π1 πK θ1 θK) is the parameter of the model
712 Parameter Estimation The EM Algorithm
For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ
21 σ
22 π) of a univariate
Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches
The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)
The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood
Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm
Maximum Likelihood Definitions
The likelihood is is commonly expressed in its logarithmic version
L(θ X) = log
(nprodi=1
f(xiθ)
)
=nsumi=1
log
(Ksumk=1
πkfk(xiθk)
) (71)
where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions
To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or
72
71 Mixture Models
classification log-likelihood
LC(θ XY) = log
(nprodi=1
f(xiyiθ)
)
=
nsumi=1
log
(Ksumk=1
yikπkfk(xiθk)
)
=nsumi=1
Ksumk=1
yik log (πkfk(xiθk)) (72)
The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise
Defining the soft membership tik(θ) as
tik(θ) = p(Yik = 1|xiθ) (73)
=πkfk(xiθk)
f(xiθ) (74)
To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows
LC(θ XY) =sumik
yik log (πkfk(xiθk))
=sumik
yik log (tikf(xiθ))
=sumik
yik log tik +sumik
yik log f(xiθ)
=sumik
yik log tik +nsumi=1
log f(xiθ)
=sumik
yik log tik + L(θ X) (75)
wheresum
ik yik log tik can be reformulated as
sumik
yik log tik =nsumi=1
Ksumk=1
yik log(p(Yik = 1|xiθ))
=
nsumi=1
log(p(Yik = 1|xiθ))
= log (p(Y |Xθ))
As a result the relationship (75) can be rewritten as
L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)
73
7 Feature Selection in Mixture Models
Likelihood Maximization
The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)
L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))
+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))
In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood
∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1
minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality
Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))
For the mixture model problem Q(θθprime) is
Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]
=sumik
p(Yik = 1|xiθprime) log(πkfk(xiθk))
=nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (77)
Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ
prime) are the posterior proba-bilities of cluster memberships
Hence the EM algorithm sketched above results in
bull Initialization (not iterated) choice of the initial parameter θ(0)
bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)
bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))
74
72 Feature Selection in Model-Based Clustering
Gaussian Model
In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is
f(xiθ) =Ksumk=1
πkfk(xiθk)
=
Ksumk=1
πk1
(2π)p2 |Σ|
12
exp
minus1
2(xi minus microk)
gtΣminus1(xi minus microk)
At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows
Q(θθ(t)) =sumik
tik log(πk)minussumik
tik log(
(2π)p2 |Σ|
12
)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
=sumk
tk log(πk)minusnp
2log(2π)︸ ︷︷ ︸
constant term
minusn2
log(|Σ|)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
equivsumk
tk log(πk)minusn
2log(|Σ|)minus
sumik
tik
(1
2(xi minus microk)
gtΣminus1(xi minus microk)
) (78)
where
tk =nsumi=1
tik (79)
The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)
π(t+1)k =
tkn
(710)
micro(t+1)k =
sumi tikxitk
(711)
Σ(t+1) =1
n
sumk
Wk (712)
with Wk =sumi
tik(xi minus microk)(xi minus microk)gt (713)
The derivations are detailed in Appendix G
72 Feature Selection in Model-Based Clustering
When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own
75
7 Feature Selection in Mixture Models
covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries
In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD
gtk (Banfield and Raftery 1993)
These regularization schemes address singularity and stability issues but they do notinduce parsimonious models
In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space
721 Based on Penalized Likelihood
Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x
log
(p(Yk = 1|x)
p(Y` = 1|x)
)= xgtΣminus1(microk minus micro`)minus
1
2(microk + micro`)
gtΣminus1(microk minus micro`) + logπkπ`
In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm
λKsumk=1
psumj=1
|microkj |
as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices
λ1
Ksumk=1
psumj=1
|microkj |+ λ2
Ksumk=1
psumj=1
psumm=1
|(Σminus1k )jm|
In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models
76
72 Feature Selection in Model-Based Clustering
Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)
λ
psumj=1
sum16k6kprime6K
|microkj minus microkprimej |
This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative
A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features
λ
psumj=1
(micro1j micro2j microKj)infin
One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means
λradicK
psumj=1
radicradicradicradic Ksum
k=1
micro2kj
The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test
The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation
722 Based on Model Variants
The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as
77
7 Feature Selection in Mixture Models
f(xi|φ πθν) =Ksumk=1
πk
pprodj=1
[f(xij |θjk)]φj [h(xij |νj)]1minusφj
where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)
An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1
which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion
tr(
(UgtΣWU)minus1UgtΣBU) (714)
so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations
To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of
minUisinRptimesKminus1
∥∥∥XU minusXU∥∥∥2
F+ λ
Kminus1sumk=1
∥∥∥uk∥∥∥1
where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet
minABisinRptimesKminus1
Ksumk=1
∥∥∥RminusgtW HBk minusABgtHBk
∥∥∥2
2+ ρ
Kminus1sumj=1
βgtj ΣWβj + λ
Kminus1sumj=1
∥∥βj∥∥1
s t AgtA = IKminus1
where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper
78
72 Feature Selection in Model-Based Clustering
triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U
The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem
minUisinRptimesKminus1
psumj=1
∥∥∥ΣBj minus UUgtΣBj
∥∥∥2
2
s t UgtU = IKminus1
whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U
To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality
However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo
723 Based on Model Selection
Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables
bull X(1) set of selected relevant variables
bull X(2) set of variables being considered for inclusion or exclusion of X(1)
bull X(3) set of non relevant variables
79
7 Feature Selection in Mixture Models
With those subsets they defined two different models where Y is the partition toconsider
bull M1
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)|X(1)
)f(X(1)|Y
)bull M2
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)X(1)|Y
)Model M1 means that variables in X(2) are independent on clustering Y Model M2
shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor
B12 =f (X|M1)
f (X|M2)
where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio
B12 =f(X(1)X(2)X(3)|M1
)f(X(1)X(2)X(3)|M2
)=f(X(2)|X(1)M1
)f(X(1)|M1
)f(X(2)X(1)|M2
)
This factor is approximated since the integrated likelihoods f(X(1)|M1
)and
f(X(2)X(1)|M2
)are difficult to calculate exactly Raftery and Dean (2006) use the
BIC approximation The computation of f(X(2)|X(1)M1
) if there is only one variable
in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term
Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability
Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis
80
8 Theoretical Foundations
In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features
We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model
In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided
81 Resolving EM with Optimal Scoring
In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate
811 Relationship Between the M-Step and Linear Discriminant Analysis
LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance
d(ximicrok) = (xi minus microk)gtΣminus1
W (xi minus microk)
where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix
81
8 Theoretical Foundations
The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)
Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood
2lweight(microΣ) =nsumi=1
Ksumk=1
tikd(ximicrok)minus n log(|ΣW|)
which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures
812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis
The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression
813 Clustering Using Penalized Optimal Scoring
The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression
d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)
This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as
82
82 Optimized Criterion
1 Initialize the membership matrix Y (for example by K-means algorithm)
2 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2
k)minus 1
2 )
4 Compute the centroids M in the LDA domain
5 Evaluate distances in the LDA domain
6 Translate distances into posterior probabilities tik with
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
] (81)
7 Update the labels using the posterior probabilities matrix Y = T
8 Go back to step 2 and iterate until tik converge
Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures
814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures
82 Optimized Criterion
In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized
83
8 Theoretical Foundations
optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture
This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows
821 A Bayesian Derivation
This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)
The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter
The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior
f(Σ|Λ0 ν0) =1
2np2 |Λ0|
n2 Γp(
n2 )|Σminus1|
ν0minuspminus12 exp
minus1
2tr(Λminus1
0 Σminus1)
where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as
Γp(n2) = πp(pminus1)4pprodj=1
Γ (n2 + (1minus j)2)
The posterior distribution can be maximized similarly to the likelihood through the
84
82 Optimized Criterion
maximization of
Q(θθprime) + log(f(Σ|Λ0 ν0))
=Ksumk=1
tk log πk minus(n+ 1)p
2log 2minus n
2log |Λ0| minus
p(p+ 1)
4log(π)
minuspsumj=1
log
(Γ
(n
2+
1minus j2
))minus νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
)equiv
Ksumk=1
tk log πk minusn
2log |Λ0| minus
νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
) (82)
with tk =
nsumi=1
tik
νn = ν0 + n
Λminus1n = Λminus1
0 + S0
S0 =
nsumi=1
Ksumk=1
tik(xi minus microk)(xi minus microk)gt
Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)
822 Maximum a Posteriori Estimator
The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is
ΣMAP =1
ν0 + nminus pminus 1(Λminus1
0 + S0) (83)
where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1
0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)
85
9 Mix-GLOSS Algorithm
Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism
91 Mix-GLOSS
The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik
When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant
The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition
911 Outer Loop Whole Algorithm Repetitions
This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs
bull the centered ntimes p feature matrix X
bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically
bull the number of clusters K
bull the maximum number of iterations for the EM algorithm
bull the convergence tolerance for the EM algorithm
bull the number of whole repetitions of the clustering algorithm
87
9 Mix-GLOSS Algorithm
Figure 91 Mix-GLOSS Loops Scheme
bull a ptimes (K minus 1) initial coefficient matrix (optional)
bull a ntimesK initial posterior probability matrix (optional)
For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process
912 Penalty Parameter Loop
The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix
Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage
88
91 Mix-GLOSS
of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive
Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)
Algorithm 2 Automatic selection of λ
Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat
Estimate λ Compute gradient at βj = 0partJ(B)
partβj
∣∣∣βj=0
= xjgt
(sum
m6=j xmβm minusYΘ)
Compute λmax for every feature using (432b)
λmaxj = 1
wj
∥∥∥∥ partJ(B)
partβj
∣∣∣βj=0
∥∥∥∥2
Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false
elselastLAMBDA larr true
end ifuntil lastLAMBDA
Output B L(θ) tik πk microk Σ Y for every λ in solution path
913 Inner Loop EM Algorithm
The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop
89
9 Mix-GLOSS Algorithm
Algorithm 3 Mix-GLOSS for one value of λ
Input X K B0 Y0 λInitializeif (B0Y0) available then
BOS larr B0 Y larr Y0
elseBOS larr 0 Y larr kmeans(XK)
end ifconvergenceEM larr false tolEM larr 1e-3repeat
M-step(BOSΘ
α)larr GLOSS(XYBOS λ)
XLDA = XBOS diag (αminus1(1minusα2)minus12
)
πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n
sumi |tik minus yik| lt tolEM then
convergenceEM larr trueend ifY larr T
until convergenceEMY larr MAP(T)
Output BOS ΘL(θ) tik πk microk Σ Y
90
92 Model Selection
M-Step
The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step
E-Step
The E-step evaluates the posterior probability matrix T using
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
]
The convergence of those tik is used as stopping criterion for EM
92 Model Selection
Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected
In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure
In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time
The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested
91
9 Mix-GLOSS Algorithm
Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)
X K λEMITER MAXREPMixminusGLOSS
Use B and T frombest repetition as
StartB and StartT
Mix-GLOSS (λStartBStartT)
Compute BIC
Chose λ = minλ BIC
Partition tikπk λBEST BΘ D L(θ)activeset
Figure 92 Mix-GLOSS model selection diagram
with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ
92
10 Experimental Results
The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6
This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63
In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations
The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions
101 Tested Clustering Algorithms
This section compares Mix-GLOSS with the following methods in the state of the art
bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan
bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website
93
10 Experimental Results
Figure 101 Class mean vectors for each artificial simulation
bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website
After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered
The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website
bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)
bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see
94
102 Results
Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable
102 Results
In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are
bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different
bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80
bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced
The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS
Results in percentages are displayed in Figure 102 (or in Table 102 )
95
10 Experimental Results
Table 101 Experimental results for simulated data
Err () Var Time
Sim 1 K = 4 mean shift ind features
CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h
Sim 2 K = 2 mean shift dependent features
CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h
Sim 3 K = 4 1D mean shift ind features
CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h
Sim 4 K = 4 mean shift ind features
CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h
Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
MIX-GLOSS 992 015 828 335 884 67 780 12
LUMI-KUAN 992 28 1000 02 1000 005 50 005
FISHER-EM 986 24 888 17 838 5825 620 4075
96
103 Discussion
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100TPR Vs FPR
MIXminusGLOSS
LUMIminusKUAN
FISHERminusEM
Simulation1
Simulation2
Simulation3
Simulation4
Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions
103 Discussion
After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted
LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here
The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4
From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall
97
Conclusions
99
Conclusions
Summary
The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables
In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering
The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems
In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations
In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results
Perspectives
Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species
101
based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography
At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term
The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis
From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm
At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression
102
Appendix
103
A Matrix Properties
Property 1 By definition ΣW and ΣB are both symmetric matrices
ΣW =1
n
gsumk=1
sumiisinCk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
gsumk=1
nk(microk minus x)(microk minus x)gt
Property 2 partxgtapartx = partagtx
partx = a
Property 3 partxgtAxpartx = (A + Agt)x
Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt
Property 5 partagtXbpartX = abgt
Property 6 partpartXtr
(AXminus1B
)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt
105
B The Penalized-OS Problem is anEigenvector Problem
In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form
minθkβk
Yθk minusXβk22 + βgtk Ωkβk (B1)
st θgtk YgtYθk = 1
θgt` YgtYθk = 0 forall` lt k
for k = 1 K minus 1The Lagrangian associated to Problem (B1) is
Lk(θkβk λkνk) =
Yθk minusXβk22 + βgtk Ωkβk + λk(θ
gtk YgtYθk minus 1) +
sum`ltk
ν`θgt` YgtYθk (B2)
Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk
βk = (XgtX + Ωk)minus1XgtYθk (B3)
The objective function of (B1) evaluated at βk is
minθk
Yθk minusXβk22 + βk
gtΩkβk = min
θk
θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk
= maxθk
θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)
If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY
B1 How to Solve the Eigenvector Decomposition
Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition
107
B The Penalized-OS Problem is an Eigenvector Problem
Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way
maxΘisinRKtimes(Kminus1)
tr(ΘgtMΘ
)(B5)
st ΘgtYgtYΘ = IKminus1
If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is
MΘv = λv (B6)
where v is the eigenvector and λ the associated eigenvalue of MΘ Operating
vgtMΘv = λhArr vgtΘgtMΘv = λ
Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue
wgtMw = λ (B7)
Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ
MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ
= ΘgtYgtXB
Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone
To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B
B = (XgtX + Ω)minus1XgtYΘV = BV
108
B2 Why the OS Problem is Solved as an Eigenvector Problem
B2 Why the OS Problem is Solved as an Eigenvector Problem
In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY
By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them
θk =
Kminus1summ=1
αmwm s t θgtk θk = 1 (B8)
The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (
Kminus1summ=1
αmwm
)gt(Kminus1summ=1
αmwm
)= 1
that as per the eigenvector properties can be reduced to
Kminus1summ=1
α2m = 1 (B9)
Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)
Mθk = M
Kminus1summ=1
αmwm
=
Kminus1summ=1
αmMwm
As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain
Mθk =Kminus1summ=1
αmλmwm
Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors
θgtk Mθk =
(Kminus1sum`=1
α`w`
)gt(Kminus1summ=1
αmλmwm
)
This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving
θgtk Mθk =Kminus1summ=1
α2mλm
109
B The Penalized-OS Problem is an Eigenvector Problem
The optimization Problem (B5) for discriminant direction k can be rewritten as
maxθkisinRKtimes1
θgtk Mθk
= max
θkisinRKtimes1
Kminus1summ=1
α2mλm
(B10)
with θk =Kminus1summ=1
αmwm
andKminus1summ=1
α2m = 1
One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =
sumKminus1m=1 αmwm the resulting score vector θk will be equal to
the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can
be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY
110
C Solving Fisherrsquos Discriminant Problem
The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance
maxβisinRp
βgtΣBβ (C1a)
s t βgtΣWβ = 1 (C1b)
where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data
The Lagrangian of Problem (C1) is
L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)
so that its first derivative with respect to β is
partL(β ν)
partβ= 2ΣBβ minus 2νΣWβ
A necessary optimality condition for β is that this derivative is zero that is
ΣBβ = νΣWβ
Provided ΣW is full rank we have
Σminus1W ΣBβ
= νβ (C2)
Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of
eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows
βgtΣBβ = βgtΣWΣminus1
W ΣBβ
= νβgtΣWβ from (C2)
= ν from (C1b)
That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1
W ΣB and β is any eigenvector correspondingto this maximal eigenvalue
111
D Alternative Variational Formulation forthe Group-Lasso
In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(D1a)
s tsump
j=1 τj = 1 (D1b)
τj ge 0 j = 1 p (D1c)
Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed
of row vectors βj isin RKminus1 B =(β1gt βpgt
)gt
L(B τ λ ν0 νj) = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
psumj=1
τj minus 1
minus psumj=1
νjτj (D2)
The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj
partL(B τ λ ν0 νj)
partτj
∣∣∣∣τj=τj
= 0 rArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 = 0
The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ
) = 0 where νj is the Lagrange multiplier and gj(τ) is the
inequality Lagrange condition Then the optimal τj can be deduced
τj =
radicλ
ν0wj∥∥βj∥∥
2
Placing this optimal value of τj into constraint (D1b)
psumj=1
τj = 1rArr τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
(D3)
113
D Alternative Variational Formulation for the Group-Lasso
With this value of τj Problem (D1) is equivalent to
minBisinRptimesKminus1
J(B) + λ
psumj=1
wj∥∥βj∥∥
2
2
(D4)
This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj
The penalty term of (D1a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (D5)
Using the value of τj from (D3) each diagonal component of Ω is
(Ω)jj =wjsump
j=1wj∥∥βj∥∥
2∥∥βj∥∥2
(D6)
In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation
D1 Useful Properties
Lemma D1 If J is convex Problem (D1) is convex
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =
partJ(B)
partB+ 2λ
Kminus1sumj=1
wj∥∥βj∥∥
2
G
(D7)
where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1
∥∥βj∥∥26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (D8)
forallj isin S(B) ∥∥gj∥∥
2le wj (D9)
114
D2 An Upper Bound on the Objective Function
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1
∥∥βj∥∥26= 0 and let S(B) be its complement then we have
forallj isin S(B) minus partJ(B)
partβj= 2λ
Kminus1sumj=1
wj∥∥βj∥∥2
wj∥∥βj∥∥minus1
2βj (D10a)
forallj isin S(B)
∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le 2λwj
Kminus1sumj=1
wj∥∥βj∥∥2
(D10b)
In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)
D2 An Upper Bound on the Objective Function
Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that
τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum
j=1
wj∥∥βj∥∥
2
2
=
psumj=1
τ12j
wj∥∥βj∥∥
2
τ12j
2
le
psumj=1
τj
psumj=1
w2j
∥∥βj∥∥2
2
τj
le
psumj=1
w2j
∥∥βj∥∥2
2
τj
where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one
115
D Alternative Variational Formulation for the Group-Lasso
This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined
116
E Invariance of the Group-Lasso to UnitaryTransformations
The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition
Proposition E1 Let B be a solution of
minBisinRptimesM
Y minusXB2F + λ
psumj=1
wj∥∥βj∥∥
2(E1)
and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof
minBisinRptimesM
∥∥∥Y minusXB∥∥∥2
F+ λ
psumj=1
wj∥∥βj∥∥
2(E2)
Proof The first-order necessary optimality conditions for B are
forallj isin S(B) 2xjgt(xjβ
j minusY)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E3a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minusY)∥∥∥
2le λwj (E3b)
where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement
First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows
forallj isin S(B) 2xjgt(xjβ
j minus Y)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E4a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minus Y)∥∥∥
2le λwj (E4b)
where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM
∥∥ugt∥∥2
=∥∥ugtV
∥∥2 Equation (E4b) is also
117
E Invariance of the Group-Lasso to Unitary Transformations
obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof
118
F Expected Complete Likelihood andLikelihood
Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available
L(θ) =
nsumi=1
log
(Ksumk=1
πkfk(xiθk)
)(F1)
Q(θθprime) =nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (F2)
with tik(θprime) =
πprimekfk(xiθprimek)sum
` πprime`f`(xiθ
prime`)
(F3)
In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are
the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)
Using (F3) we have
Q(θθprime) =sumik
tik(θprime) log (πkfk(xiθk))
=sumik
tik(θprime) log(tik(θ)) +
sumik
tik(θprime) log
(sum`
π`f`(xiθ`)
)=sumik
tik(θprime) log(tik(θ)) + L(θ)
In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities
L(θ) = Q(θθ)minussumik
tik(θ) log(tik(θ))
= Q(θθ) +H(T)
119
G Derivation of the M-Step Equations
This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as
Q(θθprime) = maxθ
sumik
tik(θprime) log(πkfk(xiθk))
=sumk
log
(πksumi
tik
)minus np
2log(2π)minus n
2log |Σ| minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
which has to be maximized subject tosumk
πk = 1
The Lagrangian of this problem is
L(θ) = Q(θθprime) + λ
(sumk
πk minus 1
)
Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ
G1 Prior probabilities
partL(θ)
partπk= 0hArr 1
πk
sumi
tik + λ = 0
where λ is identified from the constraint leading to
πk =1
n
sumi
tik
121
G Derivation of the M-Step Equations
G2 Means
partL(θ)
partmicrok= 0hArr minus1
2
sumi
tik2Σminus1(microk minus xi) = 0
rArr microk =
sumi tikxisumi tik
G3 Covariance Matrix
partL(θ)
partΣminus1 = 0hArr n
2Σ︸︷︷︸
as per property 4
minus 1
2
sumik
tik(xi minus microk)(xi minus microk)gt
︸ ︷︷ ︸as per property 5
= 0
rArr Σ =1
n
sumik
tik(xi minus microk)(xi minus microk)gt
122
Bibliography
F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011
F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008
F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012
J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993
A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009
H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996
P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004
C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008
C M Bishop Pattern Recognition and Machine Learning Springer New York 2006
C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a
C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b
S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004
L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995
L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984
123
Bibliography
T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011
S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999
C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012
B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008
L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011
C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009
A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246
D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006
R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000
B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004
Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008
R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936
V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008
J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009
124
Bibliography
J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010
J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989
W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998
A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003
D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005
G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010
G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011
Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998
Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002
L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008
Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004
J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010
I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003
T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996
T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994
125
Bibliography
T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995
A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970
J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009
T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006
K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000
P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010
T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002
M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004
Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004
C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008
C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006
H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005
J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967
Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012
C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a
126
Bibliography
C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b
L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008
N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006
B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006
B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007
Y Nesterov Gradient methods for minimizing composite functions preprint 2007
S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886
B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011
M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a
M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b
W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007
W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006
K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894
S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003
127
Bibliography
Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009
A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006
C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948
S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007
V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004
V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008
V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004
C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010
L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012
Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978
A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008
S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006
P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010
M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008
128
Bibliography
M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008
R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996
J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010
S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008
D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011
D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010
D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009
M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007
MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009
T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008
B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a
B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b
C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010
J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007
129
Bibliography
M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006
P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007
P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009
H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009
H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006
H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005
130
- SANCHEZ MERCHANTE PDTpdf
- Thesis Luis Francisco Sanchez Merchantepdf
-
- List of figures
- List of tables
- Notation and Symbols
- Context and Foundations
-
- Context
- Regularization for Feature Selection
-
- Motivations
- Categorization of Feature Selection Techniques
- Regularization
-
- Important Properties
- Pure Penalties
- Hybrid Penalties
- Mixed Penalties
- Sparsity Considerations
- Optimization Tools for Regularized Problems
-
- Sparse Linear Discriminant Analysis
-
- Abstract
- Feature Selection in Fisher Discriminant Analysis
-
- Fisher Discriminant Analysis
- Feature Selection in LDA Problems
-
- Inertia Based
- Regression Based
-
- Formalizing the Objective
-
- From Optimal Scoring to Linear Discriminant Analysis
-
- Penalized Optimal Scoring Problem
- Penalized Canonical Correlation Analysis
- Penalized Linear Discriminant Analysis
- Summary
-
- Practicalities
-
- Solution of the Penalized Optimal Scoring Regression
- Distance Evaluation
- Posterior Probability Evaluation
- Graphical Representation
-
- From Sparse Optimal Scoring to Sparse LDA
-
- A Quadratic Variational Form
- Group-Lasso OS as Penalized LDA
-
- GLOSS Algorithm
-
- Regression Coefficients Updates
-
- Cholesky decomposition
- Numerical Stability
-
- Score Matrix
- Optimality Conditions
- Active and Inactive Sets
- Penalty Parameter
- Options and Variants
-
- Scaling Variables
- Sparse Variant
- Diagonal Variant
- Elastic net and Structured Variant
-
- Experimental Results
-
- Normalization
- Decision Thresholds
- Simulated Data
- Gene Expression Data
- Correlated Data
-
- Discussion
-
- Sparse Clustering Analysis
-
- Abstract
- Feature Selection in Mixture Models
-
- Mixture Models
-
- Model
- Parameter Estimation The EM Algorithm
-
- Feature Selection in Model-Based Clustering
-
- Based on Penalized Likelihood
- Based on Model Variants
- Based on Model Selection
-
- Theoretical Foundations
-
- Resolving EM with Optimal Scoring
-
- Relationship Between the M-Step and Linear Discriminant Analysis
- Relationship Between Optimal Scoring and Linear Discriminant Analysis
- Clustering Using Penalized Optimal Scoring
- From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
-
- Optimized Criterion
-
- A Bayesian Derivation
- Maximum a Posteriori Estimator
-
- Mix-GLOSS Algorithm
-
- Mix-GLOSS
-
- Outer Loop Whole Algorithm Repetitions
- Penalty Parameter Loop
- Inner Loop EM Algorithm
-
- Model Selection
-
- Experimental Results
-
- Tested Clustering Algorithms
- Results
- Discussion
-
- Conclusions
- Appendix
-
- Matrix Properties
- The Penalized-OS Problem is an Eigenvector Problem
-
- How to Solve the Eigenvector Decomposition
- Why the OS Problem is Solved as an Eigenvector Problem
-
- Solving Fishers Discriminant Problem
- Alternative Variational Formulation for the Group-Lasso
-
- Useful Properties
- An Upper Bound on the Objective Function
-
- Invariance of the Group-Lasso to Unitary Transformations
- Expected Complete Likelihood and Likelihood
- Derivation of the M-Step Equations
-
- Prior probabilities
- Means
- Covariance Matrix
-
- Bibliography
-
Contents
List of figures v
List of tables vii
Notation and Symbols ix
I Context and Foundations 1
1 Context 5
2 Regularization for Feature Selection 921 Motivations 9
22 Categorization of Feature Selection Techniques 11
23 Regularization 13
231 Important Properties 14
232 Pure Penalties 14
233 Hybrid Penalties 18
234 Mixed Penalties 19
235 Sparsity Considerations 19
236 Optimization Tools for Regularized Problems 21
II Sparse Linear Discriminant Analysis 25
Abstract 27
3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29
32 Feature Selection in LDA Problems 30
321 Inertia Based 30
322 Regression Based 32
4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35
411 Penalized Optimal Scoring Problem 36
412 Penalized Canonical Correlation Analysis 37
i
Contents
413 Penalized Linear Discriminant Analysis 39
414 Summary 40
42 Practicalities 41
421 Solution of the Penalized Optimal Scoring Regression 41
422 Distance Evaluation 42
423 Posterior Probability Evaluation 43
424 Graphical Representation 43
43 From Sparse Optimal Scoring to Sparse LDA 43
431 A Quadratic Variational Form 44
432 Group-Lasso OS as Penalized LDA 47
5 GLOSS Algorithm 4951 Regression Coefficients Updates 49
511 Cholesky decomposition 52
512 Numerical Stability 52
52 Score Matrix 52
53 Optimality Conditions 53
54 Active and Inactive Sets 54
55 Penalty Parameter 54
56 Options and Variants 55
561 Scaling Variables 55
562 Sparse Variant 55
563 Diagonal Variant 55
564 Elastic net and Structured Variant 55
6 Experimental Results 5761 Normalization 57
62 Decision Thresholds 57
63 Simulated Data 58
64 Gene Expression Data 60
65 Correlated Data 63
Discussion 63
III Sparse Clustering Analysis 67
Abstract 69
7 Feature Selection in Mixture Models 7171 Mixture Models 71
711 Model 71
712 Parameter Estimation The EM Algorithm 72
ii
Contents
72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79
8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81
811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant
Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83
82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85
9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87
911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89
92 Model Selection 91
10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97
Conclusions 97
Appendix 103
A Matrix Properties 105
B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109
C Solving Fisherrsquos Discriminant Problem 111
D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115
iii
Contents
E Invariance of the Group-Lasso to Unitary Transformations 117
F Expected Complete Likelihood and Likelihood 119
G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122
Bibliography 123
iv
List of Figures
11 MASH project logo 5
21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-
rameters 20
41 Graphical representation of the variational approach to Group-Lasso 45
51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56
61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first
discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64
91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92
101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97
v
List of Tables
61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61
101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96
vii
Notation and Symbols
Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors
Sets
N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A
Data
X input domainxi input sample xi isin XX design matrix X = (xgt1 x
gtn )gt
xj column j of Xyi class indicator of sample i
Y indicator matrix Y = (ygt1 ygtn )gt
z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N
Vectors Matrices and Norms
0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A
ix
Notation and Symbols
Probability
E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2
W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix
H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y
Mixture Models
yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)
θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function
Optimization
J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β
βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path
x
Notation and Symbols
Penalized models
λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)
βj jth row of B = (β1gt βpgt)gt
BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix
ΣB sample between-class covariance matrix
ΣW sample within-class covariance matrix
ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach
xi
Part I
Context and Foundations
1
This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed
The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided
The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion
3
1 Context
The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm
The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne
From the point of view of the research the members of the consortium must deal withfour main goals
1 Software development of website framework and APIrsquos
2 Classification and goal-planning in high dimensional feature spaces
3 Interfacing the platform with the 3D virtual environment and the robot arm
4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments
S HM A
Figure 11 MASH project logo
5
1 Context
The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables
Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment
As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform
bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)
bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis
6
All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)
7
2 Regularization for Feature Selection
With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic
21 Motivations
There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)
As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information
When talking about dimensionality reduction there are two families of techniquesthat could induce confusion
bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples
bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature
9
2 Regularization for Feature Selection
Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)
selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category
As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text
ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out
Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the
10
22 Categorization of Feature Selection Techniques
Figure 22 The four key steps of feature selection according to Liu and Yu (2005)
ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost
There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions
22 Categorization of Feature Selection Techniques
Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured
I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm
The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities
bull Depending on the type of integration with the machine learning algorithm we have
ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm
ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while
11
2 Regularization for Feature Selection
the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive
ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm
bull Depending on the feature searching technique
ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches
ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time
ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima
bull Depending on the evaluation technique
ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures
ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty
ndash Dependency Measures - Measuring the correlation between features
ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can
ndash Predictive Accuracy - Use the selected features to predict the labels
ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)
The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels
In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized
12
23 Regularization
goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III
23 Regularization
In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret
An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations
minβJ(β) + λP (β) (21)
minβ
J(β)
s t P (β) le t (22)
In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken
In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty
13
2 Regularization for Feature Selection
Figure 23 Admissible sets in two dimensions for different pure norms ||β||p
231 Important Properties
Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability
Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies
forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)
for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex
Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources
Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution
232 Pure Penalties
For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In
14
23 Regularization
Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties
this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1
Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity
A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero
After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1
3penalty has a support region with sharper vertexes that would induce
a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1
3results in difficulties during optimization that will not happen with a convex
shape
15
2 Regularization for Feature Selection
To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty
L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0
minβ
J(β)
s t β0 le t (24)
where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable
L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)
minβ
J(β)
s t
psumj=1
|βj | le t (25)
Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited
Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)
The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by
16
23 Regularization
minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)
L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like
minβJ(β) + λ β22 (26)
The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem
minβ
nsumi=1
(yi minus xgti β)2 (27)
with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular
the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances
As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
(βlsj )2 (28)
The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)
17
2 Regularization for Feature Selection
where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model
Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions
Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t
This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as
βlowast = maxwisinRp
βgtw s t w le 1
In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1
r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)
233 Hybrid Penalties
There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is
minβ
nsumi=1
(yi minus xgti β)2 + λ1
psumj=1
|βj |+ λ2
psumj=1
β2j (29)
The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables
18
23 Regularization
234 Mixed Penalties
Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =
sumL`=1 d` Mixed norms are
a type of norms that take into consideration those groups The general expression isshowed below
β(rs) =
sum`
sumjisinG`
|βj |s r
s
1r
(210)
The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups
Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)
(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)
235 Sparsity Considerations
In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables
The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables
To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as
19
2 Regularization for Feature Selection
(a) L1 Lasso (b) L(12) group-Lasso
Figure 25 Admissible sets for the Lasso and Group-Lasso
(a) L1 induced sparsity (b) L(12) group inducedsparsity
Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters
20
23 Regularization
the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed
236 Optimization Tools for Regularized Problems
In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms
In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5
Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)
β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))
Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives
βj =minusλsign(βj)minus partJ(β)
partβj
2sumn
i=1 x2ij
In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding
algorithm where β(t+1)j = Sλ
(partJ(β(t))partβj
) The objective function is optimized with respect
21
2 Regularization for Feature Selection
to one variable at a time while all others are kept fixed
Sλ
(partJ(β)
partβj
)=
λminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
gt λ
minusλminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
lt minusλ
0 if |partJ(β)partβj| le λ
(211)
The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)
Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A
Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected
Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)
This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions
22
23 Regularization
and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions
Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points
This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable
This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)
Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)
This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques
Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals
Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)
minβisinRp
J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L
2
∥∥∥β minus β(t)∥∥∥2
2(212)
They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like
23
2 Regularization for Feature Selection
(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as
minβisinRp
1
2
∥∥∥∥β minus (β(t) minus 1
LnablaJ(β(t)))
∥∥∥∥2
2
+λ
LP (β) (213)
The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up
24
Part II
Sparse Linear Discriminant Analysis
25
Abstract
Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes
There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables
In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data
27
3 Feature Selection in Fisher DiscriminantAnalysis
31 Fisher Discriminant Analysis
Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)
We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x
gtn )gt and the corresponding labels in the ntimesK matrix
Y = (ygt1 ygtn )gt
Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance
maxβisinRp
βgtΣBβ
βgtΣWβ (31)
where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as
ΣW =1
n
Ksumk=1
sumiisinGk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
Ksumk=1
sumiisinGk
(microminus microk)(microminus microk)gt
where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k
29
3 Feature Selection in Fisher Discriminant Analysis
This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio
maxBisinRptimesKminus1
tr(BgtΣBB
)tr(BgtΣWB
) (32)
where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is
based on a series of K minus 1 subproblemsmaxβkisinRp
βgtk ΣBβk
s t βgtk ΣWβk le 1
βgtk ΣWβ` = 0 forall` lt k
(33)
The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest
eigenvalue (see Appendix C)
32 Feature Selection in LDA Problems
LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome
Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints
The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities
They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based
321 Inertia Based
The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and
30
32 Feature Selection in LDA Problems
classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations
Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)
Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as
minβisinRp
βgtΣWβ
s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t
where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony
Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max
βisinkRpβgtk Σ
k
Bβk minus Pk(βk)
s t βgtk ΣWβk le 1
The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten
Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal
solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1
minimization minβisinRp
β1
s t∥∥∥Σβ minus (micro1 minus micro2)
∥∥∥infinle λ
Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization
31
3 Feature Selection in Fisher Discriminant Analysis
Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions
322 Regression Based
In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)
Predefined Indicator Matrix
Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)
There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data
Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection
In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is
32
32 Feature Selection in LDA Problems
obtained by solving
minβisinRpβ0isinR
nminus1nsumi=1
(yi minus β0 minus xgti β)2 + λ
psumj=1
|βj |
where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β
vector for λ = 0 but a different intercept β0 is required
Optimal Scoring
In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)
As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)(34a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)
where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems
minθkisinRK βkisinRp
Yθk minusXβk2 + βgtk Ωβk (35a)
s t nminus1 θgtk YgtYθk = 1 (35b)
θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)
where each βk corresponds to a discriminant direction
33
3 Feature Selection in Fisher Discriminant Analysis
Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by
minβkisinRpθkisinRK
sumk
Yθk minusXβk22 + λ1 βk1 + λ2β
gtk Ωβk
where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen
Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)
minβkisinRpθkisinRK
Kminus1sumk=1
Yθk minusXβk22 + λ
psumj=1
radicradicradicradicKminus1sumk=1
β2kj
2
(36)
which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding
this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem
34
4 Formalizing the Objective
In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)
The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data
The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)
41 From Optimal Scoring to Linear Discriminant Analysis
Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)
Throughout this chapter we assume that
bull there is no empty class that is the diagonal matrix YgtY is full rank
bull inputs are centered that is Xgt1n = 0
bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank
35
4 Formalizing the Objective
411 Penalized Optimal Scoring Problem
For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution
The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus
minθisinRK βisinRp
Yθ minusXβ2 + βgtΩβ (41a)
s t nminus1 θgtYgtYθ = 1 (41b)
For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator
βos =(XgtX + Ω
)minus1XgtYθ (42)
The objective function (41a) is then
Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos
(XgtX + Ω
)βos
= θgtYgtYθ minus θgtYgtX(XgtX + Ω
)minus1XgtYθ
where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (43)
which shows that the optimization of the p-OS problem with respect to θk boils down to
finding the kth largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY Indeed Appendix C
details that Problem (43) is solved by
(YgtY)minus1YgtX(XgtX + Ω
)minus1XgtYθ = α2θ (44)
36
41 From Optimal Scoring to Linear Discriminant Analysis
where α2 is the maximal eigenvalue 1
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2nminus1θgt(YgtY)θ
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2 (45)
412 Penalized Canonical Correlation Analysis
As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows
maxθisinRK βisinRp
nminus1θgtYgtXβ (46a)
s t nminus1 θgtYgtYθ = 1 (46b)
nminus1 βgt(XgtX + Ω
)β = 1 (46c)
The solutions to (46) are obtained by finding saddle points of the Lagrangian
nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)
rArr npartL(βθ γ ν)
partβ= XgtYθ minus 2γ(XgtX + Ω)β
rArr βcca =1
2γ(XgtX + Ω)minus1XgtYθ
Then as βcca obeys (46c) we obtain
βcca =(XgtX + Ω)minus1XgtYθradic
nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)
so that the optimal objective function (46a) can be expressed with θ alone
nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
=
radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
and the optimization problem with respect to θ can be restated as
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (48)
Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)
βos = αβcca (49)
1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)
37
4 Formalizing the Objective
where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using
the optimality conditions for θ
npartL(βθ γ ν)
partθ= YgtXβ minus 2νYgtYθ
rArr θcca =1
2ν(YgtY)minus1YgtXβ (410)
Then as θcca obeys (46b) we obtain
θcca =(YgtY)minus1YgtXβradic
nminus1βgtXgtY(YgtY)minus1YgtXβ (411)
leading to the following expression of the optimal objective function
nminus1θgtccaYgtXβ =
nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ
=
radicnminus1βgtXgtY(YgtY)minus1YgtXβ
The p-CCA problem can thus be solved with respect to β by plugging this value in (46)
maxβisinRp
nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)
s t nminus1 βgt(XgtX + Ω
)β = 1 (412b)
where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies
nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω
)βcca (413)
where λ is the maximal eigenvalue shown below to be equal to α2
nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ
rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ
rArr nminus1αβgtccaXgtYθ = λ
rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ
rArr α2 = λ
The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)
38
41 From Optimal Scoring to Linear Discriminant Analysis
413 Penalized Linear Discriminant Analysis
Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows
maxβisinRp
βgtΣBβ (414a)
s t βgt(ΣW + nminus1Ω)β = 1 (414b)
where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C
As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable
to a simple matrix representation using the projection operator Y(YgtY
)minus1Ygt
ΣT =1
n
nsumi=1
xixigt
= nminus1XgtX
ΣB =1
n
Ksumk=1
nk microkmicrogtk
= nminus1XgtY(YgtY
)minus1YgtX
ΣW =1
n
Ksumk=1
sumiyik=1
(xi minus microk) (xi minus microk)gt
= nminus1
(XgtXminusXgtY
(YgtY
)minus1YgtX
)
Using these formulae the solution to the p-LDA problem (414) is obtained as
XgtY(YgtY
)minus1YgtXβlda = λ
(XgtX + ΩminusXgtY
(YgtY
)minus1YgtX
)βlda
XgtY(YgtY
)minus1YgtXβlda =
λ
1minus λ
(XgtX + Ω
)βlda
The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat
βlda = (1minus α2)minus12 βcca
= αminus1(1minus α2)minus12 βos
which ends the path from p-OS to p-LDA
39
4 Formalizing the Objective
414 Summary
The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)s t nminus1 ΘgtYgtYΘ = IKminus1
Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the
square-root of the largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY we have
BLDA = BCCA
(IKminus1 minusA2
)minus 12
= BOS Aminus1(IKminus1 minusA2
)minus 12 (415)
where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p
can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS
or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied
With the aim of performing classification the whole process could be summarized asfollows
1 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
2 Translate the data samples X into the LDA domain as XLDA = XBOSD
where D = Aminus1(IKminus1 minusA2
)minus 12
3 Compute the matrix M of centroids microk from XLDA and Y
4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA
5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule
6 Graphical Representation
40
42 Practicalities
The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively
42 Practicalities
421 Solution of the Penalized Optimal Scoring Regression
Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem
minΘisinRKtimesKminus1BisinRptimesKminus1
YΘminusXB2F + λ tr(BgtΩB
)(416a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)
where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm
Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps
1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1
2 Compute B =(XgtX + λΩ
)minus1XgtYΘ0
3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ
)minus1XgtY
4 Compute the optimal regression coefficients
BOS =(XgtX + λΩ
)minus1XgtYΘ (417)
Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on
Θ0gtYgtX(XgtX + λΩ
)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a
costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B
This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where
41
4 Formalizing the Objective
a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems
422 Distance Evaluation
The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance
d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log
(nkn
) (418)
is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent
Σminus1WΩ =
(nminus1(XgtX + λΩ)minus ΣB
)minus1
=(nminus1XgtXminus ΣB + nminus1λΩ
)minus1
=(ΣW + nminus1λΩ
)minus1 (419)
Before explaining how to compute the distances let us summarize some clarifying points
bull The solution BOS of the p-OS problem is enough to accomplish classification
bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances
bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1
As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain
(xi minus microk)BOS2ΣWΩminus 2 log(πk)
where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1
(IKminus1 minusA2
)minus 12
∥∥∥2
2minus 2 log(πk)
which is a plain Euclidean distance
42
43 From Sparse Optimal Scoring to Sparse LDA
423 Posterior Probability Evaluation
Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as
p(yk = 1|x) prop exp
(minusd(xmicrok)
2
)prop πk exp
(minus1
2
∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2
)minus 12
∥∥∥2
2
) (420)
Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)
2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below
p(yk = 1|x) =πk exp
(minusd(xmicrok)
2
)sum
` π` exp(minusd(xmicro`)
2
)=
πk exp(minusd(xmicrok)
2 + dmax2
)sum`
π` exp
(minusd(xmicro`)
2+dmax
2
)
where dmax = maxk d(xmicrok)
424 Graphical Representation
Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented
43 From Sparse Optimal Scoring to Sparse LDA
The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated
In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see
43
4 Formalizing the Objective
section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB
431 A Quadratic Variational Form
Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)
Our formulation of group-Lasso is showed below
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(421a)
s tsum
j τj minussum
j wj∥∥βj∥∥
2le 0 (421b)
τj ge 0 j = 1 p (421c)
where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1
B =(β1gt βpgt
)gtand wj are predefined nonnegative weights The cost function
J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)
The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41
Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)
Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump
j=1wj∥∥βj∥∥
2
Proof The Lagrangian of Problem (421) is
L = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
( psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2
)minus
psumj=1
νjτj
44
43 From Sparse Optimal Scoring to Sparse LDA
Figure 41 Graphical representation of the variational approach to Group-Lasso
Thus the first order optimality conditions for τj are
partLpartτj
(τj ) = 0hArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
hArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0 τ
j
2 = 0
The last line is obtained from complementary slackness which implies here νjτj = 0
Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier
for constraint gj(τj) le 0 As a result the optimal value of τj
τj =
radicλw2
j
∥∥βj∥∥2
2
ν0=
radicλ
ν0wj∥∥βj∥∥
2(422)
We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)
psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2= 0 (423)
so that τj = wj∥∥βj∥∥
2 Using this value into (421a) it is possible to conclude that
Problem (421) is equivalent to the standard group-Lasso operator
minBisinRptimesM
J(B) + λ
psumj=1
wj∥∥βj∥∥
2 (424)
So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation
45
4 Formalizing the Objective
With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (425)
with
τj = wj∥∥βj∥∥
2
resulting in Ω diagonal components
(Ω)jj =wj∥∥βj∥∥
2
(426)
And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5
The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence
Lemma 42 If J is convex Problem (421) is convex
Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is
V isin RptimesKminus1 V =partJ(B)
partB+ λG
(427)
where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1
G =(g1gt gpgt
)gtdefined as follows Let S(B) denote the columnwise support of
B S(B) = j isin 1 p ∥∥βj∥∥
26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (428)
forallj isin S(B) ∥∥gj∥∥
2le wj (429)
46
43 From Sparse Optimal Scoring to Sparse LDA
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Proof When∥∥βj∥∥
26= 0 the gradient of the penalty with respect to βj is
part (λsump
m=1wj βm2)
partβj= λwj
βj∥∥βj∥∥2
(430)
At∥∥βj∥∥
2= 0 the gradient of the objective function is not continuous and the optimality
conditions then make use of the subdifferential (Bach et al 2011)
partβj
(λ
psumm=1
wj βm2
)= partβj
(λwj
∥∥βj∥∥2
)=λwjv isin RKminus1 v2 le 1
(431)
That gives the expression (429)
Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima
forallj isin S partJ(B)
partβj+ λwj
∥∥βj∥∥minus1
2βj = 0 (432a)
forallj isin S ∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le λwj (432b)
where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment
Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)
432 Group-Lasso OS as Penalized LDA
With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced
Proposition 41 The group-Lasso OS problem
BOS = argminBisinRptimesKminus1
minΘisinRKtimesKminus1
1
2YΘminusXB2F + λ
psumj=1
wj∥∥βj∥∥
2
s t nminus1 ΘgtYgtYΘ = IKminus1
47
4 Formalizing the Objective
is equivalent to the penalized LDA problem
BLDA = maxBisinRptimesKminus1
tr(BgtΣBB
)s t Bgt(ΣW + nminus1λΩ)B = IKminus1
where Ω = diag
(w2
1
τ1
w2p
τp
) with Ωjj =
+infin if βjos = 0
wj∥∥βjos
∥∥minus1
2otherwise
(433)
That is BLDA = BOS diag(αminus1k (1minus α2
k)minus12
) where αk isin (0 1) is the kth leading
eigenvalue of
nminus1YgtX(XgtX + λΩ
)minus1XgtY
Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso
The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr
(BgtΩB
)
48
5 GLOSS Algorithm
The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22
The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below
1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed
2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution
3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set
This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively
51 Regression Coefficients Updates
Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(
XgtAXA + λΩ)βk = XgtAYθ0
k (51)
49
5 GLOSS Algorithm
initialize modelλ B
ACTIVE SETall j st||βj ||2 gt 0
p-OS PROBLEMB must hold1st optimality
condition
any variablefrom
ACTIVE SETmust go toINACTIVE
SET
take it out ofACTIVE SET
test 2nd op-timality con-dition on the
INACTIVE SET
any variablefrom
INACTIVE SETmust go toACTIVE
SET
take it out ofINACTIVE SET
compute Θ
and update B end
yes
no
yes
no
Figure 51 GLOSS block diagram
50
51 Regression Coefficients Updates
Algorithm 1 Adaptively Penalized Optimal Scoring
Input X Y B λInitialize A larr
j isin 1 p
∥∥βj∥∥2gt 0
Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat
Step 1 solve (421) in B assuming A optimalrepeat
Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1
2
BA larr(XgtAXA + λΩ
)minus1XgtAYΘ0
until condition (432a) holds for all j isin A Step 2 identify inactivated variables
for j isin A ∥∥βj∥∥
2= 0 do
if optimality condition (432b) holds thenA larr AjGo back to Step 1
end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax
jisinA
∥∥partJpartβj∥∥2
if∥∥∥partJpartβj∥∥∥
2lt λ then
convergence larr true B is optimalelseA larr Acup j
end ifuntil convergence
(sV)larreigenanalyze(Θ0gtYgtXAB) that is
Θ0gtYgtXABVk = skVk k = 1 K minus 1
Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1
Output Θ B α
51
5 GLOSS Algorithm
where XA denotes the columns of X indexed by A and βk and θ0k denote the kth
column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system
511 Cholesky decomposition
Dropping the subscripts and considering the (K minus 1) systems together (51) leads to
(XgtX + λΩ)B = XgtYΘ (52)
Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows
CgtCB = XgtYΘ
CB = CgtXgtYΘ
B = CCgtXgtYΘ (53)
where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)
512 Numerical Stability
The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression
B = Ωminus12(Ωminus12XgtXΩminus12 + λI
)minus1Ωminus12XgtYΘ0 (54)
where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)
52 Score Matrix
The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY This eigen-analysis is actually solved in the form
ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-
vector decomposition does not require the costly computation of(XgtX + Ω
)minus1that
52
53 Optimality Conditions
involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-
trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω
)minus1XgtY 1
Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as
Θ0gtYgtX(XgtX + Ω
)minus1XgtYΘ0 = Θ0gtYgtXB0
Thus the solution to penalized OS problem can be computed trough the singular
value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining
Θ = Θ0V we have ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ = Λ and when Θ0 is chosen such
that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation
53 Optimality Conditions
GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function
1
2YΘminusXB22 + λ
psumj=1
wj∥∥βj∥∥
2(55)
Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth
row of B βj is the (K minus 1)-dimensional vector
partJ(B)
partβj= xj
gt(XBminusYΘ)
where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as
xjgt
(XBminusYΘ) + λwjβj∥∥βj∥∥
2
1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω
)minus1XgtY It is thus suffi-
cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY In practice to comply with this desideratum and conditions (35b) and
(35c) we set Θ0 =(YgtY
)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal
vectors orthogonal to 1K
53
5 GLOSS Algorithm
The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
54 Active and Inactive Sets
The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function
j = maxj
∥∥∥xjgt (XBminusYΘ)∥∥∥
2minus λwj 0
The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥
2
is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition
55 Penalty Parameter
The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active
The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0
λmax = maxjisin1p
1
wj
∥∥∥xjgtYΘ0∥∥∥
2
The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin
is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)
54
56 Options and Variants
56 Options and Variants
561 Scaling Variables
As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm
562 Sparse Variant
This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation
563 Diagonal Variant
We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated
The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems
minBisinRptimesKminus1
YΘminusXB2F = minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB
)are replaced by
minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B
)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite
564 Elastic net and Structured Variant
For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition
55
5 GLOSS Algorithm
7 8 9
4 5 6
1 2 3
- ΩL =
3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3
Figure 52 Graph and Laplacian matrix for a 3times 3 image
for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth
When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood
This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned
56
6 Experimental Results
This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper
61 Normalization
With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1
62 Decision Thresholds
The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation
1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval
57
6 Experimental Results
63 Simulated Data
We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is
Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)
Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of
dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure
is intended to mimic gene expression data correlation
Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1
3 1)if j le 100 and Xij sim N(0 1) otherwise
Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise
Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563
The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only
58
63 Simulated Data
Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset
Err () Var Dir
Sim 1 K = 4 mean shift ind features
PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)
Sim 2 K = 2 mean shift dependent features
PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)
Sim 3 K = 4 1D mean shift ind features
PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)
Sim 4 K = 4 mean shift ind features
PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)
59
6 Experimental Results
0 10 20 30 40 50 60 70 8020
30
40
50
60
70
80
90
100TPR Vs FPR
gloss
glossd
slda
plda
Simulation1
Simulation2
Simulation3
Simulation4
Figure 61 TPR versus FPR (in ) for all algorithms and simulations
Table 62 Average TPR and FPR (in ) computed over 25 repetitions
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
PLDA 990 782 969 603 980 159 743 656
SLDA 739 385 338 163 416 278 507 395
GLOSS 641 106 300 46 511 182 260 121
GLOSS-D 935 394 921 281 956 655 429 299
method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )
64 Gene Expression Data
We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-
2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736
60
64 Gene Expression Data
Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables
Err () Var
Nakayama n = 86 p = 22 283 K = 5
PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)
Ramaswamy n = 198 p = 16 063 K = 14
PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)
Sun n = 180 p = 54 613 K = 4
PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)
ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4
dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors
Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split
Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors
Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS
4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962
61
6 Experimental Results
GLOSS SLDA
Naka
yam
a
minus25000 minus20000 minus15000 minus10000 minus5000 0 5000
minus25
minus2
minus15
minus1
minus05
0
05
1
x 104
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
2n
dd
iscr
imin
ant
minus2000 0 2000 4000 6000 8000 10000 12000 14000
2000
4000
6000
8000
10000
12000
14000
16000
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
Su
n
minus1 minus05 0 05 1 15 2
x 104
05
1
15
2
25
3
35
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
2n
dd
iscr
imin
ant
minus2 minus15 minus1 minus05 0
x 104
0
05
1
15
2
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means
62
65 Correlated Data
Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo
65 Correlated Data
When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge
The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works
For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63
As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward
The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits
Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results
63
6 Experimental Results
β for GLOSS β for S-GLOSS
Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo
β for GLOSS and λ = 03 β for S-GLOSS and λ = 03
Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo
64
Discussion
GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix
Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data
The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced
The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition
65
Part III
Sparse Clustering Analysis
67
Abstract
Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity
Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix
As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection
Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10
69
7 Feature Selection in Mixture Models
71 Mixture Models
One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering
711 Model
We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically
from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as
f(xi) =
Ksumk=1
πkfk(xi) foralli isin 1 n
where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and
sumk πk = 1) Mixture models transcribe that
given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism
bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK
bull x each xi is assumed to arise from a random vector with probability densityfunction fk
In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as
f(xiθ) =
Ksumk=1
πkφ(xiθk) foralli isin 1 n
71
7 Feature Selection in Mixture Models
where θ = (π1 πK θ1 θK) is the parameter of the model
712 Parameter Estimation The EM Algorithm
For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ
21 σ
22 π) of a univariate
Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches
The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)
The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood
Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm
Maximum Likelihood Definitions
The likelihood is is commonly expressed in its logarithmic version
L(θ X) = log
(nprodi=1
f(xiθ)
)
=nsumi=1
log
(Ksumk=1
πkfk(xiθk)
) (71)
where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions
To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or
72
71 Mixture Models
classification log-likelihood
LC(θ XY) = log
(nprodi=1
f(xiyiθ)
)
=
nsumi=1
log
(Ksumk=1
yikπkfk(xiθk)
)
=nsumi=1
Ksumk=1
yik log (πkfk(xiθk)) (72)
The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise
Defining the soft membership tik(θ) as
tik(θ) = p(Yik = 1|xiθ) (73)
=πkfk(xiθk)
f(xiθ) (74)
To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows
LC(θ XY) =sumik
yik log (πkfk(xiθk))
=sumik
yik log (tikf(xiθ))
=sumik
yik log tik +sumik
yik log f(xiθ)
=sumik
yik log tik +nsumi=1
log f(xiθ)
=sumik
yik log tik + L(θ X) (75)
wheresum
ik yik log tik can be reformulated as
sumik
yik log tik =nsumi=1
Ksumk=1
yik log(p(Yik = 1|xiθ))
=
nsumi=1
log(p(Yik = 1|xiθ))
= log (p(Y |Xθ))
As a result the relationship (75) can be rewritten as
L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)
73
7 Feature Selection in Mixture Models
Likelihood Maximization
The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)
L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))
+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))
In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood
∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1
minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality
Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))
For the mixture model problem Q(θθprime) is
Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]
=sumik
p(Yik = 1|xiθprime) log(πkfk(xiθk))
=nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (77)
Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ
prime) are the posterior proba-bilities of cluster memberships
Hence the EM algorithm sketched above results in
bull Initialization (not iterated) choice of the initial parameter θ(0)
bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)
bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))
74
72 Feature Selection in Model-Based Clustering
Gaussian Model
In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is
f(xiθ) =Ksumk=1
πkfk(xiθk)
=
Ksumk=1
πk1
(2π)p2 |Σ|
12
exp
minus1
2(xi minus microk)
gtΣminus1(xi minus microk)
At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows
Q(θθ(t)) =sumik
tik log(πk)minussumik
tik log(
(2π)p2 |Σ|
12
)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
=sumk
tk log(πk)minusnp
2log(2π)︸ ︷︷ ︸
constant term
minusn2
log(|Σ|)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
equivsumk
tk log(πk)minusn
2log(|Σ|)minus
sumik
tik
(1
2(xi minus microk)
gtΣminus1(xi minus microk)
) (78)
where
tk =nsumi=1
tik (79)
The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)
π(t+1)k =
tkn
(710)
micro(t+1)k =
sumi tikxitk
(711)
Σ(t+1) =1
n
sumk
Wk (712)
with Wk =sumi
tik(xi minus microk)(xi minus microk)gt (713)
The derivations are detailed in Appendix G
72 Feature Selection in Model-Based Clustering
When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own
75
7 Feature Selection in Mixture Models
covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries
In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD
gtk (Banfield and Raftery 1993)
These regularization schemes address singularity and stability issues but they do notinduce parsimonious models
In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space
721 Based on Penalized Likelihood
Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x
log
(p(Yk = 1|x)
p(Y` = 1|x)
)= xgtΣminus1(microk minus micro`)minus
1
2(microk + micro`)
gtΣminus1(microk minus micro`) + logπkπ`
In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm
λKsumk=1
psumj=1
|microkj |
as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices
λ1
Ksumk=1
psumj=1
|microkj |+ λ2
Ksumk=1
psumj=1
psumm=1
|(Σminus1k )jm|
In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models
76
72 Feature Selection in Model-Based Clustering
Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)
λ
psumj=1
sum16k6kprime6K
|microkj minus microkprimej |
This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative
A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features
λ
psumj=1
(micro1j micro2j microKj)infin
One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means
λradicK
psumj=1
radicradicradicradic Ksum
k=1
micro2kj
The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test
The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation
722 Based on Model Variants
The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as
77
7 Feature Selection in Mixture Models
f(xi|φ πθν) =Ksumk=1
πk
pprodj=1
[f(xij |θjk)]φj [h(xij |νj)]1minusφj
where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)
An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1
which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion
tr(
(UgtΣWU)minus1UgtΣBU) (714)
so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations
To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of
minUisinRptimesKminus1
∥∥∥XU minusXU∥∥∥2
F+ λ
Kminus1sumk=1
∥∥∥uk∥∥∥1
where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet
minABisinRptimesKminus1
Ksumk=1
∥∥∥RminusgtW HBk minusABgtHBk
∥∥∥2
2+ ρ
Kminus1sumj=1
βgtj ΣWβj + λ
Kminus1sumj=1
∥∥βj∥∥1
s t AgtA = IKminus1
where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper
78
72 Feature Selection in Model-Based Clustering
triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U
The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem
minUisinRptimesKminus1
psumj=1
∥∥∥ΣBj minus UUgtΣBj
∥∥∥2
2
s t UgtU = IKminus1
whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U
To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality
However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo
723 Based on Model Selection
Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables
bull X(1) set of selected relevant variables
bull X(2) set of variables being considered for inclusion or exclusion of X(1)
bull X(3) set of non relevant variables
79
7 Feature Selection in Mixture Models
With those subsets they defined two different models where Y is the partition toconsider
bull M1
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)|X(1)
)f(X(1)|Y
)bull M2
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)X(1)|Y
)Model M1 means that variables in X(2) are independent on clustering Y Model M2
shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor
B12 =f (X|M1)
f (X|M2)
where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio
B12 =f(X(1)X(2)X(3)|M1
)f(X(1)X(2)X(3)|M2
)=f(X(2)|X(1)M1
)f(X(1)|M1
)f(X(2)X(1)|M2
)
This factor is approximated since the integrated likelihoods f(X(1)|M1
)and
f(X(2)X(1)|M2
)are difficult to calculate exactly Raftery and Dean (2006) use the
BIC approximation The computation of f(X(2)|X(1)M1
) if there is only one variable
in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term
Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability
Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis
80
8 Theoretical Foundations
In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features
We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model
In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided
81 Resolving EM with Optimal Scoring
In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate
811 Relationship Between the M-Step and Linear Discriminant Analysis
LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance
d(ximicrok) = (xi minus microk)gtΣminus1
W (xi minus microk)
where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix
81
8 Theoretical Foundations
The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)
Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood
2lweight(microΣ) =nsumi=1
Ksumk=1
tikd(ximicrok)minus n log(|ΣW|)
which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures
812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis
The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression
813 Clustering Using Penalized Optimal Scoring
The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression
d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)
This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as
82
82 Optimized Criterion
1 Initialize the membership matrix Y (for example by K-means algorithm)
2 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2
k)minus 1
2 )
4 Compute the centroids M in the LDA domain
5 Evaluate distances in the LDA domain
6 Translate distances into posterior probabilities tik with
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
] (81)
7 Update the labels using the posterior probabilities matrix Y = T
8 Go back to step 2 and iterate until tik converge
Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures
814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures
82 Optimized Criterion
In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized
83
8 Theoretical Foundations
optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture
This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows
821 A Bayesian Derivation
This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)
The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter
The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior
f(Σ|Λ0 ν0) =1
2np2 |Λ0|
n2 Γp(
n2 )|Σminus1|
ν0minuspminus12 exp
minus1
2tr(Λminus1
0 Σminus1)
where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as
Γp(n2) = πp(pminus1)4pprodj=1
Γ (n2 + (1minus j)2)
The posterior distribution can be maximized similarly to the likelihood through the
84
82 Optimized Criterion
maximization of
Q(θθprime) + log(f(Σ|Λ0 ν0))
=Ksumk=1
tk log πk minus(n+ 1)p
2log 2minus n
2log |Λ0| minus
p(p+ 1)
4log(π)
minuspsumj=1
log
(Γ
(n
2+
1minus j2
))minus νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
)equiv
Ksumk=1
tk log πk minusn
2log |Λ0| minus
νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
) (82)
with tk =
nsumi=1
tik
νn = ν0 + n
Λminus1n = Λminus1
0 + S0
S0 =
nsumi=1
Ksumk=1
tik(xi minus microk)(xi minus microk)gt
Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)
822 Maximum a Posteriori Estimator
The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is
ΣMAP =1
ν0 + nminus pminus 1(Λminus1
0 + S0) (83)
where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1
0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)
85
9 Mix-GLOSS Algorithm
Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism
91 Mix-GLOSS
The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik
When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant
The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition
911 Outer Loop Whole Algorithm Repetitions
This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs
bull the centered ntimes p feature matrix X
bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically
bull the number of clusters K
bull the maximum number of iterations for the EM algorithm
bull the convergence tolerance for the EM algorithm
bull the number of whole repetitions of the clustering algorithm
87
9 Mix-GLOSS Algorithm
Figure 91 Mix-GLOSS Loops Scheme
bull a ptimes (K minus 1) initial coefficient matrix (optional)
bull a ntimesK initial posterior probability matrix (optional)
For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process
912 Penalty Parameter Loop
The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix
Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage
88
91 Mix-GLOSS
of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive
Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)
Algorithm 2 Automatic selection of λ
Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat
Estimate λ Compute gradient at βj = 0partJ(B)
partβj
∣∣∣βj=0
= xjgt
(sum
m6=j xmβm minusYΘ)
Compute λmax for every feature using (432b)
λmaxj = 1
wj
∥∥∥∥ partJ(B)
partβj
∣∣∣βj=0
∥∥∥∥2
Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false
elselastLAMBDA larr true
end ifuntil lastLAMBDA
Output B L(θ) tik πk microk Σ Y for every λ in solution path
913 Inner Loop EM Algorithm
The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop
89
9 Mix-GLOSS Algorithm
Algorithm 3 Mix-GLOSS for one value of λ
Input X K B0 Y0 λInitializeif (B0Y0) available then
BOS larr B0 Y larr Y0
elseBOS larr 0 Y larr kmeans(XK)
end ifconvergenceEM larr false tolEM larr 1e-3repeat
M-step(BOSΘ
α)larr GLOSS(XYBOS λ)
XLDA = XBOS diag (αminus1(1minusα2)minus12
)
πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n
sumi |tik minus yik| lt tolEM then
convergenceEM larr trueend ifY larr T
until convergenceEMY larr MAP(T)
Output BOS ΘL(θ) tik πk microk Σ Y
90
92 Model Selection
M-Step
The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step
E-Step
The E-step evaluates the posterior probability matrix T using
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
]
The convergence of those tik is used as stopping criterion for EM
92 Model Selection
Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected
In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure
In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time
The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested
91
9 Mix-GLOSS Algorithm
Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)
X K λEMITER MAXREPMixminusGLOSS
Use B and T frombest repetition as
StartB and StartT
Mix-GLOSS (λStartBStartT)
Compute BIC
Chose λ = minλ BIC
Partition tikπk λBEST BΘ D L(θ)activeset
Figure 92 Mix-GLOSS model selection diagram
with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ
92
10 Experimental Results
The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6
This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63
In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations
The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions
101 Tested Clustering Algorithms
This section compares Mix-GLOSS with the following methods in the state of the art
bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan
bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website
93
10 Experimental Results
Figure 101 Class mean vectors for each artificial simulation
bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website
After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered
The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website
bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)
bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see
94
102 Results
Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable
102 Results
In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are
bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different
bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80
bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced
The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS
Results in percentages are displayed in Figure 102 (or in Table 102 )
95
10 Experimental Results
Table 101 Experimental results for simulated data
Err () Var Time
Sim 1 K = 4 mean shift ind features
CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h
Sim 2 K = 2 mean shift dependent features
CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h
Sim 3 K = 4 1D mean shift ind features
CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h
Sim 4 K = 4 mean shift ind features
CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h
Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
MIX-GLOSS 992 015 828 335 884 67 780 12
LUMI-KUAN 992 28 1000 02 1000 005 50 005
FISHER-EM 986 24 888 17 838 5825 620 4075
96
103 Discussion
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100TPR Vs FPR
MIXminusGLOSS
LUMIminusKUAN
FISHERminusEM
Simulation1
Simulation2
Simulation3
Simulation4
Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions
103 Discussion
After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted
LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here
The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4
From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall
97
Conclusions
99
Conclusions
Summary
The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables
In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering
The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems
In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations
In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results
Perspectives
Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species
101
based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography
At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term
The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis
From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm
At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression
102
Appendix
103
A Matrix Properties
Property 1 By definition ΣW and ΣB are both symmetric matrices
ΣW =1
n
gsumk=1
sumiisinCk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
gsumk=1
nk(microk minus x)(microk minus x)gt
Property 2 partxgtapartx = partagtx
partx = a
Property 3 partxgtAxpartx = (A + Agt)x
Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt
Property 5 partagtXbpartX = abgt
Property 6 partpartXtr
(AXminus1B
)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt
105
B The Penalized-OS Problem is anEigenvector Problem
In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form
minθkβk
Yθk minusXβk22 + βgtk Ωkβk (B1)
st θgtk YgtYθk = 1
θgt` YgtYθk = 0 forall` lt k
for k = 1 K minus 1The Lagrangian associated to Problem (B1) is
Lk(θkβk λkνk) =
Yθk minusXβk22 + βgtk Ωkβk + λk(θ
gtk YgtYθk minus 1) +
sum`ltk
ν`θgt` YgtYθk (B2)
Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk
βk = (XgtX + Ωk)minus1XgtYθk (B3)
The objective function of (B1) evaluated at βk is
minθk
Yθk minusXβk22 + βk
gtΩkβk = min
θk
θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk
= maxθk
θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)
If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY
B1 How to Solve the Eigenvector Decomposition
Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition
107
B The Penalized-OS Problem is an Eigenvector Problem
Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way
maxΘisinRKtimes(Kminus1)
tr(ΘgtMΘ
)(B5)
st ΘgtYgtYΘ = IKminus1
If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is
MΘv = λv (B6)
where v is the eigenvector and λ the associated eigenvalue of MΘ Operating
vgtMΘv = λhArr vgtΘgtMΘv = λ
Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue
wgtMw = λ (B7)
Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ
MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ
= ΘgtYgtXB
Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone
To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B
B = (XgtX + Ω)minus1XgtYΘV = BV
108
B2 Why the OS Problem is Solved as an Eigenvector Problem
B2 Why the OS Problem is Solved as an Eigenvector Problem
In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY
By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them
θk =
Kminus1summ=1
αmwm s t θgtk θk = 1 (B8)
The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (
Kminus1summ=1
αmwm
)gt(Kminus1summ=1
αmwm
)= 1
that as per the eigenvector properties can be reduced to
Kminus1summ=1
α2m = 1 (B9)
Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)
Mθk = M
Kminus1summ=1
αmwm
=
Kminus1summ=1
αmMwm
As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain
Mθk =Kminus1summ=1
αmλmwm
Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors
θgtk Mθk =
(Kminus1sum`=1
α`w`
)gt(Kminus1summ=1
αmλmwm
)
This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving
θgtk Mθk =Kminus1summ=1
α2mλm
109
B The Penalized-OS Problem is an Eigenvector Problem
The optimization Problem (B5) for discriminant direction k can be rewritten as
maxθkisinRKtimes1
θgtk Mθk
= max
θkisinRKtimes1
Kminus1summ=1
α2mλm
(B10)
with θk =Kminus1summ=1
αmwm
andKminus1summ=1
α2m = 1
One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =
sumKminus1m=1 αmwm the resulting score vector θk will be equal to
the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can
be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY
110
C Solving Fisherrsquos Discriminant Problem
The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance
maxβisinRp
βgtΣBβ (C1a)
s t βgtΣWβ = 1 (C1b)
where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data
The Lagrangian of Problem (C1) is
L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)
so that its first derivative with respect to β is
partL(β ν)
partβ= 2ΣBβ minus 2νΣWβ
A necessary optimality condition for β is that this derivative is zero that is
ΣBβ = νΣWβ
Provided ΣW is full rank we have
Σminus1W ΣBβ
= νβ (C2)
Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of
eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows
βgtΣBβ = βgtΣWΣminus1
W ΣBβ
= νβgtΣWβ from (C2)
= ν from (C1b)
That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1
W ΣB and β is any eigenvector correspondingto this maximal eigenvalue
111
D Alternative Variational Formulation forthe Group-Lasso
In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(D1a)
s tsump
j=1 τj = 1 (D1b)
τj ge 0 j = 1 p (D1c)
Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed
of row vectors βj isin RKminus1 B =(β1gt βpgt
)gt
L(B τ λ ν0 νj) = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
psumj=1
τj minus 1
minus psumj=1
νjτj (D2)
The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj
partL(B τ λ ν0 νj)
partτj
∣∣∣∣τj=τj
= 0 rArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 = 0
The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ
) = 0 where νj is the Lagrange multiplier and gj(τ) is the
inequality Lagrange condition Then the optimal τj can be deduced
τj =
radicλ
ν0wj∥∥βj∥∥
2
Placing this optimal value of τj into constraint (D1b)
psumj=1
τj = 1rArr τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
(D3)
113
D Alternative Variational Formulation for the Group-Lasso
With this value of τj Problem (D1) is equivalent to
minBisinRptimesKminus1
J(B) + λ
psumj=1
wj∥∥βj∥∥
2
2
(D4)
This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj
The penalty term of (D1a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (D5)
Using the value of τj from (D3) each diagonal component of Ω is
(Ω)jj =wjsump
j=1wj∥∥βj∥∥
2∥∥βj∥∥2
(D6)
In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation
D1 Useful Properties
Lemma D1 If J is convex Problem (D1) is convex
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =
partJ(B)
partB+ 2λ
Kminus1sumj=1
wj∥∥βj∥∥
2
G
(D7)
where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1
∥∥βj∥∥26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (D8)
forallj isin S(B) ∥∥gj∥∥
2le wj (D9)
114
D2 An Upper Bound on the Objective Function
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1
∥∥βj∥∥26= 0 and let S(B) be its complement then we have
forallj isin S(B) minus partJ(B)
partβj= 2λ
Kminus1sumj=1
wj∥∥βj∥∥2
wj∥∥βj∥∥minus1
2βj (D10a)
forallj isin S(B)
∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le 2λwj
Kminus1sumj=1
wj∥∥βj∥∥2
(D10b)
In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)
D2 An Upper Bound on the Objective Function
Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that
τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum
j=1
wj∥∥βj∥∥
2
2
=
psumj=1
τ12j
wj∥∥βj∥∥
2
τ12j
2
le
psumj=1
τj
psumj=1
w2j
∥∥βj∥∥2
2
τj
le
psumj=1
w2j
∥∥βj∥∥2
2
τj
where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one
115
D Alternative Variational Formulation for the Group-Lasso
This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined
116
E Invariance of the Group-Lasso to UnitaryTransformations
The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition
Proposition E1 Let B be a solution of
minBisinRptimesM
Y minusXB2F + λ
psumj=1
wj∥∥βj∥∥
2(E1)
and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof
minBisinRptimesM
∥∥∥Y minusXB∥∥∥2
F+ λ
psumj=1
wj∥∥βj∥∥
2(E2)
Proof The first-order necessary optimality conditions for B are
forallj isin S(B) 2xjgt(xjβ
j minusY)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E3a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minusY)∥∥∥
2le λwj (E3b)
where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement
First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows
forallj isin S(B) 2xjgt(xjβ
j minus Y)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E4a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minus Y)∥∥∥
2le λwj (E4b)
where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM
∥∥ugt∥∥2
=∥∥ugtV
∥∥2 Equation (E4b) is also
117
E Invariance of the Group-Lasso to Unitary Transformations
obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof
118
F Expected Complete Likelihood andLikelihood
Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available
L(θ) =
nsumi=1
log
(Ksumk=1
πkfk(xiθk)
)(F1)
Q(θθprime) =nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (F2)
with tik(θprime) =
πprimekfk(xiθprimek)sum
` πprime`f`(xiθ
prime`)
(F3)
In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are
the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)
Using (F3) we have
Q(θθprime) =sumik
tik(θprime) log (πkfk(xiθk))
=sumik
tik(θprime) log(tik(θ)) +
sumik
tik(θprime) log
(sum`
π`f`(xiθ`)
)=sumik
tik(θprime) log(tik(θ)) + L(θ)
In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities
L(θ) = Q(θθ)minussumik
tik(θ) log(tik(θ))
= Q(θθ) +H(T)
119
G Derivation of the M-Step Equations
This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as
Q(θθprime) = maxθ
sumik
tik(θprime) log(πkfk(xiθk))
=sumk
log
(πksumi
tik
)minus np
2log(2π)minus n
2log |Σ| minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
which has to be maximized subject tosumk
πk = 1
The Lagrangian of this problem is
L(θ) = Q(θθprime) + λ
(sumk
πk minus 1
)
Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ
G1 Prior probabilities
partL(θ)
partπk= 0hArr 1
πk
sumi
tik + λ = 0
where λ is identified from the constraint leading to
πk =1
n
sumi
tik
121
G Derivation of the M-Step Equations
G2 Means
partL(θ)
partmicrok= 0hArr minus1
2
sumi
tik2Σminus1(microk minus xi) = 0
rArr microk =
sumi tikxisumi tik
G3 Covariance Matrix
partL(θ)
partΣminus1 = 0hArr n
2Σ︸︷︷︸
as per property 4
minus 1
2
sumik
tik(xi minus microk)(xi minus microk)gt
︸ ︷︷ ︸as per property 5
= 0
rArr Σ =1
n
sumik
tik(xi minus microk)(xi minus microk)gt
122
Bibliography
F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011
F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008
F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012
J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993
A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009
H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996
P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004
C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008
C M Bishop Pattern Recognition and Machine Learning Springer New York 2006
C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a
C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b
S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004
L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995
L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984
123
Bibliography
T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011
S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999
C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012
B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008
L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011
C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009
A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246
D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006
R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000
B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004
Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008
R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936
V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008
J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009
124
Bibliography
J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010
J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989
W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998
A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003
D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005
G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010
G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011
Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998
Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002
L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008
Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004
J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010
I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003
T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996
T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994
125
Bibliography
T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995
A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970
J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009
T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006
K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000
P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010
T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002
M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004
Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004
C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008
C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006
H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005
J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967
Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012
C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a
126
Bibliography
C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b
L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008
N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006
B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006
B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007
Y Nesterov Gradient methods for minimizing composite functions preprint 2007
S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886
B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011
M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a
M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b
W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007
W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006
K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894
S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003
127
Bibliography
Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009
A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006
C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948
S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007
V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004
V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008
V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004
C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010
L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012
Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978
A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008
S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006
P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010
M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008
128
Bibliography
M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008
R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996
J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010
S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008
D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011
D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010
D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009
M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007
MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009
T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008
B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a
B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b
C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010
J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007
129
Bibliography
M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006
P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007
P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009
H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009
H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006
H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005
130
- SANCHEZ MERCHANTE PDTpdf
- Thesis Luis Francisco Sanchez Merchantepdf
-
- List of figures
- List of tables
- Notation and Symbols
- Context and Foundations
-
- Context
- Regularization for Feature Selection
-
- Motivations
- Categorization of Feature Selection Techniques
- Regularization
-
- Important Properties
- Pure Penalties
- Hybrid Penalties
- Mixed Penalties
- Sparsity Considerations
- Optimization Tools for Regularized Problems
-
- Sparse Linear Discriminant Analysis
-
- Abstract
- Feature Selection in Fisher Discriminant Analysis
-
- Fisher Discriminant Analysis
- Feature Selection in LDA Problems
-
- Inertia Based
- Regression Based
-
- Formalizing the Objective
-
- From Optimal Scoring to Linear Discriminant Analysis
-
- Penalized Optimal Scoring Problem
- Penalized Canonical Correlation Analysis
- Penalized Linear Discriminant Analysis
- Summary
-
- Practicalities
-
- Solution of the Penalized Optimal Scoring Regression
- Distance Evaluation
- Posterior Probability Evaluation
- Graphical Representation
-
- From Sparse Optimal Scoring to Sparse LDA
-
- A Quadratic Variational Form
- Group-Lasso OS as Penalized LDA
-
- GLOSS Algorithm
-
- Regression Coefficients Updates
-
- Cholesky decomposition
- Numerical Stability
-
- Score Matrix
- Optimality Conditions
- Active and Inactive Sets
- Penalty Parameter
- Options and Variants
-
- Scaling Variables
- Sparse Variant
- Diagonal Variant
- Elastic net and Structured Variant
-
- Experimental Results
-
- Normalization
- Decision Thresholds
- Simulated Data
- Gene Expression Data
- Correlated Data
-
- Discussion
-
- Sparse Clustering Analysis
-
- Abstract
- Feature Selection in Mixture Models
-
- Mixture Models
-
- Model
- Parameter Estimation The EM Algorithm
-
- Feature Selection in Model-Based Clustering
-
- Based on Penalized Likelihood
- Based on Model Variants
- Based on Model Selection
-
- Theoretical Foundations
-
- Resolving EM with Optimal Scoring
-
- Relationship Between the M-Step and Linear Discriminant Analysis
- Relationship Between Optimal Scoring and Linear Discriminant Analysis
- Clustering Using Penalized Optimal Scoring
- From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
-
- Optimized Criterion
-
- A Bayesian Derivation
- Maximum a Posteriori Estimator
-
- Mix-GLOSS Algorithm
-
- Mix-GLOSS
-
- Outer Loop Whole Algorithm Repetitions
- Penalty Parameter Loop
- Inner Loop EM Algorithm
-
- Model Selection
-
- Experimental Results
-
- Tested Clustering Algorithms
- Results
- Discussion
-
- Conclusions
- Appendix
-
- Matrix Properties
- The Penalized-OS Problem is an Eigenvector Problem
-
- How to Solve the Eigenvector Decomposition
- Why the OS Problem is Solved as an Eigenvector Problem
-
- Solving Fishers Discriminant Problem
- Alternative Variational Formulation for the Group-Lasso
-
- Useful Properties
- An Upper Bound on the Objective Function
-
- Invariance of the Group-Lasso to Unitary Transformations
- Expected Complete Likelihood and Likelihood
- Derivation of the M-Step Equations
-
- Prior probabilities
- Means
- Covariance Matrix
-
- Bibliography
-
Contents
413 Penalized Linear Discriminant Analysis 39
414 Summary 40
42 Practicalities 41
421 Solution of the Penalized Optimal Scoring Regression 41
422 Distance Evaluation 42
423 Posterior Probability Evaluation 43
424 Graphical Representation 43
43 From Sparse Optimal Scoring to Sparse LDA 43
431 A Quadratic Variational Form 44
432 Group-Lasso OS as Penalized LDA 47
5 GLOSS Algorithm 4951 Regression Coefficients Updates 49
511 Cholesky decomposition 52
512 Numerical Stability 52
52 Score Matrix 52
53 Optimality Conditions 53
54 Active and Inactive Sets 54
55 Penalty Parameter 54
56 Options and Variants 55
561 Scaling Variables 55
562 Sparse Variant 55
563 Diagonal Variant 55
564 Elastic net and Structured Variant 55
6 Experimental Results 5761 Normalization 57
62 Decision Thresholds 57
63 Simulated Data 58
64 Gene Expression Data 60
65 Correlated Data 63
Discussion 63
III Sparse Clustering Analysis 67
Abstract 69
7 Feature Selection in Mixture Models 7171 Mixture Models 71
711 Model 71
712 Parameter Estimation The EM Algorithm 72
ii
Contents
72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79
8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81
811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant
Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83
82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85
9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87
911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89
92 Model Selection 91
10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97
Conclusions 97
Appendix 103
A Matrix Properties 105
B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109
C Solving Fisherrsquos Discriminant Problem 111
D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115
iii
Contents
E Invariance of the Group-Lasso to Unitary Transformations 117
F Expected Complete Likelihood and Likelihood 119
G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122
Bibliography 123
iv
List of Figures
11 MASH project logo 5
21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-
rameters 20
41 Graphical representation of the variational approach to Group-Lasso 45
51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56
61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first
discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64
91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92
101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97
v
List of Tables
61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61
101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96
vii
Notation and Symbols
Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors
Sets
N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A
Data
X input domainxi input sample xi isin XX design matrix X = (xgt1 x
gtn )gt
xj column j of Xyi class indicator of sample i
Y indicator matrix Y = (ygt1 ygtn )gt
z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N
Vectors Matrices and Norms
0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A
ix
Notation and Symbols
Probability
E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2
W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix
H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y
Mixture Models
yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)
θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function
Optimization
J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β
βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path
x
Notation and Symbols
Penalized models
λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)
βj jth row of B = (β1gt βpgt)gt
BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix
ΣB sample between-class covariance matrix
ΣW sample within-class covariance matrix
ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach
xi
Part I
Context and Foundations
1
This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed
The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided
The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion
3
1 Context
The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm
The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne
From the point of view of the research the members of the consortium must deal withfour main goals
1 Software development of website framework and APIrsquos
2 Classification and goal-planning in high dimensional feature spaces
3 Interfacing the platform with the 3D virtual environment and the robot arm
4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments
S HM A
Figure 11 MASH project logo
5
1 Context
The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables
Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment
As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform
bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)
bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis
6
All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)
7
2 Regularization for Feature Selection
With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic
21 Motivations
There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)
As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information
When talking about dimensionality reduction there are two families of techniquesthat could induce confusion
bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples
bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature
9
2 Regularization for Feature Selection
Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)
selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category
As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text
ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out
Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the
10
22 Categorization of Feature Selection Techniques
Figure 22 The four key steps of feature selection according to Liu and Yu (2005)
ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost
There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions
22 Categorization of Feature Selection Techniques
Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured
I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm
The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities
bull Depending on the type of integration with the machine learning algorithm we have
ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm
ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while
11
2 Regularization for Feature Selection
the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive
ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm
bull Depending on the feature searching technique
ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches
ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time
ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima
bull Depending on the evaluation technique
ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures
ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty
ndash Dependency Measures - Measuring the correlation between features
ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can
ndash Predictive Accuracy - Use the selected features to predict the labels
ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)
The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels
In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized
12
23 Regularization
goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III
23 Regularization
In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret
An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations
minβJ(β) + λP (β) (21)
minβ
J(β)
s t P (β) le t (22)
In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken
In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty
13
2 Regularization for Feature Selection
Figure 23 Admissible sets in two dimensions for different pure norms ||β||p
231 Important Properties
Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability
Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies
forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)
for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex
Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources
Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution
232 Pure Penalties
For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In
14
23 Regularization
Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties
this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1
Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity
A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero
After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1
3penalty has a support region with sharper vertexes that would induce
a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1
3results in difficulties during optimization that will not happen with a convex
shape
15
2 Regularization for Feature Selection
To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty
L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0
minβ
J(β)
s t β0 le t (24)
where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable
L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)
minβ
J(β)
s t
psumj=1
|βj | le t (25)
Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited
Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)
The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by
16
23 Regularization
minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)
L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like
minβJ(β) + λ β22 (26)
The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem
minβ
nsumi=1
(yi minus xgti β)2 (27)
with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular
the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances
As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
(βlsj )2 (28)
The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)
17
2 Regularization for Feature Selection
where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model
Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions
Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t
This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as
βlowast = maxwisinRp
βgtw s t w le 1
In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1
r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)
233 Hybrid Penalties
There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is
minβ
nsumi=1
(yi minus xgti β)2 + λ1
psumj=1
|βj |+ λ2
psumj=1
β2j (29)
The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables
18
23 Regularization
234 Mixed Penalties
Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =
sumL`=1 d` Mixed norms are
a type of norms that take into consideration those groups The general expression isshowed below
β(rs) =
sum`
sumjisinG`
|βj |s r
s
1r
(210)
The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups
Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)
(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)
235 Sparsity Considerations
In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables
The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables
To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as
19
2 Regularization for Feature Selection
(a) L1 Lasso (b) L(12) group-Lasso
Figure 25 Admissible sets for the Lasso and Group-Lasso
(a) L1 induced sparsity (b) L(12) group inducedsparsity
Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters
20
23 Regularization
the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed
236 Optimization Tools for Regularized Problems
In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms
In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5
Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)
β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))
Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives
βj =minusλsign(βj)minus partJ(β)
partβj
2sumn
i=1 x2ij
In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding
algorithm where β(t+1)j = Sλ
(partJ(β(t))partβj
) The objective function is optimized with respect
21
2 Regularization for Feature Selection
to one variable at a time while all others are kept fixed
Sλ
(partJ(β)
partβj
)=
λminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
gt λ
minusλminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
lt minusλ
0 if |partJ(β)partβj| le λ
(211)
The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)
Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A
Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected
Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)
This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions
22
23 Regularization
and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions
Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points
This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable
This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)
Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)
This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques
Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals
Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)
minβisinRp
J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L
2
∥∥∥β minus β(t)∥∥∥2
2(212)
They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like
23
2 Regularization for Feature Selection
(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as
minβisinRp
1
2
∥∥∥∥β minus (β(t) minus 1
LnablaJ(β(t)))
∥∥∥∥2
2
+λ
LP (β) (213)
The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up
24
Part II
Sparse Linear Discriminant Analysis
25
Abstract
Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes
There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables
In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data
27
3 Feature Selection in Fisher DiscriminantAnalysis
31 Fisher Discriminant Analysis
Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)
We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x
gtn )gt and the corresponding labels in the ntimesK matrix
Y = (ygt1 ygtn )gt
Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance
maxβisinRp
βgtΣBβ
βgtΣWβ (31)
where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as
ΣW =1
n
Ksumk=1
sumiisinGk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
Ksumk=1
sumiisinGk
(microminus microk)(microminus microk)gt
where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k
29
3 Feature Selection in Fisher Discriminant Analysis
This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio
maxBisinRptimesKminus1
tr(BgtΣBB
)tr(BgtΣWB
) (32)
where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is
based on a series of K minus 1 subproblemsmaxβkisinRp
βgtk ΣBβk
s t βgtk ΣWβk le 1
βgtk ΣWβ` = 0 forall` lt k
(33)
The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest
eigenvalue (see Appendix C)
32 Feature Selection in LDA Problems
LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome
Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints
The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities
They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based
321 Inertia Based
The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and
30
32 Feature Selection in LDA Problems
classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations
Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)
Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as
minβisinRp
βgtΣWβ
s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t
where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony
Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max
βisinkRpβgtk Σ
k
Bβk minus Pk(βk)
s t βgtk ΣWβk le 1
The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten
Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal
solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1
minimization minβisinRp
β1
s t∥∥∥Σβ minus (micro1 minus micro2)
∥∥∥infinle λ
Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization
31
3 Feature Selection in Fisher Discriminant Analysis
Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions
322 Regression Based
In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)
Predefined Indicator Matrix
Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)
There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data
Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection
In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is
32
32 Feature Selection in LDA Problems
obtained by solving
minβisinRpβ0isinR
nminus1nsumi=1
(yi minus β0 minus xgti β)2 + λ
psumj=1
|βj |
where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β
vector for λ = 0 but a different intercept β0 is required
Optimal Scoring
In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)
As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)(34a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)
where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems
minθkisinRK βkisinRp
Yθk minusXβk2 + βgtk Ωβk (35a)
s t nminus1 θgtk YgtYθk = 1 (35b)
θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)
where each βk corresponds to a discriminant direction
33
3 Feature Selection in Fisher Discriminant Analysis
Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by
minβkisinRpθkisinRK
sumk
Yθk minusXβk22 + λ1 βk1 + λ2β
gtk Ωβk
where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen
Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)
minβkisinRpθkisinRK
Kminus1sumk=1
Yθk minusXβk22 + λ
psumj=1
radicradicradicradicKminus1sumk=1
β2kj
2
(36)
which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding
this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem
34
4 Formalizing the Objective
In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)
The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data
The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)
41 From Optimal Scoring to Linear Discriminant Analysis
Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)
Throughout this chapter we assume that
bull there is no empty class that is the diagonal matrix YgtY is full rank
bull inputs are centered that is Xgt1n = 0
bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank
35
4 Formalizing the Objective
411 Penalized Optimal Scoring Problem
For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution
The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus
minθisinRK βisinRp
Yθ minusXβ2 + βgtΩβ (41a)
s t nminus1 θgtYgtYθ = 1 (41b)
For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator
βos =(XgtX + Ω
)minus1XgtYθ (42)
The objective function (41a) is then
Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos
(XgtX + Ω
)βos
= θgtYgtYθ minus θgtYgtX(XgtX + Ω
)minus1XgtYθ
where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (43)
which shows that the optimization of the p-OS problem with respect to θk boils down to
finding the kth largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY Indeed Appendix C
details that Problem (43) is solved by
(YgtY)minus1YgtX(XgtX + Ω
)minus1XgtYθ = α2θ (44)
36
41 From Optimal Scoring to Linear Discriminant Analysis
where α2 is the maximal eigenvalue 1
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2nminus1θgt(YgtY)θ
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2 (45)
412 Penalized Canonical Correlation Analysis
As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows
maxθisinRK βisinRp
nminus1θgtYgtXβ (46a)
s t nminus1 θgtYgtYθ = 1 (46b)
nminus1 βgt(XgtX + Ω
)β = 1 (46c)
The solutions to (46) are obtained by finding saddle points of the Lagrangian
nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)
rArr npartL(βθ γ ν)
partβ= XgtYθ minus 2γ(XgtX + Ω)β
rArr βcca =1
2γ(XgtX + Ω)minus1XgtYθ
Then as βcca obeys (46c) we obtain
βcca =(XgtX + Ω)minus1XgtYθradic
nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)
so that the optimal objective function (46a) can be expressed with θ alone
nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
=
radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
and the optimization problem with respect to θ can be restated as
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (48)
Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)
βos = αβcca (49)
1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)
37
4 Formalizing the Objective
where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using
the optimality conditions for θ
npartL(βθ γ ν)
partθ= YgtXβ minus 2νYgtYθ
rArr θcca =1
2ν(YgtY)minus1YgtXβ (410)
Then as θcca obeys (46b) we obtain
θcca =(YgtY)minus1YgtXβradic
nminus1βgtXgtY(YgtY)minus1YgtXβ (411)
leading to the following expression of the optimal objective function
nminus1θgtccaYgtXβ =
nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ
=
radicnminus1βgtXgtY(YgtY)minus1YgtXβ
The p-CCA problem can thus be solved with respect to β by plugging this value in (46)
maxβisinRp
nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)
s t nminus1 βgt(XgtX + Ω
)β = 1 (412b)
where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies
nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω
)βcca (413)
where λ is the maximal eigenvalue shown below to be equal to α2
nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ
rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ
rArr nminus1αβgtccaXgtYθ = λ
rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ
rArr α2 = λ
The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)
38
41 From Optimal Scoring to Linear Discriminant Analysis
413 Penalized Linear Discriminant Analysis
Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows
maxβisinRp
βgtΣBβ (414a)
s t βgt(ΣW + nminus1Ω)β = 1 (414b)
where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C
As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable
to a simple matrix representation using the projection operator Y(YgtY
)minus1Ygt
ΣT =1
n
nsumi=1
xixigt
= nminus1XgtX
ΣB =1
n
Ksumk=1
nk microkmicrogtk
= nminus1XgtY(YgtY
)minus1YgtX
ΣW =1
n
Ksumk=1
sumiyik=1
(xi minus microk) (xi minus microk)gt
= nminus1
(XgtXminusXgtY
(YgtY
)minus1YgtX
)
Using these formulae the solution to the p-LDA problem (414) is obtained as
XgtY(YgtY
)minus1YgtXβlda = λ
(XgtX + ΩminusXgtY
(YgtY
)minus1YgtX
)βlda
XgtY(YgtY
)minus1YgtXβlda =
λ
1minus λ
(XgtX + Ω
)βlda
The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat
βlda = (1minus α2)minus12 βcca
= αminus1(1minus α2)minus12 βos
which ends the path from p-OS to p-LDA
39
4 Formalizing the Objective
414 Summary
The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)s t nminus1 ΘgtYgtYΘ = IKminus1
Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the
square-root of the largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY we have
BLDA = BCCA
(IKminus1 minusA2
)minus 12
= BOS Aminus1(IKminus1 minusA2
)minus 12 (415)
where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p
can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS
or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied
With the aim of performing classification the whole process could be summarized asfollows
1 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
2 Translate the data samples X into the LDA domain as XLDA = XBOSD
where D = Aminus1(IKminus1 minusA2
)minus 12
3 Compute the matrix M of centroids microk from XLDA and Y
4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA
5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule
6 Graphical Representation
40
42 Practicalities
The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively
42 Practicalities
421 Solution of the Penalized Optimal Scoring Regression
Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem
minΘisinRKtimesKminus1BisinRptimesKminus1
YΘminusXB2F + λ tr(BgtΩB
)(416a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)
where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm
Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps
1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1
2 Compute B =(XgtX + λΩ
)minus1XgtYΘ0
3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ
)minus1XgtY
4 Compute the optimal regression coefficients
BOS =(XgtX + λΩ
)minus1XgtYΘ (417)
Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on
Θ0gtYgtX(XgtX + λΩ
)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a
costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B
This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where
41
4 Formalizing the Objective
a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems
422 Distance Evaluation
The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance
d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log
(nkn
) (418)
is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent
Σminus1WΩ =
(nminus1(XgtX + λΩ)minus ΣB
)minus1
=(nminus1XgtXminus ΣB + nminus1λΩ
)minus1
=(ΣW + nminus1λΩ
)minus1 (419)
Before explaining how to compute the distances let us summarize some clarifying points
bull The solution BOS of the p-OS problem is enough to accomplish classification
bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances
bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1
As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain
(xi minus microk)BOS2ΣWΩminus 2 log(πk)
where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1
(IKminus1 minusA2
)minus 12
∥∥∥2
2minus 2 log(πk)
which is a plain Euclidean distance
42
43 From Sparse Optimal Scoring to Sparse LDA
423 Posterior Probability Evaluation
Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as
p(yk = 1|x) prop exp
(minusd(xmicrok)
2
)prop πk exp
(minus1
2
∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2
)minus 12
∥∥∥2
2
) (420)
Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)
2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below
p(yk = 1|x) =πk exp
(minusd(xmicrok)
2
)sum
` π` exp(minusd(xmicro`)
2
)=
πk exp(minusd(xmicrok)
2 + dmax2
)sum`
π` exp
(minusd(xmicro`)
2+dmax
2
)
where dmax = maxk d(xmicrok)
424 Graphical Representation
Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented
43 From Sparse Optimal Scoring to Sparse LDA
The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated
In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see
43
4 Formalizing the Objective
section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB
431 A Quadratic Variational Form
Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)
Our formulation of group-Lasso is showed below
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(421a)
s tsum
j τj minussum
j wj∥∥βj∥∥
2le 0 (421b)
τj ge 0 j = 1 p (421c)
where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1
B =(β1gt βpgt
)gtand wj are predefined nonnegative weights The cost function
J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)
The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41
Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)
Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump
j=1wj∥∥βj∥∥
2
Proof The Lagrangian of Problem (421) is
L = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
( psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2
)minus
psumj=1
νjτj
44
43 From Sparse Optimal Scoring to Sparse LDA
Figure 41 Graphical representation of the variational approach to Group-Lasso
Thus the first order optimality conditions for τj are
partLpartτj
(τj ) = 0hArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
hArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0 τ
j
2 = 0
The last line is obtained from complementary slackness which implies here νjτj = 0
Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier
for constraint gj(τj) le 0 As a result the optimal value of τj
τj =
radicλw2
j
∥∥βj∥∥2
2
ν0=
radicλ
ν0wj∥∥βj∥∥
2(422)
We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)
psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2= 0 (423)
so that τj = wj∥∥βj∥∥
2 Using this value into (421a) it is possible to conclude that
Problem (421) is equivalent to the standard group-Lasso operator
minBisinRptimesM
J(B) + λ
psumj=1
wj∥∥βj∥∥
2 (424)
So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation
45
4 Formalizing the Objective
With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (425)
with
τj = wj∥∥βj∥∥
2
resulting in Ω diagonal components
(Ω)jj =wj∥∥βj∥∥
2
(426)
And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5
The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence
Lemma 42 If J is convex Problem (421) is convex
Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is
V isin RptimesKminus1 V =partJ(B)
partB+ λG
(427)
where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1
G =(g1gt gpgt
)gtdefined as follows Let S(B) denote the columnwise support of
B S(B) = j isin 1 p ∥∥βj∥∥
26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (428)
forallj isin S(B) ∥∥gj∥∥
2le wj (429)
46
43 From Sparse Optimal Scoring to Sparse LDA
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Proof When∥∥βj∥∥
26= 0 the gradient of the penalty with respect to βj is
part (λsump
m=1wj βm2)
partβj= λwj
βj∥∥βj∥∥2
(430)
At∥∥βj∥∥
2= 0 the gradient of the objective function is not continuous and the optimality
conditions then make use of the subdifferential (Bach et al 2011)
partβj
(λ
psumm=1
wj βm2
)= partβj
(λwj
∥∥βj∥∥2
)=λwjv isin RKminus1 v2 le 1
(431)
That gives the expression (429)
Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima
forallj isin S partJ(B)
partβj+ λwj
∥∥βj∥∥minus1
2βj = 0 (432a)
forallj isin S ∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le λwj (432b)
where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment
Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)
432 Group-Lasso OS as Penalized LDA
With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced
Proposition 41 The group-Lasso OS problem
BOS = argminBisinRptimesKminus1
minΘisinRKtimesKminus1
1
2YΘminusXB2F + λ
psumj=1
wj∥∥βj∥∥
2
s t nminus1 ΘgtYgtYΘ = IKminus1
47
4 Formalizing the Objective
is equivalent to the penalized LDA problem
BLDA = maxBisinRptimesKminus1
tr(BgtΣBB
)s t Bgt(ΣW + nminus1λΩ)B = IKminus1
where Ω = diag
(w2
1
τ1
w2p
τp
) with Ωjj =
+infin if βjos = 0
wj∥∥βjos
∥∥minus1
2otherwise
(433)
That is BLDA = BOS diag(αminus1k (1minus α2
k)minus12
) where αk isin (0 1) is the kth leading
eigenvalue of
nminus1YgtX(XgtX + λΩ
)minus1XgtY
Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso
The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr
(BgtΩB
)
48
5 GLOSS Algorithm
The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22
The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below
1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed
2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution
3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set
This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively
51 Regression Coefficients Updates
Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(
XgtAXA + λΩ)βk = XgtAYθ0
k (51)
49
5 GLOSS Algorithm
initialize modelλ B
ACTIVE SETall j st||βj ||2 gt 0
p-OS PROBLEMB must hold1st optimality
condition
any variablefrom
ACTIVE SETmust go toINACTIVE
SET
take it out ofACTIVE SET
test 2nd op-timality con-dition on the
INACTIVE SET
any variablefrom
INACTIVE SETmust go toACTIVE
SET
take it out ofINACTIVE SET
compute Θ
and update B end
yes
no
yes
no
Figure 51 GLOSS block diagram
50
51 Regression Coefficients Updates
Algorithm 1 Adaptively Penalized Optimal Scoring
Input X Y B λInitialize A larr
j isin 1 p
∥∥βj∥∥2gt 0
Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat
Step 1 solve (421) in B assuming A optimalrepeat
Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1
2
BA larr(XgtAXA + λΩ
)minus1XgtAYΘ0
until condition (432a) holds for all j isin A Step 2 identify inactivated variables
for j isin A ∥∥βj∥∥
2= 0 do
if optimality condition (432b) holds thenA larr AjGo back to Step 1
end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax
jisinA
∥∥partJpartβj∥∥2
if∥∥∥partJpartβj∥∥∥
2lt λ then
convergence larr true B is optimalelseA larr Acup j
end ifuntil convergence
(sV)larreigenanalyze(Θ0gtYgtXAB) that is
Θ0gtYgtXABVk = skVk k = 1 K minus 1
Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1
Output Θ B α
51
5 GLOSS Algorithm
where XA denotes the columns of X indexed by A and βk and θ0k denote the kth
column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system
511 Cholesky decomposition
Dropping the subscripts and considering the (K minus 1) systems together (51) leads to
(XgtX + λΩ)B = XgtYΘ (52)
Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows
CgtCB = XgtYΘ
CB = CgtXgtYΘ
B = CCgtXgtYΘ (53)
where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)
512 Numerical Stability
The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression
B = Ωminus12(Ωminus12XgtXΩminus12 + λI
)minus1Ωminus12XgtYΘ0 (54)
where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)
52 Score Matrix
The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY This eigen-analysis is actually solved in the form
ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-
vector decomposition does not require the costly computation of(XgtX + Ω
)minus1that
52
53 Optimality Conditions
involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-
trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω
)minus1XgtY 1
Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as
Θ0gtYgtX(XgtX + Ω
)minus1XgtYΘ0 = Θ0gtYgtXB0
Thus the solution to penalized OS problem can be computed trough the singular
value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining
Θ = Θ0V we have ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ = Λ and when Θ0 is chosen such
that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation
53 Optimality Conditions
GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function
1
2YΘminusXB22 + λ
psumj=1
wj∥∥βj∥∥
2(55)
Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth
row of B βj is the (K minus 1)-dimensional vector
partJ(B)
partβj= xj
gt(XBminusYΘ)
where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as
xjgt
(XBminusYΘ) + λwjβj∥∥βj∥∥
2
1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω
)minus1XgtY It is thus suffi-
cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY In practice to comply with this desideratum and conditions (35b) and
(35c) we set Θ0 =(YgtY
)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal
vectors orthogonal to 1K
53
5 GLOSS Algorithm
The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
54 Active and Inactive Sets
The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function
j = maxj
∥∥∥xjgt (XBminusYΘ)∥∥∥
2minus λwj 0
The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥
2
is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition
55 Penalty Parameter
The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active
The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0
λmax = maxjisin1p
1
wj
∥∥∥xjgtYΘ0∥∥∥
2
The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin
is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)
54
56 Options and Variants
56 Options and Variants
561 Scaling Variables
As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm
562 Sparse Variant
This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation
563 Diagonal Variant
We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated
The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems
minBisinRptimesKminus1
YΘminusXB2F = minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB
)are replaced by
minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B
)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite
564 Elastic net and Structured Variant
For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition
55
5 GLOSS Algorithm
7 8 9
4 5 6
1 2 3
- ΩL =
3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3
Figure 52 Graph and Laplacian matrix for a 3times 3 image
for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth
When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood
This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned
56
6 Experimental Results
This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper
61 Normalization
With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1
62 Decision Thresholds
The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation
1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval
57
6 Experimental Results
63 Simulated Data
We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is
Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)
Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of
dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure
is intended to mimic gene expression data correlation
Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1
3 1)if j le 100 and Xij sim N(0 1) otherwise
Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise
Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563
The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only
58
63 Simulated Data
Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset
Err () Var Dir
Sim 1 K = 4 mean shift ind features
PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)
Sim 2 K = 2 mean shift dependent features
PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)
Sim 3 K = 4 1D mean shift ind features
PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)
Sim 4 K = 4 mean shift ind features
PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)
59
6 Experimental Results
0 10 20 30 40 50 60 70 8020
30
40
50
60
70
80
90
100TPR Vs FPR
gloss
glossd
slda
plda
Simulation1
Simulation2
Simulation3
Simulation4
Figure 61 TPR versus FPR (in ) for all algorithms and simulations
Table 62 Average TPR and FPR (in ) computed over 25 repetitions
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
PLDA 990 782 969 603 980 159 743 656
SLDA 739 385 338 163 416 278 507 395
GLOSS 641 106 300 46 511 182 260 121
GLOSS-D 935 394 921 281 956 655 429 299
method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )
64 Gene Expression Data
We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-
2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736
60
64 Gene Expression Data
Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables
Err () Var
Nakayama n = 86 p = 22 283 K = 5
PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)
Ramaswamy n = 198 p = 16 063 K = 14
PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)
Sun n = 180 p = 54 613 K = 4
PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)
ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4
dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors
Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split
Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors
Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS
4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962
61
6 Experimental Results
GLOSS SLDA
Naka
yam
a
minus25000 minus20000 minus15000 minus10000 minus5000 0 5000
minus25
minus2
minus15
minus1
minus05
0
05
1
x 104
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
2n
dd
iscr
imin
ant
minus2000 0 2000 4000 6000 8000 10000 12000 14000
2000
4000
6000
8000
10000
12000
14000
16000
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
Su
n
minus1 minus05 0 05 1 15 2
x 104
05
1
15
2
25
3
35
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
2n
dd
iscr
imin
ant
minus2 minus15 minus1 minus05 0
x 104
0
05
1
15
2
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means
62
65 Correlated Data
Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo
65 Correlated Data
When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge
The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works
For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63
As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward
The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits
Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results
63
6 Experimental Results
β for GLOSS β for S-GLOSS
Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo
β for GLOSS and λ = 03 β for S-GLOSS and λ = 03
Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo
64
Discussion
GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix
Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data
The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced
The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition
65
Part III
Sparse Clustering Analysis
67
Abstract
Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity
Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix
As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection
Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10
69
7 Feature Selection in Mixture Models
71 Mixture Models
One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering
711 Model
We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically
from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as
f(xi) =
Ksumk=1
πkfk(xi) foralli isin 1 n
where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and
sumk πk = 1) Mixture models transcribe that
given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism
bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK
bull x each xi is assumed to arise from a random vector with probability densityfunction fk
In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as
f(xiθ) =
Ksumk=1
πkφ(xiθk) foralli isin 1 n
71
7 Feature Selection in Mixture Models
where θ = (π1 πK θ1 θK) is the parameter of the model
712 Parameter Estimation The EM Algorithm
For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ
21 σ
22 π) of a univariate
Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches
The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)
The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood
Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm
Maximum Likelihood Definitions
The likelihood is is commonly expressed in its logarithmic version
L(θ X) = log
(nprodi=1
f(xiθ)
)
=nsumi=1
log
(Ksumk=1
πkfk(xiθk)
) (71)
where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions
To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or
72
71 Mixture Models
classification log-likelihood
LC(θ XY) = log
(nprodi=1
f(xiyiθ)
)
=
nsumi=1
log
(Ksumk=1
yikπkfk(xiθk)
)
=nsumi=1
Ksumk=1
yik log (πkfk(xiθk)) (72)
The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise
Defining the soft membership tik(θ) as
tik(θ) = p(Yik = 1|xiθ) (73)
=πkfk(xiθk)
f(xiθ) (74)
To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows
LC(θ XY) =sumik
yik log (πkfk(xiθk))
=sumik
yik log (tikf(xiθ))
=sumik
yik log tik +sumik
yik log f(xiθ)
=sumik
yik log tik +nsumi=1
log f(xiθ)
=sumik
yik log tik + L(θ X) (75)
wheresum
ik yik log tik can be reformulated as
sumik
yik log tik =nsumi=1
Ksumk=1
yik log(p(Yik = 1|xiθ))
=
nsumi=1
log(p(Yik = 1|xiθ))
= log (p(Y |Xθ))
As a result the relationship (75) can be rewritten as
L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)
73
7 Feature Selection in Mixture Models
Likelihood Maximization
The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)
L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))
+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))
In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood
∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1
minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality
Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))
For the mixture model problem Q(θθprime) is
Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]
=sumik
p(Yik = 1|xiθprime) log(πkfk(xiθk))
=nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (77)
Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ
prime) are the posterior proba-bilities of cluster memberships
Hence the EM algorithm sketched above results in
bull Initialization (not iterated) choice of the initial parameter θ(0)
bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)
bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))
74
72 Feature Selection in Model-Based Clustering
Gaussian Model
In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is
f(xiθ) =Ksumk=1
πkfk(xiθk)
=
Ksumk=1
πk1
(2π)p2 |Σ|
12
exp
minus1
2(xi minus microk)
gtΣminus1(xi minus microk)
At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows
Q(θθ(t)) =sumik
tik log(πk)minussumik
tik log(
(2π)p2 |Σ|
12
)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
=sumk
tk log(πk)minusnp
2log(2π)︸ ︷︷ ︸
constant term
minusn2
log(|Σ|)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
equivsumk
tk log(πk)minusn
2log(|Σ|)minus
sumik
tik
(1
2(xi minus microk)
gtΣminus1(xi minus microk)
) (78)
where
tk =nsumi=1
tik (79)
The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)
π(t+1)k =
tkn
(710)
micro(t+1)k =
sumi tikxitk
(711)
Σ(t+1) =1
n
sumk
Wk (712)
with Wk =sumi
tik(xi minus microk)(xi minus microk)gt (713)
The derivations are detailed in Appendix G
72 Feature Selection in Model-Based Clustering
When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own
75
7 Feature Selection in Mixture Models
covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries
In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD
gtk (Banfield and Raftery 1993)
These regularization schemes address singularity and stability issues but they do notinduce parsimonious models
In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space
721 Based on Penalized Likelihood
Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x
log
(p(Yk = 1|x)
p(Y` = 1|x)
)= xgtΣminus1(microk minus micro`)minus
1
2(microk + micro`)
gtΣminus1(microk minus micro`) + logπkπ`
In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm
λKsumk=1
psumj=1
|microkj |
as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices
λ1
Ksumk=1
psumj=1
|microkj |+ λ2
Ksumk=1
psumj=1
psumm=1
|(Σminus1k )jm|
In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models
76
72 Feature Selection in Model-Based Clustering
Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)
λ
psumj=1
sum16k6kprime6K
|microkj minus microkprimej |
This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative
A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features
λ
psumj=1
(micro1j micro2j microKj)infin
One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means
λradicK
psumj=1
radicradicradicradic Ksum
k=1
micro2kj
The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test
The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation
722 Based on Model Variants
The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as
77
7 Feature Selection in Mixture Models
f(xi|φ πθν) =Ksumk=1
πk
pprodj=1
[f(xij |θjk)]φj [h(xij |νj)]1minusφj
where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)
An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1
which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion
tr(
(UgtΣWU)minus1UgtΣBU) (714)
so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations
To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of
minUisinRptimesKminus1
∥∥∥XU minusXU∥∥∥2
F+ λ
Kminus1sumk=1
∥∥∥uk∥∥∥1
where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet
minABisinRptimesKminus1
Ksumk=1
∥∥∥RminusgtW HBk minusABgtHBk
∥∥∥2
2+ ρ
Kminus1sumj=1
βgtj ΣWβj + λ
Kminus1sumj=1
∥∥βj∥∥1
s t AgtA = IKminus1
where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper
78
72 Feature Selection in Model-Based Clustering
triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U
The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem
minUisinRptimesKminus1
psumj=1
∥∥∥ΣBj minus UUgtΣBj
∥∥∥2
2
s t UgtU = IKminus1
whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U
To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality
However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo
723 Based on Model Selection
Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables
bull X(1) set of selected relevant variables
bull X(2) set of variables being considered for inclusion or exclusion of X(1)
bull X(3) set of non relevant variables
79
7 Feature Selection in Mixture Models
With those subsets they defined two different models where Y is the partition toconsider
bull M1
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)|X(1)
)f(X(1)|Y
)bull M2
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)X(1)|Y
)Model M1 means that variables in X(2) are independent on clustering Y Model M2
shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor
B12 =f (X|M1)
f (X|M2)
where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio
B12 =f(X(1)X(2)X(3)|M1
)f(X(1)X(2)X(3)|M2
)=f(X(2)|X(1)M1
)f(X(1)|M1
)f(X(2)X(1)|M2
)
This factor is approximated since the integrated likelihoods f(X(1)|M1
)and
f(X(2)X(1)|M2
)are difficult to calculate exactly Raftery and Dean (2006) use the
BIC approximation The computation of f(X(2)|X(1)M1
) if there is only one variable
in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term
Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability
Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis
80
8 Theoretical Foundations
In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features
We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model
In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided
81 Resolving EM with Optimal Scoring
In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate
811 Relationship Between the M-Step and Linear Discriminant Analysis
LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance
d(ximicrok) = (xi minus microk)gtΣminus1
W (xi minus microk)
where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix
81
8 Theoretical Foundations
The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)
Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood
2lweight(microΣ) =nsumi=1
Ksumk=1
tikd(ximicrok)minus n log(|ΣW|)
which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures
812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis
The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression
813 Clustering Using Penalized Optimal Scoring
The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression
d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)
This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as
82
82 Optimized Criterion
1 Initialize the membership matrix Y (for example by K-means algorithm)
2 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2
k)minus 1
2 )
4 Compute the centroids M in the LDA domain
5 Evaluate distances in the LDA domain
6 Translate distances into posterior probabilities tik with
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
] (81)
7 Update the labels using the posterior probabilities matrix Y = T
8 Go back to step 2 and iterate until tik converge
Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures
814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures
82 Optimized Criterion
In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized
83
8 Theoretical Foundations
optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture
This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows
821 A Bayesian Derivation
This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)
The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter
The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior
f(Σ|Λ0 ν0) =1
2np2 |Λ0|
n2 Γp(
n2 )|Σminus1|
ν0minuspminus12 exp
minus1
2tr(Λminus1
0 Σminus1)
where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as
Γp(n2) = πp(pminus1)4pprodj=1
Γ (n2 + (1minus j)2)
The posterior distribution can be maximized similarly to the likelihood through the
84
82 Optimized Criterion
maximization of
Q(θθprime) + log(f(Σ|Λ0 ν0))
=Ksumk=1
tk log πk minus(n+ 1)p
2log 2minus n
2log |Λ0| minus
p(p+ 1)
4log(π)
minuspsumj=1
log
(Γ
(n
2+
1minus j2
))minus νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
)equiv
Ksumk=1
tk log πk minusn
2log |Λ0| minus
νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
) (82)
with tk =
nsumi=1
tik
νn = ν0 + n
Λminus1n = Λminus1
0 + S0
S0 =
nsumi=1
Ksumk=1
tik(xi minus microk)(xi minus microk)gt
Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)
822 Maximum a Posteriori Estimator
The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is
ΣMAP =1
ν0 + nminus pminus 1(Λminus1
0 + S0) (83)
where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1
0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)
85
9 Mix-GLOSS Algorithm
Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism
91 Mix-GLOSS
The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik
When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant
The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition
911 Outer Loop Whole Algorithm Repetitions
This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs
bull the centered ntimes p feature matrix X
bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically
bull the number of clusters K
bull the maximum number of iterations for the EM algorithm
bull the convergence tolerance for the EM algorithm
bull the number of whole repetitions of the clustering algorithm
87
9 Mix-GLOSS Algorithm
Figure 91 Mix-GLOSS Loops Scheme
bull a ptimes (K minus 1) initial coefficient matrix (optional)
bull a ntimesK initial posterior probability matrix (optional)
For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process
912 Penalty Parameter Loop
The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix
Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage
88
91 Mix-GLOSS
of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive
Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)
Algorithm 2 Automatic selection of λ
Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat
Estimate λ Compute gradient at βj = 0partJ(B)
partβj
∣∣∣βj=0
= xjgt
(sum
m6=j xmβm minusYΘ)
Compute λmax for every feature using (432b)
λmaxj = 1
wj
∥∥∥∥ partJ(B)
partβj
∣∣∣βj=0
∥∥∥∥2
Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false
elselastLAMBDA larr true
end ifuntil lastLAMBDA
Output B L(θ) tik πk microk Σ Y for every λ in solution path
913 Inner Loop EM Algorithm
The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop
89
9 Mix-GLOSS Algorithm
Algorithm 3 Mix-GLOSS for one value of λ
Input X K B0 Y0 λInitializeif (B0Y0) available then
BOS larr B0 Y larr Y0
elseBOS larr 0 Y larr kmeans(XK)
end ifconvergenceEM larr false tolEM larr 1e-3repeat
M-step(BOSΘ
α)larr GLOSS(XYBOS λ)
XLDA = XBOS diag (αminus1(1minusα2)minus12
)
πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n
sumi |tik minus yik| lt tolEM then
convergenceEM larr trueend ifY larr T
until convergenceEMY larr MAP(T)
Output BOS ΘL(θ) tik πk microk Σ Y
90
92 Model Selection
M-Step
The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step
E-Step
The E-step evaluates the posterior probability matrix T using
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
]
The convergence of those tik is used as stopping criterion for EM
92 Model Selection
Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected
In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure
In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time
The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested
91
9 Mix-GLOSS Algorithm
Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)
X K λEMITER MAXREPMixminusGLOSS
Use B and T frombest repetition as
StartB and StartT
Mix-GLOSS (λStartBStartT)
Compute BIC
Chose λ = minλ BIC
Partition tikπk λBEST BΘ D L(θ)activeset
Figure 92 Mix-GLOSS model selection diagram
with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ
92
10 Experimental Results
The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6
This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63
In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations
The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions
101 Tested Clustering Algorithms
This section compares Mix-GLOSS with the following methods in the state of the art
bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan
bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website
93
10 Experimental Results
Figure 101 Class mean vectors for each artificial simulation
bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website
After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered
The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website
bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)
bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see
94
102 Results
Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable
102 Results
In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are
bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different
bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80
bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced
The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS
Results in percentages are displayed in Figure 102 (or in Table 102 )
95
10 Experimental Results
Table 101 Experimental results for simulated data
Err () Var Time
Sim 1 K = 4 mean shift ind features
CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h
Sim 2 K = 2 mean shift dependent features
CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h
Sim 3 K = 4 1D mean shift ind features
CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h
Sim 4 K = 4 mean shift ind features
CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h
Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
MIX-GLOSS 992 015 828 335 884 67 780 12
LUMI-KUAN 992 28 1000 02 1000 005 50 005
FISHER-EM 986 24 888 17 838 5825 620 4075
96
103 Discussion
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100TPR Vs FPR
MIXminusGLOSS
LUMIminusKUAN
FISHERminusEM
Simulation1
Simulation2
Simulation3
Simulation4
Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions
103 Discussion
After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted
LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here
The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4
From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall
97
Conclusions
99
Conclusions
Summary
The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables
In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering
The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems
In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations
In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results
Perspectives
Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species
101
based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography
At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term
The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis
From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm
At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression
102
Appendix
103
A Matrix Properties
Property 1 By definition ΣW and ΣB are both symmetric matrices
ΣW =1
n
gsumk=1
sumiisinCk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
gsumk=1
nk(microk minus x)(microk minus x)gt
Property 2 partxgtapartx = partagtx
partx = a
Property 3 partxgtAxpartx = (A + Agt)x
Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt
Property 5 partagtXbpartX = abgt
Property 6 partpartXtr
(AXminus1B
)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt
105
B The Penalized-OS Problem is anEigenvector Problem
In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form
minθkβk
Yθk minusXβk22 + βgtk Ωkβk (B1)
st θgtk YgtYθk = 1
θgt` YgtYθk = 0 forall` lt k
for k = 1 K minus 1The Lagrangian associated to Problem (B1) is
Lk(θkβk λkνk) =
Yθk minusXβk22 + βgtk Ωkβk + λk(θ
gtk YgtYθk minus 1) +
sum`ltk
ν`θgt` YgtYθk (B2)
Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk
βk = (XgtX + Ωk)minus1XgtYθk (B3)
The objective function of (B1) evaluated at βk is
minθk
Yθk minusXβk22 + βk
gtΩkβk = min
θk
θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk
= maxθk
θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)
If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY
B1 How to Solve the Eigenvector Decomposition
Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition
107
B The Penalized-OS Problem is an Eigenvector Problem
Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way
maxΘisinRKtimes(Kminus1)
tr(ΘgtMΘ
)(B5)
st ΘgtYgtYΘ = IKminus1
If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is
MΘv = λv (B6)
where v is the eigenvector and λ the associated eigenvalue of MΘ Operating
vgtMΘv = λhArr vgtΘgtMΘv = λ
Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue
wgtMw = λ (B7)
Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ
MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ
= ΘgtYgtXB
Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone
To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B
B = (XgtX + Ω)minus1XgtYΘV = BV
108
B2 Why the OS Problem is Solved as an Eigenvector Problem
B2 Why the OS Problem is Solved as an Eigenvector Problem
In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY
By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them
θk =
Kminus1summ=1
αmwm s t θgtk θk = 1 (B8)
The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (
Kminus1summ=1
αmwm
)gt(Kminus1summ=1
αmwm
)= 1
that as per the eigenvector properties can be reduced to
Kminus1summ=1
α2m = 1 (B9)
Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)
Mθk = M
Kminus1summ=1
αmwm
=
Kminus1summ=1
αmMwm
As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain
Mθk =Kminus1summ=1
αmλmwm
Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors
θgtk Mθk =
(Kminus1sum`=1
α`w`
)gt(Kminus1summ=1
αmλmwm
)
This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving
θgtk Mθk =Kminus1summ=1
α2mλm
109
B The Penalized-OS Problem is an Eigenvector Problem
The optimization Problem (B5) for discriminant direction k can be rewritten as
maxθkisinRKtimes1
θgtk Mθk
= max
θkisinRKtimes1
Kminus1summ=1
α2mλm
(B10)
with θk =Kminus1summ=1
αmwm
andKminus1summ=1
α2m = 1
One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =
sumKminus1m=1 αmwm the resulting score vector θk will be equal to
the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can
be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY
110
C Solving Fisherrsquos Discriminant Problem
The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance
maxβisinRp
βgtΣBβ (C1a)
s t βgtΣWβ = 1 (C1b)
where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data
The Lagrangian of Problem (C1) is
L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)
so that its first derivative with respect to β is
partL(β ν)
partβ= 2ΣBβ minus 2νΣWβ
A necessary optimality condition for β is that this derivative is zero that is
ΣBβ = νΣWβ
Provided ΣW is full rank we have
Σminus1W ΣBβ
= νβ (C2)
Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of
eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows
βgtΣBβ = βgtΣWΣminus1
W ΣBβ
= νβgtΣWβ from (C2)
= ν from (C1b)
That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1
W ΣB and β is any eigenvector correspondingto this maximal eigenvalue
111
D Alternative Variational Formulation forthe Group-Lasso
In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(D1a)
s tsump
j=1 τj = 1 (D1b)
τj ge 0 j = 1 p (D1c)
Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed
of row vectors βj isin RKminus1 B =(β1gt βpgt
)gt
L(B τ λ ν0 νj) = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
psumj=1
τj minus 1
minus psumj=1
νjτj (D2)
The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj
partL(B τ λ ν0 νj)
partτj
∣∣∣∣τj=τj
= 0 rArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 = 0
The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ
) = 0 where νj is the Lagrange multiplier and gj(τ) is the
inequality Lagrange condition Then the optimal τj can be deduced
τj =
radicλ
ν0wj∥∥βj∥∥
2
Placing this optimal value of τj into constraint (D1b)
psumj=1
τj = 1rArr τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
(D3)
113
D Alternative Variational Formulation for the Group-Lasso
With this value of τj Problem (D1) is equivalent to
minBisinRptimesKminus1
J(B) + λ
psumj=1
wj∥∥βj∥∥
2
2
(D4)
This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj
The penalty term of (D1a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (D5)
Using the value of τj from (D3) each diagonal component of Ω is
(Ω)jj =wjsump
j=1wj∥∥βj∥∥
2∥∥βj∥∥2
(D6)
In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation
D1 Useful Properties
Lemma D1 If J is convex Problem (D1) is convex
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =
partJ(B)
partB+ 2λ
Kminus1sumj=1
wj∥∥βj∥∥
2
G
(D7)
where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1
∥∥βj∥∥26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (D8)
forallj isin S(B) ∥∥gj∥∥
2le wj (D9)
114
D2 An Upper Bound on the Objective Function
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1
∥∥βj∥∥26= 0 and let S(B) be its complement then we have
forallj isin S(B) minus partJ(B)
partβj= 2λ
Kminus1sumj=1
wj∥∥βj∥∥2
wj∥∥βj∥∥minus1
2βj (D10a)
forallj isin S(B)
∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le 2λwj
Kminus1sumj=1
wj∥∥βj∥∥2
(D10b)
In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)
D2 An Upper Bound on the Objective Function
Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that
τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum
j=1
wj∥∥βj∥∥
2
2
=
psumj=1
τ12j
wj∥∥βj∥∥
2
τ12j
2
le
psumj=1
τj
psumj=1
w2j
∥∥βj∥∥2
2
τj
le
psumj=1
w2j
∥∥βj∥∥2
2
τj
where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one
115
D Alternative Variational Formulation for the Group-Lasso
This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined
116
E Invariance of the Group-Lasso to UnitaryTransformations
The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition
Proposition E1 Let B be a solution of
minBisinRptimesM
Y minusXB2F + λ
psumj=1
wj∥∥βj∥∥
2(E1)
and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof
minBisinRptimesM
∥∥∥Y minusXB∥∥∥2
F+ λ
psumj=1
wj∥∥βj∥∥
2(E2)
Proof The first-order necessary optimality conditions for B are
forallj isin S(B) 2xjgt(xjβ
j minusY)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E3a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minusY)∥∥∥
2le λwj (E3b)
where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement
First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows
forallj isin S(B) 2xjgt(xjβ
j minus Y)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E4a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minus Y)∥∥∥
2le λwj (E4b)
where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM
∥∥ugt∥∥2
=∥∥ugtV
∥∥2 Equation (E4b) is also
117
E Invariance of the Group-Lasso to Unitary Transformations
obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof
118
F Expected Complete Likelihood andLikelihood
Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available
L(θ) =
nsumi=1
log
(Ksumk=1
πkfk(xiθk)
)(F1)
Q(θθprime) =nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (F2)
with tik(θprime) =
πprimekfk(xiθprimek)sum
` πprime`f`(xiθ
prime`)
(F3)
In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are
the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)
Using (F3) we have
Q(θθprime) =sumik
tik(θprime) log (πkfk(xiθk))
=sumik
tik(θprime) log(tik(θ)) +
sumik
tik(θprime) log
(sum`
π`f`(xiθ`)
)=sumik
tik(θprime) log(tik(θ)) + L(θ)
In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities
L(θ) = Q(θθ)minussumik
tik(θ) log(tik(θ))
= Q(θθ) +H(T)
119
G Derivation of the M-Step Equations
This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as
Q(θθprime) = maxθ
sumik
tik(θprime) log(πkfk(xiθk))
=sumk
log
(πksumi
tik
)minus np
2log(2π)minus n
2log |Σ| minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
which has to be maximized subject tosumk
πk = 1
The Lagrangian of this problem is
L(θ) = Q(θθprime) + λ
(sumk
πk minus 1
)
Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ
G1 Prior probabilities
partL(θ)
partπk= 0hArr 1
πk
sumi
tik + λ = 0
where λ is identified from the constraint leading to
πk =1
n
sumi
tik
121
G Derivation of the M-Step Equations
G2 Means
partL(θ)
partmicrok= 0hArr minus1
2
sumi
tik2Σminus1(microk minus xi) = 0
rArr microk =
sumi tikxisumi tik
G3 Covariance Matrix
partL(θ)
partΣminus1 = 0hArr n
2Σ︸︷︷︸
as per property 4
minus 1
2
sumik
tik(xi minus microk)(xi minus microk)gt
︸ ︷︷ ︸as per property 5
= 0
rArr Σ =1
n
sumik
tik(xi minus microk)(xi minus microk)gt
122
Bibliography
F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011
F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008
F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012
J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993
A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009
H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996
P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004
C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008
C M Bishop Pattern Recognition and Machine Learning Springer New York 2006
C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a
C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b
S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004
L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995
L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984
123
Bibliography
T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011
S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999
C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012
B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008
L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011
C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009
A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246
D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006
R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000
B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004
Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008
R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936
V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008
J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009
124
Bibliography
J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010
J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989
W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998
A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003
D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005
G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010
G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011
Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998
Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002
L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008
Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004
J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010
I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003
T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996
T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994
125
Bibliography
T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995
A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970
J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009
T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006
K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000
P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010
T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002
M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004
Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004
C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008
C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006
H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005
J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967
Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012
C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a
126
Bibliography
C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b
L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008
N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006
B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006
B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007
Y Nesterov Gradient methods for minimizing composite functions preprint 2007
S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886
B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011
M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a
M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b
W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007
W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006
K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894
S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003
127
Bibliography
Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009
A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006
C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948
S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007
V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004
V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008
V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004
C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010
L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012
Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978
A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008
S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006
P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010
M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008
128
Bibliography
M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008
R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996
J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010
S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008
D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011
D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010
D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009
M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007
MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009
T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008
B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a
B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b
C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010
J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007
129
Bibliography
M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006
P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007
P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009
H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009
H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006
H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005
130
- SANCHEZ MERCHANTE PDTpdf
- Thesis Luis Francisco Sanchez Merchantepdf
-
- List of figures
- List of tables
- Notation and Symbols
- Context and Foundations
-
- Context
- Regularization for Feature Selection
-
- Motivations
- Categorization of Feature Selection Techniques
- Regularization
-
- Important Properties
- Pure Penalties
- Hybrid Penalties
- Mixed Penalties
- Sparsity Considerations
- Optimization Tools for Regularized Problems
-
- Sparse Linear Discriminant Analysis
-
- Abstract
- Feature Selection in Fisher Discriminant Analysis
-
- Fisher Discriminant Analysis
- Feature Selection in LDA Problems
-
- Inertia Based
- Regression Based
-
- Formalizing the Objective
-
- From Optimal Scoring to Linear Discriminant Analysis
-
- Penalized Optimal Scoring Problem
- Penalized Canonical Correlation Analysis
- Penalized Linear Discriminant Analysis
- Summary
-
- Practicalities
-
- Solution of the Penalized Optimal Scoring Regression
- Distance Evaluation
- Posterior Probability Evaluation
- Graphical Representation
-
- From Sparse Optimal Scoring to Sparse LDA
-
- A Quadratic Variational Form
- Group-Lasso OS as Penalized LDA
-
- GLOSS Algorithm
-
- Regression Coefficients Updates
-
- Cholesky decomposition
- Numerical Stability
-
- Score Matrix
- Optimality Conditions
- Active and Inactive Sets
- Penalty Parameter
- Options and Variants
-
- Scaling Variables
- Sparse Variant
- Diagonal Variant
- Elastic net and Structured Variant
-
- Experimental Results
-
- Normalization
- Decision Thresholds
- Simulated Data
- Gene Expression Data
- Correlated Data
-
- Discussion
-
- Sparse Clustering Analysis
-
- Abstract
- Feature Selection in Mixture Models
-
- Mixture Models
-
- Model
- Parameter Estimation The EM Algorithm
-
- Feature Selection in Model-Based Clustering
-
- Based on Penalized Likelihood
- Based on Model Variants
- Based on Model Selection
-
- Theoretical Foundations
-
- Resolving EM with Optimal Scoring
-
- Relationship Between the M-Step and Linear Discriminant Analysis
- Relationship Between Optimal Scoring and Linear Discriminant Analysis
- Clustering Using Penalized Optimal Scoring
- From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
-
- Optimized Criterion
-
- A Bayesian Derivation
- Maximum a Posteriori Estimator
-
- Mix-GLOSS Algorithm
-
- Mix-GLOSS
-
- Outer Loop Whole Algorithm Repetitions
- Penalty Parameter Loop
- Inner Loop EM Algorithm
-
- Model Selection
-
- Experimental Results
-
- Tested Clustering Algorithms
- Results
- Discussion
-
- Conclusions
- Appendix
-
- Matrix Properties
- The Penalized-OS Problem is an Eigenvector Problem
-
- How to Solve the Eigenvector Decomposition
- Why the OS Problem is Solved as an Eigenvector Problem
-
- Solving Fishers Discriminant Problem
- Alternative Variational Formulation for the Group-Lasso
-
- Useful Properties
- An Upper Bound on the Objective Function
-
- Invariance of the Group-Lasso to Unitary Transformations
- Expected Complete Likelihood and Likelihood
- Derivation of the M-Step Equations
-
- Prior probabilities
- Means
- Covariance Matrix
-
- Bibliography
-
Contents
72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79
8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81
811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant
Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83
82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85
9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87
911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89
92 Model Selection 91
10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97
Conclusions 97
Appendix 103
A Matrix Properties 105
B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109
C Solving Fisherrsquos Discriminant Problem 111
D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115
iii
Contents
E Invariance of the Group-Lasso to Unitary Transformations 117
F Expected Complete Likelihood and Likelihood 119
G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122
Bibliography 123
iv
List of Figures
11 MASH project logo 5
21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-
rameters 20
41 Graphical representation of the variational approach to Group-Lasso 45
51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56
61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first
discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64
91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92
101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97
v
List of Tables
61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61
101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96
vii
Notation and Symbols
Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors
Sets
N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A
Data
X input domainxi input sample xi isin XX design matrix X = (xgt1 x
gtn )gt
xj column j of Xyi class indicator of sample i
Y indicator matrix Y = (ygt1 ygtn )gt
z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N
Vectors Matrices and Norms
0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A
ix
Notation and Symbols
Probability
E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2
W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix
H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y
Mixture Models
yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)
θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function
Optimization
J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β
βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path
x
Notation and Symbols
Penalized models
λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)
βj jth row of B = (β1gt βpgt)gt
BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix
ΣB sample between-class covariance matrix
ΣW sample within-class covariance matrix
ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach
xi
Part I
Context and Foundations
1
This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed
The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided
The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion
3
1 Context
The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm
The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne
From the point of view of the research the members of the consortium must deal withfour main goals
1 Software development of website framework and APIrsquos
2 Classification and goal-planning in high dimensional feature spaces
3 Interfacing the platform with the 3D virtual environment and the robot arm
4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments
S HM A
Figure 11 MASH project logo
5
1 Context
The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables
Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment
As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform
bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)
bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis
6
All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)
7
2 Regularization for Feature Selection
With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic
21 Motivations
There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)
As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information
When talking about dimensionality reduction there are two families of techniquesthat could induce confusion
bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples
bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature
9
2 Regularization for Feature Selection
Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)
selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category
As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text
ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out
Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the
10
22 Categorization of Feature Selection Techniques
Figure 22 The four key steps of feature selection according to Liu and Yu (2005)
ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost
There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions
22 Categorization of Feature Selection Techniques
Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured
I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm
The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities
bull Depending on the type of integration with the machine learning algorithm we have
ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm
ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while
11
2 Regularization for Feature Selection
the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive
ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm
bull Depending on the feature searching technique
ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches
ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time
ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima
bull Depending on the evaluation technique
ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures
ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty
ndash Dependency Measures - Measuring the correlation between features
ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can
ndash Predictive Accuracy - Use the selected features to predict the labels
ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)
The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels
In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized
12
23 Regularization
goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III
23 Regularization
In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret
An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations
minβJ(β) + λP (β) (21)
minβ
J(β)
s t P (β) le t (22)
In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken
In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty
13
2 Regularization for Feature Selection
Figure 23 Admissible sets in two dimensions for different pure norms ||β||p
231 Important Properties
Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability
Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies
forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)
for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex
Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources
Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution
232 Pure Penalties
For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In
14
23 Regularization
Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties
this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1
Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity
A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero
After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1
3penalty has a support region with sharper vertexes that would induce
a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1
3results in difficulties during optimization that will not happen with a convex
shape
15
2 Regularization for Feature Selection
To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty
L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0
minβ
J(β)
s t β0 le t (24)
where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable
L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)
minβ
J(β)
s t
psumj=1
|βj | le t (25)
Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited
Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)
The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by
16
23 Regularization
minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)
L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like
minβJ(β) + λ β22 (26)
The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem
minβ
nsumi=1
(yi minus xgti β)2 (27)
with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular
the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances
As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
(βlsj )2 (28)
The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)
17
2 Regularization for Feature Selection
where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model
Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions
Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t
This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as
βlowast = maxwisinRp
βgtw s t w le 1
In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1
r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)
233 Hybrid Penalties
There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is
minβ
nsumi=1
(yi minus xgti β)2 + λ1
psumj=1
|βj |+ λ2
psumj=1
β2j (29)
The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables
18
23 Regularization
234 Mixed Penalties
Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =
sumL`=1 d` Mixed norms are
a type of norms that take into consideration those groups The general expression isshowed below
β(rs) =
sum`
sumjisinG`
|βj |s r
s
1r
(210)
The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups
Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)
(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)
235 Sparsity Considerations
In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables
The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables
To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as
19
2 Regularization for Feature Selection
(a) L1 Lasso (b) L(12) group-Lasso
Figure 25 Admissible sets for the Lasso and Group-Lasso
(a) L1 induced sparsity (b) L(12) group inducedsparsity
Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters
20
23 Regularization
the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed
236 Optimization Tools for Regularized Problems
In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms
In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5
Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)
β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))
Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives
βj =minusλsign(βj)minus partJ(β)
partβj
2sumn
i=1 x2ij
In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding
algorithm where β(t+1)j = Sλ
(partJ(β(t))partβj
) The objective function is optimized with respect
21
2 Regularization for Feature Selection
to one variable at a time while all others are kept fixed
Sλ
(partJ(β)
partβj
)=
λminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
gt λ
minusλminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
lt minusλ
0 if |partJ(β)partβj| le λ
(211)
The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)
Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A
Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected
Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)
This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions
22
23 Regularization
and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions
Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points
This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable
This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)
Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)
This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques
Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals
Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)
minβisinRp
J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L
2
∥∥∥β minus β(t)∥∥∥2
2(212)
They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like
23
2 Regularization for Feature Selection
(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as
minβisinRp
1
2
∥∥∥∥β minus (β(t) minus 1
LnablaJ(β(t)))
∥∥∥∥2
2
+λ
LP (β) (213)
The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up
24
Part II
Sparse Linear Discriminant Analysis
25
Abstract
Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes
There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables
In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data
27
3 Feature Selection in Fisher DiscriminantAnalysis
31 Fisher Discriminant Analysis
Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)
We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x
gtn )gt and the corresponding labels in the ntimesK matrix
Y = (ygt1 ygtn )gt
Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance
maxβisinRp
βgtΣBβ
βgtΣWβ (31)
where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as
ΣW =1
n
Ksumk=1
sumiisinGk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
Ksumk=1
sumiisinGk
(microminus microk)(microminus microk)gt
where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k
29
3 Feature Selection in Fisher Discriminant Analysis
This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio
maxBisinRptimesKminus1
tr(BgtΣBB
)tr(BgtΣWB
) (32)
where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is
based on a series of K minus 1 subproblemsmaxβkisinRp
βgtk ΣBβk
s t βgtk ΣWβk le 1
βgtk ΣWβ` = 0 forall` lt k
(33)
The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest
eigenvalue (see Appendix C)
32 Feature Selection in LDA Problems
LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome
Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints
The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities
They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based
321 Inertia Based
The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and
30
32 Feature Selection in LDA Problems
classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations
Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)
Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as
minβisinRp
βgtΣWβ
s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t
where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony
Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max
βisinkRpβgtk Σ
k
Bβk minus Pk(βk)
s t βgtk ΣWβk le 1
The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten
Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal
solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1
minimization minβisinRp
β1
s t∥∥∥Σβ minus (micro1 minus micro2)
∥∥∥infinle λ
Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization
31
3 Feature Selection in Fisher Discriminant Analysis
Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions
322 Regression Based
In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)
Predefined Indicator Matrix
Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)
There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data
Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection
In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is
32
32 Feature Selection in LDA Problems
obtained by solving
minβisinRpβ0isinR
nminus1nsumi=1
(yi minus β0 minus xgti β)2 + λ
psumj=1
|βj |
where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β
vector for λ = 0 but a different intercept β0 is required
Optimal Scoring
In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)
As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)(34a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)
where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems
minθkisinRK βkisinRp
Yθk minusXβk2 + βgtk Ωβk (35a)
s t nminus1 θgtk YgtYθk = 1 (35b)
θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)
where each βk corresponds to a discriminant direction
33
3 Feature Selection in Fisher Discriminant Analysis
Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by
minβkisinRpθkisinRK
sumk
Yθk minusXβk22 + λ1 βk1 + λ2β
gtk Ωβk
where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen
Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)
minβkisinRpθkisinRK
Kminus1sumk=1
Yθk minusXβk22 + λ
psumj=1
radicradicradicradicKminus1sumk=1
β2kj
2
(36)
which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding
this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem
34
4 Formalizing the Objective
In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)
The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data
The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)
41 From Optimal Scoring to Linear Discriminant Analysis
Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)
Throughout this chapter we assume that
bull there is no empty class that is the diagonal matrix YgtY is full rank
bull inputs are centered that is Xgt1n = 0
bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank
35
4 Formalizing the Objective
411 Penalized Optimal Scoring Problem
For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution
The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus
minθisinRK βisinRp
Yθ minusXβ2 + βgtΩβ (41a)
s t nminus1 θgtYgtYθ = 1 (41b)
For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator
βos =(XgtX + Ω
)minus1XgtYθ (42)
The objective function (41a) is then
Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos
(XgtX + Ω
)βos
= θgtYgtYθ minus θgtYgtX(XgtX + Ω
)minus1XgtYθ
where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (43)
which shows that the optimization of the p-OS problem with respect to θk boils down to
finding the kth largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY Indeed Appendix C
details that Problem (43) is solved by
(YgtY)minus1YgtX(XgtX + Ω
)minus1XgtYθ = α2θ (44)
36
41 From Optimal Scoring to Linear Discriminant Analysis
where α2 is the maximal eigenvalue 1
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2nminus1θgt(YgtY)θ
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2 (45)
412 Penalized Canonical Correlation Analysis
As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows
maxθisinRK βisinRp
nminus1θgtYgtXβ (46a)
s t nminus1 θgtYgtYθ = 1 (46b)
nminus1 βgt(XgtX + Ω
)β = 1 (46c)
The solutions to (46) are obtained by finding saddle points of the Lagrangian
nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)
rArr npartL(βθ γ ν)
partβ= XgtYθ minus 2γ(XgtX + Ω)β
rArr βcca =1
2γ(XgtX + Ω)minus1XgtYθ
Then as βcca obeys (46c) we obtain
βcca =(XgtX + Ω)minus1XgtYθradic
nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)
so that the optimal objective function (46a) can be expressed with θ alone
nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
=
radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
and the optimization problem with respect to θ can be restated as
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (48)
Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)
βos = αβcca (49)
1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)
37
4 Formalizing the Objective
where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using
the optimality conditions for θ
npartL(βθ γ ν)
partθ= YgtXβ minus 2νYgtYθ
rArr θcca =1
2ν(YgtY)minus1YgtXβ (410)
Then as θcca obeys (46b) we obtain
θcca =(YgtY)minus1YgtXβradic
nminus1βgtXgtY(YgtY)minus1YgtXβ (411)
leading to the following expression of the optimal objective function
nminus1θgtccaYgtXβ =
nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ
=
radicnminus1βgtXgtY(YgtY)minus1YgtXβ
The p-CCA problem can thus be solved with respect to β by plugging this value in (46)
maxβisinRp
nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)
s t nminus1 βgt(XgtX + Ω
)β = 1 (412b)
where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies
nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω
)βcca (413)
where λ is the maximal eigenvalue shown below to be equal to α2
nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ
rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ
rArr nminus1αβgtccaXgtYθ = λ
rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ
rArr α2 = λ
The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)
38
41 From Optimal Scoring to Linear Discriminant Analysis
413 Penalized Linear Discriminant Analysis
Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows
maxβisinRp
βgtΣBβ (414a)
s t βgt(ΣW + nminus1Ω)β = 1 (414b)
where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C
As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable
to a simple matrix representation using the projection operator Y(YgtY
)minus1Ygt
ΣT =1
n
nsumi=1
xixigt
= nminus1XgtX
ΣB =1
n
Ksumk=1
nk microkmicrogtk
= nminus1XgtY(YgtY
)minus1YgtX
ΣW =1
n
Ksumk=1
sumiyik=1
(xi minus microk) (xi minus microk)gt
= nminus1
(XgtXminusXgtY
(YgtY
)minus1YgtX
)
Using these formulae the solution to the p-LDA problem (414) is obtained as
XgtY(YgtY
)minus1YgtXβlda = λ
(XgtX + ΩminusXgtY
(YgtY
)minus1YgtX
)βlda
XgtY(YgtY
)minus1YgtXβlda =
λ
1minus λ
(XgtX + Ω
)βlda
The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat
βlda = (1minus α2)minus12 βcca
= αminus1(1minus α2)minus12 βos
which ends the path from p-OS to p-LDA
39
4 Formalizing the Objective
414 Summary
The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)s t nminus1 ΘgtYgtYΘ = IKminus1
Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the
square-root of the largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY we have
BLDA = BCCA
(IKminus1 minusA2
)minus 12
= BOS Aminus1(IKminus1 minusA2
)minus 12 (415)
where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p
can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS
or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied
With the aim of performing classification the whole process could be summarized asfollows
1 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
2 Translate the data samples X into the LDA domain as XLDA = XBOSD
where D = Aminus1(IKminus1 minusA2
)minus 12
3 Compute the matrix M of centroids microk from XLDA and Y
4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA
5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule
6 Graphical Representation
40
42 Practicalities
The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively
42 Practicalities
421 Solution of the Penalized Optimal Scoring Regression
Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem
minΘisinRKtimesKminus1BisinRptimesKminus1
YΘminusXB2F + λ tr(BgtΩB
)(416a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)
where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm
Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps
1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1
2 Compute B =(XgtX + λΩ
)minus1XgtYΘ0
3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ
)minus1XgtY
4 Compute the optimal regression coefficients
BOS =(XgtX + λΩ
)minus1XgtYΘ (417)
Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on
Θ0gtYgtX(XgtX + λΩ
)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a
costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B
This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where
41
4 Formalizing the Objective
a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems
422 Distance Evaluation
The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance
d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log
(nkn
) (418)
is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent
Σminus1WΩ =
(nminus1(XgtX + λΩ)minus ΣB
)minus1
=(nminus1XgtXminus ΣB + nminus1λΩ
)minus1
=(ΣW + nminus1λΩ
)minus1 (419)
Before explaining how to compute the distances let us summarize some clarifying points
bull The solution BOS of the p-OS problem is enough to accomplish classification
bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances
bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1
As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain
(xi minus microk)BOS2ΣWΩminus 2 log(πk)
where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1
(IKminus1 minusA2
)minus 12
∥∥∥2
2minus 2 log(πk)
which is a plain Euclidean distance
42
43 From Sparse Optimal Scoring to Sparse LDA
423 Posterior Probability Evaluation
Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as
p(yk = 1|x) prop exp
(minusd(xmicrok)
2
)prop πk exp
(minus1
2
∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2
)minus 12
∥∥∥2
2
) (420)
Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)
2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below
p(yk = 1|x) =πk exp
(minusd(xmicrok)
2
)sum
` π` exp(minusd(xmicro`)
2
)=
πk exp(minusd(xmicrok)
2 + dmax2
)sum`
π` exp
(minusd(xmicro`)
2+dmax
2
)
where dmax = maxk d(xmicrok)
424 Graphical Representation
Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented
43 From Sparse Optimal Scoring to Sparse LDA
The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated
In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see
43
4 Formalizing the Objective
section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB
431 A Quadratic Variational Form
Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)
Our formulation of group-Lasso is showed below
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(421a)
s tsum
j τj minussum
j wj∥∥βj∥∥
2le 0 (421b)
τj ge 0 j = 1 p (421c)
where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1
B =(β1gt βpgt
)gtand wj are predefined nonnegative weights The cost function
J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)
The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41
Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)
Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump
j=1wj∥∥βj∥∥
2
Proof The Lagrangian of Problem (421) is
L = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
( psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2
)minus
psumj=1
νjτj
44
43 From Sparse Optimal Scoring to Sparse LDA
Figure 41 Graphical representation of the variational approach to Group-Lasso
Thus the first order optimality conditions for τj are
partLpartτj
(τj ) = 0hArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
hArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0 τ
j
2 = 0
The last line is obtained from complementary slackness which implies here νjτj = 0
Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier
for constraint gj(τj) le 0 As a result the optimal value of τj
τj =
radicλw2
j
∥∥βj∥∥2
2
ν0=
radicλ
ν0wj∥∥βj∥∥
2(422)
We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)
psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2= 0 (423)
so that τj = wj∥∥βj∥∥
2 Using this value into (421a) it is possible to conclude that
Problem (421) is equivalent to the standard group-Lasso operator
minBisinRptimesM
J(B) + λ
psumj=1
wj∥∥βj∥∥
2 (424)
So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation
45
4 Formalizing the Objective
With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (425)
with
τj = wj∥∥βj∥∥
2
resulting in Ω diagonal components
(Ω)jj =wj∥∥βj∥∥
2
(426)
And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5
The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence
Lemma 42 If J is convex Problem (421) is convex
Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is
V isin RptimesKminus1 V =partJ(B)
partB+ λG
(427)
where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1
G =(g1gt gpgt
)gtdefined as follows Let S(B) denote the columnwise support of
B S(B) = j isin 1 p ∥∥βj∥∥
26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (428)
forallj isin S(B) ∥∥gj∥∥
2le wj (429)
46
43 From Sparse Optimal Scoring to Sparse LDA
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Proof When∥∥βj∥∥
26= 0 the gradient of the penalty with respect to βj is
part (λsump
m=1wj βm2)
partβj= λwj
βj∥∥βj∥∥2
(430)
At∥∥βj∥∥
2= 0 the gradient of the objective function is not continuous and the optimality
conditions then make use of the subdifferential (Bach et al 2011)
partβj
(λ
psumm=1
wj βm2
)= partβj
(λwj
∥∥βj∥∥2
)=λwjv isin RKminus1 v2 le 1
(431)
That gives the expression (429)
Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima
forallj isin S partJ(B)
partβj+ λwj
∥∥βj∥∥minus1
2βj = 0 (432a)
forallj isin S ∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le λwj (432b)
where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment
Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)
432 Group-Lasso OS as Penalized LDA
With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced
Proposition 41 The group-Lasso OS problem
BOS = argminBisinRptimesKminus1
minΘisinRKtimesKminus1
1
2YΘminusXB2F + λ
psumj=1
wj∥∥βj∥∥
2
s t nminus1 ΘgtYgtYΘ = IKminus1
47
4 Formalizing the Objective
is equivalent to the penalized LDA problem
BLDA = maxBisinRptimesKminus1
tr(BgtΣBB
)s t Bgt(ΣW + nminus1λΩ)B = IKminus1
where Ω = diag
(w2
1
τ1
w2p
τp
) with Ωjj =
+infin if βjos = 0
wj∥∥βjos
∥∥minus1
2otherwise
(433)
That is BLDA = BOS diag(αminus1k (1minus α2
k)minus12
) where αk isin (0 1) is the kth leading
eigenvalue of
nminus1YgtX(XgtX + λΩ
)minus1XgtY
Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso
The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr
(BgtΩB
)
48
5 GLOSS Algorithm
The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22
The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below
1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed
2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution
3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set
This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively
51 Regression Coefficients Updates
Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(
XgtAXA + λΩ)βk = XgtAYθ0
k (51)
49
5 GLOSS Algorithm
initialize modelλ B
ACTIVE SETall j st||βj ||2 gt 0
p-OS PROBLEMB must hold1st optimality
condition
any variablefrom
ACTIVE SETmust go toINACTIVE
SET
take it out ofACTIVE SET
test 2nd op-timality con-dition on the
INACTIVE SET
any variablefrom
INACTIVE SETmust go toACTIVE
SET
take it out ofINACTIVE SET
compute Θ
and update B end
yes
no
yes
no
Figure 51 GLOSS block diagram
50
51 Regression Coefficients Updates
Algorithm 1 Adaptively Penalized Optimal Scoring
Input X Y B λInitialize A larr
j isin 1 p
∥∥βj∥∥2gt 0
Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat
Step 1 solve (421) in B assuming A optimalrepeat
Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1
2
BA larr(XgtAXA + λΩ
)minus1XgtAYΘ0
until condition (432a) holds for all j isin A Step 2 identify inactivated variables
for j isin A ∥∥βj∥∥
2= 0 do
if optimality condition (432b) holds thenA larr AjGo back to Step 1
end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax
jisinA
∥∥partJpartβj∥∥2
if∥∥∥partJpartβj∥∥∥
2lt λ then
convergence larr true B is optimalelseA larr Acup j
end ifuntil convergence
(sV)larreigenanalyze(Θ0gtYgtXAB) that is
Θ0gtYgtXABVk = skVk k = 1 K minus 1
Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1
Output Θ B α
51
5 GLOSS Algorithm
where XA denotes the columns of X indexed by A and βk and θ0k denote the kth
column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system
511 Cholesky decomposition
Dropping the subscripts and considering the (K minus 1) systems together (51) leads to
(XgtX + λΩ)B = XgtYΘ (52)
Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows
CgtCB = XgtYΘ
CB = CgtXgtYΘ
B = CCgtXgtYΘ (53)
where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)
512 Numerical Stability
The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression
B = Ωminus12(Ωminus12XgtXΩminus12 + λI
)minus1Ωminus12XgtYΘ0 (54)
where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)
52 Score Matrix
The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY This eigen-analysis is actually solved in the form
ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-
vector decomposition does not require the costly computation of(XgtX + Ω
)minus1that
52
53 Optimality Conditions
involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-
trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω
)minus1XgtY 1
Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as
Θ0gtYgtX(XgtX + Ω
)minus1XgtYΘ0 = Θ0gtYgtXB0
Thus the solution to penalized OS problem can be computed trough the singular
value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining
Θ = Θ0V we have ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ = Λ and when Θ0 is chosen such
that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation
53 Optimality Conditions
GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function
1
2YΘminusXB22 + λ
psumj=1
wj∥∥βj∥∥
2(55)
Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth
row of B βj is the (K minus 1)-dimensional vector
partJ(B)
partβj= xj
gt(XBminusYΘ)
where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as
xjgt
(XBminusYΘ) + λwjβj∥∥βj∥∥
2
1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω
)minus1XgtY It is thus suffi-
cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY In practice to comply with this desideratum and conditions (35b) and
(35c) we set Θ0 =(YgtY
)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal
vectors orthogonal to 1K
53
5 GLOSS Algorithm
The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
54 Active and Inactive Sets
The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function
j = maxj
∥∥∥xjgt (XBminusYΘ)∥∥∥
2minus λwj 0
The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥
2
is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition
55 Penalty Parameter
The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active
The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0
λmax = maxjisin1p
1
wj
∥∥∥xjgtYΘ0∥∥∥
2
The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin
is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)
54
56 Options and Variants
56 Options and Variants
561 Scaling Variables
As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm
562 Sparse Variant
This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation
563 Diagonal Variant
We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated
The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems
minBisinRptimesKminus1
YΘminusXB2F = minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB
)are replaced by
minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B
)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite
564 Elastic net and Structured Variant
For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition
55
5 GLOSS Algorithm
7 8 9
4 5 6
1 2 3
- ΩL =
3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3
Figure 52 Graph and Laplacian matrix for a 3times 3 image
for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth
When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood
This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned
56
6 Experimental Results
This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper
61 Normalization
With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1
62 Decision Thresholds
The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation
1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval
57
6 Experimental Results
63 Simulated Data
We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is
Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)
Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of
dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure
is intended to mimic gene expression data correlation
Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1
3 1)if j le 100 and Xij sim N(0 1) otherwise
Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise
Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563
The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only
58
63 Simulated Data
Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset
Err () Var Dir
Sim 1 K = 4 mean shift ind features
PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)
Sim 2 K = 2 mean shift dependent features
PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)
Sim 3 K = 4 1D mean shift ind features
PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)
Sim 4 K = 4 mean shift ind features
PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)
59
6 Experimental Results
0 10 20 30 40 50 60 70 8020
30
40
50
60
70
80
90
100TPR Vs FPR
gloss
glossd
slda
plda
Simulation1
Simulation2
Simulation3
Simulation4
Figure 61 TPR versus FPR (in ) for all algorithms and simulations
Table 62 Average TPR and FPR (in ) computed over 25 repetitions
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
PLDA 990 782 969 603 980 159 743 656
SLDA 739 385 338 163 416 278 507 395
GLOSS 641 106 300 46 511 182 260 121
GLOSS-D 935 394 921 281 956 655 429 299
method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )
64 Gene Expression Data
We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-
2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736
60
64 Gene Expression Data
Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables
Err () Var
Nakayama n = 86 p = 22 283 K = 5
PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)
Ramaswamy n = 198 p = 16 063 K = 14
PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)
Sun n = 180 p = 54 613 K = 4
PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)
ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4
dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors
Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split
Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors
Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS
4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962
61
6 Experimental Results
GLOSS SLDA
Naka
yam
a
minus25000 minus20000 minus15000 minus10000 minus5000 0 5000
minus25
minus2
minus15
minus1
minus05
0
05
1
x 104
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
2n
dd
iscr
imin
ant
minus2000 0 2000 4000 6000 8000 10000 12000 14000
2000
4000
6000
8000
10000
12000
14000
16000
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
Su
n
minus1 minus05 0 05 1 15 2
x 104
05
1
15
2
25
3
35
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
2n
dd
iscr
imin
ant
minus2 minus15 minus1 minus05 0
x 104
0
05
1
15
2
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means
62
65 Correlated Data
Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo
65 Correlated Data
When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge
The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works
For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63
As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward
The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits
Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results
63
6 Experimental Results
β for GLOSS β for S-GLOSS
Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo
β for GLOSS and λ = 03 β for S-GLOSS and λ = 03
Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo
64
Discussion
GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix
Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data
The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced
The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition
65
Part III
Sparse Clustering Analysis
67
Abstract
Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity
Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix
As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection
Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10
69
7 Feature Selection in Mixture Models
71 Mixture Models
One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering
711 Model
We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically
from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as
f(xi) =
Ksumk=1
πkfk(xi) foralli isin 1 n
where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and
sumk πk = 1) Mixture models transcribe that
given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism
bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK
bull x each xi is assumed to arise from a random vector with probability densityfunction fk
In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as
f(xiθ) =
Ksumk=1
πkφ(xiθk) foralli isin 1 n
71
7 Feature Selection in Mixture Models
where θ = (π1 πK θ1 θK) is the parameter of the model
712 Parameter Estimation The EM Algorithm
For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ
21 σ
22 π) of a univariate
Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches
The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)
The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood
Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm
Maximum Likelihood Definitions
The likelihood is is commonly expressed in its logarithmic version
L(θ X) = log
(nprodi=1
f(xiθ)
)
=nsumi=1
log
(Ksumk=1
πkfk(xiθk)
) (71)
where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions
To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or
72
71 Mixture Models
classification log-likelihood
LC(θ XY) = log
(nprodi=1
f(xiyiθ)
)
=
nsumi=1
log
(Ksumk=1
yikπkfk(xiθk)
)
=nsumi=1
Ksumk=1
yik log (πkfk(xiθk)) (72)
The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise
Defining the soft membership tik(θ) as
tik(θ) = p(Yik = 1|xiθ) (73)
=πkfk(xiθk)
f(xiθ) (74)
To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows
LC(θ XY) =sumik
yik log (πkfk(xiθk))
=sumik
yik log (tikf(xiθ))
=sumik
yik log tik +sumik
yik log f(xiθ)
=sumik
yik log tik +nsumi=1
log f(xiθ)
=sumik
yik log tik + L(θ X) (75)
wheresum
ik yik log tik can be reformulated as
sumik
yik log tik =nsumi=1
Ksumk=1
yik log(p(Yik = 1|xiθ))
=
nsumi=1
log(p(Yik = 1|xiθ))
= log (p(Y |Xθ))
As a result the relationship (75) can be rewritten as
L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)
73
7 Feature Selection in Mixture Models
Likelihood Maximization
The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)
L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))
+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))
In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood
∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1
minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality
Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))
For the mixture model problem Q(θθprime) is
Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]
=sumik
p(Yik = 1|xiθprime) log(πkfk(xiθk))
=nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (77)
Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ
prime) are the posterior proba-bilities of cluster memberships
Hence the EM algorithm sketched above results in
bull Initialization (not iterated) choice of the initial parameter θ(0)
bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)
bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))
74
72 Feature Selection in Model-Based Clustering
Gaussian Model
In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is
f(xiθ) =Ksumk=1
πkfk(xiθk)
=
Ksumk=1
πk1
(2π)p2 |Σ|
12
exp
minus1
2(xi minus microk)
gtΣminus1(xi minus microk)
At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows
Q(θθ(t)) =sumik
tik log(πk)minussumik
tik log(
(2π)p2 |Σ|
12
)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
=sumk
tk log(πk)minusnp
2log(2π)︸ ︷︷ ︸
constant term
minusn2
log(|Σ|)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
equivsumk
tk log(πk)minusn
2log(|Σ|)minus
sumik
tik
(1
2(xi minus microk)
gtΣminus1(xi minus microk)
) (78)
where
tk =nsumi=1
tik (79)
The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)
π(t+1)k =
tkn
(710)
micro(t+1)k =
sumi tikxitk
(711)
Σ(t+1) =1
n
sumk
Wk (712)
with Wk =sumi
tik(xi minus microk)(xi minus microk)gt (713)
The derivations are detailed in Appendix G
72 Feature Selection in Model-Based Clustering
When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own
75
7 Feature Selection in Mixture Models
covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries
In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD
gtk (Banfield and Raftery 1993)
These regularization schemes address singularity and stability issues but they do notinduce parsimonious models
In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space
721 Based on Penalized Likelihood
Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x
log
(p(Yk = 1|x)
p(Y` = 1|x)
)= xgtΣminus1(microk minus micro`)minus
1
2(microk + micro`)
gtΣminus1(microk minus micro`) + logπkπ`
In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm
λKsumk=1
psumj=1
|microkj |
as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices
λ1
Ksumk=1
psumj=1
|microkj |+ λ2
Ksumk=1
psumj=1
psumm=1
|(Σminus1k )jm|
In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models
76
72 Feature Selection in Model-Based Clustering
Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)
λ
psumj=1
sum16k6kprime6K
|microkj minus microkprimej |
This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative
A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features
λ
psumj=1
(micro1j micro2j microKj)infin
One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means
λradicK
psumj=1
radicradicradicradic Ksum
k=1
micro2kj
The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test
The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation
722 Based on Model Variants
The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as
77
7 Feature Selection in Mixture Models
f(xi|φ πθν) =Ksumk=1
πk
pprodj=1
[f(xij |θjk)]φj [h(xij |νj)]1minusφj
where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)
An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1
which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion
tr(
(UgtΣWU)minus1UgtΣBU) (714)
so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations
To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of
minUisinRptimesKminus1
∥∥∥XU minusXU∥∥∥2
F+ λ
Kminus1sumk=1
∥∥∥uk∥∥∥1
where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet
minABisinRptimesKminus1
Ksumk=1
∥∥∥RminusgtW HBk minusABgtHBk
∥∥∥2
2+ ρ
Kminus1sumj=1
βgtj ΣWβj + λ
Kminus1sumj=1
∥∥βj∥∥1
s t AgtA = IKminus1
where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper
78
72 Feature Selection in Model-Based Clustering
triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U
The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem
minUisinRptimesKminus1
psumj=1
∥∥∥ΣBj minus UUgtΣBj
∥∥∥2
2
s t UgtU = IKminus1
whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U
To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality
However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo
723 Based on Model Selection
Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables
bull X(1) set of selected relevant variables
bull X(2) set of variables being considered for inclusion or exclusion of X(1)
bull X(3) set of non relevant variables
79
7 Feature Selection in Mixture Models
With those subsets they defined two different models where Y is the partition toconsider
bull M1
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)|X(1)
)f(X(1)|Y
)bull M2
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)X(1)|Y
)Model M1 means that variables in X(2) are independent on clustering Y Model M2
shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor
B12 =f (X|M1)
f (X|M2)
where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio
B12 =f(X(1)X(2)X(3)|M1
)f(X(1)X(2)X(3)|M2
)=f(X(2)|X(1)M1
)f(X(1)|M1
)f(X(2)X(1)|M2
)
This factor is approximated since the integrated likelihoods f(X(1)|M1
)and
f(X(2)X(1)|M2
)are difficult to calculate exactly Raftery and Dean (2006) use the
BIC approximation The computation of f(X(2)|X(1)M1
) if there is only one variable
in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term
Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability
Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis
80
8 Theoretical Foundations
In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features
We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model
In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided
81 Resolving EM with Optimal Scoring
In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate
811 Relationship Between the M-Step and Linear Discriminant Analysis
LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance
d(ximicrok) = (xi minus microk)gtΣminus1
W (xi minus microk)
where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix
81
8 Theoretical Foundations
The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)
Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood
2lweight(microΣ) =nsumi=1
Ksumk=1
tikd(ximicrok)minus n log(|ΣW|)
which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures
812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis
The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression
813 Clustering Using Penalized Optimal Scoring
The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression
d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)
This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as
82
82 Optimized Criterion
1 Initialize the membership matrix Y (for example by K-means algorithm)
2 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2
k)minus 1
2 )
4 Compute the centroids M in the LDA domain
5 Evaluate distances in the LDA domain
6 Translate distances into posterior probabilities tik with
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
] (81)
7 Update the labels using the posterior probabilities matrix Y = T
8 Go back to step 2 and iterate until tik converge
Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures
814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures
82 Optimized Criterion
In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized
83
8 Theoretical Foundations
optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture
This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows
821 A Bayesian Derivation
This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)
The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter
The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior
f(Σ|Λ0 ν0) =1
2np2 |Λ0|
n2 Γp(
n2 )|Σminus1|
ν0minuspminus12 exp
minus1
2tr(Λminus1
0 Σminus1)
where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as
Γp(n2) = πp(pminus1)4pprodj=1
Γ (n2 + (1minus j)2)
The posterior distribution can be maximized similarly to the likelihood through the
84
82 Optimized Criterion
maximization of
Q(θθprime) + log(f(Σ|Λ0 ν0))
=Ksumk=1
tk log πk minus(n+ 1)p
2log 2minus n
2log |Λ0| minus
p(p+ 1)
4log(π)
minuspsumj=1
log
(Γ
(n
2+
1minus j2
))minus νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
)equiv
Ksumk=1
tk log πk minusn
2log |Λ0| minus
νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
) (82)
with tk =
nsumi=1
tik
νn = ν0 + n
Λminus1n = Λminus1
0 + S0
S0 =
nsumi=1
Ksumk=1
tik(xi minus microk)(xi minus microk)gt
Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)
822 Maximum a Posteriori Estimator
The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is
ΣMAP =1
ν0 + nminus pminus 1(Λminus1
0 + S0) (83)
where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1
0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)
85
9 Mix-GLOSS Algorithm
Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism
91 Mix-GLOSS
The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik
When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant
The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition
911 Outer Loop Whole Algorithm Repetitions
This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs
bull the centered ntimes p feature matrix X
bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically
bull the number of clusters K
bull the maximum number of iterations for the EM algorithm
bull the convergence tolerance for the EM algorithm
bull the number of whole repetitions of the clustering algorithm
87
9 Mix-GLOSS Algorithm
Figure 91 Mix-GLOSS Loops Scheme
bull a ptimes (K minus 1) initial coefficient matrix (optional)
bull a ntimesK initial posterior probability matrix (optional)
For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process
912 Penalty Parameter Loop
The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix
Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage
88
91 Mix-GLOSS
of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive
Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)
Algorithm 2 Automatic selection of λ
Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat
Estimate λ Compute gradient at βj = 0partJ(B)
partβj
∣∣∣βj=0
= xjgt
(sum
m6=j xmβm minusYΘ)
Compute λmax for every feature using (432b)
λmaxj = 1
wj
∥∥∥∥ partJ(B)
partβj
∣∣∣βj=0
∥∥∥∥2
Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false
elselastLAMBDA larr true
end ifuntil lastLAMBDA
Output B L(θ) tik πk microk Σ Y for every λ in solution path
913 Inner Loop EM Algorithm
The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop
89
9 Mix-GLOSS Algorithm
Algorithm 3 Mix-GLOSS for one value of λ
Input X K B0 Y0 λInitializeif (B0Y0) available then
BOS larr B0 Y larr Y0
elseBOS larr 0 Y larr kmeans(XK)
end ifconvergenceEM larr false tolEM larr 1e-3repeat
M-step(BOSΘ
α)larr GLOSS(XYBOS λ)
XLDA = XBOS diag (αminus1(1minusα2)minus12
)
πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n
sumi |tik minus yik| lt tolEM then
convergenceEM larr trueend ifY larr T
until convergenceEMY larr MAP(T)
Output BOS ΘL(θ) tik πk microk Σ Y
90
92 Model Selection
M-Step
The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step
E-Step
The E-step evaluates the posterior probability matrix T using
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
]
The convergence of those tik is used as stopping criterion for EM
92 Model Selection
Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected
In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure
In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time
The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested
91
9 Mix-GLOSS Algorithm
Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)
X K λEMITER MAXREPMixminusGLOSS
Use B and T frombest repetition as
StartB and StartT
Mix-GLOSS (λStartBStartT)
Compute BIC
Chose λ = minλ BIC
Partition tikπk λBEST BΘ D L(θ)activeset
Figure 92 Mix-GLOSS model selection diagram
with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ
92
10 Experimental Results
The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6
This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63
In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations
The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions
101 Tested Clustering Algorithms
This section compares Mix-GLOSS with the following methods in the state of the art
bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan
bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website
93
10 Experimental Results
Figure 101 Class mean vectors for each artificial simulation
bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website
After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered
The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website
bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)
bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see
94
102 Results
Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable
102 Results
In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are
bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different
bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80
bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced
The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS
Results in percentages are displayed in Figure 102 (or in Table 102 )
95
10 Experimental Results
Table 101 Experimental results for simulated data
Err () Var Time
Sim 1 K = 4 mean shift ind features
CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h
Sim 2 K = 2 mean shift dependent features
CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h
Sim 3 K = 4 1D mean shift ind features
CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h
Sim 4 K = 4 mean shift ind features
CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h
Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
MIX-GLOSS 992 015 828 335 884 67 780 12
LUMI-KUAN 992 28 1000 02 1000 005 50 005
FISHER-EM 986 24 888 17 838 5825 620 4075
96
103 Discussion
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100TPR Vs FPR
MIXminusGLOSS
LUMIminusKUAN
FISHERminusEM
Simulation1
Simulation2
Simulation3
Simulation4
Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions
103 Discussion
After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted
LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here
The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4
From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall
97
Conclusions
99
Conclusions
Summary
The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables
In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering
The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems
In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations
In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results
Perspectives
Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species
101
based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography
At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term
The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis
From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm
At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression
102
Appendix
103
A Matrix Properties
Property 1 By definition ΣW and ΣB are both symmetric matrices
ΣW =1
n
gsumk=1
sumiisinCk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
gsumk=1
nk(microk minus x)(microk minus x)gt
Property 2 partxgtapartx = partagtx
partx = a
Property 3 partxgtAxpartx = (A + Agt)x
Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt
Property 5 partagtXbpartX = abgt
Property 6 partpartXtr
(AXminus1B
)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt
105
B The Penalized-OS Problem is anEigenvector Problem
In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form
minθkβk
Yθk minusXβk22 + βgtk Ωkβk (B1)
st θgtk YgtYθk = 1
θgt` YgtYθk = 0 forall` lt k
for k = 1 K minus 1The Lagrangian associated to Problem (B1) is
Lk(θkβk λkνk) =
Yθk minusXβk22 + βgtk Ωkβk + λk(θ
gtk YgtYθk minus 1) +
sum`ltk
ν`θgt` YgtYθk (B2)
Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk
βk = (XgtX + Ωk)minus1XgtYθk (B3)
The objective function of (B1) evaluated at βk is
minθk
Yθk minusXβk22 + βk
gtΩkβk = min
θk
θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk
= maxθk
θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)
If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY
B1 How to Solve the Eigenvector Decomposition
Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition
107
B The Penalized-OS Problem is an Eigenvector Problem
Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way
maxΘisinRKtimes(Kminus1)
tr(ΘgtMΘ
)(B5)
st ΘgtYgtYΘ = IKminus1
If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is
MΘv = λv (B6)
where v is the eigenvector and λ the associated eigenvalue of MΘ Operating
vgtMΘv = λhArr vgtΘgtMΘv = λ
Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue
wgtMw = λ (B7)
Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ
MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ
= ΘgtYgtXB
Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone
To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B
B = (XgtX + Ω)minus1XgtYΘV = BV
108
B2 Why the OS Problem is Solved as an Eigenvector Problem
B2 Why the OS Problem is Solved as an Eigenvector Problem
In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY
By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them
θk =
Kminus1summ=1
αmwm s t θgtk θk = 1 (B8)
The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (
Kminus1summ=1
αmwm
)gt(Kminus1summ=1
αmwm
)= 1
that as per the eigenvector properties can be reduced to
Kminus1summ=1
α2m = 1 (B9)
Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)
Mθk = M
Kminus1summ=1
αmwm
=
Kminus1summ=1
αmMwm
As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain
Mθk =Kminus1summ=1
αmλmwm
Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors
θgtk Mθk =
(Kminus1sum`=1
α`w`
)gt(Kminus1summ=1
αmλmwm
)
This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving
θgtk Mθk =Kminus1summ=1
α2mλm
109
B The Penalized-OS Problem is an Eigenvector Problem
The optimization Problem (B5) for discriminant direction k can be rewritten as
maxθkisinRKtimes1
θgtk Mθk
= max
θkisinRKtimes1
Kminus1summ=1
α2mλm
(B10)
with θk =Kminus1summ=1
αmwm
andKminus1summ=1
α2m = 1
One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =
sumKminus1m=1 αmwm the resulting score vector θk will be equal to
the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can
be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY
110
C Solving Fisherrsquos Discriminant Problem
The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance
maxβisinRp
βgtΣBβ (C1a)
s t βgtΣWβ = 1 (C1b)
where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data
The Lagrangian of Problem (C1) is
L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)
so that its first derivative with respect to β is
partL(β ν)
partβ= 2ΣBβ minus 2νΣWβ
A necessary optimality condition for β is that this derivative is zero that is
ΣBβ = νΣWβ
Provided ΣW is full rank we have
Σminus1W ΣBβ
= νβ (C2)
Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of
eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows
βgtΣBβ = βgtΣWΣminus1
W ΣBβ
= νβgtΣWβ from (C2)
= ν from (C1b)
That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1
W ΣB and β is any eigenvector correspondingto this maximal eigenvalue
111
D Alternative Variational Formulation forthe Group-Lasso
In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(D1a)
s tsump
j=1 τj = 1 (D1b)
τj ge 0 j = 1 p (D1c)
Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed
of row vectors βj isin RKminus1 B =(β1gt βpgt
)gt
L(B τ λ ν0 νj) = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
psumj=1
τj minus 1
minus psumj=1
νjτj (D2)
The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj
partL(B τ λ ν0 νj)
partτj
∣∣∣∣τj=τj
= 0 rArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 = 0
The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ
) = 0 where νj is the Lagrange multiplier and gj(τ) is the
inequality Lagrange condition Then the optimal τj can be deduced
τj =
radicλ
ν0wj∥∥βj∥∥
2
Placing this optimal value of τj into constraint (D1b)
psumj=1
τj = 1rArr τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
(D3)
113
D Alternative Variational Formulation for the Group-Lasso
With this value of τj Problem (D1) is equivalent to
minBisinRptimesKminus1
J(B) + λ
psumj=1
wj∥∥βj∥∥
2
2
(D4)
This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj
The penalty term of (D1a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (D5)
Using the value of τj from (D3) each diagonal component of Ω is
(Ω)jj =wjsump
j=1wj∥∥βj∥∥
2∥∥βj∥∥2
(D6)
In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation
D1 Useful Properties
Lemma D1 If J is convex Problem (D1) is convex
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =
partJ(B)
partB+ 2λ
Kminus1sumj=1
wj∥∥βj∥∥
2
G
(D7)
where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1
∥∥βj∥∥26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (D8)
forallj isin S(B) ∥∥gj∥∥
2le wj (D9)
114
D2 An Upper Bound on the Objective Function
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1
∥∥βj∥∥26= 0 and let S(B) be its complement then we have
forallj isin S(B) minus partJ(B)
partβj= 2λ
Kminus1sumj=1
wj∥∥βj∥∥2
wj∥∥βj∥∥minus1
2βj (D10a)
forallj isin S(B)
∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le 2λwj
Kminus1sumj=1
wj∥∥βj∥∥2
(D10b)
In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)
D2 An Upper Bound on the Objective Function
Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that
τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum
j=1
wj∥∥βj∥∥
2
2
=
psumj=1
τ12j
wj∥∥βj∥∥
2
τ12j
2
le
psumj=1
τj
psumj=1
w2j
∥∥βj∥∥2
2
τj
le
psumj=1
w2j
∥∥βj∥∥2
2
τj
where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one
115
D Alternative Variational Formulation for the Group-Lasso
This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined
116
E Invariance of the Group-Lasso to UnitaryTransformations
The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition
Proposition E1 Let B be a solution of
minBisinRptimesM
Y minusXB2F + λ
psumj=1
wj∥∥βj∥∥
2(E1)
and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof
minBisinRptimesM
∥∥∥Y minusXB∥∥∥2
F+ λ
psumj=1
wj∥∥βj∥∥
2(E2)
Proof The first-order necessary optimality conditions for B are
forallj isin S(B) 2xjgt(xjβ
j minusY)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E3a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minusY)∥∥∥
2le λwj (E3b)
where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement
First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows
forallj isin S(B) 2xjgt(xjβ
j minus Y)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E4a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minus Y)∥∥∥
2le λwj (E4b)
where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM
∥∥ugt∥∥2
=∥∥ugtV
∥∥2 Equation (E4b) is also
117
E Invariance of the Group-Lasso to Unitary Transformations
obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof
118
F Expected Complete Likelihood andLikelihood
Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available
L(θ) =
nsumi=1
log
(Ksumk=1
πkfk(xiθk)
)(F1)
Q(θθprime) =nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (F2)
with tik(θprime) =
πprimekfk(xiθprimek)sum
` πprime`f`(xiθ
prime`)
(F3)
In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are
the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)
Using (F3) we have
Q(θθprime) =sumik
tik(θprime) log (πkfk(xiθk))
=sumik
tik(θprime) log(tik(θ)) +
sumik
tik(θprime) log
(sum`
π`f`(xiθ`)
)=sumik
tik(θprime) log(tik(θ)) + L(θ)
In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities
L(θ) = Q(θθ)minussumik
tik(θ) log(tik(θ))
= Q(θθ) +H(T)
119
G Derivation of the M-Step Equations
This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as
Q(θθprime) = maxθ
sumik
tik(θprime) log(πkfk(xiθk))
=sumk
log
(πksumi
tik
)minus np
2log(2π)minus n
2log |Σ| minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
which has to be maximized subject tosumk
πk = 1
The Lagrangian of this problem is
L(θ) = Q(θθprime) + λ
(sumk
πk minus 1
)
Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ
G1 Prior probabilities
partL(θ)
partπk= 0hArr 1
πk
sumi
tik + λ = 0
where λ is identified from the constraint leading to
πk =1
n
sumi
tik
121
G Derivation of the M-Step Equations
G2 Means
partL(θ)
partmicrok= 0hArr minus1
2
sumi
tik2Σminus1(microk minus xi) = 0
rArr microk =
sumi tikxisumi tik
G3 Covariance Matrix
partL(θ)
partΣminus1 = 0hArr n
2Σ︸︷︷︸
as per property 4
minus 1
2
sumik
tik(xi minus microk)(xi minus microk)gt
︸ ︷︷ ︸as per property 5
= 0
rArr Σ =1
n
sumik
tik(xi minus microk)(xi minus microk)gt
122
Bibliography
F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011
F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008
F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012
J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993
A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009
H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996
P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004
C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008
C M Bishop Pattern Recognition and Machine Learning Springer New York 2006
C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a
C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b
S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004
L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995
L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984
123
Bibliography
T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011
S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999
C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012
B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008
L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011
C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009
A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246
D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006
R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000
B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004
Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008
R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936
V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008
J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009
124
Bibliography
J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010
J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989
W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998
A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003
D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005
G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010
G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011
Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998
Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002
L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008
Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004
J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010
I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003
T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996
T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994
125
Bibliography
T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995
A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970
J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009
T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006
K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000
P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010
T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002
M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004
Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004
C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008
C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006
H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005
J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967
Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012
C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a
126
Bibliography
C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b
L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008
N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006
B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006
B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007
Y Nesterov Gradient methods for minimizing composite functions preprint 2007
S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886
B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011
M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a
M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b
W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007
W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006
K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894
S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003
127
Bibliography
Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009
A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006
C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948
S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007
V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004
V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008
V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004
C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010
L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012
Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978
A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008
S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006
P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010
M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008
128
Bibliography
M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008
R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996
J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010
S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008
D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011
D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010
D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009
M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007
MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009
T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008
B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a
B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b
C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010
J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007
129
Bibliography
M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006
P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007
P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009
H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009
H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006
H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005
130
- SANCHEZ MERCHANTE PDTpdf
- Thesis Luis Francisco Sanchez Merchantepdf
-
- List of figures
- List of tables
- Notation and Symbols
- Context and Foundations
-
- Context
- Regularization for Feature Selection
-
- Motivations
- Categorization of Feature Selection Techniques
- Regularization
-
- Important Properties
- Pure Penalties
- Hybrid Penalties
- Mixed Penalties
- Sparsity Considerations
- Optimization Tools for Regularized Problems
-
- Sparse Linear Discriminant Analysis
-
- Abstract
- Feature Selection in Fisher Discriminant Analysis
-
- Fisher Discriminant Analysis
- Feature Selection in LDA Problems
-
- Inertia Based
- Regression Based
-
- Formalizing the Objective
-
- From Optimal Scoring to Linear Discriminant Analysis
-
- Penalized Optimal Scoring Problem
- Penalized Canonical Correlation Analysis
- Penalized Linear Discriminant Analysis
- Summary
-
- Practicalities
-
- Solution of the Penalized Optimal Scoring Regression
- Distance Evaluation
- Posterior Probability Evaluation
- Graphical Representation
-
- From Sparse Optimal Scoring to Sparse LDA
-
- A Quadratic Variational Form
- Group-Lasso OS as Penalized LDA
-
- GLOSS Algorithm
-
- Regression Coefficients Updates
-
- Cholesky decomposition
- Numerical Stability
-
- Score Matrix
- Optimality Conditions
- Active and Inactive Sets
- Penalty Parameter
- Options and Variants
-
- Scaling Variables
- Sparse Variant
- Diagonal Variant
- Elastic net and Structured Variant
-
- Experimental Results
-
- Normalization
- Decision Thresholds
- Simulated Data
- Gene Expression Data
- Correlated Data
-
- Discussion
-
- Sparse Clustering Analysis
-
- Abstract
- Feature Selection in Mixture Models
-
- Mixture Models
-
- Model
- Parameter Estimation The EM Algorithm
-
- Feature Selection in Model-Based Clustering
-
- Based on Penalized Likelihood
- Based on Model Variants
- Based on Model Selection
-
- Theoretical Foundations
-
- Resolving EM with Optimal Scoring
-
- Relationship Between the M-Step and Linear Discriminant Analysis
- Relationship Between Optimal Scoring and Linear Discriminant Analysis
- Clustering Using Penalized Optimal Scoring
- From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
-
- Optimized Criterion
-
- A Bayesian Derivation
- Maximum a Posteriori Estimator
-
- Mix-GLOSS Algorithm
-
- Mix-GLOSS
-
- Outer Loop Whole Algorithm Repetitions
- Penalty Parameter Loop
- Inner Loop EM Algorithm
-
- Model Selection
-
- Experimental Results
-
- Tested Clustering Algorithms
- Results
- Discussion
-
- Conclusions
- Appendix
-
- Matrix Properties
- The Penalized-OS Problem is an Eigenvector Problem
-
- How to Solve the Eigenvector Decomposition
- Why the OS Problem is Solved as an Eigenvector Problem
-
- Solving Fishers Discriminant Problem
- Alternative Variational Formulation for the Group-Lasso
-
- Useful Properties
- An Upper Bound on the Objective Function
-
- Invariance of the Group-Lasso to Unitary Transformations
- Expected Complete Likelihood and Likelihood
- Derivation of the M-Step Equations
-
- Prior probabilities
- Means
- Covariance Matrix
-
- Bibliography
-
Contents
E Invariance of the Group-Lasso to Unitary Transformations 117
F Expected Complete Likelihood and Likelihood 119
G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122
Bibliography 123
iv
List of Figures
11 MASH project logo 5
21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-
rameters 20
41 Graphical representation of the variational approach to Group-Lasso 45
51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56
61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first
discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64
91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92
101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97
v
List of Tables
61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61
101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96
vii
Notation and Symbols
Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors
Sets
N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A
Data
X input domainxi input sample xi isin XX design matrix X = (xgt1 x
gtn )gt
xj column j of Xyi class indicator of sample i
Y indicator matrix Y = (ygt1 ygtn )gt
z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N
Vectors Matrices and Norms
0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A
ix
Notation and Symbols
Probability
E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2
W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix
H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y
Mixture Models
yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)
θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function
Optimization
J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β
βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path
x
Notation and Symbols
Penalized models
λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)
βj jth row of B = (β1gt βpgt)gt
BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix
ΣB sample between-class covariance matrix
ΣW sample within-class covariance matrix
ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach
xi
Part I
Context and Foundations
1
This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed
The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided
The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion
3
1 Context
The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm
The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne
From the point of view of the research the members of the consortium must deal withfour main goals
1 Software development of website framework and APIrsquos
2 Classification and goal-planning in high dimensional feature spaces
3 Interfacing the platform with the 3D virtual environment and the robot arm
4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments
S HM A
Figure 11 MASH project logo
5
1 Context
The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables
Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment
As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform
bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)
bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis
6
All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)
I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)
7
2 Regularization for Feature Selection
With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic
21 Motivations
There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)
As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information
When talking about dimensionality reduction there are two families of techniquesthat could induce confusion
bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples
bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature
9
2 Regularization for Feature Selection
Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)
selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category
As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text
ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out
Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the
10
22 Categorization of Feature Selection Techniques
Figure 22 The four key steps of feature selection according to Liu and Yu (2005)
ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost
There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions
22 Categorization of Feature Selection Techniques
Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured
I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm
The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities
bull Depending on the type of integration with the machine learning algorithm we have
ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm
ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while
11
2 Regularization for Feature Selection
the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive
ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm
bull Depending on the feature searching technique
ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches
ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time
ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima
bull Depending on the evaluation technique
ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures
ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty
ndash Dependency Measures - Measuring the correlation between features
ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can
ndash Predictive Accuracy - Use the selected features to predict the labels
ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)
The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels
In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized
12
23 Regularization
goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III
23 Regularization
In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret
An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations
minβJ(β) + λP (β) (21)
minβ
J(β)
s t P (β) le t (22)
In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken
In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty
13
2 Regularization for Feature Selection
Figure 23 Admissible sets in two dimensions for different pure norms ||β||p
231 Important Properties
Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability
Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies
forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)
for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex
Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources
Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution
232 Pure Penalties
For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In
14
23 Regularization
Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties
this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1
Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity
A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero
After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1
3penalty has a support region with sharper vertexes that would induce
a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1
3results in difficulties during optimization that will not happen with a convex
shape
15
2 Regularization for Feature Selection
To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty
L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0
minβ
J(β)
s t β0 le t (24)
where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable
L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)
minβ
J(β)
s t
psumj=1
|βj | le t (25)
Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited
Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)
The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by
16
23 Regularization
minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)
L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like
minβJ(β) + λ β22 (26)
The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem
minβ
nsumi=1
(yi minus xgti β)2 (27)
with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular
the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances
As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient
minβ
nsumi=1
(yi minus xgti β)2 + λ
psumj=1
β2j
(βlsj )2 (28)
The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)
17
2 Regularization for Feature Selection
where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model
Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions
Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t
This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as
βlowast = maxwisinRp
βgtw s t w le 1
In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1
r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)
233 Hybrid Penalties
There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is
minβ
nsumi=1
(yi minus xgti β)2 + λ1
psumj=1
|βj |+ λ2
psumj=1
β2j (29)
The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables
18
23 Regularization
234 Mixed Penalties
Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =
sumL`=1 d` Mixed norms are
a type of norms that take into consideration those groups The general expression isshowed below
β(rs) =
sum`
sumjisinG`
|βj |s r
s
1r
(210)
The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups
Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)
(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)
235 Sparsity Considerations
In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables
The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables
To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as
19
2 Regularization for Feature Selection
(a) L1 Lasso (b) L(12) group-Lasso
Figure 25 Admissible sets for the Lasso and Group-Lasso
(a) L1 induced sparsity (b) L(12) group inducedsparsity
Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters
20
23 Regularization
the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed
236 Optimization Tools for Regularized Problems
In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms
In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5
Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)
β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))
Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives
βj =minusλsign(βj)minus partJ(β)
partβj
2sumn
i=1 x2ij
In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding
algorithm where β(t+1)j = Sλ
(partJ(β(t))partβj
) The objective function is optimized with respect
21
2 Regularization for Feature Selection
to one variable at a time while all others are kept fixed
Sλ
(partJ(β)
partβj
)=
λminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
gt λ
minusλminus partJ(β)partβj
2sumn
i=1 x2ij
if partJ(β)partβj
lt minusλ
0 if |partJ(β)partβj| le λ
(211)
The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)
Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A
Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected
Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)
This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions
22
23 Regularization
and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions
Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points
This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable
This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)
Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)
This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques
Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals
Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)
minβisinRp
J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L
2
∥∥∥β minus β(t)∥∥∥2
2(212)
They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like
23
2 Regularization for Feature Selection
(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as
minβisinRp
1
2
∥∥∥∥β minus (β(t) minus 1
LnablaJ(β(t)))
∥∥∥∥2
2
+λ
LP (β) (213)
The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up
24
Part II
Sparse Linear Discriminant Analysis
25
Abstract
Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes
There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables
In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data
27
3 Feature Selection in Fisher DiscriminantAnalysis
31 Fisher Discriminant Analysis
Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)
We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x
gtn )gt and the corresponding labels in the ntimesK matrix
Y = (ygt1 ygtn )gt
Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance
maxβisinRp
βgtΣBβ
βgtΣWβ (31)
where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as
ΣW =1
n
Ksumk=1
sumiisinGk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
Ksumk=1
sumiisinGk
(microminus microk)(microminus microk)gt
where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k
29
3 Feature Selection in Fisher Discriminant Analysis
This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio
maxBisinRptimesKminus1
tr(BgtΣBB
)tr(BgtΣWB
) (32)
where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is
based on a series of K minus 1 subproblemsmaxβkisinRp
βgtk ΣBβk
s t βgtk ΣWβk le 1
βgtk ΣWβ` = 0 forall` lt k
(33)
The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest
eigenvalue (see Appendix C)
32 Feature Selection in LDA Problems
LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome
Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints
The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities
They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based
321 Inertia Based
The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and
30
32 Feature Selection in LDA Problems
classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations
Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)
Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as
minβisinRp
βgtΣWβ
s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t
where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony
Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max
βisinkRpβgtk Σ
k
Bβk minus Pk(βk)
s t βgtk ΣWβk le 1
The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten
Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal
solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1
minimization minβisinRp
β1
s t∥∥∥Σβ minus (micro1 minus micro2)
∥∥∥infinle λ
Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization
31
3 Feature Selection in Fisher Discriminant Analysis
Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions
322 Regression Based
In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)
Predefined Indicator Matrix
Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)
There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data
Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection
In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is
32
32 Feature Selection in LDA Problems
obtained by solving
minβisinRpβ0isinR
nminus1nsumi=1
(yi minus β0 minus xgti β)2 + λ
psumj=1
|βj |
where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β
vector for λ = 0 but a different intercept β0 is required
Optimal Scoring
In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)
As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)(34a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)
where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems
minθkisinRK βkisinRp
Yθk minusXβk2 + βgtk Ωβk (35a)
s t nminus1 θgtk YgtYθk = 1 (35b)
θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)
where each βk corresponds to a discriminant direction
33
3 Feature Selection in Fisher Discriminant Analysis
Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by
minβkisinRpθkisinRK
sumk
Yθk minusXβk22 + λ1 βk1 + λ2β
gtk Ωβk
where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen
Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)
minβkisinRpθkisinRK
Kminus1sumk=1
Yθk minusXβk22 + λ
psumj=1
radicradicradicradicKminus1sumk=1
β2kj
2
(36)
which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding
this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem
34
4 Formalizing the Objective
In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)
The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data
The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)
41 From Optimal Scoring to Linear Discriminant Analysis
Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)
Throughout this chapter we assume that
bull there is no empty class that is the diagonal matrix YgtY is full rank
bull inputs are centered that is Xgt1n = 0
bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank
35
4 Formalizing the Objective
411 Penalized Optimal Scoring Problem
For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution
The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus
minθisinRK βisinRp
Yθ minusXβ2 + βgtΩβ (41a)
s t nminus1 θgtYgtYθ = 1 (41b)
For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator
βos =(XgtX + Ω
)minus1XgtYθ (42)
The objective function (41a) is then
Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos
(XgtX + Ω
)βos
= θgtYgtYθ minus θgtYgtX(XgtX + Ω
)minus1XgtYθ
where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (43)
which shows that the optimization of the p-OS problem with respect to θk boils down to
finding the kth largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY Indeed Appendix C
details that Problem (43) is solved by
(YgtY)minus1YgtX(XgtX + Ω
)minus1XgtYθ = α2θ (44)
36
41 From Optimal Scoring to Linear Discriminant Analysis
where α2 is the maximal eigenvalue 1
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2nminus1θgt(YgtY)θ
nminus1θgtYgtX(XgtX + Ω
)minus1XgtYθ = α2 (45)
412 Penalized Canonical Correlation Analysis
As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows
maxθisinRK βisinRp
nminus1θgtYgtXβ (46a)
s t nminus1 θgtYgtYθ = 1 (46b)
nminus1 βgt(XgtX + Ω
)β = 1 (46c)
The solutions to (46) are obtained by finding saddle points of the Lagrangian
nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)
rArr npartL(βθ γ ν)
partβ= XgtYθ minus 2γ(XgtX + Ω)β
rArr βcca =1
2γ(XgtX + Ω)minus1XgtYθ
Then as βcca obeys (46c) we obtain
βcca =(XgtX + Ω)minus1XgtYθradic
nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)
so that the optimal objective function (46a) can be expressed with θ alone
nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
=
radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ
and the optimization problem with respect to θ can be restated as
maxθnminus1θgtYgtYθ=1
θgtYgtX(XgtX + Ω
)minus1XgtYθ (48)
Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)
βos = αβcca (49)
1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)
37
4 Formalizing the Objective
where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using
the optimality conditions for θ
npartL(βθ γ ν)
partθ= YgtXβ minus 2νYgtYθ
rArr θcca =1
2ν(YgtY)minus1YgtXβ (410)
Then as θcca obeys (46b) we obtain
θcca =(YgtY)minus1YgtXβradic
nminus1βgtXgtY(YgtY)minus1YgtXβ (411)
leading to the following expression of the optimal objective function
nminus1θgtccaYgtXβ =
nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ
=
radicnminus1βgtXgtY(YgtY)minus1YgtXβ
The p-CCA problem can thus be solved with respect to β by plugging this value in (46)
maxβisinRp
nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)
s t nminus1 βgt(XgtX + Ω
)β = 1 (412b)
where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies
nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω
)βcca (413)
where λ is the maximal eigenvalue shown below to be equal to α2
nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ
rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ
rArr nminus1αβgtccaXgtYθ = λ
rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ
rArr α2 = λ
The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)
38
41 From Optimal Scoring to Linear Discriminant Analysis
413 Penalized Linear Discriminant Analysis
Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows
maxβisinRp
βgtΣBβ (414a)
s t βgt(ΣW + nminus1Ω)β = 1 (414b)
where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C
As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable
to a simple matrix representation using the projection operator Y(YgtY
)minus1Ygt
ΣT =1
n
nsumi=1
xixigt
= nminus1XgtX
ΣB =1
n
Ksumk=1
nk microkmicrogtk
= nminus1XgtY(YgtY
)minus1YgtX
ΣW =1
n
Ksumk=1
sumiyik=1
(xi minus microk) (xi minus microk)gt
= nminus1
(XgtXminusXgtY
(YgtY
)minus1YgtX
)
Using these formulae the solution to the p-LDA problem (414) is obtained as
XgtY(YgtY
)minus1YgtXβlda = λ
(XgtX + ΩminusXgtY
(YgtY
)minus1YgtX
)βlda
XgtY(YgtY
)minus1YgtXβlda =
λ
1minus λ
(XgtX + Ω
)βlda
The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat
βlda = (1minus α2)minus12 βcca
= αminus1(1minus α2)minus12 βos
which ends the path from p-OS to p-LDA
39
4 Formalizing the Objective
414 Summary
The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below
minΘ BYΘminusXB2F + λ tr
(BgtΩB
)s t nminus1 ΘgtYgtYΘ = IKminus1
Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the
square-root of the largest eigenvector of YgtX(XgtX + Ω
)minus1XgtY we have
BLDA = BCCA
(IKminus1 minusA2
)minus 12
= BOS Aminus1(IKminus1 minusA2
)minus 12 (415)
where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p
can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS
or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied
With the aim of performing classification the whole process could be summarized asfollows
1 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
2 Translate the data samples X into the LDA domain as XLDA = XBOSD
where D = Aminus1(IKminus1 minusA2
)minus 12
3 Compute the matrix M of centroids microk from XLDA and Y
4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA
5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule
6 Graphical Representation
40
42 Practicalities
The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively
42 Practicalities
421 Solution of the Penalized Optimal Scoring Regression
Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem
minΘisinRKtimesKminus1BisinRptimesKminus1
YΘminusXB2F + λ tr(BgtΩB
)(416a)
s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)
where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm
Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps
1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1
2 Compute B =(XgtX + λΩ
)minus1XgtYΘ0
3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ
)minus1XgtY
4 Compute the optimal regression coefficients
BOS =(XgtX + λΩ
)minus1XgtYΘ (417)
Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on
Θ0gtYgtX(XgtX + λΩ
)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a
costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B
This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where
41
4 Formalizing the Objective
a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems
422 Distance Evaluation
The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance
d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log
(nkn
) (418)
is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent
Σminus1WΩ =
(nminus1(XgtX + λΩ)minus ΣB
)minus1
=(nminus1XgtXminus ΣB + nminus1λΩ
)minus1
=(ΣW + nminus1λΩ
)minus1 (419)
Before explaining how to compute the distances let us summarize some clarifying points
bull The solution BOS of the p-OS problem is enough to accomplish classification
bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances
bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1
As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain
(xi minus microk)BOS2ΣWΩminus 2 log(πk)
where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1
(IKminus1 minusA2
)minus 12
∥∥∥2
2minus 2 log(πk)
which is a plain Euclidean distance
42
43 From Sparse Optimal Scoring to Sparse LDA
423 Posterior Probability Evaluation
Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as
p(yk = 1|x) prop exp
(minusd(xmicrok)
2
)prop πk exp
(minus1
2
∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2
)minus 12
∥∥∥2
2
) (420)
Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)
2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below
p(yk = 1|x) =πk exp
(minusd(xmicrok)
2
)sum
` π` exp(minusd(xmicro`)
2
)=
πk exp(minusd(xmicrok)
2 + dmax2
)sum`
π` exp
(minusd(xmicro`)
2+dmax
2
)
where dmax = maxk d(xmicrok)
424 Graphical Representation
Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented
43 From Sparse Optimal Scoring to Sparse LDA
The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated
In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see
43
4 Formalizing the Objective
section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB
431 A Quadratic Variational Form
Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)
Our formulation of group-Lasso is showed below
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(421a)
s tsum
j τj minussum
j wj∥∥βj∥∥
2le 0 (421b)
τj ge 0 j = 1 p (421c)
where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1
B =(β1gt βpgt
)gtand wj are predefined nonnegative weights The cost function
J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)
The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41
Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)
Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump
j=1wj∥∥βj∥∥
2
Proof The Lagrangian of Problem (421) is
L = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
( psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2
)minus
psumj=1
νjτj
44
43 From Sparse Optimal Scoring to Sparse LDA
Figure 41 Graphical representation of the variational approach to Group-Lasso
Thus the first order optimality conditions for τj are
partLpartτj
(τj ) = 0hArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
hArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0 τ
j
2 = 0
The last line is obtained from complementary slackness which implies here νjτj = 0
Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier
for constraint gj(τj) le 0 As a result the optimal value of τj
τj =
radicλw2
j
∥∥βj∥∥2
2
ν0=
radicλ
ν0wj∥∥βj∥∥
2(422)
We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)
psumj=1
τj minuspsumj=1
wj∥∥βj∥∥
2= 0 (423)
so that τj = wj∥∥βj∥∥
2 Using this value into (421a) it is possible to conclude that
Problem (421) is equivalent to the standard group-Lasso operator
minBisinRptimesM
J(B) + λ
psumj=1
wj∥∥βj∥∥
2 (424)
So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation
45
4 Formalizing the Objective
With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (425)
with
τj = wj∥∥βj∥∥
2
resulting in Ω diagonal components
(Ω)jj =wj∥∥βj∥∥
2
(426)
And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5
The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence
Lemma 42 If J is convex Problem (421) is convex
Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is
V isin RptimesKminus1 V =partJ(B)
partB+ λG
(427)
where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1
G =(g1gt gpgt
)gtdefined as follows Let S(B) denote the columnwise support of
B S(B) = j isin 1 p ∥∥βj∥∥
26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (428)
forallj isin S(B) ∥∥gj∥∥
2le wj (429)
46
43 From Sparse Optimal Scoring to Sparse LDA
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Proof When∥∥βj∥∥
26= 0 the gradient of the penalty with respect to βj is
part (λsump
m=1wj βm2)
partβj= λwj
βj∥∥βj∥∥2
(430)
At∥∥βj∥∥
2= 0 the gradient of the objective function is not continuous and the optimality
conditions then make use of the subdifferential (Bach et al 2011)
partβj
(λ
psumm=1
wj βm2
)= partβj
(λwj
∥∥βj∥∥2
)=λwjv isin RKminus1 v2 le 1
(431)
That gives the expression (429)
Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima
forallj isin S partJ(B)
partβj+ λwj
∥∥βj∥∥minus1
2βj = 0 (432a)
forallj isin S ∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le λwj (432b)
where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment
Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)
432 Group-Lasso OS as Penalized LDA
With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced
Proposition 41 The group-Lasso OS problem
BOS = argminBisinRptimesKminus1
minΘisinRKtimesKminus1
1
2YΘminusXB2F + λ
psumj=1
wj∥∥βj∥∥
2
s t nminus1 ΘgtYgtYΘ = IKminus1
47
4 Formalizing the Objective
is equivalent to the penalized LDA problem
BLDA = maxBisinRptimesKminus1
tr(BgtΣBB
)s t Bgt(ΣW + nminus1λΩ)B = IKminus1
where Ω = diag
(w2
1
τ1
w2p
τp
) with Ωjj =
+infin if βjos = 0
wj∥∥βjos
∥∥minus1
2otherwise
(433)
That is BLDA = BOS diag(αminus1k (1minus α2
k)minus12
) where αk isin (0 1) is the kth leading
eigenvalue of
nminus1YgtX(XgtX + λΩ
)minus1XgtY
Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso
The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr
(BgtΩB
)
48
5 GLOSS Algorithm
The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22
The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below
1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed
2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution
3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set
This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively
51 Regression Coefficients Updates
Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(
XgtAXA + λΩ)βk = XgtAYθ0
k (51)
49
5 GLOSS Algorithm
initialize modelλ B
ACTIVE SETall j st||βj ||2 gt 0
p-OS PROBLEMB must hold1st optimality
condition
any variablefrom
ACTIVE SETmust go toINACTIVE
SET
take it out ofACTIVE SET
test 2nd op-timality con-dition on the
INACTIVE SET
any variablefrom
INACTIVE SETmust go toACTIVE
SET
take it out ofINACTIVE SET
compute Θ
and update B end
yes
no
yes
no
Figure 51 GLOSS block diagram
50
51 Regression Coefficients Updates
Algorithm 1 Adaptively Penalized Optimal Scoring
Input X Y B λInitialize A larr
j isin 1 p
∥∥βj∥∥2gt 0
Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat
Step 1 solve (421) in B assuming A optimalrepeat
Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1
2
BA larr(XgtAXA + λΩ
)minus1XgtAYΘ0
until condition (432a) holds for all j isin A Step 2 identify inactivated variables
for j isin A ∥∥βj∥∥
2= 0 do
if optimality condition (432b) holds thenA larr AjGo back to Step 1
end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax
jisinA
∥∥partJpartβj∥∥2
if∥∥∥partJpartβj∥∥∥
2lt λ then
convergence larr true B is optimalelseA larr Acup j
end ifuntil convergence
(sV)larreigenanalyze(Θ0gtYgtXAB) that is
Θ0gtYgtXABVk = skVk k = 1 K minus 1
Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1
Output Θ B α
51
5 GLOSS Algorithm
where XA denotes the columns of X indexed by A and βk and θ0k denote the kth
column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system
511 Cholesky decomposition
Dropping the subscripts and considering the (K minus 1) systems together (51) leads to
(XgtX + λΩ)B = XgtYΘ (52)
Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows
CgtCB = XgtYΘ
CB = CgtXgtYΘ
B = CCgtXgtYΘ (53)
where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)
512 Numerical Stability
The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression
B = Ωminus12(Ωminus12XgtXΩminus12 + λI
)minus1Ωminus12XgtYΘ0 (54)
where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)
52 Score Matrix
The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY This eigen-analysis is actually solved in the form
ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-
vector decomposition does not require the costly computation of(XgtX + Ω
)minus1that
52
53 Optimality Conditions
involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-
trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω
)minus1XgtY 1
Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as
Θ0gtYgtX(XgtX + Ω
)minus1XgtYΘ0 = Θ0gtYgtXB0
Thus the solution to penalized OS problem can be computed trough the singular
value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining
Θ = Θ0V we have ΘgtYgtX(XgtX + Ω
)minus1XgtYΘ = Λ and when Θ0 is chosen such
that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation
53 Optimality Conditions
GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function
1
2YΘminusXB22 + λ
psumj=1
wj∥∥βj∥∥
2(55)
Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth
row of B βj is the (K minus 1)-dimensional vector
partJ(B)
partβj= xj
gt(XBminusYΘ)
where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as
xjgt
(XBminusYΘ) + λwjβj∥∥βj∥∥
2
1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω
)minus1XgtY It is thus suffi-
cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of
YgtX(XgtX + Ω
)minus1XgtY In practice to comply with this desideratum and conditions (35b) and
(35c) we set Θ0 =(YgtY
)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal
vectors orthogonal to 1K
53
5 GLOSS Algorithm
The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
54 Active and Inactive Sets
The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function
j = maxj
∥∥∥xjgt (XBminusYΘ)∥∥∥
2minus λwj 0
The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥
2
is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥
2le λwj
The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition
55 Penalty Parameter
The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active
The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0
λmax = maxjisin1p
1
wj
∥∥∥xjgtYΘ0∥∥∥
2
The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin
is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)
54
56 Options and Variants
56 Options and Variants
561 Scaling Variables
As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm
562 Sparse Variant
This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation
563 Diagonal Variant
We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated
The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems
minBisinRptimesKminus1
YΘminusXB2F = minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB
)are replaced by
minBisinRptimesKminus1
tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B
)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite
564 Elastic net and Structured Variant
For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition
55
5 GLOSS Algorithm
7 8 9
4 5 6
1 2 3
- ΩL =
3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3
Figure 52 Graph and Laplacian matrix for a 3times 3 image
for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth
When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood
This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned
56
6 Experimental Results
This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper
61 Normalization
With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1
62 Decision Thresholds
The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation
1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval
57
6 Experimental Results
63 Simulated Data
We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is
Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)
Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of
dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure
is intended to mimic gene expression data correlation
Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1
3 1)if j le 100 and Xij sim N(0 1) otherwise
Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise
Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563
The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only
58
63 Simulated Data
Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset
Err () Var Dir
Sim 1 K = 4 mean shift ind features
PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)
Sim 2 K = 2 mean shift dependent features
PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)
Sim 3 K = 4 1D mean shift ind features
PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)
Sim 4 K = 4 mean shift ind features
PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)
59
6 Experimental Results
0 10 20 30 40 50 60 70 8020
30
40
50
60
70
80
90
100TPR Vs FPR
gloss
glossd
slda
plda
Simulation1
Simulation2
Simulation3
Simulation4
Figure 61 TPR versus FPR (in ) for all algorithms and simulations
Table 62 Average TPR and FPR (in ) computed over 25 repetitions
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
PLDA 990 782 969 603 980 159 743 656
SLDA 739 385 338 163 416 278 507 395
GLOSS 641 106 300 46 511 182 260 121
GLOSS-D 935 394 921 281 956 655 429 299
method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )
64 Gene Expression Data
We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-
2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736
60
64 Gene Expression Data
Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables
Err () Var
Nakayama n = 86 p = 22 283 K = 5
PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)
Ramaswamy n = 198 p = 16 063 K = 14
PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)
Sun n = 180 p = 54 613 K = 4
PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)
ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4
dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors
Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split
Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors
Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS
4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962
61
6 Experimental Results
GLOSS SLDA
Naka
yam
a
minus25000 minus20000 minus15000 minus10000 minus5000 0 5000
minus25
minus2
minus15
minus1
minus05
0
05
1
x 104
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
2n
dd
iscr
imin
ant
minus2000 0 2000 4000 6000 8000 10000 12000 14000
2000
4000
6000
8000
10000
12000
14000
16000
1) Synovial sarcoma
2) Myxoid liposarcoma
3) Dedifferentiated liposarcoma
4) Myxofibrosarcoma
5) Malignant fibrous histiocytoma
Su
n
minus1 minus05 0 05 1 15 2
x 104
05
1
15
2
25
3
35
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
2n
dd
iscr
imin
ant
minus2 minus15 minus1 minus05 0
x 104
0
05
1
15
2
x 104
1) NonTumor
2) Astrocytomas
3) Glioblastomas
4) Oligodendrogliomas
1st discriminant
Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means
62
65 Correlated Data
Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo
65 Correlated Data
When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge
The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works
For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63
As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward
The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits
Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results
63
6 Experimental Results
β for GLOSS β for S-GLOSS
Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo
β for GLOSS and λ = 03 β for S-GLOSS and λ = 03
Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo
64
Discussion
GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix
Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data
The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced
The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition
65
Part III
Sparse Clustering Analysis
67
Abstract
Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity
Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix
As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection
Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10
69
7 Feature Selection in Mixture Models
71 Mixture Models
One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering
711 Model
We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically
from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as
f(xi) =
Ksumk=1
πkfk(xi) foralli isin 1 n
where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and
sumk πk = 1) Mixture models transcribe that
given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism
bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK
bull x each xi is assumed to arise from a random vector with probability densityfunction fk
In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as
f(xiθ) =
Ksumk=1
πkφ(xiθk) foralli isin 1 n
71
7 Feature Selection in Mixture Models
where θ = (π1 πK θ1 θK) is the parameter of the model
712 Parameter Estimation The EM Algorithm
For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ
21 σ
22 π) of a univariate
Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches
The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)
The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood
Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm
Maximum Likelihood Definitions
The likelihood is is commonly expressed in its logarithmic version
L(θ X) = log
(nprodi=1
f(xiθ)
)
=nsumi=1
log
(Ksumk=1
πkfk(xiθk)
) (71)
where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions
To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or
72
71 Mixture Models
classification log-likelihood
LC(θ XY) = log
(nprodi=1
f(xiyiθ)
)
=
nsumi=1
log
(Ksumk=1
yikπkfk(xiθk)
)
=nsumi=1
Ksumk=1
yik log (πkfk(xiθk)) (72)
The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise
Defining the soft membership tik(θ) as
tik(θ) = p(Yik = 1|xiθ) (73)
=πkfk(xiθk)
f(xiθ) (74)
To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows
LC(θ XY) =sumik
yik log (πkfk(xiθk))
=sumik
yik log (tikf(xiθ))
=sumik
yik log tik +sumik
yik log f(xiθ)
=sumik
yik log tik +nsumi=1
log f(xiθ)
=sumik
yik log tik + L(θ X) (75)
wheresum
ik yik log tik can be reformulated as
sumik
yik log tik =nsumi=1
Ksumk=1
yik log(p(Yik = 1|xiθ))
=
nsumi=1
log(p(Yik = 1|xiθ))
= log (p(Y |Xθ))
As a result the relationship (75) can be rewritten as
L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)
73
7 Feature Selection in Mixture Models
Likelihood Maximization
The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)
L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))
+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))
In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood
∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1
minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality
Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))
For the mixture model problem Q(θθprime) is
Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]
=sumik
p(Yik = 1|xiθprime) log(πkfk(xiθk))
=nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (77)
Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ
prime) are the posterior proba-bilities of cluster memberships
Hence the EM algorithm sketched above results in
bull Initialization (not iterated) choice of the initial parameter θ(0)
bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)
bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))
74
72 Feature Selection in Model-Based Clustering
Gaussian Model
In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is
f(xiθ) =Ksumk=1
πkfk(xiθk)
=
Ksumk=1
πk1
(2π)p2 |Σ|
12
exp
minus1
2(xi minus microk)
gtΣminus1(xi minus microk)
At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows
Q(θθ(t)) =sumik
tik log(πk)minussumik
tik log(
(2π)p2 |Σ|
12
)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
=sumk
tk log(πk)minusnp
2log(2π)︸ ︷︷ ︸
constant term
minusn2
log(|Σ|)minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
equivsumk
tk log(πk)minusn
2log(|Σ|)minus
sumik
tik
(1
2(xi minus microk)
gtΣminus1(xi minus microk)
) (78)
where
tk =nsumi=1
tik (79)
The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)
π(t+1)k =
tkn
(710)
micro(t+1)k =
sumi tikxitk
(711)
Σ(t+1) =1
n
sumk
Wk (712)
with Wk =sumi
tik(xi minus microk)(xi minus microk)gt (713)
The derivations are detailed in Appendix G
72 Feature Selection in Model-Based Clustering
When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own
75
7 Feature Selection in Mixture Models
covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries
In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD
gtk (Banfield and Raftery 1993)
These regularization schemes address singularity and stability issues but they do notinduce parsimonious models
In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space
721 Based on Penalized Likelihood
Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x
log
(p(Yk = 1|x)
p(Y` = 1|x)
)= xgtΣminus1(microk minus micro`)minus
1
2(microk + micro`)
gtΣminus1(microk minus micro`) + logπkπ`
In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm
λKsumk=1
psumj=1
|microkj |
as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices
λ1
Ksumk=1
psumj=1
|microkj |+ λ2
Ksumk=1
psumj=1
psumm=1
|(Σminus1k )jm|
In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models
76
72 Feature Selection in Model-Based Clustering
Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)
λ
psumj=1
sum16k6kprime6K
|microkj minus microkprimej |
This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative
A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features
λ
psumj=1
(micro1j micro2j microKj)infin
One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means
λradicK
psumj=1
radicradicradicradic Ksum
k=1
micro2kj
The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test
The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation
722 Based on Model Variants
The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as
77
7 Feature Selection in Mixture Models
f(xi|φ πθν) =Ksumk=1
πk
pprodj=1
[f(xij |θjk)]φj [h(xij |νj)]1minusφj
where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)
An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1
which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion
tr(
(UgtΣWU)minus1UgtΣBU) (714)
so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations
To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of
minUisinRptimesKminus1
∥∥∥XU minusXU∥∥∥2
F+ λ
Kminus1sumk=1
∥∥∥uk∥∥∥1
where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet
minABisinRptimesKminus1
Ksumk=1
∥∥∥RminusgtW HBk minusABgtHBk
∥∥∥2
2+ ρ
Kminus1sumj=1
βgtj ΣWβj + λ
Kminus1sumj=1
∥∥βj∥∥1
s t AgtA = IKminus1
where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper
78
72 Feature Selection in Model-Based Clustering
triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U
The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem
minUisinRptimesKminus1
psumj=1
∥∥∥ΣBj minus UUgtΣBj
∥∥∥2
2
s t UgtU = IKminus1
whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U
To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality
However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo
723 Based on Model Selection
Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables
bull X(1) set of selected relevant variables
bull X(2) set of variables being considered for inclusion or exclusion of X(1)
bull X(3) set of non relevant variables
79
7 Feature Selection in Mixture Models
With those subsets they defined two different models where Y is the partition toconsider
bull M1
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)|X(1)
)f(X(1)|Y
)bull M2
f (X|Y) = f(X(1)X(2)X(3)|Y
)= f
(X(3)|X(2)X(1)
)f(X(2)X(1)|Y
)Model M1 means that variables in X(2) are independent on clustering Y Model M2
shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor
B12 =f (X|M1)
f (X|M2)
where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio
B12 =f(X(1)X(2)X(3)|M1
)f(X(1)X(2)X(3)|M2
)=f(X(2)|X(1)M1
)f(X(1)|M1
)f(X(2)X(1)|M2
)
This factor is approximated since the integrated likelihoods f(X(1)|M1
)and
f(X(2)X(1)|M2
)are difficult to calculate exactly Raftery and Dean (2006) use the
BIC approximation The computation of f(X(2)|X(1)M1
) if there is only one variable
in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term
Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability
Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis
80
8 Theoretical Foundations
In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features
We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model
In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided
81 Resolving EM with Optimal Scoring
In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate
811 Relationship Between the M-Step and Linear Discriminant Analysis
LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance
d(ximicrok) = (xi minus microk)gtΣminus1
W (xi minus microk)
where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix
81
8 Theoretical Foundations
The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)
Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood
2lweight(microΣ) =nsumi=1
Ksumk=1
tikd(ximicrok)minus n log(|ΣW|)
which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures
812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis
The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression
813 Clustering Using Penalized Optimal Scoring
The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression
d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)
This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as
82
82 Optimized Criterion
1 Initialize the membership matrix Y (for example by K-means algorithm)
2 Solve the p-OS problem as
BOS =(XgtX + λΩ
)minus1XgtYΘ
where Θ are the K minus 1 leading eigenvectors of
YgtX(XgtX + λΩ
)minus1XgtY
3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2
k)minus 1
2 )
4 Compute the centroids M in the LDA domain
5 Evaluate distances in the LDA domain
6 Translate distances into posterior probabilities tik with
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
] (81)
7 Update the labels using the posterior probabilities matrix Y = T
8 Go back to step 2 and iterate until tik converge
Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures
814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures
82 Optimized Criterion
In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized
83
8 Theoretical Foundations
optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture
This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows
821 A Bayesian Derivation
This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)
The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter
The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior
f(Σ|Λ0 ν0) =1
2np2 |Λ0|
n2 Γp(
n2 )|Σminus1|
ν0minuspminus12 exp
minus1
2tr(Λminus1
0 Σminus1)
where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as
Γp(n2) = πp(pminus1)4pprodj=1
Γ (n2 + (1minus j)2)
The posterior distribution can be maximized similarly to the likelihood through the
84
82 Optimized Criterion
maximization of
Q(θθprime) + log(f(Σ|Λ0 ν0))
=Ksumk=1
tk log πk minus(n+ 1)p
2log 2minus n
2log |Λ0| minus
p(p+ 1)
4log(π)
minuspsumj=1
log
(Γ
(n
2+
1minus j2
))minus νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
)equiv
Ksumk=1
tk log πk minusn
2log |Λ0| minus
νn minus pminus 1
2log |Σ| minus 1
2tr(Λminus1n Σminus1
) (82)
with tk =
nsumi=1
tik
νn = ν0 + n
Λminus1n = Λminus1
0 + S0
S0 =
nsumi=1
Ksumk=1
tik(xi minus microk)(xi minus microk)gt
Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)
822 Maximum a Posteriori Estimator
The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is
ΣMAP =1
ν0 + nminus pminus 1(Λminus1
0 + S0) (83)
where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1
0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)
85
9 Mix-GLOSS Algorithm
Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism
91 Mix-GLOSS
The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik
When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant
The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition
911 Outer Loop Whole Algorithm Repetitions
This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs
bull the centered ntimes p feature matrix X
bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically
bull the number of clusters K
bull the maximum number of iterations for the EM algorithm
bull the convergence tolerance for the EM algorithm
bull the number of whole repetitions of the clustering algorithm
87
9 Mix-GLOSS Algorithm
Figure 91 Mix-GLOSS Loops Scheme
bull a ptimes (K minus 1) initial coefficient matrix (optional)
bull a ntimesK initial posterior probability matrix (optional)
For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process
912 Penalty Parameter Loop
The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix
Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage
88
91 Mix-GLOSS
of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive
Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)
Algorithm 2 Automatic selection of λ
Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat
Estimate λ Compute gradient at βj = 0partJ(B)
partβj
∣∣∣βj=0
= xjgt
(sum
m6=j xmβm minusYΘ)
Compute λmax for every feature using (432b)
λmaxj = 1
wj
∥∥∥∥ partJ(B)
partβj
∣∣∣βj=0
∥∥∥∥2
Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false
elselastLAMBDA larr true
end ifuntil lastLAMBDA
Output B L(θ) tik πk microk Σ Y for every λ in solution path
913 Inner Loop EM Algorithm
The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop
89
9 Mix-GLOSS Algorithm
Algorithm 3 Mix-GLOSS for one value of λ
Input X K B0 Y0 λInitializeif (B0Y0) available then
BOS larr B0 Y larr Y0
elseBOS larr 0 Y larr kmeans(XK)
end ifconvergenceEM larr false tolEM larr 1e-3repeat
M-step(BOSΘ
α)larr GLOSS(XYBOS λ)
XLDA = XBOS diag (αminus1(1minusα2)minus12
)
πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n
sumi |tik minus yik| lt tolEM then
convergenceEM larr trueend ifY larr T
until convergenceEMY larr MAP(T)
Output BOS ΘL(θ) tik πk microk Σ Y
90
92 Model Selection
M-Step
The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step
E-Step
The E-step evaluates the posterior probability matrix T using
tik prop exp
[minusd(x microk)minus 2 log(πk)
2
]
The convergence of those tik is used as stopping criterion for EM
92 Model Selection
Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected
In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure
In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time
The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested
91
9 Mix-GLOSS Algorithm
Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)
X K λEMITER MAXREPMixminusGLOSS
Use B and T frombest repetition as
StartB and StartT
Mix-GLOSS (λStartBStartT)
Compute BIC
Chose λ = minλ BIC
Partition tikπk λBEST BΘ D L(θ)activeset
Figure 92 Mix-GLOSS model selection diagram
with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ
92
10 Experimental Results
The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6
This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63
In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations
The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions
101 Tested Clustering Algorithms
This section compares Mix-GLOSS with the following methods in the state of the art
bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan
bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website
93
10 Experimental Results
Figure 101 Class mean vectors for each artificial simulation
bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website
After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered
The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website
bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)
bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see
94
102 Results
Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable
102 Results
In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are
bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different
bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80
bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced
The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS
Results in percentages are displayed in Figure 102 (or in Table 102 )
95
10 Experimental Results
Table 101 Experimental results for simulated data
Err () Var Time
Sim 1 K = 4 mean shift ind features
CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h
Sim 2 K = 2 mean shift dependent features
CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h
Sim 3 K = 4 1D mean shift ind features
CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h
Sim 4 K = 4 mean shift ind features
CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h
Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms
Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR
MIX-GLOSS 992 015 828 335 884 67 780 12
LUMI-KUAN 992 28 1000 02 1000 005 50 005
FISHER-EM 986 24 888 17 838 5825 620 4075
96
103 Discussion
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100TPR Vs FPR
MIXminusGLOSS
LUMIminusKUAN
FISHERminusEM
Simulation1
Simulation2
Simulation3
Simulation4
Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions
103 Discussion
After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted
LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here
The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4
From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall
97
Conclusions
99
Conclusions
Summary
The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables
In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering
The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems
In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations
In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results
Perspectives
Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species
101
based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography
At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term
The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis
From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm
At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression
102
Appendix
103
A Matrix Properties
Property 1 By definition ΣW and ΣB are both symmetric matrices
ΣW =1
n
gsumk=1
sumiisinCk
(xi minus microk)(xi minus microk)gt
ΣB =1
n
gsumk=1
nk(microk minus x)(microk minus x)gt
Property 2 partxgtapartx = partagtx
partx = a
Property 3 partxgtAxpartx = (A + Agt)x
Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt
Property 5 partagtXbpartX = abgt
Property 6 partpartXtr
(AXminus1B
)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt
105
B The Penalized-OS Problem is anEigenvector Problem
In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form
minθkβk
Yθk minusXβk22 + βgtk Ωkβk (B1)
st θgtk YgtYθk = 1
θgt` YgtYθk = 0 forall` lt k
for k = 1 K minus 1The Lagrangian associated to Problem (B1) is
Lk(θkβk λkνk) =
Yθk minusXβk22 + βgtk Ωkβk + λk(θ
gtk YgtYθk minus 1) +
sum`ltk
ν`θgt` YgtYθk (B2)
Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk
βk = (XgtX + Ωk)minus1XgtYθk (B3)
The objective function of (B1) evaluated at βk is
minθk
Yθk minusXβk22 + βk
gtΩkβk = min
θk
θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk
= maxθk
θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)
If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY
B1 How to Solve the Eigenvector Decomposition
Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition
107
B The Penalized-OS Problem is an Eigenvector Problem
Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way
maxΘisinRKtimes(Kminus1)
tr(ΘgtMΘ
)(B5)
st ΘgtYgtYΘ = IKminus1
If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is
MΘv = λv (B6)
where v is the eigenvector and λ the associated eigenvalue of MΘ Operating
vgtMΘv = λhArr vgtΘgtMΘv = λ
Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue
wgtMw = λ (B7)
Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ
MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ
= ΘgtYgtXB
Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone
To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B
B = (XgtX + Ω)minus1XgtYΘV = BV
108
B2 Why the OS Problem is Solved as an Eigenvector Problem
B2 Why the OS Problem is Solved as an Eigenvector Problem
In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY
By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them
θk =
Kminus1summ=1
αmwm s t θgtk θk = 1 (B8)
The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (
Kminus1summ=1
αmwm
)gt(Kminus1summ=1
αmwm
)= 1
that as per the eigenvector properties can be reduced to
Kminus1summ=1
α2m = 1 (B9)
Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)
Mθk = M
Kminus1summ=1
αmwm
=
Kminus1summ=1
αmMwm
As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain
Mθk =Kminus1summ=1
αmλmwm
Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors
θgtk Mθk =
(Kminus1sum`=1
α`w`
)gt(Kminus1summ=1
αmλmwm
)
This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving
θgtk Mθk =Kminus1summ=1
α2mλm
109
B The Penalized-OS Problem is an Eigenvector Problem
The optimization Problem (B5) for discriminant direction k can be rewritten as
maxθkisinRKtimes1
θgtk Mθk
= max
θkisinRKtimes1
Kminus1summ=1
α2mλm
(B10)
with θk =Kminus1summ=1
αmwm
andKminus1summ=1
α2m = 1
One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =
sumKminus1m=1 αmwm the resulting score vector θk will be equal to
the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can
be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY
110
C Solving Fisherrsquos Discriminant Problem
The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance
maxβisinRp
βgtΣBβ (C1a)
s t βgtΣWβ = 1 (C1b)
where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data
The Lagrangian of Problem (C1) is
L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)
so that its first derivative with respect to β is
partL(β ν)
partβ= 2ΣBβ minus 2νΣWβ
A necessary optimality condition for β is that this derivative is zero that is
ΣBβ = νΣWβ
Provided ΣW is full rank we have
Σminus1W ΣBβ
= νβ (C2)
Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of
eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows
βgtΣBβ = βgtΣWΣminus1
W ΣBβ
= νβgtΣWβ from (C2)
= ν from (C1b)
That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1
W ΣB and β is any eigenvector correspondingto this maximal eigenvalue
111
D Alternative Variational Formulation forthe Group-Lasso
In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed
minτisinRp
minBisinRptimesKminus1
J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj(D1a)
s tsump
j=1 τj = 1 (D1b)
τj ge 0 j = 1 p (D1c)
Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed
of row vectors βj isin RKminus1 B =(β1gt βpgt
)gt
L(B τ λ ν0 νj) = J(B) + λ
psumj=1
w2j
∥∥βj∥∥2
2
τj+ ν0
psumj=1
τj minus 1
minus psumj=1
νjτj (D2)
The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj
partL(B τ λ ν0 νj)
partτj
∣∣∣∣τj=τj
= 0 rArr minusλw2j
∥∥βj∥∥2
2
τj2 + ν0 minus νj = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 minus νjτj2 = 0
rArr minusλw2j
∥∥βj∥∥2
2+ ν0τ
j
2 = 0
The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ
) = 0 where νj is the Lagrange multiplier and gj(τ) is the
inequality Lagrange condition Then the optimal τj can be deduced
τj =
radicλ
ν0wj∥∥βj∥∥
2
Placing this optimal value of τj into constraint (D1b)
psumj=1
τj = 1rArr τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
(D3)
113
D Alternative Variational Formulation for the Group-Lasso
With this value of τj Problem (D1) is equivalent to
minBisinRptimesKminus1
J(B) + λ
psumj=1
wj∥∥βj∥∥
2
2
(D4)
This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj
The penalty term of (D1a) can be conveniently presented as λBgtΩB where
Ω = diag
(w2
1
τ1w2
2
τ2
w2p
τp
) (D5)
Using the value of τj from (D3) each diagonal component of Ω is
(Ω)jj =wjsump
j=1wj∥∥βj∥∥
2∥∥βj∥∥2
(D6)
In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation
D1 Useful Properties
Lemma D1 If J is convex Problem (D1) is convex
In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma
Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =
partJ(B)
partB+ 2λ
Kminus1sumj=1
wj∥∥βj∥∥
2
G
(D7)
where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1
∥∥βj∥∥26= 0 then we have
forallj isin S(B) gj = wj∥∥βj∥∥minus1
2βj (D8)
forallj isin S(B) ∥∥gj∥∥
2le wj (D9)
114
D2 An Upper Bound on the Objective Function
This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm
Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1
∥∥βj∥∥26= 0 and let S(B) be its complement then we have
forallj isin S(B) minus partJ(B)
partβj= 2λ
Kminus1sumj=1
wj∥∥βj∥∥2
wj∥∥βj∥∥minus1
2βj (D10a)
forallj isin S(B)
∥∥∥∥partJ(B)
partβj
∥∥∥∥2
le 2λwj
Kminus1sumj=1
wj∥∥βj∥∥2
(D10b)
In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)
D2 An Upper Bound on the Objective Function
Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that
τj =wj∥∥βj∥∥
2sumpj=1wj
∥∥βj∥∥2
Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum
j=1
wj∥∥βj∥∥
2
2
=
psumj=1
τ12j
wj∥∥βj∥∥
2
τ12j
2
le
psumj=1
τj
psumj=1
w2j
∥∥βj∥∥2
2
τj
le
psumj=1
w2j
∥∥βj∥∥2
2
τj
where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one
115
D Alternative Variational Formulation for the Group-Lasso
This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined
116
E Invariance of the Group-Lasso to UnitaryTransformations
The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition
Proposition E1 Let B be a solution of
minBisinRptimesM
Y minusXB2F + λ
psumj=1
wj∥∥βj∥∥
2(E1)
and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof
minBisinRptimesM
∥∥∥Y minusXB∥∥∥2
F+ λ
psumj=1
wj∥∥βj∥∥
2(E2)
Proof The first-order necessary optimality conditions for B are
forallj isin S(B) 2xjgt(xjβ
j minusY)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E3a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minusY)∥∥∥
2le λwj (E3b)
where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement
First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows
forallj isin S(B) 2xjgt(xjβ
j minus Y)
+ λwj
∥∥∥βj∥∥∥minus1
2βj
= 0 (E4a)
forallj isin S(B) 2∥∥∥xjgt (xjβ
j minus Y)∥∥∥
2le λwj (E4b)
where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM
∥∥ugt∥∥2
=∥∥ugtV
∥∥2 Equation (E4b) is also
117
E Invariance of the Group-Lasso to Unitary Transformations
obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof
118
F Expected Complete Likelihood andLikelihood
Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available
L(θ) =
nsumi=1
log
(Ksumk=1
πkfk(xiθk)
)(F1)
Q(θθprime) =nsumi=1
Ksumk=1
tik(θprime) log (πkfk(xiθk)) (F2)
with tik(θprime) =
πprimekfk(xiθprimek)sum
` πprime`f`(xiθ
prime`)
(F3)
In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are
the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)
Using (F3) we have
Q(θθprime) =sumik
tik(θprime) log (πkfk(xiθk))
=sumik
tik(θprime) log(tik(θ)) +
sumik
tik(θprime) log
(sum`
π`f`(xiθ`)
)=sumik
tik(θprime) log(tik(θ)) + L(θ)
In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities
L(θ) = Q(θθ)minussumik
tik(θ) log(tik(θ))
= Q(θθ) +H(T)
119
G Derivation of the M-Step Equations
This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as
Q(θθprime) = maxθ
sumik
tik(θprime) log(πkfk(xiθk))
=sumk
log
(πksumi
tik
)minus np
2log(2π)minus n
2log |Σ| minus 1
2
sumik
tik(xi minus microk)gtΣminus1(xi minus microk)
which has to be maximized subject tosumk
πk = 1
The Lagrangian of this problem is
L(θ) = Q(θθprime) + λ
(sumk
πk minus 1
)
Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ
G1 Prior probabilities
partL(θ)
partπk= 0hArr 1
πk
sumi
tik + λ = 0
where λ is identified from the constraint leading to
πk =1
n
sumi
tik
121
G Derivation of the M-Step Equations
G2 Means
partL(θ)
partmicrok= 0hArr minus1
2
sumi
tik2Σminus1(microk minus xi) = 0
rArr microk =
sumi tikxisumi tik
G3 Covariance Matrix
partL(θ)
partΣminus1 = 0hArr n
2Σ︸︷︷︸
as per property 4
minus 1
2
sumik
tik(xi minus microk)(xi minus microk)gt
︸ ︷︷ ︸as per property 5
= 0
rArr Σ =1
n
sumik
tik(xi minus microk)(xi minus microk)gt
122
Bibliography
F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011
F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008
F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012
J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993
A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009
H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996
P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004
C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008
C M Bishop Pattern Recognition and Machine Learning Springer New York 2006
C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a
C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b
S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004
L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995
L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984
123
Bibliography
T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011
S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999
C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012
B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008
L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011
C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009
A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246
D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006
R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000
B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004
Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008
R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936
V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008
J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009
124
Bibliography
J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010
J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989
W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998
A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003
D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005
G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010
G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011
Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998
Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002
L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008
Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004
J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010
I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003
T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996
T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994
125
Bibliography
T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995
A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970
J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009
T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006
K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000
P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010
T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002
M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004
Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004
C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008
C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006
H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005
J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967
Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012
C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a
126
Bibliography
C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b
L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008
N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006
B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006
B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007
Y Nesterov Gradient methods for minimizing composite functions preprint 2007
S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886
B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011
M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a
M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b
W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007
W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006
K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894
S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003
127
Bibliography
Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009
A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006
C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948
S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007
V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004
V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008
V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004
C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010
L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012
Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978
A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008
S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006
P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010
M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008
128
Bibliography
M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008
R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996
J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010
S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008
D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011
D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010
D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009
M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007
MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009
T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008
B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a
B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b
C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010
J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007
129
Bibliography
M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006
P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007
P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009
H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009
H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006
H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005
130
- SANCHEZ MERCHANTE PDTpdf
- Thesis Luis Francisco Sanchez Merchantepdf
-
- List of figures
- List of tables
- Notation and Symbols
- Context and Foundations
-
- Context
- Regularization for Feature Selection
-
- Motivations
- Categorization of Feature Selection Techniques
- Regularization
-
- Important Properties
- Pure Penalties
- Hybrid Penalties
- Mixed Penalties
- Sparsity Considerations
- Optimization Tools for Regularized Problems
-
- Sparse Linear Discriminant Analysis
-
- Abstract
- Feature Selection in Fisher Discriminant Analysis
-
- Fisher Discriminant Analysis
- Feature Selection in LDA Problems
-
- Inertia Based
- Regression Based
-
- Formalizing the Objective
-
- From Optimal Scoring to Linear Discriminant Analysis
-
- Penalized Optimal Scoring Problem
- Penalized Canonical Correlation Analysis
- Penalized Linear Discriminant Analysis
- Summary
-
- Practicalities
-
- Solution of the Penalized Optimal Scoring Regression
- Distance Evaluation
- Posterior Probability Evaluation
- Graphical Representation
-
- From Sparse Optimal Scoring to Sparse LDA
-
- A Quadratic Variational Form
- Group-Lasso OS as Penalized LDA
-
- GLOSS Algorithm
-
- Regression Coefficients Updates
-
- Cholesky decomposition
- Numerical Stability
-
- Score Matrix
- Optimality Conditions
- Active and Inactive Sets
- Penalty Parameter
- Options and Variants
-
- Scaling Variables
- Sparse Variant
- Diagonal Variant
- Elastic net and Structured Variant
-
- Experimental Results
-
- Normalization
- Decision Thresholds
- Simulated Data
- Gene Expression Data
- Correlated Data
-
- Discussion
-
- Sparse Clustering Analysis
-
- Abstract
- Feature Selection in Mixture Models
-
- Mixture Models
-
- Model
- Parameter Estimation The EM Algorithm
-
- Feature Selection in Model-Based Clustering
-
- Based on Penalized Likelihood
- Based on Model Variants
- Based on Model Selection
-
- Theoretical Foundations
-
- Resolving EM with Optimal Scoring
-
- Relationship Between the M-Step and Linear Discriminant Analysis
- Relationship Between Optimal Scoring and Linear Discriminant Analysis
- Clustering Using Penalized Optimal Scoring
- From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
-
- Optimized Criterion
-
- A Bayesian Derivation
- Maximum a Posteriori Estimator
-
- Mix-GLOSS Algorithm
-
- Mix-GLOSS
-
- Outer Loop Whole Algorithm Repetitions
- Penalty Parameter Loop
- Inner Loop EM Algorithm
-
- Model Selection
-
- Experimental Results
-
- Tested Clustering Algorithms
- Results
- Discussion
-
- Conclusions
- Appendix
-
- Matrix Properties
- The Penalized-OS Problem is an Eigenvector Problem
-
- How to Solve the Eigenvector Decomposition
- Why the OS Problem is Solved as an Eigenvector Problem
-
- Solving Fishers Discriminant Problem
- Alternative Variational Formulation for the Group-Lasso
-
- Useful Properties
- An Upper Bound on the Objective Function
-
- Invariance of the Group-Lasso to Unitary Transformations
- Expected Complete Likelihood and Likelihood
- Derivation of the M-Step Equations
-
- Prior probabilities
- Means
- Covariance Matrix
-
- Bibliography
-